11 Linear models
Statistical models of a linear relationship between variables: \[Y=\beta_0+\beta_1X+e,\] where:
- \(Y\) is the dependent variable;
- \(X\) is the independent variable;
- \(e\) is the error term;
The errors should be independent, identically normally distributed, with mean \(0\) and variance \(\sigma^2>0\).
The model parameters to be estimated are: \(\beta_0\) and \(\beta_1\).
Examples: \(\hat y=1+0.5x\), \(\hat y=0-2x\).
In R: consider 1000 pairs \((x_i,y_i)\) as follows
> set.seed(98765421)
> x<-rchisq(1000,df=1)
> y<-1+0.5*x+0.7*rnorm(1000)
>
> plot(x,y,xlab='x',ylab='y')
- The main idea is to compute the best linear model, that is, the best blue line:
Suppose the estimated model \(\hat y=1+0.5x\) for the above data.
Model residuals:
> residuals.y<-y-(1+0.5*x)
>
> # Check assumptions
> f<-function(x){dnorm(x,sd=sd(residuals.y))}
> hist(residuals.y,probability=TRUE,ylim=c(0,0.7),nclass = 20)
> curve(f,col="blue", lwd=3, add=TRUE)
11.1 Fitting models: the lm()
function
\[y=\beta_0+\beta_1x+e\]
- In R linear models can be fitted to data with the
lm()
function:
> analysis<-lm(y~x)
> analysis
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.991 0.496
So, parameter estimates are: \(\hat{\beta}_0=0.991\) and \(\hat{\beta}_1=0.496\).
Model formulas: the argument to
lm()
is aformula
object. A linear model is specified by a formula object, which may look like this:
The corresponding linear model is: \[y=\beta_0+\beta_1x+\beta_2z+\beta_3w+e.\]
Contents of the
lm()
function:
> analysis<-lm(y~x)
> names(analysis)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
- Accessing the contents:
- Summaries:
> summary(analysis)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2027 -0.4663 -0.0532 0.5073 1.9379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9909 0.0267 37.1 <2e-16 ***
## x 0.4964 0.0142 35.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.698 on 998 degrees of freedom
## Multiple R-squared: 0.551, Adjusted R-squared: 0.551
## F-statistic: 1.23e+03 on 1 and 998 DF, p-value: <2e-16