Assumptions of Linear Regression

Sanghavi Vemulapati
2 min readJan 25, 2022

What does a Linear Regression algorithm do?
Linear Regression algorithm models the relationship between a single dependent variable and one or more independent variables.

If the model has only one independent variable then we call it a simple linear regression. If it has more than one independent variable then it is multiple linear regression.

How does the Linear regression algorithm work?
By fitting a line to the data. Not just any line, but it finds the best fitting line by reducing the sum of squares of residuals(reducing the sum of the distance between predicted and actual points)

Assumptions of Linear regression?

  1. Linearity: The relationship between X and the mean of Y is linear.
    Linear regression assumes the relationship between the dependent variable and independent variables is linear.
    How to check?
    Using scatter plots.
    How to handle if this assumption violates?
    Can try transformations.
  2. Normality: For any fixed value of X, Y is normally distributed
    Linear regression requires all variables to be multivariate normal.
    How to check?
    Using histogram or a QQplot or Kolmogorov-Smirnov test(goodness of fit test)
    How to handle if this assumption violates?
    Using non-linear transformations(log — transformation)
  3. Multicollinearity — linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.
    How to check?
    Correlation matrix,
    Tolerance(T = 1-R2),
    Variance Inflation Factor(VIF = 1/T)
    How to handle if this assumption violates?
    Deducting the mean of the variable from each score,
    remove independent variables with high VIF values,
    conducting a factor analysis and rotating the factors to ensure the independence of the factors in the linear regression analysis.
  4. Autocorrelation — Linear regression assumes that there is little or no autocorrelation in data. Autocorrelation occurs when the residuals are not independent of each other. We typically see this in time series data.
    How to check?
    Durbin-Watson’s d test. It tests the null hypothesis that the residuals are not linearly auto-correlated.
    1.5 < d < 2.5 shows that there is no autocorrelation in the data. However, the Durbin-Watson test only analysis linear autocorrelation only between direct neighbors, which are first-order effects.
  5. Homoscedasticity — the residuals are equal across the regression line.
    If the assumption of constant variance is violated, the least-squares estimators are still unbiased, but the Gauss-Markov theorem(assumes OLS has the lowest sampling variance) does not hold anymore and the standardized scores do not have the assumed distribution.
    How to check?
    Goldfeld-Quandt tests heteroscedasticity. The test splits the data into two groups and tests to see if the variance of the residuals is similar across the groups.
    How to handle if this assumption violates?
    Using transformations or non-linear correction can help.

Learn more about dealing with model assumption violations here.

Advantages of using Linear Regression?
1. Works well with linearly separable data.
2. Easy to interpret and efficient to train.
3. Can handle overfitting by using dimensionality reduction techniques, regularisation, and cross-validation.
4. Works well with extrapolation(estimating assuming that existing trends will continue)

Disadvantages of using Linear Regression?
1. Sensitive to outliers
2. Prone to multicollinearity

--

--