Linear Regression

Sailaja Karra
3 min readDec 15, 2020

Introduction:

Regression analysis is often the first real learning application that aspiring data scientist will come across. It is the one of the simplest techniques to master, but it requires some mathematical and statistical understanding.

Regression analysis is a parametric technique meaning a set of parameters are used to predict the value of an unknown target variable(or dependent variable) Y based on one or more of known input features (or independent variables, predictors), often denoted by X

Linear Regression:

The term linear implies that the model functions along with a straight or nearly straight line. Lineary, one of the assumptions of this approach, suggests that the relationship between dependent and independent variables can be expressed as a straight line.

Assumptions for Linear Regression:

Here are some of the assumptions that define the complete scope of regression analysis and it is mandatory that the underlying data fulfills these assumptions. If violated, regression makes biased and unreliable predictions.

1.Linearity

The linearity assumptions require that there is a linear relationship between the response variable(Y) and predictor(X). Linear means that the change in Y by 1-unit change in X is constant.

For non-linear relationships, you can use non-linear mathematical functions to fit the data e.g., polynomial and exponential functions.

2.Normality:

The normality assumptions states that the model residuals should follow a normal distribution. The easiest way to check for the normality is with histograms or a Q-Q plots.

3.Homoscedasticity:

Heteroscedasticity refers to a circumstance in which the dependent variable is unequal across the range of values of the predictors. When there is heteroscedasticity in data, a scatterplot of these variables will often create a cone like shape. The scatter of dependent variable widens or narrows as the value of the independent variable increases. The inverse of heteroscedasticity is homoscedasticity which indicates that a dependent variables variability is equal across values of the independent variable. Homoscedasticity is the third assumption necessary when creating a linear regression model.

There are two main types of linear regression

Simple Linear Regression:

Simple Linear Regression uses a single feature(one independent variable) to model a linear relationship with a target(the dependent variable) by fitting an optimal model(ie., the best straight line) to describe the relationship. A straight line can be written as

Where Y is the dependent variable that needs to be predicted, X is the independent variable that is the input variable, m is the slope which determines the angle of the line and b is the intercept which is the constant determining the value of y when x is 0.

Multivariable Regression:

Multivariable Linear Regression uses more than one feature to predict a target variable by fitting the best linear relationship. A straight line can be written as

This is a simple blog introducing about linear regression and it’s assumptions in my next blog I will show the coding part using scikit learn for the boston housing data.

Happy reading!!!

--

--