Linear Regression#

Linear modeling by the most elementary model : simple linear regression where a variable X is explained, modelled by an affine function of another variable y. Mostly used to predict continuous value outputs.

Introduction#

First of all, what is linear ?#

The term “linearity” in algebra refers to a linear relationship between two or more variables. If we draw this relationship in a two-dimensional space (between two variables), we get a straight line. In three dimensions it is a plane, and in more than three dimensions, a hyperplane.

In this example, we will use Scikit-Learn which is a Python machine learning library.

Goal#

Y the real random variable to be explained (endogenous, dependent or response variable)
X the expanatory variable or fixed effect (exogenous).
We assume that, on average, E(Y), is an affine function of X. Writing the model implicitly assumes a prior notion of causality in the sense that Y depends on X because the model is not symmetrical.

We want to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). By ploting variables, linear regression gives us a straight line that best fits the data points.

So, what does the linear regression algorithm ?#

It gives us the most optimal value for the intercept and the slope. Y and X can’t be changed since they are fixed data. The only values that we can control are b and m. The algorithm check for the line with the least error, that fits the most data points.

I - Simple linear regression#

Basic linear equation :

\(Y = mx + b\)

\(b\) : intercept
\(m\) : slope

## Data from : https://www.kaggle.com/dronio/SolarEnergy?select=SolarPrediction.csv

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_1819/117542596.py in <module>
      3 import matplotlib.pyplot as plt
      4 import seaborn as seabornInstance
----> 5 from sklearn.model_selection import train_test_split
      6 from sklearn.linear_model import LinearRegression
      7 from sklearn import metrics

ModuleNotFoundError: No module named 'sklearn'

dataset = pd.read_csv('/Users/Laurine/Documents/Python Scripts/IMAC2/Learn-computer-graphics/Linear-Regression/SolarPrediction.csv')

## See data
dataset.describe()

	UNIXTime	Radiation	Temperature	Pressure	Humidity	WindDirection(Degrees)	Speed
count	3.268600e+04	32686.000000	32686.000000	32686.000000	32686.000000	32686.000000	32686.000000
mean	1.478047e+09	207.124697	51.103255	30.422879	75.016307	143.489821	6.243869
std	3.005037e+06	315.916387	6.201157	0.054673	25.990219	83.167500	3.490474
min	1.472724e+09	1.110000	34.000000	30.190000	8.000000	0.090000	0.000000
25%	1.475546e+09	1.230000	46.000000	30.400000	56.000000	82.227500	3.370000
50%	1.478026e+09	2.660000	50.000000	30.430000	85.000000	147.700000	5.620000
75%	1.480480e+09	354.235000	55.000000	30.460000	97.000000	179.310000	7.870000
max	1.483265e+09	1601.260000	71.000000	30.560000	103.000000	359.950000	40.500000

##  predict the maximum temperature taking input feature as the radiation.

dataset.plot(x='Temperature', y='Radiation', style='o')  
plt.title('Temperature vs Radiation')  
plt.xlabel('Temperature')  
plt.ylabel('Radiation')  
plt.show()

../../../_images/Linear-Regression_9_0.png

## We need to check the average temperature : we see that it is around 45 - 50

plt.figure(figsize=(10,10))
plt.tight_layout()
seabornInstance.distplot(dataset['Temperature'])

C:\Users\Laurine\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x1d0ca426518>

../../../_images/Linear-Regression_10_2.png

## We need to divide the data into attributes (independant variables) and labels (dependant variables). Labels are values we want to predict
## Here, our attribute is "radiation" and our label is "temperature"

X = dataset['Radiation'].values.reshape(-1,1)
y = dataset['Temperature'].values.reshape(-1,1)

## We need to split the data between the training and the test set. Let say we give 70% of the data to the training set and 30 % to the test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## It is time to traing our algorithm ! 

regressor = LinearRegression()  
regressor.fit(X_train, y_train) #training the algorithm

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

## We found the best value for the intercept (b) and the slope (m)

#To retrieve the intercept:
print(regressor.intercept_)
#For retrieving the slope:
print(regressor.coef_)

[48.12878042]
[[0.01442639]]

## Time to make prediction ! (we need to use the test dataset and compare it to actual data)

y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})

df1 = df.head(30)
df1.plot(kind='bar',figsize=(20,20))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='orange')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='orange')
plt.show()

../../../_images/Linear-Regression_16_0.png

Predicted percentage are quite close to actual one, which means that our algorithm is well trained.

## Well, now it's time to plot our straight line ! 

plt.scatter(X_test, y_test,  color='darkblue')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

../../../_images/Linear-Regression_18_0.png

## The final step is to evaluate the performance of the algorithm. 

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 3.3428145712086974
Mean Squared Error: 17.923509301005502
Root Mean Squared Error: 4.2336165746327925

You can see that the value of root mean squared error is 4.23, which is waaaaay above 10% of the mean value of the temperature (which is 45). This means that our algorithm is not accurate, but can still be used for a first approach.

So what if simple linear regression doesn’t work well ?#

The best way to handle that is to choose the good amount of data and to train it well. Machine learning is all about that : training your model and realize that the training has not been good, that the predictions are not very accurate and start over again. Good luck !

II - Multiple linear regression#

After performing a simple linear regression, you might wonder how to proceed with more than two variables. Steps for this linear regression is quite the same as above but the evaluation is different. It can be used to find out which factor has the highest impact of the predicted output and how different variables are related to each other.

The formula for multiple Linear Regression is : \(y_i = \beta_0 + \beta _1 x_{i1} + \beta _2 x_{i2} + ... + \beta _p x_{ip} + \epsilon\)

\begin{aligned} &\textbf{where, for } i = n \textbf{ observations:}\ &y_i=\text{dependent variable}\ &x_i=\text{expanatory variables}\ &\beta_0=\text{y-intercept (constant term)}\ &\beta_p=\text{slope coefficients for each explanatory variable}\ &\epsilon=\text{the model’s error term (also known as the residuals)}\ \end{aligned}

The multiple regression model is based on the following assumptions:

There is a linear relationship between the dependent variables and the independent variables.
The independent variables are not too highly correlated with each other.
yi observations are selected independently and randomly from the population.
Residuals should be normally distributed with a mean of 0 and variance σ.

Take a deep breath. We won’t directly use this formula, since we work with Scikit-Learn. However, it is always useful to understand what data you are manipulated, why and how.

# Now, let's start coding ! 
# Data from : https://www.kaggle.com/bappekim/air-pollution-in-seoul

dataset = pd.read_csv('/Users/Laurine/Documents/Python Scripts/IMAC2/Learn-computer-graphics/Linear-Regression/AirPollutionSeoul/Measurement_summary.csv')

# Let's see our data
dataset.describe()

	Station code	Latitude	Longitude	SO2	NO2	O3	CO	PM10	PM2.5
count	647511.000000	647511.000000	647511.000000	647511.000000	647511.000000	647511.000000	647511.000000	647511.000000	647511.000000
mean	113.000221	37.553484	126.989340	-0.001795	0.022519	0.017979	0.509197	43.708051	25.411995
std	7.211315	0.053273	0.078790	0.078832	0.115153	0.099308	0.405319	71.137342	43.924595
min	101.000000	37.452357	126.835151	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000	-1.000000
25%	107.000000	37.517528	126.927102	0.003000	0.016000	0.008000	0.300000	22.000000	11.000000
50%	113.000000	37.544962	127.004850	0.004000	0.025000	0.021000	0.500000	35.000000	19.000000
75%	119.000000	37.584848	127.047470	0.005000	0.038000	0.034000	0.600000	53.000000	31.000000
max	125.000000	37.658774	127.136792	3.736000	38.445000	33.600000	71.700000	3586.000000	6256.000000

# Same as for the simple linear regression, we need to divide our data into attributes and labels. 
# X variable contains all the attributes/features and y variable contains labels.
# Let's say we want to predict the NO2

X = dataset[['Latitude','Longitude', 'SO2','O3', 'CO', 'PM10','PM2.5']].values
y = dataset['NO2'].values

# We need to check the average value of NO2

plt.figure(figsize=(10,10))
plt.xlim((0,1))
plt.tight_layout()
seabornInstance.distplot(dataset['NO2'])

C:\Users\Laurine\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<matplotlib.axes._subplots.AxesSubplot at 0x1d0ab384f98>

../../../_images/Linear-Regression_30_2.png

We see that the average of NO2 is something between 0.1 and 0.2

# Next, we split 70% of the data to the training set while 30% of the data to test set using below code.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Time to train our model ! 
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

# We want to see what coefficient our regression algorithm has chosen
print('Latitude','Longitude', 'SO2','O3', 'CO', 'PM10','PM2.5')
print(regressor.coef_)

Latitude Longitude SO2 O3 CO PM10 PM2.5
[-5.63239684e-02  1.01338366e-02  2.52565384e-01  7.52547726e-01
  2.02347868e-02  1.35922634e-05  3.33477562e-05]

A unit increase in “Latitude“ results in a decrease of : \(-5.6 * 10^-2\) units in the NO2.

# Time to make our prediction : 
y_pred = regressor.predict(X_test)

# We need to check the difference between the actual value and the predicted one 
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(25)

df1.plot(kind=’bar’,figsize=(10,10)) plt.grid(which=’major’, linestyle=’-’, linewidth=’0.5’, color=’green’) plt.grid(which=’minor’, linestyle=’:’, linewidth=’0.5’, color=’black’) plt.show()

Our model is quite bad at predicting but it could be worse. Remember : machine learning, once again, is all about training and testing your model.

## Well, now it's time to plot our lines ! 

plt.scatter(X_test[:,0], y_test, color='darkblue')
plt.plot(X_test[:,0], y_pred, color='red', linewidth=2)
plt.show()

../../../_images/Linear-Regression_40_0.png

Well it is kinda chaotic since we have multiple variables but you can see that our lines cover our data well.

# It's time to evaluate our algorithm ! 
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 0.01950224842895457
Mean Squared Error: 0.0013053866895743607
Root Mean Squared Error: 0.03613013547683375

Well actually the root mean squared error is 0.04. The mean value is something in between 0.1 and 0.2. It is a bit more than 10% of this range value but it is still very good ! We can make good predictions.

So what if multiple linear regression doesn’t work well ?#

Maybe you need more data ? Maybe you thought that your data had a linear relation but it is not the case ? Maybe your training failed at some point ?

There is a lot of side-effect with machine learning. Remember : training and testing your algorithm is the best thing to do.

Conclusion#

Congrats ! You’ve learn one of the most fundamental machine learning algorithms (thanks to Scikit-learn).

LCG - Machine Learning

Linear Regression

Contents