Linear Regression Understanding by CHIRAG

Linear Regression

1.)Linear regression is a supervised learning algorithm used when target / dependent variable continues real number.

2.)Linear regression is a type of statistical analysis used to predict the relationship between dependent and independent Variables using BEST FIT LINE.

3.) It work on the principle of ordinary least square (OLS)/ Mean square errror (MSE)

Simple Linear Regression

In a simple linear regression, there is one independent variable and one dependent variable. The model estimates the slope and intercept of the line of best fit, which represents the relationship between the variables. The slope represents the change in the dependent variable for each unit change in the independent variable, while the intercept represents the predicted value of the dependent variable when the independent variable is zero.

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Y= Dependent Variable (Target Variable)

X= Independent Variable (predictor Variable)

a0= intercept of the line (Gives an additional degree of freedom)

a1 = Linear regression coefficient (scale factor to each input value).

ε = random error

What Is Multiple Linear Regression?

Multiple Linear Regression (MLR) is basically indicating that we will have many features Such as f1, f2, f3, f4, and our output feature f5. If we take the same example as above we discussed, suppose:

f1 is the size of the house,

f2 is bad rooms in the house,

f3 is the locality of the house,

f4 is the condition of the house, and

f5 is our output feature, which is the price of the house.

Now, you can see that multiple independent features also make a huge impact on the price of the house, meaning the price can vary from feature to feature. When we are discussing multiple linear regression, then the equation of simple linear regression y=A+Bx is converted to something like:

equation: y = A+B1x1+B2x2+B3x3+B4x4

“If we have one dependent feature and multiple independent features then basically call it a multiple linear regression.”

Now, our aim in using the multiple linear regression is that we have to compute A, which is an intercept. The key parameters B1, B2, B3, and B4 are the slopes or coefficients concerning this independent feature. This basically indicates that if we increase the value of x1 by 1 unit, then B1 will tell you how much it will affect the price of the house. The others B2, B3, and B4, also work similarly.

What Is the Cost Function For Linear Regression?

When working with linear regression, we aim to find the best line that fits the training data. The cost function measures the difference between the predicted values of the model and the actual target values. By minimizing this cost function, we can determine the optimal values for the model’s parameters and improve its performance.

For Linear Regression, we use the (Mean Squared Error (MSE) =cost function), which is the average of squared error occurred between the predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

Where,

n=Total number of observation
Yi = Actual value
(Wx_i+b)= Predicted value.

As we can see that we are getting the least value of the cost function when w=1 and you can find other values of J and plot the different values of the cost function. You will observe a curve(in the first quadrant) and will find that the best value you can have for the cost function is when w=1.

IF OUR COST FUNCTION IS 0 , THEN THIS IS THE CASE OF OVERFITTING, THIS PROBLEM CAN BE SOLVED BY RIDGE OR LASSO REGRESSION TECHNIQUES.

What is Gradient Descent?

Gradient descent is an optimization algorithm used in machine learning to minimize the cost function by iteratively adjusting parameters in the direction of the negative gradient, aiming to find the optimal set of parameters.

(In simple words we have to find best optimal value of slope (m) and

intercept (b) at which our cost function would be minimum.)

The goal of the gradient descent algorithm is to minimize the given function (say cost function). To achieve this goal, it performs two steps iteratively:

Compute the gradient (slope), the first order derivative of the function at that point
Make a step (move) in the direction opposite to the gradient, opposite direction of slope increase from the current point by alpha times the gradient at that point

Types of Gradient Descent

The choice of gradient descent algorithm depends on the problem at hand and the size of the dataset. Batch gradient descent is suitable for small datasets, while stochastic gradient descent is more suitable for large datasets. Mini-batch gradient descent is a good compromise between the two and is often used in practice.

Batch Gradient Descent

Batch gradient descent updates the model’s parameters using the gradient of the entire training set. It calculates the average gradient of the cost function for all the training examples and updates the parameters in the opposite direction. Batch gradient descent guarantees convergence to the global minimum, but can be computationally expensive and slow for large datasets.

Stochastic Gradient Descent

Stochastic gradient descent updates the model’s parameters using the gradient of one training example at a time. It randomly selects a training example, computes the gradient of the cost function for that example, and updates the parameters in the opposite direction. Stochastic gradient descent is computationally efficient and can converge faster than batch gradient descent. However, it can be noisy and may not converge to the global minimum.

Mini-Batch Gradient Descent

Mini-batch gradient descent updates the model’s parameters using the gradient of a small subset of the training set, known as a mini-batch. It calculates the average gradient of the cost function for the mini-batch and updates the parameters in the opposite direction. Mini-batch gradient descent combines the advantages of both batch and stochastic gradient descent, and is the most commonly used method in practice. It is computationally efficient and less noisy than stochastic gradient descent, while still being able to converge to a good solution.

Alpha – The Learning Rate

We have the direction we want to move in, now we must decide the size of the step we must take.

*It must be chosen carefully to end up with local minima.

If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing, without reaching the minima
If the learning rate is too small, the training might turn out to be too long

a) Learning rate is optimal, model converges to the minimum
b) Learning rate is too small, it takes more time but converges to the minimum
c) Learning rate is higher than the optimal value, it overshoots but converges ( 1/C < η <2/C)
d) Learning rate is very large, it overshoots and diverges, moves away from the minima, performance decreases on learning

Note: As the gradient decreases while moving towards the local minima, the size of the step decreases. So, the learning rate (alpha) can be constant over the optimization and need not be varied iteratively.

Local Minima

The cost function may consist of many minimum points. The gradient may settle on any one of the minima, which depends on the initial point (i.e initial parameters(theta)) and the learning rate. Therefore, the optimization may converge to different points with different starting points and learning rate.

==================================================

Hope You Guys Like this Article , Stay connected for more update

THANK YOU

CHIRAG GUPTA

Search This Blog

CHIRAG's_DataCafe