Gradient Descent for Linear Regression from Scratch
Lets find the best parameters for linear regression using gradient descent optimisation.
Introduction
Gradient descent (GD) is an optimization algorithm used in machine learning to find the best parameters that minimize the loss by finding the minima of the function. The algorithm is explained in detail in this gradient descent blog.
Gradient descent is a fundamental concept in the machine-learning field and is widely adopted to obtain optimal parameters. In this blog, we will see how gradient descent is implemented from scratch for a linear regression algorithm to find the optimal parameters w and b.
Linear Regression
Linear regression is an algorithm that finds the best line/plane that best fits the given data. Following is the geometric interpretation of linear regression.
$$ y = f(x) = w^T*x + b$$
The error obtained for a given input xi is as follows:
$$e_i = y_i - \hat y_i$$
To get the best-fit line, the sum of all the errors must be minimized across the training data as follows:
$$min \sum_{i} error_i$$
Some errors might be positive and some may be negative, and there is a possibility that they might cancel each other. To avoid this, we take the sum of squares of errors.
$$min \sum_{i}(error_i)^2= min\sum_{i}(y_i-\hat y_i)^2$$
The loss function that must be optimized to find the best parameters of linear regression is as follows:
$$(w^*,b^*)= argmin_{w,b} \sum_{i=1}^{N}(y_i - (w^Tx_i +b))^2$$
Implementing GD for the above Loss Function
- import libraries and initialize x, y
import numpy as np
# initialize x and y
x = np.random.randn(10,1)
y = 6*x + np.random.rand()
Initialize the parameters and set the hyperparameter learning rate
# initialize x and y x = np.random.randn(10,1) y = 6*x + np.random.rand() # initialize parameters (w, b) w = np.random.rand(1)[0] b = np.random.rand(1)[0] # learning rate lr = 0.01
Write the logic for gradient descent
# gradient descent logic
def grad_desc(x,y,w,b,lr):
# initialize gradient values
dl_dw = 0
dl_db = 0
N = x.shape[0]
for xi, yi in zip(x, y):
# partial derivatives
dl_dw += -2 * xi * (yi - (w*xi +b))
dl_db += -2 * (yi - (w*xi +b))
# update equation
w = w - lr * dl_dw * 1/N
b = b - lr * dl_db * 1/N
return w, b
Update the gradient descent equation at each epoch to find the best parameters
# update def update(epochs,x,y,w,b,lr): for epoch in range(epochs): w, b = grad_desc(x,y,w,b,lr) yhat = (w*x)+b loss = np.divide(np.sum((y - yhat)**2, axis=0), x.shape[0]) if loss < 0.0001: print(f"minima is found at {epoch}th epoch and has loss of {loss}") print(f"optimal parameters are w:{w}, b:{b}") break
Run the code and find the optimal parameters that minimize the loss function
update(50000,x,y,w,b,lr)
The output obtained is as follows:
minima is found at 386th epoch and has loss of [9.81000523e-05]
optimal parameters are w:[5.98897974], b:[0.49323804]
Conclusion
Gradient descent is a simple yet powerful algorithm used for optimisation. Also, it is very simple to implement as we saw in this article. The beauty of the algorithm is that we start with some random values for the parameters and at the end we get the optimal values of the parameters.
With some minor changes in the above code, stochastic gradient descent which is a variant of gradient descent can be implemented. This can reduce the number of computations required to find the minima.