Week 2

Multiple Feature Linear Regression

Let's assume we have multiple features.

$x_1, x_2, x_3, x_4$ repsents each feature
$n$ represents the total number of features.
$x_j$ represents the $j^{th}$ feature
$x^{(i)}$ still represents the features of the $i^{th}$ training example but it is now a row vector with multiple values
$x_j^{(i)}$ represents the value of the $j^{th}$ feature of the $i^{th}$ training example

Training model will be revised:

Previously: $f(x) = wx + b$
Now: $f(x) = w_1x_1 + w_2x_2 + w_3x_3 + \dots + w_nx_n$

Note: This is not multivariate regression.

Vectorization

Using vectorization makes your code shorter and is faster for computation.

Without vectorization: $f = w_1x_1 + w_2x_2 + w_3x_3 + b$ OR $\sum_{j = 1}^n w_jx_j + b$

With vectorization: $f = \vec{w} \cdot \vec{x} + b$ , this is much more efficient for larger $n$ because of code optimizations in the NumPy library (parallel computation as opposed to a for loop)

Implement Gradient Descent with Multiple Regression with Vectorization:

$\vec{w} = [w_1 w_2 w_3 \dots w_n]$ represents parameters of the model
b is a number
$\vec{x} = [x_1 x_2 x_3 \dots x_n]$
Therefore, $f(x) = \vec{w} \cdot \vec{x} + b$
Cost function: $J(w_1, \dots, w_n, b) = J(\vec{w}, b)$
Gradient Descent algorithm:
- $w_j = w_j - \alpha \frac{\partial}{\partial{w_j}}J(\vec{w}, b)$
- $b = b - \alpha \frac{\partial}{\partial{b}}J(\vec{w}, b)$
Update rule for multiple features: $w_1 = w_1 - \alpha \frac{1}{m} \sum_{i=1}^m (f_{\vec{w}, b}(\vec{x}^{(i)}){-y^{(i)}})x_1^{(i)}= w_1 - \alpha \frac{\partial}{\partial{w_1}}J(\vec{w}, b)$

An alternative to Gradient Descent

Normal equations can be used for linear regression to solve for w, b without iterations

Disadvantages are that it doesn't generalize to other learning algorithms, and it is slow when number of features are large.

Feature Scaling

If one feature has a large range of values, and the other has a small range of values, a good model should be able to scale the feature with the weights. If the feature are not on the same scale, gradient descent will work but it will be very slow.

Scaling means to apply some transformation to the data so that the parameters should be in the same range (usually from 0 to 1). Rescaling results in comparable range of values helping to speed up gradient descent.

Implementation of feature scaling:

Divide by Maximum: $x_{j, scaled} = \frac{x_j}{x_{max}}$
Mean Normalization: $x_{j, scaled} = \frac{x_j - \mu_j}{x_{max} - x_{min}}$
Z-score Normalization: $x_{j, scaled} = \frac{x_1 - \mu_j}{\sigma_j}$
Aim for about $-1 \le x_j \le 1$ or similar for each feature $x_j$

Convergence and Learning Rate

If $J(\vec{w}, b)$ decreases by $\le \epsilon (0.001)$ in one iteration, declare convergence.

Learning rate:

If $\alpha$ is too small, gradient descent takes a lot more iterations to converge
If $\alpha$ is too large, cost function may bounce around, sometimes going up as well.
Try 0.001, 0.01, 0.1, 1, etc.

Feature engineering is where you use intuition to design new features by transforming or combining original features.

PreviousWeek 1 NextWeek 3

Last updated 2 years ago