Week 2

Multiple Feature Linear Regression

Let's assume we have multiple features.

  • x1,x2,x3,x4x_1, x_2, x_3, x_4 repsents each feature

  • nn represents the total number of features.

  • xjx_j represents the jthj^{th} feature

  • x(i)x^{(i)} still represents the features of the ithi^{th} training example but it is now a row vector with multiple values

  • xj(i)x_j^{(i)} represents the value of the jthj^{th} feature of the ithi^{th} training example

Training model will be revised:

  • Previously: f(x)=wx+bf(x) = wx + b

  • Now: f(x)=w1x1+w2x2+w3x3+⋯+wnxnf(x) = w_1x_1 + w_2x_2 + w_3x_3 + \dots + w_nx_n

Note: This is not multivariate regression.

Vectorization

Using vectorization makes your code shorter and is faster for computation.

Without vectorization: f=w1x1+w2x2+w3x3+bf = w_1x_1 + w_2x_2 + w_3x_3 + b OR ∑j=1nwjxj+b\sum_{j = 1}^n w_jx_j + b

With vectorization: f=w⃗⋅x⃗+bf = \vec{w} \cdot \vec{x} + b, this is much more efficient for larger nn because of code optimizations in the NumPy library (parallel computation as opposed to a for loop)

Implement Gradient Descent with Multiple Regression with Vectorization:

  • w⃗=[w1w2w3…wn]\vec{w} = [w_1 w_2 w_3 \dots w_n] represents parameters of the model

  • b is a number

  • x⃗=[x1x2x3…xn]\vec{x} = [x_1 x_2 x_3 \dots x_n]

  • Therefore, f(x)=w⃗⋅x⃗+bf(x) = \vec{w} \cdot \vec{x} + b

  • Cost function: J(w1,…,wn,b)=J(w⃗,b)J(w_1, \dots, w_n, b) = J(\vec{w}, b)

  • Gradient Descent algorithm:

    • wj=wj−α∂∂wjJ(w⃗,b)w_j = w_j - \alpha \frac{\partial}{\partial{w_j}}J(\vec{w}, b)

    • b=b−α∂∂bJ(w⃗,b)b = b - \alpha \frac{\partial}{\partial{b}}J(\vec{w}, b)

  • Update rule for multiple features: w1=w1−α1m∑i=1m(fw⃗,b(x⃗(i))−y(i))x1(i)=w1−α∂∂w1J(w⃗,b)w_1 = w_1 - \alpha \frac{1}{m} \sum_{i=1}^m (f_{\vec{w}, b}(\vec{x}^{(i)}){-y^{(i)}})x_1^{(i)}= w_1 - \alpha \frac{\partial}{\partial{w_1}}J(\vec{w}, b)

An alternative to Gradient Descent

Normal equations can be used for linear regression to solve for w, b without iterations

Disadvantages are that it doesn't generalize to other learning algorithms, and it is slow when number of features are large.

Feature Scaling

If one feature has a large range of values, and the other has a small range of values, a good model should be able to scale the feature with the weights. If the feature are not on the same scale, gradient descent will work but it will be very slow.

Scaling means to apply some transformation to the data so that the parameters should be in the same range (usually from 0 to 1). Rescaling results in comparable range of values helping to speed up gradient descent.

Implementation of feature scaling:

  • Divide by Maximum: xj,scaled=xjxmaxx_{j, scaled} = \frac{x_j}{x_{max}}

  • Mean Normalization: xj,scaled=xj−μjxmax−xminx_{j, scaled} = \frac{x_j - \mu_j}{x_{max} - x_{min}}

  • Z-score Normalization: xj,scaled=x1−μjσjx_{j, scaled} = \frac{x_1 - \mu_j}{\sigma_j}

  • Aim for about −1≤xj≤1-1 \le x_j \le 1 or similar for each feature xjx_j

Convergence and Learning Rate

If J(w⃗,b)J(\vec{w}, b) decreases by ≤ϵ(0.001)\le \epsilon (0.001) in one iteration, declare convergence.

Learning rate:

  • If α\alpha is too small, gradient descent takes a lot more iterations to converge

  • If α\alpha is too large, cost function may bounce around, sometimes going up as well.

  • Try 0.001, 0.01, 0.1, 1, etc.

Feature engineering is where you use intuition to design new features by transforming or combining original features.

Last updated