Week 1

Overview of Machine Learning

Machine learning is the ability for computers to learn without being explicitly programmed

  • Supervised Machine Learning

  • Unsupervised Machine Learning

Supervised Machine Learning

  • Algorithms that learns from x→yx \to y or input to output mappings

  • They are given "right answers" to learn from - meaning they are given the correct y label for an input x.

Regression

  • Using x labels and output y labels to predict continuous values.

  • Fitting a line (or other complex curve) to the data

  • Task of learning algorithm is to produce more "right" answers

  • Example: Housing price prediction

Classification

  • Using given inputs, classify into output categories

  • Classification algorithms predict categories

  • Categories can be numeric or non-numeric. However, the prediction is a discrete and finite set of possible outputs.

  • May also have multiple inputs

  • Example: Cancer cell detection

Unsupervised Machine Learning

  • Algorithm learns by itself - it is not given y outputs

  • Find some patterns or something interesting in the data

Clustering

  • Algorithm that may group the data into clusters depending on some common pattern that it finds

  • Example: Google News, Market Segmentation

Anomaly Detection

  • Detect unusual events in data

  • Example: Detect fraud in the financial system

Dimensionality Reduction

  • Reduce a big data set to a smaller one by compressing it while losing as little information as possible

Useful Terminology

  • Training set: Dataset used to train the model

  • Input: The feature or input variable xx

  • Output: The target or output variable yy that you are trying to predict

  • Training example: (x(i),y(i))(x^{(i)}, y^{(i)}) representing the ithi^{th} row of the training set

Regression Model

Linear Regression

  • Fitting a straight line to the data

  • To train the model, you feed training set, input features and target to the learning algorithm. It then produces a function f(x)f(x), also called a hypothesis function.

  • We can use the model by giving the input x to f such that f(x)→y^f(x) \to \hat{y}, where y^\hat{y}is the predicted y value.

  • f(x)=wx+bf(x) = wx + b. The values for w and b determine the prediction y^\hat{y} and different values will output different predictions.

  • This is linear regression with one variable, or one input feature x, also called Univariate Linear Regression.

Cost Function

  • To implement Linear Regression, we need to define a cost function

  • It tells us how well the model is doing, so we can refine it

  • Model is represented by f(x)=wx+bf(x) = wx + b. w, b are parameters which represent coefficients or weights.

  • w represents the slope of the line, while b represents the y-intercept.

  • How do we tell whether the line "fits the data" or not? Intuitively, it is some line which goes through or is near some of the training examples

  • Model notation: y^(i)=wx(i)+b\hat{y}^{(i)} = wx^{(i)} + b

  • Cost function measures how close y^(i)\hat{y}^{(i)} is to yy by calculating Error (y^(i)−y)(\hat{y}^{(i)} - y)

  • Final Cost Function: J(w,b)=12m∑i=1m(y^(i)−y(i))2J(w, b) = \frac{1}{2m} \sum_{i=1}^m{(\hat{y}^{(i)}- y{(i)})^2}, where m is the number of training examples. We want to find values of w and b which minimize this function.

  • If you visualize the cost function, you get a 3d gradient plot. You can look at it from above via a contour plot by slicing it. For more intuition, look up visualizing cost function.

Gradient Descent

  • Gradient Descent applies to not just Linear Regression, but also more general functions J(w1,w2,…,wn,b)J(w_1, w_2, \dots, w_n, b).

  • Gradient Descent Intuition

    • Have some function J(w,b)J(w, b)

    • Want to minimize it

  • Outline of the process

    • Start with some w, b (set w = 0, b = 0)

    • Keep changing w, b to reduce J(w,b)J(w, b)

    • Until we settle at a or near minimum

    • For non-convex functions, you may end up at a local minimum

  • Implementation

    • One each step, w is updated to w=w−αddwJ(w,b)w = w - \alpha \frac{d}{dw}J(w, b)

    • α\alpha refers to the learning rate. It is always a positive number, and a negative number results in w increasing.

    • b is updated to b=b−αddbJ(w,b)b = b - \alpha \frac{d}{db} J(w, b)

    • To simultaneously update w and b,

      • Set a temporary variable tmpw=w−α∂∂wJ(w,b)tmp_w = w - \alpha \frac{\partial}{\partial w}J(w, b)

      • Another temporary variable tmpb=b−α∂∂bJ(w,b)tmp_b = b - \alpha \frac{\partial}{\partial b}J(w, b)

      • Update w and b to the new values after computing both tmpwtmp_w and tmpbtmp_b

    • Near a local minimum,

      • Derivative becomes smaller

      • Update steps become smaller

      • So we cannot reach a local minimum with a fixed learning rate

    • Gradient Descent for Linear Regression

      • w=w−α1m∑i=1m(f(x)(i)−y(i))x(i)w = w - \alpha \frac{1}{m} \sum_{i = 1}^m (f(x)^{(i)} - y{(i)}) x^{(i)}

      • b=b−α1m∑i=1m(f(x)(i)−y(i))b = b - \alpha \frac{1}{m} \sum_{i = 1}^m (f(x)^{(i)} - y{(i)})

      • Repeat until convergence

    • Batch Gradient Descent: One every step of gradient descent we look at all the training examples instead of a subset. We're computing the sum of all training examples! This is expensive to compute for large subsets. Other versions exist which use smaller subsets at each step.

Last updated