Week 1

Overview of Machine Learning

Machine learning is the ability for computers to learn without being explicitly programmed

Supervised Machine Learning
Unsupervised Machine Learning

Supervised Machine Learning

Algorithms that learns from $x \to y$ or input to output mappings
They are given "right answers" to learn from - meaning they are given the correct y label for an input x.

Regression

Using x labels and output y labels to predict continuous values.
Fitting a line (or other complex curve) to the data
Task of learning algorithm is to produce more "right" answers
Example: Housing price prediction

Classification

Using given inputs, classify into output categories
Classification algorithms predict categories
Categories can be numeric or non-numeric. However, the prediction is a discrete and finite set of possible outputs.
May also have multiple inputs
Example: Cancer cell detection

Unsupervised Machine Learning

Algorithm learns by itself - it is not given y outputs
Find some patterns or something interesting in the data

Clustering

Algorithm that may group the data into clusters depending on some common pattern that it finds
Example: Google News, Market Segmentation

Anomaly Detection

Detect unusual events in data
Example: Detect fraud in the financial system

Dimensionality Reduction

Reduce a big data set to a smaller one by compressing it while losing as little information as possible

Useful Terminology

Training set: Dataset used to train the model
Input: The feature or input variable $x$
Output: The target or output variable $y$ that you are trying to predict
Training example: $(x^{(i)}, y^{(i)})$ representing the $i^{th}$ row of the training set

Regression Model

Linear Regression

Fitting a straight line to the data
To train the model, you feed training set, input features and target to the learning algorithm. It then produces a function $f(x)$ , also called a hypothesis function.
We can use the model by giving the input x to f such that $f(x) \to \hat{y}$ , where $\hat{y}$ is the predicted y value.
$f(x) = wx + b$ . The values for w and b determine the prediction $\hat{y}$ and different values will output different predictions.
This is linear regression with one variable, or one input feature x, also called Univariate Linear Regression.

Cost Function

To implement Linear Regression, we need to define a cost function
It tells us how well the model is doing, so we can refine it
Model is represented by $f(x) = wx + b$ . w, b are parameters which represent coefficients or weights.
w represents the slope of the line, while b represents the y-intercept.
How do we tell whether the line "fits the data" or not? Intuitively, it is some line which goes through or is near some of the training examples
Model notation: $\hat{y}^{(i)} = wx^{(i)} + b$
Cost function measures how close $\hat{y}^{(i)}$ is to $y$ by calculating Error $(\hat{y}^{(i)} - y)$
Final Cost Function: $J(w, b) = \frac{1}{2m} \sum_{i=1}^m{(\hat{y}^{(i)}- y{(i)})^2}$ , where m is the number of training examples. We want to find values of w and b which minimize this function.
If you visualize the cost function, you get a 3d gradient plot. You can look at it from above via a contour plot by slicing it. For more intuition, look up visualizing cost function.

Gradient Descent

Gradient Descent applies to not just Linear Regression, but also more general functions $J(w_1, w_2, \dots, w_n, b)$ .
Gradient Descent Intuition
- Have some function $J(w, b)$
- Want to minimize it
Outline of the process
- Start with some w, b (set w = 0, b = 0)
- Keep changing w, b to reduce $J(w, b)$
- Until we settle at a or near minimum
- For non-convex functions, you may end up at a local minimum
Implementation
- One each step, w is updated to $w = w - \alpha \frac{d}{dw}J(w, b)$
- $\alpha$ refers to the learning rate. It is always a positive number, and a negative number results in w increasing.
- b is updated to $b = b - \alpha \frac{d}{db} J(w, b)$
- To simultaneously update w and b,
  - Set a temporary variable $tmp_w = w - \alpha \frac{\partial}{\partial w}J(w, b)$
  - Another temporary variable $tmp_b = b - \alpha \frac{\partial}{\partial b}J(w, b)$
  - Update w and b to the new values after computing both $tmp_w$ and $tmp_b$
- Near a local minimum,
  - Derivative becomes smaller
  - Update steps become smaller
  - So we cannot reach a local minimum with a fixed learning rate
- Gradient Descent for Linear Regression
  - $w = w - \alpha \frac{1}{m} \sum_{i = 1}^m (f(x)^{(i)} - y{(i)}) x^{(i)}$
  - $b = b - \alpha \frac{1}{m} \sum_{i = 1}^m (f(x)^{(i)} - y{(i)})$
  - Repeat until convergence
- Batch Gradient Descent: One every step of gradient descent we look at all the training examples instead of a subset. We're computing the sum of all training examples! This is expensive to compute for large subsets. Other versions exist which use smaller subsets at each step.

PreviousSupervised Machine Learning NextWeek 2

Last updated 2 years ago