The posts on the machine learning series are notes from the __Machine Learning Specialization__ course offered by __Deeplearning.ai__ and Stanford Online, available on Coursera.

Learning rate or alpha (α) is usually a positive number between 0 to 1

The choice of the learning rate will have a huge impact on the efficiency of our implementation of gradient descent

If the learning rate is too small, then gradient descent will work work, but it will be very slow and will take a very long time to reach convergence

If the learning rate is too large, gradient descent may overshoot and may never reach the minimum or fail to converge.

Another question that's often asked is, "what if the cost function J is already at a local minimum?" In this case, gradient descent leaves **w** unchanged, because it just updates the new value of **w** to be the exact same old value of **w**

Another important point:

as we get closer to a local minimum, gradient descent will automatically take smaller steps, as seen in our example yesterday on the red arrows. That's because as we approach the local minimum, the derivative automatically gets smaller, and the update steps automatically gets smaller as well, even if the learning rate (α) is kept at a fixed value.

Let's take a look at a plot when the learning rate is too large, hence never reaching divergence:

*above, the left graph shows **w**'s progression over the first few steps of gradient descent. **w** oscillates from positive to negative and cost grows rapidly. Gradient descent is operating on both **w** and **b **simultaneously, so we need the 3-D plot on the right for the complete picture.*

We have reached the end of the Introduction to Machine Learning series notes. Tomorrow, we'll be looking at the summary of what we have learned so far before moving on to the next series, Regression with multiple input variables.

## Comments