Gradient Descent
Gradient Descent is a very powerful optimization algorithm that is used almost everywhere in machine learning, from solving logistic regression in 1950s to GPT3!
You can play with the algorithms here and checkout the implementation details here
Problem Statement
Consider a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, our aim is to find the minimum value of the function. Now before we begin we could have generalised a little bit more, what about functions whose domain is $\subseteq \mathbb{R}^n$. But for the time being let us continue with former problem statement.
Let $f(x) = x^2$ be a one-dimensional function whose domain is real line. One can quickly infer its minimum value is 0 and is attained at $x = 0$. One way to see this follows from a keen inspection of derivative, namely by setting it to 0.
Theorem 1 (First-Order Necessary Conditions) 1
If $x^∗$ is a local minimizer and $f$ is continuously differentiable in an open neighborhood of $x^∗$ , then $\nabla f(x^∗) = 0$.
Proof. We prove by contradiction. Assume wrongly that gradient is non-zero at $x^*$. If we now move in direction opposite to gradient we are bound to attain a lower value of function! We are using the fact that gradient points in direction of greatest ascent and consequently opposite direction points to steepest descent. Hence by contradiction, $\nabla f(x^*) = 0$.