Questions I had

Why we use numerical methods to solve the optimization problem instead of using direct analysis?

Neural Networks often are highly non-linear functions and as they are, we cannot solve them using elemntal functions.

A simple example of an impossible to solve equation is $xe^x=a$, where the W function of Lambert is the inverse function: x = W(a). It is proved that this function cannot be solved with elemental functions. You can solve it with Taylor Series with the Lagrange inversion theorem.

Okey, so, we use SGD. How do we know we are in a local minima?

If the gradient is 0 and the hessian semipositive definite we are in a local minima. Also, it is not possible to spot the global minima but checkout this work of Neuman (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=41b6b7737113557347f993f1fe7e68be484708a6).

About the semidefinite hessian, we say SEMI-definite POSITIVE because we also consider a flat point as a minima. Semidefinite positive means that for any vector x, $xAx^T>0$. This just means that for any direction we choose, the value of the loss function will increase, so we are in a local minima.

Is it posible to find global minima?

Short answer: No. Since we use millons of parameters is difficult to explore all this space, so we explore the space localy. Also, the majority of loss functions are just locally convex, not globally.

We we just don’t use Second order optimization?

Well… Since you just have to compute the Hessian function to obtain the Hessian matrix at a given point… If we have a medium/small model with say 1M parameters, this is 32MB of model (float32), the Hessian Matrix would be 1T entries, with 32TB, this is too much of memory.

But you could say that it makes no sense to compute all the relations between weights, since it probably doesn’t matter the majority of relations. What if we just keep the relevant relations of the Hessian?

But, there exist other methods such as ADAM that without computing the second order derivatives adjusts automatically the gradient descent step for each parameter.