Of course, Free Energies come from Chemical Physics. And this is not surprising, since Hinton’s graduate advisor was a famous theoretical chemist.
They are so important that Karl Friston has proposed the The Free Energy Principle : A Unified Brain Theory ?
(see also the wikipedia and this 2013 review)
What are free Energies and why do we use them in Deep Learning ?
The Free Energy is a Temperature Weighted Average Energy
In (Unsupervised) Deep Learning, Energies are quadratic forms over the weights. In an RBM, one has
This is the T=0 configurational Energy, where each configuration is some pair. In chemical physics, these Energies resemble an Ising model.
The Free Energy is a weighted average of the all the global and local minima
Zero Temperature Limit
Note: as , the the Free Energy becomes the T=0 global energy minima . In limit of zero Temperature, all the terms in the sum approach zero
and only the largest term, the largest negative Energy, survives.
We may also see F written in terms of the partition function Z:
where the brakets denote an equilibrium average, and expected value over some equilibrium probability distribution . (we don’t normalize with 1/N here; in principle, the sum could be infinite.)
Of course, in deep learning, we may be trying to determine the distribution , and/or we may approximate it with some simpler distribution during inference. (From now on, I just write P and Q for convenience)
But there is more to Free Energy learning than just approximating a distribution.
The Free Energy is an average solution to a non-convex optimization problem
In a chemical system, the Free Energy averages over all global and local minima below the Temperature T–with barriers below T as well. It is the Energy available to do work.
Being Scale Free: T=1
For convenience, Hinton explicitly set T=1. Of course, he was doing inference, and did not know the scale of the weights W. Since we don’t specify the Energy scale, we learn the scale implicitly when we learn W. We call this being scale-free
So in the T=1, scale free case, the Free Energy implicitly averages over all Energy minima where , as we learn the weights W. Free Energies solve the problem of Neural Nets being non-convex by averaging over the global minima and nearby local minima.
Highly degenerate non-convex problems
Because Free Energies provide an average solution, they can even provide solutions to highly degenerate non-convex optimization problems:
When do Free Energy solutions fail ?
They will fail, however, when the barriers between Energy basins are larger than the Temperature.
This can happen if the effective Temperature drops close to zero during inference. Since T=1 implicitly in inference, this means when the weights W are exploding.
Systems may also get trapped if the Energy barriers grow very large –as, say, in the glassy phase of a mean field spin glass. Or a supercooled liquid–the co-called Adam Gibbs phenomena. I will discuss this in a future post.
In either case, if the system, or solver, gets trapped in a single Energy basin, it may appear to be convex, and/or flat (the Hessian has lots of zeros). But this is probably not the optimal solution to learning when using a Free Energy method.
Free Energies produce Ruggedly Convex Landscapes
It is sometimes argued that Deep Learning is a non-convex optimization problem. And, yet, it has been known for over 20 years that networks like CNNs don’t suffer from the problems of local minima? How can this be ?
At least for unsupervised methods, it has been clear since 1987 that:
An important property of the effective [Free] Energy function E(V,0,T) is that it has a smoother landscape than E(S) [T=0] …
Hence, the probability of getting stuck in a local minima decreases
Although this is not specifically how Hinton argued for the Helmholtz Free Energy — a decade later.
The Hinton Argument for Free Energies
Why do we use Free energy methods ? Hinton used the bits-back argument:
Imagine we are encoding some training data and sending it to someone for decoding. That is, we are building an Auto-Encoder.
If have only 1 possible encoding, we can use any vanilla encoding method and the receiver knows what to do.
But what if have 2 or more equally valid codes ?
Can we save 1 bit by being a little vague ?
Suppose we have N possible encodings , each with Energy . We say the data has stochastic complexity.
Pick a coding with probability and send it to the receiver. The expected cost of encoding is
Now the receiver must guess which encoding we used. The decoding cost of the receiver is
where H is the Shannon Entropy of the random encoding
The decoding cost looks just like a Helmholtz Free Energy.
Moreover, we can use a sub-optimal encoding, and they suggest using a Factorized (i.e. mean field) Feed Forward Net to do this.
To understand this better, we need to relate
Thermodynamics and Inference
In 1995, Hinton formulated the Helmholtz Machine and showed us how to define a quasi-Free Energy.
In Thermodynamics, the Helmholtz Free Energy F(T,V,N) is an Energy that depends on Temperature instead of Entropy. We need
and F is defined as
In ML, we set T=1. Really, the Temperature equals how much the Energy changes with a change in Entropy (at fixed V and N)
Variables like E and S depend on the system size N. That is,
We say S and T are conjugate pairs; S is extensive, T is intensive.
(see more on this in the Appendix)
The conjugate pairs are used to define Free Energies via the Legendre Transform:
Helmholtz Free Energy: F(T) = E(S) – TS
We switch the Energy from depending on S to T, where .
Why ? In a physical system, we may know the Energy function E, but we can’t directly measure or vary the Entropy S. However, we are free to change and measure the Temperature–the derivative of E w/r.t. S:
This is a powerful and general mathematical concept.
Say we have a convex function f(x,y,z), but we can’t actually vary x. But we do know the slope, w, everywhere along x
Then we can form the Legendre Transform , which gives g(w,y,z) as
the ‘Tangent Envelope‘ of f() along x
Note: we have converted a convex function into a concave one. The Legendre transform is concave in the intensive variables and convex in the extensive variables.
Of course, the true Free Energy F is convex; this is central to Thermodynamics (see Appendix). But that is because while it is concave in T, we evaluate it at constant T.
But what if the Energy function is not convex in the Entropy ? Or, suppose we extract an pseudo-Entropy from sampling some data, and we want to define a free energy potential (i.e. as in protein folding). These postulates also fail in systems like blog post on spin chains.
Answer: Take the convex hull
Legendre Fenchel Transform
When a convex Free Energy can not be readily be defined as above, we can use the the generalized the Legendre Fenchel Transform, which provides a convex relaxation via
the Tangent Envelope , a convex relaxation
The Legendre-Fenchel Transform can provide a Free Energy, convexified along the direction internal (configurational) Entropy, allowing the Temperature to control how many local Energy minima are sampled.
- This how the (TAP) Free Energy is formed for the EMF_RBM. I have written an open source version of the code in pythonopen source version of the code in python.
- Google Deep Mind released a paper in 2013 on DARN: Deep Auto Regressive Networks.
- The modeling package Edward provides Variational Inference on top of TensorFlow
- Variational AutoEncoders are available in Keras
- And it is how Jordan does Variational Inference–the topic of part II of this blog.
Extra stuff I just wanted to write down…
Convexity in Thermodynamics and Statistical Physics
- S and V (or U and V) are the coordinates for the manifold of Equilibrium states
- The Energy U is a convex function of S, V, and S is concave
- The Temperature
Extensivity and Weight Constraints
If we assume T=1 at all times, and we assume our Deep Learning Energies are extensive–as they would be in an actual thermodynamic system–then the weight norm constraints act to enforce the size-extensivity.
then W should remain bounded to prevent the Energy E(n) from growing faster than Mn. And, of course, most Deep Learning algorithms do bound W in some form.
Back to Stat Mech
where C denotes a contour integral.