Free Energy for Computer Scientists

If nature only cared about minimizing energy, all the gas molecules around us would collapse to the floor and we would suffocate. If nature only maximized entropy, all the gas would expand and disperse throughout the universe, and we would suffocate.

Free Energy for Computer Scientists
Perfect weather for minimizing free energy

It took me a really long time to understand Helmholtz free energy. When it was first introduced to me, I was confused: Is it energy? Not energy? What's so "free" about it? Despite how important it is in so many fields, I find that many people don't really have an intuitive understanding of what it actually means. In principle it's not any more complicated than energy or entropy, and we often have good intuitions for those two.

A few years ago, my aunt and I started talking about statistical physics ... I don't quite remember why this happened. She's a computer scientist, so she understands things like probability distributions, optimization, and information theory inside and out. And now she wanted to understand free energy. Inspired by that conversation, in this post I'll try to outline how I think about free energy, in a way that would make sense to an educated non-physicist (possibly a computer scientist, or even an economist!).

Why it Matters

Have you ever wondered: Why is there less oxygen at high altitudes? What sets the length scale of this change in pressure? Why doesn't the pressure change noticeably when we go from the first floor to the 3rd floor of a building? Why do proteins fold into specific shapes? Why do we have phase transitions? Why does a magnet suddenly stop being magnetic when we cross a certain temperature? Why does a superconductor suddenly superconduct below a certain temperature? Why does water boil?

Free energy is key to understanding all of these phenomena in a quantitative way.

Two Competing Tendencies in Natural Systems

Physical systems at equilibrium are governed by two competing "desires". They want to minimize their energy (U): systems tend to move toward lower energy states. They simultaneously want to maximize their entropy (S): systems tend toward more disordered arrangements.

Sometimes these two pull a system in opposite directions. If nature only cared about minimizing energy, all the gas molecules around us would collapse to the floor and we would suffocate. If nature only maximized entropy, all the gas would expand and disperse throughout the universe, and we would suffocate. Either way we suffocate. So it's important to strike a balance!

What Are We Actually Optimizing Over?

When we talk about a physical system (like the air molecules around us), we don't track the exact position and velocity of every single molecule. That would be absurd. There's just no point in doing that. I want to know how much oxygen I'll have to breathe on top of the mountain, not the position and location of every atom. The type of information we're looking for is encoded in probability distributions.

Let's call one such distribution $P(x)$, where $x$ represents a particular state the system could be in. For air molecules, $x$ might specify the position and velocity of every molecule. Or it might represent the height at which a gas molecule finds itself in our atmosphere. We call this a "microstate". Usually we're averaging over microstates that have similar statistical properties.

So what we're really doing is searching for the probability distribution $P(x)$ that best describe the system – the one that properly balances energy minimization and entropy maximization.

Would an objective function that strikes a balance between $U$ and $S$ look something like
$$F = U - S?$$ Close, but this expression looks a bit bare. There's no way to tune the relative importance of $U$ and $S$.

Maybe we need something like a hyperparameter to control this balance. Sure, let's call that $T$. So now we're minimizing
$$F = U - TS.$$

If $T$ were $0$, we're just minimizing $F=U$. Then if there were some microstate $x$ for which $U(x)$ is minimal, the probability distribution would just be peaked at that specific $x$. This causes $S$ to go to $0$. So the energy goes to its lowest possible value, and the entropy is minimized... sounds like what we expect to happen at very low temperatures.

Figure 1: Balancing Energy and Entropy. Blue curves show potential energy U(x), red curves show probability distributions P(x) at different temperatures. As T increases from near-zero to high values, probability shifts from concentrating at the energy minimum to spreading uniformly, illustrating the probability distribution minimizing F.

If we make $T$ really big, we're effectively maximizing the entropy $S$, without regard to the energy. So the energy $U$ could end up being large. It doesn't matter in the optimization – it just matters that everything will be really random looking. Sounds like what we expect to happen at very high temperatures.

We're going to call $T$ temperature.

To make this more concrete, let's write out what $U$ and $S$ actually are in terms of our probability distribution $P(x)$:

  • The expected energy is: $U[P] = \sum_x P(x) E(x)$ or $\int P(x) E(x) dx$ in the continuous case
  • The entropy is: $S[P] = -\sum_x P(x) \log P(x)$ or $-\int P(x) \log P(x) dx$ – computer scientists will immediately recognize that entropy term - it's exactly Shannon entropy from information theory!

So our Helmholtz free energy is actually a functional - a function of the probability distribution function:

$F[P] = U[P] - TS[P]$

Nature finds the probability distribution $P$ that minimizes this functional.

(We can actually derive the above expression for free energy from first principles. Maybe I will cover that in a future post.)

The Simplest Example: The Two-State System

Let's look at the simplest possible example: A system with just two states (like a bit in a computer, or a protein going between two conformations).

Let's say we have:

  • State 0 with energy $E_0$ and probability $P_0$
  • State 1 with energy $E_1$ and probability $P_1$

The expected energy is: $U[P] = E_0 P_0 + E_1 P_1$ The entropy is: $S[P] = -P_0 \log P_0 - P_1 \log P_1$

Since $P_0 + P_1 = 1$ (the probabilities must sum to 1), we can rewrite $P_1 = 1-P_0$.

This means our Helmholtz free energy is a function of just one variable, $P_0$:

$F(P_0) = E_0 P_0 + E_1 (1-P_0) + T P_0 \log P_0 + T(1-P_0)\log(1-P_0)$

To find the minimum of $F$, we take the derivative with respect to $P_0$ and set it to zero:

$$\frac{dF}{dP_0} = E_0 - E_1 + T \log P_0 + T - T \log (1-P_0) - T = 0$$

Simplifying: $E_0 - E_1 + T \log \frac{P_0}{1-P_0} = 0$

Solving for the ratio: $$\frac{P_0}{P_1} = \exp\left(-\frac{E_0 - E_1}{T}\right) = \frac{e^{-\beta E_0}}{e^{-\beta E_1}}$$

where $\beta = 1/T$.

This is the famous Boltzmann distribution - one of the most important results in statistical physics! It tells us that the probability of finding a system in a particular state is proportional to $e^{-\beta E}$.

Generalizing to Many States

Now that we've derived the probabilities for a simple two-state system, let's extend this to the general case where we have many possible states. Each state $x$ has an energy $E(x)$. Our goal is to find probability distribution $P(x)$ that minimizes the Helmholtz free energy: $$F[P] = \sum_x P(x)E(x) - T \cdot (-\sum_x P(x)\log P(x)).$$ This is optimization problem has a constraint: the total probability must sum to 1, i.e., $\sum_x P(x) = 1$. We'll use the method of Lagrange multipliers, introducing a multiplier $\lambda$: $$L[P] = \sum_x P(x)E(x) - T \cdot (-\sum_x P(x)\log P(x)) - \lambda(\sum_x P(x) - 1)$$

To find the minimum, we take the gradient with respect to each $P(x)$ and set it to zero: $$\frac{\partial L}{\partial P(x)} = E(x) + T\log P(x) + T - \lambda = 0$$

Rearranging, this gives us $$\begin{align}\log P(x) &= \frac{\lambda - E(x)}{T} - 1 \\ P(x) &= e^{\frac{\lambda}{T} - 1} \cdot e^{-\frac{E(x)}{T}} \\ &= \frac{1}{Z} e^{-\beta E(x)}\end{align}.$$

The normalization factor ($Z$) to get the probabilities to sum to 1 is special enough that it gets its own name, the partition function:

$$Z = \sum_x e^{-\beta E(x)}$$

Units

When we write $F = U - TS$, we need to be careful about units. Energy ($U$) has units of joules, while entropy in information theory ($S$) is dimensionless. Temperature ($T$) has units of kelvin. So how can we add or subtract quantities with different units?

This is where Boltzmann's constant ($k$) comes in. In physics, entropy actually has units of energy divided by temperature ($\text{J/K}$), and is related to information-theoretic entropy by the factor $k$:

$$S_{\text{physics}} = k \cdot S_{\text{information}}$$

Where $k \approx 1.38 \times 10^{-23} \text{ J/K}$ is Boltzmann's constant. With this adjustment, the Helmholtz free energy equation becomes:

$$F = U - T(k \cdot S_{\text{information}}) = U - kTS_{\text{information}}$$

Now the units work out properly: $F$ has units of energy (joules), as does $U$, and $kTS$ also has units of energy ($\text{K} \times \text{J/K} = \text{J}$).

In our derivations, we've been implicitly using $k = 1$ or folding it into the temperature $T$. This is a common practice in theoretical discussions (similar to how physicists often use $c = 1$ or $\hbar = 1$), but it's important to restore the constant when doing actual calculations.

What "Free" Really Means

The "free" in Helmholtz free energy refers to how much of the free energy is up for grabs, i.e., we can extract work from. At a fixed temperature, if a system starts out at $P_\text{initial}$, nature shuffles between microstates, and the probability distribution evolves until $F$ bottoms out at $F_0=F[P_\text{eq.}]$. At that point, there’s literally no free work left on the table. During the approach to this equilibrium though, there is free work. We can extract $$\Delta F = F[P_{\text{initial}}]- F[P_\text{eq.}]>0$$ from the system. So the amount of energy in the system we can use to do thinks like drive a turbine is not $U$, but $\Delta F$.

The Atmosphere Example: Why Doesn't All Air Float Away or Fall Down?

Now let's apply this to a more complex example: Earth's atmosphere. Why doesn't all the air around us just escape into space? Or why doesn't it all collapse to the ground?

For a gas molecule at height $z$, the gravitational potential energy for the particle as a function of its altitude is $E(z) = mgz$, where $m$ is the molecule's mass and $g$ is the gravitational acceleration ($9.8 \text{ m/s}^2$ on Earth).

Following the same principle of minimizing Helmholtz free energy, we find that the probability density of finding a molecule at height $z$ is: $$P(z) \propto e^{-mgz/kT} = e^{-z/h_0}.$$ This is the barometric formula, which explains why air density decreases exponentially with altitude. The characteristic height scale is $h_0 = kT/mg$, representing the height at which the density drops to $1/e$ (about $37%$) of its ground value.

Let's calculate this for oxygen ($\text{O}_2$) molecules on Earth:

  • Boltzmann constant $k = 1.38 \times 10^{-23} \text{ J/K}$
  • Average temperature at sea level $T \approx 300 \text{ K}$
  • Gravitational acceleration $g = 9.8 \text{ m/s}^2$
  • Mass of an $\text{O}_2$ molecule $m = 5.31 \times 10^{-26} \text{ kg}$ ($32$ atomic mass units)

$$h_0 = \frac{1.38 \times 10^{-23} \times 300}{5.31 \times 10^{-26} \times 9.8} \approx 8,000 \text{ meters}$$

So the length scale for oxygen in Earth's atmosphere is about $8.0$ kilometers. This explains why mountaineers need oxygen tanks on high peaks like Everest ($8.848~\text{km}$), but not at the climbing wall in the gym.

A Tiny Bit More About Partition Functions

The partition function $Z$ we introduced earlier is a fundamental object in statistical physics, but appears in many fields, including machine learning and information theory.

Connection to Machine Learning

If you've worked with probabilistic models in machine learning, you've likely encountered concepts equivalent to the partition function. In machine learning, we often define energy-based models where:

$$P(x) = \frac{1}{Z} e^{-E(x)}$$

Here, $e^{-E(x)}$ is the likelihood (unnormalized probability), $E(x)$ is the energy function (or negative log-likelihood), and $Z = \sum_x e^{-E(x)}$ is the partition function that we use to normalize the likelihood and obtain a probability distribution. This partition function is very difficult to calculate, so a lot of machine learning is about finding ways of estimating it.

Free Energy and the Partition Function

There's an elegant relationship between the free energy $F$ and the partition function $Z$:

$$F = -kT \ln Z$$

I'll leave the derivation as an exercise for the reader (hint: start with $F = U - TS$ and the definitions of $U$ and $S$ in terms of the Boltzmann distribution).

This relationship is powerful because it means if we can compute $Z$, we immediately know the free energy of the system, which tells us about its thermodynamic stability.

What We Can Learn from Partition Functions

The partition function contains a wealth of information about a system. By taking derivatives of $\ln Z$ with respect to various parameters, we can compute important physical quantities. For example average energy is $$\langle E \rangle = -\frac{\partial \ln Z}{\partial \beta}$$ where $\beta = 1/kT$.

Free energy can be non-analytic: Phase transitions

This will have to go into a future post.