Quantum Sensing

Fidelity, Fisher Information, QCRB and All that [pt. 1]

Amir Safavi-Naeini

09 May 2025 — 17 min read

Have you ever looked at the formula for Fidelity between two quantum states: $$F(\hat \rho_1, \hat \rho_2) = \left[ \mathrm{tr} \sqrt{\sqrt \rho_1 \rho_2 \sqrt \rho_1} \right]^2,$$ and thought who hurt the physicist who came up with this? How can you do so many square roots on just two matrices? Statisticians are not much better; I present to you the Fisher information: $$F_C(\theta) = \int \frac{1}{p(x|\theta)} \left(\frac{\partial p(x|\theta)}{\partial \theta}\right)^2 dx$$
These expressions definitely made me very confused the first time I saw them.

If you're a graduate student working on quantum information, sensing, or metrology, these expressions and terms like the Quantum Cramér-Rao Bound, and the Heisenberg limit are likely on your radar – maybe you already know them, or maybe you just nod along. This post outlines roughly how I understand these and related concepts. They're loosely based on my notes and something I presented in our group's weekly meeting a couple of years ago. It is also to convince you these are actually a bunch of surprisingly elegant and simple concepts that are closely connected to each other, and which I honestly find easier to digest all together, all at once.

In either case, I hope you find this tutorial will be helpful 😃

For now my plan is to split this into post into a few parts (I know I just said they should be digested all at once, but there are limits [haha].). This post will give an introduction to classical fisher information and parameter estimation, and give a preview of what quantum parameter estimation is all about. The second post will build out the quantum part, look at applications, and talk about the limitations of these approaches in sensing applications, and cover some new and exciting advances in the field.

First Things First: Classical Parameter Estimation

A lot of times, we're faced with a physical system whose behavior depends on some unknown parameter, let's call it $\theta$. The parameter $\theta$ can be anything: the strength of a magnetic field, the frequency or intensity of a light field, position of a mechanical oscillator, or a parameter in a quantum gate. We perform measurements on the system, collect data (let's say $\mathbf{x}$), and then try to deduce the value of $\theta$. This is the essence of parameter estimation.

Imagine we have a process whose outcomes $\mathbf{x}$ are described by a probability distribution $p(\mathbf{x}|\theta)$ that depends on an unknown parameter $\theta$. Our goal is to estimate $\theta$ after observing some data $\mathbf{x}$.

If a small change of the parameter from $\theta$ to $\theta + d\theta$ leads to a drastically different probability distribution $p(\mathbf{x}|\theta+d\theta)$, we can intuitively guess that it'll be easier to estimate $\theta$. Conversely, if $p(\mathbf{x}|\theta)$ and $p(\mathbf{x}|\theta+d\theta)$ are very similar, distinguishing them (and thus accurately estimating $\theta$) will be harder. So our program is the following:

Find a way to quantify the "distance" between two nearly identical probability distributions.
Talk about distances and divergences more generally, and use these to estimate the sensitivity of the distribution to a parameter. (Fisher information)
Use the sensitivity to estimate how much uncertainty we will have on our estimate of $\theta$. (Cramer-Rao bound)

Deriving Classical Fisher Information

Quick and dirty: estimating the bias of a coin

We have coin which lands heads with probability $p(H|\theta)=\theta$. We want to estimate $\theta$ by doing measurements, i.e., flipping the coin a bunch of times. We flip the coin $N$ times, and obtain heads $m$ times. Let's call our estimate of $\theta$, $\check \theta$. We can convince ourselves that a good expression for this estimate is
$$\newcommand{\sq}{^2}
\check \theta = \frac{m}{N}$$
What is the uncertainty in this estimate, $\Delta \check \theta$? That would be given by the variance:
$$ (\Delta \check \theta)^2 = \frac{\theta(1-\theta)}{N}$$
We can use this variance to determine how close two probability distributions can be before we can no longer distinguish them. So $(\Delta \check \theta)^2$ is really one unit of distinguishability. We have two probability distributions with parameters $\theta$ and $\theta+d\theta$, so they are separated by $d\theta$. A natural way for measuring the distance between them is to count many ``units of distinguishability'' they are separated by, i.e., we can divide by the variance of the estimate:
$$\frac{(d\theta)^2}{(\Delta \check \theta)^2} = N\frac{(\delta \theta)\sq}{\theta(1-\theta)}=N \left(\frac{(\delta \theta)\sq}{\theta}+ \frac{(\delta \theta)\sq}{1-\theta}\right)$$

We can generalize this to a probability distribution that has more outcomes, so $\theta_1, \theta_2, \cdots$, and so on, and find that the right way of estimating the difference between two distributions is proportional to
$$\frac{(d\theta_1)^2}{\theta_1} + \frac{(d\theta_2)^2}{\theta_2} + \cdots $$
We can write this as $$\sum_x \frac{1}{p(x|\theta)} \left(\frac{\partial p(x|\theta)}{\partial \theta} d\theta \right)^2 = F_C(\theta) (d\theta)\sq $$ and call $F_C$ the classical Fisher information.

The expression $$\frac{(d\theta_1)^2}{\theta_1} + \frac{(d\theta_2)^2}{\theta_2} + \cdots $$ tells us something about the geometry of the space of probability distributions ... the right way of counting distance between two probability distributions is not the Euclidean metric $(d\theta_1)^2 + (d\theta_2)\sq+\cdots$ as one would naively expect. Actually a really curious property is that if we instead thought in terms of square root of probabilities (kind of like amplitudes in quantum mechanics) i.e., $c_1 = \sqrt{\theta_1}$ then, $d c_1 = d\sqrt{\theta_1}=-\frac{1}{2\sqrt{\theta_1}} d\theta_1$ so in fact the right way to calculate distance would be something like $$(d c_1)\sq + (d c_2)\sq + \cdots$$ I honestly have no idea why this is the case. In this lecture, Carl Caves strongly encourages the audience to avoid thinking about this fact and any possible quantum connection at any significant depth, and we'll take his advice here.

Now I’ve been pretty sloppy with the derivation above. This is rectified below.

Slightly more respectable: starting from a distance metric

Another way to think about things is to start with a "distance" measure to tell us how different two probability distributions are. Defining a single, universally "best" measure can be a bit tricky, as various options exist, each with its own properties and interpretations and religious devotees. However, for the purpose of understanding how quickly the distribution changes locally (i.e., for infinitesimal $d\theta$), it turns out that several choices (among common, well-behaved measures) lead to the same fundamental underlying quantity describing the local sensitivity. Let's look at two distance measures...

For two continuous probability distributions $p(x|\theta)$ and $p(x|\theta+d\theta)$, the Hellinger Distance is defined as:

$$H^2(\theta, \theta+d\theta) = \frac{1}{2} \int \left(\sqrt{p(x|\theta+d\theta)} - \sqrt{p(x|\theta)}\right)^2 dx$$

This measures how much the two distributions overlap. A larger Hellinger distance means they are more distinguishable.

What happens if $d\theta$ is very small? We can use a Taylor expansion for $\sqrt{p(x|\theta+d\theta)}$:

$$\sqrt{p(x|\theta+d\theta)} \approx \sqrt{p(x|\theta)} + \frac{\partial}{\partial \theta} \left( \sqrt{p(x|\theta)} \right) d\theta$$
$$\sqrt{p(x|\theta+d\theta)} \approx \sqrt{p(x|\theta)} + \frac{1}{2\sqrt{p(x|\theta)}} \frac{\partial p(x|\theta)}{\partial \theta} d\theta$$

Substituting this into the Hellinger distance formula:

$$H^2(\theta, \theta+d\theta) \approx \frac{1}{2} \int \left( \frac{1}{2\sqrt{p(x|\theta)}} \frac{\partial p(x|\theta)}{\partial \theta} d\theta \right)^2 dx$$
$$H^2(\theta, \theta+d\theta) \approx \frac{1}{8} \left( \int \frac{1}{p(x|\theta)} \left(\frac{\partial p(x|\theta)}{\partial \theta}\right)^2 dx \right) (d\theta)^2$$

Definition: Classical Fisher Information (CFI)
The Classical Fisher Information $F_C(\theta)$ that a random variable $X$ (with outcomes $x$ following $p(x|\theta)$) carries about a parameter $\theta$ is:
$$F_C(\theta) = \int p(x|\theta) \left(\frac{\partial \log p(x|\theta)}{\partial \theta}\right)^2 dx = \int \frac{1}{p(x|\theta)} \left(\frac{\partial p(x|\theta)}{\partial \theta}\right)^2 dx$$
$$F_C(\theta) = \sum_x p(x|\theta) \left(\frac{\partial \log p(x|\theta)}{\partial \theta}\right)^2 = \sum_x \frac{1}{p(x|\theta)} \left(\frac{\partial p(x|\theta)}{\partial \theta}\right)^2 $$

(The equality comes from $\frac{\partial \log p}{\partial \theta} = \frac{1}{p} \frac{\partial p}{\partial \theta}$).

So, we see a beautiful connection:
$$F_C(\theta) = 8 \lim_{d\theta \to 0} \frac{H^2(\theta, \theta+d\theta)}{(d\theta)^2}$$
The Fisher Information is proportional to the curvature of the Hellinger distance. It quantifies how distinguishable $p(x|\theta)$ is from $p(x|\theta+d\theta)$ for infinitesimal $d\theta$. A larger $F_C(\theta)$ means the distributions are locally more distinguishable, implying we can get more information about $\theta$ from our measurements.

Starting from the Kullback-Leibler Divergence

Folks from more Information Theory / Machine Learning background may prefer the Kullback-Leibler (KL) divergence (also known as relative entropy). It provides another way to quantify the "difference" between two probability distributions. For two continuous probability distributions $p(x)$ and $q(x)$, the KL divergence of $q$ from $p$ is defined as:

$$D_{KL}(p || q) = \int p(x) \log \frac{p(x)}{q(x)} dx$$

It's important to note that $D_{KL}(p || q)$ is not symmetric (i.e., generally $D_{KL}(p || q) \neq D_{KL}(q || p)$) and doesn't satisfy the triangle inequality, so it's not a true distance metric in the mathematical sense, so we call it a "divergence."

Let's see how it relates to Fisher Information by considering our two infinitesimally close distributions, $p(x|\theta)$ and $p(x|\theta+d\theta)$. We'll look at $D_{KL}(p(x|\theta) || p(x|\theta+d\theta))$:

$$D_{KL}(p(x|\theta) || p(x|\theta+d\theta)) = \int p(x|\theta) \log \frac{p(x|\theta)}{p(x|\theta+d\theta)} dx$$

We can Taylor expand the term $\log p(x|\theta+d\theta)$ around $\theta$:
$$\log p(x|\theta+d\theta) \approx \log p(x|\theta) + \left(\frac{\partial \log p(x|\theta)}{\partial \theta}\right) d\theta + \cdots$$
Therefore, the ratio inside the logarithm becomes:
$$\log \frac{p(x|\theta+d\theta)}{p(x|\theta)}\approx \left(\frac{\partial \log p(x|\theta)}{\partial \theta}\right) d\theta + \frac{1}{2} \left(\frac{\partial^2 \log p(x|\theta)}{\partial \theta^2}\right) (d\theta)^2$$
Substituting this back into the expression for $D_{KL}$:
$$D_{KL}(p(x|\theta) || p(x|\theta+d\theta)) \approx -\int p(x|\theta) \left[ \left(\frac{\partial \log p(x|\theta)}{\partial \theta}\right) d\theta + \frac{1}{2} \left(\frac{\partial^2 \log p(x|\theta)}{\partial \theta^2}\right) (d\theta)^2 \right] dx$$
$$= -d\theta \int p(x|\theta) \frac{\partial \log p(x|\theta)}{\partial \theta} dx - \frac{(d\theta)^2}{2} \int p(x|\theta) \frac{\partial^2 \log p(x|\theta)}{\partial \theta^2} dx$$
The first integral gives us $0$ and we are left with:
$$D_{KL}(p(x|\theta) || p(x|\theta+d\theta)) \approx - \frac{(d\theta)^2}{2} \int p(x|\theta) \frac{\partial^2 \log p(x|\theta)}{\partial \theta^2} dx$$
It turns out that $F_C(\theta) = -\mathbb{E}\left[\frac{\partial^2 \log p(X|\theta)}{\partial \theta^2}\right]$ (try to show this), so we find:
$$D_{KL}(p(x|\theta) || p(x|\theta+d\theta)) \approx \frac{1}{2} F_C(\theta) (d\theta)^2$$
We see again that the Fisher Information $F_C(\theta)$ emerges naturally as a measure of the sensitivity of a probability distribution to a parameter.

Fisher Information: A Measure of Local Sensitivity

So, we've explored two ways to quantify how a probability distribution deforms as we modify the parameter $\theta$: first using the Hellinger distance (which is an actual metric) and then using the Kullback-Leibler divergence (which is not symmetric, so not a metric). In both cases, for small $d\theta$, the measure of difference scaled with $(d\theta)\sq$, and the coefficient of this $(d\theta)^2$ term was directly proportional to the same quantity: the classical Fisher Information, $F_C(\theta)$.

Estimators

So far, we've established that a larger Fisher Information $F_C(\theta)$ means the distribution $p(x|\theta)$ is more sensitive, and intuitively, this should mean we can do a better estimation of $\theta$ from our data. How do make this into a precise statement? First we need to introduce the "estimator".

Estimators: making sense of data

An estimator is basically a function of our measurement data. We make a bunch of measurements, $\mathbf x$, and then we feed it into a function $\check \theta(\mathbf x)$, and get the resulting estimate. Since the measurement results are random, $\check \theta(\mathbf x)$ is also random. So it has a mean, variance, etc.

We want estimators that are unbiased, e.g., their mean is the actual parameter $\theta$, and efficient, i.e., their variance $\langle \Delta^2 \check \theta(\mathbf x)\rangle\equiv \langle (\check \theta(\mathbf x)-\theta)^2\rangle$ is small.

The CRB is a statement about the highest efficiency we can achieve for any estimator. It says that no matter what estimator we choose, its variance is bounded by the CFI:
$$\langle \Delta^2 \check \theta(\mathbf x)\rangle \ge \frac{1}{F_C(\theta)}$$

Quick proof of CRB

Let $\check{\theta}(\mathbf{x})$ be an unbiased estimator for the parameter $\theta$. This means its expectation value is $\theta$:
$$\langle \check{\theta}(\mathbf{x}) \rangle = \int \check{\theta}(\mathbf{x}) p(\mathbf{x}|\theta) d\mathbf{x} = \theta$$
Differentiating both sides with respect to $\theta$:
$$\frac{\partial}{\partial \theta} \int \check{\theta}(\mathbf{x}) p(\mathbf{x}|\theta) d\mathbf{x}=\int \check{\theta}(\mathbf{x}) \frac{\partial p(\mathbf{x}|\theta)}{\partial \theta} d\mathbf{x} = \frac{\partial \theta}{\partial \theta} = 1$$
Using the identity $\frac{\partial p(\mathbf{x}|\theta)}{\partial \theta} = p(\mathbf{x}|\theta) \frac{\partial \log p(\mathbf{x}|\theta)}{\partial \theta}$:
$$\int \check{\theta}(\mathbf{x}) p(\mathbf{x}|\theta) \frac{\partial \log p(\mathbf{x}|\theta)}{\partial \theta} d\mathbf{x} = 1$$

Given that $\left\langle \frac{\partial \log p(\mathbf{x}|\theta)}{\partial \theta} \right\rangle = 0$, it follows that $\int \theta p(\mathbf{x}|\theta) \frac{\partial \log p(\mathbf{x}|\theta)}{\partial \theta} d\mathbf{x} = 0$. Subtracting this from the starting equation yields:$$1 = \int (\check{\theta}(\mathbf{x}) - \theta) p(\mathbf{x}|\theta) \frac{\partial \log p(\mathbf{x}|\theta)}{\partial \theta} d\mathbf{x}$$Applying the Cauchy-Schwarz inequality $\left(\int fg d\mathbf{x}\right)^2 \le \left(\int f^2 d\mathbf{x}\right) \left(\int g^2 d\mathbf{x}\right)$ with $f = (\check{\theta}(\mathbf{x}) - \theta)\sqrt{p(\mathbf{x}|\theta)}$ and $g = \sqrt{p(\mathbf{x}|\theta)} \frac{\partial \log p(\mathbf{x}|\theta)}{\partial \theta}$:$$(1)^2 \le \left( \int (\check{\theta}(\mathbf{x}) - \theta)^2 p(\mathbf{x}|\theta) d\mathbf{x} \right) \left( \int \left(\frac{\partial \log p(\mathbf{x}|\theta)}{\partial \theta}\right)^2 p(\mathbf{x}|\theta) d\mathbf{x} \right)$$This simplifies to:$$1 \le \langle \Delta^2 \check{\theta}(\mathbf{x}) \rangle F_C(\theta)$$Where $\langle \Delta^2 \check{\theta}(\mathbf{x}) \rangle$ is the variance of the estimator and $F_C(\theta)$ is the Classical Fisher Information. Rearranging gives the Cramér-Rao Bound:$$\langle \Delta^2 \check{\theta}(\mathbf{x}) \rangle \ge \frac{1}{F_C(\theta)}$$

Now, Let's Get Quantum

Quantum Fisher Information (for pure states)

Just as Classical Fisher Information quantifies the distinguishability between probability distributions $p(x|\theta)$ and $p(x|\theta+d\theta)$, Quantum Fisher Information (QFI) aims to quantify the distinguishability of quantum states $\rho_\theta$ and $\rho_{\theta+d\theta}$. This will then set the ultimate limit on how well $\theta$ can be estimated in quantum systems, via the Quantum Cramér-Rao Bound.

For now, let's focus on how QFI emerges for pure states. Let's assume we have a state parameterized by $\theta$, so $|\psi_\theta\rangle$. This could come about due to a unitary transformation: $|\psi_\theta\rangle = e^{-iA\theta}|\psi_0\rangle$, where $|\psi_0\rangle$ is an initial state and $A$ is a Hermitian operator (the generator of the transformation).

As with CFI, we need a way to measure the "distance" or "difference" between $|\psi_\theta\rangle$ and $|\psi_{\theta+d\theta}\rangle$. The obvious starting point is their overlap. The overlap or fidelity between two pure states is $$\mathcal F(\psi_\theta, \psi_{\theta+d\theta}) = |\langle\psi_\theta|\psi_{\theta+d\theta}\rangle|^2.$$
For an infinitesimally small change $d\theta$, we get
$$\langle\psi_\theta|\psi_{\theta+d\theta}\rangle = \langle\psi_\theta|e^{-iAd\theta}|\psi_\theta\rangle$$
For small $d\theta$, we can Taylor expand $$e^{-iAd\theta} \approx I - iAd\theta - \frac{1}{2}(Ad\theta)^2 + O((d\theta)^3)$$
Substituting this into the overlap: $$\langle\psi_\theta|\psi_{\theta+d\theta}\rangle \approx \langle\psi_\theta|(I - iAd\theta - \frac{1}{2}A^2(d\theta)^2)|\psi_\theta\rangle$$ $$\langle\psi_\theta|\psi_{\theta+d\theta}\rangle \approx 1 - i\langle A\rangle_\theta d\theta - \frac{1}{2}\langle A^2\rangle_\theta (d\theta)^2.$$

Now, $$\mathcal F = |\langle\psi_\theta|\psi_{\theta+d\theta}\rangle|^2 \approx |(1 - \frac{1}{2}\langle A^2\rangle_\theta (d\theta)^2) - i\langle A\rangle_\theta d\theta|^2$$
which gives us
$$\mathcal F \approx 1 - \langle A^2\rangle_\theta (d\theta)^2 + \langle A\rangle_\theta^2 (d\theta)^2 + O((d\theta)^4)$$
At the end we get
$$\mathcal F \approx 1 - (\langle A^2\rangle_\theta - \langle A\rangle_\theta^2)(d\theta)^2 = 1 - (\Delta A)_\theta^2 (d\theta)^2.$$
The fidelity measures how much the states overlap; $1-\mathcal F$ measures how different they are (for small differences) and is related to the Bures metric. The important thing is that we see
$$\text{Some Measure of Distance}(\theta,\theta+d\theta) \propto 1 - \mathcal F \approx \mathrm{Var}[A,|\psi_\theta\rangle](d\theta)^2,$$
where $\mathrm{Var}[A,|\psi_\theta\rangle]=\langle \psi_\theta| (A - \bar A)^2| \psi_\theta\rangle$. This gives us the opening we need to identify the Quantum Fisher Information (for pure states, with transformation generated by $A$ on a state $|\psi_\theta\rangle$) as we did before for classical probability distributions:
$$F_Q[\theta,|\psi\rangle] = 4 \mathrm{Var}[A,|\psi_\theta\rangle]$$

Quick proof of Quantum CRB (for pure states)

We've established that the Quantum Fisher Information $F_Q[\theta,|\psi\rangle] = 4 \mathrm{Var}(A,|\psi_\theta\rangle)$, where $A$ is the generator of the transformation with respect to $\theta$. Let's see how this $F_Q[\theta,|\psi\rangle]$ bounds the precision of any estimate $\check{\theta}$.

Measurement & Signal: To estimate $\theta$, we measure some observable. Let's call it ${O}$. The rate at which its average value $\langle {O} \rangle_\theta$ changes with $\theta$ is basically our responsivity to the parameter:
$$ \left| \frac{d\langle {O} \rangle_\theta}{d\theta} \right| = |\langle [A, {O}] \rangle_\theta| $$
This comes from $\frac{d\langle {O} \rangle}{d\theta} = i\langle [A, {O}] \rangle_\theta$.
Minimum noise in the estimator: The best precision we can possibly expect to have is when the uncertainty in our estimate $(\mathrm{Var}(\check \theta))$ the inherent quantum noise [haha] of the observable itself $\mathrm{Var}(O,|\psi_\theta\rangle)$. This gives:
$$\mathrm{Var}(\check \theta)\ge \frac{\mathrm{Var}(O,|\psi_\theta\rangle)}{\left|\frac{d\langle {O} \rangle_\theta}{d\theta}\right|\sq}$$
Heisenberg's Uncertainty Principle: For the generator $A$ and our chosen observable ${O}$, their uncertainties are fundamentally linked:
$$\mathrm{Var}(A,|\psi_\theta\rangle)\mathrm{Var}(O,|\psi_\theta\rangle) \ge \frac{1}{4} |\langle [A, {O}] \rangle_\theta|^2 $$
This implies:
$$\mathrm{Var}(O,|\psi_\theta\rangle) \ge \frac{|\langle [A, {O}] \rangle_\theta|\sq}{4\mathrm{Var}(A,|\psi_\theta\rangle)} $$
The QCRB: Substitute the minimum possible value for $\mathrm{Var}(O,|\psi_\theta\rangle) $ (from step 3) into our estimation uncertainty expression (from step 2):
$$ \mathrm{Var}(\check \theta )\ge \frac{\left( \frac{|\langle [A, {O}] \rangle_\theta|\sq}{4\mathrm{Var}(A,|\psi_\theta\rangle)} \right)}{|\langle [A, {O}] \rangle_\theta|\sq}= \frac{1}{F_Q[\theta,|\psi\rangle]} $$

Application Teaser: what about those time-energy and number-phase uncertainty relations?

What about those time-energy and number-phase uncertainty relations? The way the time-energy uncertainty was often explained to me when I first learned quantum mechanics felt pretty hand-wavy. You know the usual Heisenberg uncertainty principle, like the one we just used: $\mathrm{Var}(A,|\psi_\theta\rangle)\mathrm{Var}(O,|\psi_\theta\rangle) \ge \frac{1}{4} |\langle [A, {O}] \rangle_\theta|^2$. It involves two operators ($A$ and ${O}$) and their commutator. But then with $\Delta E \Delta t \ge \hbar/2$, time ($t$) isn't an operator in the same way position or momentum is. The same kind of puzzle arises with the number-phase uncertainty relation, $\Delta N \Delta \phi \ge 1/2$. The number of particles $N$ is a perfectly good operator, but a universally well-behaved phase operator $\phi$ that properly conjugates to $N$ is tricky to define, especially over a full $2\pi$ range.

The QCRB solves this and puts these uncertainty relations on a proper footing.

Instead of trying to force time or phase to be operators, we can think of them as parameters we want to estimate.

Time is a ~~flat circle~~ Parameter: If we have a quantum system evolving under a Hamiltonian $\hat H_0$ (our generator, like $A$ in the previous section, and we can set $\hbar=1$ for simplicity here), then time $t$ is the parameter that tells us how long the evolution $e^{-i\hat H_0 t}$ has occurred. Our QCRB gives us: $$(\Delta \check t)^2 \ge \frac{1}{F_Q[t,|\psi\rangle]} = \frac{1}{4 \mathrm{Var}(\hat H_0, |\psi_t\rangle)}$$ This means $\Delta \check t \cdot \sqrt{\mathrm{Var}(\hat H_0, |\psi_t\rangle)} \ge 1/2$, or more familiarly, $$\Delta \check t \cdot \Delta \hat H_0 \ge 1/2.$$ This is the Mandelstam-Tamm version of the time-energy uncertainty relation. The uncertainty in our estimate of an elapsed time $\check t$ is fundamentally linked to the energy uncertainty (standard deviation $\Delta \hat H_0$) of the state that's evolving.

Phase as a Parameter: Similarly, if we have a state that acquires a phase $\phi$ due to some process generated by an operator like the number operator $\hat N$ (e.g., in an interferometer, the state might be $e^{-i\hat N \phi}|\psi_0\rangle$), then $\phi$ is the parameter. The QCRB would be: $$(\Delta \check \phi)^2 \ge \frac{1}{F_Q[\phi,|\psi\rangle]} = \frac{1}{4 \mathrm{Var}(\hat N, |\psi_\phi\rangle)}$$ So, $\Delta \check \phi \cdot \sqrt{\mathrm{Var}(\hat N, |\psi_\phi\rangle)} \ge 1/2$, or $\Delta \check \phi \cdot \Delta \hat N \ge 1/2$. This is a rigorous number-phase uncertainty relation.

Application Teaser: Finding the Optimal State to Maximize Quantum Fisher Information

We've seen that the Quantum Fisher Information, $F_Q[\theta,|\psi\rangle] = 4 \mathrm{Var}(A,|\psi_\theta\rangle)$, depends on the initial state $|\psi_0\rangle$ (which becomes $|\psi_\theta\rangle$ after the transformation $e^{-iA\theta}$) through the variance of the generator $A$. To achieve the best possible precision in estimating $\theta$, we want to maximize this $F_Q$. This means we need to find the optimal initial state $|\psi_0\rangle$ that maximizes $\mathrm{Var}(A, |\psi_0\rangle)$. (so we want the state that's noisiest w.r.t. to $A$)

Getting GHZ and N00N states

If the Hermitian operator $A$ (the generator of our parameter $\theta$) has a largest eigenvalue $a_{\text{max}}$ and a smallest eigenvalue $a_{\text{min}}$, the strategy to maximize $\mathrm{Var}(A, |\psi_0\rangle)$ the strategy is use the state which is a superposition of the eigenstates corresponding to its extreme eigenvalues.

The optimal state $|\psi_0\rangle_{\text{opt}}$ is an equal superposition of the eigenstate $|a_{\text{min}}\rangle$ associated with the smallest eigenvalue and the eigenstate $|a_{\text{max}}\rangle$ associated with the largest eigenvalue:
$$|\psi_0\rangle_{\text{opt}} = \frac{1}{\sqrt{2}} (|a_{\text{min}}\rangle + |a_{\text{max}}\rangle)$$We can calculate the variance:
$$\mathrm{Var}(A, |\psi_0\rangle_{\text{opt}}) = \frac{1}{4}(a_{\text{max}} - a_{\text{min}})^2$$
This gives a maximum Quantum Fisher Information of:$$F_Q[\theta, |\psi_0\rangle_{\text{opt}}] = (a_{\text{max}} - a_{\text{min}})^2$$
This makes sense: to be most sensitive to the operation $e^{-iA\theta}$, the state should have the widest possible "spread" in terms of the eigenvalues of $A$.

So for example the GHZ state (superposition of all spins up and all spins down) is a really good state if you're trying to detect something like a magnetic field. So is the N00N state (superposition of N photons and 0 photons).

Open Questions and Next Steps

The discussion so far has given us a powerful theoretical limit (the QCRB) and a way to identify states that, in principle, can help us reach it for pure states and bounded generators. However, this naturally opens up a bunch of practical questions:

Are these "optimal" states always practical? The state $\frac{1}{\sqrt{2}} (|a_{\text{min}}\rangle + |a_{\text{max}}\rangle)$ might be theoretically optimal for maximizing $F_Q$. But what if $|a_{\text{min}}\rangle$ and $|a_{\text{max}}\rangle$ represent, for example, states with vastly different particle numbers or energies? Highly entangled states like GHZ states or NOON states, which are known to be optimal for certain sensing tasks (like phase sensing, where $A$ can be related to a collective spin operator or number operator), are stupidly difficult to prepare and maintain in the presence of noise and decoherence. So, while theoretically optimal, are they experimentally feasible or robust? What are the trade-offs? Are there other states that are optimal or nearly optimal, but easier to use and work with in other ways?
What happens if we are dealing with mixed states? We've focused on pure states $|\psi_\theta\rangle$. But in reality, quantum systems are often in mixed states $\rho_\theta$ due to imperfect preparation, noise, or interaction with an environment. How does the concept of Quantum Fisher Information extend to mixed states?
Is there a proof of QCRB that's more general? I.e., can we make a definitive statement where we optimize over all possible POVM measurements? (yes.)
What's the optimal measurement? We've talked about the optimal state for maximizing the QFI. The QCRB tells us the ultimate precision limit, $1/F_Q(\theta)$. But it doesn't explicitly tell us which measurement we need to perform to actually achieve this precision. If I have my optimal state $|\psi_\theta\rangle$, what operator ${O}$ should I measure? Here as well, some measurements are stupidly hard, others are much easier. How do we find things that are (nearly) optimal but easy to implement?
What if we have multiple parameters to estimate? Often, a quantum system might depend on several unknown parameters simultaneously, $\vec{\theta} = (\theta_1, \theta_2, \dots, \theta_k)$. How do we define the Quantum Fisher Information then (it becomes a matrix)?
+ Other questions ... feel free to ask!

References

Braunstein, Samuel L., and Carlton M. Caves. "Statistical distance and the geometry of quantum states." Physical Review Letters 72.22 (1994): 3439. https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.72.3439
Helstrom, Carl W. "Quantum detection and estimation theory." Journal of Statistical Physics 1 (1969): 231-252. https://link.springer.com/article/10.1007/bf01007479
Wilde, Mark M. Quantum information theory. Cambridge university press, 2013. (Chapter 9 for Fidelity, Chapter 12 for Estimation).
Sisi Zhou's thesis.
Carlton Caves Talk at ICTS in India.