Minimax quantum state estimation under Bregman divergence

We investigate minimax estimators for quantum state tomography under general Bregman divergences. First, generalizing the work of Komaki et al. $\href{http://dx.doi.org/10.3390/e19110618}{\textrm{[Entropy 19, 618 (2017)]}}$ for relative entropy, we find that given any estimator for a quantum state, there always exists a sequence of Bayes estimators that asymptotically perform at least as well as the given estimator, on any state. Second, we show that there always exists a sequence of priors for which the corresponding sequence of Bayes estimators is asymptotically minimax (i.e. it minimizes the worst-case risk). Third, by re-formulating Holevo's theorem for the covariant state estimation problem in terms of estimators, we find that any covariant measurement is, in fact, minimax (i.e. it minimizes the worst-case risk). Moreover, we find that a measurement is minimax if it is only covariant under a unitary 2-design. Lastly, in an attempt to understand the problem of finding minimax measurements for general state estimation, we study the qubit case in detail and find that every spherical 2-design is a minimax measurement.


Introduction
Quantum state tomography [1,2] refers to the process of determining an unknown quantum state of a physical system by performing quantum measurements. Any information processing task necessarily involves verifying the output of a quantum channel which mandates the study of quantum state tomography, apart from the unavoidable theoretical necessity. With recent developments leading to a transition of quantum computation from theory to practice, the verification of quantum systems and processes is of particular importance.
Given an unknown quantum state with no prior knowledge, it is clear that the measurement must be informationally complete, i.e. a measurement with outcome statistics sufficient to fully specify the quantum state [3]. Conventional data-processing techniques such as direct inversion and maximum likelihood estimation, thus, implicitly assume that the measurement statistics are informationally complete.
In direct inversion, given a fixed informationally complete measurement, one identifies the frequencies of each outcome with the corresponding probabilities. Then, by inverting Born's rule one obtains a unique estimator for the density operator that reproduces the measurement statistics; an estimator is defined as a map on the set of measurement outcomes X ,ρ : X → S(H), where S(H) is the set of density operators on the underlying Hilbert space H that describes the physical system. However, this strategy suffers from the drawback that such an estimator might not be a physical state and would yield negative eigenvalues. Example 1. Suppose one measures an unknown quantum state in C 2 along the x, y and z directions. Assuming that each of the measurements are performed only once, let us suppose that each of the outcome is 'up'. Thus, n x = n y = n z = 1 and N x = N y = N z = 1, so that p x = n x /N x = 1, etc. Now, an estimator that would yield the same probabilities would be the one with the Bloch vector: (2p x − 1, 2p y − 1, 2p z − 1) = (1,1,1). This is an invalid quantum state as it lies outside the Bloch ball, and thus necessarily has negative eigenvalues.
Reference [4] referred to such a shortcoming of direct inversion and proposed an alternative that enforces positivity on the estimator, called maximum likelihood estimation.
A likelihood functional L[ρ] : S(H) → [0, 1] is the probability of observing a data set D given that the system is in the state ρ : The data set D is characterized by the outcome set of the given measurement {E 1 , ..., ni where n i is the number of times the i-th outcome is recorded in D. Maximum likelihood estimation involves maximizing Equation (1) over the space of density operators S(H), and thus obtaining as the estimator the state that maximizes the likelihood functional.
The problem with MLE is that the estimatorρ M LE can be rank-deficient. A rank-deficient estimator is not good, as it would mean that by performing only finite number of measurements we are absolutely certain to rule out many possibilities. This kind of certainty must be bogus, suggesting that there has to be a better estimator. Let us look at Example 1 once again to illustrate this point.

Example 1 (continued).
Given the choice of measurement and the corresponding outcomes, the likelihood functional is L[ρ] = (1 + r x )(1 + r y )(1 + r z )/6 3 , which needs to be maximized under the constraint r ≤ 1-that characterizes the physical set of states in C 2 . This implies that r x = r y = r z = 1/ √ 3, which corresponds to an estimator that is a pure state.
This shows that state estimators that are unphysical in direct inversion get mapped to the closest physical states in MLE, that lie on the state space boundary, and are thus rank-deficient. In fact, it can be shown [5] that if there exists a ρ DI obtained via direct inversion over a data set D which is physical, then, it also maximizes the likelihood functional, i.e.ρ DI =ρ M LE . Although, Example 1 is an instance of an extreme case where probabilities are approximated by frequencies of a single measurement, it suffices to illustrate that in direct inversion as well as MLE, all that one cares about is to obtain an estimate of the true state that reproduces the observed measurement statistics, regardless of the fact that in the light of new data the state's estimate might change completely. Reference [5] gives a detailed critique of both direct inversion and MLE, proposing Bayesian Mean Estimation (BME) to be a more plausible estimation technique. Moreover, it has been shown that such an estimation technique is quantitatively better in reference [6].
Generally speaking, in estimation theory [7], the average measure of closeness of an estimator to the actual state is defined as the risk, R(ρ,ρ) = E X|ρ [L(ρ,ρ(X))], (2) where X is the random variable corresponding to the measurement outcomes and L is a distancemeasure between the true state and the estimator. One way of choosing an optimal estimator is to look at the average risk -defined as the expectation of risk with respect to a prior distribution over S(H). Then, by minimizing the average risk over the set of all probability distributions over S(H), one obtains what is called a Bayes estimator,ρ B [7, pg. 228]. In fact, it has been shown that the Bayes estimator is the mean if the loss function is the relative entropy [8], while in reference [9] the same was proved for a more general class of distance-measures called Bregman divergence (see Definition 3.3), which generalizes two important distance-measures-relative entropy and Hilbert-Schmidt distance, but in the classical setting. We provide a proof for the quantum setting in Appendix A for completion. Now, the Bayesian mean estimator for a prior distribution π(ρ) is given bŷ where p(ρ|D) is the posterior probability density given by the Bayes rule: and p(D) = S(H) dπ(ρ)p(D|ρ). However, BME can yield nonsensical estimators if one starts with a bad prior, as the following example illustrates.

Example 2.
Consider a σ X measurement on an unknown quantum state ρ in C 2 . Suppose there exists a prior π(ρ) such that it assigns zero measure to all states in C 2 but |− −|. A single measurement outcome of '+' rules out the outcome '−' and thus annihilates the prior!
In fact, it should be clear from the above example that some priors can be annihilated by a finite number of (independent) measurements. Thus, in general, one needs a robust [5] prior that cannot be annihilated in order to prevent rank-deficient estimates. However, the estimator's knowledge of the true state can still be jeopardized in the presence of an adversary who provides her with a wrong prior. Therefore, although BME seems to be the best bet, it remains inherently ambiguous due to its dependence on the choice of priors. A systematic approach towards deriving optimality criteria for priors is thus a compelling problem.
The minimax approach, complementary to BME, seems to be doing just that. In classical statistics, the problem of estimating probability distributions (analogous to state estimation) has been studied using the minimax approach [10][11][12][13], that offers an alternative characterization of optimality of estimators. In the minimax approach, one looks at the space of all possible estimators defined on X and, for each estimatorρ, picks the state ρ for which it has the worst performance or risk (quantified in terms of a suitably chosen distance-measure between the estimator and the true state). Then, the minimax estimator is the one that has the best worst-case risk. Such an estimator necessarily works for all states ρ ∈ S(H). It can be shown [10] that such a minimax estimator is a Bayes estimator given a particular choice of 'non-informative' prior. Thus, the solution to the minimax problem leads to a natural identification of a prior.
However, as pointed out in reference [5], no such rigorous statements were known for the quantum analogue of the problem until then. Recently, the authors of reference [14] have studied the quantum minimax estimation problem in analogy to the classical problem [13], quantifying the estimator's risk in terms of relative entropy. To summarize, they find that given an unknown quantum state ρ and some estimatorρ of it, there always exists a sequence of Bayes estimators that perform at least as well asρ in the limiting case. Moreover, they show that there always exists a class of priors, called latent information priors (although, conventionally, such priors are called least favourable, and we shall follow the convention!) for which there is a corresponding sequence of Bayes estimators whose limit is minimax. Finally, they define a minimax POVM as a POVM that minimizes the minimax risk, see Definition 4.1, and study the qubit (C 2 ) case in detail, obtaining the class of the least favourable priors as well as the minimax POVM for C 2 .
This paper is divided into six sections-we discuss our main results in Section 2, followed by Section 3 that contains the formalism and Section 4 that contains the proofs in detail. Finally, in Section 5, we discuss the state estimation problem for C 2 . In Section 6, we summarize the results and outline future work.

Main Results
Bayesian mean estimation is arguably a more plausible approach towards state estimation as opposed to maximum likelihood estimation or direct inversion. However, the performance of BME is tied to the choice of the prior. A complementary approach towards the state estimation problemthe minimax approach provides a window to explore all possible classes of priors, enabling one to narrow down those that are consistent with the requirements of both the Bayesian and the minimax analysis.
We extend the work done in reference [14] on minimax analysis (as discussed earlier) to a more general class of distance-measures called the Bregman divergence, see Definition 3.3, that generalizes both relative entropy and Hilbert-Schmidt distance. We also generalize the minimax POVM for C 2 to Hilbert-Schmidt distance, finding that such a minimax POVM is a spherical 2-design. Moreover, by re-formulating Holevo's theorem [15, pg. 171] for the covariant state estimation problem in terms of estimators, we find that a covariant POVM is, infact, minimax with Bregman divergence as the distance-measure. Let us discuss these results in detail, informally, postponing the formal statements and proofs to Section 4. So, for a given quantum state and some estimator, the above result says that one can always find a sequence of Bayes estimators that asymptotically perform at least as well as the estimator.
That there exists such a sequence of Bayes estimators means that there exists a corresponding convergent sequence of priors, see Equation (3). However, in general, the Bayes estimator is uniquely defined only up to the null set of p π comprised of outcomes that have zero probability under π. This illustrates that one cannot replace any given estimator by a corresponding Bayes estimator. However, one can still find a sequence of Bayes estimators that, in the limiting case, perform at least as well as the given estimator. This is an interesting result as it benchmarks BME against any given estimation technique. A related question is: "Does there exist a Bayes estimator that is also minimax?" Well, if one is given a minimax estimator, then Result 2.1 tells us that there exists a sequence of Bayes estimators that are at least as good as the minimax estimator, which in turns implies that the limit itself is minimax. In fact, the next result gives us a Bayesian procedure to arrive at such a minimax estimator.

Result 2.2.
There always exists a sequence of priors such that the limit of the sequence maximizes the average risk of a Bayes estimator, and the limit of the respective sequence of Bayes estimators minimizes the worst-case risk, i.e., it is minimax.
We find that the prior that maximizes the average risk of the Bayes estimator-referred to as the least favourable prior (as it maximizes the minimum possible average risk) is the limit of a convergent sequence of priors, such that the limit of the corresponding sequence of Bayes estimators is minimax. Note that the average risk with relative entropy as the loss function is the average of the maximal accessible information-Holevo information for the ensemble {π(ρ|x), ρ} with respect to the total probability of measurement outcomes. So, maximizing this average Holevo information of the given ensemble over all possible priors is like asking "What is the prior for which the posterior yields maximum accessible information for the ensemble {π(ρ|x), ρ}?" However, no such interpretation can be made for general loss functions such as Bregman divergence.
At this point, it must be noted that the underlying measurement in all the analysis done so far is fixed. Thus, a least favourable prior is inherently tied to the measurement. Any construction of such a prior necessarily implies that we also need to find a class of measurements for which it works. Although it is not clear if one would be able to solve this problem in general, one can do so for at least a subset of the estimation problem-the covariant state estimation problem.
In covariant state estimation, given a fixed state ρ 0 , one is interested in estimating the states ρ θ such that ρ θ ∈ {V g ρ 0 V † g } where g ∈ G is a group element acting on the parameter space Θ with θ ∈ Θ and V g is the projective unitary representation of the parametric group G. By generalizing Holevo's theorem [15,Theorem 3.1] to Bregman divergences, we obtain a least favourable prior.

Lemma 2.1. The uniform measure on the parameter space Θ is a least favourable prior for covariant measurements.
A covariant measurement is a measurement that reflects the transformation of the state under the group action appropriately in the outcome statistics (see Definition 4.2). Now, the question is: "What is the measurement that minimizes the worst-case risk?" An answer to this question is closely related to finding the class of measurements for the least favourable priors. The only additional information that is needed is if such a class of measurements also minimizes the average risk with respect to the least favourable prior. It turns out that this is indeed the case as far as covariant estimation is concerned. It is not so straightforward to generalize these results to the general state estimation problem. To better understand the situation for the general case, we look at the simplest system of a single qubit (extending the results of reference [14] to Hilbert-Schmidt distance-details in Section 6. In particular, we find that every spherical 2-design in C 2 is a minimax POVM.

Formalism
Consider a quantum system S described by a finite-dimensional Hilbert space H with S(H) as the set of density operators on H. Then, consider a quantum measurement to be an experiment in which the quantum system S is measured and let X be the corresponding outcome space of the measurement outcomes. Each possible event of the experiment can be identified with a subset B ⊆ X , the event being 'the measurement outcome x lies in B'. The probability distribution of the events is thus defined over a Σ-algebra of the measurable subsets B ⊆ X . To be in touch with physical reality, we choose the outcome space to be a Haursdorff space, i.e. a topological space where for any x 1 , x 2 ∈ X there exist two disjoint open sets X 1 , X 2 ⊂ X such that x 1 ∈ X 1 and x 2 ∈ X 2 . This ensures that the Σ-algebra is a Borel Σ-algebra generated by countable intersections, countable unions and relative complements of open subsets of X . Let P(H) be the set of positive operators on H.

Definition 3.1 (Quantum measurement). A Positive Operator-Valued Measure (POVM) is a map
where Σ is the Σ-algebra of all measurable subsets of X . Thus, a POVM associates an operator P(B) to each B ∈ Σ satisfying the following: The set of all POVMs on Σ forms a convex set denoted by P. A POVM is informationally complete [3] if the operators {P (B)} span L(H), the space of linear operators on H. The measurement statistics of such an IC-POVM is sufficient to determine, uniquely, all possible states that the quantum system could be in, in the limit when an infinite number of measurements are performed. Optimization of data-processing deals with the practical aspect of not having infinite resources and minimizing the corresponding statistical error. Reference [16] reviews the theoretical development of optimization techniques in quantum tomography based on informationally complete measurements. However, in this paper, we make no assumptions on the POVM. In fact, we look at an alternative definition for an optimal POVM-to be discussed later in this section.
The following lemma (see Appendix B for proof) provides a convenient way of representing a POVM as an operator-valued density. Lemma 3.1 (Existence of a POVM density). Every P ∈ P admits a density, i.e. for any POVM P there exists a finite measure µ(dx) over X such that µ(X ) = 1 and The conditional probability of the event 'the measurement outcome x lies in B given that the system is in a state ρ' is given by Born's rule as or, in the differential form as dp(x|ρ) = dµ(x) tr M (x)ρ (8) which will come in handy later. Now that we have defined a quantum measurement, we proceed with the formulation. In estimation theory [7], one typically parametrizes the system S by a parameter θ. The data set of the measurement outcomes is represented by a random variable X. Using this data set one estimates the parameter θ or more generally ρ θ -the estimand. Succinctly, this involves two random variables Θ and X defined as below: • In the Bayesian model, the quantity θ that parametrizes the system S is treated as a random variable Θ. This random variable is defined over the parameter space Ω Θ 2 and is distributed according to an a-priori probability distribution π Θ ∈ P(Θ) (where P(Θ) is the set of all probability distributions on Θ).
• X is the random variable associated with the outcomes of the measurement performed on the system S, defined over the sample space X . The outcomes of the measurement are conditioned on the random variable Θ. Thus, X is distributed according to the conditional probability p X (x|θ), given by Equation (7).
The parameter space Θ is chosen to be a compact metric space. The set of all bounded continuous real-valued function on Θ is denoted by C(Θ, R). The set of probability distributions P(Θ) on Θ is endowed with a weak topology, which essentially defines the notion of weak convergence.

Definition 3.2. A sequence of probability measures
Then, as Θ is a compact metric space and P(Θ) is endowed with a weak topology, by [17,Theorem 6.4], it implies that P(Θ) is also a compact metric space.
The central problem in quantum state estimation is to obtain an estimator of ρ θ . We define an estimator as the mapρ The value ofρ(x) is the estimate of ρ θ when the measurement outcome is X = x. We wantρ(X) to be close to ρ θ , butρ(X) is a random variable. One way of defining a meaningful measure of closeness is by defining an expectation over the conditional distribution of X, Equation (7). Let L(ρ θ ,ρ(x)) be the loss function that quantifies the closeness of an estimated stateρ(x) to the true state ρ θ . We assume two things about L: The average measure of closeness ofρ(X) to ρ θ is defined as the risk function One would like to obtain an estimator that minimizes the risk for all values of θ. Obviously, this problem does not have a solution, i.e. there does not exist an estimator that uniformly minimizes the risk for all values of θ except for the case when ρ θ is a constant. Instead, one can look at the following two quantities that are a good measure of the risk in a global sense: 1. Average risk: where π(θ) is an a-priori distribution over the parameter space Θ.
2. Worst-case/minimax risk: The estimator that minimizes the average risk is the Bayes estimatorρ B [7, pg.228]. In reference [8] it was shown that the Bayes estimator is the mean if the loss function is the relative entropy D(ρ θ ||ρ(x)), i.e.,ρ where dπ(θ|x) is the posterior probability distribution obtained via the Bayes rule, where p π (B) = B Θ dp(x|θ)dπ(θ). Note that in the continuous case, the likelihood ratio p(x|θ) pπ(x) is replaced by the corresponding Radon-Nikodym derivative, which is defined uniquely upto the null set of p π . In fact, the same was proved [9] for a more general class of distance-measures called Bregman divergence which is the measure we will use in our analysis, but only in the classical setting. We provide a proof for the quantum setting in Appendix A for completion. Let us now define Bregman divergence.
Bregman divergence generalizes two important classes of distance-measures: the relative entropy obtained by choosing f : x → x log x and the Hilbert-Schmidt distance (Schatten 2-norm) obtained by choosing f : Let us now look at a few of its important properties. First, Bregman divergence is invariant under unitary transformations of its arguments, i.e. D f (U ρU † , U σU † ) = D f (ρ, σ). Second, it is not a metric, as it is neither symmetric nor satisfies the triangle inequality, but by the strict convexity of f, D f (ρ, σ) ≥ 0, with equality if and only if ρ = σ. Third, the convexity of f implies that D f (., .) is convex in its first argument; it is jointly convex if f is operator convex and numerically nonincreasing [18]. Moreover, by generalizing the proof of lower semi-continuity of relative entropy as in reference [19], we obtain the lower semi-continuity of Bregman divergence (see Appendix C for the proof).
We are now ready to state and prove the main results of this paper.
Proof. Consider the average distance between the Bayes estimator and a given estimatorρ for some prior π ∈ P(Θ) as the map Now, the Bayes estimator is uniquely defined up to the null set of p π . In fact, it is discontinuous on X , see Appendix D, and the points of discontinuity belong to the null set of p π . So, unless the null set of p π is empty, the Bayes estimator cannot be defined continuously since there can exist different sequences that converge to the same prior, but the limit of the corresponding sequences of Bayes estimators may not coincide on the null set of p π . To deal with the discontinuity of the Bayes estimator, we consider closed subsets of P(Θ) with the defining property that every element of these subsets renders the corresponding Bayes estimator continuous on X . Then, g is lower semicontinuous on each closed subset as Bregman divergence is lower-semi continuous (Appendix C). Thus, there exists a prior π n in every subset that minimizes it on that subset. So, we look at the sequence of such priors (π n ) n and find that the corresponding sequence of Bayes estimators (ρ πn B ) n converges to a limit that has a risk lower than or equal to that of the given estimator. Let us now proceed with the proof.
We define the closed subsets of P(Θ) as where µ is a measure such that p µ (x) > 0, for all x ∈ X . The latter condition ensures that the Bayes estimator for a prior that lies in P µ/n is continuous on P µ/n . Then, as a closed subset of a compact set is compact, there exists a prior π n ∈ P µ/n such that D f ρ (π n ) = inf In fact, as P(Θ) is a compact metric space, the sequence of priors (π n ) n has a convergent subsequence. Let us denote this subsequence as (π m ) m . Let n m be such that π nm = π m .
Then, the idea is to use the fact that each π m minimizes D f ρ on the corresponding closed subset P µ/nm to obtain a suitable condition. To begin with, we define a prior in the neighbourhood of π m+1 ∈ P µ/nm+1 by taking a convex sum of it with another element in P µ/nm+1 . Observe that nm nm+1 π m+1 + (1 − nm nm+1 )δ(θ − θ 0 ) lies in P µ/nm+1 , for any θ 0 ∈ Θ. So, we define a prior with 0 ≤ u ≤ 1. This is like considering a perturbation in the neighbourhood of π m+1 and noting that the derivative of D f ρ (π) is positive as one approaches π m+1 as it minimizes D f ρ (π) on the set P µ/nm+1 . Thus, we have Let k = nm nm+1 . Then, dp π (x) = dp(x|θ) u kπ m (θ) So, the first term (18) is while the second term, using Lemma E.1, is Let us now calculate the derivative of the Bayes estimator.
Plugging in (20) we obtain, dp(x|θ) 2 k dp π m (x) dp(x|θ) + (1 − k) dp(x|θ 0 ) dp(x|θ) − dp π m+1 (x) dp(x|θ) . Now, in the limit m → ∞, the coefficient of k vanishes in the expression above due to weak convergence, while the last term of the first integral cancels with the last term of the second integral. So, we have Applying the limit and plugging (21) in (19), we find that the second term in (18) Finally, combining both the terms of (18) and applying the limit, we find that This implies that The right-hand side of the inequality above can be rearranged to obtain This implies that But, as Bregman divergence is lower semi-continuous (Appendix C), we have ).
Therefore, we arrive at our result, i.e.

A Bayesian method for minimax state estimation
Formally, Result 2.2 is stated as the following theorem.

Theorem 4.2.
There exists a convergent sequence of priors (π n ) n such that the limit of the sequence maximizes the average risk, Equation (5), of the Bayes estimator. The limit of such a sequence is referred to as a least favourable prior. Moreover, the sequence of Bayes estimators (ρ πn B ) n converges such that the limit of the sequence is minimax, i.e Proof. Consider the average risk of the Bayes estimator for a prior π ∈ P(Θ) as the map Due to the discontinuity of the Bayes estimator, we follow the same regularization arguments as made earlier and define closed subsets of P(Θ), Equation (16), such that the Bayes estimator is continuous on each of these subsets. The map h : π → r(π,ρ π B ) is then continuous on each of the subsets. Since these subsets are closed subsets of a compact set they are themselves compact. Therefore, h attains a maximum on each of the subsets. Then, denoting the maxima in each subset P µ/nm+1 as π m+1 , we define a prior as done earlier in Equation (17) as a convex sum of π m+1 and another element in P µ/nm+1 . Since the average risk is maximized on P µ/nm+1 , the derivative of r(π,ρ π B ) is negative as one approaches π m+1 , i.e.
Evaluating the derivatives using Lemma E.1, the first term in (22) is while the derivative in the second term of (22) is So, plugging this in the second term of (22), we have Thus, applying the limit m → ∞ we arrive at the following inequality, The lower semi-continuity of Bregman divergence implies that ).
But, the other direction of the inequality above is true trivially i.e. ).
Therefore, we obtain ). But, By Lemma E.2, the limit of the suprema over subsets P µ/nm+1 can be replaced by a supremum over the set P(Θ) since the sequence of subsets P µ/nm+1 is dense in P(Θ). Thus, we have Using the minimax theorem for lower semi-continuous and quasi-convex functions [20, Theorem 3.4], we can exchange the infimum and the supremum to obtain Thus, by (24) and (25)  ).
Result 2.1 and Result 2.2 are based on the assumption that the underlying POVM is fixed. However, in general, the risk depends on the POVM P, i.e. R(ρ θ ,ρ) ≡ R P (ρ θ ,ρ). One way of defining an optimal POVM could be to minimize the worst-case risk over P, the convex set of all POVMs. The POVM that minimizes the worst-case risk is called a minimax POVM.

Definition 4.1 (Minimax POVM). A POVM
where ρ θ is the estimand,ρ : X → S(H) is an estimator with risk R P (ρ θ ,ρ) which is a function of the POVM P, and P is the convex set of all POVMs on the measurement outcome space X .
It remains unclear as to how one could obtain a minimax POVM for general state estimation, but if we restrict ourselves to the case of covariant state estimation the situation simplifies.

Covariant state estimation
In the covariant state estimation problem, as discussed in reference [15], one is given a fixed state ρ θ0 and is interested in estimating all the states ρ θ that lie in the orbit {V g ρ θ0 V † g }, where g ∈ G is a parametric group of transformations of the parameter space Θ and g → V g is a (continuous) projective unitary representation of G. One can think of this as representing the following physical scenario. Given that the parameter θ labels the quantum states of the Hilbert space H, θ can be assumed to be describing some aspects of the preparation procedure for the state ρ θ -a transformation g of the parameter θ 0 results in the preparation of the state ρ θ = V g ρ θ0 V † g where θ = gθ 0 . Covariant state estimation thus corresponds to the estimation of the state ρ θ with the measurement outcome space X being identical to the parameter space Θ. Let us first define a covariant measurement. .

Thus,
i.e. a covariant measurement preserves the probability distribution under the transformation of the state. We refer the reader to reference [15] for a more detailed discussion on covariant measurements. Before we start building towards the proof of Result 2.3, let us look at some of the properties of the parametric group G to understand the situation better. First, the group G is chosen to act transitively on Θ. This ensures that the map g → gθ 0 maps G onto the whole Θ. Second, G is assumed to be unimodular which implies that there exists an invariant measure µ on G. Third, G is assumed to be compact which ensures that the measure µ < ∞. The measure µ is normalized as µ(G) = 1. Now, we are interested in an invariant measure ν on the Σ-algebra A(Θ) on Θ such that where B g = {θ : θ = gθ, θ ∈ B}. If G 0 , the stationary subgroup of G, is unimodular then such a measure ν exists and if G 0 is compact then ν is finite and can be constructed from µ by demanding that the following relation holds for all integrable functions f on Θ: We now state Proposition 2.1 from reference [15] as the following lemma that gives a relation between the two measures.
Let us pause here to look at an example.

Example 3. Let us assume that we are interested in estimating all those states in C 2 that lie on the Bloch sphere. Thus, the parameter space Θ is S 2 . This is a covariant estimation problem with the parametric group of transformations on S 2 being SO(3). Its projective unitary representation is the quotient subgroup SU(2)/U(1). Let us assume that the initial state
It is straightforward to verify that 1 4π S 2 M (θ, φ) sin θdθdφ = I. Having defined and illustrated the problem of covariant state estimation, we now recall Holevo's theorem [15,Theorem 3.1], which states that for every loss function that is invariant under the group transformation g, the minimax risk as well as the average risk attain their minima at a covariant measurement. But, the analysis in reference [15] is done for loss functions expressed as functions of the true parameter and the estimator of the parameter. It can be recast in terms of the general framework involving estimators that are functions of the parameter,ρ : Θ → D(H) by simply choosing the domain of the loss function to be the set of density matrices S(H) as opposed to the parameter space Θ. We thus state it as the following lemma which is the main ingredient of the proof of Result 2.3. Lemma 4.2 (Theorem 3.1, [15]). In the quantum covariant statistical estimation problem, given an estimatorρ of the state, the minima of the average risk r P (ν,ρ) with respect to the uniform Haar measure ν on Θ and the worst case risk sup θ R P (ρ θ ,ρ) for all Θ−measurements are achieved on a covariant measurement. Moreover, for any covariant measurement P c , we have Note: A covariant measurement is not a unique minimum for either the average or the worstcase risk.
The above theorem implies that for any measurement P , there exists a covariant measurement that minimizes the average risk as well as the worst case risk for a fixed estimatorρ. But, we know that for a fixed measurement the average risk is minimized by the Bayes estimatorρ B . Thus, the Bayes estimator minimizes the average risk for a covariant measurement. However, the invariance of the loss function implies that the Bayes estimator must be covariant under the group transformations as shown below.

Lemma 4.3. The Bayes estimator is covariant under the group transformations
Proof. Recalling the invariance property of the loss function: , let us verify for the case of Bayes estimator. Recalling Equation (14), we haveρ .
As we are interested in a covariant measurement P (B g ) = V g P (B)V † g , thereforê .
By the invariance of ν and the fact that ρ θ = V g ρ g −1 θ V † g , we finally obtain Thus, we have established two things about a covariant measurement. First, that the risk of a covariant measurement is independent of the state ρ θ and second, that it minimizes the average as well as the worst-case risk among all measurements. But, the problem of finding a minimax POVM is closely tied to obtaining a least favourable prior which in turn is tied to the underlying measurement, as we discussed in Section 2. The following lemma gives a least favourable prior and the corresponding measurement in the context of covariant state estimation.

Lemma 2.1. The uniform measure on the parameter space Θ is a least favourable prior for covariant measurements.
Proof. As the Bayes estimatorρ π B minimizes the average risk with respect to the prior π, However, by Lemma 4.2, we know that for a covariant measurement the risk is independent of the state ρ θ , i.e.
This implies that The other direction of the above inequality holds trivially, i.e. Therefore, and so ν is a least favourable prior for a covariant measurement P c .
The only remaining ingredient needed to prove Result 2.3 is the following lemma.

Lemma 4.4. The Bayes estimator for a covariant measurement P c iŝ
Proof. Recalling Equation (14), the Bayes estimator for a covariant measurement P c iŝ , B ∈ A(Θ).
Using Lemma 4.1, the denominator in the above expression is while the numerator is Now that we have a class of measurements and the corresponding least favourable prior, in order to show that it is minimax (first part of Result 2.3), we have to show that this class of measurements also minimizes the average risk with respect to such a least favourable prior. Recall Result 2.3. Formally, the first part of Result 2.3 is stated as the following theorem and the second part as a corollary to the theorem. Proof. Recalling Definition 4.1 of a minimax POVM and the fact that the Bayes estimator minimizes the average risk, we have But, as it implies that the same holds for the uniform Haar measure ν as well. Thus, where P c is a covariant measurement that minimizes the average risk. Also, by Lemma 2.1, ν is a least favourable prior which means thatρ ν B is a minimax estimator. Therefore, we have ρ).
The other direction of the inequality above holds trivially, i.e.
Hence, we have proved that P c is a minimax POVM, i.e.
As the risk for a covariant measurement is independent of the state ρ θ (by Lemma 4.2) and depends only on the estimator (the Bayes estimator in this case, which is a function of we have the following corollary.
where B g −1 = {g −1 θ| θ ∈ B}, and P c and P c have the same seed, then P c is also minimax.
In order to understand the above corollary better let us look at what it means for Example 3.

Example 3 (continued)
. Now, since any state |θ, φ in S 2 can be generated by elements of SU(2)/U(1), the following equivalence holds: where dU θ,φ is the Haar measure on SU(2)/U (1). Infact, the above implies that Now, the above equivalence along with [21,Theorem 3.3.1] implies that one can construct a unitary 2-design from a quantum 2-design. Thus, the set of unitary matrices that generate the set of eigenstates of the Pauli matrices σ x , σ y , σ z is a unitary 2-design given as below.

The corresponding measurement that is covariant under the above unitary 2-design is then obtained via
It is straightforward to see that the corresponding M θ,φ are the same as the Pauli measurements apart from a normalization constant. This measurement is thus minimax.
Note: In the above example we obtained a minimax POVM-a covariant measurement for the parameter space S 2 which describes only pure qubit states. There is no a-priori reason to believe that the same measurement would also be minimax for estimating an arbitrary state of a qubit. However, curiously, it happens to be true for a qubit as we will see in the following section.

Example: Minimax measurement for a qubit
We look at the single qubit case as studied in reference [14] wherein the authors obtain such a minimax POVM with relative entropy as the distance-measure. We generalize their results to squared-distance ρ − σ 2 = tr(ρ − σ) 2 . However, the proof does not follow the generalized treatment in terms of Bregman divergence as done in the previous sections.
To begin with, let us write the most general expression for a POVM on C 2 . Recalling Lemma 3.1, we can write any POVM, see Definition 3.1, as an operator-valued density, i.e. Thus, the most general form of a POVM element on C 2 is Now that we have obtained the general expression for a POVM on C 2 , we next evaluate the Bayes estimator which in turn is needed to evaluate the risk R P (ρ θ , ρ π B ). Recalling that the Bayes estimator is given as Recalling the differential form of Born's rule (8), and using the Bloch sphere notation of ρ θ , ρ θ = 1 2 (I + θ · σ), it is a straightforward calculation to obtain However, to be able to further simplify the Bayes estimator, we need to impose some restrictions on the prior π defined on the parameter space Θ (which in this case is R 3 ). In particular, we choose a uniform prior π * supported only on pure states, i.e. π(θ) is zero for all vectors θ with θ < 1 but is uniformly distributed on the set of unit vectors with θ = 1. It can be verified that such a prior has the following two properties : 1. E π * [θ i ] = 0, ∀i ∈ {x, y, z}.
2. E π * [θ i θ j ] = 1 3 δ ij ∀i, j ∈ {x, y, z}. By property (1) of the prior π * , we have Thus, the Bayes estimator reduces to Now, recalling that the risk is a function of the POVM P, i.e.
we evaluate the risk for both relative entropy and Hilbert-Schmidt distance below.
Lemma 5.1. The risk for relative entropy and Hilbert-Schmidt distance are given as

respectively.
Proof. (a) For relative entropy, see [14, pg. 11]. Thus, This implies that the risk is We can write the above using the property of a general POVM on C 2 , i.e. X dµ( x) x = 0 as Note: Although in reference [14] it is assumed that the POVM P is rank-1, the above expressions hold in general for any POVM P, i.e. the vector x in (32) need not be a unit vector. Lemma 5.2. For any POVM P, the average risk of the Bayes estimator with respect to the prior π * satisfies the inequalities: for relative entropy and Hilbert-Schmidt distance respectively.
Proof. (a) For relative entropy : From Lemma 5.1 and the properties of the prior π * we get However, since j r 2 j ≤ 1, it implies j E µ [r 2 j ] ≤ 1, and we obtain the required inequality : (b) For Hilbert-Schmidt distance : From Lemma 5.1 and the properties of the prior π * we get Again, as j r 2 j ≤ 1, it implies j E µ [r 2 j ] ≤ 1, and we obtain the required inequality : Lemma 5.3. For any POVM P * that satisfies E µ [r i r j ] = 1 3 δ ij , the average risk of the Bayes estimator coincides with the worst-case risk: Proof. (i) (a) For relative entropy : (a) See [14, pg. 11].
Lemma 5.4. The uniform Haar measure on S 2 is a least favourable prior for spherical 2-designs in C 2 .
Proof. Firstly, note that a POVM P * with E µ [r i r j ] = 1 3 δ ij is a spherical 2-design. (See Definition F.2 of spherical t-designs. Examples of spherical 2-designs include the SIC-POVM [22] on C 2 as well as the POVM defined through the Pauli measurements.) It can be seen from Lemma 5.1 that the risk is a polynomial function of degree 2 in the variables x, y, z. It is straightforward to see that the average of the typical term x i x j with respect to the Haar measure on S 2 is δi,j 3 δ ij is a spherical 2-design. Let us now proceed with the proof of the lemma. As the Bayes estimatorρ π B minimizes the average risk with respect to the prior π, However, we just proved in Lemma 5.3 that The other direction of the above inequality holds trivially, i.e.
Thus, π * is a least favourable prior for a spherical 2-design in C 2 .
Theorem 5.1. Any spherical 2-design for C 2 is a minimax POVM.
Proof. Recalling Definition F.2 of a minimax POVM and the fact that the Bayes estimator minimizes the average risk, we have: Now, Lemma 5.2 and Lemma 5.3 together imply that the average risk with respect to π * is minimized by the POVM P * , i.e.
Also, by Lemma 5.4, π * is a least favourable prior which means that ρ π * B is a minimax estimator. Therefore, we have Thus, we obtain The other direction of the inequality above holds trivially, i.e.
Hence, we have proved that P * , a spherical 2-design is a minimax POVM, i.e.

Discussion & Future work
To summarize, we extended the work done in reference [14] on minimax analysis to Bregman divergences. Moreover, by re-formulating Holevo's theorem [15, pg. 171] for the covariant state estimation problem in terms of estimators, we found that a covariant POVM is, in fact, minimax with Bregman divergence as the distance-measure. In addition to that, we found that it suffices that a measurement be covariant only under a subgroup H of G such that the unitary representation of H forms a unitary 2-design for it to be minimax. Finally, in order to understand the problem of finding a minimax POVM for an arbitrary quantum state, we studied the problem for a qubit observing that a spherical 2-design defines a minimax POVM for a qubit.
In the covariant state estimation problem, we assume that the underlying group G is compact. It is natural to ask if these results can be extended to infinite-dimensional systems, or equivalently, non-compact groups. The natural system that comes to mind when one thinks of an infinitedimensional system is the set of coherent states of a Harmonic oscillator. The underlying group is the translation group T acting on the complex plane. The projective unitary representation of which is the Weyl-Heisenberg translation operator {D(α) | α ∈ C}. Now, the translation group is non-compact. This means that one cannot define a normalisable measure on the group. Our derivation of the main result on covariant state estimation, Theorem 4.3, to obtain a minimax measurement uses a Bayesian approach. Recall that a minimax measurement is the one that minimizes the worst-case risk of a minimax estimator. Thus, we are interested in the following expression : inf The very first step of the proof involves re-writing the supremum over θ as a supremum over the probability distributions on Θ, i.e.
Obviously, we cannot do so in the case of the translation group T that acts on the complex plane. So, our approach will not apply to the most general problem of estimating coherent states generated by the Weyl-Heisenberg translation operator {D(α) | α ∈ C}. Indeed, a more general theorem for the case of locally compact groups [23,24] shows that covariant measurements minimize the worst-case risk (average risk cannot be defined for non-compact groups). However, the formalism considered in [23] does not include estimators. It would be interesting to extend the same to our setting and, moreover, to come up with an appropriate definition of a Bayesian estimator for such cases. The next obvious extension of this work is to find minimax POVMs for an arbitrary quantum state. It would be interesting to see if some kind of a t-design comes out as a solution. However, this requires a more generalized approach than mere brute-force calculations which become tedious in higher dimensions. Moreover, one could also generalize this result to arbitrary distance-measures such as Fidelity and Renyi divergences. The authors of reference [25] have derived the Bayes estimator for distance-measures based on Bhattacharya distance. Partial results [26] are known for fidelity as the distance-measure, but the Bayes estimator remains unknown for a general state with fidelity as the distance-measure. But, these generalizations are not so straight forward either and require a different technique.
The last inequality follows from the non-negativity of Bregman divergence. We first present the Radon-Nikodym theorem for operator-valued measures [27], without proof, as stated in [15, pg. 167].

E Additional lemma(s)
Lemma E.1. Given a self-adjoint operator A parametrized by u, the derivative of a function of the operator with respect to the parameter at u = u 0 is given by Proof. See reference [28].

Lemma E.3 ([15]
, Theorem 2.1). Let P 0 be a positive operator in the representation space such that [P 0 , V g ] = 0 ∀g ∈ G 0 , where G 0 is the stationary subgroup of G, and satisfying: Then setting P (gθ 0 ) = V g P 0 V † g , we get an operator-valued function of θ such that: is a covariant measurement with respect to g → V g . Conversely, for any covariant measurement M (dθ) there is a unique operator P 0 satisfying (38) such that M (B) can be expressed as in (39). P 0 is referred to as the seed of the covariant measurement.

Definition F.1 (Unitary t-design). Consider the set of unitary matrices U(d) on a d-dimensional Hilbert space H. A unitary t-design is a finite subset
such that for all states ρ ∈ S(H) the following holds: where 'dU' is the Haar measure on U(d).