Quantum chi-squared tomography and mutual information testing

,


Introduction
Quantum state tomography -learning a -dimensional quantum state from  copies -is a ubiquitous task in quantum information science.It is the quantum analogue of the classical task of learning a -outcome probability distribution from  samples.
In more detail, the goal is to design an algorithm that, given  ⊗ for some (generally mixed) quantum state  ∈ C × , outputs (the classical description of) an estimate 2̂︀  that is "-close" to  with high probability.The main challenge is to minimize the sample (copy) complexity  as a function of  and  (and sometimes other parameters, such as  = rank ).We will also be concerned with the practical issue of designing algorithms that make only single-copy (as opposed to collective) measurements.
An important aspect in specifying the quantum tomography task is the meaning of "-close"; i.e., what the loss function is for judging the algorithm's estimate.There are many natural ways for measuring the divergence of two quantum states -even more than for two classical probability distributions -and the precise measure chosen can make a great deal of difference both to the necessary sample complexity, as well as to the utility of the final estimate for future applications.
The main goal of this paper is to show a new tomography algorithm that achieves the most stringent notion of accuracy, (Bures)  2 -divergence, while having essentially the same sample complexity as previously known algorithms using infidelity as a loss function.We then given an application, to the quantum mutual information testing problem, which crucially relies on our ability to achieve efficient state tomography with respect to  2 -divergence.
(1) (Here the "≲" ignores small constant factors.)The first of these, ℓ 2  2 -distance, does not have an operational interpretation, but it is by far the easiest to calculate and reason about.The remainder are the "big four" [50, p.26]: total variation (TV) distance controls the advantage in distinguishing  from  with 1 sample; Hellinger-squared controls the number of samples needed to distinguish  from  with high probability; KL divergence has several information-theoretic interpretations; and,  2 -divergence plays a central role in goodness of fit (whether an unknown  is close to a known ).We remark that the first three quantities are bounded in [0, 1], but KL divergence and  2 -divergence may be unbounded.
It is extremely easy to show (see Proposition 2.14) that, given  samples from , the empirical estimate ̂︀  has expected ℓ 2 2 -distance at most 1/ from ; hence  = (1/) samples suffices for highprobability estimation with this loss function.Moreover, Cauchy-Schwarz immediately bounds TV 2 by  times ℓ 2  2 , and hence (/) samples suffice when  denotes TV 2 (and Ω(/) can be proven necessary).But in fact,  = (/) samples suffice even when  denotes the most stringent distance,  2 -divergence.This also follows from a short calculation of the expected  2 -divergence of ̂︀  from  when ̂︀  is the add-one empirical estimator (see Proposition 2.16).
The preceding five distances have natural generalizations for quantum states ,  ∈ C × .The analogous chain of inequalities to eq. ( 1) is not quite true, but we have instead (Frobenius distance) 2 ≲ (trace distance) 2 ≲ infidelity ≲ quantum relative entropy, Bures  2 -divergence.
(2) While both quantum relative entropy and Bures  2 -divergence are bounded from below by the infidelity, neither bounds the other by a constant [45].We remark that using the "measured relative entropy" rather than the "standard" (Umegaki) quantum relative entropy does make the full analogous chain of inequalities hold, turning the comma above into a ≲; however, the measured relative entropy is rarely used in practice.
In the quantum case, there is a very simple empirical estimation algorithm that achieves Frobeniussquared distance  with  = ( 2 /) samples (see Section 3.6); this algorithm has the additional practical merit that copies of  are measured individually and nonadaptively, meaning it uses  POVMs of dimension  that are fixed in advance.Kueng, Rauhut, and Terstiege [27] gave another natural algorithm of this form with a refined rank-based bound: Theorem 1.1.([27,Thm. 2].)There is a state tomography algorithm using nonadaptive single-copy measurements achieving expected Frobenius-squared error (/) on -dimensional states of rank at most .Hence  = (/) samples suffice to get 3 Frobenius-squared accuracy .
Again, Cauchy-Schwarz implies that trace distance-squared is bounded by  times Frobeniussquared, so one immediately concludes that  = ( 2 /) copies suffice for a nonadaptive single-copy measurement algorithm achieving trace distance-squared .
Allowing for adaptive single-copy measurement algorithms (in which the POVM used on the th copy of  may be chosen based on the outcomes of the first  − 1 measurements), it is known that for  = 2 (a single qubit),  = (1/) measurements with one "round" of adaptivity suffice for estimation with infidelity .The idea for this dates back to at least [38], with a proof appearing in, e.g., [8,Eq. 4.17].The case of higher  is discussed in [33], but no complete mathematical analysis seems to appear in the literature.
Remark 1.2.However, prior to completing our work, we were informed by the authors of [13] that they could achieve infidelity  with ̃︀ ( 3 /) single-copy measurements and logarithmically many rounds of adaptivity.
Moving to quantum tomography algorithms that allow for a general collective measurement on all  copies, it would seem that some amount of representation theory is needed to get optimal results (intuitively, because  ⊗ lies in the symmetric subspace).The following two results were shown independently and contemporaneously: Theorem 1.3.([31,Cor. 1.4].)There is state tomography algorithm using collective measurements achieving expected Frobenius-squared error (/) on -dimensional states.Hence  = (/) samples suffice to get Frobenius-squared accuracy .As a corollary of Cauchy-Schwarz,  = (/) samples suffice to get trace distance-squared accuracy .Theorem 1.4.( [21, (14)].)There is a state tomography algorithm using collective measurements on  = (/) • log(/) copies that achieves infidelity .Remark 1.5.Except for the log(/) factor, Theorem 1.4 is stronger than the corollary in Theorem 1.3, since (trace distance) 2 ≲ infidelity.If one wishes to have optimal (1/) dependence on  (no log factor), the best known result is  = ( 2 /) using very sophisticated representation theory [32].
On the other hand, if one wishes to have optimal () dependence (no log factor), prior to the present work the best result was (/ 2 ), following from Theorem 1.3 and infidelity ≲ (trace distance) 2 .
Turning to lower bounds, Haah-Harrow-Ji-Wu-Yu [21] showed that for collective measurements, Ω( 2 /) samples are necessary for trace distance-squared tomography in the full-rank case, and Ω(   log(/) ) are necessary in the general rank- case; Yuen [51] recently removed the log factor in case  stands for infidelity.As for single-copy measurement algorithms, [21] showed (improving on [17]) that for nonadaptive algorithms, Ω(  2   2 log(1/) ) copies are needed for infidelity-tomography, and Ω( 3 /) copies are needed for trace distance-squared tomography in the full-rank case.This latter bound was also very recently established [13] even in the adaptive single-copy case.

Our results
A major question left open by the preceding results is whether quantum state tomography with ̃︀ (1/) dependence is possible for a notion of accuracy more stringent than that of infidelity, such as quantum relative entropy or  2 -divergence.Although efficient learning with respect to these more stringent measures is known to be possible in the classical case, we are not aware of any previous provable results along these lines in the quantum case.Indeed, these divergences seem fundamentally more difficult to handle, not being bounded in [0, 1], and prior works seemed to suggest that negative results might hold for them.
Prior authors have considered tomography with respect to these stronger error notions.For example, Ferrie and Blume-Kohout [17] investigated qubit tomography with respect to quantum relative entropy, and Ref. [34] uses  2 hypothesis testing to study tomography of (Choi states of) quantum channels.A further motivation comes from the work of Blume-Kohout and Hayden [11], who showed that the quantum relative entropy is singled out as the unique loss function for quantum tomography once certain plausible and general desiderata of an estimator are specified.
Our main motivation, which we return to in Section 4, is a property test for zero quantum mutual information.For this application, our argument requires us to do quantum state tomography with respect to Bures  2 -divergence, as only then can we use the quantum " 2 -vs.-H 2 identity tester" from Ref. [7].
For these two stronger error notions, we essentially show that the strongest upper bounds that one could possibly hope for indeed hold.Our main theorem is the following: Theorem 1.6.Suppose there exists a tomography algorithm  that obtains expected Frobenius-squared error at most  (, )/ when given  copies of a quantum state  ∈ C × of rank at most .Then it may be transformed into a tomography algorithm  ′ that, given , , and of , outputs (with probability at least .99) the classical description of a state ̂︀  having Note that in the collective-measurement case, Corollary 1.8 matches (up to a logarithmic factor) the ̃︀ (/) bound known previously only for infidelity-tomography, and Corollary 1.7 also matches it in the high-rank  = Θ() case.As for Corollaries 1.9 and 1.10, independent and contemporaneous work [14] showed a weaker version of Corollary 1.10 with infidelity accuracy in place of relative entropy.
Remark 1.11.Although one would wish to achieve ̃︀ (/) scaling for  2 -tomography, we later discuss in Remark 3.17 why it seems hard to achieve dependence better than ̃︀ ( 1.5 /) even in the pure  = 1 case.Remark 1.12.In the case of  = 2 (a qubit), we remove all log factors and show that  = (1/) single-copy measurements with one round of adaptivity suffice for tomography with respect to  2divergence.This simple algorithm, which illustrates the very basic idea of our Theorem 1.6, is given in Section 3.1.
Remark 1.13.Although we have suppressed polylog factors (at most quadratic) with our ̃︀ (•) notation, for the case of tomography with respect to infidelity our polylog factors are actually better than previously known in some regimes.As an example, for collective measurements we have an infidelity algorithm with complexity  = ̃︀ (   log 2 (1/) log log(1/)), which improves on the ̃︀ (   log(/)) bound from [21] (and the ̃︀ (   2 ) bound following from [31]) whenever  is "large"; specifically, for  ≥ exp Finally, in Section 4 we apply our  2 -divergence tomography algorithm to the task of testing for zero quantum mutual information.In this problem, the tester gets access to  copies of a bipartite quantum state  on C  ⊗C  where || = || = .The task is to accept (with probability at least 2/3) if the mutual information ( : )  is zero (meaning  =   ⊗   is a product state), and to reject (with probability at least 2/3) if ( : )  ≥ .We show: Theorem 1.14.Testing for zero quantum mutual information can be done with  = ̃︀ (1/) • ( 2 +  1.5 +  .5  1.75 ) samples when   ,   have rank at most  ≤ .
Remark 1.15.The above bound is no worse than ̃︀ ( 2.5 /), and is ̃︀ ( 2 /) whenever  ≤ √ .One should also recall the total dimension of  is  2 .Remark 1.16.Harrow and Montanaro [23] have considered a related "product tester" problem in the special case where the input is a pure state |⟩.Whenever the maximum overlap ⟨||⟩ with any product state  is 1 − , the test passes with probability 1 − Θ() using only two copies of |⟩.By itself however, this bound does not test quantum mutual information in the above sense, even for the rank-1 case.
Remark 1.17.An important feature of our result is its (near-)linear scaling in 1/.This is despite the fact that estimating mutual information to ± accuracy requires Ω(1/ 2 ) samples, even for  = 2 and even for the classical case.
Our proof of Theorem 1.14 has two steps.First, we learn an estimate ̂︀   ⊗ ̂︀   of the marginals   ⊗   that has small  2 -divergence.Then second we use the " 2 -vs.-infidelity" state certification algorithm from [7] to test whether the unknown state  is close to the "known" state ̂︀   ⊗ ̂︀   .The second step requires us to relate infidelity to relative entropy (and hence mutual information); but more crucial is that in the first step, we must be able to do state tomography with Bures  2 -divergence as the loss measure.Thus we have an example where  2 -tomography is not just done for its own sake, but is necessary for a subsequent application.
Incidentally, we also show that the same two-step process works well for the problem of testing zero classical mutual information given samples from a probability distribution  on [] × []: Theorem 1.18.Testing for zero classical mutual information can be done with  = ((/) • log(/)) samples.
This actually improves on the best known previous algorithm, due to Bhattacharyya-Gayen-Price-Vinodchandran [10], by a factor of  log .
2 Basic results on distances and divergences
We now recall some distances between probability distributions.
Definition 2.5.For  ∈ [0, ∞], the associated Rényi divergence is defined by We will use a few particular cases: Definition 2.6.The total variation distance, a metric, is the  -divergence with  () = 1 2 | − 1|: Definition 2.7.The Hellinger distance d H (, ), a metric, is the square-root of the  -divergence with It is also essentially a Rényi divergence.More precisely, the Bhattacharyya coefficient between  and  is and we have d Definition 2.9.The  2 -divergence is the  -divergence with  () = ( − 1) 2 : We will sometimes use the first formula even when  and/or  do not sum to 1.
Definition 2.10.The max-relative entropy (or worst-case regret) is defined to be The following chain of inequalities is well known (see, e.g., [20]): Some of the inequalities in the above can be slightly sharpened; e.g., one also has d TV (, ) ≤ √︁ 1 2 d KL ( ‖ ), usually called Pinsker's inequality.Perhaps less well known is the following "reverse" form of Pinsker's inequality: Moreover, it is possible to strengthen the above by putting Hellinger-squared in place of total variation distance.These facts were proven in [39]; for the convenience of the reader, we provide a streamlined proof of the following: Proposition 2.12.For ,  probability distributions on [] we have Proof.Let us write   =   /  .Defining the elementary Lemma 2.13 proven below shows that It follows that Lemma 2.13.Inequality (17) holds.
Proof.Consider () := ℎ()() −  ().This function is continuous and piecewise differentiable on  ≥ 0 with an exceptional point at  = 1.We will first show that () is nonnegative and increasing on  ≥ 1. Clearly (1) = 0, so we only need to show that  ′ () ≥ 0. For  ≥ 1, by the integral definition of the logarithm and the Cauchy-Schwarz inequality we have Calculating the derivative of () and using the above inequality for the logarithm, we have that For the case 0 ≤  ≤ 1, we change variables to  = 1/ and define () := ℎ(1/)(1/) −  (1/) for  ≥ 1.We have (1) = 0, and using again the logarithm inequality we find We remark that Inequality (17) can be strengthened to ℎ() = 2 + ln (︀ (2 + )/3 )︀ , but as this does not change the scaling of any of our results, we will not use this stronger inequality or present our (annoyingly complicated) proof.
Finally, we mention the ℓ 2  2 -distance between probability distributions, ‖ − ‖ 2 2 = ∑︀  (  −   ) 2 .Though it does not have an operational meaning, the simplicity of computing it makes it a useful tool when analyzing other distances.For example, the  2 -divergence is a kind of "weighted" version of ℓ 2 2 -distance, in which the error term (  −   ) 2 is weighted by 1/  .We record here basic facts about estimation with respect to these distance measures.Proof.Items (b) and (c) follow from standard Chernoff bounds.As for Item (a), it follows from the known high-probability bound for empirically learning a distribution with respect to ℓ 2 2 -error; see, e.g., [35].We remark that it is important to use this latter result, as opposed to the generic "medianof-(log(1/))-estimates" method; if we used the latter, it would be unclear how to simultaneously achieve Items (b) and (c) Proposition 2.16.Fix a subset  ⊆ [] of cardinality .Given  samples from an unknown distribution  on [], let  be the estimator formed by using the add-one estimator on elements from , and the empirical estimator on the remaining elements.(Note that  is itself a probability distribution.)Then Dropping the term above involving (1 −   ) +1 , and then summing over  ∈ , yields Inequality (23).As for the "moreover" statement, a Chernoff bound tells us that  is within a 2-factor of   except with probability at most /, using   ≥  log(/).When this occurs,   is at least   /4 (using  ≥ ) and at most 2+1  ≤ 3  (using  ≥ 1), so the proof is complete by a union bound over  ∈ .

Quantum distances and divergences
The analogous theory of distances and divergences between quantum states is quite rich [44,26], as there are multiple quantum generalizations of both  -divergences and Rényi divergences.To distinguish between the quantum and classical cases, we use an upper-case  for quantum divergences and a lowercase  for classical divergences.
Throughout this section, let ,  ∈ C × be (mixed) quantum states.Definition 2.17.Given an  -divergence d  (• ‖ •), the associated measured (aka minimal) quantum  -divergence [26] is Remark 2.18.All measured  -divergences satisfy the (quantum) data processing inequality.This fact follows from the definition and a reduction to the classical case.
Definition 2.19.For  ∈ [0, ∞], the associated conventional quantum Rényi divergence [36] is defined by Let us also describe a further relationship between classical and quantum Rényi entropies.To do so let us introduce the following notation: Definition 2.20.Given the spectral decompositions we define two probability distributions   ,   on [] × [], as follows: We now give a simple calculation that allows us to compute a quantum Rényi divergence from an associated classical probability distribution.This calculation has appeared in the literature as early as [30,Thm. 2.2]; see [6,Prop. 1] for an explicit statement.For convenience, we repeat the calculation here.
We now define some particular quantum distances/divergences: Definition 2.22.The trace distance, a metric, is the measured  -divergence associated to total variation distance [25]: Definition 2.23.The Bures distance D B (, ), a metric, is the square-root of the measured divergence associated to Hellinger-squared [19].It has the formula where F(, ) = ‖ √  √ ‖ 1 is the fidelity between  and  (in the "square root" convention).The infidelity between  and  is simply

𝜌, 𝜎).
There is a close analogy between the quantum fidelity and the classical Bhattacharrya coefficient, and indeed the analogue of Equation ( 9) holds if one uses the "sandwiched Rényi entropy" [28,48].Using instead the conventional Rényi entropy yields a slightly different notion: Definition 2.24.The quantum Hellinger affinity is defined by and the quantum Hellinger distance D H (, ), a metric, is defined by the last equality using Proposition 2.21.(Note also the useful tensorization identity, A( Fortunately, the preceding two distances differ by only a small constant factor: The left inequality in Fact 2.25 is from A(, ) ≤ F(, ); the right inequality follows from [6, Eq. ( 32)].
Definition 2.26.The Bures  2 -divergence of  from  is the measured  -divergence associated to classical  2 -divergence [12,42].It can also be given the following formula when  = diag( 1 , . . .,   ) is diagonal of full rank (and this suffices to define it for general full-rank , since it is unitarily invariant): We will use this formula even when  1 , . . .,   ≥ 0 do not sum to 1.
Similar to the connection between ℓ2 2 -distance and  2 -divergence in the classical case, the Bures  2 -divergence can be seen as a kind of "weighted" version of the Frobenius-squared distance, in which the error term |  | 2 is weighted by 2 + = Θ( 1 max{, } ).Indeed, we will frequently consider applying Equation (33) when the   's form (or approximately form) a nondecreasing sequence, meaning that (we expect)   ≤   .In this case it is reasonable to use   +   ≥   , which motivates the following simple bound: Definition 2.27.In the notation from Definition 2.26, we define and, for  = [ ′ ] (for  ′ ≤ ) we define Definition 2.28.The quantum relative entropy [46] is defined by the last equality using Proposition 2.21.Also, if  is a "bipartite" quantum state on  ⊗ , where  ∼ =  ∼ = C  , and if   ,   denote its marginals (obtained by tracing out the ,  components, respectively), the quantum mutual information of  is defined to be Fact 2.29.The conventional quantum ∞-Rényi divergence (discussed in, e.g., [3]) is, by Proposition 2.21, Remark 2.30.This quantity is not the same as the "quantum max-relative entropy" defined in [16]; it would be if one replaced the conventional Rényi entropy with its sandwiched form.
Relating some of these divergences is the following chain of inequalities: The first inequality above is from from [6, Thm.2].The second follows from the classical case [19].The third also follows from the classical case and the observation that the "measured" quantum relative entropy is at most S( ‖ ) (see, e.g.[9, App.A]).The fourth also follows from the classical case, using that Bures  2 is the measured form of classical  2 [12,42].As with Proposition 2.11, some of these inequalities can be sharpened slightly; for example we have the quantum Pinsker inequality D tr (, ) ≤ √︁

Quantum Tomography with Quantum Relative Entropy Loss
One of our main results follows easily from the above discussion of divergences.The idea is to improve on certain "reverse quantum Pinsker" results which have been studied previously; see, e.g., [5] for a quantum generalization of the reverse-Pinsker Inequality (14).We will use the following strengthened version with quantum Hellinger-squared in place of trace distance: Proof.This is immediate from S( ‖ ) = d KL (  ‖   ), Proposition 2.12, and Fact 2.29.
Despite following directly from known results (up to constant factors), the above theorem does not seem to have appeared previously in the literature.Our next result shows that this can be used to automatically upgrade any quantum tomography algorithm with an infidelity guarantee to one with a relative entropy guarantee, at the expense of only a log factor (cf. our main Theorem 1.6 upgrading Frobenius-squared-tomography to  2 -tomography).Notation 2.33.We write Δ  for the completely depolarizing channel, which for 0 ≤  ≤ 1 acts on 3 Quantum state tomography We give a guide to this section: • Section 3.1 gives a simple  2 -tomography algorithm for qubits; it achieves copy complexity  = (1/) (no logs) using single-copy measurements with one round of adaptivity.It serves as a small warmup for our main algorithm.
• Section 3.2 begins the main exposition of our reduction from Frobenius-squared-tomography to  2 -tomography.This section shows how to give several useful black-box "upgrades" to any Frobenius-squared estimator.
• In Section 3.3 we give a high-level sketch of the central estimation routine for our main theorem, which takes Frobenius-squared-tomography and turns it into a  2 -tomography algorithm "except for very small eigenvalues".
• The most involved Section 3.4 follows; it fills in all the technical details for the preceding sketch.
• Section 3.5 shows how to take the newly-established central estimation routine and massage its output to achieve either good  2 -accuracy (with one set of parameters) or good relative entropy accuracy (with another set of parameters).It is in this section that we establish all the theorems and corollaries from Section 1.2.
• Finally, for the convenience of the reader, Section 3.6 gives a simple Frobenius-squared-tomography algorithm using single-copy measurements with complexity  = ( 2 /).

Qubit tomography with single-copy measurements
As mentioned in Section 1.
we see that all of the "good" points appear consecutively.(That is, "reading left-to-right", the ̂︀  1 values consist of some "bad" points, followed by some "good" points, followed by some "bad" points.)The reason for this is that d  2  Proof.The first phase of the algorithm (using 3/4 copies) can employ any standard single-copy quantum state tomography routine; the specific one we describe in Section 3.6 has the stated Pauli format, and (using Proposition 2.15) will return a PSD matrix  ′ (not necessarily a state) satisfying except with probability at most /2.Next, the algorithm employs a change of basis so as to make  ′ diagonal.It suffices to estimate  in this new basis.Since Frobenius-distance is unitarily invariant, in the new basis Inequality (41) implies As for the diagonal entries  = ( 00 ,  Theorem 3.6.There is an estimation algorithm using single-copy measurements with Frobeniussquared rate () on -dimensional states of rank at most .
Finally, Proposition 3.25 gives a simple single-copy measurement algorithm that has Frobeniussquared rate ( 2 ) (matching matching Theorem 3.6 in the high-rank case).
We will now successively describe several black-box "upgrades" one may make to a Frobeniussquared estimation algorithm.All of these will have the feature that they preserve the single-copy measurement property.Our ultimate goal will be to upgrade to closeness guarantees with respect to much stronger distance measures, with minimal loss in rate.To illustrate the idea, we start with a very simple upgrade (that most natural algorithms are unlikely to need): Proposition 3.7.A Frobenius-squared estimation algorithm may be transformed to one that always outputs Hermitian estimates, with no loss in rate.
Proof.Given algorithm , let  ′ be the algorithm that on input  runs , producing ̂︀ , and then outputs ̂︀   := (̂︀ +̂︀  † )/2, so that ̂︀   := (̂︀ −̂︀  † )/2 and ̂︀  = ̂︀   +̂︀   .The Hermitian matrices are a real vector space, so by picking a (Hilbert-Schmidt) orthogonal basis (for example, the generalized Pauli matrices), is it easy to verify that for Hermitian , we always have F .The claim then follows by taking expectations.
The next upgrade is not a change in algorithm, but rather in terminology.
Definition 3.8.Say that an estimation algorithm with Frobenius-squared rate  returns diagonal estimates if, when run on  ∈ C × , it returns a unitary  and a (real) diagonal matrix ̂︀ Given such an algorithm, we can get an Frobenius-squared estimator with rate  for  just by returning  †  ̂︀  .But we will prefer the interpretation that the algorithm is allowed to "revise"  to state   † (with  of its choosing), and then try to estimate this new state.Proposition 3.9.A Frobenius-squared estimation algorithm may be transformed to one that returns diagonal estimates, with no loss in rate.
Proof.First we transform the algorithm to output Hermitian estimates, using Proposition 3.7.Then, given output ̂︀ , the algorithm simply chooses a unitary  such that ̂︀  =  diag() † with  1 ≤ • • • ≤   , and returns the unitary  along with diagonal estimate diag().The proof is complete because Frobenius-squared distance is unitarily invariant.Proposition 3.10.With only constant-factor rate loss, a Frobenius-squared estimation algorithm may be transformed to one that outputs diagonal estimates ̂︀  = diag() that are genuine quantum states, meaning that  is a probability vector.
Proof.First we apply Proposition 3.9, obtaining algorithm  ′ with diagonal estimates and rate  .Now our transformed algorithm, when given  copies of , will start by running  ′ on the first /2 copies (we may assume  is even), yielding a diagonal estimate  ′ (and, to be formal, a unitary  which should be used to conjugate the remaining copies of ).Say that ‖ ′ − ‖ 2 F = , and recall that E[] ≤ 2 ()/.The next step of the algorithm is to use single-copy standard basis measurements with the remaining /2 copies of  to make a new estimate  of the diagonal of .Applying Proposition 2.14, the empirical estimator  is a genuine probability distribution, and the algorithm will finally output ̂︀  = diag().(Actually, since  might not have nondecreasing entries, we should finally "revise" by a permutation matrix.)The Frobenius-squared error of ̂︀  is its off-diagonal Frobenius-squared error plus its diagonal Frobenius-squared error; the former is at most  and the latter is, in expectation, at most 2/ by Proposition 2.14.Since E[] ≤ 2 ()/, the total expected Frobenius-squared error is at most 2 ()/ + 2/ = ( ())/, as needed.
We will also need a high-probability version of the preceding result, with some extra properties.The reader should recall the   notation from Proposition 2.15.
Proposition 3.11.The algorithm from Proposition 3.10 may be modified so that, given 0 <  < 1/2, its output satisfies each of the following statements except with probability at most  (for any fixed  ∈ []): Proof.The first statement may be obtained in a black-box way using the "median trick", which upgrades estimation-in-expectation to estimation-with-confidence-(1 − ) at the expense of only an (log(1/)) sample complexity factor.This trick may be applied whenever the loss measure is a metric (as Frobenius distance is); see, e.g., [22,Prop. 2.4] for details.It is sufficient to prove this statement with the (•), because we may then remove it by raising  in the   notation.(Similarly, we may tolerate achieving 2 failure probabilities, rather than .)To get the other two conclusions, we need to re-estimate the diagonal of , just as we did in Proposition 3.10.For this we use Proposition 2.15.As in Proposition 3.10, this re-estimation contributes some new on-diagonal Frobenius-squared distance, but only at most 1/  ≤  /  ; thus the proposition's first statement remains okay.The remaining statements follow from Proposition 2.15 by taking its "" to be {}.Now we come to a most important reduction: being able to estimate subnormalized states.Let us define terms, and make the simplifying assumption that rate functions for proper states only depend on dimension and rank, and that they are nondecreasing functions of these parameters.We also assume for simplicity that our subnormalized states arise just from submatrices, but they could just as well arise from any given projector Π. Definition 3.12.We say a subnormalized state estimation algorithm  has Frobenius-squared rate  (, ,  ) if the following holds: Whenever  is given a subset  ⊆ [], as well as  ⊗ for some quantum state  ∈ Remark 3.13.In the above definition, we may also include the condition of "returning diagonal estimates" as in Definition 3.8, with the returned unitary  being in C × .Moreover, for linguistic simplicity we will henceforth assume that "diagonal estimates" are also required to have nonnegative (diagonal) entries.Remark 3.14.Our subnormalized state estimation algorithms will actually achieve improved rate  ( ′ ,  ′ ,  ), where  ′ = || ≤  and  ′ = rank [] ≤ , but we will not try to squeeze anything out of this, for simplicity.Proof.We first apply Proposition 3.10 so that  may be assumed to output diagonal, genuine quantum states.This only changes bounds by constant factors on , to which the statement of this proposition is anyway insensitive.
Given  ⊆ [] and  ⊗ , let us write  = tr [] and also introduce the quantum state  | = []/ (when  > 0).The first step of the new algorithm  ′ is to measure each copy of  using the twooutcome PVM (1  , 1 []∖ ).It retains all copies that have outcome  and discards the rest.In this way,  ′ obtains ( | ) ⊗ ′ , where  ′ ∼ Binomial(,  ).If  ′ = 0 then the algorithm will return the 0 matrix.Otherwise, if  ′ ̸ = 0 the algorithm applies  to  | and obtains an estimate ̂︀  | with expected Frobenius-squared error at most  ( ′ ,  ′ )/ ′ ≤  (, )/ ′ , where The final estimate that  ′ produces for [] will be ̂︀ [] := ( ′ /)̂︀  | ; indeed, we can use this expression even in the  ′ = 0 case.We now have We write and use By assumption on , for  ′ > 0 we have and this is also true even for  ′ = 0 (recall we always assume  ≥ 1).Using the elementary fact As for ]︀ , let us first observe that conditioned on any  ′ =  (including  = 0), we have ‖̂︀  | ‖ F ≤ 1 with certainty, simply because  always outputs a genuine quantum state.Thus Combining all of the above (and using  ≥ 1 again), we conclude Proof.Since the definition of   anyway contains an unspecified constant , it is sufficient to prove the proposition with constant losses on various bounds (and then raise 's value to compensate).In particular, for notational simplicity we assume that we get 2 rather than  copies of .
The algorithm begins by using the first  copies of  to obtain ( | ) ⊗ ′ as in Proposition 3.15; this is done just to get  ′ .The algorithm's output ̂︀  is  ′ /, and the proposition's conclusion Item (i) follows straightforwardly from Chernoff bounds (assuming  is sufficiently large).Similarly, the  -vs.-̂︀  part of Item (iii) follows from a Chernoff bound, and we will actually ensure 1.01-factor closeness for later convenience.
If ̂︀  ≤ 1.1/  , then the algorithm runs Proposition 3.15 on the second  copies of , outputting the result.The conclusion in Item (ii) then holds except with probability at most .0001,by applying Markov's inequality to Proposition 3.15's guarantee.
We now describe how the remainder of the algorithm proceeds, when ̂︀  ≥ 1.1/  .Note that since it only remains to prove Items (iii) to (vi), we may as well assume  ≥ 1/  .The algorithm proceeds similarly to Proposition 3.15, using the second  copies of  to get ( | ) ⊗ ′ for a new value of  ′ .Since we are now assuming  ≥ 1/  , a Chernoff bound implies that except with probability at most  we'll have  49) is within a 1.02-factor of  , thereby completing the proof of Item (iii) (recall that  and ̂︀  are within a 1.01-factor).Next, we verify Item (iv) up to a constant factor (as is sufficient).Following Equations ( 43) to (45) (but without expectations), we have But Equation (49) (and using  ≥ 1) we can bound the above by 2.03 •  (, )/  , establishing Item (iv).
To show Item (v), let  denote the set of all  ∈  with   ≥ .Since  ≥  /(100) = (tr [])/(100), we know that || ≤ 100.Moreover, for any  ∈  we may use (employing Equation ( 49)).(We weakened / to / just to illustrate this is all we need for Item (v).)So by using the second bullet point of Proposition 3.11 in a union bound over the at most () indices in , we conclude that (except with probability at most ()) for all  ∈  it holds that (̂︀  | )  is within a 1.01-factor of ( | )  , and hence (by Equation ( 49)) ̂︀   is within a 1.1-factor of   .This completes the verification of Item (v).
Finally, verifying Item (vi) is similar; for simplicity, we just union-bound over all  ∈  ⊆ [], using the fact that  ≥ 1/ / .

The plan for learning in 𝜒 2 : refining diagonal estimates on submatrices
Suppose we have come up with a diagonal estimate  1 of  ∈ C × having some Frobenius-squared distance  1 = ‖ −  1 ‖ 2 F .(Here we will have "revised" some original  by the unitary that makes  1 diagonal; this revision will be taken into account in all future uses of .) Suppose we now choose some  2 ≤  1 := , define  2 to be the top-left  2 ×  2 submatrix of , and apply Proposition 3.16 to it.The idea is that we hope to improve the top-left part of our estimate  1 .
Recall that Proposition 3.16 affords us a diagonal estimate  2 ∈ C 2×2 ; let us understand a little more carefully what this means.The algorithm will give us a unitary  2 ∈ C 2×2 such that for some small value  2 .The idea now is to "revise" both  1 :=  and  1 by the unitary  2 ⊕ 1, where here 1 has dimension  1 −  2 .By design, the revised version of  2 will have Frobenius-squared distance  2 from  2 .Moreover, after revision, the fact that ‖ 1 −  1 ‖ 2 F =  1 is unchanged (since Frobenius distance is unitarily invariant).On the other hand, although  1 was previously diagonal, it no longer will be after revision.But it's easy to see that it will remain diagonal except on its top-left  2 ×  2 block, which we are intending to replace by  2 anyway.In particular, the off-diagonal  2 × ( 1 −  2 ) and ( 1 −  2 ) ×  2 blocks of  1 remain zero.
Let us summarize.We will first obtain a diagonal estimate  1 of  1 with some error  1 .Then after choosing some  2 ∈ [ 1 ], we will obtain a further diagonal estimate  2 of the top-left  2 ×  2 block of , with some error  2 .We might then take as final estimate ̂︀  the diagonal matrix formed by replacing the top-left  2 ×  2 block of  1 by  2 .
Naturally, this plan can be iterated (meaning we can try to improve the estimate's top-left  3 ×  3 block for some  3 ∈ [ 2 ]) but let us pause here to discuss error.If we're interested in the Frobeniussquared error of our current estimate ̂︀ , we can't say more than that it is bounded by  1 +  2 .Here we're decomposing the error into the contribution from the top-left  2 ×  2 block (which is  2 ) plus the contribution from the remaining ⌟ ⌟ ⌟ -shaped region (consisting of the bottom-right ( 1 −  2 ) × ( 1 −  2 ) block plus the two off-diagonal blocks).We will just bound this second error contribution by the whole Frobenius-squared distance of  1 from  1 , which is  1 .
It would seem that this scheme of refining our estimate for the top-left block hasn't helped, since it took us from Frobenius-squared error  1 to Frobenius-squared error (at most)  1 +  2 .But the idea is that our new estimate ̂︀  may have improved Bures  2 -divergence.Recall the formula for  2divergence, Equation (33) (which we will apply even though ̂︀  might not precisely be a state, meaning of trace 1).Recall also that our diagonal estimates  1 = diag( (1) ) and  2 = diag( (2) ) are chosen to have nondecreasing entries along the diagonal.(We moreover expect that ̂︀  will also have nondecreasing entries, meaning (2) 2+1 , but we won't rely on this.)Now we can use the bound The idea here is that if, perhaps 2 ≈ (tr  2 )/; and (1) then hopefully from Proposition 3.15 with  copies we will have  1 ≈ ( Unfortunately, we will have to deal separately with any extremely small eigenvalues of , which causes some additional losses. Remark 3.17.Ideally this plan suggests we might be able to achieve sample complexity  = ̃︀ (/) for tomography with respect to Bures  2 -divergence (for collective measurements).But the "small eigenvalue" issue causes problems for this.Without explicitly claiming a lower bound, let us sketch why it seems difficult to significantly beat the  = ̃︀ ( .5  1.5 /) complexity from Corollary 1.7, even in the case  = 1.

The central estimation algorithm
Proof.Fix a Frobenius-squared estimation algorithm  with rate  =  (, ), and assume we have passed it through Proposition 3.16 so that we may use it to make diagonal estimates of subnormalized states.
The algorithm  ′ will run in some ℓ stages, where we guarantee ℓ ≤ ℓ max .Each stage will consume  copies of .After the ℓth stage, there will be some final processing that uses the remaining /2 (at least) copies of .
As the algorithm progresses, it will define a sequence of numbers  =  1 ≥  2 ≥ • • • ≥  ℓ , with the value  +1 being selected at the end of the th stage.We introduce the notation   = { +1 +1, . . .,   }; each of these sets will have cardinality at most .
At the beginning of the th stage,  ′ will run the algorithm from Proposition 3.16 on [  ], with confidence parameter , resulting in some ̂︀   and a diagonal estimate that we will call   .We will use the fact that  always satisfies all of the following (provided  is large enough and using  ≥ log ): By losing probability at most 5 in each stage, we may assume that except with probability at most .0006,all of the desired outcomes from Proposition 3.16 do occur over the course of the algorithm.
If ̂︀   ≤ 1.1ε or  > , then this is declared the final stage; i.e., the algorithm will define ℓ =  and move to its "final processing".Otherwise, in a non-final stage we have ̂︀  > 1.1ε ≥ 1.1/  (using Inequality (55)), so by Item (i) of Proposition 3. 16  Finally, we record the main conclusions Items (ii) and (iv) of Proposition 3.16, taking care to distinguish the final stage: Now we explain how algorithm  ′ defines  +1 at the end of non-final stage , where recall nonfinality implies from Inequality (56) Considering the first bound in Inequality (59), note that rank [  ] ≤ , so we have that the diagonal matrix   has Frobenius-squared distance at most    /  from a matrix of rank at most .But the rank-at-most- matrix that is Frobenius-squared-closest to   is simply  ′  , the matrix formed by zeroing out all but the  largest entries of   .Recalling that  ′  has nondecreasing diagonal entries, this means  ′  is formed by zeroing out all diagonal entries of index at most  ′ +1 := max{  − , 0}.Thus we have
Aside from establishing the above, it remains to describe how algorithm  ′ forms ̃︀  satisfying the theorem's conclusion Item (d).We first describe a candidate output we'll call  ′ that almost works: namely,  ′ is formed by setting its diagonal elements from   to be those from   , for  < ℓ.(The remaining diagonal entries may be set to 0.) The difficulty with this is that it's not easy to control tr  ′ , but let us ignore this issue and calculate  2 -divergence. 5Ignoring the fact that we are not working with normalized states, we may bound Each summand above can be upper-bounded using Inequalities (59) and (66), yielding We now work to control the trace of our estimate.Our strategy is to have  ′ perform diagonal measurements on the remaining /2 copies of  to classically relearn its diagonal via Proposition 2.16, with its "" set to .Calling the resulting probability distribution , the algorithm will finally take First we complete the verification by Item (c) by establishing the condition in Inequality (70): since [] is formed by the empirical estimator, Markov's inequality and Proposition 2.14 imply that except with probability at most .0001we have ‖diag( , and we have a factor of  ℓ max to spare.
Next, using Markov again with Proposition 2.16 we get that except with probability at most .0001, Also, using  ≥ log  (and  large enough), we indeed have 1/(/2)  ′ ≤ ε/(100) for  ′ = .0001/||≥ .0001/(ℓmax ); since also   ≥ ε/(100) for all  ∈  (recall Inequality (64)), we conclude   is within a 4-factor of   for all  ∈ , ( except with probability at most .0001.Finally, it is easy to see that in Proposition 2. 16  Finally we finish the analysis of ̂︀ D −  2 ( ‖ ̃︀ ).The contribution to this quantity from the diagonal entries is precisely Inequality (73).On the other hand, since  ′ and ̃︀  are both diagonal, the offdiagonal contribution to ̂︀ D −  2 ( ‖ ̃︀ ) can be bounded by a constant times Inequality (72), using the fact that the diagonal entries (from ) of  ′ and ̃︀  are all within a constant factor by virtue of Inequalities (68) and (74).This completes the verification of Item (d).
We also show the following, to improve some log(1/) factors in the case that  is extremely small and  = Θ().(The reader might like to think of the case when  = (1).)

Theorem 3.19.
There is a variant version of  ′ from Theorem 3.18 with the following alternative parameter settings: Proof.Besides verifying that Inequality (55) still holds with our changed  and ℓ max , there is one alternative idea to be explained.In the preceding proof, the driver of progress was Inequality (69) showing  +1 < 1 2   ; this enabled us to take ℓ max logarithmic in 1/ε.In this variant, we will only use this inequality weakly, to show that |  | ≥ 1 so that  +1 <   strictly; this is already enough to ensure that taking ℓ max =  + 1 is acceptable.On the other hand, if we only implement this change then  would become unnecessarily large (namely, ε( + 1)).
To get the improved value of , we change how  ′ chooses the   values.Returning to Inequality (65), in the th stage there is a set  ′′  of at most  indices  on which each (  )  exceeds  := (1.1) 2 • tr  100 , and their sum  exceeds .66 ≥ .6(tr  ).Then  ′ chooses   to consist of all indices  ∈  ′  with (  )  ≥ , of which there are at most ().Note that if we conversely had |  | at least Ω() for every , then the algorithm would halt in at most (/) stages, allowing us to take  = (/)ε rather than εℓ max (a significant improvement when  = Θ()).
The idea is now for  ′ to choose a slightly different   in each round, of cardinality   ≥ 1, so that (  )  ≥ Ω( tr   ln  ).(Note that we need not be with the sum of (  )  on the new   , since we're now only using that |  | ≥ 1 always.)If we can show this is possible, then we can use it as a replacement for Inequality (66) when deriving Inequality (72); we'll then get as claimed.
But the proof that we can choose   as described is elementary.Essentially, the algorithm has a nonincreasing sequence of (at most)  numbers  1 , . . .,   (where   = (  ) +1−,+1− ) whose sum is (at least) .We need to show that for some   it holds that   ≥ Ω( Item (b) tells us that  ,  ′ ≤ (ε) (which may be assumed at most, say, 1 2 ); from this, it is not hard to show that the "rescaling" only makes a constant-factor difference to the off-diagonal  2 -divergence contributions.So to establish Inequality (78), it remains to analyze the effect of rescaling on the on-diagonal  2 -divergence contributions.Writing   = (1 +   )  for some numbers   > 0, the bound on just the diagonal contribution in Inequality (79) is equivalent to Then in the rescaling, when   is replaced by
It remains to obtain the Frobenius-to- 2 transformation promised in Theorem 1.6.This is Corollary 3.24 below, which we achieve in two steps.
where we have split out the "off-diagonal" and "on-diagonal" contributions to D  2 ( ‖ ̂︀ ).(Also, the factors of "" in the above three bounds may be replaced by ||, which is potentially much smaller.) and this is ( 2 + ) ≤ ( + ), as required.
Working out the parameters (just using Theorem 3.18), along the lines of Remark 3.
When we measure  with this POVM, we obtain outcome ({, }, ±) with probability If we similarly define a POVM ( ±  ) but with a factor of i = √ −1 in the off-diagonal elements of Equation (100), we will similarly get outcomes with probabilities   ±   , where   := Im   .We focus on analyzing Equation (101), as the imaginary-part analysis will be identical.
Then E[̂︀   ] =   , and Repeating the analysis for the imaginary parts, we use 2 copies of  to get estimates for all   , {, } ∈  , achieving total expected squared-error Repeating this for all  ∈  uses () copies of  and gives estimates for all off-diagonal elements of , with total expected squared-error ( − 1)/.Finally, we can use standard basis measurements to estimate the diagonal elements of , using Proposition 2.14:  more copies of  suffice to achieve total expected squared-error 1/.log 2 (1/) log log(1/)), so the polylog terms have no dependence on .In case )︀ , one may take  = ( √  ()  log(/) log log(/)).

Testing zero mutual information
We now move on to showing the main application of our  2 tomography algorithm: testing zero quantum mutual information.We will explain below in Section 4.3 how our  2 tomography algorithm is crucial to achieving this result.But first, we introduce and analyze a variant of the quantum mutual information that features in our analysis.

Mutual information versus its Hellinger variant
The goal of this subsection is to prove the below theorem, showing that the standard quantum mutual information is not much larger than the "Hellinger mutual information": We first state a bound on the continuity of mutual information in terms of the trace distance and the subsystem dimension.A bound of the following form can be proven a number of ways, for example by appealing to the Petz-Fannes-Audenaert [4] and Alicki-Fannes [2] inequalities; see [18,Appendix F].The bound we use is an immediate corollary of [40, Prop.1] which gives small explicit constants.
where the constant can be chosen as  = 4 + 4 √ 2.
Proof.Since D H (, ) is a metric, we have ( Therefore the Rényi entropy term is bounded using Fact 2.29 as . (126) First, using Proposition Moving to (ON-OFF), the first term in it factorizes to The first factor above is precisely The second factor in Equation ( 128

Open Problems
One obvious and by now longstanding open question related to our work is learning in infidelity to precision  with (/) samples, without any logarithms.This would settle the sample complexity of tomography with infidelity loss up to constant factors.In light of our work, perhaps we could even ask for more: Given our result that learning in quantum relative entropy is possible with ̃︀ (/) samples, might a similar no-logarithm bound hold here as well?
Our algorithm uses only single-copy measurements, but even these are challenging on present-day quantum computers.A stronger assumption on measurements is to restrict to product measurements, meaning that all POVM elements factorize into tensor products over subsystems.We believe this measurement model will require strictly greater sample complexity for learning in  2 -divergence and for quantum mutual information testing than the single-copy case analyzed here.
Regarding quantum mutual information testing, note that in the classical case we could learn product states to  2 -divergence well enough that the entire testing complexity was dominated by the  2 -vs.-Hellinger identity tester.Unfortunately, in the quantum case we couldn't quite match this.Might it be possible to reduce the complexity of testing zero quantum mutual information down to to ︀ ( 2 /)?
For learning in  2 -divergence, it would be interesting to show that ̃︀ Ω( √  1.5 /) is the right lower bound; currently, we have nothing better than the infidelity-tomography lower bound of ̃︀ Ω(/).As explained in Remark 3.17, though, it seems like reducing the upper bound could be difficult even for the case  = 1.
Although the Bures  2 -divergence is usually the largest of the "big four" quantities considered in this paper, there are other quantum generalizations of  2 -divergence in the literature that are larger still than Bures  2 -divergence (see, e.g., [37,43]).An example is the so-called "standard" quantum  2 -divergence, in which the the arithmetic mean reciprocal-prefactor in Equation ( 33) is replaced by a geometric mean.Similarly, there are also multiple generalizations of the quantum relative entropy besides the "Umegaki" quantum relative entropy S(• ‖ •) studied herein.As explained above, the main reason for us to consider learning with respect to Bures  2 -divergence (as opposed to other metrics) is that it seems necessary for some applications; for example, our quantum mutual information testing problem.It is an interesting open question to study state tomography with respect to other generalizations of relative entropy and  2 -divergence, and in particular to decide if this is possible while still having ̃︀ (1/) scaling.More generally, a very interesting direction is to investigate for which quantum learning and testing tasks we can get away with ̃︀ (1/) samples, and for which we require (say) ̃︀ Ω(1/ 2 ) samples.
and using the notation   from Proposition 2.15, if   ≥ 1/ / for all  ∈ , then except with probability at most  we have that   is within a 4-factor of   simultaneously for all  ∈ .Proof.For  ∈  we have that   is distributed as +1 + , where  ∼ Binomial(,   ).It is elementary to show that the resulting contribution to d  2 with 1 denoting the identity matrix).Let  be a state tomography algorithm that, given  copies of  ∈ C × and param-Applying this theorem with the previously known result of Haah-Harrow-Ji-Wu-Yu, Theorem 1.4, we immediately conclude Corollary 1.8, that there is a state tomography algorithm with respect to quantum relative entropy that has copy complexity  = (/) • log 2 (/) (using collective measurements).Theorem 2.34 is immediate from the following (together with the fact that Hellinger-squared is upper bounded by 4 times infidelity (Fact 2.25)): Suppose ,  ∈ C × are quantum states with D 2 B (, ̂︀ ) ≤  ≤ 1/2.Then letting  ′ = Δ 2 (̂︀ ), we have S( ‖  ′ ) ≤ 16 • (2 + ln(/2)).H (, ) ≤ .Then for  ′ = Δ /2 () we have S( ‖  ′ ) ≤ 4 • (2 + ln(2/)).Proof.Since Δ /2 () has smallest eigenvalue at least /2, we have D Rén ∞ ( ‖  ′ ) ≤ ln(2/) and hence from Theorem 2.32 it suffices to show D 2 H (,  ′ ) ≤ 4.In turn, since D H (•, •) is a metric, by Remark 2.2 it suffices to prove D 2 H (,  ′ ) ≤ .But using Proposition 2.31, we indeed have 1, it has long been known that one can learn a single qubit state  to infidelity  using single-copy measurements on (1/) copies of , combined with one "round" of adaptivity.In this section we give a short proof of the same result but with a stronger conclusion:  accuracy with respect to Bures  2 -divergence.We first repeat Proposition 2.16 in the simpler context of  = 2, and at the same time achieving a concentration bound: There is a simple classical estimation algorithm that, given  = (log(1/)/) samples from an unknown probability distribution  = ( 0 ,  1 ) on {0, 1}, outputs an estimate ̂︀  satisfying d 2 ( ‖ ̂︀ ) ≤  except with probability at most .Proof.As shown in Proposition 2.16, if ̂︀  is the "add-one estimator" formed from  ≥ 4/ samples, then E[d  2 ( ‖ ̂︀ )] ≤ 1 +1 ≤ /4.By Markov's inequality, the estimator is "good", meaning d  2 ( ‖ ̂︀ ) ≤ , except with probability at most 1/4.If we now use  = (log(1/)/) samples to produce (1/) independent such estimators, a Chernoff bound tells us that, except with probability at most , at least a 2/3 fraction of them are "good".If we now associate each of our estimates ̂︀  = (̂︀  0 , ̂︀  1 ) with the point Moreover, the algorithm is simple to implement in the following sense: The first /4 copies of  are separately measured in the Pauli  basis, the next /4 in the Pauli  basis, the next /4 in the Pauli  basis, and the final /4 in a fixed basis determined by the first 3/4 measurement outcomes.
2 ( ‖ ̂︀ ) ≤ /2 except with probability at most /2.The final estimate of  (in the new basis) will be ̂︀  = diag(̂︀ ).Except with probability at most /2 + /2 = , both components of the preceding algorithm produce good estimates.Then using Equation(33)we may decompose D  2 ( ‖ ̂︀ ) into the on-diagonal contribution, which is d  2 ( ‖ ̂︀ ) ≤ /2, and the off-diagonal contribution, which is 2| 01 | 2 + 2| 10 | 2 ≤ /2 (by Inequality (42).This completes the proof.We say a quantum state estimation algorithm  has Frobenius-squared rate  if the following holds: Whenever  is given  ∈ N + as well as  ⊗ for some quantum state  ∈ C × , it outputs a matrix ̂︀  ∈ C × (not necessarily a state) satisfying E[‖ − ̂︀ 11) of , the algorithm measures its remaining /4 copies of  in the diagonal basis and employs Lemma 3.1.For  = (log(1/)/), this produces an estimate ̂︀  satisfying dTheorems 1.1 and 1.3 may be restated as follows:Theorem 3.5.There is an estimation algorithm with Frobenius-squared rate () on -dimensional states.
Proposition 3.16.A state estimation algorithm having Frobenius-squared rate  (, ) may be transformed (preserving the single-copy measurement property) into a subnormalized state estimation algorithm returning diagonal estimates with the following properties: Given parameters , , and  ⊆ [], as well as  ⊗ for some quantum state  ∈ C × of rank at  •  (, )/  , except with probability at most ; (v) simultaneously for all  ∈  with   ≥  := max{ /(100), 1/ / }, we have that ̂︀   is within a 1.1-factor of   , except with probability at most .(vi) simultaneously for all  ∈  with   ≤ , we have that ̂︀ Finally, we use Proposition 3.11 to obtain some high-probability guarantees:  ≤ 1.1, except with probability at most .
11−   in  | and   is replaced by11− ′   in ̂︀ , it is as though   is replaced by1− ′ 1−   = (1 ± (ε))  .Putting that into Inequality (80) shows that the rescaling only changes the on-diagonal  2 -divergence contribution by an additive (ε)From tr [] = 1 −  ≥ 1 − (ε),the above directly yields D 2 B (,  | ) ≤ (ε), and thus Inequality (77) follows from Inequality (81) and D B (•, •) being a metric.Now by working out the parameters (using both Theorems 3.18 and 3.19), we get the below Frobenius-to-infidelity transformation.The further transformation to relative entropy accuracy promised in Theorem 1.6 follows by applying Theorem 2.34.Corollary 3.21.A state estimation algorithm with Frobenius-squared rate  =  (, ) ≫ log  may be transformed (preserving the single-copy measurement property) into a state estimation algorithm with the following property: Given parameters , , and  copies of a quantum state  ∈ C × of rank at most , either / final ), as stated in Theorem 1.6.Note that with these choices, the smallest eigenvalue of ̂︀  will be ̃︀ Ω( final /), as expected.The reason we do not simply directly fix  = √︀ / • ε in the proof of Corollary 3.22 is that in our later application to quantum zero mutual information testing (Section 4.3), it will be important to allow for a tradeoff between the off-diagonal  2 -error, the on-diagonal  2 -error, and the minimum eigenvalue of ̂︀ .
2 ( ‖ ̃︀ ) by a factor of at most 1/(1 − ) = (1), so we may indeed bound it by () from Item (d).Finally, the on-diagonal contribution to ̂︀ D [47]we get:Proof.It suffices to achieve Frobenius-squared error / using () copies.Assume for simplicity  is even.(Weleave the odd  case to the reader.)Thenthere is a simple way[47]to construct a partition  of the edges of the complete graph on  vertices into  − 1 matchings.Fix a particular matching  ∈ , and associate to it the POVM with elements Given parameters ,  (with  ≤ 1/2), and  copies of a quantum state  ∈ C × of rank at most , = ̃︀  (︃ √  •  (, ) Theorem 4.1.Let  =   be a bipartite quantum state on  ⊗ , where  ∼ =  ∼ = C  .Writing D 2 H (,   ⊗   ) = , it holds that ( : )  ≤  • (log(/)).We also observe that by restricting   to be diagonal, we immediately obtain the analogous theorem concerning classical mutual information.We remark that proving this classical version directly is no easier than proving the quantum version.Let  =   be bipartite classical state on  × , where || = || = .Writing d 2 H (,   ×   ) = , it holds that ( : )  ≤  • (log(/)).