Multivariate trace estimation in constant quantum depth

There is a folkloric belief that a depth-$\Theta(m)$ quantum circuit is needed to estimate the trace of the product of $m$ density matrices (i.e., a multivariate trace), a subroutine crucial to applications in condensed matter and quantum information science. We prove that this belief is overly conservative by constructing a constant quantum-depth circuit for the task, inspired by the method of Shor error correction. Furthermore, our circuit demands only local gates in a two dimensional circuit -- we show how to implement it in a highly parallelized way on an architecture similar to that of Google's Sycamore processor. With these features, our algorithm brings the central task of multivariate trace estimation closer to the capabilities of near-term quantum processors. We instantiate the latter application with a theorem on estimating nonlinear functions of quantum states with"well-behaved"polynomial approximations.


Introduction
The task of estimating quantities like given access to copies of the quantum states ρ 1 through ρ m is a fundamental building block in quantum information science.This subroutine, which we call 'multivariate trace estimation', opens the door to estimating nonlinear functions of quantum states [1,2], such as quantum distinguishability measures [3], integer Rényi entropies [4,5,6], entanglement measures [7,8,9] or testing sets of states for linear independence, coherence, and imaginarity [10].This estimation procedure is a component of many quantum protocols such as quantum fingerprinting [3] and quantum digital signatures [11].Let us note that multivariate traces are also known as Bargmann invariants [10,12,13].
When ρ i = ϱ for all i ∈ [m] in (1), an important application of multivariate trace estimation is to entanglement spectroscopy [4] -deducing the full set of eigenvalues {λ 1 , . . ., λ D } (the 'spectrum') of ϱ, where ϱ is the reduced state of a bipartite pure state (that is, ϱ = Tr B (ψ AB ) with ψ AB a bipartite pure state).The spectrum unlocks a wealth of information about the properties of ϱ.The smallest eigenvalue of ϱ diagnoses whether ψ is separable or entangled [7].In addition, the inverse of the smallest eigenvalue acts as a condition number for many quantum algorithms that manipulate quantum states (see for instance [14,15,16]), and it constrains their runtime.The entanglement spectrum is also useful to identify topological order [17,18,19,20], emergent irreversibility [21], quantum phase transitions [22], and to determine if the system obeys an area law [23].
With the wealth of applications described above, there has been much interest in bringing multivariate trace estimation within the reach of near-term quantum hardware.A glimmer of hope in this regard is the observation that quantities like that in (1) can be estimated without the need for full-state tomography.In the quantum information sphere, one of the progenitors of this line of thinking was Ref. [1], which proposed a method leveraging the following well-known identity (related to the replica trick originating in spin glass theory [24]): where the right-hand-side is the multivariate trace we would like to estimate, and W π is a unitary representation of the cyclic shift permutation π := (1, 2, . . ., m) . ( Here, π represents a cyclic permutation that sends 1 to 2, 2 to 3, and so forth.That is to say, multivariate trace estimation can be accomplished by estimating the real and imaginary parts of the cyclic shift operator in (2), using quantum hardware.This identity subsequently became the backbone of many proposals [7,4,25,26,10] for multivariate trace estimation with yet more nearterm constraints.However, there appears to be a lack of clarity regarding the actual resource requirements of these methods.Refs.[25,26,4,7,10] have all suggested that a quantum circuit whose depth is linear in m is needed to perform the task.Ref. [10] has additionally proposed a logarithmic-depth circuit for this task, but it requires all-to-all qubit connectivity in a quantum computer, which is unlikely to be available in the nearterm.
In this paper, we show that these characterizations are overly-conservative: multivariate trace estimation can be implemented in constant quantum depth -the shortest depth one could hope for -while simultaneously respecting near-term quantum architecture constraints.In particular, we require only twodimensional nearest-neighbor connectivity (already a feature of Google's Sycamore processor [27]), linearlymany controlled two-qubit gates and a linear amount of classical pre-processing.We invoke ideas from Shor error correction [28] to construct a circuit that achieves this claim (Section 3), show how this circuit can be implemented in a highly parallelized way on a twodimensional architecture similar to Google's (Section 4), prove Theorem 3 about the statistical guarantees of the resulting estimator (Section 5), and show that our method finds further application in estimating traces of 'well-behaved' polynomial functions of density matrices (Section 6).The next section begins the technical part of our paper by reviewing the basic principle behind multivariate trace estimation.

The principle: Using cyclic shifts for multivariate trace estimation
The principle behind our circuit construction is simple.To explain it, let us first recall a well known circuit construction [1] for multivariate trace estimation that has no clear realization on near-term quantum computers.The idea is to estimate the quantities Re The circuits to estimate both quantities are similar, and we now describe them.To estimate the real part we proceed according to the following steps: 1. Prepare a qubit in the |+⟩ := (|0⟩ + |1⟩)/ √ 2 state and adjoin to it the state 2. Perform a controlled cyclic permutation unitary gate, defined as

Repeat Steps 1 to 3 a number of times equal to
and thus X computed in Step 4 is an empirical estimate of the desired quantity.That is, by invoking the well known Hoeffding inequality, it is guaranteed, for ε > 0 and δ ∈ (0, 1), that For completeness, we recall the Hoeffding inequality now: Lemma 1 (Hoeffding [29]) Suppose that we are given n independent samples Y 1 , . . ., Y n of a bounded random variable Y taking values in [a, b] and having mean µ.Set to be the sample mean.Let ε > 0 be the desired accuracy, and let 1 − δ be the desired success probability, where δ ∈ (0, 1).Then where M := b − a.
To see that (5) holds, note that in the special case when all the states are pure, i.e., where in the last equality we have used the well-known identity in (2).Similarly, we have that so that Eq. (5), which asserts that the conclusion in (14) still holds when each ρ i is a mixed state, follows by convexity (i.e., that every mixed state can be written as a convex combination of pure states).
To estimate the second quantity , a simple variation of the above argument suffices.The technique is identical, except that the final measurement is in the basis

Our circuit construction
We propose a variation of the above method, which instead estimates This makes the circuit more amenable to run on near-term quantum processors.This circuit is depicted in Figure 2 for m = 8 -it has some similarities with the method of Shor error correction from fault-tolerant quantum computation [28] (see also Figure 2 of [30]).
The key enabler of our depth reduction is the replacement of the |+⟩ = 1 √ 2 (|0⟩ + |1⟩) state at the single qubit control wire in the circuit in [1], with an ⌊m/2⌋-party GHZ state on ⌊m/2⌋ control wires.This modification was also proposed in [10], but their circuit for it is of logarithmic depth.In any case, this modification allows the number of controls to the permutation to increase to ⌊m/2⌋, which is half the input size.In turn, it paves the way for an implementation of multivariate trace estimation in quantum depth two using parallelized controlled-SWAP gates.
In order to achieve an overall constant quantum depth, in Section 3.1 we show that constant quantum depth suffices to generate the input GHZ state in (18).Then, in Section 3.2, we show how to implement the permutation W π , again in constant quantum depth.Finally, in Section 3.3, we describe our full estimator and the accompanying circuit.

Preparing GHZ states in constant quantum depth
We now describe two methods to generate the GHZ state in constant quantum depth.The first method has been discussed previously in [31] and is a constant quantum depth circuit assisted by measurements, classical feedback, and a logarithmic depth classical circuit.See also the discussion after [32,Theorem 1.1].The second is a variation of the first, which additionally allows for qubit resets to make more efficient usage of qubits.
Method 1.We begin by discussing the first approach, which generates an r-party GHZ state using a constant quantum depth circuit assisted by measurements, classical feedback, and a logarithmic depth classical circuit.We proceed according to the following steps: 1. Generate r − 1 Bell states; i.e., each pair is in the state 2. Perform controlled-NOT gates between the second qubit in each Bell state and the first qubit of the following one.
3. Measure the target of each controlled-NOT gate (all odd-numbered qubits except the first qubit) in the computational basis.
4. Controlled on the measurement outcome b 1 , . . ., b r−2 , apply a tensor product of Pauli X operators as a correction to all even-numbered qubits except the second qubit.
We now show in detail that the scheme prepares an r-party GHZ state.Consider that the initial state can be written as After step 2, the state becomes After step 3, the (r − 2)-bit string b 1 • • • b r−2 is obtained from the measurements, where and this projects the state onto the following state: Figure 1: A constant-depth quantum circuit for preparing an eight-party GHZ state, assisted by measurement, classical feedback, and qubit resets.Method 1 consists of all the steps depicted, except for the qubit resets and final controlled-NOTs (but only prepares a five-party state).In Method 2, the measured qubits are additionally reset to the |0⟩ state and connected by controlled-NOTs so that a larger, eight-party state can be prepared instead.
The r-party GHZ state on qubits 1, 2, 4, 6, . . ., 2(r − 1) is then recovered by performing the following correction operation: The leftmost part of Figure 1 (all except the qubit resets and final controlled-NOTs) depicts this procedure.
Method 2. The second method is very similar to the one just described.The only difference is that we additionally perform qubit resets on the measured qubits and then controlled-NOTs from the nearest neighbor qubits in the GHZ state to the reset qubits.This scheme is depicted in Figure 1.It prepares a GHZ state on a number of parties equal to the number of input qubits (hence a 2(r − 1)-party GHZ state, as opposed to the r-party GHZ state without the qubit resets).

Multiply-controlled cyclic permutation in constant quantum depth
der to reduce the depth of this part of the circuit from linear to constant.Our implementation of the multiplycontrolled-W π gate in constant depth is based on the observation that there is a particularly convenient way to decompose a cyclic shift into a product of transpositions, as shown in [33, Eqs.(4.2)-(4.3)].We state this observation here, along with a brief proof, for completeness: Proposition 2 The following decomposition holds where all arithmetic is modulo m.
So, for instance, when m = 8 (the case in Figure 2), Eq. (30) reduces to it gets sent to 1 by the first transposition and is not acted on by the second transposition.Otherwise, the second transposition sends m For m odd, there are two indices that are involved in only one transposition: k = m, which, as before, gets sent to 1 by the first transposition and is not acted on by the second transposition; and k = ⌈m/2⌉, which is transposed only in the second transposition, where it gets sent to m+2−⌈m/2⌉ = ⌈m/2⌉+1.All other indices are involved in the same two transpositions as described above.Thus, the overall effect is also Eq. (30) says that, for a fixed m, the m-wise cyclic shift permutation can be decomposed into a product of two terms.Each term is itself a product of disjoint transpositions and every index gets transposed once per term.
This has a clear interpretation in terms of how to construct a quantum circuit to implement a multiplycontrolled-W π .Transposing two qubit labels can be achieved by applying a SWAP gate to the relevant qubits.Disjoint transpositions can thus be accomplished by implementing SWAP gates in parallel, in a single time step.Proposition 2 thus implies that, for every m, the multiply-controlled-W π can be implemented in two time steps (depth two), each of which performs ⌊m/2⌋ controlled SWAPs in parallel.
We give a more explicit description of the circuit in the next subsection.Note that, as depicted in Figure 2, we have made another optimization for near-term feasibility: we adjoin the input states to the cyclic shift in a specific order that ensures that only nearest-neighbor states need to be swapped.

Explicit circuit description
We now put together the findings of the previous two subsections and describe our proposed technique for estimating Tr[ρ 1 • • • ρ m ] in the case that each local dimension d = 2 (i.e., each ρ i is a single-qubit state).The corresponding circuit is depicted in Figure 2.After that, we discuss how to generalize the construction to the case in which d is a power of two, so that each local system consists of multiple qubits.The estimator works as follows: 1. Prepare an ⌊m/2⌋-party GHZ state using one of the constant quantum-depth circuit constructions described in the previous section.Let us call the ⌊m/2⌋ qubits of the GHZ state the control qubits and the m states ρ 1 ⊗ • • • ⊗ ρ m the target qubits.
It can be checked that this prescription implements precisely (30), with the states adjoined in a specific order (see (32) or (33)) that allows only nearest neighbors to be swapped.
3. Perform a Hadamard on all ⌊m/2⌋ control qubits and measure them in the computational basis, receiving outcomes 0 or 1.This has the effect of performing a measurement in the X basis on each control qubit.Let X i ∈ {0, 1} denote the result of the ith measurement, for i ∈ {1, . . ., ⌊m/2⌋}.Set 4. Repeat Steps 1 to 3 a number of times equal to

To estimate
except that in Step 3, replace each Hadamard with HS † , where S is the phase gate before the measurement in the computational basis.This has the effect of performing a measurement in the Y basis on each control qubit.Let Y (j) i ∈ {0, 1} be the outcome of the measurement on the i-th qubit on the j-th application of the circuit.Set J j = (−1) Our proposed architecture is highly flexible and can be tailored to the availability of resources such as long coherence times and multi-qubit gates.This is evident in two ways: Firstly, we can smoothly trade off circuit width for the availability of entangling gates.At one extreme, observe that we could have implemented our circuit with only one control qubit (instead of ⌊m/2⌋) if we had at our disposal a highly-entangling gate: a single-qubit controlled simultaneous SWAP gate that swapped ⌊m/2⌋ pairs of qubits simultaneously.Such a gate would not be feasible in the near term, as it requires highly nonlocal interactions to implement in a real physical architecture.However, even gates that entangle only a subset of qubits afford us savings in circuit width: every additional controlled k-wise SWAP gate at our disposal allows for a reduction of circuit width by k, as k fewer control qubits are necessary (hence the GHZ state needs to be on k fewer parties).
Secondly, to generalize the estimation of beyond single-qubit states, we can either increase the width or the depth of the circuit described above.Suppose that ρ 1 , . . ., ρ m each consist of p qubits.
• We can increase the width of the circuit by preparing GHZ states with mp qubits and then group these into p groups of m qubits each.Correspondingly, group the 2m states ρ 1 , . . ., ρ 2m into p groups of 2m qubits, where the kth group has the kth qubit of each state, for k ∈ {1, . . ., p}.Then we perform controlled SWAPs as detailed above, for each group.Finally, perform Hadamards on all of the mp control qubits and measure each of them in the computational basis.
• To increase the depth of the circuit, prepare an mparty GHZ state, and then sequentially perform the controlled-SWAP tests for the p groups of qubits, so that the depth of the circuit increases by a factor of p. Then measure the control qubits as before.
4 Implementing multivariate trace estimation on a two-dimensional architecture We now outline how to implement our algorithm using a two-dimensional architecture similar to Google's [27].We do so by means of a series of figures, which outline the time steps of the circuit implementation.See Figure 3.These figures can be understood as a twodimensional implementation of the circuit depicted in Figure 2, with the exception that we also include qubit resets during the GHZ state preparation, as depicted in Figure 1.Explanations of the steps are given in the caption of Figure 3.The main point to highlight here is that our circuit leads to a highly parallelized implementation on a two-dimensional architecture that should make it more amenable to realization on nearterm quantum computers.

Guarantees of our estimator
In this section, we show that the estimator T described in Section 3.3 is accurate and precise with high probability.
There exists a random variable T that can be computed with O( 1 ε 2 log( 1 δ )) repetitions of a constant quantum depth circuit consisting of O(m) three-qubit gates, and satisfies Proof.By the Hoeffding inequality (see Lemma 1), it suffices to prove that the estimator T output by the method of Section 3.3 satisfies It suffices to prove the first equality.To begin with, let us suppose for simplicity that all the states are pure, i.e. ρ i = |ψ i ⟩⟨ψ i |, and define the state of the target qubits as Step 1 of our procedure prepares a GHZ state |Φ ⌊m/2⌋ GHZ ⟩ and adjoins it to |ψ (m) ⟩, such that the overall state at the end of Step 1 is After Step 2 (the multiply-controlled cyclic shift), the overall state becomes In Step 3, one measures the ⌊m/2⌋ control qubits in the X basis, obtaining a ⌊m/2⌋ bits, such that the i-th bit is denoted by the random variable X i ∈ {0, 1}.We then compute the expectation of the random variable R, defined as which will be output as the real part of our estimator.
Introducing the following notation for X basis eigenvectors the probability of the X basis measurement outputting the bitstring Thus, where, in the penultimate equality, we have used the fact that x1,...,x ℓ ∈{0,1} ℓ (−1) The claim for mixed states ρ (m follows by convexity (i.e., that every mixed state can be written as a convex combination of pure states).That is, we use the fact that 2 m (51) for this case and then repeat the calculation above for every pure state in the convex decomposition of ρ (m) . (1) Load Data (2) GHZ Preparation (4)

GHZ Preparation
(5) GHZ Preparation (6) ) Using a similar chain of logic, we conclude that and the first equality of (36) follows.
We also compute the variance of T .Consider that we conclude that Var Since these two random variables are independent, We now discuss the generalization of Theorem 3 to states of more than one qubit.This generalization can be accomplished by increasing the circuit width or the circuit depth, as discussed in Section 3.3.

Proposition 4
Let {ρ 1 , . . ., ρ m } be a set of p-qubit states, and fix ε > 0 and δ ∈ (0, 1).There exists a random variable Tp that can be computed using O( 1 ε 2 log( 1 δ )) repetitions of a constant quantum depth circuit consisting of O(mp) three-qubit gates, and satisfies Proof.The gate count for the circuits for both constructions detailed in Section 3.3 is O(mp).For the second construction (increasing the depth), Eq. (57) follows from the exact same calculation as in the proof of Theorem 3.For the first construction (increasing the width), let us define ρ (m,p) := ρ 1 ⊗• • •⊗ρ m .It suffices to prove that for the estimator Rp := (−1) . This follows from a calculation similar to that in (42)- (50).

Similarly,
6 Applications of our method An application of our method is to estimate functions of density matrices that can be approximated by 'wellbehaved' polynomials.This was already suggested in the original work of [1], but no complexity analysis was put forth.Here we formalize the analysis of the application in the following theorem: Theorem 5 Let ρ be a quantum state with rank at most d.Suppose there exists a constant ε > 0 and a function C such that g : R → R is approximated by a degree-m polynomial and Then estimating Tr[g(ρ)] within ε additive error and with success probability not smaller than (Here, we should think of C as a slowly-growing function of m.) An example of a function of a quantum state that can be estimated in this way is g(x) = (1 + x) α for α > 0, which has the following expansion as a binomial series where α k is the generalized binomial coefficient.It is well known that the binomial series converges absolutely for x = 1 and α > 0, which implies that, for every α > 0, there exists a positive constant C(α) such that Thus, this series satisfies the criterion in (60).
Another function of a quantum state that can be estimated in this way is g(x) = ln(x + 1).Indeed, this function has the following well known expansion: so that the absolute partial sum of the coefficients satisfies m k=1 ≈ ln m + γ, where γ ≈ 0.577 is the Euler-Mascheroni constant.The point here is that, even though the absolute partial sums are not bounded by a constant, they still grow sufficiently slowly such that the algorithm runs efficiently.
Yet another example of a function, but in this case with applications in thermodynamics, is g(x) = e βx , where β ∈ R.Here we have that Thus, in this case the absolute partial sums can be bounded by a constant.This function is particularly interesting for thermodynamics applications (and possibly beyond) because our method allows for estimating the partition function Tr[e βρ ] whenever the Hamiltonian is encoded in a density matrix ρ, as assumed, e.g., in the density matrix exponentiation algorithm [34,35].
Proof of Theorem 5.
We will present an estimator for the desired quantity that satisfies the claimed complexity guarantees.Let us describe the estimator for Re [Tr[g(ρ)]] (the estimator for the imaginary part follows immediately).To estimate Re[Tr[g(ρ)]], we run the following procedure:

For each k ∈ [m], run the circuit described in
Steps 1-3 of Section 3.3 once to output a random variable R k ∈ {−1, 1} such that 2. Linearly combine the above random variables to form the new random variable Repeat the first two steps N −1 more times, on the i-th iteration outputting the random variable R (i) .Let us now prove the correctness of this procedure.Suppose the spectral decomposition of ρ is as follows: where r ρ is the rank of ρ.In the limit of N → ∞, the only error in the estimator ĝ would come from the error in the polynomial approximation.That is, where the inequality follows from (59) and the fact that d > r ρ .Now we account for the other source of error, which is the statistical error caused by taking a finite number of samples.Consider that for all i ∈ {1, . . ., N }, where the first inequality follows from the triangle inequality, the second from the fact that R k ∈ {−1, 1}, and the third from the assumption in (60).By applying (71), the Hoeffding inequality in Lemma 1 immediately applies to our setting when we take we get that Combining the two sources of error in (70) and (73), we get that with probability 1 − δ, Repeating the analysis for the estimation of Im[Tr[g(ρ)]] yields the stated complexity.

Discussion
An alternative way to estimate Tr[ρ k ] for each k ∈ [m] is by using single-copy randomized measurements to estimate nonlinear functions of a density matrix [36].More recent approaches use the method of classical shadows to obtain 'classical snapshots' of ρ that can be linearly combined to obtain a classical random variable whose expectation is Tr[ρ k ] (see [37,Supplementary Material Section 6] and [38]).The latter reference provides rigorous bounds on the sample complexity required for the estimation of quantities like Tr(O q ρ ⊗q ) for arbitrary q, where O q is an observable acting on the Hilbert space of q systems; setting O q to be the real and imaginary parts of W π on the q systems and taking their linear combination, we recover our setting.The benefit of these alternative methods is that they allow for sequential measurements and do not require the assumption that the copies of ρ available to the algorithm are identical and independent.However, this robustness comes at a significant cost in terms of sample and computational complexity, as they generically require a number of copies scaling in the dimension d of the states involved -i.e.exponential in n -as well as poly(1/δ) where δ is the failure probability of the estimation (see [38,Appendix E]).Computational complexity is always lower-bounded by sample complexity.On the other hand, our sample complexity does not depend on d at all, and the dependence on 1/δ is only logarithmic.However, our computational complexity does depend on d as we use d-dimensional SWAP gates (but these have complexity O(log d)).
One might also wonder whether we could use this approach to estimate Rényi or von Neumann entropies of quantum states.The main difficulty in doing so is that well known polynomial approximations of the functions x α and −x ln x, given respectively by do not satisfy the condition in (60), in the sense that the absolute partial sums grow too quickly and therefore do not lead to an efficient algorithm using this approach.See also [39] in this context.It thus remains open whether this approach can be used effectively for estimating these important uncertainty measures.See [40,41,42,43] for work on this topic in classical information theory and [44,45,46,47,48,49,50] for a flurry of recent efforts on estimating Rényi and von Neumann entropies using quantum computers, which propose alternative approaches.The authors of [51,52] have also proposed and analyzed quantum algorithms for fidelity estimation.
We note here that the method outlined above can be generalized to functions of multiple density matrices.For example, let ρ and σ be quantum states, and suppose that g 1 and g 2 are well behaved polynomials in the sense described in Theorem 5. Then we can employ a similar approach to estimate the functions Tr[g 1 (ρ)g 2 (σ)] and Tr[g 1 (σ 1/2 ρσ 1/2 )].Polynomial approximations of these functions take the following form: respectively, and can be estimated using our circuits combined with classical postprocessing.Thus, by the discussion after Theorem 5, we can take g 1 (x) = (1+x) α and g 2 (x) = (1 + x) β for α, β > 0. A case of interest is when we set α ∈ (0, 1) and β = 1 − α.The resulting function Tr[g 1 (ρ)g 2 (σ)] then satisfies faithfulness and the data-processing inequality under unital quantum channels [53] and thus can serve as an alternative to the widely used Hilbert-Schmidt distance measure (which also satisfies the data-processing inequality under unital channels [53]).We prove these claims in Appendix A.
Other functions of two density matrices of interest include the Schatten-p distances.For states ρ and σ, these are defined for p > 1 as The simplest of these is the Hilbert-Schmidt distance (i.e., when p = 2), which reduces to This distance measure has been extensively employed in various works on quantum computing, with applications to compiling and machine learning, mainly because it is convenient to estimate it on quantum computers using the well known swap test.We can also consider the cases of p = 4 and p = 6, leading to the following formulas proved in Appendix B: ∥ρ − σ∥ These distances can also be estimated using our algorithms proposed here, because each of the terms involved is a multivariate trace.It remains open for future work to use these distances in applications like compiling and machine learning.

Conclusion
We have provided a quantum circuit for multivariate trace estimation that requires only constant quantum depth, and hence it is more amenable to be implemented on near-term quantum computers than previous methods that require linear depth.Our architecture is also flexible and can be smoothly tailored to the availability of circuit width (at the cost of more-entangling gates).Going forward from here, one can further consider the application of our method to estimating nonlinear functions of quantum states."The most important application of computers has been designing better computers" [54], and these methods can be used for this purpose.Our method to estimate functions of quantum states based on their polynomial approximations opens the door to the idea that near-term quantum computers can be used to design better quantum computers.An important open question in this regard is whether any functions whose polynomial approximations fulfill (60) have an interpretation as quality metrics for quantum computer design.Conversely, it would also be interesting to explore further whether there are any quantum state distinguishability measures -critical in applications like quantum compiling [55,56] and state learning [57,58] -that fulfill this condition.
Our findings open up many avenues for future research.While we already achieve the optimal possible circuit depth, other aspects like circuit width, gate number, and noise robustness could be further optimized to fit near-term constraints.See Ref. [59] for some recent developments in these directions.
A possible extension for future work is to combine our algorithm with the crosstalk mitigation techniques of Ref. [60] to make them directly applicable to noisy quantum processors.As mentioned before, a related exploration was performed recently in [59], wherein the authors combined our approach with error mitigation and explored circuit depth-width trade-offs.Another possible avenue is to employ amplitude estimation techniques [61] to improve the dependence of our algorithm on ε from 1 ε 2 to 1 ε ; however, doing so using known techniques will incur a significant increase in circuit depth [62], and thus we would find it surprising if a constant quantum depth circuit could have this dependence on the accuracy parameter ε.Heuristic methods have been explored in [63] and can also be considered for this purpose.

A Proofs of faithfulness and data processing
Let us define the following measures for states ρ and σ and α ∈ (0, 1) ∪ (1, ∞): These measures are related as follows: where we observe that I+ρ d+1 and I+σ d+1 are states.It is known that for all states ρ and σ (the lower bound follows because ρ and σ are positive semi-definite and the upper bound follows by applying the Hölder inequality).Furthermore, the measure Q α (ρ∥σ) is faithful on states, i.e., equal to 1 if and only if ρ = σ, and it satisfies the dataprocessing inequality [64,65]: , for α ∈ (0, 1) , (89) for every channel N .By the equality in (87), we can conclude properties of K α (ρ∥σ) from properties of Q α (ρ∥σ).Indeed, Also, the measure K α (ρ∥σ) is faithful, i.e., equal to d+1 if and only if ρ = σ.To see this, consider that if and only if This last equality is equivalent to ρ = σ.Thus, the faithfulness claim follows.Finally, the measure K α (ρ∥σ) obeys the data-processing inequality for unital quantum channels.For α ∈ (0, 1) and a unital channel N (i.e., satisfying N (I) = I), we have that and for α ∈ (1, 2], we have that These inequalities follow from the data-processing inequality for Q α .Indeed, consider for α ∈ (0, 1) that The second equality follows from linearity of the channel N and the fact that it is unital.The inequality for α ∈ (1, 2] follows similar reasoning as above but instead makes use of (90).

B Schatten p-distances as linear combinations of multivariate traces
In this we establish formulas for the Schatten-4 and -6 distances of quantum states, in terms of linear combinations of multivariate traces.To start, let us define the Schatten-2k distance measure of states ρ and σ as follows: where k ∈ Z + .Let us consider the cases k ∈ {1, 2, 3}.
First consider k = 1: giving that ∥ρ − σ∥ Now taking the trace, we find that Let us continue and consider the case of k = 3:

Figure 2 :
Figure 2: The leftmost part of the circuit prepares a fourparty GHZ state.The middle part of the circuit performs a controlled cyclic-shift.The final part of the circuit results in the classical bits x1, x2, x3, x4, which are used to generate r = (−1) x 1 +x 2 +x 3 +x 4 .As argued in Section 5, the expectation of r is equal to Re[Tr[ρ1 • • • ρ8]], so that this latter quantity can be estimated through repetition.

Figure 3 : ( 1 )
Figure3:(1) The squares in light grey represent control qubits, and the squares in dark grey represent data qubits.In this example, there are five data states involved, each consisting of four qubits.(2)The quantum data is loaded during this stage, which we note here can be conducted in parallel with the preparation of the GHZ state in the control qubits.The state ρi, for i ∈ {1, . . ., 5}, is a four-qubit state that occupies the indicated column of the data qubits in dark grey.The light grey control qubits are prepared in the all zeros state.(3) First step of the preparation of the GHZ state of the control qubits.Every other control qubit has a Hadamard gate applied in parallel.(4) Every pair of control qubits has CNOT gates applied in parallel.(5) Every other pair of control qubits has CNOT gates applied in parallel.(6) Starting from the third control qubit from the top left, every other control qubit is measured, and the measurement outcome is stored in a binary vector b1, . . ., b7.(7) Based on the measurement outcomes from the previous step, Pauli-X corrections are applied to every other qubit, starting from the fourth in the top row.The particular corrections needed are abbreviated by a multivariate function f , the details of which are available in(29).(8)The measured qubits are reset to the all zeros state.(9) Final step of the preparation of the GHZ state of the control qubits.CNOT gates are again applied to every other control qubit.The final state of all control qubits is equal to a GHZ state.(10) Controlled-SWAPs are applied in parallel between control qubits and data qubits, in a first round of the implementation of the cyclic shift.(11)Controlled-SWAPs are applied in parallel between other control qubits and data qubits, in a second round of the implementation of the cyclic shift.(12)Hadamard gates are applied to all control qubits.(13) In a final step, all control qubits are measured in the computational basis and the measurement outcomes are processed according to(40) to form an estimate of the real part of Tr[ρ1 • • • ρ5].To estimate the imaginary part, replace every H in step(12) with HS † .