Can Error Mitigation Improve Trainability of Noisy Variational Quantum Algorithms?

Variational Quantum Algorithms (VQAs) are often viewed as the best hope for near-term quantum advantage. However, recent studies have shown that noise can severely limit the trainability of VQAs, e.g., by exponentially flattening the cost landscape and suppressing the magnitudes of cost gradients. Error Mitigation (EM) shows promise in reducing the impact of noise on near-term devices. Thus, it is natural to ask whether EM can improve the trainability of VQAs. In this work, we first show that, for a broad class of EM strategies, exponential cost concentration cannot be resolved without committing exponential resources elsewhere. This class of strategies includes as special cases Zero Noise Extrapolation, Virtual Distillation, Probabilistic Error Cancellation, and Clifford Data Regression. Second, we perform analytical and numerical analysis of these EM protocols, and we find that some of them (e.g., Virtual Distillation) can make it harder to resolve cost function values compared to running no EM at all. As a positive result, we do find numerical evidence that Clifford Data Regression (CDR) can aid the training process in certain settings where cost concentration is not too severe. Our results show that care should be taken in applying EM protocols as they can either worsen or not improve trainability. On the other hand, our positive results for CDR highlight the possibility of engineering error mitigation methods to improve trainability.

A central challenge in the NISQ regime is to combat the effects of noise as full error correction is not possible [41].Decoherence, gate errors, and measurement noise all conspire to limit the complexity of quantum circuits that can be implemented on NISQ devices.While VQAs themselves offer some strategy to mitigate the impact of noise [1], it is widely viewed that VQAs alone will not be enough, and additional strategies will be needed to obtain quantum advantage in the face of noise.This has spawned the field of error mitigation (EM), and many researchers believe that VQAs combined with EM techniques will be the path forward.Indeed, EM methods like Zero-Noise Extrapolation [10,[42][43][44], Clifford Data Regression [45], Virtual Distillation [46,47], Probabilistic Error Cancellation [42,43] and others [48][49][50][51][52] have been demonstrated to reduce errors of observable expectation values, sometimes by orders of magnitude.Hence, there has been hope that one can simply train the VQA in the presence of noise, and then after training, one can apply an EM method to extract the correct cost value (e.g., the ground state energy in the case of the variational quantum eigensolver [21]).
However, new challenges have recently been discovered for this approach [53,54].It is now recognized that noise impacts the trainability of VQAs, that is, the ability of the classical optimizer to find the global cost minimum.For ansatzes (i.e., parameterized quantum circuits) with depth linear or superlinear in the number of qubits and local Pauli noise, the cost function landscape exponentially flattens, leading to an exponentially vanishing cost gradient, a phenomenon known as Noise-Induced Barren Plateaus (NIBPs) [53].Thus, noise impedes the training process of VQAs, as in such a setting one requires an exponential number of shots per optimization step to resolve the cost landscape against finite sampling noise.As with other barren plateau effects [55,56], this exponential scaling does not only arise for gradient-based optimizers but also impacts gradient-free methods [57] and optimizers that use higher-order derivatives [58].NIBPs represent a serious issue for VQA scalability, and could ultimately be a roadblock for near-term quantum advantage.It is therefore crucial to investigate potential methods to mitigate them.
Given the great success of EM methods in suppressing error in observable expectation values, it is natural to ask whether EM methods could address NIBPs.More generally, one could simply ask: does it help to use error mitigation during the training process for VQAs?This question is precisely the topic of our article.We remark that error mitigation has been successfully implemented during the VQA training process for a small-scale problem [44].However, it is an open question as to whether or not EM can resolve large-scale trainability issues associated with cost concentration.This is due to the fact that even though EM can reverse the concentration of cost values, it also increases the stochastic uncertainty in the mitigated quantities.As summarized in Figure 1, this is a trade-off that should be carefully considered.Thus, it is a non-trivial question as to whether or not EM improves the resolvability of cost function values which is a key factor in determining the trainability of the landscape.
In this work, we investigate the effects of error mitigation on the resolvability of the cost function landscape.First, we consider a broad class of error mitigation protocols and show that, under the class of local depolarizing noise that is known to cause NIBPs, in order to reverse exponential cost concentration any such protocol needs to spend resources (e.g., shot resources or number of state copies) scaling at least exponentially in the number of qubits.This suggests that NIBPs are a serious scaling issue that cannot be simply resolved with error mitigation.
Second, we study four specific error mitigation protocols in further detail: Zero Noise Extrapolation, Virtual Distillation, Probabilistic Error Cancellation, and strategies that implement a linear ansatz which includes Clifford Data Regression.We find that Virtual Distillation can actually decrease the resolvability of the noisy cost landscape, and impede trainability.Under more restrictive assumptions on the cost landscape, we find a similar result for Zero Noise Extrapolation.We also show that any improvement in the resolvability after applying Probabilistic Error Cancellation under local depolarizing noise exponentially degrades with increasing number of qubits.Finally, for strategies that use a linear ansatz such as Clifford Data Regression, we show that there is no change to the resolvability of any pair of cost values if the same ansatz is used.However, we do observe numerically that Clifford Data Regression increases trainability in some settings.This last observation provides some hope that a careful choice of error mitigation method can be useful.It also suggests that researchers could design and engineer error mitigation methods to enhance VQA trainability.
The rest of the manuscript is structured as follows.Section II introduces the framework and notation for our work.We present our theoretical results in Section III and our numerical results in Section IV.Finally, our concluding discussions are presented in Section V.The proofs for our main results are presented in the Appendix.θ2)) on the cost landscape in parameter space.Ideally (with infinite sampling), these cost values correspond to the mean values of some probability distributions (left panel).However, in an experimental setup, one only has a finite shot budget and by collecting measurement statistics one obtains an estimate of the mean values by sampling from these distributions (right panel).(b): The effect of certain types of noise models is to concentrate cost function values.This impedes trainability as any two cost function values ( C(θ1) and C(θ2)) have small separation and require many shots to accurately distinguish.(c): Error mitigation can mitigate many effects of noise and potentially recover key features of the noise-free cost function.In an ideal scenario, the separation of the mitigated cost values (Cm(θ1) and Cm(θ2)) closely resembles that of the noise-free landscape.However, the caveat is that the variance of statistical outcomes can increase greatly.The effect of this is that the two cost function points can often require even more shots to resolve accurately, compared to the unmitigated case.

A. Variational Quantum Algorithms
The main goal of Variational Quantum Algorithms (VQAs) is to solve an optimization problem by minimizing a cost function that can be efficiently estimated on a quantum computer.In this work we consider settings where the cost function takes the form In the above, given some Hilbert space H, we define the set of density operators S(H) and set of bounded linear operators B(H).We then denote ρ in ∈ S(H) as the input state, U (θ) ∈ B(H) as a unitary that corresponds to a parametrized quantum circuit with trainable parameters θ, and O ∈ B(H) is a Hermitian operator.The Variational Quantum Eigensolver [21], variational quantum compiling [33][34][35]59], quantum autoencoders [60], and several other VQAs fit under the framework of Eq. (1).
A quantum computer is employed to evaluate the cost function, or gradients thereof, and part of the computational complexity of the algorithm is designated to a classical computer that leverages the power of classical optimizers to solve the problem ( The optimization task defined in Eq. ( 2) has been shown to be NP-hard [61].Moreover, on top of the typical difficulties associated with solving classical non-convex optimization problems, there are challenges that arise when training the parameters of a VQA due to the quantum nature of the problem itself.
As quantum mechanics is intrinsically a probabilistic theory, one has to deal with shot noise arising from finite sampling when estimating the cost function (or its gradient).This has led to the development of several quantum-aware optimizers that are frugal in the number of shots [62][63][64].Additionally, it has been recently shown that certain properties of the cost function can induce so-called barren plateaus, originating due to highly expressive ansatzes [55,65,66], global cost functions [56], high levels of entanglement [67,68], or the controllability of U (θ) [69].When a cost function exhibits a barren plateau, with high probability the cost function partial derivatives are exponentially suppressed across the landscape.This means that an exponentially large number of shots are needed to navigate the flat landscape and determine a cost-minimizing direction [57,58].
In this work we investigate the effect of noise and error mitigation techniques in solving the optimization task of Eq. (2).For this purpose we investigate the task of resolving two points on the cost function landscape, as presented in Fig. 1.This is a central primitive in the training process that is utilized at each optimization step, regardless of whether one is using gradient-based or gradientfree methods.In gradient-based methods, a common strategy is to use the parameter shift rule, which constructs partial derivatives from two cost function values [70,71].Gradient-free methods such as simplex-based methods also compare two or more cost function values at each optimization step [72,73].Thus, this task is a key step for both gradient-based and gradient-free optimizers, and it reflects the ability of the optimizer to find a cost-minimizing direction at each step of the optimization.As discussed below, under a finite shot budget this task becomes harder under cost concentration, leading to trainability issues.

B. Effect of noise on the training landscape
Hardware noise can impact the cost function landscape in a variety of ways such as changing the optimal cost function value, shifting the position of minima, and demoting a global minimum to a local minimum.All of the above present further challenges in the training of VQAs.In this section we briefly review some of the literature on the effect of noise on VQAs cost function landscapes.We summarize some of these effects in Fig. 2.

Noise resilience
Certain cost functions have been demonstrated to show optimal parameter resilience under particular noise models [34].This is a phenomenon where the position of the global cost minimum of the cost landscape is invariant under the action of noise.This has important consequences for trainability.There are many VQAs where the goal is to obtain optimal parameters, rather than the optimal cost value, such as when solving combinatorial optimization problems with the Quantum Approximate Optimization Algorithm [24].If such cost landscapes display optimal parameter resilience, this leaves open the possibility of noisy training even if the cost value of the global minimum is altered by the noise.However, noise can also severely affect the trainability of the landscape in a number of ways, which we summarize below.

Noise-induced cost concentration and noise-induced barren plateaus
Here we summarize the phenomenon of noise-induced cost concentration and noise-induced barren plateaus (NIBPs), as well as introduce some notation that we will use throughout the rest of this manuscript.This was formulated in Ref. [53] for a general class of VQAs and a class of Pauli noise that includes as a special case local depolarizing noise.(See also Refs.[54,74,75] for other discussions of the impact of noise.)In this section we follow the treatment for local depolarizing noise.Consider a model of noise acting through a depth L circuit with n-qubit input state ρ in as where {U k } L k=1 denote unitary channels that describe collections of gates that act together in a layer, and N = n i=1 D i is an instance of local depolarizing channels, with depolarizing probabilities {p i } n i=1 .We denote a noisy cost function as where O is some Hermitian measurement operator (throughout the article we will use a tilde to denote noisy quantities).In Ref. [53] it was shown that where q = max{(1 − p i )} n i=1 < 1 and D(q, n) ∈ O(q αn ) for some positive constant α if L ∈ Ω(n).Thus, in the presence of the class of noise models considered, the noisy cost function exponentially concentrates on a fixed value if the depth scales linearly or superlinearly in the number of qubits.
The gradients across the cost function landscape show similar scaling [53], demonstrating a phenomenon known as NIBPs.This implies that the task of accurately determining gradients or cost function differences during the training process requires an exponential number of shots due to the need to resolve quantities to an exponentially small precision.

Cost corruption
In general, a noise model that exhibits cost concentration and NIBPs would not simply uniformly flatten the cost landscape.Instead, we expect noise to additionally alter the cost landscape in many non-trivial ways.We refer to any additional adverse effects on the landscape as cost corruption.For example, it was shown in Ref. [76] that non-unital noise can break the degeneracy of exponentially-occurring global minima, thus proliferating local minima and impacting trainability.In addition, cost functions that do not exhibit optimal parameter resilience [34] limit the quality of noisy optimization, as the optimal parameters of C(θ) do not correspond to the optimal parameters of C(θ).

C. Error Mitigation Techniques
We finish the discussion of our framework with a summary of the key features of the error mitigation techniques that we study in this article.For a more detailed review, readers can refer to Refs.[2,77].
Consider the effects of noise on the cost function in Eq. (1).We suppose the noise can be characterized by a single parameter ε and we denote the corresponding noisy state and cost function as ρ(θ, ε) and C(θ, ε) = Tr[ ρ(θ, ε)O] respectively.The goal of error mitigation is to construct an experimental protocol which obtains a mitigated cost function estimator C m (θ) that approximates the noise-free value C(θ).The protocol to obtain C m (θ) generally consists of running circuits that modify the original circuit of interest by inserting additional gates, preparing multiple copies of a state, changing the measurement operator, and classical post-processing of the expectation values of these circuits.These different utilizations of resources are summarized in a schematic in Fig. 3.
Error mitigation protocols often lead to a larger variance in the statistical outcomes of each experiment, and thus more shots are required to estimate the errormitigated cost value C m (θ, ε) to a desired precision compared to the unmitigated noisy value C(θ, ε).This is often quantified by the error mitigation cost, which is defined below.Definition 1 (Error mitigation cost).We define the error mitigation cost as where C(θ, ε) denotes the noisy cost function value corresponding to vector of parameters θ at noise level ε, and C m (θ, ε) denotes the corresponding error-mitigated quantity.
In certain settings we encounter, γ(θ, ε) is independent of θ.In other cases, where we have to compare γ(θ, ε) for two different parameters, we will seek parameterindependent bounds.We now summarize the error mitigation techniques that we study in this article.We note that recently, unified error mitigation techniques have also been proposed that combine two or more of the protocols that we discuss in this section [78][79][80].Our results are also applicable to such strategies, however, we will only review the root strategies here.

Zero Noise Extrapolation
The goal of Zero Noise Extrapolation is to run a given circuit of interest at m + 1 increasing noise levels ε < a 1 ε < ... < a m ε, and to use information from the resulting expectation values to obtain an estimate of the zero-noise result.Here we summarize the key features of a protocol using Richardson extrapolation [10,42], and exponential extrapolation [43].
Richardson Extrapolation.Suppose that C(θ i , ε) admits a Taylor expansion in small noise parameter ε as where p k are unknown parameters and C(θ i , 0) = C(θ) is the zero-noise cost function.By considering the equivalent expansion of C(θ i , a 1 ε) and combining the two equations one obtains which is a higher-order approximation of C(θ i , 0) compared to simply using C(θ i , ε).This process can be repeated iteratively m times to obtain an estimator which is accurate up to O(ε m+1 ) error.
Exponential extrapolation.In some cases the noisy behavior may not be well-depicted by a Taylor expansion.As an alternative one can consider an exponential model for some r and t which in general can be functions of ε.We can also construct an extrapolation strategy that is tailored towards noisy cost function values that are dominated by NIBP scaling as in Eq. ( 5), where we model the effects of noise as Here, C(θ i , 0) is the fixed point of the noise and B(θ i ) + C(θ i , 0) is the noise free cost value.For these two strategies we can similarly construct C m (θ i ) as linear combinations of { C(θ i , a i ε)} m i=0 to achieve O(ε m+1 ) approximations of C(θ i , 0).We detail these constructions in Section A 1 a of the Appendix.

Virtual Distillation
Virtual Distillation, also known as Error Suppression by Derangement, was proposed concurrently in Refs.[47] and [46].In this article we consider the two error mitigation protocols in Ref. [47] (denoted "A" and "B") to respectively prepare and where λ i is the dominant eigenvalue of ρ i ≡ ρ(θ i ).The operator ρ M i can be obtained by preparing M copies of ρ i in a tensor product state ρ ⊗M i and applying a cyclic shift operator.We note that protocol B presumes access to the dominant eigenvalue beforehand, which could potentially be computed via the techniques of Ref. [37].

Probabilistic Error Cancellation
Probabilisitic Error Cancellation utilizes many modified circuit runs in order to construct a quasiprobability representation of the noise-free cost function [42,43].We assume that the effect of the noise can be described by a quantum channel N that occurs after a gate that we denote with unitary channel U.Here we make the simplifying assumption that this is the only gate in the circuit, and we treat the general case in Section A 1 b of the Appendix, as well as provide a more detailed exposition.The goal of this protocol is to simulate the inverse map N −1 .Note that, in general, this will not always correspond to a CPTP map.Despite this fact, if one has a basis of (noisy) quantum channels {B α } α , corresponding to experimentally available channels, one can expand the inverse map in this basis as N −1 = α q α B α , for some set of q α ∈ R. By defining a probability distribution Many such circuits with different hyperparameters can be run, with their expectation values combined in a post-processing step, in order to construct the final error mitigated cost value Cm(θ).One feature that distinguishes the approaches to error mitigation studied here from error correction is that error correction allows global access to the larger Hilbert space from the start of the circuit, whereas the error mitigation techniques studied here only allow the possibility for global operations at the end of the circuit.
, the noise free expectation value can then be written as a quasiprobability distribution where ρ in is the input state, O is the measurement operator, and sgn(q α ) denotes the sign of q α .The idea is that if one has access to the set of CPTP maps {B α } α in the noisy native hardware gate set, then one can obtain an estimate of the noise free cost C U (ρ) as follows: (1) With probability p α , prepare the circuit of interest with additional gate B α in order to obtain the expectation value (2) Multiply the result by sgn(q α )G N .
(3) Repeat process many times and sum results.
The main idea of linear ansatz methods is to assume that we can approximately reverse the effects of noise with an affine map, and thus we construct a linear ansatz of the form where a(θ) = (a 1 (θ), a 2 (θ)) is a vector of parameters to be determined.In general we expect a to be highly dependent on θ.In Ref. [45], the authors use data regression to learn the optimal parameters a * (θ) with training data comprising of pairs of noise-free and corresponding noisy cost function values T θ = {(C j , C j )} j , where the circuits are predominantly constructed from Clifford gates.The noise-free cost values can be simulated efficiently on a classical computer whilst the noisy cost values can be evaluated directly on the quantum computer.This strategy is known as Clifford Data Regression.
Other methods have been proposed to learn the optimal parameters a * (θ).In Ref. [81] the authors further develop the idea of training-based error mitigation by considering alternative training data comprising of fermionic linear optics circuits.One can also model the noise as global depolarizing noise.Under this assumption, a * (θ) has an exact solution in terms of a single noise parameter.Subsequently, various techniques can be used to estimate the noise parameter [82][83][84][85][86].

III. Theoretical Results
We present two sets of theoretical results.First, in Section III A we show that a broad class of error mitigation techniques cannot undo the exponential resource requirement that exponential cost concentration presents.This has implications for both the trainability of noisy VQAs, as well as the accurate estimation of noise-free cost function values in general.Second, in Section III B, we work predominantly in the non-asymptotic regime (in terms of scaling in n) and investigate to what ex-tent different error mitigation strategies can improve the resolvability of the noisy cost landscape, assuming that some cost concentration has occurred.For these purposes we introduce a class of quantities which quantify the improvement of the resolvability of the cost function landscape after error mitigation, which we call the relative resolvability (see Defs. [2][3][4].Using these quantities we study Zero Noise Extrapolation (Sec.III B 2), Virtual Distillation (Sec.III B 3), Probabilistic Error Cancellation (Sec.III B 4) and linear ansatz methods which include Clifford Data Regression (Sec.III B 5).In the settings that we consider, we find that in many cases error mitigation impedes the optimizer's ability to find good optimization steps, and is worse than performing no error mitigation.

A. Asymptotic scaling results (exponential estimator concentration)
In this section we show that full mitigation of exponential cost concentration is not possible for a general class of error mitigation strategies.Specifically, we show that one cannot remove the exponential scaling that local depolarizing noise incurs without investing exponential resources elsewhere in the mitigation protocol.
We start by remarking that, as summarized in Fig. 3, all of the strategies presented in Sec.II C consist of preparing linear combinations of expectation values of the form for some n-qubit quantum state σ ∈ S(H) that in general can be prepared by a different circuit to that of the state of interest, for |0 0| ∈ S(H ) and for some X ∈ B(H ⊗M ⊗ H ⊗k ).That is, one can prepare multiple copies of a state, prepare different quantum circuits, and apply general measurement operators.In order to generalize the setting further, we also allow the possibility to ultilize multiple clean ancillary qubits at the end of the circuit.By considering linear combinations of such quantities, one also accounts for the ability to post-processing measurement results classically with a linear map, such as is the case with Probabilistic Error Cancellation.In the following theorem we show how quantities of the form ( 16) concentrate under local depolarizing noise.
Theorem 1.Consider an error mitigation strategy that, as a step in its protocol, estimates E σ,X,M,k as defined in Eq. (16).Suppose that σ is prepared with a depth L σ circuit and experiences local depolarizing noise according to Eq. (3).Under these conditions, E σ,X,M,k exponentially concentrates with increasing circuit depth on a state-independent fixed point as where 1 1 ∈ S(H) is the n-qubit identity operator and with noise parameter q ∈ [0, 1).
We remark that a similar result can be obtained for local Pauli noise, provided that each Pauli error occurs with a non-zero probability.Theorem 1 shows that quantities of the form (16) exponentially concentrate in the depth of the circuit.As we summarize in the schematic in Fig. 3, such quantities generalize expectation values that are prepared by many different error mitigation protocols.We now explicitly demonstrate how Theorem 1 affects the mitigated cost values that these protocols output.

Corollary 1 (Exponential estimator concentration).
Consider an error mitigation protocol that approximates the noise-free cost value C(θ) by estimating the quantity where each E σ,X,M,k takes the form (16).We denote M max and a max as the maximum values of M and a X,M,k respectively accessible from a set T defined by the given protocol.Assuming X ∞ ∈ O(poly(n)), there exists a fixed point F independent of θ such that for some constant β 1 if the circuit depths satisfy for all σ(θ) in the construction (19).That is, if the depths of the circuits scale linearly or superlinearly in n then one requires at least exponential resources to distinguish C m from its fixed point, for instance by requiring an exponential number of shots, or by requiring an exponential number of state copies M max .
We note that the assumption X ∞ ∈ O(poly(n)) is satisfied in most settings, and in particular is satisfied for all error mitigation protocols discussed in Sec.II C. For instance, in the case of Virtual Distillation, X corresponds to a cyclic shift operator followed by a Pauli observable, and thus X ∞ ∈ O(1).Corollary 1 implies that under conditions that generate a NIBP, in order to obtain an estimate of a noise-free cost value up to some arbitrary additive error, one requires resource consumption that scales exponentially in the number of qubits.In Appendix C we present a more detailed statement that explains how such resources may be consumed.
Whilst the use of clean ancillary qubits as part of an error mitigation protocol, utilized as in Equation ( 16) and Fig. 3, has not been widely studied, Corollary 1 rules out the possibility that such resources used at the end of the circuit would offer advantage in countering the exponential scaling effects due to cost concentration.Indeed, upon inspecting (17), the ancilla appear explicitly in the form of the fixed point.This highlights a key difference between many error mitigation strategies and error correction, as error correction utilizes resources (such as a larger Hilbert space) in the middle of the computation, whilst the error mitigation protocols considered here are based on processing states obtained at the end of a noisy computation.Our result leaves open the possibility that novel error mitigation protocols that move beyond the framework of ( 19) and Fig. 3 can have hope of countering the exponential scaling of exponential cost concentration and NIBPs.

B. Non-asymptotic protocol-specific results
In this section we present predominantly nonasymptotic results for Zero-Noise Extrapolation (Sec.III B 2), Virtual Distillation (Sec.III B 3), Probabilistic Error Cancellation (Sec.III B 4), and methods which use a linear ansatz such as Clifford Data Regression (Sec.III B 5).For each protocol, we investigate the effect of error mitigation on the resolvability of the cost landscape, for different classes of noisy states.To the end, we first define a class of resolvability measures which quantify how many shots it takes to resolve the cost landscape at some fixed precision after applying error mitigation, compared to no mitigation at all.

Definitions
Definition 2 (Relative resolvability for two points).Consider two locations in parameter space θ 1 , θ 2 and their corresponding points on the cost landscape.Denote the number of shots to resolve these two points up to some fixed precision with and without error mitigation as N EM and N noisy respectively.We define the relative resolvability for θ 1 and θ 2 at error level ε as where we have used the shorthand notation χ (θ 1,2 ) = χ (θ 1 , θ 2 ), γ is the error mitigation cost as defined in Definition 1, and where we denote We can see that if we have χ (θ 1,2 , ε) > 1, then error mitigation has succesfully increased the resolvability of the cost values corresponding to the cost values at θ 1 and θ 2 .Note that this criterion is a necessary but not sufficient condition for error mitigation to reverse the effects of cost concentration on the cost landscape.Namely, it does not require the mitigated landscape to accurately reflect the noise-free landscape, and it does not account for other trainability issues such as proliferation of minima.If χ (θ 1,2 , ε) < 1 then N noisy (θ 1,2 , ε) < N EM (θ 1,2 , ε).This implies that error mitigation has exacerbated the resolvability issues associated with cost concentration and NIBPs, and it has been counterproductive in fixing these trainability issues.
For a general cost functions, the relative resolvability of cost function points after mitigation may vary significantly across the landscape, or be different for different choices of ansatzes and noise models.This motivates us to seek averaged measures of resolvability.We consider two types of averaging: first, an average over cost function points generated by a given ansatz, noise and cost; second, an average over a set of noisy states.Definition 3 (Average relative resolvability I).Denote the vector of parameters that corresponds to the global cost minimum at noise parameter ε as θ * .We then define the averaged relative resolvability as where • i denotes the mean over all parameter vectors θ i accessible with the given ansatz of consideration, and where we denote Averaging across a given cost landscape gives a result that is particular to the choice of ansatz, measurement operator and noise model.In order to evaluate the performance of error mitigation in a more general setting, we consider a broader average over noisy states that have the same spectrum.This choice of class of noisy states is motivated by the fact that a central mechanism of cost concentration and NIBPs under unital Pauli noise is the loss of purity.When we consider the second averaged relative resolvability for Virtual Distillation in Section III B 3, it will turn out to be bounded by a function of the purity for such states.Definition 4 (Average relative resolvability II).Consider a normalized spectrum λ ∈ R 2 n + which corresponds to the eigenspectrum of some noisy state.We define the 2-design-averaged relative resolvability as where • Ui denotes an average over unitaries U i drawn from a unitary 2-design, ρ λ is an arbitrarily chosen reference state with spectrum λ, and where we denote where M : S(H) → B(H) is a map that describes the action of the error mitigation protocol.
In Fig. 4 we present a a schematic of our second averaged relative resolvability.
In the results that follow we do not constrain ourselves to investigating specific models of Pauli noise.Instead, we simply suppose that there exists a noise model that causes some concentration of the cost function onto some state-independent fixed point.Any further assumptions on the effects of the noise model are explicitly specified in the statement of the particular result in question.As a precursor to looking at more generic models of noisy states, we will also find it useful to investigate the performance of error mitigation strategies in the presence of global depolarizing noise of the form where D is the global depolarizing channel and p is the depolarizing probability.Our justification for studying this noise model is twofold.First, global depolarizing noise provides a clean model of cost concentration with no other cost corrupting effects of the noise.Therefore, if a given error mitigation strategy is to mitigate the effects of cost concentration and NIBPs, we expect it to be able to perform well on this noise model.Second, the structure of many error mitigation strategies is directly motivated by the model of global depolarizing noise [45,[81][82][83][84][85][86].Indeed, many such strategies have been shown to achieve good or perfect performance with this noise model in mitigating noisy cost function values [45][46][47]82].However, we stress that trainability may simultaneously get worse, which is what we will now investigate.

Zero Noise Extrapolation
In this section we present our results on Zero Noise Extrapolation.First, we consider the simple model of global depolarizing noise.
Proposition 1 (Relative resolvability of Zero Noise Extrapolation with global depolarizing noise, 2 noise levels).Consider a circuit with L instances of global depolarizing noise of the form Eq. (32).Consider a Richardson extrapolation strategy based on Eq. ( 7), an exponential extrapolation strategy based on Eq. ( 9) and a NIBP extrapolation strategy based on Eq. (10).We presume access to an augmented noisy circuit where the error probability is exactly increased by factor a 1 > 1 as p → a 1 p.Then, we have where χ depol is the relative resolvability (see Definition 2) for global depolarizing noise, and where for exponential extrapolation,  Thus, χ depol 1 for all of the above extrapolation strategies with access to 2 noise levels.
We see that for all the above techniques, Zero Noise Extrapolation with access to 2 noise levels decreases the resolvability of the cost function under global depolarizing noise.Further, if one attempts to directly reverse the exponential scaling of NIBPs that global depolarizing noise incurs, one obtains an exponentially worse relative resolvability.We now consider how resolvability behaves under Zero Noise Extrapolation on average across the cost landscape, given a generic noise model.Proposition 2 (Average relative resolvability of Zero Noise Extrapolation, 2 noise levels).Consider a Richardson extrapolation strategy based on Eq. ( 7), an exponential extrapolation strategy based on Eq. (9) and a NIBP extrapolation strategy based on Eq. (10).We presume perfect access to an augmented noisy circuit where the noise rate is increased by factor a 1 > 1.We denote θ ε * as the parameter corresponding to the global cost minimum at base noise parameter ε.Further denote Any such noise model has an average relative resolvability where Thus, under the assumption that z 1 and ∆ C(θ i,ε * , a 1 ε) i 0, χ 1 for all of the above extrapolation strategies with access to 2 noise levels.
Proposition 2 shows that under mild assumptions of the effect of the noise on the cost landscape, Zero Noise Extrapolation with access to 2 noise levels impairs the resolvability of the cost landscape.These assumptions have physical meaning: z 1 implies that on average the cost concentrates when the noise parameter is boosted, whilst ∆ C(θ i,ε * , a 1 ε) 0 implies that the landscape is not heavily corrupted after boosting the noise parameter so that the global minimum at the base noise level remains below the average cost value.We also see that in the presence of exponential cost concentration and NIBPs, the relative resolvability is exponentially small if one attempts to directly reverse the exponential scaling of NIBPs.
In the Appendix, we study a modification of the averaged resolvability in Definition 4 and find that this is bounded by a function of the purity of the noisy states, such that the resolvability decreases if purity decreases with increasing noise level.This result, along with the proofs of the above propositions, can be found in Appendix D 1. Finally, we remark that in the above results we consider a scenario where the Richardson, exponential or NIBP extrapolation strategies utilize expectation values from only two noise levels.In Appendix D 1 c we show that similar results may be obtained for Richardson extrapolation with access to 3 distinct noise levels.

Virtual Distillation
Here we present our results on Virtual Distillation.In the following proposition we start again with the simple model of global depolarizing noise.
Proposition 3 (Relative resolvability of Virtual Distillation with global depolarizing noise).Consider global depolarizing noise of the form in Eq. (32) acting on some n-qubit pure state ρ with error probability p.We consider the two error mitigation protocols of Ref. [47] (denoted "A" and "B") to respectively prepare (12) and (13), using M copies of a quantum state.The relative resolvabilities to resolve any two arbitrary cost function points satisfy for all n 1, M 2, p ∈ [0, 1], and where is a monotonically decreasing function in M (with asymptotically exponential decay) for n 1, M 2. Within this region the bound is saturated as Γ(1, 2, p) = 1 for all p.
Proposition 3 shows that Virtual Distillation decreases the resolvability of cost landscapes suffering from global depolarizing noise.Moreover, as the number of state copies M increases, the effect worsens.We find similar results in the following proposition when considering averaged resolvabilities over a class of noisy states.
Proposition 4 (Average relative resolvability of Virtual Distillation).Consider an error mitigation protocol that prepares estimator Consider the average relative resolvability χ λ for noisy states of some spectrum λ with purity P as defined in Definition 3. We have where G(n, M, P ) is a monotonically decreasing function in M (with asymptotically exponential decay) for all n 1, M 2. Within this region the bound is saturated as G(1, 2, P ) = 1 for all P , and as G(n, M, 1) = 1 for all n 1, M 2.
We present the explicit forms of Γ(n, M, p) and G(n, M, P ) as well as a proof of the above propositions in Sec.D 2 of the Appendix.We note that in most cases the bound given by Proposition 4 decreases with decreasing noisy state purity.This indicates that within such settings, the greater the loss of purity due to noise, the worse the impact on resolvability is after error mitigation with Virtual Distillation.

Probabilistic Error Cancellation
Here we present our results for Probabilistic Error Cancellation.We utilize the optimal quasiprobability decompositions studied in Ref. [87], and the proofs can be found in Section D 3 of the Appendix.

Proposition 5 (Relative resolvability of Probabilistic Error Cancellation under global depolarizing noise).
Consider a quasi-probability method that corrects global depolarizing noise of the form (32).For any pair of states corresponding to points on the cost function landscape, the optimal quasiprobability scheme gives for all n 1, p ∈ [0, 1], which is achieved with access to noisy Pauli gates.
Proposition 5 shows that for the special case of global depolarizing noise, Probabilistic Error Cancellation actually improves the resolvability of the noisy cost landscape.However, this improvement is generally small and is decreasing quickly with the number of qubits n.For instance, for n = 1, χ depol has maximum value 4/3 (achieved in the limit of maximum depolarization probability).For for n = 2, χ depol has maximum value ≈ 1.07.In the limit of large n, χ depol tends to 1.
We extend this study to local depolarizing noise in Appendix D 3. We find that for a single instance of local depolarizing noise, Probabilistic Error Correction can either improve resolvability or worsen it, depending on the strength of concentration of the cost.In addition, we show in the following proposition that if one wishes to mitigate all the noisy gates in the circuit and one has NIBP scaling, the improvement due to Probabilistic Error Cancellation degrades exponentially, and ultimately for large problem sizes this impairs resolvability.
Proposition 6 (Scaling of Probabilistic Error Cancellation with local depolarizing noise).Consider local depolarizing noise with depolarizing probability p to act in L layers through a depth L circuit as in Eq. (3).Suppose that the effect of this noise is to cause cost concentration for some constant A and noise parameter q ∈ [0, 1).The optimal quasiprobability method to mitigate the depolarizing noise in the circuit yields where . Thus, the average relative resolvability has unfavourable scaling with system size.
We note that (41) gives the best possible scaling of noisy cost differences allowed by (5) under local depolarizing noise.Thus, Proposition 6 shows that if has NIBP scaling under local depolarizing noise, one still has unfavourable scaling for resolvability.

Linear ansatz methods
In Proposition 7 we consider a scenario where the same linear ansatz (15) is applied to two points on the noisy cost landscape.For Clifford Data Regression this is a reasonable assumption in scenarios where one is comparing two points that are close in parameter space, for instance, when a simplex-based optimizer is exploring a small local region.However, we remark this is not always true in general settings.
Proposition 7 (Linear ansatz methods).Consider any error mitigation strategy that mitigates noisy cost function value C(θ) by constructing an estimator C m (θ) of the form (15).For any two noisy cost function points to which the same ansatz is applied, we have for any noise process.
Corollary 2 (Linear ansatz methods under global depolarizing noise).Under global depolarizing noise, the optimal linear ansatz gives χ = 1 for any pair of cost function points.
Corollary 2 comes simply by noting that the optimal choice of linear ansatz under global depolarizing noise corrects the noise exactly and is state independent [45].The above results imply that in some settings CDR has a neutral effect on the resolvability of the cost function landscape.This opens up the possibility that in practical settings CDR can improve the trainability of cost landscapes, if it can remedy other cost corrupting effects due to noise outside of cost concentration.This motivates our numerical studies of CDR, which we present in the following section.

IV. Numerical Results
As discussed in Sec.III B, in many settings, current state-of-the-art error mitigation methods do not mitigate the effects of cost concentration.Nevertheless, as discussed in Sec.II B 3, trainability of VQAs is also affected by other cost-corrupting effects.We expect that error mitigation can reverse some of the effects due to cost corruption that affect the trainability of VQAs when the effects of cost concentration are not too severe.In this section, we numerically investigate the effects of error mitigation on trainability in such a setting to provide possible evidence towards beneficial effects of error mitigation.To this end, we focus on CDR as in some settings it does not worsen the effects of cost concentration, as shown in Sec.III B 5.
We perform our numerical experiments by simulating the Quantum Approximate Optimization Algorithm (QAOA) [24] for 5-qubit MaxCut problems.We use a realistic noise model of an IBM quantum computer [88], which has been obtained by gate set tomography of IBM's Ourense quantum device.Further, we assume linear connectivity of the simulated quantum computer.
A MaxCut problem is defined for a graph G = (V, E) of nodes V and edges E. The problem is to find a bipartition of the nodes into two sets which maximizes the number of edges connecting the sets.This problem can be reformulated as finding the ground state of a Hamiltonian where Z i , Z j are Pauli Z matrices.Here we consider graphs with n = 5 vertices, with 36 randomly generated instances according to the Erdös-Rényi model [89], where for each pair of vertices in the graph there is a connecting edge with probability 0.5.
To approximate the ground state of H MaxCut we simulate the QAOA for number of rounds p = 1 to 8. The QAOA ansatz applied to the input state is given as where H M = j X j , X j are Pauli X matrices, we denote |+ = (|0 + |1 )/ √ 2 , and β j , γ j are variational parameters.We minimize the cost function H MaxCut using the Nelder-Mead algorithm [72].We perform the optimization with shot budgets ranging from N tot = 10 7 to 1.5 × 10 8 .We define N tot as total number of shots spent on the optimization.We detail the optimization procedure in Appendix E. In our numerics, the values of N tot are chosen to enable implementation of the optimization with current quantum computers.To quantify the quality of the noisy or CDR-mitigated optimization we compute approximation ratios of the solutions using the exact expectation value of H MaxCut .The approximation ratio is defined here as the ratio of a given solution's energy to the true ground state energy.We gather our numerical results for CDR in Fig. 5.In the figure we plot the approximation ratio averaged over 36 randomly chosen graphs versus N tot .We compare the quality of the solutions of noisy (unmitigated) and CDRmitigated optimization and find that CDR-mitigated optimization outperforms noisy optimization for all considered p and N tot values.We observe that the solutions for p = 2 outperform those for p = 1 for both CDRmitigated and noisy optimization.The quality of p > 2 solutions decline with increasing p for noisy optimization, while it remains approximately the same for p = 2 to 6 for CDR-mitigated optimization.With CDR-mitigated optimization we see a decrease in quality of solution for the largest considered p = 8.
The numerical results presented here are obtained for circuits shallow enough to be trainable while using the CDR-mitigated cost function.Therefore they are outside of the NIBP scaling regime.As discussed in Section II B, even outside the NIBP regime noise may adversely impact trainability by corrupting the cost function landscape, which error mitigation has a chance to remedy.

FIG. 5.
Comparing CDR-mitigated and noisy optimization for 5 qubit Max-Cut QAOA.We plot the approximation ratio of solutions for noisy (red circles) and CDRmitigated (blue diamonds) optimization of Max-Cut QAOA for 5 qubits.Different panels show results for different numbers of QAOA rounds p plotted versus total number of shots Ntot spent on the optimization of a MaxCut problem.Here, we compute the approximation ratios using exact HMaxCut energies to benchmark quality of the noisy and CDR mitigated optimization.For each p we average the approximation ratio over 36 MaxCut graphs chosen randomly from Erdös-Rényi ensemble.For all p and Ntot values we see an advantage of the CDR-mitigated optimization over noisy optimization.
Our results give hope that CDR-mitigated optimization may overall offer a trainability advantage for problems with such cost function landscape corruption.
As discussed in Section III B optimizing an error mitigated cost function is not guaranteed to outperform its noisy optimization even outside the NIBP regime.Indeed, we find numerically that for p = 2, 4 optimization with Virtual Distillation does not outperform noisy optimization for the considered implementations (see Appendix E 2).

V. Discussion
Noise can exponentially degrade the trainability of linear (or superlinear) depth Variational Quantum Algorithms (VQAs) by flattening the cost landscape, thus requiring an exponential precision in system size to resolve its features [53,54].This limits the scope for achieving possible quantum advantage with VQAs.At present there are no known strategies to avoid this exponential scaling completely aside from pursuing algorithms with sublinear circuit depth, and current strategies to mitigate this effect consist only of reducing hardware noise rates.Thus, is it a pressing challenge to search for possible solutions to this problem.Error mitigation strategies emerge as a natural candidate to tackle this problem under near-term constraints.
In this work we investigate the effects of error mitigation on the trainability of noisy cost function landscapes in two regimes.First, we work in the asymptotic regime (in terms of scaling with system size) and find that if a VQA is suffering from exponential cost concentration, requiring an exponential number of shots to accurately resolve cost values, then a broad class of error mitigation strategies (including as special cases Zero Noise Extrapolation, Virtual Distillation, Probabilistic Error Cancellation, Clifford Data Regression) cannot remove this exponential scaling.Within the considered paradigm, this exponential scaling implies that at least an exponential number of resources needs to be spent in order to extract accurate information from the cost landscape in order to find a cost-minimizing optimization direction.In Corollary 1 we identify circuit samples (or shots) as well as number of copies of a quantum state as two such resources.
Second, we move out of the asymptotic regime and investigate whether or not particular error mitigation protocols can improve the resolvability of noisy cost landscapes.Should such a landscape be burdened with exponential cost concentration, this would correspond to an improvement in the coefficient in the exponential scaling.Our results indicate that some error mitigation protocols can worsen the resolvability, and ultimately the trainability, of cost landscapes in certain settings.In particular, in Propositions 3 and 4 we show analytically that Virtual Distillation impairs resolvability with worsening resolvability as the number of state copies increases.We obtain similar results for Zero Noise Extrapolation in Propositions 1 and 2 under some assumptions of the cost landscape.Numerical analysis of a particular MaxCut problem indicates that trainability is overall similarly impaired for Virtual Distillation.
Clifford Data Regression (CDR) distinguishes itself from the other error mitigation techniques considered in this article, as in contrast to the other protocols it does not necessarily increase the statistical uncertainty of cost values more than it reverses their concentration.This is reflected in the fact that under a global depolarizing noise model, CDR has neutral impact on resolvability (Corollary 2).However, it is also known that CDR can remedy the effects of more complex noise models.This indicates that CDR could resolve trainability issues arising due to corruptions of the cost function outside of cost concentration, whilst having a neutral effect on cost concentration itself, and thus overall improve trainability.In the numerical example studied, presented in Fig. 5, we observe this to be the case.This points to deeper future work studying the mechanisms that allow error mitigation to improve the trainability of noisy cost landscapes.
Finally, we identify that the broad class of error mitigation protocols we study in our asymptotic analysis all only consist of post-processing expectation values of noisy circuits, as summarized in Fig. 3.This gives intuition as to why they cannot escape the exponential scaling of noise-induced barren plateaus (NIBPs).However, the theory of error correction indicates that with sufficient resources NIBPs can indeed be avoided.This gives hope that there can exist novel error mitigation strategies that move beyond the framework of the protocols considered in this article and thereby avoid the exponential impairment to trainability that NIBPs present.
prove our main results, as well as provide further details on the error mitigation protocols studied in this article.In Appendix B we derive some useful lemmas that are required for our proofs.In Appendix C we present the proof for our asymptotic results on the exponential concentration of estimators.In Appendix D we present our protocol-specific results on the change in resolvability of the cost landscape under error mitigation.Finally, in Appendix E we discuss details of our numerical implementations for Clifford Data Regression and Virtual Distillation.which approximates C(θ i , 0) to a higher order in ε compared to C(θ i , ε).NIBP extrapolation.We can construct an alternative Zero Noise Extrapolation strategy that is tailored towards noisy cost function values that are dominated by NIBP scaling as in Eq. ( 5).We model the effects of noise as for all noisy cost function points C(θ i , q), where C(θ i , 0) is the fixed point of the noise and B(θ i )+ C(θ i , 0) is the noise free cost value.(Note that for NIBPs we cannot consider lower-order polynomials of q, as else the NIBP condition would be broken for small q.)We construct estimators for any given parameter θ i , as where C(θ i , 0) is the parameter-independent fixed value obtained at maximum noise level, and K = C(θ i , 0) − k p k is an additive constant.As we are only interested in cost function differences for our results, this will cancel out.

b. Probabilistic Error Cancellation
General idea.Probabilisitic Error Cancellation utilizes many modified circuit runs in order to construct a quasiprobability representation of the noise-free cost function [42,43].We assume that the effect of the noise can be described by a quantum channel N that occurs after a gate that we denote with unitary channel U.For now we assume this is the only instance of noise in the circuit, however, we will later generalize to many instances of noise.The goal of Probabilistic Error Cancellation is to simulate the inverse map N −1 .Note in general this will not always correspond to a CPTP map.Despite this, if we have a basis of (noisy) quantum channels {B α } α , corresponding to experimentally available channels, we can expand the inverse map in this basis as for some set of q α ∈ R.Then, the channel that describes the noiseless gate can be written as where we have defined K α = B α • N • U. Denote the input state to the gate as ρ in and a measurement operator as O.The expectation value can be written where for simplicity we first assume that U is the only gate in the circuit, and where sgn(q α ) denotes the sign of q α .We call this the quasiprobability representation of the gate U.The idea is that if we have access to the set of CPTP maps {B α } α in our noisy native hardware gate set, then we can obtain an estimate of the noiseless expectation value C U (ρ) as follows: (1) With probability p α , prepare the circuit corresponding to K α (ρ) and obtain C Kα(ρ) .( 2) Multiply the measurement result by sgn(q α )G N .(3) Repeat process many times and sum results.
for any two operators σ, ρ ∈ B(H), dim(H) = d.This implies Proof.We use the following standard expressions for integration with respect to the Haar measure over the unitary group of degree d: where w i,j are the matrix elements of the unitary W ∈ U(d).Then, the expectation values over 2-designs can be evaluated as and where we have used Tr[ρ] = Tr[σ] = 1.The final statement comes by noting that which can be factorized to give the desired result.

Appendix C: Exponential estimator concentration
We present a proof of Theorem 1, and restate the result here for convenience.
Theorem 1.Consider an error mitigation protocol prepares the quantity for some quantum state σ ∈ S(H), for |0 0| ∈ S(H ) and for some X ∈ B(H ⊗M ⊗ H ⊗k ).We suppose σ is prepared with a depth L σ circuit and experiences noise according to Eq. (3).Under these conditions, E σ,X,M,k exponentially concentrates on a state-independent fixed point in the depth of the circuit as where 1 1 ∈ S(H) is the n-qubit identity operator and with noise parameter q ∈ [0, 1).

Proof. Consider
Tr The first inequality is due to the matrix Hölder's inequality, and the second inequality follows from Lemma 2 Similarly, we have M − 1 further such equations, which we display with the original equation: ... (C11) The summation of these equations gives which gives the desired bound.
We now present a more detailed statement of Corollary 1, which explains how one can spend an exponential number of resources in different ways in order to resolve concentrated cost values.
Corollary 1 (Exponential estimator concentration).Consider an error mitigation protocol that approximates the noise-free cost value C(θ) by estimating the quantity where each E σ,X,M,k takes the form (C1).We denote M max and a max as the maximum values of M and a X,M,k respectively accessible from a set T defined by the given protocol.Assuming X ∞ ∈ O(poly(n)), there exists a fixed point F independent of θ such that for some constant β 1 if the circuit depths satisfy for all σ(θ) in the construction (C15).That is, if the depth of the circuits scale linearly or greater then one requires at least exponential resources to distinguish C m from its fixed point, for instance in one of the following ways: • a max |T |M max ∈ O(poly(n)) and an exponentially large number of shots are used to distinguish two quantities with exponentially small separation • a max |T | ∈ Ω(2 β n ) for some constant β 1 and an exponentially large number of shots are required to distinguish two quantities with exponentially large statistical uncertainty, as measurement outcomes are multiplied by a max |T |.
• M max ∈ Ω(2 β n ) for some constant β 1 and an exponentially large number of copies of some quantum state σ are required.
Proof.Explicitly applying the results of Theorem (1) to the construction of C m (θ) in (C15) we have where in the second line we have used (C3).
as required, where we can denote β = min θ β(θ) and the fixed point as F , noting that F is indeed parameter independent.From here, we can inspect the three presented cases: • If a max |T |M max ∈ O(poly(n)) then C m has exponentially small separation from F .
• There exists choice β 1 such that if a max |T | ∈ Ω(2 β n ) such that C m is not exponentially concentrated on F , however, C m now has an exponentially large statistical uncertainty, as measurement outcomes are multiplied by coefficients of order a max |T |.
• There exists choice of β 1 such that M max ∈ Ω(2 β n ) and C m is not exponentially concentrated on F .
for all a 1.Thus, C(θ, a) denotes the noisy cost value at the boosted noise level, and C(θ, 1) denotes the noisy cost value at the base noise level.
Throughout this section, in order to estimate the sample cost of error mitigation we will make the key assumption that for all θ and a 1 that is, the statistical fluctuations in measurement outcomes at the boosted noise level are no smaller than that at the base noise level.We note that similar assumptions are made in the literature for Zero Noise Extrapolation and Quasi-Probability Methods [43,77].Indeed, for noise models with a maximally mixed fixed point, we expect that high noise rates will tend to lead to larger variances.For example, in the simple scenario of a local Pauli measurement, the variance of measurement outcomes takes the form (1−p0)(p0)

N
, where p 0 is the probability of obtaining a "0" outcome and N is the number of shots.This variance is maximized for p 0 = 1 2 .
a. Relative resolvability under global depolarizing noise Proposition 1 (Relative resolvability of Zero Noise Extrapolation with global depolarizing noise, 2 noise levels).Consider a circuit with L instances of global depolarizing noise of the form (32). Consider a Richardson extrapolation strategy based on Eq. (A1), an exponential extrapolation strategy based on Eq. (A5) and a NIBP extrapolation strategy based on Eq. (A7).We presume access to an augmented noisy circuit where the error probability is exactly increased by factor a 1 > 1 as p → a 1 p.Then we have where χ depol is the relative resolvability (see Definition 2) for global depolarizing noise, and where for exponential extrapolation, a −(L+1) 1 for NIBP extrapolation .

(D4)
Thus, χ depol 1 for all of the above extrapolation strategies with access to 2 noise levels.
Proof.Upon inspecting Eqs.(A2), (A6) and (A8), one can verify that the Richardson, exponential and NIBP extrapolation strategies all take the form where A, B 0 (note that for NIBP extrapolation E contains the state-independent cost value that represents the fixed point of the noise) and where we have adopted the notation defined in (D1).We note that under L instances of global depolarizing noise (of the form (32)) with error probability p, noisy cost differences are given by for any pair of cost function points, where ∆C is the corresponding noise-free cost difference.
The error-mitigated cost function difference ∆C m = C m (θ 1 ) − C m (θ 2 ) between two arbitrary points is given by Inpsecting (D7), we see that the error mitigation cost can be bounded simply as where the inequality comes from our core assumption (D2).Inserting γ, ∆C m and ∆ C(1) into Definition 2, we have where we have denoted c = A/B.By inspecting the specific values of A and B for the Richardson, exponential and NIBP extrapolation strategies respectively, we obtain the results for each strategy.

b. Average relative resolvability
Proposition 2 (Average relative resolvability of Zero Noise Extrapolation, 2 noise levels).Consider a Richardson extrapolation strategy based on Eq. (A1), an exponential extrapolation strategy based on Eq. (A5) and a NIBP extrapolation strategy based on Eq. (A7).We presume perfect access to an augmented noisy circuit where the noise rate is increased by factor a 1 > 1.We denote θ ε * as the parameter corresponding to the global cost minimum at base noise parameter ε.Further denote Any such noise model has an average relative resolvability where for exponential extrapolation, a −(L+1) 1 for NIBP extrapolation .

(D14)
Thus, under the assumption that z 1 and ∆ C(θ i,ε * , a 1 ε) i 0, χ 1 for all of the above extrapolation strategies with access to 2 noise levels.
Proof.As in the previous proof, we can inspect Eqs.(A2), (A6) and (A8), and see that the Richardson, exponential and NIBP extrapolation strategies all take the form where A, B, D 0 (note that E contains the state-independent cost value that represents the fixed point of the noise) and we have adopted the notation of (D1).The average mitigated cost differences (averaged over accessible parameters {θ i } i ) can be written Thus, we have finally, by noting once again that the error mitigation cost is simply bounded as γ due to (D2), we have where we can obtain the desired form by defining c = A/B.Finally, the specific values of c for each extrapolation strategy can be read off by inspecting Eqs.(A2), (A6) and (A8).
We now introduce a modification of the 2-design-averaged relative resolvability in Definition 4 that we will use to prove an additional result for Zero Noise Extrapolation.
Definition 5 (Average relative resolvability III).Consider a spectrum λ ∈ R 2 n 0 with unit 1 -norm, which corresponds to a noisy reference state.Then define the unitarily-averaged relative resolvability as where • Ui denotes an average over U i drawn from a unitary 2-design, and where we denote where M : S(H) → B(H) is the map that describes the action of the error mitigation protocol.
Note that for Zero Noise Extrapolation we would generally expect to apply the map that describes the error mitigation after conjugation by the unitary U i , counter to the definition presented above.Definition 5 presumes that augmenting the noise level and conjugating the noisy state by a unitary are two processes that approximately commute, and as such its use is not justified for generic cases.However, below we still present a result using this definition as a mathematical curiosity.
Supplemental Proposition 1 (Average relative resolvability III with Zero Noise Extrapolation).Consider a Richardson extrapolation strategy based on Eq. (A1), an exponential extrapolation strategy based on Eq. (A5) and a NIBP extrapolation strategy based on Eq. (A7).We presume perfect access to an augmented noisy circuit where the noise rate is increased by factor a > 1. Denote the output state at the base and augmented noise levels as ρ(1) and ρ(a) respectively.Then we have where χ λ is the averaged relative resolvability defined in Definition 5, P (1) is the purity of ρ(1), P (a) is the purity of ρ(a), and for exponential extrapolation a −(L+1) for NIBP extrapolation .
Proof.We denote reference states ρ(ε) and ρ(aε) as states with purity P (ε) and P (aε) respectively.Moreover, denote the noisy cost function values and further denote C m (U i ) as the corresponding error mitigated estimator.We start again by noting that in all three Zero Noise Extrapolation strategies the estimator takes the form where A, B 0 (see Eqs. (A2), (A6) and (A8)) and we have adopted the notation of (D1).We first evaluate the relevant expectation values which correspond to integrals over the Haar distribution over the unitary group of degree 2 n .We now proceed to derive the result for Richardson/exponential extrapolation, however, we note that the proof follows in a similar way for NIBP extrapolation with the simple substitution aε → q/a.Utilizing Lemma 3, we have where the inequality comes by observing that Tr[ ρ(ε)] = Tr[ ρ(aε)] = 1 and further applying Cauchy-Schwarz to Tr[ ρ(ε) ρ(aε)] and noting that the purity of an n qubit state is lower bounded by 1/2 n .Inspecting Eq. (D25) we have The inequality comes by substituting in the expressions for ∆ C(U i , ε) Ui and ∆ C(U i , aε) Ui obtained in (D28), and dropping the third term in the numerator, where we have used Eq.(D30).Finally, we note that (D15) gives γ −1 = D 2 A 2 +B 2 .Substituting the obtained expressions for γ −1 , (D35) and (D28) into Definition 5 we obtain where we can define c = A/B to obtain the desired result.Further, the explicit form of c for Richardson, exponential and NIBP extrapolation can be respectively found by inspecting Eqs.(A2), (A6) and (A8).

c. Richardson extrapolation with 3 noise levels
In this section we focus on Richardson extrapolation (see Appendix A 1 a for review) and investigate the change in resolvability under an extrapolation strategy that utilizes 3 distinct noise levels.
Supplemental Proposition 2 (Relative resolvability of Richardson extrapolation with global depolarizing noise, 3 noise levels).Consider L instances of global depolarizing noise of the form (32) acting through a circuit.Consider a Richardson extrapolation strategy based on Eq. (A1), an exponential extrapolation strategy based on Eq. (A5) and a NIBP extrapolation strategy based on Eq. (A7) in the appendix.We presume access to two augmented noisy circuits where the error probability is perfectly increased by factors a 2 > a 1 > 1 as p → a 1 p and p → a 2 p respectively.Then for all three extrapolation strategies and any such choices of a 2 and a 1 , we have where χ depol is the relative resolvability (see Definition 2) for global depolarizing noise.
Proof.We start by noting that under L instances of global depolarizing noise with error probability p (of the form (32)), noisy cost differences are given by for any noise augmentation factor a and any pair of cost function points, where ∆C is the corresponding noise-free cost difference.
The error-mitigated cost function difference ∆C m (θ 1,2 ) = C m (θ 1 ) − C m (θ 2 ) between two arbitrary points constructed under Richardson extrapolation with 3 noise levels is given by where in order to obtain the first equality we have used (A4).The second equality comes by substituting in (D38).Inspecting (D7), we see that the error mitigation cost can be bounded simply as for any θ, where the inequality comes from our core assumption (D2).Inserting our expressions for γ, ∆C m and ∆ C(1) into Definition 2, we have The desired result can be observed by noting that a 2 (a 2 − 1) > a 1 (a 1 − 1) and that Supplemental Proposition 3 (Average resolvability of Richardson extrapolation, 3 noise levels).Consider a Richardson extrapolation strategy based on Eq. (A1), an exponential extrapolation strategy based on Eq. (A5) and a NIBP extrapolation strategy based on Eq. (A7) in the appendix.We presume perfect access to two augmented noisy circuits where the noise rate is increased by factors a 2 > a 1 > 1.We denote θ ε * as the parameter corresponding to the global cost minimum at base noise parameter ε.Further denote where χ is the averaged relative resolvability (see Definition 2).Thus, under the assumption that z 2 z 1 1 (on average the cost concentrates with increasing noise level) and ∆ C i,ε * (a 1 ε) i , ∆ C i,ε * (a 2 ε) i 0 (boosting the noise level does not shift the cost value of the global minimum above the average cost value), then χ 1.
Proof.The averaged error-mitigated cost function difference ∆C m (θ i,ε * ) i = C m (θ i ) − C m (θ ε * ) i between two arbitrary points constructed under Richardson extrapolation with 3 noise levels is given by As in the previous proof, we can inspect (D7) and we see that the error mitigation cost can be bounded simply as for any θ, where the inequality comes from our core assumption (D2).Inserting our expressions for γ and ∆C m into Definition 2, we have and the desired result comes by denoting As with the results of Proposition 2 we see that χ decreases with increasing cost concentration.

Virtual Distillation a. Bounds on error mitigation cost
We recall the two error mitigation protocols of Ref. [47], denoted "A" and "B" respectively, to prepare and where λ i is the dominant eigenvalue of ρ i ≡ ρ(θ i ).The protocols considered explicitly construct these quantities as where prob 1,i , prob M,i and prob M,i are expectation values of a Pauli-Z measurement on a qubit ancillary subsystem.
In order to obtain our results we will hereon make the core assumptions ) ) that is, the statistical uncertainty of the measurement outcomes of the circuit that prepares ρ i and ρ M i can be approximated to be state independent, and the statistical uncertainty of the measurement outcomes of the circuit that prepares ρ M i are at best equal to that of ρ i .In the case of large M we expect Var[prob M,i ] to be large, as for any (non-pure) ρ, the quantity Tr[ ρ M O] is close to zero for large M .This corresponds to prob M,i = 1 2 , which maximizes the variance for a binomial distribution.
Lemma 4 (Bounds on error mitigation cost of virtual distillation).Denote the error mitigation cost (see Definition 1) corresponding to (D51) and (D52) as γ (A) and γ (B) respectively.We have Let us first treat the qubit setting of n = 1.Consider eigenvalue decomposition ρ λ = λ|ψ ψ| + (1 − λ)|ψ ⊥ ψ ⊥ |, where we have defined λ 1 = 1 − λ, λ 2 = λ and without loss of generality we fix 1 − λ λ.We define G(1, M, P ) = f (M )/f (1) and will determine f (M ) exactly for single-qubit states.For generic M we have where in the final equality we have used the fact that for single-qubit states λ = 1 2 (1 − √ 2P − 1).Further, using f (1) = P − 1 2 we have the bound as required.Now let us consider the setting of n 2. We will construct two bounds, for the respective high purity and low purity limits.We start with the bound for high purity states.We can write the right hand side of Eq. (D109) explicitly as where in order to obtain the first inequality we have dropped the cross terms 1 2 2n i =j λ M i λ M j , and in the second inequality we have introduced new cross terms.The final equality comes by substituting in the definition of the purity P .We note this first bound is upper-bounded by 1 for all P 1 2 n −1 .We can now construct our second bound for strongly mixed states.We will consider bounds on Var[X M ] where a random variable X when it is known that it takes values close to its mean µ.We consider the decomposition where we have defined the random variables where the inequality is due to Cauchy-Schwarz.We now take X to be the random variable which takes values {λ i } i with uniform probability and mean µ = 1 2 n .We will bound Var Y k under the assumption that {λ i } i are close in value to the maximally mixed value 1  2 n .First, we note that each Y k is a function of (X − µ) k , and so we must investigate the shifted spectrum which we denote λ where λi = λ i − 1 2 n for all i.Using Popoviciu's inequality, we have the bound Now suppose that we have the constraint for some b 0. For any k, we have Let us now bound the quantity on the right by considering its maximum value over all spectra with constraint (D126).The quantity on the right is maximized by the choice of vector λmax , λmin that majorizes all others, given some fixed value of λmax + λmin .Indeed, λmax + λmin = b is fixed by our constraint (D126) ( λmin must be negative in order to preserve trace).Thus the quantity on the right hand side of (D127) can be bounded by maximizing λmax and minimizing λmin .This is achieved by setting all other λi equal to λmin .We then have pair of constraints λmax + (2 n − 1) λmin = 0 , (D128) λmax − λmin = 2b , where the first constraint comes from preservation of trace, and the second is our original constraint.This is a linear system of equations with solution We will find it necessary to use the tighter bound (D131) in the case of M = 2, but the looser bound (D132) will enable us to write a bound with a more compact form for M 3.
We now relate b to the purity.We can write a general spectrum that satisfies the constraint in (D126) as λ b,c,a = ( Var[λ (1) ] against purity for randomly generated states at different values of M at n = 2.We also plot the final bound (D96), which is a function of purity, as a line.
where we have used (A11).From (D154) we see that the error mitigation cost is where γ, γ are the individual error mitigation costs for each gate.We can see the above reasoning can be extended inductively to show that the error mitgation cost of a collection of gates with the probabilistic error cancellation is equal to the product of the individual error mitigation costs.

b. Global depolarizing noise
Proposition 5 (Relative resolvability of Probabilistic Error Cancellation for global depolarizing noise).Consider a quasi-probability method that corrects global depolarizing noise of the form (32).For any pair of states corresponding to points on the cost function landscape, the optimal quasiprobability scheme gives for all n 1, p ∈ [0, 1], which is achieved with access to noisy Pauli gates.
Proof.Ref. [87] gives the optimal quasi-probability decomposition for the inverse noise channel as where I is the identity channel and P i is the Pauli channel corresponding to the ith Pauli tensor product.This has corresponding error mitigation cost  For the noisy (no error mitigation) optimization we benchmark various combinations of N i and N s values for p = 4.In particular we consider increasing N s for N i = 30 and increasing N i for N s = 1024.We gather the results in Fig. 7.We find that for N tot ranging from 10 7 to 1.5 × 10 8 (as considered in Fig. 5) the best results are obtained for N i = 3000, N s = 1024.We use these values for the noisy optimization presented in Fig. 5.

Optimization with Virtual Distillation
In this Appendix we compare 5-qubit MaxCut QAOA optimization of VD-mitigated cost function with optimization of the noisy and CDR mitigated cost function for p = 2, 4. We perform the comparison for 36 randomly chosen graphs from the Erdös-Rényi ensemble as in Fig. 5.We perform VD mitigation for each expectation value of a 2-site term of H MaxCut according to (12).Therefore, a key assumption is that we neglect derangement noise, which would affect realistic VD implementation on hardware [47].We use the Nelder-Mead algorithm as described in Appendix E. We have N i = 30 as for CDR simulations from Fig. 5 and assign 65536 shots in order to estimate Tr[ ρ M Z i Z j ] for each 2-site term of H MaxCut and Tr[ ρ M ].Consequently, the total shot cost of the mitigated cost function estimation is (n e + 1) × 65536, where n e is the number of Max-Cut graph edges.We consider M = 2, 3 state copies as the shot cost of VD mitigation for given precision grows with increasing M [95] and M = 2, 3 was shown to be sufficient for typical applications [96].We find that M = 2 gives better results than M = 3 as suggested by our analytical results.Here we allow for N tot up to 1.5 × 10 9 , i.e. up to 10 times more shots than considered for CDR mitigated and noisy optimization in Fig. 5.
We gather the results in Fig. 8 comparing them with noisy and CDR mitigated optimization from Fig. 5.We find that even with smaller N tot the noisy and CDR mitigated optimization outperforms the VD mitigated optimization.This example shows that even for circuits outside of the NIBP regime there is no guarantee that mitigating an error mitigated cost function outperforms noisy cost function optimization.We note that this result does not prohibit VD-mitigated optimization advantage for different choices of N tot , M or the shot number per cost function evaluation.

FIG. 1 .
FIG. 1. Error mitigation can impair the resolvability of cost function landscapes.(a): A central primitive in training VQAs is the task of comparing two cost function values (C(θ1) and C(θ2)) on the cost landscape in parameter space.Ideally (with infinite sampling), these cost values correspond to the mean values of some probability distributions (left panel).However, in an experimental setup, one only has a finite shot budget and by collecting measurement statistics one obtains an estimate of the mean values by sampling from these distributions (right panel).(b): The effect of certain types of noise models is to concentrate cost function values.This impedes trainability as any two cost function values ( C(θ1) and C(θ2)) have small separation and require many shots to accurately distinguish.(c): Error mitigation can mitigate many effects of noise and potentially recover key features of the noise-free cost function.In an ideal scenario, the separation of the mitigated cost values (Cm(θ1) and Cm(θ2)) closely resembles that of the noise-free landscape.However, the caveat is that the variance of statistical outcomes can increase greatly.The effect of this is that the two cost function points can often require even more shots to resolve accurately, compared to the unmitigated case.

FIG. 3 .
FIG. 3. Schematic of resource use in error mitigation.Noise is indicated by the shaded orange region.(a) Cost function values are obtained by taking input state ρin, applying parameterized gates which we denote as a unitary channel U(θ), and measuring the resulting state U(θ)(ρin) with observable O.(b) Noise can corrupt the gates in the circuit and the measurement process.(c) Error mitigation aims to obtain a good approximation to the noise-free cost C(θ) by employing a number of strategies such as: modifying the gates implemented U(θ) → V(θ) or the input state ρin → σin, utilizing multiple copies of the quantum circuit, modifying the measurement operator O → X, and utilizing clean ancillary qubits at the end of the circuit.Many such circuits with different hyperparameters can be run, with their expectation values combined in a post-processing step, in order to construct the final error mitigated cost value Cm(θ).One feature that distinguishes the approaches to error mitigation studied here from error correction is that error correction allows global access to the larger Hilbert space from the start of the circuit, whereas the error mitigation techniques studied here only allow the possibility for global operations at the end of the circuit.

FIG. 4 .
FIG. 4. Schematic for average relative resolvability II.In Definition 4 we consider a broader averaged resolvability measure where the average is taken over a class of noisy states, rather than a cost landscape generated by a particular ansatz.This is constructed from the following game: (1) Prepare a reference noisy state ρ λ with spectrum λ and conjugate by unitary Ui drawn from a 2-design.(2) Pass the resulting state through the considered error mitigation protocol, and evaluate the resolvability from the fixed point of the noise.(3) Do the same, without error mitigation.(4) Average over the 2-design and compare the averaged resolvabilities.
Finally, we can explicitly define a probability distribution p α = |q α |/G N where G N = α |q α |.This gives us an alternative way to write (A11) and (A12) as

2 .Lemma 3 .
Averages over unitary 2-designs Consider the cost function value C σ (U ) = Tr[U σU † O] where U is a d × d unitary matrix and σ ∈ S(H) is some quantum state.Consider expectation values over U i ∈ Y where Y ⊂ U(d) is a unitary 2-design and U(d) is the unitary group of degree d.Denote such expectation values as • U .Then, we have

FIG. 7 .
FIG. 7.Benchmarking various implementations of the noisy optimization for p = 4 5-qubit MaxCut QAOA.We plot the approximation ratio averaged over 36 Max-Cut graphs chosen randomly from the Erdös-Rényi ensemble as a function of Ntot.We compare the results for various numbers Ni of optimization instances initialized randomly and various numbers of shots Ns per cost function evaluation.As a reference we show the results of CDR optimization for p = 4.We consider Ntot = 10 7 − 1.5 × 10 8 as in Fig.5.Additionally, as in Fig.5we use the approximation ratio computed with the exact energy to benchmark the optimization, and in the case of Ni > 1 we choose as the result of optimization the best instance determined according to the optimized cost function.We consider various values of Ns = 1024, 16384, 262144 for Ni = 30 and various values of Ni = 30, 150, 3000 for Ns = 1024.We find that Ni = 3000, Ns = 1024 yields the best results although differences of quality in between most of the noisy optimization implementations are relatively small in comparison to the CDR mitigated optimization.

FIG. 8 .
FIG.8.Virtual Distillation mitigated optimization for p = 2, 4, 5-qubit MaxCut QAOA.We plot the approximation ratio averaged over 36 Max-Cut graphs chosen randomly from Erdös-Rényi ensemble as a function of total shot number Ntot.The results were obtained with Ni = 30 initializations and 65536 shots per Tr[ ρ M ZiZj] and Tr[ ρ M ] estimation.For reference we also present our results of CDR-mitigated and noisy optimization from Fig.5.We observe that for this setting the optimization with Virtual Distillation does not outperform the noisy or CDR-mitigated optimization.