Error mitigation with Clifford quantum-circuit data

Achieving near-term quantum advantage will require accurate estimation of quantum observables despite significant hardware noise. For this purpose, we propose a novel, scalable error-mitigation method that applies to gate-based quantum computers. The method generates training data $\{X_i^{\text{noisy}},X_i^{\text{exact}}\}$ via quantum circuits composed largely of Clifford gates, which can be efficiently simulated classically, where $X_i^{\text{noisy}}$ and $X_i^{\text{exact}}$ are noisy and noiseless observables respectively. Fitting a linear ansatz to this data then allows for the prediction of noise-free observables for arbitrary circuits. We analyze the performance of our method versus the number of qubits, circuit depth, and number of non-Clifford gates. We obtain an order-of-magnitude error reduction for a ground-state energy problem on 16 qubits in an IBMQ quantum computer and on a 64-qubit noisy simulator.


Introduction
Currently, one of the great unsolved technological questions is whether near-term quantum computers will be useful for practical applications. These noisy intermediate-scale quantum (NISQ) devices do not have enough qubits or high enough gate fidelities for fault-tolerant quantum error correction [41]. Consequently, any observable measured on a NISQ device will have limited accuracy. However, candidate applications such as quantum chemistry require chemical accuracy to beat classical methods [8,33]. Similarly quantum approximate optimization has the potential to beat classical optimization when high accuracy is achieved [14,21,25].
Hence, it is widely regarded that near-term quantum advantage will only be achieved through error mitigation. Error mitigation (EM) is broadly defined as methods that reduce the impact of noise, rather than directly correct it. EM includes efforts to optimize quantum circuits with compiling and machine learning [12,13,35]. It also includes variational quantum algorithms [3,10,18,34,40], some of which can be used to remove the effects of incoherent noise [9,[29][30][31]42]. A prominent EM approach is to perform classical post-processing of observable expectation values. Perhaps the most widely used example is state-of-the-art zero-noise extrapolation (ZNE), which has shown great promise [16,28]. For a set of states that are classically simulable, one generates noisy and corresponding noise-free data on a quantum computer and classical computer, respectively. One learns to correct on this training data by fitting the parameters of an ansatz. Finally, one uses this ansatz with the fitted parameters to predict noise-free observables for arbitrary quantum states. ZNE involves collecting data at various levels of noise, achieved by stretching gate times or inserting identities, and using this noisy data to extrapolate an observable's expectation value to the zero-noise limit [6,17,23,37,45]. It has been successfully employed to correct ground-state energies for problem sizes up to 4-qubits [16,26,28]. Among other approaches are quasi-probabilistic error mitigation [45] and numerous application specific approaches leveraging symmetries and/or post-selection techniques to mitigate errors [4,7,32,38].
A crucial requirement of any EM method is scalability. While it is relatively easy to develop EM methods for small qubit systems, EM methods that work effectively at the quantum supremacy scale (> 50 qubits) are much more challenging to construct. Even methods that are in principle scalable may not actually scale well in practice.
This work aims to address this issue by proposing a novel, EM method that is applicable to all gatebased quantum computers. The basic idea is shown in Fig. 1. First we generate training data, of the form {X noisy i , X exact i }, where X noisy i and X exact i are the noisy and noiseless versions of an observable's expectation value of interest. The noisy values are obtained directly from the quantum computer, while the noiseless values are simulated on a classical computer. The training data are obtained from quantum circuits composed largely of Clifford gates (gates that map Pauli operators to Pauli operators), and hence these circuits are efficiently classically simulable. Next we fit the training data with a model, and finally we use the fitted model to predict the noise-free observable.
Our method is conceptually simple and could be refined with sophisticated model fitting methods offered by modern machine learning [36]. Nevertheless, even with simple linear-regression-based fitting, our method performs extremely well in practice. We consider the task of estimating the ground-state energy of an Ising spin chain by variationally training the Quantum Alternating Operator Ansatz (QAOA) [21,25]. For this task, our method reduces the error by an order-of-magnitude for both a 16-qubit problem solved on IBMQ quantum computer and a 64-qubit problem solved on a noisy simulator, though we expect that this improvement will generically vary depending upon the task considered. We also demonstrate the utility of our method for non-variational algorithms such as quantum phase estimation. Finally we find that our method appears to perform better than ZNE for the larger scale problems we consider.

Our method
We refer to our method as Clifford Data Regression (CDR). Let X be the observable of interest whose expectation value X exact ψ = ψ|X|ψ one wishes to estimate for a given state |ψ . Let X noisy ψ be the noisy version of this expectation value obtained from the quantum computer. To remove the noise from this expectation value, the CDR method involves the following steps: 1. One chooses a set of states S ψ = {|φ i } that will be used to construct the training data T ψ . Each |φ i state must satisfy the property that it is efficient to classically compute the expectation value of X for this state. The CDR method ensures this property by constructing each |φ i state from a quantum circuit composed largely of Clifford gates. We denote the number of non-Clifford gates used to prepare each |φ i as N , which (as shown below) plays the role of a refinement parameter.
2. For each |φ i ∈ S ψ , one evaluates X exact φi = φ i |X|φ i using a classical computer. One also evaluates the noisy version of this expectation value, X noisy φi , using the quantum computer of interest. These two quantities are incorporated into the training data set T ψ = {X noisy φi , X exact φi }.
3. One constructs an ansatz or model for the noisefree value of the observable in the vicinity of |ψ , where a are free parameters. The parameters can be found either by regression or machine-learning methods. In this article we use least square regression, with a linear ansatz (see Appendix A for discussion): obtaining parameters a 1 and a 2 by minimizing 4. One uses the ansatz f (X noisy ψ , a) with the fitted parameters to correct X noisy ψ .
We now discuss several strategies for how to choose S ψ in Step 1. In general, we have found that it is advantageous to tailor the set S ψ to the specific state |ψ ; in other words, to bias the training data towards the target state of interest. A simple strategy for this purpose is to generate the state-preparation circuits for the |φ i states by replacing a subset of the gates in the circuit that prepares |ψ with Clifford gates that are close in distance to the original gates. An alternative strategy, which is used in our implementations, is to employ Markov Chain Monte Carlo (MCMC) to generate classically-simulable states |φ i based on the values of their observables. (See Appendix B for details.) In general, one may base the MCMC sampling on the closeness of X noisy φi to X noisy ψ . However, a potentially more efficient strategy exists for the specific application of variational quantum algorithms [34,40,48]. When |ψ is a state that is meant to optimize a variational cost function (e.g., the energy of a given Hamiltonian), then one can instead base the sampling on minimizing this cost function. In this way, one can employ MCMC to obtain classically-simulable states that are close to extremizing the variational cost function. For our QAOA implementation (see Section 3.1) we find that this MCMC scheme produces results that are more robust to implementation details than the simple strategy described above. Furthermore, it outperforms the simple strategy for deep circuits while for shallower ones it provides results similar to the best implementations of the simple scheme.
The shot cost of CDR error mitigation depends on number of circuits in S ψ and the required precision of the X noisy ψ and X noisy φi measurements. These quantities need to be determined empirically by analyzing convergence of the results with increasing number of training circuits and shots per X noisy ψ evaluation. In the case of our QAOA implementation we observe that 60−80 training circuits and 8192 shots per expectation value estimation are enough to obtain an order of magnitude improvement for all analyzed numbers of qubits Q = 8 − 64 and numbers of QAOA rounds p = 1 − 24. See details in Section 3.1.
The classical cost of expectation value computations for near-Clifford circuits grows exponentially with N and polynomially with number of the Clifford gates [1]. In the case of state-of-the art classical simulators one can simulate circuits with N = 80 using a desktop computer [39]. In this work, in order to provide our method's proof of principle, we use our own implementation and N ≤ 30. We find empirically that number of all near-Clifford circuit simulations required typically grows polynomially with number of qubits and circuit depth, see details in Appendix B.

Implementation for QAOA
A central application of error mitigation is to correct the energies of Hamiltonian eigenstates prepared on a quantum computer. Here we illustrate this application with the Quantum Alternating Operator Ansatz (QAOA) [21,25], which can serve as an ansatz for Hamiltonian ground states. We consider the transverse Ising chain with open boundary conditions, given by the Hamiltonian where σ X , σ Z are Pauli operators, and Q is the length of the chain. We study the case of g = 2 (belonging to a paramagnetic phase) with open boundary conditions for different numbers of qubits Q. To apply the QAOA, we write where β j , γ j are variational parameters, p is the number of ansatz layers, and |+ = (|0 + |1 )/ √ 2. The exponentials of H 2 and H 1 can be easily decomposed into quantum circuits (see Appendix D for a description of the resulting circuits).
For this implementation, we perform the optimization of the β j , γ j parameters, and then correct the energies of low-energy local minima of this optimization. This correction involves correcting all σ j X and σ j Z σ j Z terms associated with (4). (Note however that the same training set S ψ , which is generated based on the total energy, is used to correct each term.) Figure 2 shows that our CDR method reduces the relative energy error (calculated with respect to the exact energy of the minima) by an orderof-magnitude on IBM's Almaden quantum processor. The (recently retired) device had quantum volume [15] QV = 8 and connectivity enabling simulation of Q = 16 Ising chain without use of qubit swap gates. We remark that the mitigated circuits have 60 CNOTs and 62 non-Clifford rotations. We show error bars that reflect our confidence in our ability to correct the training data. Specifically we take the error bars as three standard deviations, where the standard deviation is obtained directly from the cost function in (3) by dividing C by the training size and then taking the square root (see Appendix C for more details).
To study the scaling behavior, we also implement this problem using a classical matrix-productstate [20] simulator and full density matrix simulators of noisy circuits. The matrix-product-state simulator enables efficient simulation of large shallow circuits which are beyond reach of the full density matrix simulator. We use a noise model obtained from gate set tomography of IBM's Ourense quantum computer [13]. The noise model does not capture spatial and time inhomogeneous aspects of the real device noise. It also does not have cross-talk terms present in the real device. Detailed description of the noise model including its process matrices can be found in [13]. Furthermore, in our simulations we omit mea- Here Q is the number of system qubits, p is the number of QAOA rounds, and N is the number of non-Clifford gates in the training circuits. The noisy results are obtained with 16384 shots per circuit in the case of (b), (c), (d) and 65536 shots per circuit in the case of (a). The training sets contain 70 near-Clifford circuits. The total shot cost of mitigation is 1.16 × 10 6 and 4.65 × 10 6 shots per minimum, respectively. surement noise as it can be mitigated by specialized techniques [5,11]. Figure 3 presents the relative energy error, uncorrected (red) and corrected (blue), for different values of N , p, and Q. One can see that our CDR method results in between one to two orders-of-magnitude reduction in the error. Increasing the number of non-Clifford gates N in the training data monotonically reduces the error, as shown in Fig. 3(a). This is expected because increasing N allows the training set S ψ to become closer to the target state |ψ of interest. Hence, our results show that N is a refinement parameter, allowing one to obtain better error mitigation with the increased computational difficulty of simulating more non-Clifford gates. N is limited to 80 for state-of-the-art classical simulators [39], but our results show that N 30 already leads to ordersof-magnitude reduction in the error. Figures 3(b), (c) and (d) show that correcting errors with CDR becomes more challenging with deeper circuits and larger qubit counts, respectively. However we still obtain order of magnitude error reductions for either p = 4 layers or 64 qubits. It is worth noting that 64 qubits is considered to be in the regime where quantum advantage might be demonstrated. Furthermore, the p = 4 ansatz has more two-qubit entangling gates per pair of neighboring qubits than the ansatzes used to demonstrate quantum supremacy [2]. We note that the largest circuits considered here (Q = 64, p = 3) contain 378 CNOTs and 381 non-Clifford rotations.
To obtain substantial improvement for deeper p > 4 we consider the case of noise rates modestly reduced in comparison to current devices. To investigate such a scenario we construct a noise model by taking gate process matrices E as a convex combinations of process matrices of the IBM's Ourense noise model [13] E noisy and process matrices of the noiseless case E exact .
For this noise model we optimize the QAOA for p = 4 − 24 and Q = 8 using full density matrix noisy simulator. Therefore, for p = 4 we recover the realistic noise model and for the deepest simulated p = 24 we reduce the noise rates by a factor of 6 which is insufficient to enable the fault-tolerant quantum computing [24]. Again, we mitigate energies of the optimization local minima using near-Clifford circuits with N = 28 non-Clifford gates. We gather results in Fig. 4. The mitigated energies show an order of magnitude improvement even for the deepest p = 24. Furthermore, we observe that the improvement remains approximately constant as a function of number of ansatz layers p. Finally, we note that the deepest circuits simulated here contain 336 CNOTs  Fig. 3(a)) shots per circuit. Therefore, the total shot cost of the CDR mitigation per circuit of interest equals 1.16 × 10 6 and 4.65 × 10 6 , respectively. These shot numbers are of the same order as the ones used in our real-device implementations. To perform classical, finite-shot, large Q simulations we use perfect sampling with matrix product states [22]. The training circuits were generated using MCMC sampling with a target energy distribution of the training circuits peaking at an energy close to the ground state energy. The distribution was parametrized using the ground state energy, see Appendix B for details. Such a distribution favors training circuits with the smallest energy as proposed in Section 2. In the case of variational ground state simulations in a quantum advantage regime, the ground state energy is not known and its estimation is one of the goals of the quantum simulation. Therefore, in Appendix B we instead use a target distribution which favors low energy training circuits but is not parametrized by the ground state energy. We show numerically that for such a distribution one can obtain very similar results to the ones shown in Fig. 3. Consequently, special knowledge of the ground state properties is not necessary to achieve a high quality correction with CDR.
To give further insight into the scaling with Q, we show the results for individual optimization instances in Fig. 5. Interestingly, Fig. 5 shows that the CDR method is capable of removing noise-induced Figure 6: Correcting the results of quantum phase estimation on a noisy simulator with CDR. (a) Decomposition of a random pure state in the binned eigenbasis ofH defined in the text, without (red) and with (blue) correction. The probability distribution q is shown as a function of the binned energy eigenvalueλ. The inset shows its error. The random state whose decomposition is shown is the one for which CDR performed most poorly. (b) Relative error of this decomposition over 20 instances (different random pure states), ordered by increasing corrected error. For all instances, we employed training data constructed from quantum circuits with 6 of the total 12 non-Clifford gates replaced by Clifford gates (see Appendix E for a circuit diagram).
fluctuations in the energy, i.e., very different noisy energy values are correctly mapped to the same corrected ground-state energy values. For large Q, some remnant of these fluctuations still linger in the corrected energies, leading to the worse performance in Fig. 3(c), although it suggests that employing a more detailed model in (1) accounting for additional features in the training data could further improve the accuracy of the corrected energy.

ZNE Implementation
In addition to applying our CDR method for QAOA, we also examined the performance of the ZNE method under the same conditions as those in Fig. 2, i.e., for the Q = 16, p = 2 case with the same circuits on IBM's Almaden quantum processor. Furthermore, to mitigate a circuit of interest we used similar total number of shots (1.70×10 6 ) as in the case of our CDR implementation (1.05 × 10 6 ). Unfortunately, for this problem instance we were unable to attain a meaningful correction with ZNE. This difficulty was likely largely due to the number of qubits and depth of the quantum circuit we considered, as ZNE depends on the base circuit being considered not being too noisy to start with. For more details, see Appendix F.

Implementation for phase estimation
Let us now illustrate how our method applies to quantum phase estimation (QPE). We consider a version of phase estimation optimized for near-term quantum computers [43]. For an input state |χ , this algorithm estimates χ|e −iHt |χ = e −iHt on a quantum computer for a series of times t and then classically Fourier transforms the time series to estimate |χ decomposition in a binned eigenbasis of H. More precisely it estimates probability of |χ being in an eigenstate of H with an eigenvalueλ j − < λ <λ j + whereλ j is a center of the bin and defines width of the bin with a quantity q j which definition and convergence properties are discussed in detail in [43]. The bins are chosen to cover the whole spectrum of H.
The expectation value e −iHt is obtained by measuring σ X + iσ Y on an ancilla qubit after applying the controlled e −iHt gate to the state |χ [43]. We performed error mitigation of e −iHt for a threequbit Max-Cut HamiltonianH = 1 6 (σ 1 , |χ being a random pure state, and times t ∈ {1, 2, ..., 136}. We calculated q j for a binned eigenbasis withλ j = −0.5, −0.375, . . . , 0.5 using exact, noisy and CDR mitigated e −iHt . We employed full density matrix simulator using the IBM Ourense noise model [13] described in more detail in Section 3.1. To reduce circuit depth for our H we apply circuit compilation techniques of [12] obtaining a circuit of depth 24 with 12 CNOTs and 12 non-Clifford R Z rotations, see Appendix E for details. Figure 6 shows the results. One can see in Fig. 6(a) that the decomposition of |χ in the energy eigenbasis is significantly improved after correction with the CDR method. Indeed, Fig. 6(b) shows that the relative error is reduced by at least a factor of three by CDR.
The QPE algorithm results in circuits with depth growing proportionally to number of H terms which in turn usually scales polynomially with the number of qubits Q. Consequently, we expect that implementation of the algorithm for large systems with near-term quantum computers and successful error mitigation of the resulting circuits will be challenging despite the algorithm being more resource efficient than its alternatives. Additionally, circuit compilation techniques used here have cost scaling exponentially with the numbers of qubits [12] restricting their usability for large systems. Therefore, we do not analyze performance of CDR as a function of Q here.

Conclusions
With quantum supremacy recently demonstrated [2], the next milestone in quantum computing may be to achieve quantum advantage for practical applications such as chemistry or optimization. These applications will require accurate estimation of observables on noisy quantum hardware, and hence largescale error mitigation will be necessary. In this work, we proposed a method to significantly reduce the errors (potentially by orders of magnitude) of quantum observables. The method, called Clifford Data Regression (CDR), learns how to correct errors on a training data set. Constructing this training set exploits the classical simulability of quantum circuits composed largely of Clifford gates. This allows one to apply our method to large problem size. For example, we obtained meaningful corrections with CDR for a 64-qubit ground-state-energy problem, with the total shot number being feasible with current quantum devices. Additionally, we found that the circuit depth our method can mitigate will be notably increased with modest reductions in hardware noise rates, reaching 24 rounds of 8-qubit QAOA.
Further testing our method on real quantum hardware will be important, as will be further studying the scaling of our method's performance with circuit depth and problem size. Furthermore, additional benchmarks will be required to determine the best training set construction methods for circuits generated by non-variational algorithms. In addition, refining our method with ansatzes that are more sophististicated than our linear ansatz (e.g., using neural networks [46] or other machine-learning approaches) could be fruitful. Finally, it would be interesting to extend the applicability of our approach to analog quantum simulators. Furthermore, we note that a related, but distinct, approach using Clifford circuits to learn quasi-probabilistic error mitigation was recently proposed [44].

Acknowledgements
We thank Kipton Barros, Kenneth Rudinger, and Mohan Sarovar for helpful conversations. We thank IBM for providing access to their quantum computers. The views expressed are those of the authors and do not reflect those of IBM. All authors acknowledge support from LANL's Laboratory Directed Research and Development (LDRD) program. PJC acknowledges support from the LANL ASC Beyond Moore's Law project. This work was also supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under the Quantum Computing Application Teams program. This research used quantum computing resources provided by the LANL Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001.

A Motivation for linear ansatz
In this work, we assumed essentially the simplest ansatz possible for correcting noisy observables, given by the linear relationship in Eq. (2). While more complicated ansatzes may improve the performance of CDR, we note here that there does exist theoretical motivation for a linear ansatz.
Consider a depolarizing noise channel that acts globally on all Q qubits. The action of this channel takes the form: where d is the Hilbert-space dimension and 0 p 1. Suppose that this channel acts repeatedly at m different times throughout the quantum circuit of interest. Then, for the circuit that prepares the noise-free state |ψ , the actual final state will be: Similarly, for a circuit that prepares a state |φ i from the training set S ψ , the actual final state will be: For the observable of interest, X, the actual relationship between the noisy and noise-free values is given by which is equivalent to Hence we see that global depolarizing noise leads precisely to a linear relationship between noisy and noisefree expectation values. Similarly the training data will take the form of: where a 1 and a 2 are independent of i and are given by the formulas in (11). Fitting a linear regression to the data in the training set will lead to a perfect fit, and the fitted parameters will have the values of a 1 and a 2 given in (11). Hence, using our linear regression approach in this case will lead to exactly the right correction for X exact ψ . The above analysis shows that our linear ansatz will perfectly correct for global depolarizing noise. This analysis can be extended to a global version of the quantum erasure channel. More generally, this analysis applies to any channel that maps the input state to a convex combination of the input state and a fixed state, of the form: where ρ 0 is a fixed state.
In addition, we remark that the linear ansatz will perfectly correct for certain types of measurement noise. This is because measurement noise typically can be thought of as noise acting on the local state just prior to a noise-free measurement. If this noise corresponds to depolarizing noise (e.g., for symmetric measurement noise) or more generally to the noise in Eq. (13) (which allows for asymmetric measurement noise), then one can apply the analysis above to show that the linear ansatz will perfectly correct the noise.
While we do not expect the linear ansatz to correct for all types of noise, one can nevertheless replace the linear ansatz with a more complicated one, and further work is needed to explore the benefits of such (more complicated) ansatzes.

B Generating the set S ψ
We use a Markov Chain Monte Carlo (MCMC) [47] sampling technique in order to generate a set of classically simulable training states, S ψ , for use in CDR. Starting from some initial simulable circuit, we build S ψ by making small update steps to our circuit. At each update step, we choose to either accept or reject the change. On accepting a new circuit it is added to S ψ , and we look for updates starting from this new circuit. In the case when acceptance of the circuit is based on its exact expectation values (as in the case of our QAOA implementation) one can further improve shot efficiency of the method. As for large circuits it takes many small changes to form a circuit with significant fraction of gates updated one can form S ψ post-selecting the accepted circuits to choose only ones sufficiently different. We formalize this notion below.
In our applications, the initial point is chosen by finding a near-Clifford circuit with N non-Clifford gates that is close to the circuit we wish to correct. To generate update steps, we randomly pick n p pairs of the circuit's σ Z rotations. Here, each pair consists of one rotation that has been replaced by a Clifford gate (specifically, S n for some integer n where S = R Z (π/2) = e −iσ Z π/4 is the π/2 rotation gate) and one that has not. We chose n p = 5 in our implementations. For each pair we then replace the non-Clifford rotation by a power of S and the Clifford gate by the original rotation in that part of the circuit. When replacing a rotation R Z (α) by S n , the power n is randomly sampled with weight w(n), which is given by with σ = 1/2. Note that such an update preserves the number of non-Clifford gates N . We visualize the update for a simple circuit in Fig. 7. For our implementations, the new state |φ proposed by this update step is then either accepted or rejected according to a Metropolis-Hastings rule with  (14). The non-Clifford rotations are replaced by original Clifford rotations (RZ (k4π/2), RZ (k5π/2)). The update preserves the number of non-Clifford gates N . As a result of the update we obtain a circuit at the right which is later accepted or rejected according to the Metropolis-Hastings rule. Here P = RX (π/2) and to obtain Clifford rotations we have k1, k2, . . . , k7 = 0, 1, 2, 3.
a likelihood function L. In the limit of many MCMC steps, such sampling produces a distribution of near-Clifford training circuits with probability proportional to L. The likelihood was defined differently for our QAOA and phase estimation examples. In the case of QAOA our goal was to produce probability distribution peaking as close to the ground state energy as possible. To that purpose we used Here, X is the energy per qubit, and X 0 = −2.1 being roughly the ground state energy. We used X σ = 0.05, though we found that instead using X σ = 0.1 or 0.2 produced similar results. Larger values (X σ ∼ 1) degraded the quality of the results. For this proof of principle demonstration we chose X 0 close to the ground state energy. Nevertheless, for open problems we may lack knowledge about the optimization solution to base choice of X 0 on that. One may set X 0 using lower bounds on the energy or performing MCMC sampling with decreasing X 0 until training circuits energy distribution stops shifting towards smaller energies. An arguably more elegant and simpler solution is using likelihood function that increases monotonically with decreasing energy. An example of such a likelihood is In Fig. 8 we test this likelihood for correcting energy for local minima of Q = 8 − 64, p = 3 QAOA optimization. These circuits were corrected using the likelihood (15) in Fig. 3(d). As in Fig. 3(d), we use the noise model of IBM's Ourense quantum computer. Furthermore, we choose the same number of training circuits and total shots spent performing error mitigation as for simulations from Fig. 3(d). With X σ = 0.02 we find that the quality of the correction is similar to the one obtained by earlier approach. This result demonstrates that knowledge of the solution's properties is not necessary to construct a good training set. The value of X σ = 0.02 is chosen to ensure that the exact energies of the training circuits are close to the lowest energies obtained with the training circuits constructed by randomly projecting the non-Clifford gates according to (14). For the phase estimation example, we used a similar expression to what we used for QAOA: with X σ = 0.1. This version is appropriate when correcting observables that we do not have special information about (e.g., that X is not supposed to be minimized).
To give insight into resource cost of the CDR method we analyze in detail construction of the training sets for deep QAOA circuits from Fig. 4 and large, shallow circuits from Fig. 3(d). The MCMC procedure described above generates a chain of near-Clifford circuits indexed by an index i for each mitigated circuit. We build the training set choosing the chain elements with i = i 0 +kξ MCMC , k = 0, 1, . . . , 69, where ξ MCMC measures auto-correlation in the chain and i 0 determines how many initial elements of the chain are rejected. We define ξ MCMC as where θ is a vector of R Z rotations' angles in the circuit,θ = 1 L−i0 L i=i0 θ i and L is the chain's length. For fixed N , required classical resources are determined in our case by ξ MCMC , i 0 and number of Clifford gates in the circuit.  Fig. 3(d). As in Fig. 3(d) we use the IBM's Ourense noise model. The blue curves are results obtained with the likelihood (15) which is used in the main text and are taken from Fig. 3(d). The black curves are obtained with the likelihood (16). Here X = E/Q, X0 = −2.1, Xσ = 0.05 and X σ = 0.02. The likelihood function (15) utilizes special knowledge about ground state of the model as it peaks close to the ground state energy. That's not the case for the other one as it increases monotonically with decreasing energy. We find that both approaches yield similar quality of the mitigation showing that special knowledge about the ground state of the model is not necessary to obtain a good quality error mitigation with CDR. In both cases we use 70 training circuits and 1.16 × 10 6 shots for the mitigation.
The classical resources scale polynomially with increasing number of the Clifford gates. Scaling of ξ MCMC and i 0 needs to be analyzed empirically. We observe that mean value of ξ MCMC obtained by averaging over the minima scales polynomially with increasing p and Q, see Fig. 9. The behavior of i 0 is somewhat more complicated as it depends on a starting point of the MCMC chain. If the initial point with the energy close to the chain's stationary energy distribution, one can choose i 0 = 1 without decreasing quality of the energy correction. Otherwise it is necessary to reject some initial elements of the chain. For the shallow circuits and the deep one with p ≤ 16 we were able to find good initial points starting from a set of 200 near-Clifford circuits obtained by our chain initialization algorithm. That is no longer possible for p = 20, 24 for which we choose i 0 = 70ξ MCMC + 1 and i 0 = 150ξ MCMC + 1, respectively. We remark that to reduce classical computation time for these p values one can choose the initial points using MCMC chains obtained for smaller p. To sum up we observe that for fixed N total cost of classical simulations for our implementation scales polynomially with Q and p while showing deviations for the largest p. Furthermore, we note that the MCMC sampling for N = 28 and the largest circuits (Q = 8, p = 24 and Q = 64, p = 3) takes less than 2 days and 12 hours, respectively, while using a single core of Intel Xeon E5-2695 (v4, 2.1 GHz) processor per a circuit of interest.

C Error bars
In the main text, we display error bars on the corrected observables obtained by the CDR method. These error bars are meant to convey one's confidence in the predicted noise-free observable. Conceptually speaking, there are two main sources of error to consider: (1) Imperfect training on the training data such that the cost function C does not go to zero, and (2) Inability of the training data to capture the noise processes that affect the target state |ψ . The latter source of error is difficult to quantify in practice, although it can be systematically removed by increasing the number of non-Clifford gates N as discussed in the main text.
Therefore we focus our error bars on the former source of error, i.e., imperfect training. For the most  Figure 10: A circuit implementing a layer of the QAOA ansatz for a system with Q = 4 qubits. β, γ are QAOA parameters. Here P = RX (π/2) = e −iσ X π/4 and is a Clifford gate. The only non-Clifford gates in this circuit are the σZ rotations: RZ (α) = e −iσ Z α/2 . We note that we have made use of the decomposition of e iγσ j Z σ j Z from Ref. [27].
part, we find (see Fig. 4) that the error bars associated with imperfect training are sufficient to encompass the discrepancy between the predicted and exact observable values. However, for our largest implementation, Q = 64 in Fig. 5(d), one can see that our error bars sometimes underestimate the true error. This suggests that our error bars are useful as lower bounds on the true error.
As mentioned in the main text, we calculate our error bars using the value of the cost function C obtained after training. Specifically, the magnitude of the error bar is given by three standard deviations, where the standard deviation is given by C/(L − 1), where L is the number of states in the training set S ψ . D QAOA circuit structure Figure 10 shows the structure of our QAOA circuit for the Ising spin chain considered in the main text. We note that all gates except for the the (2Q−1)p R Z gates are Clifford gates. The circuits contain (2Q − 2)p CNOTs. The only changes made to this circuit for generating the training data S ψ are therefore only replacements of these σ Z rotations by S n , as discussed in Appendix B. Figure 11 shows the quantum phase estimation circuit from Ref. [43]

E Quantum Phase Estimation circuit structure
compiled by an algorithm from Ref. [12] and applied to a randomly chosen input product state. As in the QAOA ansatz, we note that all gates except for the the σ Z rotations are Clifford gates. Again, our classically simulable training data set S ψ is generated by replacing some of these σ Z rotations with S n as discussed in Appendix B.

F Zero Noise Extrapolation
For comparison with our CDR method, we have performed zero noise extrapolation (ZNE) for the QAOA implementations discussed in the main text on IBM's 20-qubit Almaden quantum processor. Specifically, we used Qiskit's Pulse package [19] to systematically stretch the microwave pulse sequences used to physically implement our Q = 16, p = 2 optimized QAOA circuits as done in Ref. [28].

F.1 Background
Following Ref. [45], the evolution of the quantum computer can be modeled in terms of the drive Hamiltonian described by the quantum circuit, K(t) and a Lindblad operator L (ρ) representing the physical noise channels: Here λ is a (hopefully) small parameter that represents the strength of the action of the noise channels, so the limit λ → 0 would represent noiseless quantum computation. Attempting to approximate this limit is the heart of ZNE. While an experimenter cannot in general directly adjust λ, under the assumption that the form of L is invariant under time re-scaling and independent of K(t), one can in effect increase λ. Increasing λ is accomplished by increasing (stretching) the time for the circuit evolution (T ) by a factor of c while decreasing the magnitude of the drive Hamiltonian: To see how this works, let us integrate (19) with respect to time from t = 0 to t = T , using our modified drive Hamiltonian from (20). Calling the state evolved this way ρ (t) = ρ(c −1 t), we have: We therefore have that, under these assumptions, the final state driven over a longer time with the stretched drive Hamiltonian is equivalent to one evolved with the original drive Hamiltonian with λ → cλ. For a more detailed derivation of this formalism, see Ref. [45].

F.2 Implementation
For our implementation, we follow Ref. [28] Figure 11: Quantum circuit used to estimate Re( χ|e −iHt |χ ) forH, and a random product state |χ given by random angles α1, α2, α3, α4, α5, α6. Parameters f1(t), f2(t), f3(t), f4(t), f5(t), f6(t) were obtained by compiling the circuit with an algorithm from Ref. [12]. The compilation was accurate up to numerical precision. We assume here all to all device connectivity. then stretched the pulse sequences generated from the QAOA circuit as shown in Fig. 12. In addition to running the circuit at different stretch factors, we also made use of Qiskit's built-in measurement error mitigation functions [19] as ZNE does not directly handle read-out error. Finally, we used 212992 shots to measure each operator for each value of c.

F.3 Results
The data points we measured as well as our extrapolated energies are shown in Fig. 13 for the three instances of low energy QAOA studied in the main text. For the sake of completeness, we show the extrapolation with a linear fit, a quadratic fit, and finally the cubic polynomial that is the standard ZNE approach for four values of c [28,45]. In addition, we also show an exponential fit for ZNE [17]. As shown in Fig. 13, it appears that for our particular use case the ZNE method did not provide an accurate correction for the energy expectation values.

G Correction with a constant ansatz
Training set construction methods tested in the paper are reliant on building a set of training circuits similar in some respect to the circuit of interest. In the case of the QAOA error mitigation from Section 3.1 the training circuits are chosen to have similar energy as the circuit of interest. In the case of Quantum Phase Estimation presented in Section 3.3 they are chosen to have similar noisy expectation value of the mitigated observable. One may wonder if good performance of the method is limited to the cases for which the exact expectation value of the mitigated observable is very similar for both the training circuits and the circuit of interest.
In the case of simple global depolarizing noise treated analytically in Appendix A, we see that CDR is capable of learning the perfect correction from any two training circuits with different values of the observable of interest. Therefore, at least in some cases the method has potential to learn correction even if the training circuits have expectation values of the observable of interest very different from the circuit of interest. In this appendix we give insight into this problem for the more realistic case of the IBM Ourense noise model. To that end we perform the correction with a constant ansatz f (X noisy ψ , a) = a 1 (22) and compare it with CDR correction. As a test case we correct the p = 3 QAOA circuits for which CDR correction was performed in Figs. 3(c) and 8. We use the training sets used above in Fig. 8 which are constructed with MCMC sampling method (15). We present the results in Fig. 14 showing that CDR outperforms the correction with the constant ansatz. This shows that good quality of the correction can be obtained for the realistic noise even when construction of the training set with X exact φi concentrated around X exact ψ is impossible. Finally, to provide additional insight into properties of the used training sets  Table 1: Average absolute values of fitted slope a1 and intercept a2 of the CDR ansatz (2) used to perform error mitigation in Fig. 14. We average over different local minima of the QAOA optimization and different observables contributing to the energy. We also include standard deviation of the absolute values using parentheses notation.
we gather average absolute values of the fitted CDR ansatz coefficients in Tab. 1.  Fig. 8. The black curves depict the results obtained with the constant ansatz using the same training sets as for CDR. We see that the CDR outperforms constant correction performed with (22). The difference is especially pronounced for circuits with the largest qubit count.