Unifying and benchmarking state-of-the-art quantum error mitigation techniques

Error mitigation is an essential component of achieving a practical quantum advantage in the near term, and a number of different approaches have been proposed. In this work, we recognize that many state-of-the-art error mitigation methods share a common feature: they are data-driven, employing classical data obtained from runs of different quantum circuits. For example, Zero-noise extrapolation (ZNE) uses variable noise data and Clifford-data regression (CDR) uses data from near-Clifford circuits. We show that Virtual Distillation (VD) can be viewed in a similar manner by considering classical data produced from different numbers of state preparations. Observing this fact allows us to unify these three methods under a general data-driven error mitigation framework that we call UNIfied Technique for Error mitigation with Data (UNITED). In certain situations, we find that our UNITED method can outperform the individual methods (i.e., the whole is better than the individual parts). Specifically, we employ a realistic noise model obtained from a trapped ion quantum computer to benchmark UNITED, as well as other state-of-the-art methods, in mitigating observables produced from random quantum circuits and the Quantum Alternating Operator Ansatz (QAOA) applied to Max-Cut problems with various numbers of qubits, circuit depths and total numbers of shots. We find that the performance of different techniques depends strongly on shot budgets, with more powerful methods requiring more shots for optimal performance. For our largest considered shot budget ($10^{10}$), we find that UNITED gives the most accurate mitigation. Hence, our work represents a benchmarking of current error mitigation methods and provides a guide for the regimes when certain methods are most useful.


Introduction
As new generations of quantum computers become larger and less noisy, the practical application of quantum algorithms is just around the corner. It is likely that any quantum algorithm that can achieve an advantage over its classical counterpart will have to contend with substantial hardware errors. While quantum error correction may eventually allow one to eliminate hardware noise, doing so will require many more qubits and significantly lower error rates. Although full error correction will not be practical in the near term, it is possible to partially mitigate errors with far lower resource requirements. A number of such error mitigation techniques have been proposed [6,14], each with its own benefits and trade-offs.
One of the most popular approaches to error mitigation is zero noise extrapolation (ZNE) [45]. ZNE involves collecting data at different noise levels and fitting an extrapolation to the noiseless case. A variety of methods have been introduced to amplify the noise as well as to perform the extrapolation [13,45]. ZNE has been shown to work well for small problem sizes [24] but has faced challenges for larger problems [10]. Due to the uncertainty of the extrapolation, ZNE does not have performance guarantees for regimes of high noise. This is particularly noticeable for circuits involving many gates. Here, the lowest available error points are likely to be too noisy for such fits to be helpful.
Alternatively, classically simulable circuits can be used to inform the experimenter about noise in a quantum device [10, 33,43,47,48]. The Clifford Data Regression (CDR) method [10] makes use of the Gottesman-Knill theorem [18] which guarantees that quantum circuits containing only Clifford gates can be classically simulated. CDR constructs a training set using data obtained from classically simulable near-Clifford circuits similar to the generic quantum circuit of interest. This set of data is then used to perform a regressive fit on an ansatz mapping noisy to exact expectation values. The fitted ansatz is used to estimate the value of the noiseless result. Furthermore, the combination of CDR and ZNE, which can outperform either method separately, has been considered and is called variable-noise CDR (vnCDR) [28]. vnCDR uses training data involving near-Clifford training circuits evaluated at multiple noise levels and can be thought of as using Clifford data to inform the extrapolation to the zero noise limit.
More recently, error mitigation via virtual distillation (VD) has been introduced [22, 26]. VD mitigates noise by simultaneously preparing multiple copies of a noisy state, entangling them, and making measurements to virtually prepare a state which is purer than the original noisy states. Further improvements have been made and analyses have been performed [5,9,23,27]. While VD could achieve exponential error suppression in some regime [9,22,26], it has been noted that VD requires a number of shots that scales exponentially in the number of qubits used for sufficiently large noise rates [9].
Given the success of merging ZNE and CDR into vnCDR, in this paper, we consider whether or not it is beneficial to unify these other error mitigation methods with VD. By considering the number of copies used in VD to be a noise-control parameter, we use Clifford data to guide a combination of VD-mitigated observables, much like vnCDR. While also adding ZNE-like noise amplification we arrive at a framework unifying CDR, ZNE, and VD, which we call UNIfied Technique for Error mitigation with Data (UNITED). A summary of the steps used in UNITED is shown in Figure 1.
In order to get a realistic idea of how ZNE, vnCDR, VD, and UNITED compare we carry out a numerical study benchmarking the performance of each technique when mitigating a local observable produced by a random quantum circuit, and one term of a Max-Cut problem cost function for a circuit obtained when optimizing the Quantum Alternating Operator Ansatz (QAOA). We explore their performance on a simulator based on the current error rates of an iontrap quantum computer. We compare performance across various system sizes and circuit depths, involving systems of up to 10 qubits and depths of up to 128 Molmer-Sørensøn entangling gate layers. Furthermore, we systematically compare the necessary shot cost for each method in its mitigation task, providing a fair comparison to guide experimenters in this field.
Overall, for the largest considered shot budget (10 10 ) our new unified method UNITED gives an advantage over all other methods considered in mitigating observables produced by both random quantum circuits (RQC) and QAOA applied to the Max-Cut problem. We observe that ZNE outperforms the learning-based methods for the 10 5 shot budget but fails to improve significantly when the total number of shots used is increased. This is in contrast to the learning-based methods, such as vnCDR and UNITED. These outperform ZNE when using 10 6 and 10 8 shots respectively. VD performs similarly to ZNE in the RQC task but performs significantly better in the QAOA task being the best method for 10 5 , 10 6 , and 10 8 shot budgets. That is expected, as solutions of the Max-Cut problems are eigenstates of each term in the cost function which is beneficial to the performance of VD [27]. Nevertheless, for the highest shot budget we consider (10 10 ), UNITED improves on the best-performing mitigation methods for QAOA and the RQC experiments. These results suggest that the unified method successfully leverages the strengths of each of its constituent parts. Hence, we believe that is it crucial to explore such unified methods.
2 State-of-the-art error mitigation methods 2.1 Zero Noise Extrapolation ZNE [45] uses noisy expectation values of an observable of interest X obtained from the quantum circuit of interest U run at increased noise to extrapolate to the noise-free limit. Originally, Richardson extrapolation [38] was proposed to estimate the noise-free observable [45]. The method was applied successfully to mitigate variational ground state quantum simulations with IBM quantum computers [12,24].
Let us now explain the ZNE setting more formally. Here, the goal is to estimate the noiseless expectation value µ = ⟨0| ⊗Q U † XU |0⟩ ⊗Q from measurements on a noisy quantum device, where |0⟩ ⊗Q denotes the input state to the quantum circuit. In a realistic scenario, one cannot experimentally access µ due to the presence of hardware noise. Instead, one measures in the quantum computer the noisy observablê µ. In the ZNE framework one usually assumes that the noise in the device can be characterized by a single parameter λ which is the noise strength. Moreover, while we assume that the noise cannot be decreased past a certain value λ 0 , there always exist ways to further increase the noise strength (see below). Thus, in ZNE one wishes to estimate the noisy observable for increased noise levels, and then leverage extrapolation techniques to obtain the noiseless value. To perform the extrapolation, we measure the observable for n + 1 noise strengths λ j = c j λ 0 , where 1 = c 0 < c 1 < · · · < c n coefficients are noise levels. This allows us to create a dataset composed of expectation values estimated at different noise levels The Richardson extrapolation estimates the noise-free value µ via T ZNE as where n j=0 Under the standard assumption, Richardson extrapolation approximates the true noiseless value µ with an error scaling as |µ ZNE −µ| ∈ O(λ n+1 0 ) [45], see Appendix A for details. Nevertheless, the Richardson extrapolation in the presence of shot noise leads to the exponential growth of the uncertainty in µ ZNE with increasing n [17]. Richardson extrapolation is equivalent to fitting a polynomial of order n on the various noisy expectation values, treating the noise level c j as an independent variable [21]. To address problems with the reliability of Richardson extrapolation for large n, lower order polynomial fits and exponential fits have been proposed [3, 12, 17]. Furthermore, approaches for noise characterized by multiple parameters have been postulated [36].
Regarding how one can increase the noise strength, the original proposal stretches pulses used to implement gates to increase the hardware noise [45]. Nevertheless, it is often more convenient to increase noise performing so-called identity insertions [12,17,21]. These methods insert groups of gates equivalent to the identity into the circuit of interest. This does not change a state encoded by the noiseless circuit U but increases the number of gates in the circuit, scaling the noise.
ZNE provides a very simple framework to mitigate errors introduced by noise on some observable of interest. However, there are several technical challenges that limit its power. Firstly, ZNE assumes that low enough noise levels are experimentally accessible to enable successful extrapolation to the zero noise limit. Furthermore, the best choice of the ansatz used to perform the extrapolation is not known a priori [17]. Also, a controlled increase of the noise strength is hardware dependent and may be challenging to implement in practice [24].

Clifford Data Regression
Let us now introduce the CDR framework which provides background for the more general vnCDR and UNITED methods. CDR [10] uses classically simulable training circuits consisting mainly of Clifford gates to learn the effects of the noise on an observable of interest. In particular, given the circuit of interest U , one constructs a set of circuits {V i } Nt i=1 , which are chosen to be similar to U . The simplest strategy to construct such a training set is to replace most of the non-Clifford gates in the circuit of interest with Clifford gates [10]. Here we note that to ensure classical simulability of the training circuits, they contain a small number N of non-Clifford gates. This is due to the fact that the cost of their classical simulation grows exponentially with N .
For each training circuit one computes via classical simulations [18] the noiseless expectation value of the observable of interest ⊗Q , as well as uses a quantum computer to estimate the noisy expectation valueμ CDR i of X. Here, we will denote as E noiseless and E noisy the space of noiseless and noisy expectation values, respectively, so that is used to train a function f CDR : E noisy → E noiseless , i.e., a function mapping the noisy to the exact expectation values. The simplest form for f CDR which approximates a relation between the noisy and exact expectation values is Here, the values of parameters a 1 , a 2 are found by least-squares linear regression on T CDR . Once the function is trained, CDR estimates the noiseless expectation value for U as µ CDR = f CDR (μ). This method has been successfully applied to quantum simulations of condensed matter systems with IBM quantum computers [10,41]. One feature of near-Clifford circuits constructed in this manner is the clustering of their exact expectation values around 0. This can lead to a large clustering of the expectation values in a near-Clifford training set which is detrimental to the shot efficiency of such an approach [11]. Therefore, to improve the shot efficiency, strategies have been proposed to ensure sufficient variance of the exact observable of interest among the training circuits [11]. Alternatively, the usage of Markov Chain Monte Carlo was proposed to construct training sets, where the training circuits have a low value of a cost function in the case of variational quantum algorithms [10].
The primary challenge in CDR is the construction of the near-Clifford training circuits as the optimal procedure to construct the training set is not known in general. Furthermore, the method requires evaluation of the observable of interest for the training circuits which may result in a large shot cost of the mitigation for large training sets.

Variable noise Clifford Data Regression
Variable noise Clifford Data Regression (vnCDR) conceptually unifies ZNE with CDR by using a training set involving near-Clifford circuits evaluated at several noise levels [28]. The dataset for vnCDR is constructed as follows. First, just like in CDR, one constructs a dataset of circuits {V i } Nt i=1 which are chosen to be similar to U . Subsequently, one uses a classical computer to obtain the noiseless expectation value of the observable of interest µ i . Then, for each V i and each noise level c j one uses a quantum computer to estimate the noisy expectation valuesμ vnCDR i,j , just like in ZNE. This allows one to obtain a training dataset of the form ). We note that for n = 0 one recovers the CDR training set T CDR . The dataset T vnCDR is then used to fit a linear function f vnCDR : (E noisy ) (n+1) → E noiseless , of the form This function is trained using least-squares regression which leads to the best-fit parameters a * j . These fitted parameters a * j are then used to obtain the vnCDR estimate of µ, as µ vnCDR = f vnCDR (μ ZNE ). Here, we have definedμ ZNE = (μ ZNE 0 , . . . ,μ ZNE n ) as the vector comprised of the n + 1 noisy expectations for the circuit of interest (that is,μ ZNE is a vector of the data in the ZNE dataset T ZNE ). The vnCDR method has been successfully implemented for quantum dynamics simulations with IBM quantum computers [41].
As we can see, vnCDR shares several common features with CDR and ZNE. The functional form of the ansatz is a linear combination of noisy estimates which resembles Richardson extrapolation. Therefore, vnCDR can be understood as performing an extrapolation guided by near-Clifford circuits with a similar structure to the circuit of interest. Nevertheless, the fitted ansatz does not require explicit knowledge of the noise levels, unlike the ZNE ansatzes, potentially easing ZNE requirements for precise control of the noise strength. Both CDR and vnCDR use training on data produced by near-Clifford quantum circuits. Consequently, vnCDR can also be treated as adding additional relevant data features to the CDR training set. We note that the ansatz for vnCDR does not include a constant term unlike that of CDR.
As in the case of CDR, the construction of an informative training set is the primary technical challenge. Furthermore, adding additional noise levels to the vnCDR training set increases the shot cost of the mitigation. Therefore, the best choice of noise levels is not known a priori.

Virtual Distillation
Virtual distillation (VD) [22, 26] takes a slightly different approach than ZNE and CDR. While the former suppresses error via post-processing, VD attempts to mitigate the effect of noise coherently. Specifically, let us denote by ρ the noisy state obtained at the output of the circuit of interest U . In VD, one uses M > 1 copies of a noisy state ρ to obtain expectation values of an observable of interest X for a purer, "distilled" state. Specifically, VD mitigates the expectation value of X by computing the quantitŷ To better understand VD, consider the eigenvalue decomposition of ρ with p i decreasing with increasing i, and Q being the number of the qubits. Then, Therefore, we can see that VD exponentially suppresses contributions toμ VD from eigenvectors of ρ other than the dominant one |ψ 0 ⟩. While increasing M suppresses the errors, for large error rates it also exponentially increases the shot cost of the estimation ofμ VD with a given accuracy [9].
VD implicitly assumes that the dominant eigenvector |ψ 0 ⟩ is close to the noiseless state |ψ exact ⟩ = U |0⟩ ⊗Q . Therefore, the quality of the VD correction is limited by a noise floor ϵ = ⟨ψ 0 |X|ψ 0 ⟩ − ⟨ψ exact |X|ψ exact ⟩. (8) The noise floor can be bounded from above using a quantity called the coherent mismatch i.e., where ||X|| ∞ is the absolute value of the largest eigenvalue of X [27]. When |ψ 0 ⟩ is an eigenvector of X, the bound (10) is replaced [27] by leading to better-than-expected performance of VD. For realistic noise, ϵ is usually larger than 0 [26]. Nevertheless, it was shown that for sufficiently small error rates the noise floor does not prevent VD from significantly reducing the impact of noise [27].
Here we recall that Tr[ρ M X] can be computed using M copies of the noisy state ρ and an ancillary qubit with a controlled permutation of the copies which changes the position of each copy (so-called controlled derangement). Consequently, QM + 1 qubits are required to mitigate a Q-qubit state. This number can be reduced to 2Q + 1 by employing qubit resets and increasing the depth of the circuit performing the controlled derangement [9]. The controlled derangement can be implemented as a circuit of QM controlled swaps. Therefore, the derangement introduces errors to the estimates of Tr[ρ M X] and Tr[ρ M ] which may require correction with other error mitigation methods [26].
Alternative approaches utilizing state verification to implement VD while further reducing required qubit counts were proposed recently [5, 23]. Furthermore, approaches conceptually unifying VD with both ZNE and subspace expansion techniques were proposed [4, 29, 51, 52].

Unifying variable noise Clifford Data Regression methods with Virtual Distillation
As we have discussed in the previous section, results obtained with CDR and vnCDR show that it is beneficial to train an error mitigation scheme with training data obtained from near-Clifford circuits, as well as from data arising from running circuits under different noise levels. Here, we propose to additionally use data from VD with multiple numbers of copies. This conceptually unifies VD, ZNE, and CDR into one mitigation method which we call UNIfied Technique for Error mitigation with Data (UNITED). Let us now explain how the dataset for UNITED is generated. Given a set of near-Clifford circuits similar to the circuit of interest U , {V i } Nt i=1 , we first use a classical computer to calculate the noiseless expectation values (denoted by µ i ) of the observable of interest X for each circuit. Then, we estimate the noisy and VD mitigated expectation values of X for different noise levels c j with j = 0, . . . , n (as in ZNE), and for different numbers of copies M = 1, . . . , M max . These estimates are made using a noisy quantum computer. Note that M = 1 labels the standard noisy estimates as they are obtained using a single copy. Putting all this together, we obtain a training set is a tensor of the noisy and VD mitigated estimates for V i , andμ i,j,M is the expectation value obtained at noise level c j and with M copies. We use T UNITED to train a linear function Here we use E VD to denote a space of the VDmitigated expectation values. For simplicity of notation, E VD also includes the noisy expectation values (M = 1). We use least-squares regression to find the values d * j,M that correspond to the best possible fit of T UNITED . This fitted ansatz is then used to obtain an error-mitigated expectation value of X for the circuit of interest as µ UNITED = f UNITED (μ UNITED ). Here, µ UNITED = (μ 0,1 , . . . ,μ n,Mmax ) is a tensor of the VDmitigated and the noisy expectation values of X for U , whereμ j,M is obtained at the noise level c j with M copies. It is not hard to see that under this framework, vnCDR is a special case of UNITED. Indeed, for the special case of M max = 1 T UNITED becomes the vnCDR dataset T vnCDR . For the reader's benefit, we summarize the similarities and differences between UNITED and other error mitigation methods in Table 1.
While we have presented the UNITED method, we still need to show that it is well-motivated. In what follows we will study special cases of UNTIED that illustrate how its different moving parts help to mitigate errors.

Error mitigation from VD data with multiple copy numbers
In this section, we consider a special case of the ansatz used in the UNITED method where we do not use multiple noise levels (n = 0). That is, we show that one can use a combination of the VD mitigated observables with different numbers of copies as an estimate of the noiseless observable that has an error smaller than each VD mitigated observable. For this purpose, we consider a toy noise model where for each gate an error occurs with probability λ, and every different sequence of errors results in an orthogonal state. Additionally, at the end of the circuit we place a coherent error given by a unitary U (θ) = e −iθE , where E is a Hermitian operator. Such an error produces a non-zero noise floor for θ ̸ = 0, while preserving the orthogonality of states corresponding to different sequences of the gate errors. We describe this noise model in more detail in Appendix B.2. Furthermore, in the following, we neglect the errors caused by an imperfect derangement operation and the shot noise.
First, let's consider taking a linear combination of VD-mitigated expectation values of X for U obtained using various M up to M max . We use such a linear combination to construct an estimate of µ that improves on all the VD mitigated estimates. That is, we take where b M are some coefficients to be found. In Appendix B we show that for this noise model the orthogonal error scales as while one can find coefficients b M (see Appendix B.2) which give Therefore, µ ′ Mmax can be used to suppress the orthogonal error beyond the error of the VD estimate with the largest number of copies. We note that the ansatz used here is similar to the one used by the Richardson extrapolation (1) with higher noise levels c > 1 replaced by larger copy numbers M > 1.

Error mitigation from VD data with multiple copy numbers guided by near-Clifford circuits
Here we consider a special case of UNITED which corresponds to taking n = 0 and M max > 1. This is equivalent to fitting the ansatz considered above (13) using a training set obtained from near-Clifford circuits. First, we start by noting that even in the case of the simple noise model considered in the previous section, one needs detailed knowledge of the noise model to determine the coefficients b M (see Appendix B.2). However, such knowledge would require expensive process tomography in the case of a real device. Therefore, we propose to use data from near-Clifford quantum circuits to fit the coefficients b M that lead to the desired orthogonal error suppression. We note that this corresponds to constructing a training set analogously to the case of vnCDR but replacing the c > 1 noisy expectation values with VDmitigated ones. Furthermore, the same as in the case of vnCDR, we use a linear ansatz. We call this approach Clifford Guided Virtual Distillation (CGVD).
In more detail, the CGVD training set for an observable of interest X and a circuit of interest U is de- is a vector of the VD-mitigated and noisy expectation values of X for a near-Clifford circuit V i , which are obtained at the base noise level c = 1 and with M = 1, . . . , M max . The training set is fitted with a linear function f CGVD : where the parameters b M are found using leastsquares regression. This leads to the best-fit parameters b * M which are then used to correct the noisy and VD-mitigated estimates of X expectation value for the circuit of interest, i.e. µ CGVD = f CDVD ((μ, . . . ,μ V D Mmax )). We note that CGVD perfectly mitigates global depolarizing noise as shown in Appendix C. Such perfect correction is possible as for this noise model the noise floor is zero.

UNITED ansatz for a simple noise model with orthogonal and coherent errors
In this section we demonstrate that by combining ZNE, VD and CDR, the UNITED method can suppress the error better than its individual parts. For this purpose, we consider again a simple noise model from Section 3.1. Let us construct for this model a datasetμ UNITED , which is created using n + 1 noise levels and M max copy numbers. For a noise level c j we scale the noise strengths λ and θ as λ j = c j λ 0 and θ j = c j θ 0 , where λ 0 and θ 0 are the base noise strengths. As described in detail in Appendix B.3, for each one of these noise levels we suppress the orthogonal gate errors with a combination of VD mitigated observables of form (13), i.e.
From Eq. (15) we obtain where ϵ j is the noise floor for the noise level c j . As shown in Appendix B.3, for such obtained estimates {µ ′ Mmax,j } n j=0 one can perform a Richardson extrapolation to additionally suppress the coherent error resulting in the noise floor.
By doing that one gets the final estimate of the expectation value of interest where γ j coefficients are given by Eq.
(2), that for large enough M max has an error For a more detailed explanation of the final error see Appendix B.3. This improved estimate has a form of the UNITED ansatz (12) motivating its choice. We note that, similarly to the case of CGVD, even in the case of our simple noise models determining the ansatz coefficients requires detailed knowledge of the noise model that is usually unavailable for realword quantum computers. Therefore, in the UNITED methods we aim to find these coefficients by learning from near-Clifford training circuits.

General Remarks
It is important to note that although a combination of noisy observables inspired by Richardson extrapolation motivates the form of the ansatz in Eq. (12), it does not guarantee good performance of the new method for several reasons. Firstly, all methods to scale physical noise are only approximate so the noise levels will be imperfectly controlled, leading to imperfect extrapolated estimates. Nevertheless, successful applications of ZNE and vnCDR show that imperfect noise scaling control does not preclude successful error mitigation. Very importantly, training a model based on near-Clifford circuit data is not equivalent to performing the Richardson extrapolation, so the performance of UNITED depends on the ability to construct informative training sets. However, analogous to the case of vnCDR, we expect that fitting ansatz coefficients (12) instead of performing an extrapolation is beneficial as it does not require knowledge of noise levels, reducing the impact of imperfect noise control. Overall, combining observables with different noise levels can be used to construct estimates with less error as shown by Richardson extrapolation. However, the current level of noise in quantum computers is significant. Due to the concentration of the expectation values of observables for high noise levels, a large number of shots is necessary to extract information from highly noisy data [42,50]. This limits the range of the noise levels that can be used in practice. Therefore, at an intuitive level, using VD-mitigated observables is another way to provide data corresponding to different noise levels enhancing previous error mitigation methods using variable noise levels, like ZNE and vnCDR. From this perspective, the UNITED ansatz uses near-Clifford circuits to learn the best way to combine the various observables calculated under different noise levels.
Another important remark is that adding copy data to the training set further increases the shot cost of the mitigation. Therefore, the best choice of noise levels and the optimal number of copies is not known a priori. Despite this, we found that the method gives mitigated estimates, which remain stable with increasing n and M max and that it either outperforms or matches vnCDR performance for sufficiently large shot numbers.

Numerical Results
In this section we present the performance of ZNE, vnCDR, VD, and UNITED applied to noisy outputs obtained from simulations of a trapped-ion quantum computer with a realistic noise model. We consider the task of mitigating the output of random quantum circuits and the output of the Quantum Alternating Operator Ansatz applied to Max-Cut problems. We use these two settings to benchmark the mitigation methods for various system sizes and different circuit depths. We study several total shot budgets per mitigated expectation value, highlighting the strengths and weaknesses of each method.

Random quantum circuits
We investigate random quantum circuits for various numbers of qubits Q = 4, 6, 8, 10 and several circuit depths. We consider random quantum circuits constructed using the native gates of the trapped-ion quantum computer we are modeling, namely of L layers of alternating nearest-neighbor Molmer-Sørensøn entangling gates decorated with general single qubit unitaries, see Figure 2 for details. Both the entangling gates and the single qubit unitaries are parametrized by randomly chosen angles. Here we simulate random quantum circuits with a number of layers L = gQ, taking g = 1 and g = 16.
We expect that for this setup, random quantum circuits will output states that converge to random states sampled according to Haar measure [32], with increasing L. Therefore, the setup is general and challenging, making it relevant for testing error mitigation methods. Furthermore, it is worth noting that deep circuits implemented in a trapped-ion quantum computer present a favorable setting for VD as this archi- |0 v(α 4 , β 4 , γ 4 ) Figure 2: Random quantum circuit structure used for our benchmark simulations. Here we show a layer of an exemplary random quantum circuit for Q = 4. The layer is built from two layers of alternating nearest-neighbor Molmer- and σZ , σY are Pauli operators. We choose angles α, β, γ, δ randomly for each random quantum circuit instance of the gates. We remark that RY , RZ , XX are native gates of a trapped-ion quantum computer.
tecture enables all-to-all connectivity which decreases the depth of the controlled derangement circuit compiled to native gates [30]. Furthermore, it is expected that for sufficiently deep circuits and sufficiently small gate error rates, the coherent mismatch does not prevent a good quality mitigation with VD [27].

Random quantum circuit results
First, we analyze the case of L = Q (g = 1). For this choice, we consider system sizes of Q = 4, 6, 8, 10. For each pair of L and Q values we consider 30 instances of random quantum circuits and for each random quantum circuit, we mitigate the expectation value of σ Z acting on the first qubit of the circuit ⟨σ 1 Z ⟩. We generate noisy data using a realistic trapped-ion computer noise model described in Appendix H. In general, deep RQCs will produce observables that are clustered around 0 and noise makes this clustering phenomenon even stronger [32,50]. However, the RQCs we explore here are not sufficiently deep to observe this feature, see Appendix G for details.
We perform the mitigation with data obtained using a finite number of shots. For each mitigated expectation value and each error mitigation method, we assign the total number of shots used by the method to mitigate the expectation value. This number is defined as the total shot budget N tot . We consider N tot = 10 5 − 10 10 . Different error mitigation methods require estimates of observables from a different total number of circuits. In the case of ZNE, the circuit of interest needs to be evaluated for several different state preparation noise levels. In the case of vnCDR, estimates for the observable of interest are required for many training circuits evaluated at several noise levels. UNITED additionally requires performing VD mitigation for both the training circuits and the circuit of interest. Therefore, for the same shot budget, different mitigation methods have different numbers of shots per circuit used in the method. Consequently, for a given total number of shots, the accuracy of individual observable estimates decreases as the required total number of circuits for a given error mitigation method increases. We note that for fixed N tot we can distribute shots between different observable evaluations in many ways. Here for the sake of simplicity for a given method, we assign the number of shots to be uniformly distributed across all the required circuits. That is the number of shots per circuit to run for a given mitigation method is the total number of shots divided by the total number of circuits necessary to run.
In Figure 3 we plot the mean and maximal absolute errors of ⟨σ 1 Z ⟩. We find that the performance of the error mitigation methods depends strongly on the shot budget. The mitigated observables produced by ZNE and VD converge fastest with increasing N tot . Furthermore, ZNE offers the best correction for the smallest total shot budget N tot = 10 5 . The convergence of vnCDR is slower but it outperforms ZNE and the other methods for N tot = 10 6 up to N tot = 10 8 . We find that UNITED exhibits the slowest convergence with increasing N tot but it gives the best results for N tot = 10 10 . The quality of the mitigated results decreases with increasing Q as the number of noisy gates increases. Nevertheless, even for the largest considered Q = 10, a factor of 20 improvement over the noisy result can be obtained with vnCDR and UNITED. The convergence of the results with N tot is more explicitly shown for g = 1, Q = 4 in Figure 4(a).
Next, we analyze the case of L = 16Q (g = 16). For this choice, we simulate Q = 4, 6, 8. Again, for each pair of L and Q values, we consider 30 instances of random quantum circuits and mitigate ⟨σ 1 Z ⟩ for each of them. In this regime, the appearance of clustering of the exact observables, due to the properties of RQCs, becomes apparent in the largest system size, see Appendix G for details.
As in the case of g = 1, we perform the mitigation for N tot = 10 5 − 10 10 . For Q = 6, L = 96 and Q = 8, L = 128 we were unable to obtain improvement over the noisy results with any error mitigation method as the number of noisy gates is too large. For smaller systems of Q = 4, L = 64 we find that learning-based error mitigation methods outperform VD and ZNE even for N tot = 10 5 . For N tot = 10 7 − 10 10 we find that both vnCDR and UNITED give a factor of 5 − 10 improvement of the error in the noisy results. For N tot = 10 9 −10 10 we see an improvement of UNITED over vnCDR. Again we find that ZNE and VD estimates converge fastest with increasing N tot . Again, the convergence of UNTIED is the slowest but it also provides the best mitigation quality for the largest shot budget. We show g = 16, Q = 4 convergence results in detail in Figure 4(b).
For this high noise regime, learning-based error mit- igation is particularly sensitive to shot noise [11] as noisy expectation values in the training set become more clustered with increasing noise strength. Different runs with different shot noise instances can lead to large differences in the mitigated observable. Therefore, one can expect that the behavior of the mean and maximal errors of the mitigated expectation values for a sample of RQCs are very sensitive to finite sample size effects. Consequently, the peak around N tot = 10 8 observed in the convergence of vnCDR with an increasing number of shots in Figure 4(b) is consistent with the expected behavior in this regime.

QAOA for Max-Cut problems
In this section, we investigate mitigating observables arising from the application of the Quantum Alternating Operator Ansatz (QAOA) [16, 19] to solve Max-Cut problems. We consider a graph G = (V, E) with vertices in V , |V | = Q, and edges in E. The Max-Cut problem is to find a bipartition of the graph that maximizes the number of edges connecting the halves. It can be solved by minimizing the energy of a Hamiltonian where we are summing over the edges. Here we consider Erdös-Renyi graphs [15] which connect each pair of nodes (i, j) with probability p e . The QAOA ansatz is built with L layers where α = (α 1 , α 2 , . . . , α L ), β = (β 1 , β 2 , . . . , β L ) are parameters of the ansatz, H M = Q i=1 σ i X , and |+⟩ is a tensor product state of single-qubit |+⟩ = 1 √ 2 (|0⟩ + |1⟩) states.
Here we consider Q = 4, 6, 8, 10 and L = Q. For each Q we construct 28 Erdös-Renyi graphs, taking p e = 0.5, obtained by the uniform sampling of their distribution. For each graph, we perform noiseless optimization of the Max-Cut cost function ⟨H P ⟩ to find optimal (α 0 , β 0 ). Then, we mitigate the expectation value ⟨ψ(α 0 , β 0 )| σ i Z σ j Z |ψ(α 0 , β 0 )⟩ for one randomly selected (i, j) ∈ E. We compile |ψ(α 0 , β 0 )⟩ to the trapped-ion native gate set and obtain the noisy expectation value using the trapped-ion noisy simulator. Here we assume full device connectivity. Additional details on the compilation are given in Appendix I. For Q = 10 the compiled circuits have on average 96 layers of gates and 227 2-qubit Molmer-Sørensøn gates. We note that since the number of edges in Erdös-Renyi graphs is a random variable, the structure of the compiled circuits varies from instance to instance. Hence, we give average values of the depth Z ⟩ for random quantum circuits plotted versus the shot budget. We use 30 instances of random quantum circuits. In (a) we show results for Q = 4, g = 1. In (b) we display the results for Q = 4, g = 16, where the number of layers L = gQ. For Q = 4, L = 4 we find that ZNE and VD estimates converge first. The estimates produced by vnCDR converge slower but outperform those given by VD and ZNE for shot budgets of 10 7 − 10 10 . The convergence of UNITED is the slowest but it gives the best results for Ntot = 10 9 , 10 10 . For Q = 4, L = 64 estimates produced by ZNE and VD basically offer no mitigation and are outperformed by vnCDR and UNITED for all considered shot budgets. We see an improvement of UNITED over vnCDR for Ntot = 10 9 , 10 10 . A further discussion of the performance of VD for different shot budgets is explored in Appendix E. and the gate count here.

QAOA results
We mitigate the expectation value of a random term of H P for each instance of Erdös-Renyi graph. Once again, we employ the trapped-ion noise model described in Appendix H. In Figure 5 we show the mean and maximal absolute errors of the mitigated observables over the 28 graph instances. We again find (as in the case of random quantum circuits) that the performance of the error mitigation methods depends primarily on the shot budget. Interestingly, here we find that VD performs much better than in the case of random quantum circuits and we observe that it is the best-performing method for N tot = 10 5 , 10 6 , 10 8 shot budgets. ZNE is the second best method for N tot = 10 5 but fails to improve significantly for larger shot budgets. The learning-based methods improve systematically with increasing N tot . For the largest considered N tot = 10 10 , UNITED is the best method for Q = 6, 8, 10 and vnCDR is the best for Q = 4. Figure 6 analyzes the convergence of the various techniques for Q = 8 in more detail. Here VD is the best performer at the smallest shot budget (N tot = 10 5 ), followed by ZNE, vnCDR, and UNITED. ZNE shows no improvement with an increased shot budget. UNITED, vnCDR, and VD, on the other hand, continue improving with UNITED becoming better than vnCDR at N tot = 10 8 and vnCDR seeing no further improvement after this point. Similarly, VD sees continued improvements up to N tot = 10 8 , being the best technique up to that point. From N tot = 10 9 and above, UNITED is the best method.
The reason for the better performance of VD when compared to the random quantum circuit case is likely due to the fact that the solutions to the Max-Cut problems are eigenstates of the mitigated observables. Therefore, we expect the noise floor to be reduced relative to the random quantum circuits (see (10) and (11)).
We note that the relative performance of VD with respect to vnCDR depends on the application. In particular, VD performance is better for QAOA than for the RQC experiments. This coincides with a larger improvement of UNITED over vnCDR for QAOA. Therefore, these two numerical studies highlight the utility of unifying error mitigation methods. Leveraging the strengths of ZNE, CDR, and VD into one method appears to allow for better performance across a range of tasks.

Zero Noise Extrapolation
In the simulations described above, when implementing ZNE we use two noise levels c ∈ {1, 2} and linear extrapolation. For the case of random quantum circuits, we compared this approach to exponential extrapolation (defined as in Ref. [17]) and linear extrapolation with c ∈ {1, 2, 3} and found that the method we employ works best in general. We present details of this comparison in Appendix F.
When scaling the noise in our simulations we use factors c = 2, 3. To implement c = 2 we replace XX(θ) by XX(α)XX(θ−α) with α chosen randomly. We treat R Z and R Y in the same way. To implement c = 3 we replace XX(θ) by XX(θ)XX(α)XX(−α) performing an identity insertion. Again, we treat R Z and R Y in the same way. Both procedures scale the number of all types of gates in our random quantum circuits by the noise level.  Ntot. Mean (a) and maximal (b) absolute errors of the expectation value of a random term of the Max-Cut Hamiltonian on 28 instances of Erdös-Renyi graphs. ZNE has already converged at Ntot = 10 5 , while VD is the best performer up to its convergence at Ntot = 10 8 , after which (Ntot = 10 9 ) UNITED becomes the best performer. vnCDR converges at Ntot = 10 8 .

variable noise Clifford Data Regression
When constructing the noise-scaled circuits necessary for vnCDR we use the same method of noise scaling as used in the ZNE implementation. In our simulations above we use c ∈ {1, 2, 3}. For the case of random quantum circuits, we found that this choice of noise levels results in better results than using c ∈ {1, 2}. The method used to construct the near-Clifford training circuits is discussed in Section 5.5.

We implement VD by computing Tr[ρ M X]
Tr[ρ M ] for the noisy state ρ and neglecting errors that occur during the derangement circuit execution. This assumption enables classical simulation of VD for a Q-qubit system while simulating Q noisy qubits instead of the 2Q + 1 that would otherwise be required to simulate the derangement. We analyze the performance of VD with derangement errors for Q = 4 systems in Appendix D. We present results of VD using 2 copies as we observed that this typically gave the best results. We explore the performance of VD for various numbers of copies in Appendix E.

Unified Data-driven Error Mitigation
To implement UNITED we choose the same near-Clifford circuits as used in vnCDR. The training set construction methods used to construct these circuits are discussed in Section 5.5. We choose M max = 3 and state preparation noise levels c ∈ {1, 2, 3}. We found that these parameters performed best across the regimes investigated for the RQC experiments.

Constructing the near-Clifford training circuits
To construct the near-Clifford training circuits for the random quantum circuits we use an algorithm from [9]. We replace XX and R Y gates with their closest Clifford gates of the same type. We also replace a fraction of R Z gates by their closest R Z Clifford gates, leaving N = 10 R Z gates intact.
In order to maximize the information contained within the training set while minimizing the number of shots used to evaluate their outputs on noisy hardware we inspect the distribution of the exact values produced by these circuits. Initially, we construct 100 near-Clifford circuits using the method described above. We find that with an increasing number of qubits, an increased fraction of their exact expectation values approaches 0. This clustering effect is a property of near-Clifford quantum circuits [11]. Shot efficiency of the learning-based error mitigation method can be improved by constructing a training set with significant variance in the exact expectation values, as explored in detail in [11]. Therefore, to construct the training set in this work, we post-select N t = 50 near-Clifford circuits with the largest absolute value of the exact observable from the initial 100. We find that this strategy does not decrease the quality of the mitigation while decreasing the shot cost, see Appendix G for details. However, the strategy outlined above is sub-optimal when applied to the Max-Cut QAOA problems, as for Q = 10 we see instances for which all 100 training circuits generated using this approach have the exact expectation value 0. Therefore, we use a more advanced training set construction strategy. This strategy is based on the method proposed in [11]. We use N t = 100 training circuits with X exact evenly distributed between −0.495 and 0.495, i.e. X exact ∈ {−0.495 + 0.01n} 99 n=0 . To construct these training circuits we start by building 10 5 near-Clifford circuits using the same gate replacement strategy as outlined above, including the same number of non-Clifford gates (N = 10). We then post-select these circuits in order to match the desired distribution of exact expectation values. We wish to reinforce the point that this process is scalable as near-Clifford circuits are classically simulable. While previous studies used Markov-Chain Monte Carlo sampling to generate such distribution [9], the simpler method of post-selection can also be used to find the desired distribution in certain cases.

Shot budgeting
Each technique described above requires a certain total number of circuits in order to mitigate the observable of interest. This, therefore, corresponds to a different number of observable evaluations. For VD, it is necessary to perform a measurement for 2 different circuits obtained with controlled derangement [26]. In the case of ZNE, one needs to measure the expectation value of n + 1 circuits built from one circuit of interest at n + 1 noise levels. For vnCDR, we additionally need to evaluate each training circuit for n + 1 noise levels. This means that in total vnCDR requires (n + 1)(N t + 1) circuit evaluations. Finally, for UNITED we need to mitigate the observable estimates using VD with M max − 1 different copies for the training circuits, the circuit of interest, and for each of n + 1 noise levels. Therefore, the number of circuit evaluations required by UNITED is (n + 1)(N t + 1)(2M max − 1).
In this work we consider several total shot budgets N tot . For each method, we distribute the shots across measurements dividing N tot by the number of measurements. We note that this strategy may not be optimal as it may be beneficial to distribute the shots unequally between noise levels or copy numbers. We note that one could imagine fine-tuning the values of n, N t , M max for a given N tot but leave the systematic investigation of more advanced budgeting strategies for future work.

Discussion
The ability to efficiently and accurately mitigate hardware errors will be essential to achieving practical application of near-term quantum computers. A number of approaches have been proposed to fill this need, each bringing useful new ideas. In this work, we have discussed some state-of-the-art methods and examined how the approaches might best be combined. Specifically, we have considered Zero Noise Extrapolation (ZNE), variable noise Clifford Data Regression (vnCDR), and Virtual Distillation (VD) and proposed two new methods that represent unifications of the ideas behind these approaches. We applied these methods to the mitigation of errors in local observables measured from states prepared with random circuits and observables produced from applying QAOA to Max-Cut problems. As these methods have different advantages, disadvantages, and resource requirements, we evaluated their performance with 10 5 −10 10 total shots to mitigate the expectation value of an observable of interest with up to Q = 10 qubits with depth L = Q as well as up to Q = 8 for depth L = 16Q. We used a realistic simulation of a noisy trapped-ion device, which was chosen as the trappedion architecture enables all-to-all connectivity.
Combining the near-Clifford circuit data and the multi-copy data from VD, we formulated Cliffordguided VD (CGVD) which uses Clifford data to guide a combination of VD-mitigated observables to obtain an improved estimate. We then generalized this method to perform a Clifford-guided Richardson-like extrapolation on VD-mitigated data at variable noise levels. This can be thought of as first suppressing the effect of orthogonal errors, by using a combination of VD-mitigated observables, and then extrapolating to the zero noise limit. The new method, UNIfied Technique for Error mitigation with Data (UNITED), can be viewed as a unification of the CDR, ZNE, and VD methods. Furthermore, vnCDR is a special case of the new general method.
We find that UNITED is the technique that offers the best mitigation in both the RQC and QAOA numerical experiments at N tot = 10 10 . We also find that vnCDR is preferable at budgets N tot = 10 6 , 10 8 in the case of RQC. For N tot = 10 5 , ZNE is optimal for circuits with depths L = Q while vnCDR is optimal for more noisy L = 16Q. When considering the QAOA results, VD is optimal for all shot budgets apart from the largest for which UNITED is the best method. This is because the noise floor is significantly lower in QAOA due to the fact that the solution of the algorithm is also an eigenstate of the measurement operator [27]. We observe that increasing the shot budget improves the error-mitigated expectation value until a performance ceiling is reached. Different techniques have different performance ceilings and we find that UNITED, being able to draw from more data than VD and ZNE, reaches better mitigated values when more resources are available.
We note it may be difficult to assess when the performance ceiling of a given method has been reached. This is an issue particularly for methods that lack analytical scaling guarantees, or when such guarantees only exist under restrictive noise assumptions. In this case, we propose using the convergence of the mitigated observable with the number of shots employed to judge if the performance ceiling has been reached.
As the available quantum devices continue to grow in qubit number and reduce the rate of hardware errors, practical applications of quantum computing grow ever closer. Due to the high number of qubits that are required for full error correction, it seems likely that error mitigation will play a pivotal part in those early applications. This work provides a useful reference for those who are seeking such applications to determine which combination of strategies is best for their use case.
A natural question one might ask is if there are other forms of data that would be useful to include in an error mitigation approach. Error mitigation for deep quantum circuits is challenging, but perhaps some other source of data (other than those we considered) would extend the depth for which error mitigation is practical. Further unification with other error mitigation techniques such as noise-resilient quantum compiling [7,8,25,34,39], quasi-probabilistic error decomposition [44,45], verified phase estimation [35], truncated Neumann series [49] and specific approaches leveraging symmetries and/or post-selection [1,4,31,37] is likely to be a promising research direction.
Another interesting direction for future work is to consider how best to determine when to measure new expectation values (i.e. more near-Clifford circuits, more noise levels, more copies of the state, etc.) versus expanding more shots per expectation value in order to get the best mitigation possible with fixed resources. Finally, we note that investigation of the effects of noisy controlled derangement on VD and UNITED performance including large circuits challenging for noisy classical simulation will be necessary for a full understanding of these methods' potential.

A Cancellation of errors using Richardson extrapolation
By using Richardson extrapolation one can cancel the leading orders of a Taylor expansion of a noisy observable expectation valueμ ZNE (λ) to suppress its error. The order to which one can suppress the error depends on how many noisy observables one has access to. Each observable used must be evaluated with a different noise strength, which should be known. Here λ is the noise strength. More precisely, Richardson extrapolation uses a combination of n+1 noisy expectation values of the mitigated observableμ ZNE 0 ,μ ZNE 1 , . . . ,μ ZNE n with the noise strengths λ 0 < λ 1 < · · · < λ n to suppress the error up to O(λ n+1 0 ). Writing the series expansion of the noisy observable as a function of λ we havê where µ is the exact expectation value and α i are the expansion coefficients. We introduce the noise levels c 0 , c 1 , c 2 , . . . , c n such that λ 0 , λ 1 , λ 2 , ..., λ n = c 0 λ 0 , c 1 λ 0 , c 2 λ 0 , ..., c n λ 0 , which enables us to writê To build intuition, we first consider the combination of just two noisy observables,μ ZNE Note that we use c 0 = 1 above. Therefore, combininĝ µ ZNE 0 andμ ZNE 1 with coefficients γ 0 and γ 1 we write Therefore, to cancel terms up to order O(λ 2 0 ) we require: One can generalize this procedure to larger numbers of noise levels. In such a case we take a combination of n + 1μ ZNE with γ j being the coefficients. Expanding this sum in powers of λ j gives Using the noise levels we can write above as Therefore, to obtain the error of order O(λ n+1 0 ) the Richardson extrapolation chooses γ j coefficients to cancel the leading n orders of the error in (32). This leads to a system of linear equations for γ j n j=0 γ j = 1, n j=0 γ j c i j = 0 for i = 1, . . . , n. (33) To sum up, the Richardson extrapolation uses a combination of different noisy estimates with different noise levels to mitigate the error by canceling the leading error terms.

B Suppressing error through a combination of Virtual-Distillation-mitigated observables
Here we consider a combination of VD-mitigated observablesμ VD M obtained for different numbers of VD copies M = 1, 2, . . . , M max . By analogy to the Richardson extrapolation, we look for a combination with the error smaller than the error of each of the individual observables. As proved below, Virtual Distillation suppresses contributions to the observable from eigenstates of the noisy state other than the dominant one leading to where µ is the exact observable, ϵ is the noise floor defined in (8) and repeated below for convenience and is a sum of eigenvalues of the mitigated state (6) other than the dominant one.

B.1 Proof of VD error scaling with the number of state copies
To prove (35) we use the eigendecomposition of a noisy state ρ given by Eq. (6). Then we havê where X i = ⟨ψ i |X|ψ i ⟩. Transforming it further, we obtainμ Therefore, we obtain where we use that p i ≥ 0, and ∥X∥ ∞ is an infinity norm of X. But, where we again use that p i ≥ 0. Using (40), (41), and (37), this leads to Therefore, if ∥X∥ ∞ is finite, like for a Pauli string, we obtain that was to be proved.

B.2 Error suppression with a combination of VD terms
Here, to motivate the CGVD ansatz (16) -which is a linear combination of VD-mitigated observables with up to M max copies -we find a combination of VDmitigated observables that estimates the noise-free observable of interest with an error smaller than its parts. We start with M max = 2 and then generalize this approach to more copies. The noise model we consider is a simple model where for each gate an error occurs with probability λ, and every two different sequences of errors result in orthogonal states. Additionally, at the end of the circuit we place a coherent error given by a unitary U (θ) = e −iθE , where E is a Hermitian operator. Such an error produces a non-zero noise floor for θ ̸ = 0, while preserving the orthogonality of states corresponding to different sequences of the gate errors. For a circuit with G gates the noisy output ρ can be written as where every state is orthogonal. For small enough error rate, ρ 0 is the dominant eigenvector, i.e. ρ 0 = |ψ 0 ⟩⟨ψ 0 |. We consider taking linear combinations of observables mitigated with VD using different numbers of copies up to M = M max . That is, where b M are some coefficients to be found. We show that this ansatz contains solutions that lead to an estimate of the observable of interest with a smaller error thanμ VD Mmax .

B.2.1 Combining two VD-mitigated observables
To build intuition we first consider the combination of VD-mitigated observables with M = 1 and M = 2 copies. We can writê By grouping terms according to powers of λ we rewrite this aŝ where we have defined α In the following we assume that α (1) 1 ̸ = 0 so that error is of order O(λ). Taking into account that we arrive at a similar expression forμ VD 2 , where the coefficients α (2) k are defined similarly to the α (1) k ones. In particular, one can find α Let us now show how one can take a linear combination ofμ VD 1 andμ VD 2 to further suppress the error. Consider the combination where b 1 and b 2 are coefficients to be found. Expanding this out we obtain Consider taking Solving for b 1 and b 2 , and recalling that α Putting these coefficients into Eq. (51) leads to Considering the term λ 2 α (1) 2 b 1 we obtain The previous then implies Therefore, we have shown that taking a linear combination of VD-mitigated observables further reduces the error on the estimate of the noiseless observable. Having seen how the errors can be canceled for M max = 2 let's generalize this same error cancellation to the many copy case.

B.2.2 Generalizing to the many copy case
Analogously to the case of M max = 2 one can use combination (45) to suppress the error also for larger M max . For M copies and our simple noise model we havê Richardson extrapolation obtaining a mitigated estimate of the observable where n j=0 γ j = 1, n j=0 γ j c k j = 0 for k = 1, . . . , n. (74) Recalling (69), (72) we finally obtain which is a linear combination of VD-mitigated observables obtained with different numbers of copies at different coherent noise levels. This combination has exactly the same form as the UNITED ansatz (12). As previously discussed in Appendix A this application of the Richardson extrapolation results in an error assuming that the orthogonal errors are small enough. We note that as long as the orthogonal error rates are not too large, one can scale both θ an λ, e.g. choosing θ j = c j θ 0 and λ j = c j λ 0 , which results in and obtaining the same error scaling as in (76), since for large enough M max the error will always be dominated by the coherent mismatch.

C Correcting for global depolarizing noise with CGVD or UNITED
Here we motivate the UNITED ansatz by considering its performance in mitigating the action of a global depolarizing noise channel on some observable of interest. The ansatzes used in CDR and vnCDR have been shown to perfectly correct global depolarizing noise and this fact is used to motivate the form of the CGVD ansatz. We note in this setting there is no limit set by coherent mismatch when using VD, so one expects a combination of VD and CDR to perform well. First, we consider CGVD since the treatment is the same as in the UNITED case and can be generalized very simply. Consider a global depolarizing noise channel that acts on all Q qubits. For some state ρ = |ψ⟩ ⟨ψ| produced by quantum circuit U , this channel takes the form: Consider the effect of the above channel on some observable of interest X where µ = Tr(ρX). We consider the CGVD ansatz that combines observables evaluated at several numbers of copies in the VD circuit structure. For simplicity, we omit the small changes in the rate of errors for different numbers of copies due to the swap noise. We can rewrite (16) as follows where µ ′ Mmax is the estimate of the observable of interest, a M are the parameters fitted using least squares regression on the data generated by the near-Clifford training circuits mitigated using VD with up to M copies.
We can express an observable mitigated with VD using M copies asμ We can evaluateμ VD M , assuming ρ has been acted upon by a global depolarizing channel as defined in (78). First focusing on the denominator, Figure 7: Comparing VD and UNITED error mitigation with (the brown and the red curves) and without controlled derangement noise (the green and the blue curves). We perform the comparison for the random quantum circuits with Q = 4 and L = 5. We plot an error of mitigated and noisy ⟨σ 1 Z ⟩ averaged over 30 instances of the random quantum circuits. The derangement noise affects VD and UNITED implementations decreasing the performance of VD and UNITED for large Ntot. Nevertheless, even with the derangement noise, we obtain a factor of 2.5 improvement for VD with Ntot ≥ 10 6 and factor 1.2 improvement of UNITED over vnCDR for the largest considered Ntot = 10 10 . For reference, we show also the noisy, ZNE, and vnCDR results. Here, ZNE is performed using linear extrapolation with noise levels c = {1, 2}. VD uses M = 3 copies. We obtain the vnCDR correction using noise levels c = {1, 2, 3} and training circuits with N = 30 RZ non-Clifford gates left intact. In the case of UNITED we have c = {1, 2, 3}, Mmax = 3, and N = 30.
gates [26] and by using further error mitigation of controlled derangement noise [26]. Another interesting direction is VD implementations based on state verification instead of controlled derangement [5, 23].

E Optimal parameters for Virtual Distillation
Here we systematically explore the convergence of VD estimates with an increasing number of copies M . In Figure 8 we plot the mean and maximal errors of VD ⟨σ 1 Z ⟩ estimate for 30 random quantum circuits and M = 2 − 6. In this figure, we show results for the smallest N tot = 10 5 and the largest N tot = 10 10 considered here. We find that Q = 4 − 10 and g = 1, M = 2 typically gives optimal results for both shot budgets. We note that for Q = 4 − 10, g = 1 and N tot = 10 10 , M = 2 − 5 VD gives very similar results indicating that M = 2 is large enough to reach the noise floor. Our findings agree with [27], which argues that, with the exception of highly noisy circuits, M ≤ 3 is sufficient to reach the noise floor. For noisier Q = 4, g = 16 circuits we find that the optimal M depends strongly on N tot . For N tot = 10 5 , M = 2 is optimal but it gives on average a small factor of 1.2 improvement over the noisy results. For N tot = 10 10 , M = 4 is optimal giving a factor of 2.5 improvement. Based on the obtained results, we use M = 2 in the main text as it is optimal for most circuits. We note that although with M = 4 VD performance can be improved for Q = 4, g = 16 and large N tot , this improvement is not large enough to outperform vnCDR and UNITED.

F Comparison of ZNE implementations
For ZNE we compare the mitigation performance when using two or three noise levels, c ∈ { 1, 2, 3 } and c ∈ { 1, 2 } respectively. For three noise levels, we apply both linear and exponential extrapolations, while the data coming from two noise levels only allows us to perform a linear extrapolation. We benchmark the extrapolation approaches when mitigating observables produced by RQCs and taking N tot = 10 5 , 10 6 , . . . , 10 10 . For L = Q and Q = 4, 6, 8, 10 we find that with the exception of Q = 4, N tot = 10 5 using two noise levels leads to better mitigation quality. The exponential extrapolation results in an order of magnitude worse results than the linear one.
The results for Q = 4, 10 are shown in Figure 9. For L = 16Q and Q = 4, 6, 8, we find that both variants of the linear extrapolation ZNE lead to similar mitigation quality. The exponential extrapolation again produces an order of magnitude worse results than the linear extrapolation.

G Clustering of the near-Clifford circuits
It is a well-known property of near-Clifford circuits that exact expectation values of Pauli strings cluster around 0. This clustering negatively impacts the shot-efficiency of learning-based error mitigation methods using near-Clifford circuits as training circuits [11]. Here we demonstrate that the clustering does indeed occur for the training sets used to mitigate the random quantum circuit expectation values in Section 4.2. In Figure 10 we show relative frequencies of exact expectation values of ⟨σ 1 Z ⟩ for a sample of 30 Q = 4, L = 4 random quantum circuits and 3000 near-Clifford circuits obtained by random substitutions of non-Clifford gates in these random quantum circuits. For each circuit of interest, 100 near-Clifford circuits were generated. We see strong clustering of the expectation values for the near-Clifford circuits despite the lack of this effect in the random quantum circuits used to generate them. The training circuits used in Section 4.2 were then chosen by post-selecting on these near-Clifford circuits.  In Table 2 we further analyze the clustering of the near-Clifford circuits. The table shows the median values |⟨σ 1 Z ⟩| 2 , and third quartiles |⟨σ 1 Z ⟩| 3 of the absolute value of the observable of interest, obtained for all the random quantum circuits mitigated in Section 4.2 and all near-Clifford circuits generated be-fore post-selection. We see that |⟨σ 1 Z ⟩| 2 and |⟨σ 1 Z ⟩| 3 are decreasing slowly with increasing Q for the random quantum circuits. For example, Q = 4, L = 4 the RQCs have |⟨σ 1 Z ⟩| 2 = 0.270 and |⟨σ 1 Z ⟩| 2 = 0.438, while for the largest Q = 10 (L = 10) we obtain |⟨σ 1 Z ⟩| 2 = 0.184 and |⟨σ 1 Z ⟩| 2 = 0.265. Unlike the  Table 2: Clustering of the near-Clifford circuits. Here we gather medians |⟨σ 1 Z ⟩|2 and third quartiles |⟨σ 1 Z ⟩|3 of |⟨σ 1 Z ⟩|, obtained for samples of 30 random quantum circuits with various Q and L values mitigated in Section 4.2. We also show |⟨σ 1 Z ⟩|2, |⟨σ 1 Z ⟩|3 and 95 th percentile of |⟨σ 1 Z ⟩| for 3000 near-Clifford circuits obtained by substituting non-Clifford gates in the random quantum circuits. These circuits are then post-selected, choosing a subset that is used as the training circuits in UNITED and vnCDR in Section 4.2. When the obtained quartiles are of the order of numerical precision (10 −15 ) or smaller we simply denoted it as 0 in the table.