Volumetric Benchmarking of Error Mitigation with Qermit

The detrimental effect of noise accumulates as quantum computers grow in size. In the case where devices are too small or noisy to perform error correction, error mitigation may be used. Error mitigation does not increase the fidelity of quantum states, but instead aims to reduce the approximation error in quantities of concern, such as expectation values of observables. However, it is as yet unclear which circuit types, and devices of which characteristics, benefit most from the use of error mitigation. Here we develop a methodology to assess the performance of quantum error mitigation techniques. Our benchmarks are volumetric in design, and are performed on different superconducting hardware devices. Extensive classical simulations are also used for comparison. We use these benchmarks to identify disconnects between the predicted and practical performance of error mitigation protocols, and to identify the situations in which their use is beneficial. To perform these experiments, and for the benefit of the wider community, we introduce Qermit - an open source python package for quantum error mitigation. Qermit supports a wide range of error mitigation methods, is easily extensible and has a modular graph-based software design that facilitates composition of error mitigation protocols and subroutines.


Introduction
Noise inhibits the development of quantum processors [1,2]. Techniques for reducing the level and effects of noise have been proposed, and act at each layer in the quantum computing stack. At the hardware level, noise reduction may be achieved through calibration [3], dynamical decoupling [4], pulse-level optimisation [5], etc. At the highest level of abstraction, faulttolerant methods for encoding and processing information with logical qubits may be used [6,7].
Between these two extremes there exists two classes of noise reduction techniques. The first are those that increase process fidelity through operations at compile time. This class includes noise-aware rout-ing and qubit allocation [8,9], noise tailoring [10], and circuit optimisation [9]. The second are those which do not increase the fidelity of prepared states directly, but instead improve accuracy when measuring quantities of concern, such as expected values of observables. Several such error mitigation protocols have been proposed, including Zero-Noise Extrapolation (ZNE) [11][12][13], Probabilistic Error Cancellation (PEC) [11], Clifford Data Regression (CDR) [14], Virtual Distillation [15], and many others [16][17][18].
Importantly, both of these classes of noise reduction techniques require less information about the experimental setup than do hardware level approaches. Additionally, neither compile time error reduction nor error mitigation protocols require a significant overhead in the number of qubits, as in the case of error correction [19,20]. This often comes at the cost of an increase in the number of circuits and shots in the case of error mitigation.
The importance of noise suppression at the hardware and logical levels is clear as both are vital to the development of scalable fault-tolerant quantum processors. Compile time approaches to noise reduction have also been extensively benchmarked [21] and shown to perform strongly. What's unclear is the impact of error mitigation. While proof of principle experiments have shown improved accuracy when using these techniques [11][12][13][14], there has been no systematic study of their practical effectiveness and limitations. As such it is unclear which circuit dimensions and types, nor quantum computing devices of which characteristics, are well suited to the use of error mitigation. Indeed, the differing assumptions made about the underlying noise by each error mitigation protocol may manifest as unpredictable practical performance.
We clarify the practical utility of error mitigation by introducing and implementing a methodology for the benchmarking of error mitigation protocols. Our benchmarking experiments compare the accuracy of operator expectation value calculations when a selection of error mitigation protocols are employed, and when they are not.
The design of our benchmarks is inspired by volumetric benchmarks of quantum processors [22], allowing us to estimate the 'volume' of the circuit depth and width where the error mitigation protocols perform well. The particular circuits used are constructed from a variety of application-motivated circuit classes [21]. This ensures our benchmarks are indicative of practical performance, which we use to identify classes of computations where error mitigation may be used fruitfully. Taking a volumetric approach ensures that results of these benchmarks may be quickly compared, and that a wide range of circuit sizes are covered. In this work, we benchmark CDR and ZNE, but our method is applicable to a broad class of error mitigation protocols.
We conduct our experiment on real quantum computing devices, and using classical simulators. Each error mitigation method makes different assumptions about the underlying noise, and as such it is difficult to compare error mitigation methods using noisy simulations alone without introducing bias. This is particularly true since existing noise models are poor predictors of hardware behaviour beyond a few qubits. Indeed our experiments on real hardware demonstrate reduced performance as compared with noisy classical simulation, across different error mitigation methods.
For the benefit of the community, and in order to conduct these benchmarking experiments, we have developed Qermit. Qermit is an open-source python package for the design and automated execution of error mitigation protocols. The error mitigation protocols presently available through Qermit include those explored in our benchmarking experiments, namely ZNE and CDR. Qermit also includes implementations of PEC, error mitigation based on frame randomisation [10], and protocols performing correction through characterisation of State Preparation And Measurement (SPAM) errors. In all cases, several variations of each protocol are provided. Qermit provides a common interface to this selection of error mitigation schemes, simplifying their use. Further, Qermit supports the straightforward construction and combination of error mitigation protocols and sub routines, facilitating quick prototyping of new protocols. By virtue of being implemented using Tket [9], Qermit is platform-agnostic, and so may be used with a wide range of quantum hardware, and in conjunction with several common quantum software development kits.
The remainder of this paper is structured as follows. In Section 2 we give an overview of error mitigation and the particular schemes that we investigate in this work. In Section 3 we introduce Qermit and details of the implementations of CDR and ZNE. In Section 4 we introduce the design of our benchmarks, and the philosophy that motivates them. In Section 5 we give the results of the experiments we have conducted. Finally we conclude in Section 6.

Quantum Error Mitigation
Error mitigation protocols typically have many steps in common. In Section 2.1 we describe a general framework for error mitigation of observables which takes into account the practical aspects of their implementation on quantum hardware. Recent works [23,24] have also exploited these common features to analyse the overhead in sample complexity, and to determine (universal) lower bounds. Our approach focusses on the modularity of error mitigation protocols, also exploited by the design of Qermit, and takes into account all quantum and classical resources required for an error mitigation experiment. In Section 2.2 and Section 2.3 we use this framework to describe ZNE and CDR. In Section 2.4 we describe the noise profile assumptions that justify the design choices made when developing ZNE and CDR.

Unifying Framework
Consider a target input quantum circuit given by a sequence of gates U = U 1 U 2 ...U d , an initial (pure) state ρ 0 , and an observable O. 1 The error mitigation protocols studied here output an estimator ⟨Ô⟩ EM of the true expectation value ⟨O⟩ = tr(U ρ 0 U † O). This estimator should reduce the noise bias in the estimation of this quantity on the backend.
To achieve this, error mitigation protocols run the given circuit and/or other quantum circuits on backends, such as quantum processing units or classical simulators, which may be noisy or ideal. The outputs from backends are binary strings, called shots, which may be combined to produce, for example, expectation values.
In generating the estimator ⟨Ô⟩ EM , many error mitigation protocols employ the following steps.
Data Collection: The first step consists of a noise characterisation procedure N that takes as input the target experiment(s) (U, ρ 0 , O), along with a set of resource parameters R. R can include: the total number of distinct circuits K, shots per each circuit (n i ) K i=1 , and allowed qubits; the type and amount of classical simulation used; and the backend q with fixed specifications such as the compilation strategy, architecture, and noise features.
N involves (i) a series of sub-processes N 1 , ..., N K , each of which modify the input circuit U in a method-specific way and (ii) measurement circuits The data collection step returns a set of (labelled) complex parameters given by D = (D q i,m , D c i,m ), indexed by each of the sub-processes, where with Z s an independent identically distributed (i.i.d) indicator random variable over measurement outcomes obtained from evaluating M m (O)N i (U )(ρ 0 ) on the quantum device q. If required and available, D c i,m corresponds to the exact classical simulation.
Functional Model: An implicit mapping F (or a set thereof) between the output parameters of the noise characterisation step and the (unknown) error mitigated estimator ⟨Ô⟩ EM so that The specific form of this function is typically motivated by assumptions on the noise characteristics of the quantum device.
Data Processing: This step, is completely classical and aims to produce an output estimator ⟨Ô⟩ EM based on fitting the data D to the functional model F. Depending on the particular function this may be a simple summation or a classical optimisation algorithm to determine the coefficients of F.
All the processes involved in producing D may be described in the quantum combs formalism, which generalises quantum channels to higher order operations [25]. N will generally depend on the type of noise a particular method is aiming to mitigate. For example, a set of the sub-processes and measurements may be independent of U or O, with the aim to produce (partial) tomographic information [11]. The framework also allows for adaptive processes, which is to say that N i may depend on outcomes D 1,m ... D i−1,m for a subset of the measurement circuits m. Typically the Data Collection step will include the identity process, which does not modify the circuit U or observable, and produces a noisy (sample mean) estimator ⟨Ô⟩ N of the expectation value of the observable given resources R.
Note that several factors at each step can influence the performance of an error mitigation technique. An example is the accuracy in the noise modelling assumptions that motivate the choice of functional model, and we discuss this further in the case of ZNE and CDR in Section 2.4. The accuracy of the data collected, influenced by the available shot budget, also impacts the accuracy and variance in the final error mitigated observable expectation approximation. We explore this variance in the case of our experiments in Appendix E.4.

Zero Noise Extrapolation
The Zero Noise Extrapolation (ZNE) method [11,12] assumes that noise may be controlled by a parameter λ (or more generally a set of parameters) which can be viewed as a proxy for average gate error rates. Then, for a given quantum circuit, the noisy expectation value of a target observable will be a function depending on λ. One may produce different samples of this function by artificially increasing the noise parameter to different levels λ 1 , λ 2 , ..., λ k . An extrapolation process then produces an estimate of the expected value for λ = 0, the ideal case where no physical errors occur.
There are several ways in which one may boost physical errors affecting a quantum circuit. One method involves increasing the duration of pulses involved in producing each gate within the circuit [11]. A second approach, which we review here, is to introduce additional gates to obtain a higher depth but equivalent unitary circuit [13].
Data Collection: ZNE involves a series of subprocesses N λ1 , ..., N λ k , that take the input circuit U = U 1 U 2 ... U d and produce a modified, or folded [13], circuit Here C i C † i = I so as to not introduce logical error, and α i j are positive integers. There is flexibility in the choice of the C i unitaries, and [13] analyse different folding variations: (i) circuit folding when only α i d ̸ = 0 and C d = U 1 ...U d , (ii) random gate folding with C j = U j and α i j chosen uniformly at random for a fixed noise level λ i = j (2α i j + 1). The output of the first step will be D := (D λ1 , D λ2 , ..., D λ k ) where each entry is a sample mean estimator for the expectation value tr(ON λi (U )ρ 0 [N λi (U )] † ) of the target observable with respect to the modified circuit at each noise level, given the fixed set of resources (i.e number of shots).

Functional Model:
In the case of ZNE, there are several possibilities for the data-fitting functional. We outline several below which have been explored in Refs. [11,13].
a) The polynomial extrapolation assumes that the dependency on the noise level parameter λ can be expressed as a truncated Taylor series so that the data is fitted to the function with negligible higher order terms O(λ K+1 ) and unknown (complex) parameters F i . If the number of data points, or equivalently in our case the number of noise parameters k, is at least K + 1, the number of unknown parameters, then the extrapolation is well defined. In the special case when k = K + 1 there exist analytic expressions for the coefficients, and the method is referred to as Richardson extrapolation [16].
b) The exponential extrapolation assumes the expected values of observables decay exponentially with the noise level parameter λ and the data can then be fitted to c) The poly-exponential extrapolation assumes that the exponential decay with the noise level has a polynomial expansion so it is fitted to the function

Clifford Data Regression
Clifford Data Regression (CDR) [14] is a machine learning approach to error mitigation. The method relies on the idea that circuits containing a number of T gates logarithmic in the number of qubits can be efficiently simulated [26].
Data Collection: CDR involves a series of subprocesses N 0 (U ) = U and N 1 , ..., N k , each of which modifies the input U , synthesized into a universal Clifford + T gate set, to produce a circuit where all except a small number N nc of T gates are replaced by a single-qubit Clifford gate {1, S, S † , Z}. The resulting unitary circuits N i (U ) are efficiently simulated classically and so measurements of O will involve both quantum and classical evaluation. The output will be D = ( where each pair consists of the ideal classically simulated expectation value of the target observable O with respect to the modified circuit D c i = tr(ON i (U )ρ 0 [N i (U )] † ) and respectively D q i a corresponding sample mean estimator evaluated on quantum device with resources R.

Functional Model:
The functional that relates an error mitigated estimate ⟨Ô⟩ EM of the ideal expected value ⟨O⟩ = tr(OU ρ 0 U † ) to the data obtained in the noise characterisation is where In particular, in [14] the class of fitting functions F was assumed to be linear so that f (x) := F 1 x + F 0 and (assuming F 1 ̸ = 0)

Noise Profile Assumptions
The use of different fitting functions in ZNE and CDR are motivated by an incoherent, Markovian noise model. Such a model is described by the quantum channel N = (1 − λ)I + λE with I the identity operation, and E an arbitrary process. Therefore, the noisy implementation of the target unitary channel U(·) := U (·)U † is given byŨ We can expand this out in terms of linear combinations of channels with coefficients depending on the noise parameter λ to get , where the notation E (k) is an average over all processes in the expansion ofŨ that have exactly k errors E.
In total there are d k different ways k errors occur within the circuit. Therefore,Ũ and the corresponding noisy expectation value can be expressed in terms of a power series in λ with degree at most d, thus motivating the polynomial fit. Alternatively, the analysis in [27] considers approximating the binomial coefficients in the expansion of U with a Poisson distribution so that and therefore, this noise model with the assumption that The noisy expectation value will then take a similar form which motivates the use of exponential and poly-exponential fitting functions. In particular, for a global depolarising noise model (where E outputs a fixed state ψ 0 ) the noisy and exact expected values have a linear relationship In such a case, the linear fit in CDR and polynomial fit in ZNE (with K ≤ d) are not susceptible to noise bias so the true expectation can be recovered exactly, up to finite sampling size errors.

Qermit Implementation Details
Qermit has a graph based architecture design, which splits the execution of error mitigation methods into atomic functional processes. These sub-processes may be the submission of a circuit to a QPU, the modification of a circuit for the purposes of error mitigation, or the fitting of models. A graph, with these processes as vertices, is specified to describe how the processes should exchange inputs and outputs.
Qermit is complementary to Mitiq [28], which is an open-source Python toolkit that implements an overlapping set of error mitigation schemes. Qermit takes a different approach that breaks-down the implementation of each protocol into standalone modular units, with a graph describing how each module interacts defined separately. This allows the modules to be easily re-used, modified and composed.
This software architecture allows for vertices to be amended to adapt the protocol where necessary, and relieves developers of some of the work of piping together processes. A modular design additionally means sub-graphs and graphs may be reused in other original error mitigation protocols, which is particularly useful as error mitigation schemes typically have several steps in common. As a result it is ensured that Qermit: is easily extensible; allows one to readily combine error mitigation protocols; and facilitates quick prototyping of new protocols through the reuse of existing sub-protocols. Combining error mitigation protocols has been shown to be a fruitful endeavour [27,29,30], and one which is simplified in Qermit.

MitRes and MitEx
There are two types of error mitigation methods in Qermit: MitRes methods, which modify the distribution of shots retrieved from a backend; and MitEx methods, which return a modified expectation value estimator of some observable. In general a MitRes or MitEx object may perform any modification matching this form, or no modification at all. The instances of MitRes and MitEx objects in Qermit are designed to perform error mitigation.
MitRes and MitEx objects are constructed as dataflow graphs, called a TaskGraph. Each node of a TaskGraph is a MitTask object; itself a function that computes some step or sub-process of an error mitigation protocol. Edges of the TaskGraph move data between MitTask objects which depend on each other. When the run function is called on a MitRes or MitEx object, a topological sort is applied to the graph to order these tasks sequentially. The tasks are then run locally in this sequential order.
In its default construction, as displayed in Fig. 1, a MitRes object will simply run each circuit through a backend and gather the results. Similarly, the default construction of a MitEx object, as displayed in Fig. 2, will simply estimate the expectation of each desired observable without applying any mitigation methods. To do so it will modify the circuit, appropriately rotating the measurement basis according to the target observable. An example of such a use of the default MitRes object can be seen in Appendix A.1, and of the default MitEx object in Appendix A.2.1.

ZNE in Qermit
Several options exist in Qermit for both noise scaling and extrapolation to the zero noise limit. Each noise scaling operation available is of the digital variety, and includes: circuit folding, random gate folding, and odd gate folding. Extrapolation functions available include: exponential, linear, poly-exponential, polynomial, and Richardson. Those chosen for our experiments are discussed in Section 5.
An example TaskGraph which corresponds to a ZNE MitEx can be seen in Fig. 3. One notices in particular that the initial circuit is duplicated by the Duplicate MitTask, before each duplicate is passed to a iFoldMitEx MitTask. Each iFoldMitEx increases the noise in the circuit by a factor of i, and runs the resulting circuit. The original circuit is also passed through a default MitEx, as displayed in

CDR in Qermit
One approach to generating the near Clifford training circuit set required for CDR uses Markov chain Monte Carlo techniques [14]. Here a first training circuit is produced, where all but a fixed number of non-Clifford gates are randomly replaced with nearest Clifford gates according to a weighted distribution. Subsequent steps generate circuits sequentially by randomly replacing n pairs of the Clifford gates in the previously generated circuit with their original non-Clifford, and similarly n pairs non-Clifford gates with (nearest) Clifford gates. At each step the new training circuit is accepted/rejected according to a (usually problem-dependent) user-defined maximal likelihood function. A second option, implemented in Qermit, and used in the experiments of Section 5 replaces uniformly at random n pairs of non-Clifford/Clifford pairs in the first training circuit, without producing a chained training set.
Parameters specifying an instance of a CDR MitEx include the device backend, the classical simulator backend, the number of non-Clifford gates in the training circuits, the number of pair replacements, and the total number of training circuits. An example TaskGraph which corresponds to a CDR MitEx can be seen in Fig. 4. The initial circuit is transformed by CCLStateCircuits to prepare it for three different experiments. Respectively these are to: run the original circuit on the given backend, run training circuits with a reduced number of non-Clifford gates on the  Figure 2: MitEx TaskGraph. The MitRes task may be expanded as in Fig. 1. Other MitTask objects featuring here include: FilterObservableTracker, which takes as input a description of the circuit to be implemented, and the observable to be measured, returning circuits modified to perform the measurements necessary to calculate the expectation of the requested observable; CollateExperimentCircuits, which reformats the list of circuits to facilitate parallelism; SplitResults, which undoes this reformatting; and GenerateExpectations, which uses the results to calculate the requested expectation values. same given backend, and run the training circuits on an ideal classical simulator. Once these experiments have been conducted, a series of checks conducted in QualityCheckCorrect ensure that the training data gives a well-conditioned (least-squares) optimisation.
CDRCorrect finds the fit parameters and produces an error mitigated estimator. An example of the use of CDR can be found in Appendix A.2.3.

Benchmark Design
Here we develop a benchmark procedure which can be used to compare error mitigation methods on different backends. To assess the performance of an error mitigation protocol, we introduce the relative error of mitigation in Section 4.1. We further discuss in Section 4.2 factors that influence such performance. Our benchmarks are volumetric by design, with the required circuit structure, presentation style and results interpretation discussed in Section 4.3. Finally CompileToBacked, which ensures the circuit obeys the connectivity and gate set restraints of the backend used; Duplicate, which created copies of the inputted circuit; iFoldMitEx, which increases the noise in the circuit by a factor i, and runs the resulting circuit using a MitEx of Fig. 2; and CollateZNEResults, which gathers the experiment results and extrapolates to the zero noise limit. Note that other MitTask objects appearing in this figure have parallels in Fig. 2. Indeed the iFoldMitEx contains a MitEx TaskGraph, which we do not expand for succinctness, as well as a MitTask perfoing the noise scaling.
we detail the precise circuits we use in Section 4.4.

Accuracy Measures for Error Mitigation
In order to assess the performance of an error mitigation strategy we propose the following criteria that such a performance metric should satisfy (i) faithful  Fig. 2, this TaskGraph additionally includes: CCLStateCircuits, which outputs the original circuits as output 0, and training circuits copied as outputs 1 and 2; StatSimMitEx, which runs these two sets of circuits through a given backend and an ideal classical simulator, returning the results; QualityCheckCorrect, which assesses the expected quality of the resulting function model; and CDRCorrect, which uses the results from the original and training circuits to find a function model and produce an error mitigated result. Note that StateSimMitEx would expand to reveal a MitEx for the inputted backend, and a MitEx for an ideal classical simulator.
-takes a fixed (zero) value when the error mitigated estimator matches the exact value (ii) operationalrelates to improvement in a computational task (iii) efficiently estimable from experimental data and classical processing -allows for scalability.
The relative error of mitigation is defined as The intuition for the above definition is that we want a measure that expresses how much closer to the true value is the mitigated estimator compared to the noisy expectation. Note that the relative error of mitigation has the following operational properties (i) faithful ϵ = 0 iff corrected expectation values matches the exact value and (ii) whenever ϵ ≤ 1 the mitigation of errors was successful.
Generally, the measures in Definition 4.1 involve a classical simulation computing the true expectation value, and therefore are limited to determine performance of error mitigation schemes in these regimes. In Appendix E.1 we discuss the problem of certifying error mitigation for target states or circuit classes for which there is no available classical simulation. We introduce several performance metrics that satisfy (i) -(iii) and are based on mirroring, or efficiently simulable circuits. For example, mirrored circuits give scalable benchmarks since the exact expectation values depend only on the input (product) state and observable itself; no classical simulation of the circuits is needed in this case. We make use of this approach to benchmarking in Section 5. However, we leave for future work to determine how well these measures predict the general performance of error mitigation on quantum hardware, particularly in regimes approaching quantum advantage.
Furthermore, since both ⟨Ô⟩ EM and ⟨Ô⟩ N are estimators they incur a variance due to finite sampling statistics. In general the ratio of two random variables does not have a well defined variance and the ratio of two normally distributed variables with zero mean gives rise to the Cauchy distribution, which is typically heavy tailed. However, under mild conditions one can show [31] that ϵ can be approximated by normal distribution with mean µ = µ EM µ N and variance where σ 2 EM and σ 2 N are the variance of estimators ⟨Ô⟩ EM and ⟨Ô⟩ N due to finite sampling statistics and µ EM = E⟨Ô⟩ EM , µ N = E⟨Ô⟩ N correspond to the means in the limit of an infinite number of samples. We discuss in Appendix E.2 the conditions under which the above approximation holds.

Performance of Error Mitigation Methods
There are a series of factors that can affect the performance of error mitigation methods. Firstly, finite sampling effects are amplified in the error mitigated estimator which incurs a higher variance than the noisy expectation value estimator for a fixed number of shots. This limitation has previously been discussed in Ref. [16] and recent work [23] derives theoretical lower bounds on the sample complexity overhead of error mitigation in terms of the maximal noise bias and distinguishability measures. In particular, for a local depolarising noise model the number of shots required to produce a mitigated estimator with the same accuracy as the noisy estimator scales exponentially with the circuit depth [23,32]. Practically, if the architecture has a restricted topology then this scaling can even depend exponentially on the number of qubits for sparser graphs that require an O(n) routing overhead [33].
Secondly, the functional model used to produce the error mitigated estimator will generally not fully capture the underlying backend noise effects. This is particularly restrictive for real hardware, where ⟨Ô⟩ EM will therefore be susceptible to noise bias.
Thirdly, fitting parameters to the functional model involves a classical optimisation that may be ill conditioned or unstable, partly due to the increased variance in the finite sampling or in the functional fit.
Finally there are several specific regimes where the above issues can be more detrimental to the performance of error mitigation strategies. For example if low levels of noise occur then ⟨Ô⟩ N already produces a good estimator of ⟨O⟩ with high accuracy and due to the additional sampling overhead error mitigation will not, on average, improve upon that estimator for a fixed shot budget. Typically error mitigation strategies involve reconstructing the surface defined by F from the noisy samples D in the data collection step -however, if the finite sample error dominates then solutions to the fit parameters will be unstable. This situation can occur for high levels of noise and is typically exacerbated when the target observables have low expected values. In the following section and Appendix E.2 we make these observations more precise.

Amplification of finite sampling variance in the error mitigated estimator
Recall that the error mitigated estimator ⟨Ô⟩ EM is produced by classical post-processing which fits the experimental results to a functional model F(D, ⟨Ô⟩ EM ) = 0. The terms D q i,m are sample mean estimators that will introduce finite sampling effects. In the simplest case one might have a linear functional (1). The variance due to finite sampling in the error mitigated estimator is then = m,i where o m,i := z o m,i (z) 2 are constants predetermined from the target observables (and any modified observables required in the data collection step) and recall that n i are the number of independent samples {Z s } ni s=1 each with variance σ 2 i .

Volumetric Benchmarking
Volumetric benchmarking of error mitigation assesses the overall performance of a method on a specific backend with a fixed set of total resources R. We employ circuit classes with increasing depth d (as determined by the number of layers of primitive circuits) and qubit number n (which is to say the number of qubits the circuit acts on). Our methodology is inspired by volumetric benchmarks of quantum processors [22] and consists of: 1. Select a class of circuits C(n, d) and a probability distribution or method to sample C individual circuits.
2. Select a (Pauli) observable O (or set thereof) with fixed locality.

Determine relative error of mitigation
4. Determine the median relative error of mitigation over the sampled circuits The choice of circuit classes and sampling methods are flexible and we discuss them in detail in the following section. In our benchmarks we will consider global observables, acting non-trivially on each qubit. Measurements of global observables are typically affected by all errors occurring throughout a circuit whereas a local observable is affected by errors within its light-cone. Indeed, if the noisy operations outside the light-cone are described by completely positive trace-preserving maps, then their action on the identity observable cancels out. This assumption may fail if, for example, correlated errors across the cone partition occur. Therefore the choice of a global observable will place stronger constraints on depths/qubit number at which error mitigation gives improved estimators. One might also consider observables with a fixed locality constraint determined by applications of interest. There are two motivations behind the use of medians as a measure of overall success. First, we discussed how finite sampling statistics can give a long-tailed distribution of the relative error of mitigation. Secondly we observe for multiple experiments a similar long-tailed behaviour for the distribution over different circuits in the same class C(n, d).

Circuits
In this section, we formally define and motivate the circuits used in benchmarking error mitigation methods. In particular, for each circuit class we define a primitive layer type, providing pseudocode generating it. A layer acts on n qubits, where n is called the qubit number of the circuit. We specify the depth of a layer, and discuss how layers are composed to create a depth d circuit from the given class. This terminology corresponds to that used in Section 4.3. While the qubit number and depth are indicative of the circuit sizes, the exact number of gates will vary between circuit classes. The compiled circuits will additionally depend on the architecture of the backend. This is discussed and displayed in Fig. 5. The particular implementation generating the circuits used in our experiments is available as specified in the Code Availability.
The circuit classes that we make use of are Random Circuits, introduced in Section 4.4.1 and inspired by those used for quantum volume experiments [21,34], and Pauli-Gadget Circuits, introduced in Section 4.4.2 and inspired by ansatz circuits for NISQ algorithms [35,36]. Collectively, this selection of circuit classes encompass several important potential applications of quantum computing, covering circuits of varied depth, connectivity, and gate types, strengthening the predictive power of our benchmarks.
While these circuit classes allow us to draw conclusions about the performance of error mitigation on specific applications, there are many circuits that would not be covered by such classes. However, the mirroring method discussed in Section 4.4.3, and the experiment design in Section 4.4.4 are sufficiently general to allow for the use of different classes of circuits, as may be desirable to others. General purpose benchmark suites may provide inspiration for such choices [37].
We avoid favouring any device in particular by the design of the benchmark circuits used here. The connectivity and the gate-set assumed by the circuits is very general to ensure we benchmark the error mitigation schemes as general purpose schemes. A compilation step will be required to execute these circuits on a real device. The compilation strategy used may also influence the performance of the error mitigation scheme. While the same compilation strategy is used during error mitigated runs as with noisy runs, compilation passes may have different effects on error mitigated and unaltered circuits. As such we fix the compilation scheme used during our benchmarks to avoid this dependency. Note in particular that caution should be taken in the case of mirrored circuits (see Section 4.4.3) to ensure that the circuits are not compiled to the identity.

Random Circuits
While circuits required for applications are typically not random, sampling from the output distributions of random circuits built from two-qubit gates has been suggested as a means to demonstrate quantum computational supremacy [38][39][40][41]. By utilising uniformly random two-qubit unitaries and all-to-all connectivity, Random Circuits, introduced now, provide a comprehensive benchmark.
The circuits used here -taken from Ref. [21] and similar to those used for the quantum volume benchmark [34] -are generated according to Algorithm 1, and are illustrated in Fig. 6. This circuit class consist of d layers of two-qubit gates acting between a bipartition of the qubits.
By utilising uniformly random two-qubit unitaries, Random Circuits test the ability of the error mitigation scheme to mitigate errors acting on a universal gate set. Allowing two-qubit gates to act between any pair of qubits in the uncompiled circuit, means Random Circuits avoid favouring any device in particular. This choice adheres closely to our motivations of being hardware-agnostic. Because of this generality Random Circuits are complementary to the other classes introduced in this work.

Pauli-Gadget Circuits
Pauli gadgets [43] are quantum circuits implementing an operation corresponding to exponentiating a Pauli tensor. Sequences of Pauli gadgets acting on qubits form product formula circuits, most commonly used in Hamiltonian simulation [44]. Many algorithms employing these circuits require fault-tolerant devices, but they are also the basis of trial state preparation circuits in many variational algorithms, which are the most promising applications of noisy quantum computers.
A notable example of this in quantum chemistry is the physically-motivated Unitary Coupled Cluster ▷ This loop's content constitutes a layer. 8: 9: Divide the qubits into ⌊ n 2 ⌋ pairs {q i,1 , q i,2 } at random. 10: 11: for all i ∈ Z, 0 ≤ i ≤ ⌊ n 2 ⌋ do 12:

13:
Generate U i,t ∈ SU (4) uniformly at random according to the Haar measure.

14:
Enact the gate corresponding to the unitary U i,t on qubits q i,1 and q i,2 .

15:
▷ Decompositions of this gate can be found in [42] 16: if mirrored then 17: Enact the gate corresponding to the unitary U † i,t on qubits q i,1 and q i,2 . family of trial states used in the variational quantum eigensolver (VQE) [35,36]. Motivated by these practical applications, we include in our benchmarks circuits with a similar structure. The Pauli-Gadget Circuits are built as in Algorithm 2, and may be visualised as in Fig. 6. They are constructed from several layers of Pauli Gadgets, each acting on a random subset of n qubits. Note that the circuits in this class differ from running end-to-end VQE. Focusing on the state preparation portion of a VQE circuit, we might deduce performance of error mitigation when running the VQE on ansatze of similar type.

Mirrored Circuits
Besides the division by circuit class, given by the specific layer type, circuits are also defined as being mirrored or un-mirrored. While un-mirrored circuits consist of independently generated random layers of a fixed type, mirrored circuits consist of randomly generated layers, each followed by their inverse. Mirrored circuits correspond to an identity operation and thus have the property that their expectation values can be efficiently computed with increasing qubit number and depth. Note that Mirrored circuits in this work are inspired by, and are a special case of, those introduced in [45]. "Mirroring" in [45] consists of concatenating a circuit with its quasi-inverse (its inverse up to Pauli operations) and inserting Pauli operations before, after and between the circuit and its quasi-inverse. The ideal behaviour of such a circuit would be to produce the input state, inspired by the Loschmidt echo. In our case we take the quasi inverse to be the inverse, and do not add layers of Pauli operations.
In both Algorithm 1 and Algorithm 2 the option to generate mirrored circuits is provided as an input parameter. This has the effect of inverting each layer of the circuit once the layer has been applied. Random layers are generated and inverted a total of ⌊ d 2 ⌋ times to ensure that the final depth is d, assuming that d ∈ 2Z. This results in a circuit implementing the identity, with ideal expectation value that depends only on the observable and input state. Therefore, mirrored circuits can provide scalable benchmarks to q n = |0⟩ . . .   Circuits, one can imagine appending the gate exp −i j s t j α t after the gate exp i j s t j α t . Respectively in the case of mirrored Random Circuits, one can imagine appending the gate U † i,t after the gate Ui,t. Note that the gates Ui,t in the random circuit class act between 2 randomly selected qubits.

Algorithm 2
The pattern for building Pauli-Gadget Circuits.

11:
Enact the Pauli gadget corresponding to the unitary exp i j s t j α t on qubits q 1 , ..., q n .

12:
▷ Decompositions of this gate can be found in [43] 13: if mirrored then 14: Enact the Pauli gadget corresponding to the unitary exp −i j s t j α t on qubits q 1 , ..., q n . 19: Measure all qubits in the computational basis assess the performance of error mitigation strategies. We discuss this further in Appendix E.1.

Experiment Design
In our experiments, for a given qubit number and depth we choose a fixed number of random instances from each class of Random and Pauli-Gadget circuits and their mirrored versions. The measurements are in the computational basis with the target observables set to O = n i=0 Z i and input state ρ 0 = |0⟩ ⊗n . The choice of a global observable was motivated in Section 4.3.
From the randomly generated circuit instances we select those with ideal expectation values within a fixed range (specifically we select values from 0.4 to 0.6). As the circuit size increases the probability of generating circuits with expectation values in this range falls exponentially. This limits the size of such circuits that we can generate, demonstrating an additional advantage of the mirrored circuits which in this case have a fixed ideal expectation value 1.
We require this control of the ideal expectation value for the following reason: Random circuits show anti-concentration properties with increasing depth, meaning that for Pauli observables the expectation values will be close to zero with high probability [46]. However, our experiments, particularly the hardware runs, are limited by finite size sampling so that the obtained data D = {D q i,m } i,m incurs a variance depending on the fixed number of shots. Fitting this data to the surface F(D, ⟨Ô⟩ EM ) = 0 to determine an error mitigated estimator can be numerically unstable if the variance due to finite sampling is comparable to the value of the noisy data D itself. This is particularly relevant for very low ideal expected values since accumulation of noise within the circuit results in a further decreased value (typically dropping exponentially with number of noisy gates). To avoid this situation, where the error mitigation performance decrease with increasing depth can be (partly) attributed to the random generation of benchmark circuits, we restrict to high/fixed range expected values. We refer to Appendix E.4 for further details.

Results
We use the benchmarks introduced in Section 4 to assess the performance of ZNE and CDR on both superconducting hardware and noisy simulator backends. In Fig. 7 and Fig. 8 we present volumetric plots displaying the relative error of mitigation for Pauli-Gadget Circuits and Random Circuits respectively, when run on both ibm lagos and ibmq casablanca. Extensive noisy simulations are provided in the Appendix D.

Noisy Classical Simulations
The first set of benchmarks employ a local depolarising noise model with error rate of e 1 = 10 −3 and e 2 = 10 −2 for single and two qubit gates respectively. 2 These values are chosen to be broadly inline with those reported by several hardware providers, but do not correspond to any device in particular. In this case there is no restriction on the pairs of qubits between which two qubit gates can act. As such only minimal compilation to the simulator's gate set is required.
The results of these experiments are presented in the volumetric plots of Figs. 11 and 12. For each fixed width/depth (i.e square) 10 circuits are sampled from the specified class, with 5 × 10 5 shots per circuit for each error mitigation protocol, and 10 5 in the case of the noisy expected value. This shot count is higher than could be taken from real backends at our disposal in a reasonable runtime. This high shot count is used here to present an idealised scenario and illustrate boundaries of qubit count and circuit depth beyond which error mitigation fails to improve the noisy estimator of the expected value. In particular, such boundaries align with the theoretical analysis limiting the depth to scale with a constant inverse of the error rate d = O(1/e 2 ) [24,27].
In the case of ZNE, we use an exponential fit and the target circuit is folded repeatedly for a number of λ = 1, 3, 5, 7, 9 times. The shot budget is distributed equally between these noise scaling values. Curve fitting is achieved via a least-square optimisation. For CDR, we distribute the total shot budget equally to each of the 21 required circuits; namely 20 near-Clifford circuits, and 1 unaltered circuit. Generation of the training set fixes the number of non-Clifford operations to 10 and uniformly at random 2 To create this model we make use of the qiskit noise model module, as described at https://qiskit.org/documentation/ apidoc/aer_noise.html. For more details see Code Availability.
generates the new near-Clifford circuits by pair replacements from an initial seed circuit. In the case where the total number of non-Clifford gates is less than 10, at least one non-Clifford gate is replaced at random. Additionally, using the classical simulation of these circuits we test that the training data gives a well-conditioned matrix for the linear regression to be performed and generate new training circuits if that is not the case.
There are circuit sizes where ZNE and CDR are consistently beneficial Comparing first the performance of CDR and ZNE, we note from Figs. 11a and 11b that for Random Circuits in the size ranges explored, both show an improvement over the case where error mitigation is not used. ZNE appears to perform particularly well in this domain, outperforming CDR for each circuit size explored. To identify the limits of the utility of CDR and ZNE we extend the depth of the circuit explored using mirrored circuits, as discussed in Section 4.4. We see from Figs. 11c and 11d that while ZNE outperforms CDR for smaller circuit sizes, the range of circuit sizes where CDR outperforms the use of no error mitigation appears to be larger than for ZNE. Fig. 12 to Fig. 11 we note that while ZNE outperform CDR in case of Random Circuits, this appears not to be the case for Pauli-Gadget Circuits. As shown in Fig. 5, the CX gate count of circuits with a fixed qubit number and depth are similar between the Random Circuits and Pauli-Gadget Circuits classes. As such this improved relative performance of CDR is likely because there are a smaller proportion of non-Clifford gates in Pauli-Gadget Circuits than Random Circuits. In the case where there are few non-Clifford circuits, the training circuits are little different from the original circuit, which improved the accuracy of the CDR training procedure.

CDR outperforms ZNE with Pauli-Gadget Circuits Comparing
More refined device modelling reduces performance The second set of benchmarks employ device emulation that includes many properties of the corresponding real backend used in Section 5.2. 3 These properties include connectivity of the architecture, available gate set, and an expansion of the noise model to include thermal relaxation and readout errors, in addition to a local depolarising model with parameters corresponding to calibration results. Exact values of these properties can be found in Appendix B. The parameters used for ZNE and CDR are identical in this case to those used when performing depolarising noise simulation, with the exception that the total  ibm lagos and ibmq casablanca are used with ZNE or CDR on Pauli-Gadget Circuits and mirrored Pauli-Gadget Circuits. Medianε (outer square) and worstcase (inner square) relative error of mitigation over the sampled circuits are shown, metric that ranges between (0, 1) for successful mitigation with lower values corresponding to better performance. Each square corresponds to 5 sampled circuits of the specified type with a total 1.6 × 10 5 total shot budget per each mitigated expectation value.
shot budget is reduced to 1.6 × 10 5 for the mitigated experiment, and 0.32 × 10 5 for the unmitigated experiments. The shot budget is reduced to match that of the real device experiments, where computing resources are more scarce. Comparing Figs. 11a, 11c, 12a and 12c to Figs. 13a, 13c, 14a and 14c we see a marked fall in the performance of ZNE when additional device constraints are impose by emulation.

Quantum Hardware Backend
Finally, we perform volumetric benchmarks to assess and compare the performance of ZNE and CDR error mitigation on ibmq lagos and ibmq casablanca as measured by the relative error of mitigation.
The experiments used in the volumetric benchmarks are produced according to Section 4.4 and employ the two types of circuit classes as described in Section 4.4.4. For each fixed qubit number and depth we use 5 circuits on which we sequentially perform CDR, ZNE and no error mitigation. The same sets of circuits are used on both ibmq casablanca and ibm lagos. As discussed, the parameters used for ZNE and CDR are the same as in the case of the depolarising noise simulations, and the device emulation experiments. As in the case of the emulation experiments, the shot budget is set to 1.6 × 10 5 for the error mitigated experiments, and 0.32 × 10 5 for the unmitigated experiments.
It is important to emphasize that for any given circuit we run the two mitigation schemes, and the unmitigated experiments, back-to-back so that any time drift in the device's noise parameters do not skew the performance comparison. For both devices the calibration data during these experiments can be found in Table 2 and Table 1. Furthermore, they share the same architecture connectivity. Figs. 7 and 8, for a fixed qubit number and depth (as measured by the number of layers from a specific circuit class), we typically found at least one sampled circuit for which error mitigation does not improve upon the noisy expectation value. This occurred for both error mitigation methods even when their me-  Medianε (outer square) and worst-case (inner square) relative error of mitigation over the sampled circuits are shown, metric that ranges between (0, 1) for successful mitigation with lower values corresponding to better performance. Each square corresponds to 5 sampled circuits of the specified type with a total 1.6 × 10 5 total shot budget per each mitigated expectation value. dian performance gave a 2-fold or higher error reduction (i.eε ≤ 0.5). This stands in contrast with the emulation results, where there was a significant size region with different (n, d) for which error mitigation was successful even in the worst case. Method parameters and shot budget were the same for emulation and device experiments.

Improvements are inconsistent As seen in
Improvements are less on ibm lagos Strikingly, whilst the dominating two qubit error rates are within 10% for the two devices, performance of error mitigation drastically decreased on ibm lagos. In particular, as seen in Figs. 7a and 7c, the structured Pauli-Gadget Circuits showed little to no improvement over the noisy expectation value when exponential ZNE was performed. This is not due to ibm lagos producing results with low absolute error. As indicated in Tables 3 and 4 the absolute error of the results from ibm lagos are in fact typically higher than that of ibmq casablanca. Comparatively, as seen in Fig. 7e, on ibmq casablanca for small qubit sizes n = 3 and 2, ZNE mitigation was successful with a median relative error of 0.51 respectively 0.50 for 4 Pauli circuit layers. However, comparing Figs. 7f and 7h against Figs. 7b and 7d, we see that the performance of CDR was consistent between the two devices for this circuit class, and on average scaled well to the largest circuit sizes considered.
On the other hand, we see from Fig. 8 that for Random circuits both ZNE and CDR had worse performance on ibm lagos compared to ibmq casablanca, with the limitsε > 1 reached within the sizes considered in both depth and qubit number. For this circuit class, as seen in Figs. 8e and 8g, the performance of ZNE on ibm casablanca consistently reached on average a reduction in the noisy estimator's absolute error by 2 or more for all sizes considered. This was generally slightly better than CDR, although both methods still improved the noisy estimators even for larger qubit sizes and depths.
As a result of these features, each appearing unique to the particular device, we do not expect the results presented here to carry over generically to other devices. This is particularly the case for architectures other than superconducting.
Mirrored circuits are optimistic In the case of mirrored circuits, the performance was more consistent between the two devices. Notably, as seen in Figs. 7d and 7h, on mirrored Pauli gadget circuits CDR decreased, on average, the absolute error in the noisy estimator by up to an order of magnitude for most circuit sizes. This corresponds in Fig. 7 to a median relative error of mitigation of 0.1 or less, and the results were similar for both devices. Comparing Figs. 7a and 7e against Figs. 7c and 7g shows ZNE mirrored Pauli circuits were a good predictor of the poor performance of non-mirrored circuits with similar dimensions.
Surprisingly, the relative error of mitigation for mirrored random circuits improved from 0.45 on ibmq casablanca to 0.15 on ibm lagos for n = 5 qubits, as seen in Fig. 8g and Fig. 8c, respectively. This suggests that while mirrored circuits provide scalable benchmarks to assess performance of error mitigation, they generally overestimate the median relative error for circuits of similar sizes using both of the methods considered. As such they may be taken to provide an prediction of an upper bound on the performance of error mitigation schemes, and so indicating when a scheme should be expected to bring no benefit.

Conclusions and Outlook
In this work we introduce Qermit -an open source package for error mitigation with a composable software architecture that allows for combining different error mitigation methods and facilitates development of new techniques.
A key question we address is given a particular device and class of circuits, which error mitigation has, on average, a better performance? To that aim, we design volumetric benchmarks to assess the performance of error mitigation methods for a given class of quantum circuits with varying qubit number and depth. The methodology is based on frameworks for benchmarking quantum devices [22]. We use this framework here to delineate the boundary beyond which a given error mitigation technique fails on average to give an improved estimator of the target expectation value.
In this setup, we compare the performance of ZNE and CDR implementations on current superconducting hardware. We show how noise in real devices places even stronger constraints on the scalability of error mitigation than theoretical lower bounds as for example derived in [23] or determined from classical noisy simulations [29]. Qualitatively, we found that even for a low total shot budget (≈ 10 5 per mitigated expected value) CDR generally outperforms ZNE on structured Pauli circuits and slightly underperforms on random SU(4) circuits. This generic behaviour is captured by the emulation which includes depolarisation, thermal relaxation and readout in the noise model as well as the device's architecture. However, emulation largely overestimates the performance with successful mitigation for significantly larger circuit sizes compared to those accessible with real hardware. In contrast, a simplified depolarising noise model without connectivity constraints fails to predict even qualitative features of error mitigation performance.
On real hardware, we find that for circuits of the same type and depth the relative error of mitigation can vary extensively. Often, it results in unsuccessful mitigation in the worst case even when on average the error in the expectation value is reduced by an order of magnitude. We run the same benchmarks on two superconducting devices with the same connectivity and comparable calibration data and find a high level of variability in the performance of both error mitigation methods. This was particularly striking for ZNE, where the use of digitised method to artificially increase noise level may be more susceptible to the different noise profiles. These results flag two issues; first, that fine grained noise characterisation is needed to predict behaviour of error mitigation on hardware and second, that without access to the ideal expectation value one needs to validate that error mitigation has succeeded.
Our benchmarks include mirrored circuits, which can be efficiently classically simulated as the number of qubits grows. As such, mirrored circuits provide a useful benchmark of the performance of error mitigation, particularly when classical simulation of their un-mirrored counterparts is inaccessible. We observe that, given a fixed qubit number, depth, shot budget and device, mirrored circuits typically produced a lower average relative error of mitigation for both ZNE and CDR compared to un-mirrored circuits from the same class. This illustrates that mirrored circuits give an upper bound on the relative error of mitigation. This feature remains constant across devices, circuit classes and schemes, and so it gives a way to infer when EM will not produce improved results.
Due to the emphasis on composability, Qermit is particularly well suited to explore the application of multiple noise tailoring and mitigation techniques for the same circuit and observable. In future work we will explore the performance of combined methods, which is particularly well motivated by recent results that show significant improvements when using Pauli frame randomisation, ZNE and dynamical decoupling in combination, and with additional bespoke device calibration [47,48]. In particular, for these combined methods the error mitigated estimates of magnetisation gave a better approximation than the leading approximate classical tensor network simulations for 2D Ising models on up to N = 127 qubits. Note that these results do not contradict those of this paper as we do not explore here combinations of error miti-gation schemes, or techniques which make such extensive use of device characteristics. However, this indicates that Qermit's composable design makes it suitable for testing how useful error mitigation can be to achieving practical quantum advantage.
It would also be pertinent to understand how the performance of error mitigation schemes varies between different device architectures, such as ion traps devices. As we have seen, for example when comparing the classical simulations of Appendix D.2 to those performed on a real device in Section 5.2, idealised noise models can be a poor predictor of real performance. This discrepancy will likely be emphasised by using different devices, between which noise models vary greatly. Even apparently similar devices were shown in Section 5.2 to perform differently, and so we do not expect our results to generically carry over to other superconducting devices. Qermit facilitates such experiments as it is implemented with pytket [9], which gives access to an assortment of devices. While in Appendix D.2 we note some observations regarding the performance of error mitigation with changing noise levels, a more refined and extensive study with more complex noise models would be of great interest.
Since all parameters including total shot budget per mitigated expected value are fixed, the discrepancy in performance between the two devices can be attributed to the noise bias and different noise profiles. Therefore, it would be of interest to determine how the results of the volumetric benchmarks change for the same device at different calibration times. This was not possible in the case of ibm casablanca as the machine was taken out of use during our experiments, so we leave this type of assessment for future work.
It will also be particularly interesting to explore the effect of finite sampling in the case of ion traps, where the rate at which shots are generated is lower. Such restrictions will impact the accuracy with which a functional model, as discussed in Section 2.1, can be approximated. This could manifest, for example, in the accuracy of the learnt extrapolation function in the case of ZNE, and so the variance in the learnt zero noise limit. In Appendix E.4 we introduce theoretical tools to explore this particular point, inspired by numerical results revealing strong performance variation depending on shot budget availability [29]. We look to future work to establish strictly how the relative error of mitigation varies not just with circuit size as explored here, but also with shot count.
Finally, would be insightful to compare our approach, targeting an understanding of the practical utility of error mitigation, to those exploring information theoretic bounds to performance [23,[49][50][51]. In such results, lower bounds on the resources required to ensure beneficial use of error mitigation are derived, and are shown to grow exponentially with circuit size. We anticipate that the approach of our work will be complementary to those results, firstly as a means of verifying the predicted scaling. Secondly we imagine that our techniques could be used to explore more complicated unknown noise models.
Data Availability A complete set of results can be found in [52]. This additionally includes the circuits used, and other data required to reproduce the experiments. The MitRes.run method takes a list of CircuitShots objects as an argument. Each CircuitShots objects contains the basic information required to run a circuits, namely; a state preparation circuit, and the number of shots to run for the circuit.

Qermit Installation and Usage
The example code snippet: with graph being the TaskGraph of Fig. 1.

A.2 MitEx
The MitEx.run method takes a list of ObservableExperiment objects as an argument. Each ObservableExperiment objects contains the basic information required to estimate the expectation value of an observable: a state preparation circuit, a dictionary between symbols and parameter values (where appropriate), a pytket QubitPauliOperator detailing the operator being measured and used for preparing measurement circuits, and the number of shots to run for each measurement circuit. The following is an example of an ObservableExperiment set up. In the following we discuss the definition of a MitEx object in several cases. In each we will use the ObservableExperiment defined above. In all cases an experiment returns a QubitPauliOperator object containing an expectation value for each QubitPauliString.

A.2.1 Default MitEx Example
In its default version, a MitEx object will append a measurement circuit for each QubitPauliString to the ansatz circuit and execute it through the pytket Backend the MitEx object is defined by. In particular the example code snippet: with graph being the TaskGraph of Fig. 2. Note that as we have used a noisy simulator, the returned value is different from the ideal value of −1.

A.2.2 ZNE MitEx in Qermit
The ZNE MitEx has the required inputs: backend, the backend on which the circuits will be run; and noise scaling list, a list of factors by which the noise is scaled. Optionally the type of folding used can be specified via the folding type keyword argument, which expects a Folding object. The fit used to extrapolate results can be specified via the fit type keyword argument, which expects a Fit object. The example code snippet: with graph being the TaskGraph of Fig. 3. Note that ZNE had returned a value closer to the ideal value of −1 than did the example in Appendix A.2.1 where no error mitigation was used.

A.2.3 CDR MitEx in Qermit
The CDR MitEx has required inputs: device backend, the possibly noisy backend used to approximate the expectation of the inputted circuits; simulator backend, an ideal classical simulator; n non cliffords, the number of non-Clifford gates in the training circuits; n pairs; and total state circuits, the total number of training circuits to use.
The example code snippet:

B Backend Properties
The coupling map, describing the connectivity between the qubits, of the ibm lagos and Vertices in the coupling map correspond to qubits, while edges indicate that 2-qubit gates can be acted between the qubits which correspond to the vertices connected by the edge.

Property
Average Value  ibmq casablanca devices is as seen in Fig. 9. The gates available on the both devices are: Emulation of the devices requires compilation to the coupling map described in Fig. 9, and a rebasing of the gates used by the circuits to the limited but universal set of Eq. (16). Average noise properties reported by the devices over the time our experiments were conducted are seen in Tables 1 and 2

C Additional Results Details
Here we present some additional details about the experiments discussed in Section 5. In particular in Table 3 and Table 4 we give the average absolute errors and absolute error of mitigation, as defined in Definition 4.1, for the results presented in Fig. 7 and Fig. 8 respectively. Note that we present mean values here, rather than medians as in Fig. 7 and Fig. 8.

Property
Average Value where p is the depolarisation factor. It commutes with any other linear map so D • U = U • D for any unitary channel U. Suppose that a unitary circuit U contains d 1 single qubit gates and d 2 two-qubit gates, with average error p 1 and p 2 respectively. If we model the error as globally depolarising, then we have that for any traceless observable O the noisy expected value ⟨O⟩ N (in the limit of infinite shots) depends on the exact value ⟨O⟩ via For this simplified noise model, the relative error of mitigation remains constant and is independent of the observable and the specific circuit structure. It depends only on the error mitigation method and the number of noisy gates and their average errors.

D.1.1 Zero Noise Extrapolation -Polynomial Fit
We assume that the error rates are increased artificially via unitary folding with noise stretching factors α 0 = 1, α 1 = 2k 1 + 1 and α 2 = 2k 2 + 1. Then the mitigated expected value under a polynomial fit via Richardson extrapolation is where the coefficients satisfy (20) and Therefore the relative error mitigation factor is For example, the minimal number of folds e.g k 1 = 1 and k 2 = 2 gives F 0 = 15 8 , F 1 = − 5 4 and F 2 = 3 8 . This analytic expression allows us to explore the relationship between the relative error of mitigation and the depolarising factor. Fig. 10a reveals a sharp increase in expected relative error of mitigation as the depolarisation factor grows.     Fig. 8.
(a) Contour plot of relative error of mitigation in terms of D = d 1 + d 2 (x-axix) and error rate p = p 1 = p 2 (y-axis) for ZNE with polynomial extrapolation. This is a plot of the function seen in Eq. (21).
(b) Volumetric plot of relative error of mitigation with 4-qubit Pauli-Gadget Circuits depth and 2-qubit depolarising factor. Exponential fit ZNE is used and 500000 shots are taken per circuits. The depolarising factor on 1-qubit gates is a tenth of that on 2qubit gates.
(c) Volumetric plot of relative error of mitigation with 4-qubit Pauli-Gadget Circuits depth and 2-qubit depolarising factor. Exponential fit CDR is used and 500000 shots are taken per circuits. The depolarising factor on 1-qubit gates is a tenth of that on 2qubit gates.

D.1.2 Zero Noise Extrapolation -Exponential Fit
The effect of global depolarising noise model is that for different noise levels ⟨O⟩ αi = γ αi ⟨O⟩. This means that an exponential ZNE fit exactly captures the underlying noise model with the error mitigated observable given by for two levels of noise α 0 , α 1 . Therefore, in the absence of finite sampling errors ϵ ZN E,exp = 0.

D.1.3 Learning with Clifford Data Regression
Clifford-based learning is specifically designed for (global) stochastic Pauli noise; a case in which the linear ansatz exactly matches the noisy expectation values.
Therefore, in the absence of sampling errors the method should fully correct for the depolarising noise model, and so ϵ CDR = 0.

D.2 Noisy Classical Simulations
We perform extensive classical simulations of the experiments conducted in Section 5, with the results also discussed there. In particular depolarising noise simulations with Random Circuits and Pauli-Gadget Circuits are found in Figs. 11 and 12 respectively. Besides the fixed depolarising factor simulations presented in Figs. 11 and 12 and discussed in Section 5 we additionally explore simulations with varying noise in Fig. 10. As discussed in Appendix D.1.1 such an analytical study is possible in the case of ZNE with polynomial fit, and is presented in Fig. 10a. By comparison, in other areas of this work we consider exponential extrapolation, which in the idealised case of global depolarising noise results in non zero relative error of mitigation only as a result of finite sample errors. Fig. 10b reveals that for 4 qubit Pauli-Gadget Circuits in the more realistic case of single and two qubit gate local depolarising errors, and with errors due to sampling, there is a sharp increase in the relative error of mitigation as the noise level increases. For the same set of experiments conducted with CDR we again expect 0 relative error of mitigation in the presence of global depolarising errors and in the absence of finite sampling errors. As seen in Fig. 10c CDR is less susceptible to finite sampling errors in the presence of local depolarising noise than is ZNE, speaking to the comparative robustness of the functional model fitting step of the two schemes.
Finally, emulations additionally classically simulates noise, motivated by properties of the device. In the case of the emulates experiments shown here this noise models includes: Readout error: Single qubit readout errors on measurements.
Single-qubit errors: depolarizing error followed by a thermal relaxation error on each qubit the gate acts on.
Two-qubit gate errors: depolarizing error followed by single-qubit thermal relaxation error on each qubit participating in the gate.
In particular, emulations with Random Circuits and Pauli-Gadget Circuits are found in Fig. 13 and Fig. 14 respectively.

E Performance of Error Mitigation E.1 Certifying Error Mitigation
In practical applications we don't have a priori access to the exact value ⟨O⟩, as this is the quantity we aim to estimate in the first place. Typically proof of principle demonstrations of error mitigation strategies require a classical simulation to compare against, and the measures introduced in Definition 4.1 also assume that ⟨O⟩ can be directly computed. Therefore, the problem of assessing the performance of error mitigation on a specific circuit/observable in regimes beyond classical simulations becomes similar to that of certifying quantum computations. For the purpose of benchmarking the techniques for increasing depth and qubit number one may employ scalable benchmarks consisting, for example, of mirrored circuits. Mirrored circuits -with the general form C(w, d) where C i are sampled from a specified class of primitive circuits. Then for any observable the ideal expectation of the target state ρ mirror = C(w, d)ρ 0 C(w, d) † is tr(Oρ mirror ) = tr(Oρ 0 ). For example, if O is a Pauli observable and ρ 0 corresponds to the pure product state |0⟩ ⊗w , then the mirrored circuits will give ideal expected values of either {0, 1, −1}, which can be efficiently determined. Then the relative error of mitigation for the mirrored circuit can be used as a proxy for the performance of error mitigation on the target state

E.2 Relative Error of Mitigation Under Finite Sampling Statistics
In [31] the authors derive conditions for when the ratio of two independent normally distributed random variables can itself be approximated by a normal distribution. The following is a direct corollary of that result.
Lemma E.1. Let X, Y be two independent normal variables with positive means µ X , µ Y and variances σ 2 X and σ 2 Y . Suppose the coefficient of variation In our case, X = ⟨Ô⟩ EM − ⟨O⟩ corresponds to the difference between the error mitigated estimator and the exact value and Y = ⟨Ô⟩ N − ⟨O⟩ corresponds to the difference between the noisy estimator and exact value of the target expectation value. The expected values are then µ X = E⟨O⟩ EM − ⟨O⟩ respectively N . Since we are interested in relative error of mitigation and this corresponds to absolute values of Z, then one can assume that µ X and µ Y are positive quantities because if that happens not to be the case for a fixed observable then we have the freedom to redefine the random variable X and Y so as to ensure the positivity constraint.
In the following section we discuss the remaining condition that controls the signal to noise ratio and how the tension between low levels of noise and high variance in the noisy estimators make it difficult to quantify the performance of error mitigation in this regime.

E.3 Detrimental Effect of Low Noise Level
Therefore the only condition left to ensure that Lemma E.1 can be directly applied to ϵ(O) is that the coefficient of variation δ Y = σ N E⟨O⟩ N −⟨O⟩ is sufficiently small. As in the main text if ρ = U 1 ... U d ρ 0 U † d ... U † 1 is the target quantum state ⟨O⟩ = T r(Oρ) and its noisy implementationρ then E⟨O⟩ N = tr(Oρ). Now we are looking at the difference, where we have used the Holder inequality in the first bound, and then taken a supremum over all possible ρ to bound by the diamond norm ||Φ|| ⋄ = sup X:||X||1≤1 ||(Φ ⊗ I)(X)|| 1 . For conditions of Lemma E.1 to be met we want that µ N ≥ σ N λ for some λ ≤ 1. This along with sub-multiplicatively of the diamond norm implies that (a) Random Circuits mitigated using ZNE.
(b) Random Circuits mitigated using CDR.
(d) Mirrored Random Circuits mitigated using CDR. Figure 11: Random Circuits run on a depolarising noise simulator. Each square presents the results of 10 circuits. 500,000 shots are taken in total for each circuit by ZNE and CDR, and 100,000 in the case of the un-mitigated runs.      where e max is maximal (gate) error rate. Since σ 2 N is the variance of the sample mean estimator for the noisy expectation value ⟨Ô⟩ N , it will depend on the number shots n 0 so that σ N = σ0 √ n0 , where σ 0 is the variance of an individual (i.i.d) sample of a single measurement. The condition Eq. (28) becomes σ 0 √ n 0 ≤ λ d e max ||O|| ∞ multiplicatively (29) In particular this shows that, if one is to ensure that the relative error of mitigation follows a normal distribution under finite sampling then one must have the above trade-offs between the number of shots and depth. The condition also puts a lower bound on the depth of the circuit.

E.4 Effect of Finite Sampling on the Classical Optimisation to Determine Fit Parameters
In this section we consider the effect of finite size sampling on fitting the parameters of the functional model F. This type of analysis depends not only on the form of the functional F itself but also on the classical optimization algorithm used to fit the function with noisy data D.
For simplicity, we will assume that the data processing step involves a linear least square optimisation. Recall the following analytical expression for the fit parameter in a linear model y i = Ax i + B for K data sampleŝ where the estimators minimise the mean square error As it proves valuable in calculating σ 2 EM , the variance in ⟨Ô⟩ EM , note the following derivation of the variance in the estimatorÂ where we have: assumed Var [y i ] = σ 2 for all i, employedȳ = 1 K K i=1 y i , and used that x i andx are constants and not random variables. Note further Var B = Var ȳ −Âx = σ 2 K +x 2 Var Â . (35) In the examples considered below, for CDR and ZNE the finite size sampling errors imply that the y i 's can be viewed as normally distributed random variables.

E.4.1 Clifford Data Regression
We consider a linear functional model as in Eq. (8) with least square optimisation as a classical method to determine the model parameters. Then we havê A =F 1 ,B =F 0 , x i = D c i , and y i = D q i and the error mitigated estimator is given by IfF 0 andF 1 were a constant fixed values, F 0 and F 1 , then However, both are unbiased estimators constructed out of (noisy) sample mean estimators, so incur a variance due to a finite number of shots. TreatingF 0 ,F 1 and ⟨Ô⟩ N as normally-distributed independent random variables gives where we employ respectively Lemma E.1, Eq. (35), and Eq. (32), and use |D c | to denote the number of Clifford training circuits. Note that an honest treatment of Lemma E.1 would require knowledge of exact values of ⟨O⟩ N , F 0 , and F 1 , rather than the estimators we use here. As we do not have access to these exact values, we conjecture that this approximation is accurate in the limit of large numbers of shots.

E.4.2 Zero Noise Extrapolation
We consider exponential extrapolation and the parameters in the functional model D λi = F 0 e F1λi with ⟨Ô⟩ EM =F 0 determined via linear least square optimisation. Linearisation of the exponential function givesÂ =F 1 ,B = log F 0 , x i = λ i , and y i = log D q λi . If again we treatF 1 as being the constant value F 1 then we recover from Eq. (35) that Var log ⟨Ô⟩ EM = 1 |λ| Var log ⟨Ô⟩ N (42) Figure 15: Data is as in Fig. 8a and Fig. 8b, with a qubit number of two. Error bars show one standard deviation.
where |λ| is the number of noise scaling values, and where we have assumed Var log D q λi = Var log ⟨Ô⟩ N , ∀i.
Assuming further that However, it is more honest to considerF 1 as a random variable, in which case

E.4.3 Relative Error of Mitigation Variance
We combine the arguments in Appendices E.4.1 and E.4.2 with Eq. (13) to produce Fig. 15, which exemplifies well controlled variance in the case of our experiments. Here σ 2 N , the variance in the estimator of the noisy expectation value, is approximated by resampling the shots produced in the experiments of Figs. 8a and 8b. σ 2 EM is calculated using Eqs. (37) and (44).