Modeling and mitigation of cross-talk effects in readout noise with applications to the Quantum Approximate Optimization Algorithm

1Center for Theoretical Physics, Polish Academy of Sciences, Al. Lotników 32/46, 02-668 Warsaw, Poland 2Max-Planck-Institut für Quantenoptik, Hans-Kopfermann-Straße 1, 85748 Garching, Germany 3Wigner Research Centre for Physics, H-1525 Budapest, P.O.Box 49, Hungary 4BME-MTA Lendület Quantum Information Theory Research Group, Budapest, Hungary 5Mathematical Institute, Budapest University of Technology and Economics, P.O.Box 91, H-1111, Budapest, Hungary


Motivation
Outstanding progress has been made in the last years on the path to development of scalable and fully functional quantum devices. With state of the art quantum processors reaching a scale of 50-100 qubits [1], the scientific community is approaching a regime in which quantum systems cannot be modeled on modern supercomputers using known methods [2]. This situation is unarguably exciting since it opens possibilities for the first demonstrations of quantum advantage and, potentially, useful applications [3][4][5]. At the same time, however, it entails a plethora of non-trivial problems related to fighting effects of experimental imperfections on the performance of quantum algorithms. As quantum devices in the near-term will be unable to implement proper error correction [5], various methods of noise mitigation have been recently developed [6][7][8][9][10][11][12][13][14][15]. Those methods aim at reducing the effects of errors present in quantum gates and/or in quantum measurements. In this work, we focus on the latter, i.e., noise affecting quantum detectors.
Indeed, it has been found in currently available quantum devices that the noise affecting measurements is quite significant. Specifically, errors of the order of a few percent in a single qubit measurement and non-negligible effects of crosstalk were reported [1,[13][14][15][16]. Motivated by this, a number of methods to characterize and/or reduce measurement errors were proposed [13][14][15][16][17][18][19][20][21][22][23]. Readout noise mitigation usually relies on the classical post-processing of experimental statistics, preceded by some procedure of noise characterization. Existing techniques typically suffer from the curse of dimensionality due to characterization cost, sampling complexity, and the complexity of post-processing -which scale exponentially with the number of qubits N . Fortunately, some interesting problems in quantum computing do not require measurements to be performed across the whole system. An important class of algorithms that have this feature are the Quantum Approximate Optimization Algorithms (QAOA) [24], which only require the estimation of a number of few-particle marginals. QAOA is a heuristic, hybrid quantum-classical optimization technique [24,25], that was later generalized as a standalone ansatz [26] shown to be computationally universal [27,28]. In its original form, QAOA aims at finding approximate solutions for hard combinatorial problems, notable examples of which are maximum satisfiability (SAT) [29], maximum cut (MAXCUT) [30], or the Sherrington-Kirkpatrick (SK) spinglass model [31,32]. Regardless of the underlying problem, the main QAOA subroutine is the estimation of the energy of the local classical Hamiltonians on a quantum state generated by the device. Those Hamiltonians are composed of a number of a few-body commuting operators, and hence estimation of energy can be done via estimation of the local terms (which can be performed simultaneously). Since the estimation of marginal distributions is the task of low sampling and post-processing complexity, this suggests that error mitigation techniques can be efficiently applied in QAOA. In this work, we present a number of contributions justifying the usage of measurement error-mitigation in QAOA, even in the presence of significant cross-talk effects.

Results outline
Our first contribution is to provide an efficiently describable measurement noise model that incorporates asymmetric errors and cross-talk effects. Importantly, our noise model admits efficient error-mitigation on the marginal probability distributions, which can be used, e.g., for improvement of the performance of variational quantum algorithms. We show how to efficiently characterize the noise using a number of circuits scaling logarithmically with the number of qubits. To this aim, we generalize the techniques of the recently introduced Quantum Overlapping Tomography (QOT) [33] to the problem of readout noise reconstruction. Specifically, we introduce notion of Diagonal Detector Overlapping Tomography (DDOT) which allows to reconstruct noise description with k-local cross-talk on N -qubit device using O k2 k log (N ) quantum circuits consisting of single layer of X and identity gates. Furthermore, we explain how to use that characterization to mitigate noise on the marginal distributions and provide a bound for the accuracy of the mitigation. Importantly, assuming that cross-talk in readout noise is of bounded locality, the sampling complexity of error-mitigation is not significantly higher than that of the starting problem of marginals estimation.
We test our error-mitigation method in experiments on 15 qubits using IBM's device and on 23 qubits using Rigetti's device, both architectures based on superconducting transmon qubits [34]. We obtain a significant advantage by using a correlated error model for error mitigation in the task of ground state energy estimation. Interestingly, the locality structure of the reconstructed errors in these devices does not match the spatial locality of qubits in these systems.
We also study statistical errors that appear in the simultaneous estimation of multiple local Hamiltonian terms that appear frequently in QAOAs. In particular, we provide arguments why one can expect that the estimated energies of local terms behave effectively behave as uncorrelated variables, for the quantum states appearing at the beginning and near the end of the QAOA algorithm. This allows to prove significant reductions in sampling complexity of the total energy estimation (compared to the worst-case upper bounds).
Finally, we present a numerical study and detailed discussion of the possible effects that measurement noise can have on the performance of the Quantum Approximate Optimization Algorithm. This includes the study of how noise distorts the quantum-classical optimization in QAOAs. We simulate the QAOA protocol on an 8-qubit system affected by correlated measurement noise inspired by the results of IBM's device characterization. For a number of random Hamiltonians, we conclude that the errormitigation highly improves the accuracy of energy estimates, and can help the optimizer to converge faster than when no error-mitigation is used.

Related works
The effects of simple, uncorrelated noisy quantum channels on QAOAs were analyzed in Refs. [35][36][37]. In particular, the symmetric bitflip noise analyzed in [35] can be used also to model symmetric, uncorrelated measurement noise -a type of readout noise which has been demonstrated to be not very accurate in currently available devices based on transmon qubits [13,16]. A simple readout noise-mitigation technique on the level of marginals for QAOAs was recently experimentally implemented on up to 23 qubits in a work by Google AI Quantum team and collaborators [38]. Importantly, the authors assumed uncorrelated noise on each qubit -we believe that our approach which accounts for correlations could prove beneficial in those kinds of experiments. In a similar context, readout noise-mitigation using classical post-processing on global probability distributions was implemented in QAOA and Variational Quantum Eigensolver (VQE) experiments on up to 6 qubits [39,40]. While for such small system sizes it is possible to efficiently perform global error mitigation, we emphasize that our approach based on the noise-mitigation on the marginals could be performed also in larger experiments of this type.
Alternative methods of characterization of correlated measurement noise were recently proposed in Refs. [15,16,20]. Out of the abovementioned works, perhaps the most related to ours in terms of studied problems is Ref. [15], therefore we will now comment on it thoroughly. The authors introduced a correlated readout noise model based on Continuous Time Markov Processes (CTMP). They provide both errorcharacterization and error-mitigation methods. The CTMP noise model assumes that the noisy stochastic process in a measurement device can described by a set of two-qubit generators with corresponding error rate parameters. The total number of parameters required to describe a generic form of such noise for N -qubit device is 2N 2 . The authors propose a method of noise characterization that requires preparation of a set of suitably chosen classical states and performing a post-selection on the noise-free outcomes (i.e., correct outcomes given known input classical state) on the subset of (N − 2) qubits. Since probability of noise-free outcomes can be exponentially small even for the uncorrelated readout noise, this method can prove relatively costly in practice. We believe that our DDOT characterization technique could prove useful in reducing the number of parameters needed to describe CTMP model (by showing which pairs of qubits are correlated, and for which the crosstalk can be neglected), hence allowing for much more efficient version of noise reconstruction presented in Ref. [15]. The authors also present a novel noise-mitigation method that allows to estimate the noise-reduced expected values of observables. The method is based on decomposition of inverse noise matrix into the linear combination of stochastic matrices, and constructing a random variable (with the aid of classical randomness and post-processing) that agrees in the expected value of noise-free observable. In general, both sampling complexity and classical postprocessing cost scale exponentially with the number of qubits. Since the authors aim to correct observables with arbitrary locality, the problem they consider is different to our approach that aims to correct only local observables (and therefore does not exhibit exponential scalings). It is an interesting problem to see whether the CTMP noise model can also be interpreted on the level of marginal probability distributions in a way that allows for mitigation analogous to ours in terms of complexity.

.1 Preliminaries
In quantum mechanics, the most general measurement M that can be performed is represented by a Positive Operator-Valued Measure (POVM) [41]. A POVM M with r outcomes on a ddimensional Hilbert space is a set of positivesemidefinite operators that sum up to identity, i.e., where 1 is the identity operator. When one performs a measurement M on the quantum state ρ, the probability of obtaining outcome i is given by Born's rule: Pr (i|M, ρ) = tr (ρM i ).
In quantum computing a perfect measurement is often modeled as a so-called projective measurement P = {P i } r i for which the measurement operators, in addition to the requirements from Eq. (1), are also projectors, i.e., ∀ i P 2 i = P i . While non-projective measurements have many applications in certain quantum information processing tasks, in this work we are interested in using them to model imperfections in measurement devices. The relationship between an ideal measurement P and its noisy implementation M can be modeled by the application of a generic quantum channel. Recently it has been experimentally demonstrated for superconducting quantum devices that in practice those channels, to a good approximation, belong to a restricted class of stochastic maps [13,14], which we will refer to as 'classical measurement noise'. In such a model, the relation between M and P is given by some stochastic transformation Λ. Namely, we have M = ΛP, i.e., M i = j Λ ij P j . Due to the linearity of Born's rule, it follows that probabilities p noisy from which noisy detector samples are related to the noiseless probabilities p ideal via the same stochastic map, hence [13] p noisy = Λp ideal . (2) Specifically, in the convention where probability vectors are columns, the noise matrix Λ is leftstochastic, meaning that each of its columns contains non-negative numbers that sum up to 1.
Such noise is thus equivalent to a stochastic process, in which an outcome from a perfect device probabilistically changes to another (possibly erroneous) one. Equation (2) suggests a simple way to mitigate errors on the noisy device -via leftmultiplying the estimated statistics by the inverse of noise matrix Λ −1 [13,14]. From stochasticity of Λ it follows that its inverse does preserve the sum of the elements of probability vectors, however it may introduce some unphysical (i.e., lower than 0 or higher than 1) terms in corrected vector. A common practice in such a scenario is to solve optimization problem where minimization goes over all proper probability distributions. This introduces additional errory in the final estimations which can be easily upper-bounded [13]. Before going further, we note that while multiplication by Λ −1 is perhaps the most natural (and simple) method to reduce the noise and has been shown to be useful in practical situations [13,14], there exist more sophisticated techniques of noise-mitigation that do not exhibit this problem. For example, Iterative Bayesian Unfolding [18] always returns physical probability vectors. We note that our noise model and noisemitigation on marginal probability distributions (to be introduced in following sections) is consistent with using methods different than Λ −1 correction, and we intend to test them in the future.
To finish this introduction, we recall that, as mentioned above, Eq. (2) does not present the most general model of quantum measurement noise. Specifically, coherent errors might occur, and reduce the effectiveness of error-mitigation (a detailed analysis of this effect was presented in [13]). However, it was validated experimentally on multiple occasions, that coherent errors in superconducting quantum devices are small, and the error-mitigation by classical post-processing was shown to work very well in few-qubit scenarios [13, 14].

Correlations in readout noise
The size of the matrix Λ scales exponentially with the number of qubits. Thus, if one wants to estimate such a generic Λ using standard methods, both the number of circuits and sampling complexity scale exponentially. Indeed, the standard method of reconstructing Λ is to create all the 2 n computational basis and estimate the resulting probability distributions (which constitute the columns of Λ). We refer to such characterization as Diagonal Detector Tomography (DDT), since it probes the diagonal elements of the measurement operators describing the detector. This is restricted version of more general Quantum Detector Tomography (QDT) [42][43][44][45].
These complexity issues can be circumvented if one assumes some locality structure in the measurement errors. For example, in the simplest model with completely uncorrelated readout noise, the Λ matrix is a simple tensor product of single-qubit error matrices In this model, we need to estimate only singlequbit matrices, which renders the complexity of the problem to be linear in the number of qubits. However, for contemporary quantum devices based on superconducting qubits, it was demonstrated that such a noise model is not very accurate due to the cross-talk effects [1,13,14,16]. At the same time, the completely correlated noise is not realistic as well, which motivates the search for a model that can account for correlations in readout errors while still giving an efficient description of Λ.
In this work, we propose such a model and give a method to characterize it. Let us lay out the basic concepts of our model. Consider the correlated errors between some group of qubits C χ . The most general way of describing those errors is to treat the qubits in C χ as a single object, i.e., to always consider their measurement outcomes together. In terms of the noise matrix description, this means that the noise matrix on C χ is some generic Λ Cχ acting on C χ . This gives rise to the first basic object in our model -the clusters of qubits. The cluster C χ is a group of qubits with correlations between them so strong, that one can not consider outcomes of their measurements separately. At the same time, it is unlikely that in actual devices the correlations between all the qubits will be so strong that one should assign them all to a single cluster. This motivates the introduction of another, milder possibility. Consider a measurement performed on qubits in cluster C χ and some other qubits N χ (not being in that cluster). It is conceivable to imag-ine some complicated physical process, which results in the situation in which the noise matrix Λ Cχ on cluster C χ slightly depends on the state of the qubits in N χ . To account for that, we introduce the second basic object of our modelthe neighborhood of the cluster. The neighborhood N χ of a cluster C χ is a group of qubits the state of which just before the measurement affects slightly the noise matrix acting on the cluster C χ .
For example, if C χ contains only a single qubit, say Q 0 , it is possible that due to some effective ferromagnetic-type interaction, the probability of erroneously detecting state "0" of Q 0 as "1" rises when the neighboring qubit Q 1 is in state "1" (compared to when it is in state 0 ).
A notion related to our "neighborhood" has appeared in recent literature. Specifically in the context of measurement error characterization, 'spectator qubits' are the qubits that affect measurement noise on other qubits [16,20]. However, so far this effect was treated rather as an undesired complication, while here it is an inherent element of the proposed noise model. Now we are ready to provide an efficient noise model. We construct a global noise matrix Λ with matrix elements of the following form In the next subsection we give some illustrative examples, but first let us thoroughly describe the notation used in the above equation. A collection {C χ } χ gives us the partitioning of the set of all qubits. Explicitly, where N is the number of qubits. To each cluster C χ the model associate its neighborhood N χ . Equation (5) can be now understood in the following way. The noise matrix Λ Y Nχ describing the measurement noise occurring in cluster C χ depends on the state just before measurement of the qubits in the neighborhood N χ of that cluster (hence the superscript Y Nχ denoting that state). Importantly, each Λ Y Nχ is leftstochastic for any state of the neighbors. By X Cχ (or Y Cχ ) we denote bit-strings of qubits belonging to cluster C χ which were measured (or put inside the device just before measurement). Finally, Y Nχ indicates the bit-string denoting the state just before the measurement of the qubits from the neighborhood N χ of the cluster C χ (see Fig. 1a for illustration). Note that in general the  Figure 1: Exemplary correlations in the measurement noise that can be captured by our model. Each circle represents a qubit. Subfigure a) represents a 9-qubit device with some cluster C χ (orange envelope) consisting of two qubits. The noise on the qubits from that cluster is dependent on the state of qubits from its neighborhood N χ (magenta envelope). Note that the "neighborhood" does not have to correspond to spatial arrangement of qubits in the device. Subfigure b) shows a more detailed example of a four-qubit device. Qubits Q0 and Q1 are in one cluster, which is indicated by coloring them with the same color. Qubits Q2 and Q3 are white, meaning they do not belong to any cluster. Qubits at the beginning of the arrow are the neighbors of the qubits at the end of the arrow. Explicitly, the clusters in the exam- Correlations in the readout errors for qubits Q 0 and Q 1 can be arbitrary, while for the rest of the qubits the dependencies are restricted by the structure of clusters and the neighbors. In particular, the noise matrix on Q 3 depends on the state of Q 1 , while the state of Q 3 affects the noise on qubits Q 0 and Q 2 . At the same time, qubit 2 does not affect the noise matrix on any other qubit. See the description in the main text.
correlations in measurement errors (expressed by the structure of the clusters and neighborhoods) do not need to be directly correlated with the physical layout of the device. In general Λ in Eq. (5) is specified by only χ (2 |Cχ| )2 |Nχ| parameters, where |C χ | and |N χ | are sizes of χ'th cluster and its neighbourhood respectively. Therefore this description is efficient provided sizes of the clusters and their neighborhoods are bounded by a constant.

Illustrative examples
In what follows present examples of readout correlation structures that can be described with our model. It is instructive to start with a simple example of a hypothetical four-qubit device depicted in Fig. 1b. Note that in this example we have only one non-trivial (i.e., with size ≥ 2) cluster. Let us write explicitly the matrix elements of the global noise matrix acting on that exemplary 4-qubit device Note that on the RHS of Eq. (6), the superscript Y 3 appears two times, indicating that noise matrices on the cluster C 1 and on the cluster C 2 both depend on the state just before measurement of the qubit 3. At the same time, there is no superscript Y 2 , which follows from the fact that qubit 2 does not affect the noise on any other qubits. Note that while a generic noise matrix on 4 qubits would require reconstruction of 16 × 16 matrix, here we need a number of smaller dimensional matrices to fully describe the noise.
We now move to a more general readout error model, which is particularly inspired by current superconducting qubits implementations of quantum computing devices. Consider a collection of qubits arranged on a device with a limited connectivity. This can be schematically represented as a graph G(V, E), where each vertex in V represents a qubit and each edge in E connects two qubits that can interact in the device. In such a scenario, a natural first step beyond an uncorrelated readout noise model can be a nearestneighbour correlated model, where readout errors on each qubit are assumed to be influenced at most by the state of the neighbouring ones. By using the notation introduced in the previous section, we can represent such a model by associating a single-qubit cluster to each vertex, C i = {i}, for i ∈ V , and defining the neighbourhoods according to the graph structure, namely The global noise matrix then reads If we specialise this to the case of a 2D rectangular lattice of size L, the neighbourhood of generic (i.e. not belonging to the boundary) vertex be- can be represented by a collection of 2 4 = 16 matrices of size 2 × 2, which is an exponential improvement with respect to a general 2 L 2 × 2 L 2 .
Although the above correlated noise model seems a very natural one, we will see in the following Section that it does not encompass all the correlated readout errors in current superconducting devices, for which it will be more convenient to resort to models (5) with more general cluster and neighbourhood structures that do not necessarily correspond to the physical layout of the devices.

Characterization of readout noise
Here we outline a strategy to determine a noise matrix in the form (5) which closely represents the readout noise of a given device. We proceed in two steps: at first we infer the structure of clusters ({C χ } and neighbourhoods ({N χ } by making use of Diagonal Detector Tomography (DDT); then we proceed to experimentally deter- For the first step, we propose to reconstruct all two-qubit noise matrices (averaged over all other qubits) by means of DDT and calculate the following quantities The above quantity has a strong operational motivation in terms of Total-Variation Distance (TVD). This distance quantifies statistical distinguishability of probability distributions p and q and can be defined by We can give the following, intuitive interpretation of the quantity from Eq. (8): c j→i represents the maximal TVD for which the output probability distributions on qubit Q i differs due to the impact of the state of the qubit Q j on the readout noise on Q i . Note that in general c j→i = c i→j , which encapsulates the asymmetry in the correlations which is built into our noise model.
We propose to use the values of c j→i to decide whether the qubits should belong to the same clusters, to the neighborhoods, or should be considered uncorrelated (a simple, intuitive algorithm for this procedure is presented in Appendix B.2., Algorithm 3 -in the future, we intend to extend those methods). After doing so, the noise matrices {Λ Y Nχ X Cχ |Y Cχ } can be reconstructed by means of joint DDT over the sets of qubits {C χ ∪ N χ } χ .
In the above construction we assumed that the joint size of a cluster and its neighborhood is at most k. This makes it so that one has to gather DDT data on subsystems of fixed size, implying a number of different circuits that scales at most as O(N k ). However, for any characterization procedure, it is expedient to utilize as few resources as possible. In order to reduce the number of circuits even further, in the next Section we generalize the recently introduced Quantum Overlapping Tomography (QOT) [33] (see also recent followups [46,47]) to the context of Diagonal Detector Tomography. We will refer to our method as Diagonal Detector Overlapping Tomography (DDOT).

Diagonal Detector Overlapping Tomography
Quantum Overlapping Tomography is a technique that was introduced for a problem of efficient estimation of all k-particle marginal quantum states. The main result of Ref.
[33] was to use the concept of hashing functions [48][49][50][51] to reduce the number of circuits needed to reconstruct all k-qubit marginal states. Specifically, it was shown there that O log (N ) e k circuits suffice for this purpose. Here we propose to use an analogous technique to estimate all noise matrices corresponding to k-particle subsets of qubits. Specifically, we propose to construct a collection of circuits consisting of certain combinations of 1 and X gates in order to initialize qubits in states |0 or |1 . With fixed k, the collection of quantum circuits for DDOT must have the following property -for each subset of qubits of size k, all of the computational-basis states on that subset must appear at least once in the whole collection of circuits. Intuitively, if a collection has this property, then the implementation of all circuits in the collection allows us to perform tomographic reconstruction (via standard DDT) of noise matrices on all k-qubit subsets. One can think about DDOT as a method of parallelizing multiple local DDTs in order to minimize number of circuits needed to obtain description of all local k-qubit noise processes. In Appendix B.4 we show that it suffices to implement O k2 k log (N ) quantum circuits consisting of random combinations of X and identity gates in order to construct a DDOT circuits collection that allows to capture all k-qubit correlations in readout errors (see Algorithm 1 and Algorithm 2). It is an exponential improvement over standard technique of performing local DDTs separately (which, as mentioned above, requires O N k circuits). For example, if one chooses k = 5 for N = 15-qubit device, the naive estimation of all 5-qubit marginals would require the implementation of 2 5 15 5 ≈ 10 5 quantum circuits, while DDOT allows doing so using ≈ 350 circuits. We note that this efficiency, however, comes with a price. Namely, since different circuits are sampled with different frequencies, some false-positive correlations might appear. This may cause some correlations in the reconstructed noise model to be overestimated (see Appendix B.6 for a detailed explanation of this effect). This effect can be mitigated either by certain post-processing of experimental results (see Appendix B.6), or by constructing DDOT collections that sample each term the same number of times. Using probabilistic arguments in Appendix B.3 we show that still the number of circuits exponential in k and logarithmic in N suffices if we want to have all k particle subsets sampled with approximately equal frequency.

Experimental results
We implemented the above procedure with k = 5 for IBM's 15q Melbourne device and 23-qubit subset of Rigetti's Aspen-8 device (the details of the experiments are moved to Appendices). The obtained correlation models are depicted in Fig. 2. In the case of Rigetti's device, our procedure reports a very complicated structure of multiple correlations in readout noise, while in the case of IBM's device the correlations are fairly simple. We discuss this issue in detail in further sections while presenting results of noisemitigation benchmarks. Here we conclude by making an observation that, despite common intuition, the structure of the correlations in the readout noise can not be directly inferred from the physical layout of the device.

Noise mitigation on marginals
Now we will analyze the estimation and correction of marginal probability distributions affected by the kind of correlated noise model introduced in the previous section. In the next subsection, we will use those findings to propose a benchmark of the adopted noise model.

Noise on marginal probability distributions
Let us denote by p noisy a global probability distribution generated by measurement of arbitrary quantum state on the noisy detector for which Eq. (5) holds. As mentioned previously, for many interesting problems, such as QAOA or VQE algorithms, one is interested not in the estimation of p itself (which is an exponentially hard task), but instead in the estimation of multiple marginal probability distributions obtained from p. Let us say that we are interested in the marginal on a subset S formed by clusters C χ indexed by a set A, S := ∪ χ∈A C χ , where each C χ is some cluster of qubits (see green envelope on Fig. 3 for illustration). Our goal is to perform error mitigation on S. To achieve this, we need to understand how our model of noise affects marginal distribution on S.
From the definition of the noise model in Eq. (5) we get that the marginal probability distribution p noisy S on S, is a function of the local noise matrices acting on the qubits from S and the "joint neighborhood" of S, N (S) := ∪ χ∈A N χ \ S. The set N (S) consists of qubits which are neighbors of points from S but are not in S.
Because of this one can not simply use the standard mitigation strategy: i.e., estimate p noisy S and reconstruct probability distribution p ideal S by applying the inverse of Λ (more discussion of this matter is given in Appendices).
To circumvent the above problem we propose to use the following natural ansatz for the construction of an approximate effective noise model on the marginal S where summation is over states of qubits in the joint neighborhood N (S) defined above. In other words, it is a noise matrix averaged over all states  . For the layout of the graphs, we used the qubits actual connectivity in the devices (i.e., it is possible to physically implement two-qubit entangling gates on all nearest-neighbors in the graph). For IBM's device, we included the qubits in the cluster if the correlations given by Eq. (8) were higher than 4% in any direction, while we marked qubits as neighbors if the correlations were higher than 1%. In the case of Rigetti, the respective thresholds were chosen to be 6% and 2%. Moreover, for Rigetti we imposed locality constraints by forcing the joint size of the cluster and the neighborhood to be at most 5 by disregarding the smallest correlations. In practice the correlations within clusters were significantly higher than the chosen thresholds -heatmaps of all correlations can be found in Appendix E.2.
of the neighbors of the clusters in S, excluding potential neighbors which themselves belong to the clusters in S. Indeed, note that it might happen that a qubit from one cluster is a neighbor of a qubit from another cluster -in that case, one does not average over it but includes it in a noise model. Importantly, the average matrices Λ S av can be calculated explicitly using data obtained in the characterization of the readout noise.

Approximate noise mitigation
The noise matrix Λ S av can be used to construct the corresponding effective correction matrix Correcting the marginal distribution via leftmultiplication by the above ansatz matrix is not perfect and can introduce error in the mitigation. In the following Proposition 1 we provide an upper bound on that error measured in TV distance.
where the maximization goes over all possible states of the neighbors of S.
The proof of the above Proposition is given in Appendix A.2 -it uses the convexity of the set of stochastic matrices, together with standard properties of matrix norms and with a triangle inequality (similar methods were used for providing error bounds on mitigated statistics in Ref. [13]). Note that the quantity on RHS of Eq. (12) shows resemblance to c i→j in Eq. (8) (which we used to quantify correlations). Hence 1 2 max Y N (S) ||Λ S av − Λ Y N (S) || 1→1 can be interpreted as the maximal TVD between states on S generated by Λ S av and states generated by Λ Y Nχ (which appear in the description of the noise model). This can be also interpreted as a measure of dependence of noise between qubits in S and the state of their neighbors just before measurement. Indeed, if the true noise does not depend on the state of the neighbors, the RHS of inequality Eq. (12) yields 0, and it grows when the noise matrices Λ Y N (χ) increasingly differ.
In practice, it might happen that one is interested in the marginal distribution on the qubits from some subset S k ⊂ S (red envelope in Fig. 3). In principle, one could then consider a coarsegraining of noise-model within S (i.e., construction of noise model averaged over qubits from S that do not belong to S k , treating those qubits like neighbors) and perform error-mitigation on the coarse-grained subset S k . However, due to the high level of correlations within clusters, we expect such a strategy to work worse than performing error-mitigation on S, and then taking marginal to S k . Indeed, we observed numerous times that the latter strategy works better in practice. Yet, it is also more costly (since, by definition, S is bigger than S k ), hence in actual implementations with restricted resources one may also consider implementing a coarsegrained strategy. In the following sections, we will focus on error-mitigation on the set S. All those considerations can be easily generalized to the case of S k ⊂ S.

Sampling complexity of error-mitigation
Let us now briefly comment on the sampling complexity of this error-mitigation scheme (the detailed discussion is postponed to Section 6). If one is interested in estimating an expected value of local Hamiltonian, a standard strategy is to estimate the local marginals and calculate the expected values of local Hamiltonian terms on those marginals. Without any error-mitigation, this has exponential sampling complexity in locality of marginals (which for local Hamiltonians is small), and logarithmic complexity in the number of local terms (hence, for typically considered Hamiltonians, also logarithmic in the number of qubits) -see Eq. (18) and its derivation in Appendix A.3. Now, if one adds to this picture errormitigation on marginals, this, under reasonable assumptions, does not significantly change the scaling of the sampling complexity. We identify here two sources of sampling complexity increase (as compared to the non-mitigated marginal estimation). First, the noise mitigation via inverse of noise matrix does propagate statistical deviations -the bound on this quantitatively depends on the norm of the correction matrix (see Ref. [13] and detailed discussion around Eq. (18) in Section 6). Assuming that the local noise matrices are not singular (which is anyway required for error-mitigation to work), this increases sampling complexity by a constant factor . Second, the additional errors can come from the fact that, as described above, sometimes it is desirable to perform noise-mitigation on higherdimensional marginals (if some qubits are highly correlated). However, assuming that readout noise has bounded locality, this can increase a sampling complexity only by a constant factor (this factor is proportional to the increase of the marginal size as compared to estimation without error-mitigation). In both cases, for a fixed size of marginals (as is the case for local Hamiltonians), it does not change the scaling of the sampling complexity with the number of qubits, which remains logarithmic. 5 Benchmark of the noise model

Energy estimation of local Hamiltonians
After having characterized the noise model, how to assess whether it is accurate? To answer this question we propose the following, applicationdriven heuristic benchmark. The main idea is to test whether the error-mitigation of local marginals based on the adopted noise model is accurate. To check this we propose to to consider the problem of estimation of the expectation value H of a local classical Hamiltonian measured on its ground state |ψ 0 (H) . Here by "local" we mean that the maximal number of qubits on which each H α acts non-trivially does not scale with the system size. Classicality of the Hamiltonian means that every H α is a linear combination of products of σ z Pauli matrices. In turn the ground state |ψ 0 (H) can be chosen as classical i.e., for some bit-string X (H) representing one of the states from the computational basis. This problem is a natural candidate for error-mitigation benchmark due to at least three reasons. First, a variety of interesting optimization problems can be mapped to Ising-type Hamiltonians from Eq. (13). Indeed, this is the type of Hamiltonians appearing in the Quantum Approximate Optimization Algorithm. The goal of the QAOA is to get as close as possible to the ground state |ψ 0 (H) . Second, the estimation of H can be solved by the estimation of energy of local terms H α and therefore error-mitigation on marginals can be efficiently applied. Finally, the preparation of the classical ground state |ψ 0 , once it is known, is very easy and requires only the application of local σ x (NOT) gates. This works in our favor because we want to extract the effects of the readout noise, and single-qubit gates are usually of high quality in existing devices.
To perform the benchmark we propose to implement quantum circuits preparing ground states of many different local classical Hamiltonians, measure them on the noisy device, and perform two estimations of the energy -first from the raw data, and second with error-mitigation based on our characterization. Naturally, it is also desirable to compare both with the error-mitigation based on a completely uncorrelated noise model (cf. Eq. (4)). We propose that if the mitigation based on a particular noise model works well on average (over some number of Hamiltonians), one can infer that the model is more accurate as well.

Experimental results
We applied the benchmark strategy described above on the 15-qubit IBM's Melbourne device and 23-qubit subset of Rigetti's Aspen-8 device. We implemented Hamiltonians encoding random MAX-2-SAT instances with clause density 4 (600 on IBM and 399 on Rigetti) and fully-connected graphs with random interactions and local fields (600 on IBM). MAX-2-SAT instances were generated by considering all possible sets of 4 * 15 = 60 clauses with 8 variables, choosing one randomly and mapping it to Ising Hamiltonian (see, e.g., Ref. [52]). Random interactions and local fields for fully-connected graphs were chosen uniformly at random from [−1, 1]. Figure 4 presents the results of our experiments, together with a comparison with the uncorrelated noise model.
Let us first analyze the results of experiments performed on a 15-qubit IBM's device. Here it is clear that the error-mitigation based on our noise model performs well, often reducing errors in estimation by as much as one order of magnitude. We further note that the uncorrelated noise model performs quite well (yet being visibly worse than ours). To compare the accuracy by using a single number (as opposed to looking at the whole histogram), we take the ratio of the mean deviations from the ideal energy for the error-mitigated data based on two models. The results are presented in tables embedded into Fig. 4. In those tables, we provide also additional experimental data. Namely, for each figure there the second row shows data for noise model labeled as "only neighbors". This corresponds to noise model in which each qubit is a trivial, single-qubit cluster, and correlations are included only via neighborhoods. The worse results of error-mitigation for such model as compared to full clusters-neighborhoods model motivates the introduction of non-trivial clusters. Furthermore, we found experimentally that the characterization of the uncorrelated noise model exhibits significant memory effects (see, for ex-ample, Ref. [53]). Particularly, if one performs uncorrelated noise characterization in a standard way, i.e., by performing characterization in a separate job request to a provider, without any other preceding experiments ("local 2" in tables), the accuracy (measured by the error in energy after mitigation based on a given noise model) is much lower than for the characterization with some other experiments performed prior to the characterization of the uncorrelated model ("local 1" in tables). Indeed, the difference in mean accuracy can be as big as ≈ 26%.
Clearly, the overall performance of Riggeti's 23 qubit device is lower than that for IBM's device. First, the effects of noise (measured in energy error per qubit) are stronger. Second, the mean error with error-mitigation is only around ≈ 5.6 times smaller than the error without errormitigation (as opposed to factor over 22 for IBM's device). Third, the comparison to the uncorrelated noise model shows that the uncorrelated model performs not much worse than the correlated one.
Here we provide possible explanations of this poorer quality of experiments performed on Rigetti's device. Due to the limited availability of the Rigetti's device, we used a much lower number of samples to estimate Hamiltonian's energies in those experiments. Specifically, each energy estimator for Rigetti's experiments was calculated using only 1000 samples, while for IBM's experiments the number of samples was 40960. This should lead to statistical errors higher by a factor of roughly √ 41 ≈ 6.4 (and note that the errors in error-mitigated energy estimation in Rigetti's device are around 7.8 times higher than corresponding errors for IBM's device for the same class of Hamiltonians). Similarly, we used fewer measurements to perform DDOT -on Rigetti's device, we implemented 504 DDOT circuits sampled 1000 each, while on IBM's device we performed 749 circuits sampled 8192 each. Less DDOT circuits imply less balanced collection, hence, as already mentioned, some correlations might have been overestimated. In summary, our characterization of this device was in general less accurate than on IBM's device. This might be further amplified by the fact that single-qubit gates (which are used to implement DDOT circuits) were of lower quality for Rigetti's device. Finally, as illustrated in Fig. 2, we observed that correlations in measurement noise for Rigetti's device are much more complex than in the case of IBM's. As mentioned in the Figure's description, to work around this we imposed locality constraints in the constructed noise model by disregarding the lowest correlations between qubits, which made the model less accurate.
We note that due to the limited availability of Rigetti's device, we did not perform the study of memory effects similar to that performed in IBM. The local noise model presented for this device originates from a separate uncorrelated noise characterization performed prior to the rest of the experiments (hence it is analogous to the "local 2" model in IBM's case).
To summarize, presented results suggest that in experiments on near-term quantum devices it will be indispensable to account for cross-talk effects in measurement noise. For both studied quantum devices we provided proof-of-principle experiments showing significant improvements in ground state energy estimation on the systems of sizes in the NISQ regime. Motivated by those results, we hope that the framework developed in this work will prove useful in the future, more complex experiments on even larger systems.

Error analysis for QAOA
In this section, we analyze the magnitude of errors resulting from our noise-mitigation scheme when applied to an energy estimation problem. Those errors result as a combination of two independent sources. First, from the fact that we use approximate correction matrices instead of the exact ones (see in Proposition 1). Secondly, by statistical errors due to the common practice of measuring multiple marginals simultaneously in a single run of the experiment. In the following, we will analyze the first and second effects separately and then provide a bound that takes both into account. We restrict our analysis to local Hamiltonians diagonal in the computational basis. A detailed derivation of the results below can be found in Appendices.

QAOA overview
Before starting, let us provide a short overview of the QAOA algorithm. In standard implementation [24], one initializes quantum system to be in |+ ⊗N state, where |+ = 1 √ 2 (|0 + |1 ). Then p-layer QAOA is performed via implementation of unitaries of the form where α, β are the angles to-be-optimized. Unitary matrices are given by , and H O is objective Hamiltonian that one wishes to optimize (i.e., to find approximation for its ground state energy). The quantum state after p-th layer is |ψ p = U p |+ ⊗N and the function which is passed to classical optimizer is the estimator of the expected value ψ p | H O |ψ p (note that this makes those estimators to effectively be a function of parameters {α j } , {β j }). The estimator is obtained by sampling from the distribution defined by the measurement of |ψ p in the computational basis, taking the relevant marginals, and calculating the expected value of H O using values of those estimated marginals. Let us now proceed to the analysis of possible sources of errors while performing noisemitigation on the level of marginals to estimate the energy of local Hamiltonians, such as those present in QAOA.

Approximation errors
We start by recalling that performing noise mitigation with the average noise matrix instead of the exact one subjects the estimation of each marginal to an error upper bounded by Eq. (12). It follows that the correction of multiple marginal distributions can lead to the accumulation of errors which for each marginal α (we label subset of qubits by α so that local term H α acts nontrivially on qubits from α) take the form consists of clusters to which qubits from α belong, and C Sα av is the average correction matrix for the marginal on that set. It is straightforward to show that the total possible deviation between the error-mitigated expected value H corr and the noiseless one H is upper bounded by (17)

Additive statistical bound
Moving to the effect of measuring several marginals simultaneously, let us start by considering the simplest bound on the propagation of statistical deviations under our error-mitigation.
In Appendix A we derive that the TVD (Eq. (9)) between the estimator p est α and the actual local marginal p α is upper bounded by where n is the number of of qubits in the support of each local term (for simplicity we assume it to be the same for all H α ), K is the total number of local terms, s is the number of samples, and 1 − P err is the confidence with which the above bound is stated. Importantly, the above bound is satisfied for each marginal simultaneously, hence the logarithmic overhead log (K). Using Eq. (18) together with standard norm inequalities one obtains the following bound for the total energy estimation Here H est corr denotes the estimator of the total energy with error-mitigation performed on each local term independently and A α is the exact (not approximate) correction matrix on marginal α.

Joint approximation and statistical bound
The two bounds provided above took into account the two considered sources of errors independently. By using the triangle inequality (see Appendix A.4), we can now combine them to obtain It follows that the dominant scaling in the overall error are linear in the number of terms K caused by summing over all of them and the logarithmic overhead in * added by the statistical errors.

Sampling complexity of energy estimation
While the additive bound from Eq. (19) could be tight in principle, we observed numerically on many occasions that in practice the statistical errors are much smaller (see Fig. 5 for exemplary results).
Here we will provide arguments that show that natural estimators of local energy terms H α effectively behave as uncorrelated for a broad class of quantum states, hence leading to a significantly smaller total error than that obtained from an additive bound.
We start by describing in detail the natural strategy for energy estimation in the considered scenarios. In this work we are concerned with classical local Hamiltonians. This means that all local terms H α can be measured simultaneously via a single computational basis measurement. The natural estimation procedure amounts to repeating s independent computational basis measurements on a quantum state ρ of interests. Outcomes of these measurements are then used to obtain local energy estimators E est where E i α are values of the local energy terms obtained in the i'th experimental run. Now to perform estimation of expected value of energy, H , we simply sum the local estimators E est It is clear that H est is an unbiased estimator of H . Likewise E est α are unbiased estimators of H α . We would like to understand statistical properties of H est (specifically its variance) as as a function of number of experimental runs (samples) s and the number of local terms in the Hamiltonian K. To this and we observe that random variables E i α , E j β are independent unless i = j and therefore Assuming that measurements of the energy E i α are distributed according to the probability com-patible with the Born rule allows us to write (24) The variance of H est can be related to the sample complexity of the energy estimation. Let ∆E > 0 be some positive number. Then using Chebyshev inequality we get Choosing some parameter P f as an upper bound bound on the RHS of this inequality, i.e., Then we obtain using (24) that for the number of samples s satisfying the estimator H est will be within accuracy form the expectation value H with probability at least 1 − P f . Now, if different Hamiltonian terms H α are correlated then according to the above bound the sample complexity grows like K 2 , where K is the total number of terms in H. Conversely, if Cov(H α , H β ) ≈ 0 (for α = β) then get sample complexity scaling linearly with K. The above consideration can be equivalently translated to the estimates of the confidence intervals associated with estimator H est for a fixed value of samples s. Specifically, if local terms are uncorrelated, then the confidence interval (statistical error) will scale as square-root √ K of the number of Hamiltonian terms, contrary to the pessimistic bound in Eq. (20) which is linear in K. We now show that one should expect that the sub-linear scaling of energy errors described above holds for a variety of quantum states, which in turn greatly reduces the sampling complexity compared to the pessimistic (linear) bound. We want to emphasize that our results are of immense practical importance for near-term devices. Our findings indicate a reduction (compared to the naive bounds) of sample complexity of energy estimation by orders of magnitude, even for relatively small systems (K ≈ 100 and larger).

Generic 2-local Hamiltonians in QAOA
We start by considering states that appear at the beginning and at the end of QAOA. In recent work [54] it was shown that after p-th layer of QAOA optimization, for given two local terms H α and H β , there is no entanglement between qubits from α and qubits from β if they are further away from each other than 2p (on a graph corresponding to interactions present in a Hamiltonian). Therefore, for generic QAOA optimization, one can expect that for low p, i.e., at the beginning of the QAOA, the local Hamiltonian terms will be uncorrelated variables. On the other hand, it is is well known that the ground states of local Hamiltonians are product states. Therefore, provided that QAOA worked well and converged to the state close to the ground state of objective Hamiltonian, the same arguments can be applied for the reversed QAOA circuit, and one expects local terms to be uncorrelated as well.
The following Proposition provides a more quantitative description for generic 2-local Hamiltonians corresponding to random graphs.

Proposition 2.
Consider Hamiltonian with connectivity given by Erdös-Rényi random graph in which each edge of the graph is added independently at random with some fixed probability. Assume that the probability of adding edge is chosen so the average degree of a node is equal to q = K N , hence that a random graph has on average N nodes and K edges. For QAOA starting from product state, if the number of layers satisfies with w < 1, then with probability 1 − e −N a/2 the variance of the Hamiltonian is bounded by where maximization in first definition goes over all two-qubit local terms acting on subsets of qubits α and β. Since we can always choose w < 1 such that A < 1, the variance thus scales sub-quadratically for shallow depth QAOA. Importantly, the parameter w can be chosen in such a way that A goes asymptotically to 0 (an ex- , which in turn means that in the N regime the variance Var(H) will scale almost linearly with the number of terms.
The proof of the above Proposition 2 uses insights from Ref. [54] and is delegated to Appendix D.2.

Random quantum states
A simple argument can be made to show that for generic Haar-random pure sates, as well as random states appearing in random local quantum circuits, the variance of a local Hamiltonian H behaves as if different energy terms were independent. Let |ψ be a pure state on (C 2 ) ⊗N . For a subset of qubits γ, let ρ γ denote the reduced density matrix |ψ ψ| on qubits in γ, and let id γ denote the normalized maximally mixed state on qubits in γ. Assume now that local Hamiltonian terms H α , H β have disjoint supports. It can be then shown (see Proposition 4 in Appendices for the proof) that (30) Now, it is a well-known fact [55] that with overwhelming probability all few-body marginals ρ γ of a Haar-random multiqubit states |ψ are exponentially close to maximally mixed states. Therefore, assuming that max α,β does not scale with the system size, we have that high probability over the choice of |ψ , for every disjoint terms The above reasoning mimics the computation done in Theorem 1 of [56], where it was used to establish that generic Haar-random pure states attain only the so-called standard quantum limit in the paradigmatic interferometric scenarios (again the underlying argument was based on the fact that all few-body reduced density matrices of a generic pure state |ψ are very close to maximally mixed states with overwhelming probability). Analogous analysis can be carried out for typical states generated by local random quantum circuits. Such circuits are known to form approximate t-designs, i.e., capture properties of typical Haar-random unitaries captured by lowdegree moments [57]. Specifically, a recent paper [58] considered evolution of local entropies for pure states |ψ generated by shallow local random quantum circuits. From Theorem 1 of that work it directly follows that with probability greater than 1 − δ over the choice random states |ψ generated by random local circuits in the brick-wall architecture of depth r, all marginals ρ γ of size Clearly, if the size of the marginals k is fixed, setting r = c log(N/ ), for a suitable constant c, allows us to conclude that for all γ such that |γ| ≤ k one has ρ γ − id γ 1 ≤ with probability approaching 1 with the increasing system size.

Effects of measurement noise
To conclude, let us provide some analysis of the effects of measurement noise mitigation on the above considerations. First, let us note that it is straightforward to generalize all of the above arguments to include the uncorrelated readout noise. Intuitively, if the measurement noise is not correlated, it cannot increase the level of correlations in the energy estimators, and the same holds for noise-mitigation for such model (see Appendix D.3 for derivations).
For the correlated noise model, it is hard to provide analytical results, however, one can expect that if correlations in measurement noise are mild, then it should not drastically increase sample complexity. To test this hypothesis for small system sizes, we performed numerical simulations in the following way. Consider confidence intervals of the energy estimation. The most pessimistic bound on the error, as already explained, is given by Eq. (20) and is additive in the number of Hamiltonian terms K (if one considers the situation without measurement noise, then it suffices to set δ α = 0 and ||C Sα av || 1→1 = 1 and Eq. (20) still holds). To obtain confidence intervals expected for uncorrelated variables, we It is clear that for tested Hamiltonians the confidence intervals in the estimation behave roughly like for uncorrelated variables.
7 Effects of measurement noise on QAOA -numerical study In this section we apply our noise characterization and mitigation strategies to numerically study the effects of correlated measurement errors on QAOA and how they can be reduced with our techniques. As test Hamiltonians, we choose those encoding random MAX-2SAT instances with clause density 4, the Hamiltonians corresponding to fully-connected graphs with random interactions and local fields with magnitude from [−1, 1], and the Sherrington-Kirkpatrick (SK) model in 2D (i.e. random Gaussian ZZ-interactions on a square lattice). For all models, we classically simulate a QAOA algorithm on an 8-qubit device with a number of layers ranging from p = 3 to 30. The parameter optimization is performed using Simultaneous perturbation stochastic approximation (SPSA) [39, [59][60][61] (see Appendix E.1 for details of optimization). As a correlated noise model, we adopt one inspired by our previous characterization of the IBM 15-qubit device.
We first simulated the result obtained by a QAOA algorithm where both the energy estimation and the gradient estimation used to guide the state evolution were affected by readout noise. For different numbers of gate layers, we compared the resulting energy estimators between an optimization guided by noisy estimators ("noisyguided" in Fig. 6) and the optimization guided by estimators on which error-mitigation was performed ("corrected-guided" in Fig. 6). The results of our numerical studies are presented in the first column of Fig. 6, together with the noiseless case as a reference. Note that to make differences more visible, we set offset on the vertical axis. It is clear that for the considered Hamiltonians, the noise-mitigated estimators are much better than noisy ones. This suggests that our noise-mitigation scheme can be used to obtain an overall more reliable QAOA algorithm.
For our second analysis, we wanted to closely analyze the effects of noise (and its mitigation) on the parameter optimization only. To do so, Figure 6: Numerical study of the effects of readout noise on energy estimation for QAOA on an 8-qubit system. The algorithm is used to prepare the ground state of a,d) Hamiltonians encoding random MAX 2-SAT problems, b,e) Hamiltonian corresponding to a fully-connected graph with uniformly random (from [−1, 1]) interactions and local fields, and c,f) the 2D Sherrington-Kirkpatrick model. In each plot, the horizontal axis shows a number of layers in the QAOA optimization, while the vertical axis shows the absolute difference between obtained and theoretical energies per qubit. Each data point is the average over 96 Hamiltonians. Note that we made an offset on the y-axis in order to make differences visible. Red data-points indicate optimization guided by noisy function evaluations, green points indicate noise-mitigated evaluations, while blue lines correspond to noiseless optimization given for reference. The estimators were obtained from ≈ 10 4 samples. Due to differences in spectra of various Hamiltonians over which the mean is calculated the fluctuations around the presented means are very high, therefore for clarity we decided not to include error bars.
we still compare the results of QAOA between the cases where the optimization is guided by a noiseless, noisy ("noisy-guided" in Fig. 6) and noisemitigated ("corrected-guided" in Fig. 6) energy estimators. However, in order to isolate the effect of noise on the optimization procedure, instead of sampling the energy from the noisy probability distributions, we calculated its expectation value directly on the quantum state obtained at each layer of the QAOA circuit (the same circuits as in left column). As a theoretical comparison between the optimal parameters found by QAOA, we took the distance between the resulting average energies on the state, divided by the number of qubits. The plots in the second column of Fig. 6 show our numerical results. For both of the considered models, the noise-mitigated optimization leads to better parameters regions. We note that the difference is not high, yet it is systematic. Since this is the case already for 8 qubits, one can expect that for bigger systems, the relative improvement will be higher.
To conclude, let us stress that the purpose of this section was to illustrate the possible effects of mildly correlated, realistic readout noise on QAOA algorithm. The above results clearly indicate that correlated readout noise can potentially influence the optimization, as opposed to an identical and uncorrelated one (see Ref. [35]). We find that in a number of instances the errormitigation strategy helps to land in better parameters regions, however, we also note that the effects of noise on the optimization are not dramatic in the studied cases. From the point of view of near-term applications, this should be viewed as a positive result -the noise does not seem to strongly affect optimization, yet even those mild effects can be reduced by performing our errormitigation. Furthermore, we note that due to the stochastic nature of the SPSA optimizer, the results might differ from run to run (each data point presented in the plots comes from the results of the optimization run which was the better one amongst two performed independent optimizations, see Appendix E.1 for details). Finally, we note that to obtain accurate estimates of energy, the noise-mitigation, unsurprisingly, remains highly beneficial.

Conclusions
In the first part of this work, we proposed an efficiently describable model of correlated measurement noise in quantum detectors. The basic idea of the model is to group qubits into strongly-correlated clusters that are mildly affected by their neighborhoods, which, provided that the size of those groups is bounded by a constant, allows to describe a global noise model by much smaller number of parameters compared to the most generic situation. To characterize our noise model, we have introduced Diagonal Detector Overlapping Tomography, which is a procedure inspired by recently introduced Quantum Overlapping Tomography [33], tailored to the efficient characterization of the proposed measurement noise model. Similarly to the [33], proposed method can estimate k-qubit correlations in measurement noise affecting N -qubit device using O k2 k log (N ) quantum circuits. We have shown that the measurement noise can be efficiently mitigated in problems that require estimation of multiple marginal probability distributions, an example of which is Quantum Approximate Optimization Algorithm [24]. Importantly, from the fact that noise-mitigation is performed on marginal distributions it follows that sampling-complexity of noise mitigation is similar to that of the original problem, provided that cross-talk in readout noise is of bounded locality. We proposed a benchmark of the noise model and error reduction, which we implemented in experiments on up to 15 qubits on IBM's Melbourne device and on 23 qubits on Rigetti's Aspen-8 device, and concluded significant improvements compared to simple, uncorrelated noise model. Interestingly, additional experimental data have pointed at previously unreported memory effects in IBM's device that were demonstrated to notnegligibly change the results of experiments.
In the second part of the work, we provided an analysis of the statistical errors one may expect when performing the simultaneous estimation of multiple local terms of Hamiltonian on various classes of states. We provided simple arguments why low sampling complexity (i.e., scaling of the variance as the square-root of a number of local Hamiltonian terms) should be expected from Haar-random quantum states, and for states gen-erated by shallow random quantum circuits. Similarly, we gave some arguments based on [54], why for states appearing at the beginning and at the end of the QAOA, one may expect that estimated local terms will effectively behave as uncorrelated variables, reducing the sampling complexity. Furthermore, we have provided analytical results for Hamiltonians encoding random MAX-2-SAT instances.
In the last part of the work, we have presented numerical results and extended discussion of the effects that correlated measurement noise can have on the performance of QAOA. We have demonstrated that already for 8 qubits the correlated measurement noise can alter the energy landscape in such a way, that the quantumclassical optimization leads to sub-optimal energy regions (compared to reference runs without noise). At the same time, we demonstrated that our noise-mitigation procedure can reduce those effects, improving optimization.

Future research
We believe that our findings will prove useful in both near-and long-term applications as efficient methods of characterizing and reducing measurement noise. At the same time, we find that a number of future research directions opens. It is natural to ask how well the error mitigation will perform in actual multi-qubit experiments of QAOA, which can be tested only with more access to the quantum devices than is available for the public via cloud services. Similarly, testing noise mitigation in more general, Variational Quantum Eigensolver (VQE) scenarios is of great interest [62,63]. The analysis of statistical errors in the VQE setup is a natural extension -we note that since the local terms in VQE Hamiltonians do not commute, it is less straightforward than in the QAOA scenario. Another interesting problem is to design more benchmarks for the noise model and mitigation than the one method described in this work. A natural extension of techniques presented in this work would be to develop more methods for inference of correlations structure and construction of noise model from DDOT data. Having a noise model for marginal probability distributions, it is desirable to test and compare techniques of noise mitigiation that go beyond a standard noise matrix inversion analyzed in this work. In particular, methods based on Bayesian inference seem promising [18]. Moreover, it is tempting to ask whether errormitigation performance and demonstrated memory effects are stable over time, a question which is often omitted (see however recent work [64] where systematic methods for studying time instabilities were developed and [65] which made an important contribution by studying the stability of various types of noise over time for IBM's devices). Another open problem is to find out whether further generalizations of Quantum Overlapping Tomography can be used to perform reliable characterization of more general types of noise -coherent measurement noise and the noise affecting quantum gates. For example, in the work [66] the authors consider an ansatz for the generic noisy quantum process which uses only 2-local noise processes -a setup which seems natural to benefit from similar techniques. In Refs. [67,68] the authors develop methods of estimating generic Pauli channels, and it would be of great interest to assess whether those methods can benefit from using measurement errormitigation techniques such as ours. A very important research problem is that of mitigating measurement noise in scenarios which include estimators obtained from a very few samples, such as those in Refs. [69,70]. We note that in its current form our methods cannot be directly implemented in such scenarios because they operate on marginal probability distributions. Finally, it would be extremely interesting to investigate whether our model of measurement noise is accurate for quantum devices based on architectures different than transmon qubits, such as flux qubits [71], trapped ions [72] or photonic quantum devices [73]. We intend to investigate some of the listed problems in future works.

Data availability
The Python code for creating DDOT perfect collections and for characterization of readout noise (in qiskit [74] and pyquill) will be available in our open-source Quantum Readout-Error Mitigation (QREM) GitHub repository [75]. We intend to further develop the package to contain practically relevant techniques of noise characterization and mitigation, including modifications and refinements of algorithms presented in this work. The results of experiments used in this work will be available as exemplary data-sets in the QREM respository. Any requests for more experimental data or codes used in the creation of this work are welcome.

Acknowledgements
We are grateful to Tomek Rybotycki for providing help with running experiments on Rigetti's device. We acknowledge the use of the IBM Quantum Experience and Amazon AWS Braket for this work. The views expressed are those of the authors and do not reflect the official policy or position of IBM, or the Rigetti, or the Amazon.

Appendices
We collect here technical results that are used in the main part of the paper. Some of the results stated here can be of independent interest for further works on quantum error mitigation. In Sec. A we discuss the details of error mitigation on marginals for correlated readout noise models, while Sec. B provides details of our noise characterization procedure based on the Diagonal Detector Overlapping Tomography technique. In Sec. C we give a short overview of the whole noise characterization scheme in a step-by-step manner. Results concerning the sample complexity of energy estimation are discussed in Sec. D, and finally in Sec. E we provide some additional experimental data and details of the numerical simulations.

A Correlated readout noise model and its usage for error-mitigation on marginals
In this section, we provide some details on how correlated measurement noise affects marginal probability distributions. We start by providing in Appendix A.1 explicit relation between ideal and noisy marginals, when the global probability distribution is affected by readout noise given by our model. Then in Appendix A.2, we discuss the error-mitigation on the level of marginals, including proof of Proposition I from the main text. We finish by analyzing errors in simultaneous estimation of multiple marginal distributions (Appendix A.3) and estimation of multiple expected values of local terms (Appendix A.4).

A.1 Noise model for marginal distributions
In this subsection, we show how to translate a noise model for the full probability distribution given in Eq. (5) into a simple noise model for marginal distributions. Our results can be summarised in the following Proposition 3.

. , X N ) be the ideal distribution and the one obtained on noisy detector respectively. Define by p(Y S ) andp(X S ) their corresponding marginal distributions for qubits in some subset S. If the full distributions are connected by a stochastic matrix of the form Eq. (5) by means of
then there exists a left-stochastic matrix Λ (S) that connects the noisy marginal to the ideal one, namelỹ Moreover, the form of the marginal noise matrix can be explicitly derived from the decomposition Eq. (5).
We will now proceed to prove the above Proposition by deriving the concrete expression of Λ where to obtain the above simplifications, we have exploited the fact that the noise matrices Λ Y Nχ are all left-stochastic (for any fixed Y Nχ ) and we have defined p Y N (S)∪C(S) as the marginal of the ideal distribution on the qubits belonging both the clusters and the neighborhoods of qubits from S. By N (S) we denote set of qubits from neighbourhoods of clusters C (S) but without including the qubits from S, i.e., N (S) = ∪ χ∈L(S) (N χ /S). Note that this additional requirement is necessary since qubits which are neighbors of some clusters in C (S) might belong to some other clusters in C (S). Now by means of the chain rule for probability we decompose which after substituting to the Eq. (35) gives Now notice that Eq. (37) coincides exactly with Eq. (34) if we identify the marginal noise matrix elements as which shows, unsurprisingly, that the noise matrix acting on the marginal is state-dependent (via marginal distribution p Y {N (S)∪C(S)}/{S} |Y S ). Note that the above equation does not include simple matrix multiplication of noise matrices for neighborhoods N (S). In other words, the generic marginal noise matrix is not a convex combination of the matrices from the set even if the neighborhoods do not overlap.
It is now left to show that the matrix defined in Eq. (38) is left-stochastic. To this aim, let us sum over string X S and obtain where we simply used the fact that the cluster noise matrices are left-stochastic and that the conditional distribution p Y {N (S)∪C(S)}/{S} |Y S is normalised.

A.2 Noise mitigation for marginal distributions -proof of Proposition 1
Here we analyze in detail the noise mitigation strategy outlined in the main text, focusing on its scalability and effectiveness to recover the ideal marginal distribution. Again let us assume that we are interested in marginal on qubits from some subset S. Let us start by outlining the reasoning behind our choice of the mitigation strategy. Recall that, as proved in the previous section, the noisy marginals can always be related to ideal ones by means of a marginal stochastic matrix as for Eq. (34). Hence, taking the inverse of the marginal noise matrix would the natural strategy for correct noise mitigation. However, recall that, as shown in Eq. (38), the form of the marginal noise matrix depends on the ideal distribution itself, which is generally unknown. Therefore one needs an ansatz for such a distribution, which hopefully will work in the mitigation generic case of arbitrary conditional distribution. As indicated in the main text, the natural choice for such ansatz is a uniform distribution (inserted in place of p Y {N (S)∪C(S)}\{S} |Y S in Eq. (38)). Now, let us make the following important observation. We are interested in mitigation of the noise of a |S|-qubit marginal distribution of qubits generally belonging to some set of distinct clusters C (S). Note that, by definition, a cluster is a set of qubits with correlations in errors so big, that one can not consider the measurement outcomes on them separately. We argue that, if one is interested in correcting the marginal distribution of only parts of the clusters, sayp(X S ), it is still a better idea first correct the marginal distribution on the whole variables in clusters, namelỹ . (40) Notice that the corrected cluster marginal distribution can always be post-processed to obtain the twobody marginal distribution of interest. By proceeding in a similar manner as in the previous section, one can obtain the following expression for the cluster noise matrix where L (S) is a set of labels of the clusters C (S) (note that those are also labels for neighbours of those clusters). Recall that N (S) was defined as set of qubits from neighbourhoods of clusters C (S) but without including the qubits from S, i.e., N (S) = ∪ χ∈L(S) (N χ \ S), so in the above definition we do not average over some qubits in clusters. As mentioned before, since the exact form of such a matrix requires access to the unknown information of p Y N (S) |Y C(S) , we propose instead to invert the average cluster matrix and use it to perform the error mitigation. From our numerical analysis, the replacement of the cluster noise matrix with its average version yields a much better correction than making a similar replacement at the level of the |S|-body marginals (however we tested in only for the case of 2-qubit marginal distributions). This provides a clear indication in favor of performing the error mitigation at the level of the clusters instead of the single variables. Moreover, notice that the computational cost of computing the two noise matrices in Eq. . Hence, as long as the size of the involved clusters and neighborhoods is reasonably small (particularly, it does not scale with the system size), computing the matrix in Eq. (42) is efficient. Due to the above, the following discussion will concern the mitigation on the level of marginals on the level of clusters C (S). For clarity, let us from now on use slightly changed notation to indicate that we are interested in error-mitigation on the subset of qubits S which consists of full clusters, i.e., prior to performing error-mitigation we do not wish to take marginal over qubits belonging to the same cluster (while later one can of course marginalize the corrected distributions).
Having obtained Λ S av , one can use its inverse as a correction matrix which can reduce the noise on the marginal distribution, by defining the corrected distribution as where we used vector notation for clarity.p S is noisy distribution on qubits from S. Similarly, the vector without tilde symbol p S will denote corresponding ideal distribution, i.e., we havep S = Λ (S) p S . Error reduction via average matrix is perfect only in the case 1 of infinite-statistics and under the assumption that the actual conditional distribution in Eq. (41) is uniform (because then Λ S = Λ S av ). Since this scenario is not realistic, in practice one can hope only for partial noise mitigation, and not its complete reverse. The errors related to statistical noise when correcting probability distribution were thoroughly analyzed in work [13] (we also repeat a similar analysis in the context of expected values of Hamiltonians in next sections). Here we will analyze the errors which arise due to the fact that we use the inverse of the average cluster matrix Λ (as in Eq. (11)). 1 Clearly, the implicit assumption is that our model of noise is exact.
To quantify the errors, we make use the Total-Variation-Distance (see Eq. (89)) to measure the distance between probability distributions. For later purposes, let us also recall that the operator norm induced by the vector L1 norm is where A is any linear operator acting on given vector space. Now we will proceed to proving Proposition 1 which states that the maximum error due to using average correction matrix is given by As a starting point for the proof, let us make following decomposition Λ (S) = Λ where the first inequality indicates the maximization over all possible probability distributions over the set S variables and the second inequality follows from Eq. (45) and the sub-multiplicativity of the L1 norm. Now, let us bound the second term of the above equation as follows where the above sequence of inequalities removes the dependence on the unknown conditional distri- (hence over a polytope with extremal points being each of the matrices Λ Y N (S) ). It is straightforward to see that this function is convex, due to the triangle inequality and absolute homogeneity of the L1 norm. From this fact, it follows that the maximum in Eq. (48) is attained for one of the extremal points of the convex hull, i.e., a particular Λ Y N i Λ Y N (S) . Hence, combining this last bound with Eq. (47) one obtains exactly Eq. (46), which proves Proposition 1.

A.3 Statistical error bounds
The idea behind the energy estimation routine in variational algorithms such as QAOA is to estimate Hamiltonian with K-terms by estimating those terms separately and then adding them up. However, if one provides statistical error bounds for each of those terms, one needs to take into account the probabilistic nature of such bounds. Let us say that we want to estimate some distribution p with N outcomes by sampling from it s number of times. It is well known [76] that the probability of estimated distribution being -close to the true distribution p in terms of TVD (Eq. (9) is vanishing exponentially in number of sample Pr TVD p, p est ≥ ≤ (2 n − 2) exp −2s 2 .
Now it is often convenient to set the probability fixed and consider the confidence intervals, i.e., the bound for TVD p, p est . Then we can rewrite the above equation as a function * (n, s, P 1,err , 1) of three fixed parameters -number of outcomes n, number of samples s and probability of the upper bound being incorrect P 1,err . Then the basic manipulations of Eq.(49) give TVD p, p est ≤ * (n, s, P 1,err , 1) = log (2 n − 2) − log (P 1,err ) 2s . (50) However, since in estimation of the K-term Hamiltonian one combines the upper bounds of the form (50) for particular terms into upper bound * (n, s, P 1,err , K) for the whole Hamiltonian, one needs to make sure that all upper bounds for particular terms are true at the same time with the desired probability. The union bound states that the probability of at least one event from some set occurring is no greater than the sum of probabilities of particular events. In our case, the interesting set consists of events of the type "one of the bounds of type (50) is not satisfied". Hence the probability of at least one bound being wrong is upper bounded by Therefore, if we wish to ensure that probability of all the bounds being right at the same time is fixed and equal to P 1,err for fixed number of samples s, we need to effectively increase the upper bound to * (n, s, P 1,err , K) = = log (2 n − 2) − log with additional term log K under the square root. In other words, we effectively lower the probability of error occurring in estimation of particular marginal distributions by a factor of K, which ensures that simultaneous estimation of K marginal distributions has the precision from Eq. (52) with probability not lower than the initial P 1,err (by the virtue of union bound). It is instructive to look at this from the perspective of sampling complexity. Let's say that we would like to increase number of samples in such a way, that upper bound remains fixed, i.e., * (n,s, P 1,err , K) = * (n, s, P 1,err , 1) .
Simple calculations show that in this case we would need to (multiplicatively) increase the number of samples s by the overhead equal to where parameter C is equal to Hence we see that for fixed dimension (fixed number of outcomes) simultaneous estimation of K Hamiltonian terms leads to sampling overhead logarithmic in K.
Note that if one fixes the initial sampling size s, probability of error P 1,err and number of simultaneously estimated marginals K, the sampling overhead is, perhaps counterintuitively, decaying in the number of outcomes. This dependence is roughly linear in the number of outcomes, hence it is exponential in the number of qubits. Now we want to bound the error in the energy estimated after acting by correction matrix Λ −1 α on the estimated distribution above. In analogy to Eq. (55) we can write where first inequality follows from definition of operator norm, second inequality follows from a definition of any induced operator norm and the last inequality follows from analysis in previous section with * being bound given by Eq. (52). In analogy to Eq. (56) by applying multiple times triangle inequality we obtain Eq. (19).

A.4.3 Approximation and statistical errors
Now we will combine two previous bounds by using a triangle inequality. Let us denote by p noisy, est the estimator of noisy probability distribution p noisy on the marginal α (hence the RHS of Eq. (57)), and by C α av the average correction matrix used to correct that marginal. We want to bound the distance between corrected estimator of noisy distribution C α av p noisy, est and the ideal distribution p ideal . For particular marginal we have triangle inequality for Total-Variation Distance TVD C α av p noisy, est , p ideal ≤ TVD C av p noisy , C av p noisy, est where * is statistical bound given by Eq. To finish this section, let us note that the reasoning presented in this section is analogous to the one given in Ref.
[13] where analysis of effects of statistical noise and non-classical noise on the errormitigation performed on global distributions was presented.

B Details of Diagonal Detector Overlapping Tomography
In this section, we give more details regarding our noise characterization procedure using DDOT. We start by providing efficient way of construction of DDOT circuits in Section. B.1. In Appendix B.2 we show how to use results of DDOT to infer the correlations structure of readout noise in a device. Then we discuss the construction of Diagonal Detector Overlapping Tomography circuits which are balanced (Appendix B.3) and perfect (Appendix B.4), together with proofs fro scaling of required number of random circuits. In Appendix B.5, we discuss heuristic procedures of making DDOT circuits more balanced. Finally, in Appendix B.6 we explain the effect of overestimating correlations which can happen if the implemented DDOT collection is not balanced.

B.1 Construction of Diagonal Detector Overlapping Tomography circuits
Here we will present in detail our building block for efficient characterization of correlations in a measurement device: Diagonal Detector Overlapping Tomography (DDOT). In analogy to Quantum Overlapping Tomography (QOT) [33], a collection of (N, k) DDOT circuits allows one to reconstruct the noise matrices for the readout process of any group of k qubits in an N -qubit device.
More precisely, we define a (N, k) perfect collection of DDOT circuits as a collection of N -qubit quantum circuits constructed only from 1 and X gates, with the property that for every subset of qubits of the size k, each of the computational basis states on that subset is prepared at least once. For example, we will call a collection of circuits (N, 2) perfect if for all pairs of qubits, each of the states from {|00 , |01 , |10 , |11 } is prepared at least once in the whole collection. One way to construct DDOT collection is to follow Ref.
[33] and make use of the notion of hash functions (here by a hash function we mean every function [N ] → [k] with k < N ). To construct (N, k) perfect collection of DDOT circuits using hash functions one can use Algorithm 1, which encapsulates the idea of Quantum Overlapping Tomography from Ref.
[33] translated to the construction of DDOT circuits. Specifically, this method corresponds to QOT with two "bases" which are preparation of state |0 or |1 . Each hash function assigns each qubit a label from [k]. For a given function, qubits are assigned to k disjoint batches based on the value of the function value. For fixed assignment of batches, qubits belonging to a batch are initialized in the same state (|0 or |1 ), independently from the qubits in other batches. In this way, 2 k circuits are specified, which independently implement all computational basis states on all the batches. For example, in the case of k = 2 and N = 6 qubits, some specific hash functions could result in initializing the following states -{000000, 000111, 111000, 111111}, which implement all two-qubit computational basis states on the pairs of qubits from left and right parts of the register (in this example the two batches are {Q0, Q1, Q2} and {Q3, Q4, Q5}). The DDOT circuits collection constructed in this way is perfect if the underlying collection of hash functions is perfect. This means that for each k-qubit subset in N -qubit device, there exists at least one hash function in the collection which assigns each qubit from that subset a distinct number from [k]. Note that if a given function is indeed injective on some k-qubit subset, this means that qubits from those subsets belong to distinct batches and therefore all computational-basis states will be implemented on them. Hence if this holds for all subsets, the collection is perfect. In Algorithm 1 we generate random hash functions to create a DDOT collection, therefore there is no deterministic guarantee that the collection constructed in this way will indeed be perfect. However, it follows directly from arguments in [33] (specifically, Section III) that if we use Algorithm 1 to generate DDOT collection, then if we wish the collection to be perfect with probability at least 1 − δ, the needed number circuits κ is at least of the order see Appendix B.4 for detailed bounds.
On the other hand, one can also consider construction which uses random circuits without referring to the notion of hash functions. Generation of random bitstrings and using them as definitions of quantum circuits is used to construct DDOT collection in Algorithm 2. Similarly to Algorithm 2, since we use randomness to generate the collection, there is no deterministic guarantee that the resulting collection will be perfect. However, in Appendix B.4 we show that if we wish the collection to be perfect with probability at least 1 − δ, then the needed number of circuits κ is at least of the order which looks similar to Eq. (60), however in practice exhibits better scaling due to specific factors (see Appendix B.4 for more detailed bounds). Hence we expect that for higher system sizes the random circuits algorithm should perform better in practice compared to the one which uses hash functions. Of course, when the collection with desired properties has been constructed, it does not matter what method was used to create it.
As mentioned in the main text, during the implementation of DDOT, different circuits will in general be sampled a different number of times, this may cause some correlations to be overestimated (this effect can be reduced by proper post-processing of the data -see Appendix B.6). Hence it is beneficial for the perfect collection of DDOT circuits to be approximately balanced. Here by 'balanced' we mean that all k-qubit computational basis states are sampled the same number of times, and by 'approximately balanced' that they are sampled approximately the same number of times (see related notion for hash functions, for example in Ref [51]). For the construction using random circuits, we prove in Appendix B.3 that one can expect that with high probability the collection will also have this property with the required number of circuits scaling similarly to Eq. (61).
To conclude, let us point out that for a given pair of numbers (N, k) it suffices to generate the DDOT collection only once, and it can be used in the design of future experiments. We will make a number of pre-generated collections publicly available in our GitHub repository QREM (Quantum Readout Errors Mitigation) [75]. We note that the described techniques are suitable for noise characterization not only for the noise model we proposed but also for other models with bounded locality of correlations, for example with two-qubit correlations considered in Ref. [15].
while for second qubit they are Note that the idea is to simply fix the state of 'neighbouring' qubit to be either 0 or 1 and calculate according conditional marginals. Now, we propose to use the information about the two-qubit noise matrices to calculate the correlations (Eq. (8)) between a given pair of qubits. For the i-th qubit, it's the dependence from the j-th qubit can be calculated by constructing two single-qubit matrices on i-th qubit -the first one with the condition that j-th qubit was initialized in |0 state, and the second in |1 state.
Those matrices can be used to calculate the parameter c j→i as the norm of a difference of those matrices (Eq. (8)). We then propose to infer the structure of the readout correlations according to the magnitude of the parameters c j→i as follows. One specifies threshold parameters δ clust and δ neighb for the level of correlations between qubits, and then assigns qubit j to the neighborhood or to the cluster of qubit i, if the parameter c j→i is greater than the respective threshold. In general, we advise to set those thresholds to be higher than the likely effects of statistical deviations. We outline the whole procedure in Algorithm 3, which takes as input the conditional single-qubit noise matrices Λ Y j Q i i =j together with a set of thresholds, and outputs the structure of the clusters and neighborhoods in a device.
Finally, let us note that the above-described inference of correlations from two-qubit marginal distributions works under the assumption that the correlations do not vanish under taking marginals over other (than given pair) qubits. This does not need to be true in practice, and one can consider generalizations of Algorithm 3. The analogous set of single-qubit matrices depending on states of t neighboring qubits would be created in a fully analogous manner to that of Example 1 but now one would need to fix the state of t qubits. Note that if one implemented DDOT (N, k) collection, the data to create such matrices for t = k − 1 is available from the experiments. In the future, we intend to investigate more elaborate methods of inferring correlations structure using DDOT, and we will accordingly expand our repository [75].
Once a model for the above structure is obtained, one can use the rest of the data obtained in the DDOT procedure to reconstruct the cluster noise matrices as a function of the state of the neighbors and consequently construct a global noise model (Eq. (5)), as well as the correction matrices for the marginals (Eqs. (10),(11)). From the definition of a perfect (N, k) collection, it follows that one can reconstruct only cluster noise matrices involving a number t = |C χ | + |N χ | of qubits which does not exceed k. If it happens that in fact t > k, we propose to implement one of two following solutions: 1. Neglect correlations between some qubits in order to enforce that t = k. For example, in Algorithm 3 if the number of qubits assigned to the neighborhood of some cluster exceeds the limit, one could decide to not assign to that neighborhood the qubits with the lowest values of c j→i parameters. This will result in an imperfect model, which might nevertheless be accurate enough for the purposes of error mitigation.
2. Refine the noise model by performing additional experiments of standard Diagonal Detector Tomography on a chosen subset.
We note that for the refinement of the noise model, there also exists the alternative possibility of constructing a restricted DDOT collection that implements the characterization on a specific set of t-length subsets (as opposed to all such subsets). We will analyze this problem in more detail in future works, as well as during the development of our GitHub repository [75].
To finish this subsection, we note that Algorithm 3, while being straightforward and efficient, is not a flawless method. For example, it might happen that the noise on qubit i highly depends on the joint being at least -distant in TVD is bounded by where s is here the number of random circuits in the collection (viewed as samples from the global uniform distribution of bitstrings). Since there is N k number of k-bit marginals, applying the union bound (as in the derivations in the previous sections) gives the upper bound for the probability P N,k that at least one of the N k marginals is at least -distant from uniform distributions as Hence with probability at least 1 − N k 2 2 k exp −2s 2 all marginals are at most -distant from uniform distribution. Note that = 0 corresponds to a perfectly balanced family of circuits, and small but non-zero will correspond to the approximately balanced family.
Let us now choose some parameter δ as an upper bound for probability Eq. (70) and find the bound for the required number of random circuits s (viewed as samples from a global uniform distribution) which are needed to obtain a family for which each marginal is distant from the uniform distribution by at most . After basic manipulations of Eq. (70) we obtain that B.4 The efficiency of the random construction of DDOT collection

B.4.1 Random circuits
Adopting the perspective from the previous section, we can in a simple manner tackle the problem of bounding the required number of random circuits which are needed to obtain a perfect collection of N k DDOT circuits. Consider randomly sampling s number of bit-string of length N . Now for a fixed k-element subset, the probability of a particular k-element combination not appearing is 1 − 1 2 k . Hence after s samples, there is 1 − 1 2 k s probability that this particular combination did not appear. Since there are 2 k N k combinations of interest (i.e., 2 k small k-length bitstrings for all N k subsets), we can use the union bound to obtain as the upper bound for the probability that at least one k-length bit-string did not appear after s samples. This means that with probability of at least 1 − 2 k N k exp − s 2 k , all of the k-length bitstrings appeared.
Let us now choose some parameter δ as the upper bound for Eq. (72) and calculate the resulting bound for the number of samples (i.e., random circuits). After basic manipulations we obtain

B.4.2 Random hash functions
Using simple combinatorial arguments, in Ref. [33] it is shown that the probability of the collection of N k random hash functions not being perfect (for our purposes perfect hash function collection means that it can be used to construct a perfect DDOT collection) is upper bounded by where L is the number of generated random hash functions. Then the authors simplify the use of the above fact to derive a bound on the needed number of hash functions for the collection to be perfect. Let us repeat the derivation from Ref.
[33] with paying attention to specific factors. From Algorithm 1 it follows that the number of circuits s obtained from a given hash function collection is equal to s = 2 + 2 k − 2 L. Now by choosing the parameter δ as upper bound on Eq. (74), we obtain that required number of circuits is lower bounded by where we used that log(1 − x) ≥ −x − x 2 . Let us note that Eq. (75) is a reiterated result from Ref. [33].
In above derivations we assumed that − log N k + log δ < 0. , then we utilized the fact that and we used approximation N k ≈ N k .

B.5 Heuristic balancing of DDOT collection
After generating a perfect DDOT collection, one can be interested in making it more balanced. There are various possible figures of merit that can be used to quantify the "balancing" of the family. For example, in previous sections, we used TVD between a uniform distribution and generated sample, when viewing obtained circuits in the collection as samples from the uniform distribution. Another possibility is to calculate a number of appearances of each marginal term in the whole collection (for example, for 2-qubit subsets the number that each of 00, 01, 10, and 11 appeared, for every 2-qubit subset), and then take an empirical standard deviation σ n,k of this quantity. A perfectly balanced family would have 0 standard deviation defined in that way. Now we discuss a simple heuristic method to improve balancing. The starting point is a perfect (N, k) DDOT collection. Now we apply the following steps in a loop.
1. Calculate a number of appearances of all k-qubit terms in the collection.
2. Find the set of non-overlapping k-bit marginals which appear in the collection the least number of times compared to the whole population. If k does not divide N , choose N k subsets and add random gates to the remaining bits.
3. Add circuit which implements those least-appearing marginals to the collection.
For example, say that in the last step of the above procedure for k = 2 and N = 5 we added circuit 01100 . (77) This might mean that we found that the least appearing marginal is state 01 on qubits Q0 and Q1, the marginal 10 on qubits Q2 and Q3 appeared the same or second-least number of times in the whole collection, while the 0 on last qubit Q4 was random. In this way, we are adding missing marginal terms "by hand" at each step of the loop. Clearly, this procedure is heuristic and it is not guaranteed to succeed since by adding certain circuits that implement desired marginals we also implement other marginals (in the example above, we add, e.g., marginal 00 on qubits Q0 and Q3). If it happens that those other marginals are in the opposite "tail" of the whole distribution (i.e., they are the most abundant ones), then the collection will not become more balanced (it can actually become less balanced). However, on average it is more likely that by doing so we add marginals that are closer to the "average marginals" (i.e., those which appear a number of times close to the mean appearance number). Clearly, a lot of practical refinements of the described method are possible -for example, in the second step of the method, instead of adding the least appearing marginals one can focus on maximizing the number of low-appearance marginals. In practice, we found that the described method without modifications usually reduced σ N,k with a growing number of circuits added in the loop.
To conclude, let us note that the special case of the above technique can be also used to add circuits to the non-perfect collection of DDOT circuits in order to make it closer to being perfect. Namely, if the collection is not perfect, this means that it does not implement some computational-basis states on some k-qubit subsets. Hence in that case the above procedure would report "the least appearing marginals" to be those which are missing in the collection and it would keep adding them in a loop until there are no missing terms -a strategy which might turn out to better then adding random circuits if the number of missing terms is not too big.

B.6 Overestimation of correlations
In the main text, we explained that when considering the implementation of DDOT circuits, some two-qubit correlations might be overestimated due to the fact the collection of circuits is not balanced. To understand this effect, let us now consider the following illustrative (but rather unrealistic) example.

B.6.1 Explanation of the effect and post-processing strategy
Say that we want marginal probability distribution on qubits Q0 and Q2 when input state was |00 . Then if noise is uncorrelated it does not matter whether global input state on three qubits was |000 or |010 (|001 or |011 ). However, when there are correlations, it might happen that those two distributions will be different depending on the input state of Q1. When we marginalize over Q1 we forgot about "where the data came from". Normally, we would just add the marginal probability from circuits implementing both global states and then normalized it. But if global state |000 was implemented a different number of times than |010 , it can cause that effectively correlations which are caused by Q1 are wrongly identified as correlations between Q0 and Q2, because some global probability distributions contribute to the marginal with higher weights (i.e., are effectively counted more times when calculating marginal distributions). As mentioned in the main text, the natural way to reduce such effects is to create a collection that is balanced, hence it samples from all twoqubit states the same number of times. The other thing one can do is to perform post-processing of the experimental data in such a way, that all contributions to the given marginal are weighted by the inverse of a number of times they were implemented. Note that they come from different global distributions, so what is important is the number of times the given marginal state was implemented together with some specific state on all the other qubits. Importantly, this method is not perfect because for big systems the "state on all the other qubits" will likely be different each time anyway (this is due to the fact that collection of DDOT circuits will be random, hence it becomes quite unlikely to obtain two times the same bitstring if the number of qubits is high). Another thing one can do is to change the weights of the given contribution to the marginal distribution depending on the state of some particular subset of qubits in order to assess whether inferred correlations were correct. Specifically, one might perform a recursive procedure in which the structure of clusters and neighborhoods inferred from non-post-processed data is validated on particular subsets using this type of post-processing. There are certainly a lot of practical possibilities to improve the post-processing scheme and we intend to investigate them in future research.

C Noise characterization scheme overview
Here we provide a step-by-step description of our noise characterization procedure. We note that stages 0 and 1 were only briefly mentioned in the main text, and they correspond to the verification of the undertaken assumptions: stage 0 verifies the quality of the single-qubit gates, and stage 1 the assumption about classical nature of the noise in the measurement device. Stages 2 and 3 describe the proper characterization scheme of Diagonal Detector Overlapping Tomography which was discussed in detail in the main part of the work and in Appendix B.

C.1 Stage 0 -single-qubit gate fidelities
To begin, let us note that of our characterization procedure relies on the assumption of perfect state preparation. However, in practice, this assumption might be significantly violated. Therefore we propose not to use qubits with single-qubit gate infidelities above some threshold -in our experiments we arbitrarily chose this threshold to be 0.01.
In experiments on IBM's Melbourne backend, the single-qubit gate fidelities were good enough (fidelities above 99%) to use all of the qubits. In the case of Rigetti's Aspen-8, we discarded 8 qubits which had fidelities below 98%, while still using qubits 5 qubits which had fidelity in range [98%, 99%].

C.2 Stage 1 -assessing classical form of the noise
In order to perform simultaneous estimation of single-qubit detectors with an overcomplete operator basis, one needs to implement 6 different circuits -each implementing eigenstate of a different Pauli matrix on every qubit at the same time. After this implementation, one needs to post-process data to obtain marginal single-qubit distributions and use standard detector tomography algorithms, for example, those described in [44] (and implemented in Python in online repository [75]).
Having reconstructed POVMs describing each single-qubit detector in a device, one can assess the classicality of the noise using methods described in Ref.
[13] -for the sake of completeness, we will recall the main notions of that procedure here. First, we will need a notion of distance between quantum measurements. Such distances are usually related to the probability distributions that those measurements generate via Born's rule. Recall from Eq. (9) that the Total Variation Distance between probability distributions p and q as L1 norm of difference of those vectors Now, the distance between quantum measurements related to TVD is the operational distance D op defined for two POVMs M and P as [77] where p M (or q P ) is a probability distribution generated by measurement M (or P) via Born's rule, and the maximization goes over all quantum states. Hence, the operational distance between two quantum measurements is the worst-case distance between probability distributions they can generate. Now, to quantify readout noise one can simply calculate 2 operational distance for reconstructed POVM M for each qubit and the ideal measurement P. Then, to quantify coherent part of the noise, one can make the following decomposition [13] For the ideal measurement P being the standard measurement in the computational basis, this decomposition is straightforward -the classical part of the noise is contained in the diagonal part of the measurement operators, while off-diagonal terms are a coherent part. The magnitude of the coherent part of the noise can be quantified as D op (M, ΛP). The assumption of fully classical noise leads to discarding the coherent part ∆ in the Eq. (91) and performing error-mitigation as if POVM ΛP was exact description of the detector. This leads to the propagation of coherent errors under error-mitigation, and it can be quantified via ||Λ −1 || 1→1 D op (M, ΛP). Following guidelines in Ref.
[13], we propose to discard every qubit which fulfills the following inequality Fortunately, in experiments on both IBM's and Rigetti's machines, this step did not lead us to discard any qubits -the noise in those devices remains highly classical, as indicated in previous experiments [13,14]. Before proceeding further, let us note that while assessing classicality of the noise via single-qubit QDTs we make the following implicit assumption -the coherent part of the noise does not scale significantly with growing system size (by this we mean that it is at most additive in the number of qubits). Furthermore, in using the rule Eq. (92) we disregard effects of statistical deviations.

C.3 Stage 2 -Diagonal Detector Overlapping Tomography
The main idea of DDOT was described in the main text. The practical generation of relevant circuits was described in Appendix B.1 and summarized in Algorithm 1 (using random hash functions) and in Algorithm 2 (using random circuits).

C.4 Stage 3 -inferring correlations structure
The procedure of inferring correlations (i.e., structure of clusters and neighbourhoods) was described in detail in Appendix B.2 and summarized in Algorithm 3.

D Sample complexity of energy estimation
In this section, we give some derivations related to the estimation of expectation values of local Hamiltonian terms on various quantum states. We start by discussing correlations in random states in Appendix D.1. Then in Appendix D.2, we prove Proposition 2 from the main text, which concerns states appearing in the QAOA algorithms. Finally, in Appendix D. 3 we analyze what happens with the covariances of local Hamiltonian terms if the uncorrelated measurement noise and its mitigation are present. (93)

D.2 Proof of Proposition 2
Consider Hamiltonian with connectivity given by Erdös-Rényi random graph with N nodes and K edges with average degree equal to q = K N . We will show that if the number of levels satisfies p = c log(N ) + 1 with c ≤ Let G = (V, E) be a graph (with vertex set V and edge set E), we denote by B(i, r) the set of vertices that are in graph distance r or less from vertex i. Similarly, for an edge α = (i, j) ∈ E, we define C(α, r) the set of vertices that are in graph distance r or less away from α. For α = (i, j) and β = (v, w), we have that if v / ∈ B(i, r + 2), then β / ∈ C(α, r). The random Hamiltonian corresponding to a graph can be written as sum of 2-qubit terms (with also single qubit terms incorporated into to these) corresponding to the edges E of the random graph G (N, q), i.e., H = (i,j)∈E H (i,j) . The variance is then To bound this quantity we will utilize the following two facts. First, we notice that than any non-zero term in Eq. (97) can be bounded by where we used known covariance-variance inequality together with Popoviciu's inequality. Second, we reiterate an important observation from Ref. [54] about these types of QAOAs: Consider two operators O 1 and O 2 acting non-trivially only on the sets of qubits A 1 ⊂ V and A 2 ⊂ V , respectively. If U is a unitary corresponding to a p-level QAOA (with any parameter setting), the set of nodes A 1 and A 2 is in graph distance at least 2p distance from each other and |ψ is product state, then With these two ingredients, we can bound the variance as An even more rough upper bound can be given by with setting Now we can turn to the concrete case of Hamiltonians corresponding to graph sampled from the Erdös-Rényi graphs with average degree q. For such graphs the Neighborhood Size Theorem states the following [54]: For any r < w log(N ) 4 log(2q/ ln (2) , where 0 < w > 1 , there exists a constants a > 0 and A < 1 such that To be more specific, we can give the expression of a and A in terms of r and N : (2))|) (1 + | log 2q (ln(2))|) , (105) a = w 3(1 + | log 2q (ln(2))| .
The proof of the above comes from the proof of Neighborhood Size Theorem given in [54], page 13. Specifically our inequalities correspond to setting s = 1. Now note that since for any α = (i, j) ∈ E and β = (v, w) ∈ E, we have that if β ∈ C(α, r) then v ∈ B(i, r + 2). This immediately implies that generally for α = (i, j) ∈ E we have |C(α, r)| ≤ |B(i, r + 2)| 2 , and thus also As a starting point, let us assume that we have a state with correlations bounded in trace norm where ρ αβ is marginal quantum state on subsystems α and β which is close to a product state of those subsystems. The special instance = 0 corresponds to the product state case. We are interested in what happens with the covariances of local Hamiltonian terms if the measurement is affected by uncorrelated classical noise of the form Λ = i Λ Q i , where Λ Q i is noise matrix acting on qubit Q i (see Eq. (16)). Similarly, as in the main text, the Hamiltonian we consider is classical (i.e., diagonal), and local terms can be decomposed into sums of products of Pauli σ z terms. Let us denote by p noisy α = Λ α p ideal α a noisy marginal distribution on the qubit subset α corresponding to local Hamiltonian term H α . Note that and the function which is passed to classical optimizer is the estimator of the expected value ψ p | H O |ψ p (note that this makes those estimators to effectively be a function of parameters {α j } , {β j }). The estimator is obtained by sampling from the distribution defined by the measurement of |ψ p in the computational basis, taking the relevant marginals, and calculating the expected value of H O using values of those estimated marginals.
Theoretically one could optimize p-layer QAOA by simultaneous optimization of all 2p angles. However, this is hard in practice, since the number of optimized parameters increases the complexity of classical optimization. In our optimizations, we therefore modified the optimization to be divided into steps in the following way. In each step of optimization, we optimized over the set of 6 angles (i.e., 3 QAOA layers). Then the input to the next step was the optimized state obtained in the previous steps. For example, input to second optimization step was the quantum state U (1) 3 |+ ⊗N = U 3 |+ ⊗N , where we used superscript to denote optimization step. Then input to third optimization step was the state U In summary, for p-layer QAOA, the optimization was effectively divided into p 3 steps, and in each step, we optimized over 6 angles, i.e., 3 QAOA layers. Each step used 2 * 800 = 1600 function evaluations (the factor 2 comes from two evaluations needed for gradient evaluation which was done 800 times) plus a single final function evaluation. We further performed each such procedure additional time and chose the better run (out of 2). Each energy estimator was obtained using 10 4 energy measurements. Therefore the total number of function evaluations was ≈ p 3 × 3.2 × 10 7 . As a classical optimizer, we used Simultaneous perturbation stochastic approximation (SPSA) [39, [59][60][61]. Then with a growing number of optimization steps, we gradually changed the parameters to (heuristically) make the optimization more adaptive. Parameters α and γ were not changed, while other parameters were changed according to prescription a p = a 0 0.9 p , c p = c 0 0.9 p , where 0 subscript denotes starting values of parameters given in Table 1. We note that the results presented in Fig. 6 exhibit rather poor convergence of the algorithm in a sense that adding more layers above p = 9 changed the resulting energies only slightly. This can be explained by the fact that we did not perform an exhaustive search over hyperparameters, but can also be a manifestation of the recently reported fact that QAOA might have problems with reaching global minima for relatively complicated Hamiltonians (like high-density MAX-2-SAT used in our work) [79]. Clearly, in the context of our work, only the comparison of noisy and noise-mitigated optimization to the noiseless run was of significance.   8)) in IBM's 15-qubit Melbourne device (left) and a 23-qubit subset of Rigetti's Aspen-8 device (right). The convention is that "row is affected by column", i.e., the measurement noise on the qubit with the label given by row index depends on the state of the qubit with the label given by column index, and the magnitude of the dependence is indicated by colors.

E.2 Additional experimental data
Here we present additional data concerning correlations reconstructed in DDOT characterization of IBM's 15-qubit Melbourne device and a 23-qubit subset of Rigetti's Aspen-8 device. The full depiction of correlations in Rigetti's device is presented in Fig. 7. The heatmaps showing the reported correlations in both devices are presented in Fig. 8. In Fig. 9 we show how many parameters are needed to describe various noise models, and table in Fig. 10 shows how much time data-processing took. The data was processed using laptop with 32GB DDR4 RAM (speed 2667MT/s) and Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz. We note that no multi-threading was implemented -in principle, one likely can reduce the run-time further be exploiting parallel calculations. We intend to optimize the code used for data-processing during development of our online repository [75]. [15] without any assumptions on the correlations structure. We note that combining our DDOT characterization with CTMP model could reduce the number of parameters in CTMP by pointing to negligible correlations that can be disregarded. For comparison, the "full classical" noise model refers to generic stochastic map, and "full quantum" to a generic d-outcome POVM. Figure 10: Time of data processing. In first table, the "pre-processing time" includes calculation of marginal noise matrices on 2-qubit subsets, calculation of pairwise correlations, reconstruction of the noise model and calculation of inverse noise matrices needed for corrections of all possible two-qubit marginal probability distributions (note that this in general includes also higher-dimensional corrections, as explained in Fig. 3). In the second table, the "Total post-processing time" includes both calculation of marginal distributions needed to estimate energies of investigated 2-local Hamiltonians and performed error-mitigation on them. The "Error-mitigation time" shows only the latter.