Adaptive estimation of quantum observables

The accurate estimation of quantum observables is a critical task in science. With progress on the hardware, measuring a quantum system will become increasingly demanding, particularly for variational protocols that require extensive sampling. Here, we introduce a measurement scheme that adaptively modifies the estimator based on previously obtained data. Our algorithm, which we call AEQuO, continuously monitors both the estimated average and the associated error of the considered observable, and determines the next measurement step based on this information. We allow both for overlap and non-bitwise commutation relations in the subsets of Pauli operators that are simultaneously probed, thereby maximizing the amount of gathered information. AEQuO comes in two variants: a greedy bucket-filling algorithm with good performance for small problem instances, and a machine learning-based algorithm with more favorable scaling for larger instances. The measurement configuration determined by these subroutines is further post-processed in order to lower the error on the estimator. We test our protocol on chemistry Hamiltonians, for which AEQuO provides error estimates that improve on all state-of-the-art methods based on various grouping techniques or randomized measurements, thus greatly lowering the toll of measurements in current and future quantum applications.

In this work, we introduce a novel algorithm called AEQuO that maximises the information obtained from sampling the quantum system and learns from previous outcomes to improve the allocation of the remaining measurement budget. Compared to previous strategies, AEQuO is based on the ability of on-the-fly estimating not only the average of a given quantum observable, but also its error. This allows us to faithfully determine the precision of the estimated quantity, without the requirement of deriving error bounds [19,21,25,26]. Furthermore, instantaneous error knowledge also permits better allocation of the measurement budget and is at the core of the learning capabilities of our algorithm. We focus on the problem of accurately estimating a given observable O. This is a typical challenge encountered for example in quantumclassical hybrid protocols such as the variational quantum eigensolver [14,15]. In particular, we want to estimate the expectation value of using M repeated preparations of the quantum state. Here, all c j are real constants, the Pauli string (PS) P j labels a tensor product of Pauli operators, and Q ≡ Q ρ = tr(ρQ) denotes the expectation value of an observable Q with respect to the quantum state ρ. Devising a protocol that allocates the budget M to minimise the estimation error is far from trivial. Several approaches have been proposed, including joint measurements of pairwise commuting operators [17], randomized and derandomized measurements [19,21,25,26], grouping by weights [20], neural networks [22], minimizing the number of measurement groups [16,18,25,27,[29][30][31][32], and a method based on maximum entropy and optimal transport [33]. In this work, we introduce a measurement scheme for estimating observables that adaptively allocates the measurement budget based on previously collected data, allows for both non-bitwise commutation between PS and overlap in their grouping, and assesses both the average and variance of the observable O in Eq. (1).
The article is structured as follows. In Sec. 2, we provide an overview of our main results. In Sec. 3, we introduce the necessary background for observable estimation. We explain in detail the process of encoding an observable as in Eq. (1) in a weighted graph, and we discuss the connection between the estimation task and clique covers of a weighted graph. We explain AEQuO's subroutines in Sec. 4 and provide numerical results using chemistry Hamiltonians as a benchmark in Sec. 5. We conclude in Sec. 6 with some open problems and future directions of research. The appendices contain additional information further explaining our methods and results.

Overview of the main results
We introduce the Adaptive Estimator of Quantum Observables (AEQuO), a protocol designed to allocate the measurement budget in order to minimise the error affecting the estimate of an observable O as in Eq. (1). AEQuO iteratively changes the employed estimator based on the continuous inflow of new measurement data, it allows for both non-bitwise commutation relations between the PS and overlap in their grouping, and it yields estimates for both the average value and the error of O. It also employs Bayesian statistics, which permits the inclusion of prior knowledge and gives meaningful results even for small sample sizes. As a benchmark, we use AEQuO to estimate chemistry Hamiltonians, obtaining error estimates that improve on all state-of-the-art methods of estimating quantum observables.
A qualitative description of AEQuO's subroutines is given below. First, an operator O as in Eq. (1) is encoded in a (weighted) graph with vertex set {c j P j } [16]. An edge is drawn between P i and P j whenever the two PS commute, and the corresponding edge weight, representing their covariance, is initialized (see Eq. (2b) and Sec. 4). An estimate of O and its error can be directly calculated from this graph by assigning the vertices to groups of commuting operators that can be measured simultaneously. Crucially, we allow for overlap among these groups, that is, the operators P j in Eq. (1) can belong to multiple groups, thus effectively increasing the amount of gathered information.
The second step of AEQuO consists in iteratively assigning a number of measurements to each of these groups until the budget M is exhausted. Importantly, we develop a method (see Sec. 3) to calculate the estimation error that AEQuO minimizes while allocating the M shots.
We provide two different subroutines for this assignment. One is a greedy "bucket-filling" (BF) algorithm that assigns the M measurements one by one. This algorithm gives a good estimation error in small problem instances, but performs poorly for large problem sizes due to its greedy nature. To remedy this, we developed another subroutine based on machine learning (ML) that requires fewer repetitions, as it allocates a fraction of the total budget M to the most promising groups of commuting PS in each iteration. As a result, the ML algorithm is faster for large values of M and n, the latter being the number of PS in Eq. (1).
In a final step, we post-process the gathered data to obtain the estimator characterized by the smallest error based on the performed measurements. The post-processing utilizes the cumulative knowledge gathered during the measurement phase. It considers the contribution to the estimation error from each PS in different commuting groups, removing it if it is statistically likely to improve the error.

Theory
The PS P j in Eq. (1) acts on d qubits, . The superscript I j (i) labels the element of the PS, i.e. I j (i) ∈ {0, x, y, z}, with σ 0 denoting the identity and σ x , σ y , σ z the Pauli matrices. The subscript i indicates the corresponding qubit on which a Pauli operator acts. A decomposition of an observable into PS as in Eq. (1) has at most n ≤ 4 d terms; however, in most physical examples the number of PS with non-zero weight c j scales polynomially in d.
In an experiment, we collect m j measurement outcomes ("shots") for each P j and take their average to obtain an estimateP j of P j , which subsequently allows us to estimate O through O = j c jPj . Considering that two commuting PS P j and P i can be measured simultaneously (i.e., in the same shot), an estimate for the error (∆Õ) 2 affectingÕ is Here,Q is the covariance matrix of all measured data, m ji denotes the number of shots where P j Figure 1: (a): Example of a graph. Vertices and edges are weighted with c jPj and c j c i C(P j ;P i ), respectively. As shown by the grey circle, we include self-edges (omitted in the rest of the graph) with variances c 2 j (∆P j ) 2 . Two maximal cliques are highlighted in green and violet. (b-e): Case study of the operator O = σ x 1 σ x 2 +σ z 1 σ z 2 +σ z 2 using its graph representation (top). The four non redundant covers are highlighted within plots (b-d), along with theQ ji elements contributing to the associated error (∆Õ) 2 (matrices with red and green squares). Results from measurements of these covers are plotted in the graphs. We considered 2 · 10 5 experiments where M = 10 4 experimental shots are used to estimateÕ and (∆Õ) 2 for pure states uniformly sampled on the Bloch sphere. For each cover, measurement allocation is optimal and is determined with the full knowledge ofQ. As discussed in Sec. 3 and App. A, for each state one cover gives the smallest error (∆Õ) 2 opt , to which the results obtained for the four plots are compared. Vertical lines represent averages. In the bottom panel (e), we collect the results obtained by AEQuO without prior knowledge ofQ, and using the ML subroutine with L = 1. and P i have been sampled simultaneously, and δ is the delta function with δ(0) = 1 and δ(x) = 0 for x = 0. In the limit of many measurements, Note that m jj = m j , and that j m j = M if and only if m ji = 0 for all j = i. In this special case, Eqs. (2) yield the well known result To obtainÕ, there is no unique way to choose groups of PS that are pairwise commuting, in particular if one allows these groups to have overlap. Every choice of groups corresponds to a specific estimator in the form of Eqs. (2) characterized by the values of m ji and m j . In the following, we explain the aforementioned graph representation used in our protocol, and the idea behind our algorithm for grouping PS.
As shown in Fig. 1(a), we represent the operator O in Eq. (1) with a weighted graph whose vertices correspond to c jPj . If P j and P i commute, their vertices are connected by an edge with weight c j c i C(P j ,P i ). Therefore,Õ is obtained by summing over all vertices, while (∆Õ) 2 is the sum of all edges (including self-edges). Different P j can only be measured simultaneously if they belong to the same clique of the graph, i.e., a fully connected subset of vertices. A clique is called maximal if no further vertex can be added to it [see Fig. 1(a)]. Evidently, an estimator corresponds to a clique cover, i.e., a set of cliques such that each vertex of the graph is included in at least one clique.
For illustrative purpose, let us consider the ex- Fig. 1(b-e). There are five cliques that can be arranged in four different, non-redundant covers 1 . The particular choice of clique cover determines the terms C(P j ;P i ) contributing to the error (∆Õ) 2 , as can be seen in the pictorial representations of the ma-tricesQ in Fig. 1(b-d). For a given input state, the estimatorÕ corresponding to the cover yielding the minimal (∆Õ) 2 for fixed M gives the most accurate estimation (since all estimators are unbiased).
We test the performances of the four estimators (viz. covers) in Fig. 1(b-d) with randomly chosen two-qubit pure states, and calculate the relative deviation of the error estimate w.r.t. the best estimator. Here, we use the asymptotic values for Eq. (2b) that are obtained for M → ∞. The results demonstrate that, as expected, measuring each PS separately [ Fig. 1(b)] is never a good strategy, while the intuitively best estimator [ Fig. 1(d)], corresponding to the cover made of maximal cliques only, is optimal in only ∼54% 1 In general, one can find more covers by nesting smaller cliques into bigger ones. However, these cases are not relevant to us since only one of these nested cliques will yield minimal error. Therefore, without loss of generality, we can limit ourselves to the study of the four covers in Fig. 1. of the cases. More details about how the best estimator is determined and the measurement budget is allocated can be found in App. A.
In practice, the state to be measured is unknown and it is not possible to test all covers beforehand to identify the best one. This motivates an adaptive approach in which the clique cover is changed during the measurement process depending on former outcomes. In our protocol, we update the vertex and edge weights of the graph after some (or each) of the M shots. We then decide which clique is measured next, based on its contribution to Eq. (2a).
Conveniently, the choice of the particular estimator can be relegated to a second postprocessing step; in the data acquisition phase we may restrict to considering maximal cliques only. Changing the estimator (or equivalently, the clique cover) can then be done subsequently by adjusting the measurement numbers m ji and m j , and updatingQ. For example, if one removes P j from a clique, then for all P i in that clique one has to reduce m ji and m j by the number of shots allocated to that clique. Furthermore, all outcomes gathered for P j in the considered clique have to be discarded in order to avoid introducing biases in the estimator. In the example in Fig. 1(d), changing the estimator corresponds to removing the PS σ z 1 σ z 2 from one of the two cliques.
The advantages of employing maximal cliques and post-processing are twofold. First, restricting to maximal cliques reduces the number of choices in each shot. Second, potentially available data is never neglected, resulting in a better knowledge of the vertex and edge weights of the graph. This knowledge can then be used for more informed choices, both of the maximal cliques to be measured at each shot, and in the postprocessing stage itself.
In the histograms of Fig. 1(b-e), we plot the expected deviations from the minimal error, which is derived by always identifying the best cover out of the four. The black and red lines in the figures indicate the mean values of the associated histograms. A comparison shows that AEQuO [in Fig. 1(e)], based on maximal cliques and post-processing, consistently provides nearminimal error and greatly outperforms the static approaches [in Fig. 1(b-d)] represented by the four covers in the figure. For more details, see As a conclusive remark, we present two corollaries following from our graph representation of the observable O and Eqs. (2). Namely, we derive simple upper bounds on both the error (∆Õ) 2 and its scaling with respect to the number n of PS within O. To bound the error, we observe that the maximum error contribution of two PS P j and P i measured together is c 2 j + c 2 i + 2|c j c i |, which can be obtained by setting allQ ji equal to sgn(c j )sgn(c i ) in Eq. (2b). This yields the tight upper bound Here, we have omitted the delta functions for clarity. Given any measurement strategy, this equation determines, beforehand, the maximum error that can possibly affect the resulting estimator. The scaling of (∆Õ) 2 in terms of n can be understood using graph theory. Considering that it is always possible to find a graph's cover of n cliques, one concludes that in the worst-case scenario (∆Õ) 2 grows linearly in n.

Algorithm
In the following, we present our algorithm AEQuO, which is outlined in the diagram in Fig. 2. First, the graph is constructed in time O(n 2 ). At this stage, it is possible to choose whether or not to connect vertices whose associated PS are non-bitwise commuting. As discussed below in Sec. 5, restricting to bitwise commuting PS results in errors (∆Õ) 2 that are higher. However, simultaneously measuring nonbitwise commuting PS requires entangling gates, which in the context of NISQ devices are expensive. AEQuO includes a sub-routine for finding a suitable circuit that diagonalizes a group of commuting PS [16,34,35]. This subroutine is classically efficient for large values of the numbers n and d of PS and qubits, respectively, and the resulting circuit is ensured to contain at most d(d − 1)/2 (zero) entangling gates when allowing (forbidding) non-bitwise commutation. After the graph is built, AEQuO finds a desired number r of maximal cliques. We prioritize cliques that are statistically likely to have bigger contributions to the error (∆Õ) 2 in Eq. (2a) (see App. D and [34]) and only consider maximal Depending on the number of cliques r and the measurement budget M , the BF or the ML algorithm is chosen. Afterwards, the system is probed and vertices and edges of the graph are updated according to the outcomes. If M shots have been taken, the post-processing resorts to the desired estimates for the average and error. Otherwise, another round of allocation and measurements is performed. Green and light blue boxes represents essential and optional subroutines, respectively. In fact, the user is free to choose between the BF and the ML algorithm, while post-processing can be turned off if desired. ones for the reasons explained above. However, our protocol may also operate on all cliques. The runtime required to find the clique cover scales as O(r). Efficient algorithms for finding maximal cliques are known, e.g., the Bron-Kerbosch algorithm [36].
We now assign weights to the vertices and edges of the graph constructed from the observable O. We resort to Bayesian estimation (see App. B) for both the initialization and all subsequent updates of the weights of the graph, which gives meaningful results with scarce statistics or even in the absence of any data. This is crucial due to the adaptive nature of our protocol, and allows us to initialize the graph without any pre-sampling. Indeed, when no shots are taken, Bayesian estimation prescribesP j = 0 and Q ji = 2δ(i − j)/3 for all j, i = 1, . . . , n.
Once the r cliques have been found and the graph has been initialized, the system can be measured. In order to choose the clique to be probed at each shot, we propose two possible subroutines whose objective is to minimize the cost function (∆Õ) 2 in Eqs. (2) (for a numerically efficient method to compute (∆Õ) 2 see App. C). The first choice is a greedy "bucket-filling" (BF) algorithm [37] whose premise is to allocate measurements one-by-one by evaluating the predicted cost function before each shot. After a certain amount of measurements (the M shots are divided into L chunks of increasing sizes, as explained below), the graph is updated and the BF algorithm is run again. This approach works well for instances where r and M are small, since the number of cost function evaluations increases linearly in these variables.
As an alternative to the BF method we develop an approach inspired by a recent structural optimization algorithm based on machine learning (ML) [38]. We reparamterize the vector of clique measurements, which are the set of optimization variables in the BF approach, in terms of the weights and biases of a densely-connected neural network. The output of the network yields the (locally) optimal measurement allocations after learning occurs. This approach alters the optimization landscape by increasing the number of parameters.
In contrast to BF, the ML algorithm allocates a fraction of the total budget to the most promising cliques. Subsequently, the graph is updated and the ML algorithm is run again. We perform L iterations of the process, at each step updating the covariance matrixQ with the experimental outcomes and progressively increasing the available shots by a factor l > 0 until all M measurements are exhausted. Thus, for large problem sizes the ML subroutine scales more favorably compared to the BF algorithm, since it requires fewer repetitions (L for ML versus M for BF), and is more efficient for large values of r.
The ML approach is based on minimizing the cost function, Eq. (2a), over the trainable parameters of the neural network. We implement a variety of different optimizers to determine the weights and activation biases [34], including stochastic gradient descent [39], Adam [40], and a Limited-Memory BFGS method [39,41,42]. These different optimizers yield similar performances; for instance, for Figs. 3 and 4 the stochastic gradient descent and the Limited-Memory BFGS methods, respectively, were employed.
The architecture of the neural network we settled on is composed of three densely-connected layers of width r, and with ReLU activation func-tions on hidden layers and a softmax activation function on the output [43]. Such a structure yields a probability distribution at the output corresponding to the ratio of times that each clique should be measured. The output of the neural network is then converted to an integer measurement allocation vector by first scaling by the number of measurements to be performed, then flooring the entries, and finally by allocating excess measurements to the cliques with the largest percentage change in budget due to the flooring operation [34]. Before the learning phase, the network is initialized such that each clique is measured the same number of times and so that measurements between various cliques are initially uncorrelated.
After completing all measurements, postprocessing can be applied. In this step of the protocol, we determine the estimator with the lowest variance that could be realized with the available data. For each pair P j and P i such that m ji = 0 and c j c i C(P j ;P i ) > 0, we find all cliques where these strings have been simultaneously probed and consider all possibilities of removing either of those or keeping the estimator as it is. Eventually, the configuration minimizing the updated error function (∆Õ) 2 is kept. The runtime of this procedure scales exponentially in the size of the subset of cliques where P j and P i have been measured simultaneously. In practical examples, this number is small and post-processing can be used for large values of n. For instance, postprocessing has been used for all numerical results presented in Fig. 3, where we considered observables with up to n = 1086 terms corresponding to the Hamiltonian of H 2 O.
In the first two, the goal is to minimize the number of cliques in the resulting cover, while the latter two reconstruct the desired estimator from the outcomes of partially random as well as deterministically allocated measurements. We choose these as representatives of the variety of algorithms based on different grouping strategies [16-20, 27, 30-32] and the classical shadow technique [21][22][23][24][25][26]. AEQuO is run both for L = 1 and L = 2 (in this case l = 9) using the ML subroutine, and outperforms other approaches in determining more precise estimates.
As discussed in more detail below, this improvement has two reasons. First, we allow for both non-bitwise commutation relations between PS and overlapping cliques. On the one hand, this allows us to gather more information from the same number of shots. On the other hand, it increases the number of non-zero elementsQ ji in Eqs. (2). While these covariances can either lower or increase the error (∆Õ) 2 , the adaptive nature of AEQuO (see next paragraphs) and the post-processing ensure that negative ones are preferred. We remark that our version of the LDF protocol also allows for non-bitwise commutation relations; as a result, it typically outperforms OGM, APS and Derand. As shown in App. E, when restricted to bitwise commutation relations, LDF yields errors that are generally higher than all other protocols.
Second, AEQuO allocates the shots by directly minimizing the estimated error (∆Õ) 2 in Eqs.
(2) based on on the available experimental information. Provided the ML and/or the BF subroutines find the global minimum of (∆Õ) 2 , this results in AEQuO finding the estimatorÕ that, constrained by the used cliques and the transient knowledge aboutQ, is characterized by the smallest possible error. This is supported by Fig. 3 there is no memory of prior outcomes; in this case, the points obtained with either subroutine lie on a horizontal line. However, for L = 3 and l = 4, AEQuO learns and uses the gathered information to find better strategies for allocating the remaining shots. For large M , asymptotes are recovered, which are considerably lower than the ones for L = 1. Importantly, the advantage from the adaptive nature of AEQuO comes for free, in the sense that it follows exclusively from a better allocation of the measurement budget and does not require extra resources (such as entangling gates for the simultaneous diagonaliza-tion of the PS). We investigate the improvement resulting from the adaptivity of AEQuO in more detail in Fig. 4 and in App. E, where we restrict all protocols considered in this section to bitwise commutation relations.
While the BF and ML subroutines yield comparable results, Fig. 3(c) indicates that they outperform each other in different parameter regimes, depending on the problem at hand. This is a consequence of several factors. By construction, the BF picks the clique to be measured based on the gradient of (∆Õ) 2 . On the other hand, the ML subroutine (in this context) follows a stochastic gradient descent algorithm [39] that allows for exploring the error landscape. However, its performance depends on hyperparameters [34,38,40] that, in Fig. 3, have not been optimized to keep low computational requirements.
As explained above, our protocol has three distinctive features. Besides the ability to monitor and determine the error (∆Õ) 2 [see Eqs. (2)] that allows for better allocation of the measurements and assessment of the precision of the estimator, AEQuO can also exploit both non-bitwise commutation between PS 2 and cliques' overlaps, and it is adaptive. In the remainder of this section, we investigate the advantages provided by these last two features both with chemistry Hamiltonians and a family of 2D and 3D lattice models of interacting spin 1/2 particles (for details, see App. F).
In Fig. 4(a), we list averaged errors (∆Õ) 2 for several chemistry Hamiltonians 3 , with M = 10 4 and the ground state as input [9]. AEQuO is run with different settings, namely (not) allowing non-bitwise commutation relations between the PS, and with adaptive allocation turned either on (L = 3, l = 4) or off (L = 1). As expected, the smallest errors are obtained when AEQuO can learn from previous outcomes (L = 3, l = 4) and non-bitwise commuting PS can be simultaneously measured [indicated by GC in Fig. 4(a)]. In the case of chemistry Hamiltonians with their ground state as input, the advantage is remarkable. If we compare AEQuO when restricted to bitwise commutation (BC) and L = 1 with the best results (bold numbers), we see that mea- 10   For comparison, up to statistical noise, the Derand [21] and OGM [27] protocols yield results that are similar to AEQuO's when we restrict it to bitwise commutation and turn the adaptive allocation off (BC, L = 1). The direct comparison between the BC versions of AEQuO and all other protocols considered in this section is presented in App. E.
A similar analysis is made in Fig. 4(b-e), where we considered a generalized Hubbard model [44] in which qubits are located on edges of the 2D [panels (b) and (d)] and 3D [panels (c) and (e)] lattices drawn at the bottom of the figure. As explained in App. F, they are allowed to hop with (many-body) flip-flop interactions, and we include energy shifts that depend on the states of qubits on neighbouring edges. We use the ground state as input, set M = 10 4 , and run AEQuO with the same settings as in Fig. 4(a): allowing for (squares) or forbidding (circles) non-bitwise commutation, and turning the adaptive allocation on (red and violet points) or off (blue and green points).
In panels (b-c), we show the relative increase of the error (∆Õ) 2 if compared to (∆Õ) 2 opt , that is the value obtained by AEQuO when allowing for non-bitwise commutation and providing prior knowledge of the input state. In practice, prior knowledge is given by initializing the covariance matrixQ ji in Eq. (2b) with cov(P j , P i ) for all i, j = 1, . . . , n. Therefore, the data points in Fig. 4 It is evident from Fig. 4 that the advantages from the adaptive allocation and non-bitwise commutation are consistent, yet vary considerably between data points. All red squares in panels (b) and (d) are approximately zero, indicating that AEQuO learns in the process and for sufficiently large M it allocates the measurements as if it knew the input state beforehand. All other coloured data points lie above, suggesting that it is detrimental not to exploit non-bitwise commutation and adaptive allocation.
How advantageous these features are highly depends on the considered observable and state. Empirically, non-bitwise commutation is more beneficial if the Hamiltonian is dominated by long-range many-qubits interactions, such that two commuting PS are more likely to commute non-bitwise. Chemistry Hamiltonians fall within this category. On the other hand, in most of the considered examples adaptive allocation resorted to precision improvements varying between 25% and 400%, with the tendency of increasing with the numbers of qubits d and PS n.

Conclusions and outlook
We introduced AEQuO, an adaptive algorithm to estimate quantum observables expressed as a sum of Pauli operators. By allowing for both overlap between groups of PS that are simultaneously measured and general commutation relations, we gather more information per shot and introduce possibly negative covariances into Eqs. (2). AEQuO is capable of allocating the remaining measurement budget depending on previous outcomes, it post-processes the estimator to lower its error, and it gives estimates both for the averageÕ and the variance (∆Õ) 2 of the considered observable. Our protocol is based on two routines which provide either near-optimal (BF) or computationally efficient (ML) allocation of the measurement budget. For the observables considered here, both subroutines result in similar errors (∆Õ) 2 .
We tested our algorithm on several Hamiltonians with different settings that include allowing or forbidding non-bitwise commutation between PS, and turning the adaptive allocation on (L > 1) or off (L = 1). Our protocol yields numerical results that outperform state-of-the-art estimators based on various grouping techniques [16,18,25,27] and the classical shadow method [21,25,26]. We also studied the advantages our algorithm gains from adaptive allocation as well as non-bitwise commutation. We found that, while being highly problem dependent, these advantages are consistent.
There are different possibilities for improving our algorithm. Generating the list of all maximal cliques of the graph is computationally demanding for large problem instances. In this case, one could find better strategies to select a subset of (maximal) cliques, and this subset can in principle be updated while measurements are taken. Designing a better method of choosing suitable cliques could increase the performance of our algorithm without sacrificing the quality of the results. The ML-based subroutine could be further improved by employing graph neural networks to leverage the graph structure of the problem [45]. Another possibility is the extension of our adaptive algorithm to non-qubit-based hardware and, in view of the Bayesian framework underlying our estimation, the direct inclusion of the experimental characteristics of the considered platform via hierarchical modelling [46]. are also grateful to Jinglei Zhang for valuable feedback on a prior version of this manuscript. This work has been supported by Transforma-

Added notes
The algorithms developed in the present work are available at [34].

Appendix A A minimal example
In this section we consider the example presented in Fig. 1(b-e), consisting of an operator For this operator O and a generic input state ρ in , we are interested in finding the best possible estimator (viz. a cliques' cover), as well as the optimal allocation of the M measurements.
In order to do so, we assume that the three PS P j (j = 1, 2, 3) have been previously probed infinitely many times, such that the values of Q ji → Tr{ρ in P j P i } − Tr{ρ in P j } Tr{ρ in P i } are available beforehand. For the sake of clarity, we remark that this assumption was not taken when using AEQuO [see Fig. 1(e)]. This demonstrates that our approach, for a generic input state, is more likely to provide lower errors (∆Õ) 2 than any of the possible covers associated to the operator O in Eq. (3), despite the disadvantage of not knowingQ beforehand.
For a generic input state ρ in , one out of the four covers presented in Fig. 1(b-d) resorts to the smallest error (∆Õ) 2 opt , provided the M measurements have been allocated optimally. The idea is to find this (∆Õ) 2 opt and compare it with the errors (∆Õ) 2 determined with the different covers in Fig. 1(b-d) and AEQuO in Fig. 1(e). The relative difference [(∆Õ) 2 − (∆Õ) 2 opt ]/(∆Õ) 2 opt plotted in the histograms is thus representative of how well the associated covers (or AEQuO) perform with respect to a wide variety of random input states ρ in . Specifically, ρ in correspond to pure states uniformly distributed on the Bloch sphere.
As a first step, we describe how to determine the optimal allocation of the M measurements for any clique cover. We start by rewriting the errors (∆Õ) 2 associated to the four different covers as a sum of the elements in the matrices (recall c j = 1 for all j = 1, 2, 3) Fig. 1 Fig. 1(c), where the two alternatives in the curly bracket in Eq. (5) are used to indicate the two possible cases where one clique is maximal and the other is not, as depicted in Fig. 1(c). These matrices correspond to the ones pictorially represented in Fig. 1(b-d), and fully determine the expected error (∆Õ) 2 for any choice of ξ 1 , ξ 2 and (for the cover without maximal cliques) ξ 3 [see Eqs. (2)]. These last parameters represent the number of times that the corresponding clique is measured, and can be obtained from m j and m ji (and vice versa, see Sec. C). From Eqs. (4), (5) and (6) it is possible to determine the cliques in which the PS are by looking at the subscripts of the parameters ξ j in the diagonal elements. IfQ jj is divided by ξ 1 , ξ 2 or ξ 3 , it means that the PS P j belongs to the first, second or third clique, respectively. Similarly, in Eq. (6), the elementQ 22 /(ξ 1 + ξ 2 ) indicates that the PS P 2 is measured any time either clique is probed, in agreement with Fig. 1(d).
In order to determine the optimal allocation of the M measurements for any of the covers in Fig. 1(b-d), we minimize the corresponding error (∆Õ) 2 in Eqs. (4), (5) or (6) with the additional constraints that j ξ j = M and that ξ j are nonnegative integers for all j. We note that a similar problem (without allowing for an overlap of cliques) has been studied in Ref. [47]. Since we are assuming that we know the exact values of Q ji , the errors (∆Õ) 2 derived with the optimal ξ j are the minima of the associated cover that can be achieved with the considered input state ρ in and the measurement budget M . Furthermore, the four covers presented in Fig. 1(b-d) are all non-redundant covers that can be found for the operator O in Eq. (3), meaning that out of these four (∆Õ) 2 , the smallest one is the minimal error (∆Õ) 2 opt that can be generally obtained when measuring O with M shots with respect to ρ in .
Once the lowest possible errors associated to each cover have been determined, it is possible to test whether there is one cover always resorting to the minimal error (∆Õ) 2 opt . One may think that the cover in Fig. 1(d), consisting of maximal cliques only, always outperforms the others, since it probes more PS per shot. However, this is not the case due to possibly positive covariancesQ ji (j = i) in the off-diagonal terms in Eq. (6) 4 . In fact, it turns out that the error from the cover made of maximal cliques only equals the minimal (∆Õ) 2 opt ∼ 54% of the times. The other ∼ 46% of times (∆Õ) 2 opt corresponds to one of the two covers in Fig. 1(c).
Despite not always being the optimal one, the cover made of maximal cliques does outperform all other covers on average. This motivated the choice of using maximal cliques only when probing the system, along with the post-processing unit afterwards for removing vertices from the cliques based on their covariances and their contribution to (∆Õ) 2 within the considered clique. In fact, by comparing Fig. 1(e) with Fig. 1(b-d), we see that on average AEQuO provides smaller errors if compared to the theoretical minima that can be obtained with the four covers considered in the figure.

B Bayesian estimation for the graph
In this section, we use Bayesian inference [48,49] to estimate the averages, variances and covariances of the PS spanning O. We consider a pair of PS P j and P i whose measurements are collected in the dataset D. From D, it is possible to 4 We remark that, in general, the best measurement strategy is the one measuring all PS at once. However, when some of the PS do not commute and as such their covariances cannot contribute to (∆Õ) 2 , it is sometimes advantageous not to measure a PS for the reasons explained in Sec. 3. obtain the following quantities. We denote by s ± λ (with s + λ + s − λ = m λ ) the total number of times an outcome ±1 is collected from measurements of P λ for λ = j, i. The associated underlying (but unknown) probability is denoted by θ ± (λ), where the argument λ will be dropped for clarity whenever it is clear from context. Similarly, is the number of times that P j and P i yielded the corresponding {±1, ±1} outcome, which is characterized by a probability θ ±± . Finally, w ± λ is the number of occurrences of ±1, that refers to the cases where P j and P i have been probed independently. It follows that w + j + w − j = m j − m ji and w + i + w − i = m i − m ji . Using the collected data D, our goal is to estimate the posterior probability P ( θ|D) that best describes the parameters in our measurement process. By Bayes' Theorem, where the likelihood P (D| θ) is the probability of obtaining the dataset D given θ. The prior P ( θ) is the probability of θ before the current evidence or dataset D is observed. The probability P (D) is also called the marginal likelihood. Following standard procedures [48,49], we consider P ( θ) = Dir(K, a), where Dir indicates the Dirichlet distribution of order K and a = {a 1 , . . . , a K } are its hyperparameters [48,49]. Then P ( θ|D) = Dir(K, c + a), where c = {c 1 , . . . , c K } is a vector giving the number of occurrences of each category in the dataset D.
The Dirichlet distribution of order K has the following probability density function: where Γ(z) = ∞ 0 x z−1 e −x dx is the gamma function. The likelyhood function P (D| θ) then takes the form which leads to the posterior With the posterior probability P ( θ|D) in Eq. (10), we can determine the quantities of interest for the measurement protocol considered in this work by marginalization. In more detail, assume we are interested in estimating a certain quantity f ( θ). We define its estimatorf ( θ) bỹ This equation is used for determiningP j and the elements of the covariance matrixQ (see Sec. 3).
In the following, we explicitly deriveP j and the varianceQ jj of a single PS P j in Sec. B.1. In Sec. B.2, we consider the case in which two PS P j and P i are measured together, and we estimate their covariance. As explained in Sec. B.3, this estimate for their covariance may lead to wrong errors (∆Õ) 2 when M is small. Then, we obtain a valid estimate forQ ji (j = i) by considering the case in which the PS may also have been measured independently from each other. Finally, we summarize the results and explain how Bayesian estimation is used in AEQuO in Sec. B.4.

B.1 Single Pauli string
Whenever we measure a single PS P j , we obtain a dataset D which is a collection of +1 and −1, from which s + j and s − j can be found. In this case, K = 2, θ = {θ + , θ − } with θ − = 1 − θ + , and for simplicity we set D → c = {s + j , s − j }. From Eq. (10) we get the posterior probability where we set a 1 = a 2 = 1. This choice for a, corresponding to a uniformly distributed prior, is generally taken when no information is available prior to the measurement.
The exact values P j and (∆P j ) 2 of the average and variance of the PS P j can be expressed in terms of the probabilities θ as Using these equations in place of f ( θ) in Eq. (11), we obtain the following estimatesP j andQ jj : These relations are used by our algorithm to update vertices and self-edges in the graph [see, e.g., Fig. 1(a)].

B.2 Two PS always measured together
The procedure used in the previous section to de-termineP j andQ jj can be repeated here for the covarianceQ ji (j = i), provided we have meaningful likelihood and prior functions. In this case, we deal with two PS P j and P i , and as such we have four possible outcomes described by the probabilities θ ++ , θ +− , θ −+ and θ −− = 1−θ ++ − θ +− −θ −+ . Furthermore, we assume here that P j and P i are always measured together; the generalization is presented in Sec. B where we left the parameters a = {a ++ , a +− , a −+ , a −− } in their explicit form for reasons that will become clear in the following Sec. B.3.
Here, we assume a ++ = a +− = a −+ = a −− = 1 similar to the single PS, since there is no information available prior the measurements.
By expressing the exact covariance cov(P j , P i ) in terms of the probabilities θ as and using Eqs. (11) and (15), we can determine a possible estimateQ ji (j = i), . (17) This equation, while representing a valid estimate for the covariance, is not used in our algorithm to obtainQ ji . Indeed, since we allow for overlap among the cliques, there are cases in which two PS are both measured individually and together, and Eq. (17) disregards all outcomes in which P j and P i are uncorrelated. In the following section, we study two possibilities on how to include these events into the estimation ofQ ji , such that we do not neglect any collected information on the system and always obtain a meaningful estimate of the error (∆Õ) 2 .

B.3 Two PS measured together and individually
As anticipated at the end of the previous section, Eq. (17) ignores all events where P j and P i were not measured together. This has two implications. First, we are discarding available information that can be used for getting a more precise estimate for the covariance. Second, and more importantly, in case of scarce statistics Eqs. (17) and (14) may lead to a wrong estimate for the error (∆Õ) 2 . This is because the variancesQ jj and the covariancesQ ji (j = i) are derived independently from each other, ignoring the fact that they are correlated. In fact, it is possible to express the probabilities θ ±± in terms of θ ± (j), θ ± (i), and the exact cov(P j , P i ): where we recall that the subscripts ± in θ ±± refer to PS P j and P i , in this order. Since the relations in Eqs. (18) are not always satisfied by Eqs. (17) and (14), it may happen that in measuring an observable we obtain an estimate resulting in a wrong error (∆Õ) 2 . As an example, assume that we take 25 measurements of each of P 1 and P 2 independently, and always get +1 and −1 outcomes, respectively. Afterwards, we measure these PS together twice and get the two pairs {−1, +1} and {+1, −1}. In this case, our estimates for the variances will be small, while we would get a comparatively large negative covariance. This leads to a negatively defined covariance matrixQ such that Q 11 +Q 22 +2Q 12 = −0.07, which is not physically allowed.
To prevent this from happening, we describe two alternative ways for estimating the covariance. The first one is more precise but unpractical, since it yields a result that cannot be efficiently evaluated numerically. The second is less precise, but computationally advantageous.
The first way consists of defining a probability distribution function describing the measurements of two PS both independently and simultaneously probed, in which Eqs. (18) are satisfied. To do so, we start from the joint posterior distribution P ( θ|D): where the order is K = 8, and Here, θ −− = 1 − θ ++ − θ +− − θ −+ , θ − (λ) = 1 − θ + (λ) for λ = j, i, and we set a to be the all ones vector in the following. Evidently, P ( θ|D) in Eq. (19) is the product of the three distinct distributions associated to P j and P i being measured together (second row), and independently (third and fourth rows). In particular, since K = 8, there are eight probabilities in θ, but only four are independent. Indeed, by inverting the relations in Eqs. (18), we obtain Using these, we can define a new probability distribution P( θ|D) in the following way, Here, only the independent variables θ ±± are used to describe the measurements. From P( θ|D), repeating the same steps as in Sec. B.2, it is then possible to determine the analytical form of the expected covarianceQ ji (j = i), when including events in which the PS P j and P i are both measured independently and together. This form of the covariance (which is lengthy and thus not explicitly reported) ensures that the error (∆Õ) 2 and the covariance matrix Q are always well defined and rapidly converge to the exact values. However, since the probability distribution function P( θ|D) in Eq. (21) is not a Dirichlet distribution, the resulting form of the covariance cannot be efficiently computed numerically. In more details, we obtain a complicated sum of factorials which we were unable to simplify, and that requires a long time to be evaluated when the dataset D is large.
There is a second alternative to estimate the covariance that both includes the information from the PS being measured independently, and avoids getting wrong errors (∆Õ) 2 . Rather than defining a joint probability distribution as before, we modify the hyperparameters a in the prior in Eq. (8a) to obtain a more precise posterior distribution P ( θ|D). In other words, we are going to modify the values of a in Eq. (15) depending on the outcomes w ± j and w ± i . As such, the dataset described by c, as well as the probabilities θ, are going to be the same as in Sec. B.2.
The first step consists in observing that, in the absence of measurements, c = 0 = {0, 0, 0, 0}, the estimates for the probabilities θ are [using the posterior in Eq. (15)] which are all equal to 1/4 for a = {1, 1, 1, 1} (as expected when there is no information about the measurements). However, in this case, we do have information available, coming from all the times in which the PS P j and P i have been independently probed. In particular, by using the procedure outlined in Sec. B.1 with the substitution s ± λ → w ± λ , we can determine the expected valuesθ ± (λ) for θ ± (λ) (λ = j, i) as follows 5 , The idea is now to plug theseθ ± (λ) into Eqs. (18), determine the expected values for θ ±± , and then use those to find the best hyperparameters a from Eqs. (22). To do so, however, we need two more ingredients. First, as it is possible to see from Eqs. (18), we also require an estimate for the exact covariance cov(P j , P i ). Since independent measurements of P j and P i are uncorrelated, we set cov(P j , P i ) to be zero as our initial guess. Second, we are free to choose a normalization A = a ++ + a +− + a −+ + a −− that fixes the bias towards the prior in determining the estimates from the resulting posterior distribution [48,49]. Since we want to calculateQ ji (j = i), but our prior assumes zero covariance, we set A = 4 in the following, which is the same normalization generally used when no knowledge 5 Notice that, to estimate θ±(j), we only use the events in which the PS Pj has been measured independently from Pi (and vice versa). This is done to avoid biases that could falsify the error estimate (∆Õ) 2 .
is available (in fact, we set A = K throughout this work).
From the above considerations, we determine the elements of a to be By using these values of a in Eq. (15) we can obtain the posterior distribution P ( θ|D), from which it is possible to find [see Eq. (11) or the first row in Eq. (17)] Compared to the estimate forQ ji given in Eq. (17), this equation provides more accurate results, particularly when the dataset D is small. Furthermore, it lowers the probability of getting nonphysical values for the covariance matrixQ and the cost function (∆Õ) 2 . However, since the relations in Eqs. (18) are still not directly enforced into the posterior probability P ( θ|D), there are still cases in which Nature conspires against us to get a nonphysicalQ. In addition to that, the post-processing subroutine in our algorithm (see Fig. 2 and Sec. 4) is particularly efficient in finding these cases, since it tries to minimize (∆Õ) 2 .
To make the wrong estimation of the covariances statistically impossible, we assign a risk factor R ji for each pair of PS P j and P i , proportional to how likely the corresponding covariance matrixQ is to be nonphysical. We define R ji by where sign(x) is the sign of x, and β 1 , β 2 and β 3 are arbitrary parameters 6 . This definition of R ji is such that it assigns a higher risk (in absolute value) when P j and P i have been measured more times individually than together. Once R ji is assigned to the pair P j and P i , we increasẽ where (Q 2 ) ji is the estimated square of the covariance cov 2 (P j , P i ): (Q 2 ) ji can be found from Eq. (15) by using the usual approach presented in Eq. (11), and is equal to where the elements of a are defined in Eqs. (24).
Since δQ ji is a statistical error on the covariancẽ Q ji , it approaches zero when increasing the size of the dataset.

B.4 Summary
To summarize, we resorted to Bayesian theory to estimate averages, variances and covariances of PS being measured together and/or individually. This is crucial for allocating the measurement budget to different cliques, since our estimators yield valid results with scarce or even no statistics. As detailed in [34], our algorithm allows for choosing between different features. Every time the covariance matrix is updated, the estimator for the covariance of two PS P j and P i is whereQ ji , R ji and δQ ji are given in Eqs. (25), (26) and (28), respectively. The user has then the freedom of choosing the different hyperparameters β 1 , β 2 and β 3 for the risk factor R ji . This has implications in the allocation of the measurement budget. In our experience, setting the risk factor to zero (i.e., β 1 = 0) never resorted to unphysical covariance matrices and yielded slightly better results. Otherwise, an empirically reasonable choice for these hyperparameters is given by β 1 = 1, β 2 = 10 and β 3 = 0.25. Another feature of our algorithm is that, after all measurements have been allocated and taken, it offers different options for calculating the esti-matorÕ and its error (∆Õ) 2 . The user can substitute the Bayesian estimators forP j andQ jj in Eqs. (14), and forQ ji (j = i) in Eq. (29), with the sample averages, variance and covariance [50], respectively. This is a reasonable choice since these sample quantities are known to converge faster if compared to the Bayesian ones. We remark, however, that this substitution is reasonable if and only if there is a sufficiently large dataset for correctly estimating all contributions in Eq. (2b).
Yet another feature of our algorithm is that it allows, both when allocating the measurement budget and/or when outputting the final value for estimate and error, to use the exact values cov(P j , P i ) forQ ji in Eq. (2b). On the one hand, this allows for investigating the limits of our protocol. On the other hand, it is useful to determine the exact error (i.e., without statistical fluctuations) that is expected from a given measurement allocation, as we did for the values in squared brackets in Fig. 3(a) and the points in Fig. 3(c).
C A computationally efficient method to compute (∆Õ) 2 In this section, we explain a computationally efficient method to derive the error (∆Õ) 2 . As evident from Eq. (2), this involves two objects, the matrix C (with elements c j c iQji ) and the parameters m ji (j, i = 1, . . . , n). The first one can be obtained by following the procedure explained in Sec. B. The latter one is found from the outputs of either the BF or the ML subroutine, as explained in Sec. 4. Indeed, these subroutines provide an r-dimensional vector ξ whose elements ξ i represent the number of times that the i-th clique is measured. To derive m ji from ξ, one may use the measurement characteristic matrix E, which is an (n × r)-matrix whose elements E ji are equal to 1 if the Pauli string j belongs to the clique i, and zero otherwise. It then follows that m ji is the (j, i)-element of the product EΞE , where Ξ is a diagonal (r×r)-matrix with ξ as the diagonal entries, and indicates transposition.
Given m ji , the error (∆Õ) 2 can be rewritten as a function of matrix operations that can be efficiently implemented numerically. Denoting the Hadamard product and Hadamard division [51] with • and , respectively, we have where J is the all-ones matrix of size (r × r), and j is the (r×1)-all-ones vector. Given C and ξ, this equation is used in the BF and the ML subroutines of our algorithm to efficiently compute the error function. Notice that, compared to Eq. (2), the Kronecker delta does not appear in Eq. (30). This is not problematic since the BF [ML] subroutine considers the elements in ξ to be positive integers [real numbers that are then rounded to the nearest integer]. For the post-processing we directly use Eq. (2) to calculate (∆Õ) 2 .

D Numerical results
In this section, we explain the numerical results in Fig. 3 and 4. This includes details on the LDF algorithm that we have used, and an extended explanation about how the values in the tables, as well as the points in the plots, have been derived.
The numbers outside the square brackets in Fig. 3(a) represent the standard deviations whereÕ j is the j-th estimated average of the observable O under consideration, O is the exact average of O, and R is the total number of times the whole measurement procedure is repeated [R = 40 in Fig. 3(a)]. For the Derand [21] and the APS [25,26] methods,Õ j are determined with the algorithms reported in the associated references. The same holds for the OGM [27] method, where we resorted to version 2 of their algorithm and followed their sampling procedure.
As explained in Sec. B.4 and in [34], one can choose different quantities to be returned by AEQuO. TheÕ j in Eq. (31) for calculating the numbers outside the square brackets are obtained with the sample average [50] for estimating all PSP j withinÕ j . The numbers inside the square brackets in Fig. 3(a), on the other side, are derived by averaging the values (∆Õ) 2 obtained in the R iterations of the algorithm. The associated errors in the parentheses are the sample root mean squares [50]. While AEQuO performs the measurement allocation without prior knowledge of the covariance matrixQ, the values for (∆Õ) 2 that are returned at each j-th iteration are calculated with the exactQ j,i → cov(P j , P i ). This allows to have a very precise estimate of the real error (hence the small uncertainties in the parentheses), even with only R = 40.
In panels (b) and (c) of Fig. 3 we show the rescaled errors M Σ 2 / O 2 and M (∆Õ) 2 / O 2 , respectively. In (b), M Σ 2 / O 2 is chosen in order to compare AEQuO with other approaches that cannot directly estimate the variance of the considered observable O. In (c), we report M (∆Õ) 2 / O 2 , where (∆Õ) 2 and the associated error bars are calculated in the same way as the values in the square brackets within Fig. 3(a) (see previous paragraph). We have chosen R = 100 to reduce statistical fluctuations affecting Σ in (b), and R = 40 in (c). Furthermore, since the APS method and AEQuO with the BF subroutine allocate the measurement shots one at the time, when M is large they require longer runtimes than the other approaches. Therefore, they have been run up to M = 6400. Every time AEQuO is used to determine theQ used for measurement allocation, it resorts to Bayesian estimation with β 1 = 0 (see Sec. B.4).
For observables characterized by small values of n, AEQuO can find all maximal cliques and feed them into either the BF or the ML subroutines for measurement allocation. This procedure led to the reported values in Fig. 3(a) until (excluded) the LiH 2 Hamiltonian. Since the worstcase scaling of r is exponential in the number of vertices [52], resorting to all maximal cliques is not an option when n is large. We thus developed an algorithm that spans all vertices and, for each, finds a user-specified number of maximal cliques (see Sec. 4 and [34] for more informations). When building each maximal clique, this algorithm favours vertices with more edges and higher contributions to the cost function (∆Õ) 2 in Eq. (2b). This creates a bias towards larger cliques, that in the context of measuring a quantum observable are highly desirable. In fact, we have tested AEQuO when using either all, or a subset of r n (large) maximal cliques. For the examples we considered, the errors yielded in these cases differed by 15% at most.
The values reported in Fig. 3(a-b) for the LDF method have been determined with an algorithm based on Ref. [16] and integrated with the framework developed in this work. We first find a cover of the graph such that each vertex is contained in exactly one clique. To find this cover, we also prioritise larger cliques and vertices with higher (∆Õ) 2 contributions, as we did for the maximal cliques' algorithm described in the previous paragraph. Once a cover is found, we use the BF subroutine to allocate all measurements, without ever updating the covariance matrixQ, i.e., no adaptive features are used for the LDF method. We then perform the measurements, and resorting to the sample average we obtain R = 40 [R = 100] valuesÕ j that are used in Eq. (31) to determine the numbers reported in Fig. 3(a) and 3(c) [ Fig. 3(b)].
The numerical results from the LDF method reported in Sec. 5 are lower if compared to the ones reported in Refs. [16,18,21,25,27]. This is a consequence of two facts. First, for each graph cover, our measurement allocation is optimal (as can be demonstrated by following the procedure in App. A). Second, we allow for general commutation relations between PS, i.e., we group together two PS that commute, but do not necessarily bitwise commute. Indeed, when we restrict our LDF to bitwise commutation relations (as in App. E), it yields results that are similar to the ones reported for the LDF algorithm in Refs. [16,18,21,25,27].
Our version of the LDF is representative of other measurement protocols (such as the ones in Refs. [16,[29][30][31][32]) that also allow for non-bitwise commuting relations between PS. In fact, prior to this work two criteria were commonly used to collect PS. First, the magnitude of their coefficients and second, the total number of resulting groups. For building the groups of PS to be measured together, our LDF protocol employs the expected contributions of the PS to the error, that is available via Eqs. (2) and results in a strategy that is similar to the one used by AEQuO (described in Sec. 4 and [34]). Since having less groups and collecting together PS with large coefficients are highly correlated with having lower estimation errors, we expect the algorithms in Refs. [16,[29][30][31][32] to achieve errors that are comparable to the numerical LDF results reported in Sec. 5.
In all panels of Fig. 4, we report averaged values of (∆Õ) 2 with their statistical errors [in parentheses in (a) and as error bars in (b-e)] and R = 25 [see above discussion about Fig. 3(c)]. The same chemistry Hamiltonians previously utilized in Fig. 3 have been used in Fig. 4(a), while we resorted to the family of lattice models introduced in the main text and App. F for Fig. 4(bd). In (a), (b) and (d), the settings chosen for estimatingQ are the same ones used in Fig. 3. For (c) and (e) the covariance matrix is initialized to the exact values, as explained at the bottom of Sec. B.4.
For all numerical values in Fig. 3 where the ML subroutine has been utilized, we employed a stochastic gradient descent optimization algorithm [40] for the cost function (∆Õ) 2 minimization. This algorithm's performance depends on hyperparameters [34,38,40] -the learning rate in particular -that have not been optimized to keep low computational requirements. On the other hand, the ML subroutine in Fig. 4 employs the Limited-memory BFGS optimizer [53] that, belonging to the family of quasi-Newton methods [42], does not require choosing the learning rate.

E Bitwise commutation comparison
In this section, we compare AEQuO and our LDF protocols both restricted to bitwise commutation (BC) relations with the APS [25,26], the Derand [21] and the OGM [27] methods. In fact, as explained in Sec. 4, simultaneous measurements of non-bitwise commuting PS require entangling gates that, in the context of NISQ devices, generate errors that can jeopardize the result. As such, it is important to determine the advantage resulting from the adaptive nature of AEQuO alone.
In Fig. 5 we present analogous numerical results as in Fig. 3(a-b), with the additional constraint that all measurement schemes are restricted to bitwise commutation relations. From panel (a), it is possible to conclude that for the considered chemistry Hamiltonians the adaptive nature of AEQuO does provide a consistent advantage over all other schemes. Except for the × LDF BC [16] OGM [27] APS [25,26] Derand [21] AEQuO BC, L=1  Figure 5: (a): Errors obtained with LDF [16], OGM [27], APS [25,26], Derand [21], and AEQuO with the ML subroutine and L = 1 or L = 3 and l = 4. We consider chemistry Hamiltonians in the BK encoding [9]. Values (in bold the lowest) are variances Σ 2 with Σ as in Eq. (31). In square brackets, we report averages of (∆Õ) 2 with their statistical error. The input is the ground state, R = 25, M = 10 4 , and values are rescaled by 10 −4 . (b): Relative errors M Σ 2 / O 2 as a function of M obtained considering the BeH 2 Hamiltonian with BK encoding and R = 10 2 . In both panels, AEQuO and the LDF [16] protocol are restricted to bitwise commutation (BC) relations.
water molecule, non-adaptive (L = 1, black dotted line) AEQuO, the OGM, and the Derand protocols yield results that are within statistical fluctuations of one another (in agreement with Fig. 5(b) and Figs. 3 and 4). Furthermore, while the LDF protocol in Fig. 3(a-b) was yielding results that, in several instances, were better than the OGM and the Derand, here the LDF always performs comparably or worse. We thus conclude that its advantage came from non-bitwise commutation between PS that, in this section, is not allowed.
The case of H 2 O in Fig. 5(a) presents an interesting feature. The OGM, while performing worse than adaptive (L = 3, l = 4) AEQuO, outperforms all other approaches. After a careful analysis, we identified the reason of this advantage in the fact that the OGM protocol, for small measurement budgets, does not measure the input observable. Instead, it removes PS with small coefficients from the Hamiltonian [27]. These terms, that are considered and measured by AEQuO, the LDF and the Derand protocols, introduce extra statistical fluctuations that increase the error. This is supported by Fig. 5(b), where it is evident that the OGM (yellow downwards triangles) asymptotically reaches (within statistical fluctuations) the Derand (red upwards triangles) and non-adaptive AEQuO (L = 1, black dotted line) for large values of M , when all PS are considered by OGM.
To confirm that the OGM's enhanced performance for small M comes from neglecting PS that bare extra statistical errors, we investigated the performances of all protocols with different input states. For small values of M , we have found several instances in which the OGM yields results that are comparable to the Derand and non-adaptive (L = 1) AEQuO.

F Lattice Hamiltonians
For deriving the data in Fig. 4(b-d), we considered different 2and 3D lattice models (see bottom of the figure), described by the Hamiltonians Figure 6: (a): Example of a lattice used in Fig. 4(b-d). Here To better explain this operator, we refer to Fig. 6. In panel (a), we show an example of a lattice in D = 2 dimensions with d = 16 qubits lying on the edges. Vertices are indicated by vectors v and the Cartesian coordinate frame is characterized by versors {ê} = {ê 1 , . . . ,ê D }. It follows that each qubit can be identified by v ±ê j (j = 1, . . . , D), where v is the vertex from which it can be reached by one of the versors (see Fig. 1). In Eq. (32), α k and β 2k are real constants, and the products run over all possible combinations of versors (S are subsets of k oriented versors). The raising and lowering operators are indicated withσ ± = (σ x ±iσ y )/2. In Fig. 6(b), all elements of the Hamiltonian in Eq. (32) are represented for a D = 2 dimensional lattice with d = 4 qubits. This class of physical models, describing manybody, next-neighbouring interacting spin-1/2 particles on a lattice is a generalization of the Hubbard model [44] with hopping multi-particle terms and energy shifts depending on the particles' state. For determining the data points in Fig. 4(b-e), we set α k = 1/k and β 2k = 1/(2k).