Overlapped grouping measurement: A unified framework for measuring quantum states

Quantum algorithms designed for realistic quantum many-body systems, such as chemistry and materials, usually require a large number of measurements of the Hamiltonian. Exploiting different ideas, such as {importance sampling,} observable compatibility, or classical shadows of quantum states, different advanced measurement schemes have been proposed to greatly reduce the large measurement cost. Yet, the underline cost reduction mechanisms seem distinct from each other, and how to systematically find the optimal scheme remains a critical challenge. Here, we address this challenge by proposing a unified framework of quantum measurements, incorporating advanced measurement methods as special cases. Our framework allows us to introduce a general scheme~ -- ~overlapped grouping measurement, which simultaneously exploits the advantages of most existing methods. An intuitive understanding of the scheme is to partition the measurements into overlapped groups with each one consisting of compatible measurements. We provide explicit grouping strategies and numerically verify its performance for different molecular Hamiltonians with up to 16 qubits. Our numerical result shows significant improvements over existing schemes. Our work paves the way for efficient quantum measurement and fast quantum processing with current and near-term quantum devices.

Without introducing additional entangling circuits, three types of advanced measurement schemes have been proposed to reduce the measurement cost by exploiting different features of the to-be-measured observables [11,15,16,27,31,33,38,62,63,65,67]. First, observables may have different weight coefficients and we can exploit importance sampling to distribute more measurements to observables with large weights [44,67]. Next, observables may be compatible with some other ones, in the sense that they could be simultaneously measured with the same measurement basis. We can thus group observables into sets of compatible observables using fewer measurements [16,28,35,38,50,64,65,76]. Another notable but conceptually different scheme considers classical shadows of quantum states using uniformly random local measurements, which are extensively investigated in theoretical and experimental works [1,2,10,31,58,74,77]. By properly post-processing the classical measurement outcomes, one can simultaneously obtain the expectation values of any observables. The cost of the original uniform classical shadow scheme [31] scales exponentially to the number of qubits that the observable non-trivially acts on, and later LBCS and derandomized CS methods were further proposed to reduce the measurement cost [27,33]. While the optimized classical shadow method outperforms the other two types of methods in numerical experiments, how they are related, and how to find an optimized method that exploits the advantages of all these advanced measurement schemes remain open.
Here, we address these problems in quantum state measurement. We first introduce a unified framework that integrates the advantages of the typically advanced measurement schemes in Sec. 2. In particular, we show how to understand the classical shadow method as a generalized observable grouping method. We next introduce the overlapped grouping measurement scheme that simultaneously exploits the features of the importance sampling, observable compatibility, and classical shadows in Sec. 3. While finding the optimal overlapped groups could be computationally challenging, we provide explicit algorithms that output an optimized measurement scheme in Sec. 4. We then numerically benchmark our method in Sec. 5 by comparing it to existing advanced works [27,31,33,38,65] in estimating expectations of molecular Hamiltonians, as a subroutine in most quantum algorithms. Our numerical result shows prominent improvements over all the others. The proposed method is immediately applicable to currently available and near-future quantum computing experiments. In Sec. 6, we conclude this work and suggest some interesting future investigations.

A unified framework
Now we introduce a framework for measuring hermitian objective observables O := j α j Q (j) on a multi-qubit quantum state ρ. Here Q (j) ∈ {I, X, Y, Z} n are tensor products of single-qubit Pauli operators, and we also call Q (j) local Pauli strings. Naively, we could measure each term Q (j) to obtain the expectation value, tr ρQ (j) , and hence the expectation value of the objective observable, tr(ρO), whereas more efficient schemes may be found by exploiting the properties of the objective observables.
We first consider observable compatibility. Let Q and R be tensor products of single- We let Q R denote Q i = R i or Q i = I for any i, indicating that measuring observable R equivalently measures Q. We say that Q is compatible with R when Q i R i or R i Q i for any i, meaning that Q and R can be simultaneously measured. In an extreme case, when each operator Q (j) is compatible with the same Pauli basis P , we can simultaneously obtain all the expectation value tr ρQ (j) by measuring one basis P . Nevertheless, a practical case generally consists of observables that are compatible with only a subset of other observables. Then we need to find a set of Pauli bases {P } such that each observable Q (j) is compatible with at least one Pauli basis. After choosing the Pauli bases {P }, the next question is how to distribute the measurement samples to each basis P , which corresponds to the idea of importance sampling. Without loss of generality, we select each basis P randomly with the probability K(P ). Now, suppose that we have determined {(P, K(P ))}, and we can define an estimator where supp(Q) := {i|Q i = I} is the support of Q, µ(P, supp(Q (j) )) := i∈supp(Q (j) ) µ(P i ), and µ(P i ) is the single-shot outcome by measuring the ith qubit with single-qubit Pauli operator P i . Here, µ P, supp Q (j) effectively gives measurement results of Q (j) obtained from measuring with the basis P . f (P, Q (j) , K) is associated with the probability distribution K of measurement P and Q (j) , which is designed to guarantee thatv is an unbiased estimation of tr (ρO). It depends on the measurement scheme, and we will show its explicit form for different schemes later.
Assuming that E f P, Q (j) , K = 1, we will show that E [v] = tr (ρO), i.e.,v is an unbiased estimator of the observable expectation tr (ρO) in the following proposition.
where Q (j) are local Pauli strings, andv be defined as in Eq. (1) with E f P, Q (j) , K = 1, thenv is an unbiased estimation of tr (ρO).
Proof. By the definition ofv, we have where the first equation holds because of the conditional expectation formula.
In the following, we give explicit expressions of f (·, ·, ·) for different existing measurement schemes.
(1) Importance sampling. The strategy of the importance sampling measures each observable Q (j) independently with the basis P (j) = Q (j) , and the associated probability distribution is determined by the weight of the observable as K(P (j) ) = |α j |/ α 1 with α 1 = m j=1 |α j | being the l 1 norm of α = (α 1 , . . . , α m ), and f defined as The variance for importance sampling, LDF-grouping, LBCS and OGM algorithms of this instance are 0.90, 0.56, 0.74, 0.50, respectively. (e) Schematic diagram of the unified framework. We first determine the set of Pauli basis {P } and the probability distributions {K(P )} using Algorithm 1. We next measure the quantum state ρ with a different Pauli basis drawn from the distribution K and post-process the measurement outcomes to obtain the estimation of tr(ρO).
It is easy to check that E P f l 1 P, Q (j) , K = 1. This method is also referred to as l 1 -sampling, since the sampling probability is associated with the l 1 -norm of Q (j) . The l 1sampling needs O( α 2 1 /ε 2 ) copies of quantum states ρ to approximate the expectation of O with an additive error ε. The number of copies is obtained from Chebyshev inequality, and we leave details of error analysis in Appendix A.
(2) Grouping. The grouping method exploits observable compatibility by partitioning the observables O = {Q (j) } into several non-overlapped sets S = {e 1 , . . . , e s } such that e j ∩ e j = ∅ (∀j = j ), ∪ j e j = O. It also requires that observables in the same set are compatible with each other, such that there exists a measurement basis P (j) satisfying Q P (j) , ∀Q ∈ e j . Let K(P (j) ) be the probability that P (j) is selected. It could be optimized using the importance sampling by setting K(P (j) ) proportional to the total weight of the observables in the set P (j) as K(P (j) ) = e j 1 / α 1 . Here the weight of a set e j is defined as the l 1 -norm of the weights of the observables in this set as e j 1 = Q (k) ∈e j |α k |. The function f for the optimized grouping method is where δ P,Q (j) equals one if P = Q j and equals zero otherwise. Then we have The number of copies of quantum states ρ needed here is associated with the grouping strategy, and we give an explicit upper bound for the number of copies requiring to approximate tr (ρO) with an additive error ε for any given grouping strategy in Appendix A.
Finding the exact minimum number of groups has been proved to be NP-hard [37]. Several heuristic algorithms, such as the largest degree first (LDF) method [65] have been proposed to give approximate solutions. We refer to Appendix B for a detailed implementation of the heuristic grouping method.
(3) Classical shadows. The conventional classical shadow (CS) method measures the quantum state with a random Pauli basis, which corresponds to a Pauli string P ∈ {X, Y, Z} ⊗n within our framework. The original scheme in the seminal work [31] considers a uniform probability K(P ) = 1/3 n whereas the locally biased classical shadow (LBCS) method [27] assumes a general product distribution K(P ) . One can check that The number of copies ρ to approximate the expectation of O with an additive error k is not identity for all of j. We refer to Appendix A for detailed proof.
(4) Derandomized CS. Recently, Huang, Kueng and Preskill [33] proposed a derandomized classical shadow algorithm, which shows great practical performance compared with conventional classical shadow methods. The derandomization algorithm first assigns a collection of T completely random n-qubit Pauli measurements, and then derandomizes the process for sampling measurement set P by greedily and adaptively choosing current P (j) in the j-th step, provided derandomized measurements P (1) . . . P (j−1) , that minimizes the conditional expected value over all remaining random measurement assignments. Given all the selected measurements P, the estimator of the derandomized CS algorithm can be expressed asv within our framework. Now we give a sampling version for this method and prove that it can also be unified to the unified framework, as shown in Eq. (1).
For the measurement sequence P (1) , . . . , P (T ) , suppose the frequency of P (k) be t k , let p k = t k /T be the probability to select P (k) , and denote this distribution as K. Then proposition 1 still holds as long as for any observable Q (j) where j ∈ [m], there exists a measurement P (k) in the measurement sequence such that Q (j) P (k) . In this case, we can rewrite Eq.
for a selected measurement P (k) , where p Q (j) is the probability to measure Q (j) . It is In Fig. 1, we show explicit examples of the above three typical methods. While Ref. [27] showed the superiority of the LBCS method, we can see that LBCS essentially exploits an alternative view of observable compatibility, which is captured by the unified framework.
However, since the measurements are selected locally in LBCS, we have to measure redundant observables, such as X 1 Z 2 X 3 in our example of Fig. 1. This term has no contribution to the objective observable, but is still assigned a certain number of measurements. For a general observable, many measurements might be assigned to these redundant terms, and thus makes LBCS non-optimal or even costly with increasing system size.

Overlapped grouping measurement
Here, we propose a new scheme that exploits the advantages of the aforementioned typical measurement methods. We first introduce the concept of overlapped grouping and then give a comparison of this strategy and other existing strategies.

Overlapped grouping
Suppose that we have determined the probabilities {K(P (j) )}, and we can define a new function f for the overlapped grouping as where χ(Q) := P :Q P K(P ) represents the probability that Q is effectively measured with the basis P . Now we can definê as an unbiased estimator of tr (ρO). Intuitively, an unbiased estimation of tr ρQ (j) can be generated from the measured results of Q (j) divided by its measured probability. From the definition of f G , we have E P [f G (P, Q, K)] = P :Q P K(P ) 1 χ(Q) = 1, and hencev G is also an unbiased estimation of tr (ρO) by Proposition 1. To summarise, an overlapped grouping measurement (OGM) scheme works as follows. S1. Find the overlapped sets S with corresponding measurements {P (1) , . . . , P (s) }.
S3. Measure the quantum state with a randomly generated basis P and process the outcomes with Eq. (1) and (10).
A specific OGM scheme is determined by the choice of sets S (equivalently {P (j) }) and the probability distribution {K(P (j) )}. To quantify the performance of the scheme, we consider the variance of the estimator, as shown in the following proposition.

Proposition 2. The variance ofv G defined in Eq. (11) is
, Proposition 2 follows from the following equations where the first equation holds directly by its definition and the second equation holds because we have µ(P, supp(Q))µ(P, supp(R)) = µ(P, supp(Q ⊕ R)) = µ(P, supp(QR)) when Q, R ∈ {I, X, Y, Z} n .
The variance determines the sample complexity. In particular, we need the total number of measurements

Illustration and comparison with other measurement schemes
We illustrate the differences between our OGM scheme and other measurement schemes in Fig. 1. As illustrated in Fig. 1, importance sampling selects an observable in each iteration, and measures the prepared state with the sampled observables to obtain the estimations associated with this observable. Grouping strategy leverages the compatible property of the observables, and measures the observables that are compatible jointly. Nevertheless, it only exploits a very limited space of the full probability space for 4 n possible measurements in {I, X, Y, Z} n . Moreover, for the sake of the grouping determination using the heuristic strategy, an observable can not arise in two different sets. Therefore, it might be inefficient in leveraging of the measurements. Classical shadow method finds the optimized probabilities of each qubit of the measurement, and it also measures the observables jointly. However, since the CS method independently generates Pauli operators on each qubit, it will generate useless measurements, such as X 1 Z 2 X 3 in Fig. 1(c). As a comparison, for any measurement P in set P (1) , . . . , P (s) generated from the OGM scheme, there exists at least an observable Q (j) , such that Q (j) P .
The overlapped grouping measurement framework defined as in Eq. (10) without an explicit assignment of P (k) and K covers importance sampling, LDF Grouping, CS and the "probabilistic version" of the derandomized CS algorithm as in Eq. (9). The importance sampling, grouping and CS algorithms can be regarded as a special OGM framework with some restrictions for the distribution of measurements {(P, K(P ))}. We note that in our method f G (P, Q, K) = χ(Q) −1 δ Q P may lead to more effective data post-processing since it exploits all the compatible properties of observables in the overlapped sets, and the "probabilistic version" estimation in Eq. (9) of the derandomized CS algorithm also belongs to this scope, since it is equivalent to the estimation expression with the OGM grouping strategy in Eq. (11).
We can also observe that OGM is strictly better than the classical shadow method. The OGM scheme reduces to the CS method when we choose P (j) ∈ {X, Y, Z} ⊗n and restrict the probability distribution K(P (j) ) to have a local product structure on different qubits. We remark that OGM will not measure redundant observables as that in the local shadow methods.
The estimator in Eq. (8) indicates that derandomization utilizes the compatible properties of observables when measuring on the predetermined basis. Once the measurement bases are determined, it could be regarded as a special overlapped grouping method.
In OGM, challenges remain to (1) determine the collection S and (2) find the probability distributions K. Similar to the case of grouping method, finding the optimal overlapped groups given the objective observables is also NP-hard. In Sec. 4, we develop an explicit strategy to determine an approximate solution S by leveraging a greedy algorithm based on the weights of the observables. To find the probability distribution K, we apply an optimization procedure to adaptively search for the solution that minimizes the estimator variance.

Explicit grouping strategies
We show in Algorithm 1 our strategy to determine the overlapped sets e 1 , . . . , e s and the associated probability K j := K(P (j) ). The main idea is that under the premise of covering all the objective observables, we add an observable which has not been accessed into a new set, and add all compatible observables into this set. We give priority to observables with larger absolute weights since it has more contributions to the estimation. We note that different sequences to add a new observable into an existing set will influence the structure of sets and the number of sets. Algorithm 1 provides a grouping strategy by adding a new observable by its importance (weight) and trying to reduce the number of sets as far as possible. Meanwhile, the procedure guarantees that whenever an observable Q (j) is compatible with the measurement P (i) , it is in the set e i . See Appendix C for an alternative strategy, which has slightly better performance however based on a more dedicated optimization procedure. Add Q (k) into set e s ; 10 Update P (s) to P (s) Q (k) ; Let the initial probability of P (s) be the summation of the weight of all observables in this set; 12 for k ← 1 to j − 1 : 14 Add Q (k) into set e s , and update P (s) to P (s) Q (k) ; The algorithm outputs the measurements {P } with non-optimized probabilities {K}.
Here the initial probability of P (s) is not chosen as the weight of e s since we wish to distinguish the importance of different sets and give more priority to the sets which are generated in front of e s . We can then optimize {K} to further minimize the estimator variance. However, the variance Var(v G ) in Eq. (12) depends on the input state ρ, which could be unknown in general. Alternatively, we consider the diagonal approximation of Var(v G ) (see similar techniques in Ref. [27]), which is explicitly expressed as where χ(Q, K) = P :Q P K(P ) and K := (K 1 , · · · , K s ) represents all the corresponding probabilities. We give the mathematical supports that why we utilize l( K) as the cost function in Appendix E. There are several advantages of using the diagonal approximation l( K) instead of the actual variance -(1) independence of the quantum state, (2) fast classical evaluation, (3) including dominant contribution to the variance since tr ρQ (j) Q (k) < tr ρQ (j) Q (j) = 1 when j = k. Therefore, we could instead regard l( K) as the cost function and minimize it by optimizing over K. From the expression of l K in Eq. (14), we see the cost function is not convex in K and hence there is no closed minimum solution. An estimation can be generated by searching for a local minimum solution of the cost function in Eq. (14). To further give a better estimation and avoid being trapped into bad local minima, we slightly revise the cost function, as shown in the following subsection.

Optimization process
For the optimization process of the OGM method, we will further speed it up by adaptively deleting the groups that have very small initial probabilities until the cost function stops decreasing with the disturbance. Note that after cutting down the groups with small weights, some observables with small coefficients will disappear in the cost function. Therefore, we adjust the final cost function as where T is the total number of samples, Q (j) ∈ S if there exists a set e such that Q (j) ∈ e, and Q (j) ∈S is the penalty caused by deleting some sets. The selection of the final cost function in Eq. (15) is inspired by the relationship between the variance and the number of samples. More specifically, Chebyshev inequality T ≥ Var(v)/ δε 2 indicates that Var(v) is linear in T . Hence we introduce α 2 j T to compensate the initial error ε 0 for excluding the observable Q (j) . The initial error ε 0 = j:Q (j) ∈S α j tr ρQ (j) implies biases of our estimation. We could search for an optimized T in a real experiment with a small-scaled input size with an initial T 0 . Since the cost function in Eq. (15) is not convex, we could find a local minimum solution using the nonconvex optimization methods.
Since our OGM method assumes measurements drawn from the probability distribution, the measurement accuracy may fluctuate. We will derandomize the scheme by fixing almost all of the choices of measurements P in the next subsection.

Sampling strategy
Suppose that we have determined the measurement basis sets e 1 , . . . , e s and the optimized probability distribution {(P, K(P ))} using the above strategy. In practical computation, we usually have constraints on the maximum allowed number of measurements. In what follows, we provide a partially derandomized strategy with the given number of measurements T . For the jth measurement P (j) with sampling probability K j , we choose T K j number of measurements for P (j) , and select an additional one P (j) with probability T K j − T K j , as shown in the Algorithm 2.  Observe that the estimationv G does not rely on the arrangement of measurements. Let M be the list for the selected measurements, and note here we allow a measurement to appear more than once in the list M. It is easy to check the expectation number of samples for P (j) is equal to K j T with Algorithm 2. Note that in Algorithm 2, the number of sampled measurements may not exactly equal T , although it is close to T if s T . Hence in the numerical experiment we additionally add P (j) to M for the measurement P (j) satisfies K j T < 1 and size(M) < T in the descending sequence sorted in Step (1) of Algorithm 2 if the size of M is less than T . We provide detailed discussions on the variance of the partially derandomized strategy in Appendix D.

Numerical tests
In this section, we numerically demonstrate the overlapped grouping measurement algorithms for the energy estimation of molecular systems, and compare our methods with other advanced measurement strategies, including LDF-grouping, locally biased classical shadows, and derandomized classical shadows. We do not include importance sampling method in the comparison since its performance is worse than others. Algorithm 3 gives the full estimation process for the OGM algorithm. In Step 2 of Algorithm 3, we begin the optimization process from a better initialized probabilities by picking the distribution with the minimum cost function from uniformly randomly selected 10 distributions around the initialized distribution K s from Step 1. To show the robust advantages of the OGM algorithm, we directly choose the probabilities initialized in Algorithm 1 without performing Step 2 to give the optimized measurement distributions, and outputs the errors for estimations with 1000 samples in Table 1.
We compare the measurement schemes for different molecular Hamiltonians, ranging from 4 to 16 qubits. We first consider the molecular Hamiltonian measurement on the ground state of molecular Hamiltonians, in which the fermionic Hamiltonians are mapped to the qubit ones under the Jordan-Wigner (JW) transformation and the number of terms in the molecular Hamiltonians scales quartically to the system size. In practice, the cost function in Eq. (14) might lead to the optimized result for the probability distributions trapped in the local minimum. Here, we address this problem by adding an additional disturbance term in the cost function to jump out of local minima. Calculate v with Eq. (8); We compare the estimation error (averaged over 100 independent tests) using 1000 measurement samples in Table 1. Here the error is estimated with the formula The definition of ε v is also consistent with the standard deviation calculation of the estimation v. It is worth mentioning that we numerically show that N = 1000 independent experiments are sufficient to output a convinced estimation error in Appendix G. We also include the recently proposed derandomized classical shadow method, which is the current state-of-the-art method and has been numerically tested to outperform the others [33]. The numerical result again shows that our OGM method achieves much higher accuracy than other methods when the number of measurements is limited, including the derandomized classical shadow method, verifying its significant performance in the practical computation. The OGM algorithm has simultaneous advantages in the energy estimation under different fermion-to-qubit encodings, including Bravyi-Kitaev (BK) and parity encodings, and we refer to Appendix G for the numerical results and detailed comparison under Bravyi-Kitaev and parity encodings.  algorithms. We also provide the comparison of the variances of existing algorithms with OGM algorithm in Appendix F. We also show Fig. 3 and 4 to illustrate that the advantage of OGM is independent of the quantum input state, where we approximate the expectation of molecule H 2 (8 qubits) with JW-encoding. In Fig. 3, we compare the errors of LBCS, LDF Grouping, Derandomized CS, and OGM algorithms under 10 random generated 8-qubit states with (a) 1000 samples and (b) 10,000 samples. In Fig. 4 we further show the comparison of these algorithms on a randomly generated 8-qubit state with the increase of the number of samples, where the x-axis and the y-axis are both in logarithmic scales. Here we choose the input quantum state as an 8-qubit state with uniformly randomly generated real amplitudes.
We additionally provide the experimental results in Appendix I. The experimental results clearly show a much faster convergence of our OGM method using a few hundred of measurements, which aligns with our theoretical prediction and numerical simulation. We can observe that our methods are practically useful even for the current generation of quantum devices.

Discussion and outlook
We introduce a unified framework of quantum measurement that reveals the underlying mechanism of the existing advanced measurement strategies, which are seemingly distinct from each other. We further propose the overlapped grouping measurement (OGM) scheme that integrates the advantages of these typical measurement strategies. Our numerical results suggest a significant improvement over existing advanced measurement methods. Our numerical result shows that our method already demonstrates advantages in practical problems. Since the efficient quantum measurement is crucial for many quantum algorithms and quantum processing, our work has wide applications, such as in variational quantum algorithms and quantum many-body tasks involving eigenenergy estimation [3,7,43,54], where we need to efficiently measure complicated Hamiltonians H|H or their moments H 2 H 2 [64]. Our method could significantly reduce the measurement cost and hence speed up the quantum computation, especially when we aim to realize quantum advantage for realistic problems. Moreover, our method applies to adaptive variational quantum simulation, which requires a large number of measurements in each subroutine [25,75]. It is expected that our measurement scheme will show more advantages with an increasing system size of great practical relevance to both theoretical and experimental tasks. The optimization goal of the OGM algorithm is completely different from the derandomized CS method, since here we utilize a partial variance as the cost function, while the derandomized CS algorithm utilizes a confidence bound. The numerical results also show that our algorithm has clear advantages for a large number of measurements. Our work considers explicit strategies for choosing the overlapped sets, which could be improved using more advanced classical algorithms. Note that the dimension of the considered measurement space is 4 n , and both OGM and CS variant algorithms aim to find a good distribution in this huge space. The expressivity of the CS algorithm is limited since it only explores the 3n size of this space. One of the approaches is to combine OGM algorithm with CS for the molecules where parts of qubits have strong correspondence. Briefly speaking, we could utilize the CS method to generate n/s independent subspaces, each grouped with OGM with dimension 4 S . We leave this idea as an interesting future work. Another possible extension is to utilize neural networks to generate samples within the OGM framework [8,62,63]. In our work, we assume local Pauli measurements, whereas more general measurements, such as arbitrary local measurements or entangled measurements could be considered [35,36,44,49,55,70,76]. Several measurement schemes have been proposed by adding a polynomial-depth circuit before the local measurement to implement entangled measurements [24,34,70]. How to extend our OGM scheme to generalized measurements is an interesting future direction. 1 The source code for the OGM optimization process is available at https://github.com/GillianOoO/Overlapped-grouping-measurement.
We further provide the variance of the derandomization algorithm and show the relation to our overlapped grouping measurement method. Letv 1 , . . . ,v T be the estimations after T independent samples. Letv = (v 1 + · · · +v T ) /T be the expectation of these samples. Then by Chebyshev inequality, we have Hence the error can be bounded to ε with probability δ when the number of samples is The variance ofv 1 generated by l 1 -sampling can be bounded to Hence T ≥ α 2 1 δε 2 suffices to give an estimation with error less than ε and success probability 1 − δ.
Grouping.-The variance ofv 1 generated by grouping method satisfies where α e k 1 = j:Q (j) ∈e k |α j |. Classical shadow.-For classical shadow algorithm, the variance of the generated estimationv 1 satisfies

B LDF Grouping method
In the LDF Grouping method, we mapped the observables and the "compatible with" relationship to a graph. In specific, we denote an observable as a vertex, and if two observables do not have any "compatible with" relationship, we connect them with an edge. Then we can obtain a graph G (V, E), where the number of vertices is equal to m. Next, we can proceed the grouping method as follows.
(2) Repeat the following step until all of the vertices are in one of the sets.
(3) For j goes from 1 to m, if V j is not in any set, then add V (j) to a set such that there is no edge between V (j) and any other vertices in this set. If such a set does not exist, then add V (j) to a new set.
After the above process and changing V i into Q (i) , we can generate the grouping sets which satisfy any two observables are compatible with each other.

C Greedy overlapped grouping strategy
Aside from the overlapped set generation strategy in Algorithm 1 of the main text, we proposed an alternative strategy that is slightly different in selecting the observables for a set, denoted as Grouping version 2. The main difference is in "the sequence of observables" adding to a new set. The new strategy has a potentially better performance but it needs more time since it has a larger number of sets. We further add a token to each observable to represent how many times we can visit this observable to avoid the explosion of the number of sets. Note that if there is no restriction, the observables in the tail of the sequence will be much more difficult to be added to existing sets, hence the number of sets could be very large. Let the token of an observable Q (k) be U k = 2 d−1 , where d is the number of digits of |α k |/MinWeight , and MinWeight = min j≤m |α j | is the minimum weight. This new version of grouping strategy is depicted in Algorithm 4.
We compared the error of the estimation generated by grouping versions 1 and 2 in Table 2. The table shows that when we take more consideration for the observables with larger weight, we have better optimized results, while this would give us a longer optimization time because of the expansion of sets (optimized parameters).

D Variance for the partially derandomized strategy
Suppose that we have determined the measurement basis set M, in which the number of measurement basis P (k) is assigned as M k , and the total number of measurements is T = k M k . We show the variance for the partially derandomized strategy with given measurements T . We denote e k containing all Q (k) element-wise commute with basis P (k) , and denote s k as the total number of times that Q (k) is effectively measured, which is given by s k = P ∈M δ Q (k) P . Let t j,k be the measurement outcome of the jth observable Q (j) measured with the basis P (k) is +1. The measurement outcome associated with the measurement P (k) for observable Q (j) is thusv j,k = 2t j,k /M k − 1. As such, the estimator can be expressed byv Initialize the measurement of e s as P (s) ← Q (j) ; 7 for k ← 1 to m : 8 if Q (k) is compatible with P (s) and k = j, the number of sets Q (k) appeared is less than token U k : 9 Add Q (k) into set e s , and update P (s) to P (s) Q (k) ; Let the initial probability of P (s) be the summation of the weight of all observables in this set; 11 for k ← 1 to j − 1 :

12
if Q (k) is compatible with P (s) and not in e s : 13 Add Q (k) into set e s , and update P (s) to P (s) Q (k) ; Here, we use the fact that measurement outcomes obtained from different P (k) are independent since the measurements P (k) are independent of each other. We also note that the outcomesv j,k are correlated, so the variance in Eq. (21) depends on the covariance Cov v j,k ,v j ,k .

E Relationship between cost function and variance
Let o j := f (P, Q (j) , K)µ P, Q (j) be the estimation of tr ρQ (j) . The cost function in the manuscript is selected as j α j 2 E o 2 j = j α j 2 /χ(Q (j) ), to evaluate Var (v). In the following, we prove that Var and j =k α j α k P :Q (j) P,Q (k) P K(P )

F Variance comparison
We compare the variances of different measurement schemes in Table 3, where the initial point is directly chosen from Algorithm 1. Here we generate the measurement distribution by choosing T = 1000 in Eq. (15). With a negligible small error ε 0 , we find that our method has a much smaller variance than that of LDF-grouping, and LBCS. The improvement becomes more prominent for larger molecules (approximately one order compared to the classical shadow methods), which indicates its effectiveness for large practical problems with a limited number of measurements.

G Numerical results and discussions
In this section, we numerically show the advantages of our OGM algorithm compared with l 1 -sampling, LDF Grouping, LBCS, and the derandomized CS algorithm by computing the corresponding variances and errors. Table 4 shows the comparison of variance of OGM, l 1 -sampling, LDF Grouping, and LBCS algorithms under different fermionic-to-qubit encodings, including JW, bk, and parity encodings. The last column is the deviation of the OGM algorithm after a small perturbation of the cost function as introduced in the main text.
We compare the estimation accuracy with 1000 measurements for molecules H 2 , LiH, BeH 2 and H 2 O with JW, BK and parity encodings in Table 5, where the initialized probability for the OGM algorithm is chosen directly from Algorithm 1.

H Error of the estimation
We leverage root-mean squared error to quantify the error of the estimation. In each iteration, we generate an estimationv i by independently performing T measurements of initial state ρ. For N independent repetitions, we get the average error of T samples as We plot the figure to show that the average error will fluctuate in a small range after more than 10 iterations. In our simulation experiment of the main file (Table 1 and Fig. 2), we choose N = 100 for the molecules in which the number of qubits is less than 14. For NH 3 molecule, we let

I Experimental results
The numerical study ignored device errors, and how the noise in realistic hardware affects the measurement efficiency is critical for studying their practical performance with realistic quantum devices. To further demonstrate the advantage of our OGM method with current quantum devices, we implement and compare the measurement schemes on the IBM quantum cloud hardware with device imperfections. We aim to estimate tr(|ψ ψ| O H 2 ) with the GHZ state |ψ = (|0000 + |1111 )/ √ 2 and the four-qubit Hamiltonian O H 2 of the H 2 molecule under the JW-encoding. We note that the GHZ state has a much larger variance compared to the ground state, and thus could be a suitable testbed to compare the performance of different measurement schemes. In Fig. 6, we compare estimation errors using the l 1 -sampling, LDF-grouping, LBCS, derandomized CS, and the OGM method with a different number of copies (samples) of the prepared entangled state. We evaluate the error by comparing the reference results obtained using the OGM method with 49140 samples.