Eﬃcient quantum measurement of Pauli operators in the presence of ﬁnite sampling error

A key result of interest is the maximum number of two-qubit gates required to obtain a measurement of all the operators in a given Hamiltonian. We show the theoretical maximum, given by the largest value of kn − 12 k ( k + 1) for any collection, and the true largest value for any collection. The ratio of these two numbers is shown in Fig. 3(c). We also show the mean number of two-qubit gates required in a rotation circuit, averaged across all collections for a given molecule.


Introduction
Estimating the expectation value of an operator corresponding to an observable on a quantum state is a fundamental task in quantum mechanical experiments. Expectation estimation of a Hamiltonian features prominently as the quantum sub-routine of the Variational Quantum Eigensolver (VQE) algorithm [1], which has emerged as a leading candidate for exhibiting quantum advantage in the Noisy Intermediate Scale Quantum era [2]. VQE is a hybrid quantum-classical algorithm designed to find the ground state [1,[3][4][5][6][7][8][9], or energy spectrum [10][11][12][13][14][15], of a physical or chemical system. Often there is no natural way to obtain the expectation of an operator directly and some indirect protocol is required. In particular, this is true of current quantum computers, which can only obtain measurements in the computational basis defined, by convention, as eigenstates of the Pauli-Z operator on each qubit.
One naive method of proceeding, therefore, is to decompose the operator into a weighted sum of Pauli operators (or Paulis) and then measure each separately. The paper that introduced VQE [1] proposed measuring the Hamiltonian in this way. However, this can be inefficient. For example, a second-quantised chemical Hamiltonian on n qubits decomposes into a very large number of Paulis that scales as n 4 . An improved method, therefore, is to arrange the Paulis in commuting collections. All Paulis in a collection can be measured at the same time, as any set of commuting Paulis can be simultaneously diagonalised by a single unitary. There are typically many arrangements of Paulis in mutually commuting collections, and the reduction in the number of measurements required will depend on the arrangement.
In the context of VQE, McClean et al. [3] proposed this improved protocol and argued using a toy example that the arrangement with the fewest collections might not result in the fewest measurements as splitting a single commuting collection into two might sometimes offer an improvement. However, Ref. [3] did not show how to obtain commuting collections nor how to construct the unitary rotation, U , that enables simultaneous measurement.
Recently, a series of papers [16][17][18][19][20][21][22] * have appeared that make progress on both the collecting strategy and rotation circuit construction problems. Our paper is in the same arena and addresses both problems.
First, we define two natural metrics, R andR, that quantify the performance of any given arrangement. R andR measure the ratio of the number of measurements required in the uncollected case to the collected case to attain a fixed level of accuracy. The key feature of these two metrics is that they assume measurements are distributed optimally between the collections to minimise the finite sampling error, following Refs [8,25,26]. The difference between them is that R is state-dependent butR is designed to approximate E[R] over the uniform spherical measure. Therefore, R is more suitable for use given knowledge of the underlying quantum state, whileR is more suitable otherwise.
With R andR defined, we prove a first result that breaking a single commuting collection into two is never advantageous, as the number of measurements required to obtain an expectation estimate to a given accuracy is never reduced. This result already contradicts the conclusion of the aforementioned toy example used by McClean et al. [3], and analysed in full in Ref. [20,Sec. 10.1], that breaking a collection can be advantageous. The reason for the discrepancy is that we distribute measurements optimally among the collections so as to minimise the overall error due to finite sampling for a given number of measurements. In contrast, in the previous works, measurements are distributed uniformly between the collections.
We further propose an intrinsically new collecting strategy, which we call Sorted Insertion, that is designed to maximise R andR. Unlike all strategies used previously that seek the minimum number of collections [16,17,20,21], Sorted Insertion seeks to maximiseR by explicitly exploiting the coefficients of the Paulis in its assignment of collections. It is important to stress that maximisingR is not the same as minimising the number of collections, as we show by a toy example. Perhaps counter-intuitively, this does not contradict the above conclusion, which showed only that it is never better to break a single collection into two.
When considering the rotation circuit construction problem, we contribute two new methods for constructing Clifford rotation circuits that enable simultaneous measurement of a collection containing arbitrary commuting Paulis. Like Ref. [20], we approach the problem via the stabiliser formalism but further * We mention that both the collecting and rotation circuit synthesis problems have been addressed on an ad-hoc basis by experimentalists since Kandala et al. [5]. We refer readers to Table 2 of Ref. [20] for a good summary. We also mention that recent Ref. [23] and the less recent Ref. [24] allow for collecting of non-commuting Paulis. We found it interesting that such approaches are feasible, but found it hard to compare them to our work on a like-for-like basis.
consider the case where the number of independent operators, k, in a collection can be less than the number of qubits, n. We show that the number of twoqubit gates in the rotation circuit, U , can be reduced in a way that scales with k. This is important because it is atypical for actual collections to have exactly k = n independent Paulis and reducing the number of two-qubit gates is important, especially in the nearterm [27][28][29][30][31][32][33][34][35][36][37][38][39][40][41]. As far as we are aware, ours is the first paper to consider the k < n case explicitly. Furthermore, we allow classical post-processing, which can save quantum resources.
More specifically, we introduce constructions "CZ" and "CNOT". The CZ-construction builds on work by Van den Nest, Dehaene, and De Moor [42] in the graph-state literature to yield U with a number of two-qubit gates at most The CNOT-construction builds on our CZconstruction, and work by Aaronson and Gottesman [43] and Patel, Markov, and Hayes [44] to yield U with a number of two-qubit gates at most We stress that u cnot and u cz are worst-case upper bounds. In practice, numerical simulations are needed to determine whether the CZ-or CNOT-construction is actually more efficient. We note that, in the case of k = n, our constructions have two-qubit gate-counts scaling no worse than the previous best of O(n 2 ) [20].
Other works, such as Ref. [ [19,22]. We end our paper with a series of numerical results on molecular Hamiltonians that serve to illustrate and validate our theoretical work. In doing so, we first quantify the performance of Sorted Insertion using the metricR for molecules ranging in size from hydrogen H 2 , which requires two qubits, to hydrogen selenide H 2 Se, which requires 38, finding that it leads to a 10 to 60 fold improvement in the number of measurements required. Note that we are defining a single measurement to consist of a measurement of all qubits, and so the number of measurements equals the number of ansatz state preparations.
We further present results of using Sorted Insertion alongside our CZ-construction on molecules requiring up to 38 qubits to calculate the actual number of two-qubit gates required for real molecular systems. Our numerical results show that the typical number of two-qubit gates is fewer than the worst-case u cz (k, n) by a factor of 3.5.
Lastly, we present data showing Sorted Insertion outperforming the four conventional greedy colouring algorithms we tested, as measured by the metricR. Our data strongly challenges the assumption that minimising the number of collections is best, as arrangements with the smallest number of collections do not typically perform the best with respect toR.
We mention that an entirely different measurement strategy has been proposed recently [45] that is based on so-called "classical shadows". Ref. [45,Illustrative Example Applications] explicitly mentions that their method is unsuitable for estimating expectation values of global observables. Even for a Hamiltonian consisting of a single global Pauli operator P := P 1 ⊗ P 2 ⊗ · · · ⊗ P n , with P i ∈ {X, Y, Z}, their number of classical shadows must scale as 2 Ω(n) . In contrast, our method, as with the original VQE proposal, scales independently of n and would simply measure each P i and multiply the outcomes.
Follow-up Ref. [46] sought to ameliorate this problem by using so-called "locally-biased classical shadows". However, as they mention in [46,Remark 1], the analytical upper bound on their estimator can still scale as 2 Ω(n) for the same P above. Separately, Ref. [46] did not compare their techniques to ours, rather they observe that we require deeper circuits. However, the additional depth our method requires is modest relative to that of ansatz state preparation.
There have also been interesting developments [47,48] on measurement reduction in VQE focusing on designing classical optimisers that intelligently choose how many measurements to use at each iteration -for example, fewer measurements may be used near the start versus near the end. These methods complement ours and can be combined with ours to reduce further the overall number of measurements.

Collecting strategies
In this section, we develop a method for collecting Pauli operators designed to minimise the number of measurements required to obtain an estimate of the expectation value of an operator to a given level of accuracy.
We have an operator, O, of the form where N is the number of collections of mutually commuting operators, m i is the number of operators in collection i, P ij is the jth Pauli operator in the ith collection and a ij ∈ R is its coefficient. We assume that the P ij s are distinct for different (i, j), i.e., we do not collect a single Pauli operator appearing in O into more than one collection by splitting its coefficient. This is discussed further at the end of this section.
Given ǫ, let M u and M g be the minimum number of measurements required to attain an accuracy ǫ in the uncollected and collected (as per Eq. 3) cases respectively. Finding M u is a special case of finding M g . To find M g , we can solve the constrained optimisation problem that asks how a given number of measurements should be distributed in order to maximise accuracy. The solution to such a problem gives the optimal measurement strategy for a given operator and state. Using Lagrange multipliers [8,25,26], we find that the optimal number n i of measurements of collection i is Therefore, we have where Since M u is just M g evaluated with every operator in its own collection, we have where The ratio R, defined as therefore acts as a natural metric for the performance of a particular arrangement of Paulis under the assumption that measurements are distributed optimally. The larger the value of R, the greater the saving that assembling operators into collections provides. In Theorem 1 below, we show that combining two collections into one can only improve R.
Proof. We write the operators associated with collections A and B as for some k, l respectively, where the indexing of the sum in O B is for later convenience. The operator associated with C can therefore be written as We define the (k + l) × (k + l) covariance matrix C, which is symmetric and positive semi-definite, associated with collection C by its entries with We write a for the vector of length k + l whose first k entries are given by {a j } k j=1 and the remaining l entries are zero. Similarly, we write b for the vector of length k + l whose first k entries are zero and the remaining l entries are given by {a j } k+l j=k+1 . Since the numerator of R is constant for different arrangements, to prove our claim it suffices for us to show that the denominator of R is not smaller in the case of {A, B} than in the case of {C}, i.e., that With the above notation, this is equivalent to where the fourth line follows due to the Cauchy-Schwarz inequality on a semi-inner product defined by a, b := a T Cb [ Theorem 1 shows that it is impossible to mitigate covariances by splitting collections under the optimal measurement strategy. This is in contrast to Refs [3,20], which showed that it is possible using a sub-optimal measurement strategy. In Appendix A, we re-do precisely their example using the optimal measurement strategy. Note that Theorem 1 simply implies that it is never better to break a single collection into two, and not that the minimum number of collections is necessarily the best; indeed, we provide a counter-example below.
If all of the variances in R are replaced by their expectation values over the uniform spherical measure (see Ref. [50,Ch. 7]), we obtain another metric,R, given byR (20) The derivation ofR is given in Appendix B. The same proof as in Theorem 1 can be used to show that breaking a collection into two is never better when measured byR; the only difference is the covariance matrix must be replaced by its expectation over the uniform spherical measure. R is a particularly useful metric because it approximates R, but can be calculated analytically. When ignorant of the quantum state |ψ , a good collecting strategy should aim to maximiseR. As mentioned above, this is emphatically not the same as minimising the number of collections. For example, let us consider the operator O, on two qubits, defined as The operators cannot be assembled into a single commuting collection but there is a unique arrangement, G 1 , consisting of two collections, given by which hasR = 1.47 to two decimal places. G 1 performs worse than an arrangement G 2 , with three collections, given by which hasR = 1.71 to two decimal places. Reflecting upon such examples and the nature of the square root function in the denominator ofR leads us to propose a collecting algorithm that prioritises collecting Paulis with large coefficients.
This algorithm, which we name Sorted Inser-tion, can be described as follows. Given an operator where t is the total number of Pauli operators, the entire set {(a j , P j )} t j=1 is sorted by the absolute value of coefficients a j , so that |a 1 | ≥ · · · ≥ |a t |. Then, in the order i = 1, . . . t, it is checked whether P i commutes with all elements in an existing collection. If it does, it is added to that collection. If not, a new collection is created and P i is inserted there. The collections are checked in order of their creation * . The collections formed are tracked and outputted at the end once the final P t has been inserted.
Sorted Insertion has worst-case complexity O(nt 2 ), where we recall that n is the number of qubits. In contrast, greedy colouring algorithms, as implemented in Ref. [16], require pre-generating the full commutation graph which has complexity Θ(nt 2 ) even in the best case. The colouring algorithms then run on this graph adding further complexity -see Table 1. Therefore, Sorted Insertion is at least as efficient as these greedy colouring algorithms.

Colouring Algorithm
Time Complexity Note that Sorted Insertion is a heuristic algorithm for maximisingR. It works well in practice, as demonstrated by Table 3 in Sec. 4, but may fail to output collections that actually maximiseR. In fact, it is unlikely we can go beyond heuristics as the problem of maximisingR is NP-hard in general. We can show this by combining the reduction from Min-Clique-Cover in Ref. [20, Appendix A] with the NP-hardness of |V | 1−ǫ -approximating the Min-Clique-Cover [52], where |V | is the number of vertices and ǫ = 0.5. We appeal to the hardness of approximation because, as we have stressed, maximisinĝ R is not the same as minimising the number of collections, which corresponds to minimising number of cliques in a cover. For a full proof, see Appendix F.
Lastly, we mention that our choice not to put a single Pauli into multiple collections by splitting its coefficient may be sub-optimal as can be seen by con-* We note that other collection orderings could also be considered, for example, ordering the collections by their values of sidering the operator O, on two qubits, defined as and a grouping of the form for α ∈ [0, 2]. It can be verified thatR, defined as before even with coefficient splitting, is maximised at α = 1. We leave maximisingR with consideration of coefficient splitting for future work.

Rotation constructions
In this section, we present two methods of calculating a rotation circuit which enables measurements of all operators in a mutually commuting collection to be obtained simultaneously. We assume familiarity of the reader with the stabiliser formalism, especially the 2nbit binary representation of n-qubit Paulis [43,[53][54][55].
We follow the convention that the upper and lower halves of the binary matrix encode Z and X operators respectively. This representation is reviewed in Appendix C. We also reserve symbols I m and 0 m for the m × m identity and all-zero matrices respectively. Our starting point is a commuting collection, S ′ start , of m Paulis which can be represented as a binary 2n× m matrix S ′ start . By Gaussian elimination on S ′ start , we can form a 2n × k matrix S start representing a set S start of k independent Paulis drawn from S ′ start where k ≤ min(n, m). Our goal is to transform S start , using certain allowed transformations, into a 2n × n matrix S end where Let U denote the circuit consisting of one-qubit and two-qubit transformations in the order they were applied from S start → S end . Then applying U to any state |ψ , measuring in the computational basis, and classically post-processing allows us to obtain measurements of all operators in S ′ start on |ψ simultaneously.
The allowed set, T , of transformations on a binary 2n × m matrix S is, where p ranges over all columns, r ranges over all rows, and addition is mod 2: 1. One-qubit and two-qubit quantum gates, corresponding to row operations, specifically: CZ on qubits i and j: CNOT on control-qubit i and target-qubit j: Hadamard (H) on qubit i: 2. Classical post-processing: Products of eventual single-qubit computational-basis measurements: right-multiplication by invertible m × m matrix. Relabelling of qubits i and j:

Basis extension:
Addition of further stabiliser: appending of new column S r,m+1 .
In the near term, operations in T have justifiably different costs. Two-qubit gates are much more costly than one-qubit gates, which are more costly than classical post-processing. Basis extension has no cost.
In presenting our constructions, we shall refer to the commutativity condition, preserved under T , given by where S is the 2n × m matrix encoding the Paulis and We ignore any changes in sign of stabilisers under T as this can be easily accounted for by classical post-processing. Readers interested in this and other details are referred to Appendix D, where we work through our CZ-construction with a specific example.

CZ-construction
Important to our first approach is the special class of stabiliser states known as graph states. Consider any graph G on n vertices. The graph state |Φ G is then defined by n independent stabiliser generators where nbd(i) is the set of neighbours of vertex i in G.
The binary representation of these stabilisers is where A, an n × n symmetric matrix with 0s on its diagonal, is exactly the adjacency matrix of G.
It is well-known that |Φ G = V H ⊗n |0 n where V is a product of CZ gates and H is the Hadamard gate. More specifically, V applies CZ between qubits i and j if and only if vertex i neighbours j in G. Van den Nest, Dehaene, and De Moor [42] tell us that any stabiliser state can be transformed to a graph state by a product of single-qubit Clifford gates and classical post-processing. It is therefore clear that we can transform any S start to S end via S graph using at most n(n − 1)/2 two-qubit (CZ) gates, as this is the maximum number of edges on an n-vertex graph. The interesting question is whether we can do better by exploiting the potential low rank k ≤ n of S start .
Our answer is in the affirmative and we now present an explicit and efficient algorithm that constructs U with at most u cz (k, n) = kn − k(k + 1)/2 two-qubit gates. In Fig. 1, we illustrate the sequence of reductions that allow us to reach S graph , and so S end , from S start . We now describe the salient aspects of each step.
1. S start → S 1 . Following Ref. [43,Lemma 6], we can apply Hadamard gates so that B has rank k. By classical row-swaps (relabelling of qubits), we can ensure that the first k rows of B have full-rank.
Since the upper k × k submatrix of B has full-rank, column operations corresponding to classical post-processing can reduce it to I k .
3. S 2 → S 3 . Additional columns, corresponding to further operators, are appended. We can directly verify that the extension to S 3 is valid by Eq. (28). Clearly S 3 has full column-rank n.
The sparsity of our chosen extension shall play a crucial role in the reduced two-qubit gate size of U when k < n in both the CZ-and CNOTconstructions. 4. S 3 → S 4 = S graph . Column operations can eliminate F , then Phase gates can ensure E has ze-ros on its diagonal. Importantly, S 4 represents a graph state S graph .
5. S 4 → S end . Phase and CZ gates can implement this final reduction as discussed above. The maximum number of CZ gates required to map S 4 to S end equals the maximum number of offdiagonal 1s in the upper half of S 4 . When n = k, this is n(n − 1)/2 = O(n 2 ). When k = n, this is w cz (k, n) = kn − k(k + 1)/2 due to sparsity of the upper half of S 4 which traces back to the structure of the additional operators appended in step Note that in step S 4 → S end , we can first try to reduce the upper half of S 4 by single-qubit gates before applying CZ. One way to do this is to reduce the number of edges in the graph whose adjacency matrix is specified by the upper half of S 4 by the so-called "local complementation" operation [42,56,57]. This corresponds precisely to reducing the number of CZ gates in our CZ-construction. We start from S 4 above, which we reached without using two-qubit gates. Now, instead of using one block of CZ gates, we reduce to S end as shown in Fig. 2

CNOT-construction
where Next, we apply CNOT gates corresponding to M 1 . The number of CNOT gates required here equals the number of row operations required to reduce M 1 to I n . We find this is at most u cnot (k, n) = O(kn/ log k) via arguments of Patel, Markov, and Hayes [44]. The proof is given in Appendix E. Note that in the three steps S 4 → S 5 , S 6 → S 7 , and S 8 → S end , we have used blocks of CNOT gates. The method we used to synthesise these blocks is size-optimal [44, Lemma 1], but we could have alternatively used methods in Ref. [58], that built on Ref. [59], to achieve optimal space-depth tradeoff, where space refers to extra ancilla qubits.
To end our discussion of constructing rotation circuits, we briefly mention a third, ancilla-based construction with two-qubit gate size at most kn. This construction is well-known in the context of syndrome measurement [55,60] in quantum error correction but does not seem to have been mentioned in the context of measuring a Pauli decomposition of an operator, as in VQE. To measure k commuting Paulis {P i } k i=1 , this "ancilla-construction" uses k ancilla and involves k consecutive blocks of generalised-CNOT gates, each targeted at a different ancilla. The controls in block b ≤ k are activated or deactivated by the +1 or −1 eigenstates of the single-qubit Paulis forming P b * . k single-qubit measurements are performed on the ancilla at the end of each block to exactly give measurements of P i . Unfortunately, this construction requires k extra ancilla qubits (or else a single extra ancilla qubit that needs to be measured and reset k times) and has worse worst-case two-qubit gate size than both of our constructions.

Application to VQE
In this section, we present numerical results of the collecting method discussed in Sec. 2, alongside the CZ-construction of Sec. 3.1 to construct the rotation circuits for given commuting collections. In particular, we have applied our methods to the Hamiltonians of simple molecules so as to demonstrate their use in the context of VQE. The full results are given in Table 4 of Appendix G, with a subset shown in Table 2. In all cases, we used OpenFermion [61] to obtain Hamiltonians in the STO-3G basis, at approximately the equilibrium geometry of the molecules, with the symmetry conserving Bravyi-Kitaev transformation [62,63]. In order to reduce the number of two-qubit gates required, we considered qubits on which all operators in a collection locally commute separately -a one-qubit rotation per locally commuting qubit is all that is required to do so.
In Fig. 3(a), we plot the average collection size against the number of qubits, n, for the molecular Hamiltonians. We can see that the average collection size increases with increasing n, and the increase does not appear to be slowing down. We therefore conclude that our sorting method works well on systems up to at least size n = 38, and looks likely to work for larger systems. However, the key advantage of assembling a Hamiltonian into collections of mutually commuting operators is a reduction in the number of measurements required to obtain an energy expectation to a certain level of accuracy, and collection size alone does not directly quantify this reduction. For a given Hamiltonian and quantum state, the reduction is instead given by R, as in Eq. (10).
We therefore calculated the value of R for 100 different quantum states, generated using 100 random sets of ansatz parameters with a hardware efficient ansatz of depth 1, for the nine smallest molecular systems. We show the mean, minimum and maximum values for each molecule. In practice, the value of R can at best be obtained approximately by making measurements on the quantum computer and so cannot be used to determine the expected advantage of a particular arrangement a priori. The metricR, given by Eq. (20), on the other hand, depends only on the coefficients of the terms in the Hamiltonian. From Table 2, we can see thatR closely approximates the average of R over many ansatz parameters, for the ansatz we have considered, but can be calculated analytically without the need for simulations. Further investigation of the relationship betweenR and R will be useful if considering a different ansatz. In Fig. 3(b), we showR as a function of the number of qubits for our full selection of molecules. We can see that it is highly molecule dependent, with systems of similar size having very different values.
The reduction in the number of measurements required comes at the cost of applying additional quan-   Table 2, and the full data is shown in Table 4.
tum gates before the qubits are measured, the most costly of which are two-qubit gates. For the CZconstruction, we demonstrated in Sec. 3.1 that the maximum number of additional two-qubit gates required for a collection with k independent terms is nk − k(k + 1)/2. We would like to know, in practice, how many additional two-qubit gates are required at a maximum, as this is the quantum resource that is most limiting. Assuming for a given Hamiltonian that at least one collection has rank n, obtaining a measurement of all terms in a Hamiltonian on n qubits may therefore require applying an additional n(n − 1)/2 gates in a single circuit. However, for the molecules we have considered, we find that the largest number of two-qubit gates required is in fact far lower than this, typically by a factor of about 3.5, as can be seen in Fig. 3(c).
Given the close relationship between the average value of R and the value ofR, we propose usingR as a metric for the quality of a collecting method and compare different methods of collecting the operators with this metric in mind. The results are shown in Table 3, along with the number of collections of op-   Table 3: Comparison of the arrangements produced by the greedy colouring algorithms "Largest First", "Connected Sequential d.f.s." (depth first search), "Independent Set" and "DSATUR" as implemented by the Python package NetworkX [64] with our method Sorted Insertion. For each method, the number of collections produced, N , and the metricR given by Eq. (20), are presented. The best or joint best methods are highlighted in bold for each molecule.
erators, N , that each method produces. Out of the methods, "Independent Set" was best at approximating the minimum clique cover -it found the cover with the fewest cliques in all but one case. However, the minimum clique cover does not necessarily result in the fewest measurements, with "Independent Set" only performing best once with respect toR. Overall, our Sorted Insertion was best at maximisinĝ R, performing best or joint best in all but two cases.

Conclusion
We have addressed two problems related to the efficient measurement of Pauli operators on a quantum computer in the presence of finite sampling error. The first is how to assemble a set of Paulis into collections in which they mutually commute, and the second is how to construct rotation circuits that enable mutu-ally commuting Paulis to be measured simultaneously.
For the first problem, we contribute two natural metrics, R andR, that justifiably measure the effectiveness of an arrangement, followed by a collecting strategy motivated byR that we call Sorted Insertion. For the second problem, we contribute two rotation circuit constructions, CZ and CNOT. The CZconstruction uses a maximum of u cz (k, n) = kn−k(k+ 1)/2 two-qubit gates while the CNOT-construction uses a maximum of u cnot (k, n) = O(kn/ log k).
We have applied our theoretical work to the task of estimating energies of molecules in the context of VQE. Comparison to other collecting methods shows that while Sorted Insertion does not normally result in the smallest number of collections, it nearly always results in the best value ofR. We also find that, for the CZ-construction, the largest number of two-qubit gates required is typically less than the theoretical worst-case by a factor of about 3.5.

Acknowledgement
We thank Oscar Higgott for informing us of the ancilla-based simultaneous measurement method and Ref. [60,Fig. 14]. We also thank the editor Ronald de Wolf and two anonymous reviewers for their comments and suggestions.
A Example to demonstrate that combining two collections into one never reduces R Consider the example in Refs [3,20] where we consider measuring the energy, given by the Hamiltonian of the state |ψ = |01 . The covariance matrix is The covariance between the non-commuting operators in the upper right and lower left blocks is not defined. We have set them to equal zero for convenience (highlighted in bold).
First, we consider collecting the Paulis into For these collections of Paulis, the coefficient vectors are The number of measurements to achieve an accuracy of ǫ is Now, let us consider breaking up the first collection into In this case, the coefficient vectors are The number of measurements required to attain an accuracy ǫ is therefore Therefore, under the optimal measurement strategy it is not preferable to break the {−XX, −Y Y, ZZ} collection into {−XX} and {−Y Y, ZZ}. In this specific example we have equality because for α = −β we have αa + βb, αa + βb = 0.

B Derivation ofR formula
Claim 1. For R as defined in Eq. (10), if all variances and covariances are replaced with their expectation value over uniform spherical distribution, we obtain a new metric,R, given bŷ Proof. The variance of a single Pauli operator is The expectation of this variance for all P i = I is where α n := 1/(2 n + 1), with n the number of qubits, is independent of P i [50,Exercise 7.3]. Trivially, Var[I] = 0. In addition, it was shown in Ref. [20] that for all P i = P j . Substitution of these results yields Eq. (43).

C Binary representation
The Pauli subset P n on n-qubits is a collection of 4 n+1 elements defined by The binary representation, first introduced by Calderbank, Rains, Shor, and Sloane [53], is a representation of P n as binary vectors. In this representation, Paulis differing only in phase i k are represented in the same way.
Single-qubit Paulis are represented by 2dimensional binary vectors, so that σ 00 := I → (0, 0), σ 01 := X → (0, 1), σ 10 := Z → (1, 0), An n-qubit Pauli σ u1v1 ⊗ · · · ⊗ σ unvn (49) is then represented by the 2n-dimensional binary vector In this representation, two n-qubit Paulis with binary vectors a and b commute if and only if where J 2n denotes the 2n × 2n matrix Given a set S of m n-qubit Paulis, we can write down a corresponding 2n × m binary matrix S where each column represents a Pauli. Then, from Eq. (51), we deduce that all Paulis in S mutually commute if and only if which recovers Eq. (28) in the main text. We say that the set S of Paulis is independent if the matrix S has rank m. We shall often find it helpful to write S in terms of its upper half S (Z) and lower half S (X) , separated by a horizontal line for visual-aid, i.e., The conjugation action of quantum gates on S can be represented as transformations to the matrix S. For example, we document the transformations on S that represent four common quantum gates. In the following, addition is mod 2 and p ranges over all columns {1, . . . , m}.
CZ on qubits i and j: CNOT on control-qubit i and target-qubit j: S ip ← S ip + S jp , S j+n,p ← S j+n,p + S i+n,p .
Hadamard (H) on qubit i: Phase (P) on qubit i: These rules can be directly verified by conjugating X i , X j , Z i , Z j by the listed gates. They are also reproduced in Sec. 3 of the main text.

D CZ-construction example
We walk through our CZ-construction for a specific example. In this example, we would like to obtain measurements simultaneously of a collection S ′ start of six four-qubit Paulis given by We can represent these Paulis in a matrix S ′ start with By Gaussian elimination, we find the reduced row echelon form of S ′ start to be The pivot columns are numbers 1, 2 and 4 which tells us that P 1 , P 2 and P 4 are the three independent Paulis from which the remaining Paulis in S ′ start can be constructed. Therefore, we can write Note that the inverse on R −1 0 is purely notational. Now, the lower half S (X) start of S start has column echelon form and so the first two rows are pivot rows. In order to give the lower half of S start a rank of k = 3, we therefore apply a Hadamard to the rows corresponding to qubits 3 and 4 so that where The lower half of S 1 now has rank 3, and performing Gaussian elimination on it, we find where We now extend S 2 to a rank n = 4 matrix by adding a column that corresponds to a fourth Pauli P ext . In the main text, this is the crucial basis extension step from S 2 → S 3 which might have seemed fortuitous. In fact, the extension was systematically obtained as follows.
To make our reasoning clearer, let us represent S 2 alternatively by the matrix where each row corresponds to a Pauli operator given by a column of S 2 . Looking at the form of P (S 2 ), we see that we can place X in the 4th qubit position of P ext (and nowhere else) to ensure P ext is independent of the other Paulis. Then we observe that the left 3-by-3 sub-matrix of P (S 2 ) has X/Y on the diagonal and I/Z everywhere else. This means we can place I/Z in the other qubit positions of P ext depending on whether the X in its 4th qubit position commutes with the 4th position terms of the other Paulis.
By this prescription, we find P ext = Z 1 I 2 I 3 X 4 . Therefore, S 2 is extended to and where Note that the inverse on R −1 2 is also purely notational. The lower half of S 3 is full-rank and so we can take its inverse to find (69) Finally, we apply Phase to qubits 1 and 2 so that where S 4 is of the form of a graph state and represents the following Paulis: We now have where  Figure 4: The rotation circuit U for the CZ-construction walk-through example.
and (75) The rotation circuit is shown in Fig. 4. Using Q −1 , S 4 and R −1 , we can work out that the phases for the six original operators are ( +1 +1 +1 −1 −1 −1 ). Therefore, we can construct measurements of the original Pauli strings as follows: P 1 from product of measurements of qubits 3 and 4, P 2 from product of measurements of qubits 1 to 4, P 3 from product of measurements of qubits 1 and 2, P 4 from the negative of measurement of qubit 2, P 5 from the negative of measurement of qubit 1, P 6 from the negative product of measurements of qubits 1, 3 and 4.

E Proof of O(kn/ log k)
We prove the following Claim 2 via arguments of Patel, Markov, and Hayes [44]. As acknowledged in Ref. [44], these arguments originate from the "Method of Four Russians" [65]. Note that row operations correspond to CNOT gates, as explained in detail in Ref. [44].

Claim 2.
Let M be a n × n matrix with block form where A is any k × (n − k) matrix. Then O(kn/ log k) row operations suffice to reduce M to identity I n .
Proof. Let m be a constant we later choose. Partition A into l := (n − k)/m consecutive column-blocks A i , each containing m columns.
Start at A 1 and eliminate any duplicate rows using at most k row operations. There then remain at most 2 m unique rows in A 1 which can be eliminated by at most m2 m−1 row operations that add rows from I n−k . A 1 is now zero.
For each of A 2 , . . . , A l perform the same operation as was done to A 1 . M then becomes where B is some k × k matrix that must be invertible. B can then be row-reduced to I k using O(k 2 / log k) by the result of Ref. [44].
The total number N of row operations is therefore Choosing m = α log k, we find which is O(kn/ log k) provided α < 1.

F NP-hardness of maximisingR
In this Appendix, we show that maximisingR is NPhard by a reduction argument.
is a tensor product of n single-qubit Paulis, so that It can be easily seen [20, Appendix A] that a commuting collection of the P i s corresponds exactly to a clique in G.
For an arrangement A of O into N commuting collections with sizes m i ≥ 1, we consider the term appearing in the denominator ofR, and see that where the second equality is because the coefficients of P i s in O are all 1s and we do not consider splitting these coefficients. We have where the second inequality follows from Cauchy-Schwarz and N i=1 m i = n. Now, let N 0 denote the minimum number of commuting collections corresponding to some arrangement A 0 and let N 1 denote the number of commuting collections in an arrangement A 1 that maximisesR. By definition, N 0 ≤ N 1 . Since maximisingR is equivalent to minimising D(A), we also have using Eq. (82) for the first and third inequalities. Hence, D(A 1 ) is a √ n-approximation to N 0 . But N 0 is also the solution to the Min-Clique-Cover problem on G by construction. So maximisingR involves computing a √ n-approximation to Min- gives the stronger result that computing a n 1−ǫ approximation to Min-Clique-Cover is NP-hard for any ǫ > 0. This implies that it is NP-hard to compute a n 1/2−ǫ approximation to D(A 1 ) for any ǫ > 0. In other words, it is NP-hard to compute a 1/n 1/2−ǫ approximation to the maximumR for any ǫ > 0.

G Full numerical results
In this Appendix, we present the full version of Table 2 Table 4: The full set of results of the numerical simulations discussed in the main text. The molecular geometry is approximately that of the equilibrium configuration. A number of results related to the collecting of Hamiltonian terms, how the arrangement reduces the number of measurements required, and the number of two-qubit gates in the resulting rotation circuits are shown. We used OpenFermion [61] and Psi4 [66] to obtain Hamiltonians in the STO-3G basis and under the symmetry conserving Bravyi-Kitaev transformation [62,63]. Using our collecting method, Sorted Insertion, the number of collections, N , the average number of terms per collection, mi, and the average rank of the collections, ki are shown, for molecules with n qubits and t Pauli operators, excluding the identity, in the Hamiltonian sum. For the smallest nine molecules, we calculated the resulting ratio R, as given in Eq. (10), for 100 randomly selected trial states, prepared by choosing random sets of parameters for a hardware efficient ansatz preparation circuit of depth 1. For each molecule, we show the mean, minimum and maximum values of R obtained from the 100 runs. We also show the value ofR, given by Eq. (20), obtained, which is close to the mean value of R where this has been calculated. A key result of interest is the maximum number of two-qubit gates required to obtain a measurement of all the operators in a given Hamiltonian. We show the theoretical maximum, given by the largest value of kn − 1 2 k(k + 1) for any collection, and the true largest value for any collection. The ratio of these two numbers is shown in Fig. 3(c). We also show the mean number of two-qubit gates required in a rotation circuit, averaged across all collections for a given molecule.