Circuit optimization of Hamiltonian simulation by simultaneous diagonalization of Pauli clusters

Many applications of practical interest rely on time evolution of Hamiltonians that given by a sum of Pauli operators. Quantum circuits for exact time evolution of single Pauli operators are well known, and can be extended trivially to sums of commuting Paulis by concatenating the circuits of individual terms. In this paper we reduce the circuit complexity of Hamiltonian simulation by partitioning the Pauli operators into mutually commuting clusters and exponentiating the elements within each cluster after applying simultaneous diagonalization. We provide a practical algorithm for partitioning sets of Paulis into commuting subsets, and show that the propose approach can help to significantly reduce both the number of CNOT operations and circuit depth for Hamiltonians arising in quantum chemistry. The algorithms for simultaneous diagonalization are also applicable in the context of stabilizer states; in particular we provide novel four- and five-stage representations, each containing only a single stage of conditional gates.


Introduction
Simulation of quantum systems by means of Hamiltonian time evolution is an important application of quantum computers [15,24]. The time evolution of a Hamiltonian H is given by e itH , and the main challenge is to generate an efficient circuit that implements or closely approximates this time-evolution operator. Given the prominent position of Hamiltonian time evolution in quantum computing, it should come as no surprise that this area has been well studied, and that different approaches have been developed, including those based on, for instance, product formulas [29,32], quantum walks [6], linear combinations of unitaries [12], truncated Taylor series [7], and quantum signal processing [25] (see [11] for a good overview). Product formulas are applicate when, as is often the case, the Hamiltonian can be decomposed as the sum H = j H j , such that the time evolution of each of the terms H j is readily evaluated. Through successive application of the terms with appropriately chosen time steps, it is then possible to simulate the original Hamiltonian. For instance, using the Lie-Trotter product formula [32] we have that e itH = lim k→∞ j e i(t/k)Hj k , whereas in the non-asymptotic regime, the Trotter scheme provides a first-order approximation, with the norm of the difference between the exact and approximate time evolution operators scaling as O(t 2 /k). More advanced higher-order schemes, such as those by Suzuki [29], are also available, and are analyzed for example in [11]. The approximation errors arising in the use of product formulas are ultimately caused by non-commuting terms in the Hamiltonian. Indeed, given any set of mutually commuting operators P 1 through P m , the exponent of the sum is equal to products of the individual exponents, provided that the time slices for each operator add up to t. As a simple example, it holds that whenever the operators commute. A natural idea, therefore, is to partition the operators into mutually commuting subsets. This can be done by applying graph coloring [8] to a graph whose nodes correspond to the operators and whose edges connecting nodes for which the associated operators do not commute. The resulting coloring is such that all nodes sharing the same color commute. Time evolution for the sum of nodes within each subset is then trivial, and product formulas can be applied to the sum of Hamiltonians formed as the sum of each subset. This approach is especially applicable in scenarios where the Hamiltonian is expressed as a sum of Pauli operators, for which the commutativity relations are easily evaluated. This situation arises by definition in spin simulation of magnetic systems using the Heisenberg model. In other applications, such as the quantum simulation of fermionic systems, the terms in the Hamiltonian can be mapped to Pauli operators using for example the Jordan-Wigner or Bravyi-Kitaev transformation [10,22,31].
In this paper we focus on quantum circuits for evaluating the product of commuting exponentials, appearing on the right-hand side of equation (1). We also consider the partitioning of terms, and application of the proposed methods to quantum chemistry. Given the limited qubit connectivity in near-term architectures, we largely focus on reducing the number of cnot gates, since these may translate into large numbers of swap gates. For systems that use error-correction codes, it may be important to reduce other gates, such as the T -gate. These gates only appear in the exponentiation of the diagonalized operators, and these parts of the circuit can be independently simplified using techniques such as those described in [3,4]. We further note that clustering of Pauli operators and simultaneous diagonalization of commuting operators also arises in variational quantum eigensolvers [9,14,17,21,23,33,34]. In that context, however, the techniques are used for an altogether different purpose; namely, to reduce the number of measurements to estimate inner-products of the initial state with different Pauli operators. The schemes we develop for simultaneous diagonalization and partitioning are also applicable in the context of variational quantum eigensolvers.
The paper is organized as follows. In Section 2 we review the basic circuit for exponentiation of individual Pauli operators, and how these can be combined. Section 3 describes the proposed approach based on simultaneous diagonalization. Synthesis and optimization of circuits for diagonalization are studied in Section 4. In Section 5 we perform numerical experiments to obtain the circuit complexity for simulating random Paulis and Hamiltonians arising in quantum chemistry. Conclusions are given in Section 6.
Notation We denote the Pauli matrices by σ x , σ y , and σ z , and write σ i for the two-by-two identity matrix. The tensor product of n Pauli matrices gives an n-Pauli operators, which we denote by the corresponding string of characters, for example zxi = σ z ⊗ σ x ⊗ σ i . We write [n] = {1, . . . , n} and denote the binary group by F 2 .

Direct exponentiation of Pauli operators
Given a Hermitian operator M with eigendecomposition M = QΛQ † = k λ k |q k q k |, it holds that exponentiation of the matrix is equivalent to exponentiation of the individual eigenvalues; that is, Alternatively, we can look at operators D = Q † that diagonalize M , that is DM D † = Λ. The identity and Pauli σ z matrices are already diagonal, and therefore have a trivial diagonalization with D = I. From this it follows directly that e iθσi = e iθ I, and e iθσz = e iθ 0 0 e −iθ =: R z (θ) The remaining two Pauli operators σ x and σ y can be diagonalized to Λ = σ z with operators It then follows that e iθσx = e iθD † x σzDx = D † x e iθσz D x = D † x R z (θ)D x , and likewise for σ y . A direct way to exponentiate a Pauli matrix is to first apply the appropriate diagonalization operator D, followed by the rotation R z (θ), and finally the adjoint diagonalization operator D † .
In order to exponentiate general n-Pauli operators we first diagonalize the matrix, which is done by applying the tensor product of the diagonalization operators corresponding to each of the terms. The resulting diagonal is the (b) Circuit after permuting blocks and applying gate cancellations tensor product of σ i and σ z matrices; a σ i for each i term, and σ z for each of the x,y, or z terms. For a given element in the computational basis we can determine the sign induced by the σ z diagonal terms and maintain the overall sign in an ancilla qubit using cnot operators. The rotation operator R z (θ) is then applied to the ancilla to achieve the exponentiation of the eigenvalue (see also [26,Chapter 4]). We then uncompute the ancilla by reapplying the cnot gates, and complete the procedure by applying the adjoint diagonalization operator. An example for the the successive exponentiation of Pauli operators ixx, zyz, xxi with angles θ 1 ,θ 2 , and θ 3 , is shown in Figure 1(a). Several remarks are in place here. First, in the diagonalization of σ y we include a not operator (X) to ensure diagonalization to σ z rather than −σ z . In practice this term can be omitted, and for each occurrence of a σ y term we can simply multiply the corresponding rotation angle θ by −1. Second, it is often the case that time evolution needs to be done as a conditional circuit. Instead of making each gate conditional it suffices to merely make the R z gates conditional. Third, for sets of commuting Pauli operators it is possible to obtain circuits with reduced complexity by rearranging the order in which the Pauli operators are applied in such as way that as many gates as possible cancel. In the example shown in Figure 1(b) we rearrange the blocks and apply simple gate cancellation to adjacent pairs of identical diagonalization operations and cnot gates.

Proposed approach
It is well known for any set of mutually commuting operators there exists a unitary U that simultaneously diagonalizes each of the operators in the set [20,Thm. 1.3.19]. Applying this to a set of commuting n-Pauli operations {P j } m j=1 , we know that there exists a unitary U ∈ C 2 n ×2 n , such that UP j U † = Λ j is diagonal for all i ∈ [m]. Moreover, not only are the resulting operators diagonal, they are in fact Pauli operators themselves, consisting only of σ i and σ z terms along with a sign. As an example we apply the techniques we develop in Section 4 to the three commuting Paulis used in Figure 1. The resulting circuits that each diagonalize all three Paulis, along with the resulting diagonals are shown in Figure 2.
The advantage of simultaneous diagonalization become apparent when looking at the exponentiation of the sum of commuting Paulis. From (1) we know that this is equal to the product of individual exponents. Additionally Circuit after permuting blocks and applying gate cancellations The last equality follows from the fact that successive UU † terms cancel, thereby allowing us to apply the diagonalization operator and its adjoin only once, instead of once for each individual term. Since we know how to exponentiate the diagonal Paulis, we can put everything together to obtain the circuit shown in Figure 3(a). If needed, the sign of the diagonalized terms can be incorporated in the rotation angle. Similar to the original approach we can often simplify the circuit by canceling adjacent gates that multiply to identity. A key advantage of diagonalization then is that, aside from the R z gates, each term consists entirely of cnot gates. This provides much more room for optimization, since instead of having to match four terms (i, x, y, and z), we only need to consider two (i and z). This makes is easier to find orderings of the terms that reduce the number of cnot gates in the circuit. The circuit after simplification can be seen in Figure 3(b). Compared to Figure 1(b), which requires twelve cnot gates, this example uses only a total of ten cnot operation: six for exponentiation and a further four for the diagonalization circuit and its adjoint. In practice it is unlikely that all terms in a Hamiltonian commute. For these cases we first need to partition the terms into subsets of commuting operators. For each of these subsets we can then apply simultaneous diagonalization for simulating that part of the Hamiltonian. When the number of terms in a subset is small we may find that exponentiation using the original approach gives a better circuit. We can use this to our advantage by simply choosing the best method for each subset.

Circuits for simultaneous diagonalization
In this section we consider the construction of circuits that simultaneously diagonalize a given set of commuting Pauli operators. This is conveniently done using the tableau representation originally used to simulate stabilizer circuits [1,18] and reviewed next. The schemes for simultaneous diagonalization presented in [14,17] use the same techniques, but differ from ours in the number and type of stages.

Tableau representation and operations
The tableau representation is a binary array in which each row represents a single n-Pauli operator. The columns of the tableau are partitioned as [X, Z, s], such that (X i,j , Z i,j ) represents the jth component of the ith Pauli operator. The value is (1, 0) for x, (0, 1) for z, (1, 1) for y, and (0, 0) for i. Entries in s are set if the corresponding Pauli operator has a negative sign. For instance: To keep the exposition clear, we do not show the sign column in the tableaus for the remainder of the paper. It is of crucial importance, however, that the appropriate signs are maintained, as they eventually appear in the exponent of the Paulis. Once the tableau is set up we can apply different operations. The first two operations, illustrated in Figures 4(a) and 4(b), change the order of respectively the Pauli operators and the qubits. A third operation, shown in Figures 4(b) sweeps one row with another. This operation corresponds to multiplication of the operators, which results in the given entries in the X and Z blocks to be added modulo two. The sign update is more involved and we refer to [1] for details. Even though these operations alter the tableau they do not generate any corresponding gates in the circuit.  In addition to these basic operations, we can apply operators from the Clifford group. Operators C in this group are unitary and have the property that CP C † is a Pauli operator for any Pauli operator P . The Clifford group can be generated by three gates: the Hadamard gate (H), the phase gate (S), and the conditional-not (cnot) gates. The definition of these gates along with the respective update rules and effect on the tableau are summarized in Figure 5. The Hadamard gate applied to a column (qubit) results in the exchange of the corresponding columns in the X and Z blocks. The phase gate adds a given column in the X block to the matching column in the Z block, along with appropriate updates to the signs as shown in the figure. The conditional-not operation CX(a, b), that is, negation of qubit b conditional on qubit a, has the effect of adding column a to column b in block X, and adding column b to column a in block Z. From the basic three operations we can form another convenient gate, the conditional-Z gate (see also [16]). Denoted CZ(a, b), this gate is equivalent to successively applying H(b), CX(a, b), and H(b), and has the effect of adding columns a and b of block X to columns a and b of block Z, respectively.

Simultaneous diagonalization
For simultaneous diagonalization we initialize the tableau with our commuting set of Paulis. We then need to apply the different tableau operations in such a way that the entries in the X block of the final tableau are all zero. In our Operator matrix sign update block update algorithms we use row swaps and row sweep operations. Even though these operations do not generate any gates for the circuit, they do alter the tableau, and the underlying Pauli operations. In order to obtain the appropriate diagonalization of the original Pauli operations we can do one of two things. First, since these operations commute with the Clifford operations we can apply the inverse operations of all row operations at the end. Second, we can work with a parallel tableau on which only the Clifford operations are applied. The desired diagonalized Pauli operators are then represented by the final tableau. We now look at several algorithm that clear the X block. In this we occasionally need to use the rank of the tableau, which we define as the rank of the [X, Z] matrix.

Diagonalizing the X block
For the simultaneous diagonalization process we proceed in phases. As the first phase we would like to operate on the tableau such that only the entries on the diagonal of the X block are nonzero. More precisely, let r be the rank of the matrix [X, Z], then we would like the first r diagonal elements of the X block to be one, and all remaining elements of the block to be zero. The algorithm we use for this is given in Algorithm 1. At the beginning of the algorithm we are given a tableau corresponding to commuting Paulis. At this point there is no clear structure, and the tableau therefore looks something like Figure 6(a), where gray indicate possibly nonzero entries (although we illustrate the procedure on a tableau with m > n, the process applies equally to tableaus with other shapes). In steps 2-11 we iteratively diagonalize the X block. Starting at k = 1 we first look for a nonzero element in rows and columns of the X block with indices at least k. If found, we move the one entry to location (k, k) by applying appropriate row and column swaps, sweep all other nonzero entries in the new column, increment k, and continue. If no such item we could be found we are done with the first stage and have a tableau of the form illustrated in Figure 6(b). In steps 13-22, we then repeat the same process on the Z block, starting off at the current k. The tableau at the end of the second stage would look like Figure 6(c). In the third stage, given by steps 23-25, we apply Hadamard gates to swap the diagonalized columns in the Z block with the corresponding columns in the X block, resulting the a tableau as shown in Figure 6(d). If the rank r is less than n, there may be spurious nonzero elements to the right of the diagonal block in X. These are swept using cnot operations in steps 26-28. The resulting tableau after the final fourth stage is depicted in Figure 6(e).
Recalling that the tableau has rank r, it is immediate by construction that any row in X with index exceeding r will be zero. It therefore follows immediately that the Paulis associated with these rows contain only i and z terms. The Pauli string for rows i with i ≤ k consist of all i and z terms, except for an x or y term at location i. We now show that rows i in Z with i > r are also all zero. This certainly holds for column indices j > k, and we therefore assume that we have Z[i, j] = 1 with i > r and j ≤ k. The terms in the Pauli operators for rows i and j commute at all indices except j, where row i has z and row j has x or y. The Pauli operations therefore anticommute, which contradicts our assumption that the Paulis in the tableau commute, and it therefore follows that rows i > r in Z are all zero. Now, note that the cnot operations in the third stage and the Hadamard operations in the second stage, did not affect the values in the bottom-left block of Z. We conclude that these values must therefore already have been zero at the end of stage two, as shown in Figure 6(f). The following result is a direct consequence of the above discussion: Theorem 4.1. The X block of any tableau corresponding to commuting n-Paulis with rank n can be diagonalized using only Hadamard gates.
The fourth stage of the algorithm for diagonalizing X is applicable whenever the rank of the tableau is less than n. In the implementation given in Algorithm 1 we clear the spurious entries using cnot operations. There are several ways in which this stage could be improved. We could determine, for instance, if the corresponding column in Z has fewer nonzero entries. If that were the case, we could swap the column using a Hadamard operation and sweep the alternative column instead. Likewise, it would be possible to see if sweeping the Z column with that of X using a phase gate, followed by a swap would be more efficient. In both these cases the number of cnot operations would be reduced at the cost of single-qubit operations. If two columns in the residual column block are similar, one could be simplified by sweeping with the other using a cnot operation. Further optimization is possible using a combination of these techniques.

Updating Z and clearing X
After diagonalizing the X block, we need to update the Z block, such that all nonzero columns in X are matched with a zero or identical column in Z. Application of combinations of Hadamard and phase gates then allows us to zero out X and obtain the circuit for simultaneous diagonalization. In this section we consider three algorithm to achieve this.

Algorithm 1 Diagonalization of the X block.
Input: the input to this function is a tableau T = [X, Z, S] of size m × 2n + 1, consisting of the X and Z blocks, as well as a sign vector S. We use the convention that indexing of the X or Z blocks corresponds to indexing the tableau at the corresponding location. Swapping or sweeping rows applies to the entire tableau. Swapping columns i and j means swapping these columns in both the X and Z blocks.
if (index found) then

5:
Swap rows i and k; swap columns j and k 6: Sweep row i with row k 8: end for 9: 15: if (index found) then 16: Swap rows i and k; swap columns j and k 17: Apply gate CNOT(i, j) 28: end for

Pairwise elimination
Application of the controlled-Z operation on qubits a and b is equivalent to successively applying H(b), cnot(a, b), and H(b). The overall effect, as illustrated in Figure 5, is the sweeping of columns a and b in Z with respectively columns b and a of X. This operation can therefore simultaneously eliminate Z[a, b] and Z[b, a] whenever both elements are one. The following result shows that and off-diagonal one is matched by the reflected element: Given a tableau T corresponding to a set of commuting Paulis of rank k, and apply the diagonalization procedure. Then the top-left k-by-k sub-block of the resulting Z is symmetric.
Proof. Consider any pair of distinct indices i, j ∈ [k], and denote the string representation of the corresponding Pauli operators of the updated tableau T by P i and P j . The operations performed during diagonalization preserve commutativity, and P i and P j therefore commute. For commutativity, we can focus on the symbols at locations i and j; all others are either σ i or σ z . It can be verified that symbols It follows that in order for the Pauli operators to commute, we must have Z[i, j] = Z[j, i]. The result follows by the fact that indices i and j were arbitrary.
With this, the algorithm for updating the Z block simply reduces to eliminating the lower-triangular entries in Z (the corresponding upper-triangular entries will be eliminated simultaneously). This process is summarized in lines 1-5 of Algorithm 2. After this first step we are ready to clear the X block using single-qubit gates, by considering the values of the diagonal entries in Z. This is done in lines 6-9 of the algorithm. One notable benefit of algorithm is that the elimination process only affects the targeted entries, which means that there is no fill-in. Together with the diagonalization of X in Section 4.3, we obtain a classical complexity of O(n 2 max(m, n)), along with the following result: Theorem 4.3. Given a tableau for commuting n-Paulis with rank n. We can diagonalize the operators using H-CZ-S-H stages with O(n 2 ) CZ gates.
Since the application of the CZ gates do not affect the diagonal entries in the Z block, it is possible to apply the phase gates first and obtain an H-S-CZ-H scheme. Note that it is always possible to obtain a full-rank tableau by adding commuting Paulis that were not in the original span. The resulting diagonalization then has the stages as given above, and clearly applies to the original set of Paulis as well. Doing so may however come at the cost of an increased circuit complexity.
Algorithm 2 Pairwise update of Z, clear X.
Apply H(i) 9: end for

Elimination using cnot operations
Alternative way of updating Z that is based on cnot operations is given by Algorithm 3. The main for-loop in lines 1-11 iteratively ensures that the top-left i × i block of Z has ones one the diagonal and zeroes elsewhere. The update process for a given i is illustrated in Figure 7. At the beginning of iteration i, the (i − 1) × (i − 1) block Algorithm 3 Update of Z using cnot operations, clear X.
Input: Tableau T with diagonal X of rank k. Apply S(i), H(i) 14: end for of Z is diagonal, and to obtain the desired state at the end of the iteration we therefore need to eliminate any nonzeros occurring in the first i − 1 entries in the in the i-th row and column of Z, and ensure that Z[i, i] = 1.
As an example, consider the tableau in Figure 7(a) at the beginning of iteration i. During the iteration we will need to eliminate entries Z [4,1], Z [4,3], and their reflections Z [1,4] and Z [3,4]. For now we assume that that the entry Z[i, i] is 0 or 1 respectively. To eliminate entry Z [1,4] we first apply a cnot(4, 1) gate. In addition it also flips the value in Z[i, i] to 1 or 0 respectively, and fills in element X [1,4], as shown in Figure 7(b). Aside from this there are some further updates to the entries of column i with indices exceeding i; these are irrelevant to the current iteration and will be dealt with in later iterations. Next, we eliminate the undesirable fill of element X [1,4] by sweeping row 4 with row 1, which also clears up element Z [4,1]. Note that this is no coincidence: since the X block is diagonal again, if follows form Theorem 4.2 that corresponding block in Z must be symmetric. We again ignore the additional updates beyond the block boundaries. This leaves us at the state shown in Figure 7(c). As the next step we eliminate entries Z [3,4] and Z [4,3] by applying cnot(4, 3), followed by a sweep of row 4 with row 4, as shown in Figures 7(d) and 7(e). Applying of the cnot operation again caused the value of Z[i, i] to flip to 0 or 1 respectively. As a final step, we now need to ensure that the Z[i, i] entry is one. For this we could check the latest value, and apply S(i) whenever the value is zero. Instead, we prefer to set the value appropriately at the beginning, and ensure that at the end of all value flips it ends at the one value. For this we can simply consider the value of Z[i, i] at the beginning and add the number of entries that need to be eliminated and thus incur a flip. If this result value is even we need to to change the initial value of Z[i, i] by applying S(i). This is done in lines 2-4 of Algorithm 3. Once completed, the first k columns in Z exactly match those of X. We can therefore clear the X block by applying phase and Hadamard operations on the first k qubits, which is done in lines 12-14. Combined with the diagonalization of X from Section 4.3, we have the following result: This result can be further improved using [27], which shows that cnot circuits consisting of O(n 2 ) gates can be reduced to O(n 2 / log(n)) gates. The overall classical complexity of this diagonalization procedure is O(mn min(m, n)).

Column-based elimination
In the two methods described so far, each iteration of the algorithm for updating the Z block zeroes out exactly two elements. In many cases we can do much better and clear multiple entries at once. To see how, consider the situation where the X block is diagonal and the initial Z block is as shown in Figure 8(a). The second and third column are nearly identical, and sweeping one with the other using a cnot operation would leave only a single non-zero entry in the updated column in the location where the two differed. This suggests the following approach. Given a set of columns that is yet to be swept, I, we first determine the column i ∈ I that has the minimal number of non-zero off-diagonal elements; that is, the number of cnot gates needed to clear them. We then consider the Hamming distance between all pairs of columns i, j ∈ I, excluding rows i and j. The reason for excluding these entries is that the X block is diagonal, and we can therefore easily update the diagonal entries in the Z block to the desired value using Hadamard or phase gates. The total number of cnot operations to clear column i with column j is then equal to their off-diagonal distance plus one for the column sweep itself. That is, after sweeping the columns we still need to take care of the remaining entries in the column using elementwise elimination. There are many possible ways to combine these steps, but one approach is to greedily determine the lowest number of cnot operations needed to clear any of the remaining columns in I, an approach we refer to as greedy-1. Once the column has been cleared aside from the diagonal entry we can zero out the corresponding column in the X block and remove the entry from I.
As an example we apply this method to the example in Figure 8(a). Starting with I = {1, 2, 3, 4, 5, 6} we first determine the number of off-diagonal elements to sweep in each single column, which turns out to be three. For elimination using pairs of columns, we see that the distance between columns 1 and 3 is one, provided we update the diagonal entry in column 3. Columns 2 and 3 also have an off-diagonal distance of two, as do columns 4 and 5. At each iteration we choose the first minimum we encounter, in this case columns 1 and 3, as highlighted in Figure 8(b). To clear column 1 we first update the diagonal entry in 3 by applying a phase gate. Next, we apply a cnot operation that sweeps column 1 with the updated column 3, to arrive at the Z block shown in Figure 8(c). As seen in Figure 7, the cnot operation causes fill-in of the X block, which we can eliminate by sweeping row 1 with row 3. Doing so restores diagonality of the X block, and symmetry of the Z block. The result of this operation can be seen in Figure 8(d) . What remains is to pairwise eliminate the remaining entries in column 1, and by symmetry of row 1, and clear column 1 of the X block. This finalizes the clearance of column 1, so we can remove it from the active set I, and leaves us with the tableau shown in Figure 8(e). Starting we a new iteration, we again count the number of off-diagonal entries to sweep per column. The minimum of two occurs in column 5. Pairwise sweeping does not improve on this, and we therefore use the technique from Section 4.4.1 to clear these entries directly. We then clear column 5 of the X block and remove the column from I. The algorithm continues in this fashion until I is empty.
So far, we have only considered the number of cnot operations. An alternative approach, referred to in the experiments section as greedy-2, takes into account the number of single-qubit gates when the number of cnot gates match. Recall that in the first iteration there were several pairs of columns with a minimal off-diagonal distance of one. The greedy-1 strategy chooses to clear column 1 with column 3, which requires one phase gate to clear the diagonal entry of column 3, a cnot and cz operation respectively for sweeping the column and remaining off-diagonal entry, and finally a Hadamard operation to clear column 1 of the X block. Alternatively choosing to clear column 2 with column 3 would require an initial cnot for the column sweep, a cz for removing the remaining off-diagonal entry, and a Hadamard operation to clear column 2 of X. The latter approach requires the same number of cnot operations, but requires one fewer single-qubit gate. The greedy-2 method would therefore choose this option. For this particular example, pairwise elimination requires ten cnot operations, whereas the greedy approach require seven and six cnot operations, respectively. For all three algorithms, the number of single-qubit operations is six. The complexity of column-based elimination of the Z block is O(k 4 ), where k the rank of the tableau. This assumes that at each stage of the algorithm we recompute the distance between all pairs of remaining columns, and more efficient implementations may be possible.

Ordering of terms
Once the X block in the tableau has been cleared we can either undo all row sweep and row swap operations, or reapply all Clifford operators on the initial tableau, to obtain the diagonalized Pauli terms corresponding to the given set of commuting Paulis. Figure 9(a) shows the transpose of the resulting Z block for a set of 20 Paulis over 7 qubits, represented as columns. In the plot gray cells represents a Pauli z terms, while white cells represent identity terms i. For exponentiation we need to add cnot gates for each of the z terms. As illustrated in Figure 3, we can cancel cnot operators between successive z terms one the same qubit. The resulting number of cnot gates for each of the seven qubits is given on the right of Figure 9(a), for a total of 72 cnot gates. (For ease of counting, imaging all-identity Paulis before the first and after the last operator and count the number of transitions from white to gray and vice versa.) In order to reduce the number of transitions we can permute the order of the operators within the commuting set. This is done in Figure 9(b), where we first sort all operators in qubit one. We then recursively partition the operators in the i set, such that all i operators appear before z operators, and vice versa for the z set. The resulting binary tree like structure in Figure 9(b) reduces the total number of cnot gates needed to implement the circuit from the original 72 down to 58. The order in which the qubits are traversed can make a big difference. Figure 9(c) shows a histogram of the number of cnot gates required for all possible permutations of traversal order, ranging from 38 to 60 gates. The large range in gate count indicates that there is still a lot of room for improvement for the ordering strategy. As seen in Figure 9(b), qubits that appears earlier in the ordering tend to require fewer cnot gates. This can be leveraged when optimizing the circuit for a particular quantum processor where operators between non-neighboring qubits are implemented using intermediate swap operations. In this case we can reduce the number of cnot operations between topologically distant qubits by having them appear in the ordering earlier.
Alternative implementation where cnot gates are connected to qubits of successive z terms are possible, but will not be considered in this paper. Ordering of operators in the Z block has a classical complexity of O(mn).

Experiments
We now consider the practical application of the methods described in earlier sections. In the experiments we consider the number of cnot and single-qubit operations, as well as the circuit depth. The number of cnot gates that appear in the circuit are especially important for processors with limited qubit connectivity. In particular, cnot operations between qubits that are not physically connected may require a substantial number of swap operations. We use Qiskit [2] circuit optimization where indicated, and also use the package to determine all circuit depths.

Random Paulis
Pauli bases. As a first set of experiments we consider the circuit complexity for diagonalizing random sets of commuting Paulis. In order to run these experiments we need an algorithm for sampling bases of commuting Paulis uniformly at random. For this we proceed in two stages: first we uniformly sample a canonical generator set, and second we sample a full-rank binary matrix. The resulting set of Paulis is then obtained by multiplication of the generator set tableau generator set with the binary matrix. Many of the random generators can be sampled by end if 13: end for 14: Return the tableau with random signs setting the X block in the tableau to the identity, followed by randomly sampling a symmetric Z block, as required by Theorem 4.2. Besides these there are generators with one or more of the diagonal entries in X set to zero. Such entries are generated by clearing out the entries on, below, and to the right of the given diagonal element in the Z block and exchanging the associated columns in the X and Z blocks. Zeroing out the entries is needed to ensure that the diagonal element in the X block cannot be set to one using row exchanges. The algorithm for stage one is summarized in Algorithm 4. For the first row of the tableau we have 2 n possibilities for Z if the diagonal of X is set, and a single possibility otherwise, for a total of 1 + 2 n . For the second row we can only set n − 1 entries in Z due to the symmetry requirement, therefore giving a total of 1 + 2 n−1 . The total number of possible generators thus obtained is indeed the maximum [28]: For the second stage we generate a binary n × n matrix with entries selected uniformly at random. The probability that the given matrix is full rank is given by [5]: After sampling a matrix we therefore need to check whether the matrix is full rank. If not we need to sample another matrix, until we find a full-rank one. The expected number of matrix samples is no more than five for any matrix size.
Using this procedure we generated twenty random sets of commuting n-Pauli operators of size n ranging from 3 to 25. The resulting tableaus are guaranteed to have rank n by construction. For each set we apply the diagonalization procedure from Section 4.4.1 (cz), the cnot-based approach from Section 4.4.2, either directly (cnot), or using the cnot reduction from [27] with block size equal to log 2 (n) or the optimal block size in the range 1 through n; labeled (cnot-log2) and (cnot-best), respectively. In addition to the two greedy methods (greedy-1, greedy-2) described in Section 4.4.3, we also applied the tableau normalization procedure described in [16], and denoted (gmc).
The results averaged over the twenty problem instances of a given size are summarized in Table 1. The first column of results list the number of cnot operations, the number of single-qubit gates, and the depth of the generated circuit for diagonalizing the set of Paulis. The second and third columns summarize the circuit complexity when the methods are applied to simulate products of the Pauli exponentials. We will first focus on the circuit complexity of the diagonalization and consider the simulation results later. For the diagonalization process we also provide an aggregated comparison of the performance of the different methods in Figure 10. This figure gives the percentage of problem instances, across all problem sizes, for which the method on the vertical axis strictly outperforms the method on the horizontal axis. From the results in Table 1 we see that the performance of the gmc method is closest to that of the cnot method. Overall, though we still find that the cnot method requires fewer cnot gates for 62% of the problems and fewer single-qubit gates in 84% of the cases. In terms of depth of the diagonalization circuit, we see that gmc generally outperforms cnot-best and both greedy methods. However, the latter three methods require far fewer cnot and single-qubit gates than gmc. The cz method generally outperforms gmc and the three cnot methods in terms of both gate counts and circuit depth. The greedy approaches excel at reducing the number of cnot gates, but generally have a larger circuit depth. The greedy-2 approach additionally outperforms all methods in terms of the number of single-qubit gates, although this difference is only marginal for the cz method. The cnot-best method chooses a block size that minimizes the cnot count across all possible block sizes, and by definition is therefore never outperformed by cnot-log2. The number of single qubit gates is not affected by the optimization of the cnot operations and is therefore identical for all three cnot methods. The optimal choice of blocksize was relatively small and equal to two for 48% of the test problems, three for 28%, and 4 for some 10% of the problems. For problems with n between 20 and 25, the frequencies changed to 48%, 40%, and 10%, respectively. For the very small problem sizes it was often found that the unoptimized cnot circuit was at least as good as the optimized one, and amounted to around 12% over all test problems.
We now consider the performance of the different methods in evaluating the product of exponentials of the Paulis in each set. For this we include the direct method, which was described in Section 2. The circuits generated are pre-optimized by omitting gates that clearly cancel. For the direct method we additionally apply level-two circuit optimization as proved by Qiskit. The results of these experiments are summarized in the two simulation columns of Table 1. The second of these columns gives the result after optimizing the order of the Pauli operators. For the diagonalization-based approaches we use the procedure described in Section 4.5, with sorting applied according to the canonical qubit order. For the direct approach we adopt a greedy approach in which we iteratively pick an unused operator whose addition requires the smallest number of additional cnot gates, and in case of a tie, the smallest the number of single-qubit gates.
part responsible for exponentiation, plus an additional single-qubit R z gate for each of the n operators. From the cnot exp. column in Table 1 we see that the number of cnot gates in the central part of the circuit is nearly identical for the different methods, and the difference must therefore be due to the depth of the diagonalization circuits. Having more cnot gates in a shallower circuit indicates a higher level of parallelism where two or more gates can be applied simultaneously. This also suggests an improvement to the cz approach: instead of simply sweeping the entries row by row, we could process the entries in a way that promotes parallelism by avoiding repeated dependence on a single qubit. Another possible modification, which applies to all methods, is to connect the cnot gates between pairs of qubits where the Pauli term is z, and only eventually connect the partial parity values to the ancilla. This approach can help improve locality of the cnot operators, and enable a higher level of parallelism, as the cost of potentially more complex optimization and circuit generation code.
General sets of Paulis. When ignoring the sign, the number of n-Pauli operators that can mutually commute is 2 n . We can therefore expect that the number of commuting Paulis in a set exceeds n, which was used in the experiments above. In our next set of experiments we consider sets of size m. We generate these by multiplying the XZ blocks of the initial tableaus used earlier by a full-rank m × n binary matrix, thereby generating a new tableau with X and Z block sizes equal to m × n. The sign column of the tableau is initialized at random. We perform three types of optimization regarding the operator order. The base option uses the operators in the order they are provided. The opt strategy applies the ordering described above for our experiments with sets of size n. The final optimization strategy (rnd) aims to minimize the number of cnot gates based on random permutations. In particular, for the diagonalization methods, we use permutations of [n] to determine the qubit sorting order, as described in Section 4.5. For the direct approach we use permutations of [m] to shuffle the operator order before applying the greedy optimization procedure described above; the first permutation is the canonical ordering to ensure the results are at least as good as those of the opt strategy. For our experiments we use 100 random permutations per setting and then select the result that has the lowest number of cnot gates. The gmc method as given in [16] does not apply to non-square tableaus and we therefore do not use it in subsequent experiments. The average circuit complexities for simulation, obtained for the three optimization procedures for n = 20 and varying values of m, are shown in Table 2. Results in the table are grouped by the resource type: cnot and single-qubit counts and depth. Note that this differs from Table 1, where the results were grouped by optimization type (base or opt). Looking at Table 2 we see that diagonalization-based simulation is uniformly better than the direct method on our test problems, even for m much less than n. The diagonalization part of the circuit has a complexity that is essentially constant for m ≥ n, and the overhead therefore diminishes as m grows, thereby leading to a further improvement over the direct method. Aside for m = 3 we see that the single optimization step used in opt can significantly reduce the cnot gate count and circuit depth. As before, the number of single-qubit gates is unaffected by optimization for the diagonalization-based methods, but reduced substantially for the direct method. Randomized optimization helps further lower the circuit complexity, although the improvement is much less pronounced.
In Table 2 we purposely omit results on the complexity of the diagonalization circuit, as they were found to be similar for m < n and identical for m ≥ n to the ones shown in Table 1. The fact that we obtain identical circuits for m ≥ n may seem surprising at first, but becomes apparent when noting that a circuit that diagonalizes a generating set for Paulis automatically diagonalizes all Paulis in the group it generates. We here show the result for a slightly different procedure of diagonalizing the X block, as summarized in Algorithm 5. Then the output of Algorithm 5 applied to tableau B · T gives the same tableau and index set I for any full-rank B ∈ F m×n 2 with m ≥ n.
Proof. For analysis it will be easier to update the algorithm to omit column exchanges between the X and Z blocks, and instead sweep directly based on the entries in the column of X if the index was found, there or based on the entries in the column of Z, otherwise. Note that full-rankness of the tableau guarantees that at least one of the indices exists. Although we do not apply the column exchanges, we do maintain index set I. Applying the Hadamard operator to the columns (qubits) in I after normalization, then gives the original algorithm since row-based operations commute with Hadamard.
All tableaus are generated as linear combinations of rows in T . It then follows from full-rankedness of B that all Paulis corresponding to the tableaus can be instantiated using the same generator set. The updated normalization algorithm produces generator sets of the same form used in Algorithm 4. From the analysis of the latter algorithm we know that representation in this form is unique; no generator set has more than one tableau representation. Algorithm 5 must therefore return the same tableau and index set I.
Given that the tableaus after diagonalization of the X block the number of Hadamard gates used in the process are identical, it follows that the circuit complexity for simultaneous diagonalization is the same for m ≥ n. For cz-based diagonalization, the expected cnot count then follows directly from the construction of random Pauli bases in Algorithm 4. For each of the rows that are set in the Z block, on average half of the entries will be one. In case of the column swap, no additional entries are set to one, and the expected number of elements to sweep is therefore A consequence of Theorem 5.1 is that Algorithm 5 can be used to generate a unique representation of a stabilizer state, irrespective of its original representation. Moreover, the Z block and index set I can be concisely represented as a n × n + 1 binary matrix. Similarly, the technique can be used to check if two sets of commuting Paulis have a common generator set up to signs. Note that our condition of full-rankedness of the tableau T can be relaxed; if needed the tableau can be augmented by adding rows with the missing diagonal elements. These basis vectors are never used in linear combinations of the original rows in T and can be discarded after normalization.
Algorithm 5 Normalization of full-rank tableau.
Input: Full-rank tableau T = [X, Z] with block size m × n such that m ≥ n. Swap rows i and k 10: for i ∈ [m] such that i = k and X[i, k] = 1 do

11:
Sweep row i with row k 12: end for 13: end for 14: Return the updated tableau along with I.

Quantum chemistry
The Hamiltonians we have considered so far were randomly generated, and may therefore be structurally different from those found in practical applications. In this section we look at the time evolution of Hamiltonians arising from fermionic many-body quantum systems. We use the spin Hamiltonians obtained in [13] by using the second quantization formalism of the fermionic system followed by a conversion to interacting spin models by applying the Jordan-Wigner, Bravyi-Kitaev, or parity encodings [10,22]. The resulting Hamiltonians are expressed as a weighted set of Paulis, as desired. Table 3 summarizes the molecular Hamiltonians, along with the basis sets [30] used in the discretization. In order to apply simultaneous diagonalization we first need to partition the Hamiltonian terms into sets of commuting Paulis. For this we use two different greedy coloring strategies (largest first, and independent set) implemented in NetworkX [19], along with a custom implementation of a greedy algorithm in which each of the Paulis is sequentially added to the first set it commutes with, creating a new set if needed (this approach, which was also given in [14], has the additional advantage that no graph needs to be constructed). Overall, as seen in Table 3, the three different partitioning strategies give similar results in terms of number of partitions, as well as median and maximum partition size. The same applies across the different encoding schemes, but we assume that these are given; the partitioning scheme can be freely chosen. Note that the maximum partition size can be much larger than the number of qubits (terms in each of the Paulis). In some cases the NetworkX graph coloring algorithms either ran out of memory or did not return a result in a reasonable amount of time. Throughout the results we indicate those cases are by dashes.
Once the terms in the Hamiltonian and partitioned into commuting sets we can apply the different simulation algorithms to each of the individual partitions. We compare the diagonalization-based approaches with direct exponentiation. As before, we apply level-two circuit optimization as provided in Qiskit to the direct exponentiation approach only as it was found not to give any improvements in circuit complexity for the diagonalization-based circuits. We additionally use the opt strategy described in Section 5.1 to determine the order of the Paulis within each partition. For the cz and direct methods we additional allow the use of the rnd optimization strategy. In the determination of the circuit complexity we assume that the R z operators has a single-gate implementation. We determine the total circuit depth by simple adding the depths of the circuits for each of the partitions. It might be possible to reduce the depth and single-qubit counts due to potential simplifications at the circuit boundaries; we expect this reduction to be negligible.
The circuit complexity when partitioning the Hamiltonians with the greedy sequential approach is shown in Table 4. The first thing we note is that the cnot-based diagonalization performs substantially worse that both the cz and greedy-based approaches, in stark contrast to the results on random Paulis in Section 5.1, where the performance of the cnot approach closely matched that of cz. This could be caused by the fill-in during normalization the Z block, which is present in the cnot approach, but absent in the other two diagonalization approaches. Despite its relatively poor performance, the cnot method still consistently outperforms direct exponentiation in terms of  Table 3: Problem instances of different molecules when discretized in the given bases, along with the number of qubits (#) and the resulting number of Pauli terms in the Hamiltonian. The entries in columns for largest-first, independent-set, and sequential greedy partitioning methods give the number of sets in the partition, as well as the median and maximum size of the sets, respectively, for each of the three encodings: Bravyi-Kitaev (BK), Jordan-Wigner (JW), and parity (P). circuit depth and cnot count. This difference is much more substantial for the cz and greedy approaches, where we see reductions of up to 50%. Application of the rnd optimization strategy shows good improvements for cz diagonalization. For the direct method, however, the improvements are only marginal. When applied to randomized Hamiltonians, we saw that the greedy approach generally required fewer cnot gates than the cz approach. We concluded that this was mostly due to the cnot count in the diagonalization part of the circuit, rather than exponentiation part. For the experiments here we see that the difference between the different methods is minimal at best. A similar pattern emerges for the remaining experiments using different partitioning algorithms. The cnot counts and circuit depths for these experiments are summarized in Tables 5  and 6, respectively. Due to their poor relative performance we omit the results for cnot-based diagonalization, and also leave out the single-qubit counts, as these are very similar to the ones given in Table 4.
Somewhat surprisingly, we see that across the different simulation methods the results for independent-set  Table 4: Results based on the greedy sequential partitioning method, with the cnot count and circuit depth for the Jordan-Wigner encoding, as well as the single-qubit count for the Jordan-Wigner, Bravyi-Kitaev, and parity encodings, respectively.  greedy partitioning are substantially worse than those of the other two partitioning methods, despite the similarity of the partition metrics shown in Table 3. To get a better understanding of what causes this difference we plot the number of cnot gates for the diagonalization circuit for each of the partitions against the size of the partition. The resulting plots, shown in Figure 11, indicate that the largest first and sequential strategies behave very similar. The independent set coloring strategy on the other hand often requires a substantial larger number of cnot gates for small partitions. This difference is seen across all molecules and encoding schemes, but is especially apparent for the Jordan-Wigner encoding. Given that, among the three coloring strategies considered, the independent-set strategy is the most time-consuming anyway, we would not recommend its use in this setting.
Overall, we see from the results in Tables 5 and 6 that the circuits for the Hamiltonians based on the Jordan-Wigner encoding tend to be simpler than those for the Bravyi-Kitaev and parity encodings. Finally, we note that each subset of commuting Paulis can be simulated independently. It is therefore possible to choose a different method per partition. For instance, we could select the direct method for some partitions and the diagonalizationbased method for others. To implement this idea we modified the experiments based on diagonalization such that the direct method was used if it was found to have a circuit with fewer cnot gates. The improvements obtained with this approach were very minor and in fact showed that in most cases the diagonalized-based approach was not outperformed by the direct method on any of partitions.

Conclusions
In this paper we have shown that the use of simultaneous diagonalization for Hamiltonian simulation can yield substantial reduction of circuit complexity in terms of cnot count and circuit depth, compared to direct exponentiation of the individual Pauli operators. The proposed approach first partitions the Pauli operators into sets of mutually commuting operators. We used two different strategies provided by the NetworkX package (independent set and largest first) and compared them against a pure greedy scheme in which Paulis are added sequentially to the first partition whose elements it commutes with. Given the need to instantiate the entire commutatively graph in NetworkX prior to coloring, the latter strategy is clearly favorable in terms of computational complexity. For synthetic test problems we found the three strategies to have similar performance, but a clear difference was   Table 6: The circuit depth for different exponentiation methods for the sequential, largest-first, and independentset partitioning methods. The results per method correspond to different encodings given by, from top to bottom, Jordan-Wigner, Bravyi-Kitaev, and parity.
found in application to problems in quantum chemistry, where the independent-set strategy was found to perform substantially worse compared to the other two. The next step is generate circuits that simultaneously diagonalize the operators in each set of commuting operators in the partition. This can be done by representing the Pauli operators in a tableau form consisting of X and Z blocks, along with a sign vector. The operators are diagonalized when all entries in the X block are eliminated using appropriate Clifford operators along with row and column manipulations. We introduce novel elimination schemes that first diagonalize the X block using row operations and Hadamard gates only. When applied to tableaus with full column rank, the resulting schemes give circuits consisting of sequence of H-S-CZ-H and H-S-CX-S-H gates respectively. The introduction of column-based elimination of entries in the Z block can help reduce the cnot count, and may have separate application in representing stabilizer states.
We apply the proposed techniques to random sets of commuting Pauli operators as well as practical problems arising in quantum chemistry. To facilitate the generation of random test problems we introduce an efficient new algorithm for uniformly sampling generator sets of commuting Paulis. The resulting insights also lead to a compact and unique representation in the form of a binary n × n + 2 matrix for sets of commuting Paulis that can be generated using the same generator set. This construction can also be used in the representation of stabilizer states.
For the chemistry problems we show that the cnot count can be reduced by a factor of two to three compared to the direct approach. The circuit depth is generally halved, but this may be further improved when considering the mapping of the circuits to systems with limited qubit connectivity.