Clifford recompilation for faster classical simulation of quantum circuits

Simulating quantum circuits classically is an important area of research in quantum information, with applications in computational complexity and validation of quantum devices. One of the state-of-the-art simulators, that of Bravyi et al, utilizes a randomized sparsification technique to approximate the output state of a quantum circuit by a stabilizer sum with a reduced number of terms. In this paper, we describe an improved Monte Carlo algorithm for performing randomized sparsification. This algorithm reduces the runtime of computing the approximate state by the factor $\ell/m$, where $\ell$ and $m$ are respectively the total and non-Clifford gate counts. The main technique is a circuit recompilation routine based on manipulating exponentiated Pauli operators. The recompilation routine also facilitates numerical search for Clifford decompositions of products of gates, which can further reduce the runtime in certain cases. It may additionally lead to a framework for optimizing circuit implementations over a gate set, reducing the overhead for state-injection in fault-tolerant implementations. We provide a concise exposition of randomized sparsification, and describe how to use it to estimate circuit amplitudes in a way which can be generalized to a broader class of gates and states. This latter method can be used to obtain additive error estimates of circuit probabilities with a faster runtime than the full techniques of Bravyi et al. Such estimates are useful for validating near-term quantum devices provided that the target probability is not exponentially small.


Introduction
Digital simulation of quantum phenomena is a central problem in computational physics, with deep theoretical and practical significance. On the theory side, research into this problem has produced surprising connections between different branches of mathematics and physics, while providing major insight into the innerworkings of quantum theory. On the practical side, there is high demand for faster and more accurate simulations in fields such as quantum computing, materials science, and quantum chemistry. In these disciplines, it is desirable to have a toolkit of simulation techniques which can handle the wide variety of quantum systems encountered.
Quantum computers are widely believed to be hard to simulate classically. Nevertheless, this hardness is only asymptotic, and it is always possible in principle to simulate larger and larger quantum computers by Hammam Qassim: hqassim@uwaterloo.ca optimizing the usage of classical resources. Efforts to demonstrate such optimizations have surged recently [2,[6][7][8]15]. These efforts are partially motivated by upcoming quantum computers containing on the order of 100 qubits [13,14]. Since these prototypes are likely to be considerably noisy, the ability to simulate their ideal counterparts is essential for validating their performance and to aid in their design and debugging. Furthermore, simulation of quantum circuits is directly tied to establishing when quantum computers are useful, as demonstrating quantum supremacy requires sampling from distributions which are provably hard to (approximately) sample from using classical means [10].
Near-Clifford circuits constitute an important class of quantum circuits. Fault-tolerant implementations of quantum computing in the near-term are likely to contain a small number of non-Clifford gates. This is the case since the overhead associated with implementing non-Clifford gates fault-tolerantly is huge, for example accounting for nearly 90% of the number of physical qubits in surface code implementations of Shor's algorithm [9]. A host of classical simulation schemes have recently been proposed for simulating near-Clifford circuits, with runtimes exhibiting mild exponential dependence on n [1,[3][4][5]12]. State-of-the-art techniques rely on decomposing the n-qubit output state of a near-Clifford circuit U as a superposition of stabilizer states [3,5], with N 2 n . These techniques allow for approximate sampling from the outcome distribution of the circuit, with the runtime for producing a single sample scaling like O(N −2 ), where is the error in total variation distance [3,5]. The minimum number of terms over all such decompositions of |ψ is called its stabilizer rank, and is a measure of the complexity of simulating |ψ . Finding stabilizer decompositions with a small number of terms is typically quite hard, even for simple product states [3,5]. This motivates using approximate stabilizer decompositions. It was shown in [5] that an approximation to |ψ with a smaller number of stabilizer terms can be obtained by "randomly sparsifying" any decomposition of the type in eq. (1). This is done by generating k random stabilizer states, each chosen independently to be |v i with probability |a i |/ a 1 , then setting the approximate state |ψ to be their equal superposition (with a uniform scaling factor). If k is large enough (roughly a 2 1 −2 ), then ψ −ψ ≤ with high probability. Typically k N , so that using the approximate state |ψ for simulation purposes is preferable.
Randomized sparsification can be performed using a sum-over-Cliffords Monte Carlo method [5], which we briefly review before stating our contribution. Given a circuit U , each non-Clifford gate in U can be expanded as a linear combination of Clifford gates. The entire circuit can then be expressed as for a collection of (typically exponentially many) Clifford circuits C i . The sum-over-Cliffords method uses this decomposition to compute the stabilizer states in eq. (1) while preserving the relative phases between them. For each term, a phase-sensitive Clifford simulator is used to convert the description of the circuit C i to a description of the stabilizer state |v i = C i |0 n while keeping track of the global phase. Similarly, the stabilizer terms of the approximate state |ψ can be computed by applying circuit C i with probability |a i |/ a 1 . The Clifford simulator computes the so-called CH-form of C i |0 n by performing an update for each gate in C i . This computation takes time O( n 2 ), where is the total number of gates in the circuit. Thus to compute the CH-form for all stabilizer terms comprising the approximate state |ψ , the method in [5] runs in time O(k n 2 ).
In the current paper, we report an improved variant of randomized sparsification based on recompiling the circuit using exponentiated Pauli operators. This variant removes the dependence on the total number of gates in the circuit. More precisely, it reduces the time cost of computing the CH description of |v i from O( n 2 ) to O(mn 2 ), where m is the non-Clifford gate count in U . This reduces the runtime of key subroutines in [5] by the factor /m, which can be significant for circuits in which the number of Clifford gates is very large compared to non-Clifford gates. This is relevant, for example, in the case of the QAOA circuit simulated in [5], which contains hundreds of Clifford gates and only 64 T gates. The above savings are achieved at an additional memory cost of O(nm) bits, which is negligible in the regime of interest.
As a side benefit, the Clifford recompilation method makes it easier to numerically search for decompositions of the kind in eq. (1) with a reduced 1-norm. As the exponential scaling of the runtimes in [5] is determined by a 2 1 , this allows for a significant speed-up in some cases. Furthermore, the total number of non-Clifford gates in the circuit can be reduced if adjacent gates in the recompiled circuit cancel out. This suggests a systematic way of optimizing the compilation step to reduce the number of non-Clifford gates in a given circuit, with the possibility of significant overhead savings in practical fault-tolerant implementations.
Another use of Clifford recompilation is in estimating circuit amplitudes. It is straight-forward to see that the random variable x|ψ is an unbiased estimator of the amplitude x|ψ , where x ∈ F n 2 denotes a computational basis state. Clifford recompilation reduces the complexity of evaluating x|ψ by the factor /m, in comparison to [5]. Estimates of the amplitude obtained in this way are used in the Metropolis simulator in this latter work. They can also be used to estimate the probability | x|ψ | 2 up to additive error with a runtime scaling like O( a 2 1 −2 ). Learning probabilities in this way can be generalized to a wider range of quantum circuits. It is also much less memory intensive, and is better suited for massive parallelization and GPU-computing. However, in general such polynomial precision estimates are only useful as long as the target probability is not exponentially small.
Notation. We write X, Y , and Z for the single qubit Pauli matrices, and typically use the letters P, Q and W to refer to Hermitian elements of the n-qubit Pauli group. The shorthand M b:a , with b > a, denotes the product

Improved randomized sparsification
The CH-form of stabilizer states. We begin by reviewing the CH-form of stabilizer states. A C-type Clifford is one which satisfies U C |0 n = |0 n . An Htype Clifford consists of Hadamard gates on some subset of the qubits. Any stabilizer state of n-qubits can be written as where U C is a C-type Clifford, U H is an H-type Clifford, s ∈ Z n 2 , and ω is a complex number. U C can be described by a stabilizer tableau specifying its action on the Pauli operators X i and Z i for i = 1, . . . , n, and U H can be described by an n-bit string specifying which qubits are acted on. As shown in [5], the tuple Proof. Without loss of generality we assume that θ ∈ {π/2, π/4}. We can compute a Pauli P such that (U C U H ) † P U C U H = P in time O(n 2 ), using the stabilizer tableau of U C and the fact that HXH = Z. Given such P , the state can be updated in time O(n 2 ). This follows from the following observations. If θ = π/2, then exp(iθP )wU C U H |s = wU C U H (iP )|s = w U C U H |s , for some w and s which can be computed in time O(n). If θ = π/4, then exp(iθP )wU C U H |s = wU C U H exp(i π 4 P )|s = w U C U H (|s + i δ |s ), for some δ ∈ Z 4 , ω , and s , all of which can be computed in time O(n). If s = s then the update is trivial. If s = s , then the update can be computed in time O(n 2 ) by combining Proposition 4 in [5] with the update rule for U C under right-multiplication by gates from the set {CNOT, CPHASE, PHASE}, which is given in the same section in [5].
Clifford recompilation. We now describe recompiling the circuit in order to reduce the complexity of computing the CH-forms of the output state. Consider a quantum circuit U consisting of gates, m of which are non-Clifford gates. We can write U in the form where C i are Clifford circuits, and U i are non-Clifford gates. We can then "pull" all non-Clifford gates to the end of the circuit where We can absorb the action of the Clifford gates into the input state, so that the output state is given by where |φ ≡ C m:0 |0 n is a stabilizer state. Note that the CH-form of |φ can be computed in time Storing the CH-form of |ψ in memory requires O(n 2 ) bits.
Next we discuss how to compute the U i . Suppose we are given a Clifford decomposition of each non-Clifford gate in U ; where each V ij is a product of a small number of elementary Clifford gates. We will use exponentiated Pauli matrices to compute and store U i efficiently. First note that each V ij can be written as a product of Clifford gates of the form exp(iθP ) where P is a Pauli and θ ∈ (π/4)Z 1 . We can therefore write where a ij absorbs any phases from the elementary gates, and {P ijk } is some collection of Pauli matrices. Then 1 For example, the Hadamard, PHASE, and CZ gates can be written as Similar identities can be derived for any Clifford gate acting on a fixed number of qubits.
Set ω ← (bij/|bij|)ω; where each stabilizer state |u i is produced by acting on the fixed stabilizer state |φ with a circuit consisting of O(m) Clifford gates of the form exp(iθP ), θ ∈ (π/4)Z.
Combining this with Proposition 1 shows that we can produce the CH-form of |u i in time O(mn 2 ), which improves on the O( n 2 ) required by the method described in [5]. This improvement is applicable to both the norm estimation and the heuristic Metropolis algorithms of [5]. The total time cost of computing the CH-forms of the approximate state is O(kmn 2 ) in our method, vs. O(k n 2 ) in [5]. The runtime reduction factor of /m is significant because the randomized sparsification subroutine requires computing an exponential number of CH-forms, roughly k ≈ a 2 1 −2 , in order to obtain an -approximation to |ψ . Recall that the factor a 2 1 is exponential in the number of non-Clifford gates. For a Clifford+T circuit, for example, it is equal to 2 0.228m , where m is the T count. Set ω ← (a il /|a il |)ω; can be implemented as a Monte Carlo algorithm. The state space is the set of CH-tuples (ω, U C , U H ). In our improved variant, the initial state is the CH-form of |φ . For each gate U i in the reduced circuit, a random Clifford gate is applied to the current state according to eq. (10). Namely, V ij is applied with probability |b ij |/ b i 1 . The relative phases between the terms are accounted for by updating ω appropriately. After the last gate, the final state is scaled by the factor b 1 ≡ i b i 1 . This is summarized in Algorithm 1.
A direct calculation shows that the output |z of this algorithm satisfies Moreover, it is shown in [5] that the average of k runs of the algorithm is a random vector |ψ = k −1 k i=1 |z i which satisfies For example, taking k = (2 a 1 / ) 2 ensures that the probability of the error exceeding is at most 1/4.
where all collinear terms in the second sum are collected in the third sum. Note that computing eq. (15) can be done efficiently by updating the CH description of each term under the Clifford gates V ij . It is then straightforward to show that a i 1 ≤ a i 1 .
Consider the modified Monte Carlo in Algorithm 2. As before, we have E(|z ) = |ψ , and one can take the average of k iterations of this algorithm to approximate |ψ . As a 1 depends on the particular trajectory taken, we need to choose k ≈ ∆ 2 −2 , where Computing ∆ is generally hard since it involves a maximization over all trajectories. Unless a smaller upper bound for ∆ can be learned, the safe solution is to settle for the pessimistic sample size which scales like a 2 1 . Even with this limitation, the above method improves the precision of the approximation obtained for the same sample size. We can use heuristics to predict whether adaptive sampling is useful. Let α be the random variable which takes value a 1 with probability equal to the sum of the probabilities of all trajectories consistent with a 1 . One can estimate the mean and variance of α by sampling, and heuristically argue that most trajectories have a 1 within, say, 3 standard deviations of the mean. Thus if the mean is much smaller than a 1 , and the variance is not too large, it is a good indication that adaptive Monte Carlo improves the rate at which |ψ converges to |ψ .

Clifford decompositions of products of gates.
Finding a Clifford decomposition of U i with minimal 1-norm a i 1 can be done using a convex optimization software [5,12]. These brute-force optimizations can only be performed for one-or two-qubit gates, as it is too computationally intensive for more qubits.
For an operator A, denote by Ω(A) the minimum 1norm over all Clifford decompositions of A. Then Ω is sub-multiplicative, i.e.

Ω(AB) ≤ Ω(A)Ω(B). (17)
For two operators A and B, let us call a Clifford decomposition of the product AB contractive if its 1-norm is strictly less than Ω(A)Ω(B). Finding a contractive Clifford decomposition of AB is generally hard unless A and B are at most 2-qubit gates acting on the same pair of qubits, so that we can use brute-force convex optimization. An exception is when the two operators can be mapped to the same pair of qubits via a Clifford C. In that case, one can find a contractive decomposition of CABC † using convex-optimization, and then apply the inverse C † , using the idea of eq. (10) to keep the terms in a form enabling fast CH updates. This method is well-suited to the case of Clifford+e iθP circuits, which we discuss in the next section.

Example: Clifford+e iθP circuits
As an application of the methods presented so far, we consider circuits where each non-Clifford gate has the form e iθP , where P is a Pauli operator, and θ ∈ [−π, π]. This class of circuits includes important cases such as Clifford+T circuits, and more generally Clifford+Z rotation circuits. For a circuit U of this type containing m non-Clifford gates, eq. (6) reduces to for a list of Hermitian Pauli operators Q 1 , . . . , Q m , a list of angles θ 1 , . . . , θ m , and a stabilizer state |φ . Note that the unitary e iθP is Clifford if and only if θ ∈ (π/4)Z.
Thus we may, without loss of generality, assume that θ i ∈ (0, π 4 ), as otherwise we can "factor out" Clifford gates e ±i π 4 Qi and absorb them into the state |φ , at the cost of applying a Clifford mapping to the list of Pauli operators Q j with j < i. The Clifford mapping can be quickly computed using the relation We refer to eq. (18) as the canonical form of the output state of a Clifford+e iθP circuit. For a single qubit Pauli operator P , the optimal decomposition of e iθP into Clifford gates is given by [5] e ±iθP = (cos θ − sin θ)1 + ( This decomposition achieves the minimum 1-norm, ).
Clifford decompositions of products of exponentiated Pauli operators. If a sequence of exponentials in eq. (18) multiplies to a Clifford gate, it can be commuted past the gates prior to it and absorbed into the input state, effectively reducing the number of non-Clifford gates in the circuit. As we are dealing with a universal set of gates, it is generally intractable to determine whether such a product belongs to the Clifford group. In the simplest case of multiplying two gates e iθP e iφQ with θ, φ ∈ (0, π/4), it can be shown that the product is a Clifford gate if and only if P Q = ±1 and θ + φ ∈ (π/4)Z. If P Q = ±1, there exists no better decomposition than the trivial one, obtained by decomposing each gate separately; This can be verified numerically for 1 and 2 qubits, and the general case follows from the fact that any pair of n-qubit Paulis P and Q can be mapped by a Clifford circuit to a pair of Paulis with support on the first two qubits only, together with the fact that Ω(U ⊗ 1) = Ω(U ). For products of three or more gates, we can find decompositions with significantly reduced 1-norm in certain cases. Numerically, we observe that the possibility of 1-norm reduction is correlated with the rank of the binary representation of the Paulis. Specifically, for x, z ∈ Z n 2 write X[x] ≡ X x1 ⊗ · · · ⊗ X xn , Z[z] ≡ Z z1 ⊗· · ·⊗Z zn , and P (x,z) ≡ i x·z X[x]Z [z]. Then numerics in the two qubit case suggest that, if and only if rank(v 1 , v 2 , · · · , v k ) < k with rank(·) computed mod 2. For example, it is possible to numerically verify that for any choice of angles (θ 1 , θ 2 , θ 3 ). We can also find the optimal Clifford expansion 2 Therefore for any triple of n-qubit Paulis (P, Q, W ) which are equivalent to (X 1 , Y 1 , Z 1 ) via a Clifford V , the optimal decomposition is given by As before, each Clifford term V † (C i ⊗ 1 ⊗n−1 )V is easy to cast as a product of a small number of Cliffords of the form e iφR where R is a Pauli and φ ∈ (π/4)Z. This way we obtain a decomposition of the many-body operator e iθ1P e iθ2Q e iθ3W which is both optimal and in a form enabling fast phase-sensitive updates.
The reduction in the 1-norm by decomposing products of gates can be quite large, see for example Figure 1, where the log of the 1-norm is plotted for different one and two qubit gate sequences.

Amplitude estimation
To illustrate the value of Clifford recompilation, we consider the problem of estimating circuit amplitudes of near-Clifford circuits. Randomized sparsification can be used to estimate the amplitudes x|ψ of the output state. Indeed eq. (13) implies that where |z is the random vector generated by Algorithm 1. By a complex-valued version of Hoeffding's inequality [11,16], the random vector |ψ obtained by averaging k instances of |z satisfies is optimal. More generally, the Clifford terms in eq. (26) vary depending on (θ 1 , θ 2 , θ 3 ), and a general expression is too tedious to write down, although finding the optimal decomposition numerically is easy on a case-by-case basis.
good estimate of x|ψ with high probability. Given the CH-form of |z , it takes time O(n 2 ) to compute the amplitude x|z using the methods in [5]. This gives a runtime scaling of O( a 2 1 mn 2 −2 log(δ −1 )) for computing an additive error estimate of the amplitude, using the improved sparsification Monte Carlo of Section 2. Typically we are interested in estimating the probability Pr(x) = | x|ψ | 2 , rather than the amplitude. It is straightforward to check that if | x|ψ − x|ψ | ≤ c , where c = √ 2 − 1, then | x|ψ | 2 − P (x) ≤ (as long as ≤ 1). In other words, estimating the probability instead of the amplitude only requires multiplying k by the constant c −2 ≈ 6.
Note that the heuristic Metropolis simulator of [5] relies on taking the ratio between two amplitude estimates, obtained exactly as described above, in order to determine the next move at each Metropolis step.
Estimating amplitudes as above relies on very few properties of stabilizer states. It is only needed that stabilizer states admit an efficient, phase-sensitive description which can be updated under Clifford gates, and that this description can be used to efficiently compute any desired amplitude of the state. Thus estimating amplitudes as described in this section, and indeed the heuristic Metropolis simulator of [5], can both be used for other states and gates satisfying similar criteria. Perhaps the main candidate for this generalization is Fermionic Gaussian states and nearest-neighbor matchgates [17,18]. Nearest-neighbor matchgates acting on Fermionic Gaussian states are known to be classically simulable, and yet augmenting the gate set with nearest-neighbor SWAP gates is sufficient for universal quantum computing. This motivates searching for matchgate-decompositions of the SWAP gate. Our numerical search suggests that the decomposition has minimal 1-norm over all matchgate-decompositions of SWAP, where G(1, ±iX) is the ±iSWAP gate (which is a matchgate). This decomposition has 1-norm equal to √ 2, which implies that amplitude estimation using this decomposition has a runtime scaling like O(2 m ), where m is the number of SWAP gates. Since it is possible to exactly compute the amplitude in time O(2 m ) by computing the 2 m matchgate contributions corresponding to the expansion in eq. (30), this decomposition is not useful for this purpose. It may be possible nevertheless to find smaller 1-norms by considering decompositions of parallel SWAP gates into nearest-neighbor matchgate circuits involving more than two qubits.
Applying the same idea to local vs. entangling gates, we find numeric evidence that the decomposition achieves the minimum 1-norm over all decompositions of CPHASE into product gates (not necessarily Cliffords), where S i is the PHASE gate on qubit i. Similar to the matchgate case, the 1-norm of this decomposition is too large to provide an improvement over brute force methods. Nevertheless, our numerics is not exhaustive and a better decomposition may exist. It may also be the case that a decomposition of the form exists such that each V j is a product gate on 2k qubits and a 1 < 2 k/2 . Our numerical search did not detect such decompositions for k = 2 allowing for up to 4 terms in the sum.

Conclusion
We have described how to improve the simulators of Bravyi et al [5] by recompiling the circuit using exponentiated Pauli operators. The theoretical runtime reduction is given by the factor /m where and m are respectively the total and non-Clifford gate counts, which can be significant when the target circuit is vastly dominated by Clifford gates. As an added benefit, the recompilation puts the circuit in a form where it is easier to search for optimal Clifford decompositions, since all non-Clifford gates appear in sequence. This can significantly reduce the exponential prefactor in the runtime. The circuit recompilation is well-suited to be used with a scheme for computing additive-error estimates of outcome probabilities. This scheme requires much less work than the full techniques of [5], and can in principle be used with a broader class of states and gates.