Clifford Circuit Optimization with Templates and Symbolic Pauli Gates

The Clifford group is a finite subgroup of the unitary group generated by the Hadamard, the CNOT, and the Phase gates. This group plays a prominent role in quantum error correction, randomized benchmarking protocols, and the study of entanglement. Here we consider the problem of finding a short quantum circuit implementing a given Clifford group element. Our methods aim to minimize the entangling gate count assuming all-to-all qubit connectivity. First, we consider circuit optimization based on template matching and design Clifford-specific templates that leverage the ability to factor out Pauli and SWAP gates. Second, we introduce a symbolic peephole optimization method. It works by projecting the full circuit onto a small subset of qubits and optimally recompiling the projected subcircuit via dynamic programming. CNOT gates coupling the chosen subset of qubits with the remaining qubits are expressed using symbolic Pauli gates. Software implementation of these methods finds circuits that are only 0.2% away from optimal for 6 qubits and reduces the two-qubit gate count in circuits with up to 64 qubits by 64.7% on average, compared with the Aaronson-Gottesman canonical form.


Introduction
One of the central challenges in quantum computation is the problem of generating a short schedule of physically implementable quantum gates realizing a given unitary operation, otherwise known as the quantum circuit synthesis/optimization problem.In this paper, we focus on a restricted class of quantum circuits belonging to the Clifford group, which is a subgroup of the group of all unitary transformations.Clifford group elements play a crucial role in quantum error correction [25], quantum state distillation [6,20], randomized benchmarking [21,23], study of entanglement [5,25], and, more recently, shadow tomography [2,16], to name some application areas.Clifford group elements are important and frequently encountered subsets of physical-level and fault-tolerant quantum circuits; sometimes, an entire quantum algorithm can be a Clifford circuit (e.g., Bernstein-Vazirani [25] and its generalizations [9]).
A special property of the Clifford group that plays the central role in many applications is being a unitary 2-design [11,12].It guarantees that a random uniformly distributed element of the Clifford group has exactly the same second order moments as the Haar random unitary operator.Thus random Clifford operators can serve as a substitute for Haar random unitaries in any application that depends only on the second order moments.However, in contrast to Haar random unitaries, any Clifford operators admit an efficient implementation by a quantum circuit.For example, randomized benchmarking [21,23] provides a scalable fidelity metric for multi-qubit operations which is insensitive to the state preparation and measurement errors.Randomized benchmarking works by measuring the decay rate of a signal generated by a sequence of random Clifford operators of varying length.The 2-design property ensures that the effective noise model obtained after averaging over the Clifford group is the depolarizing channel with a single unknown noise parameter.As another example, classical shadows [16] provide a succinct classical description of a multiqubit quantum state that can be efficiently measured in an experiment without performing the full state tomography.At the same time, a classical shadow determines many physically relevant properties of a state such as expected values of observables.A classical shadow of a quantum state ρ is obtained by repeatedly preparing a state U ρU † with a random Clifford operator U and measuring each qubit in the computational basis.The ability to realize a random element of the Clifford group by a short quantum circuit plays the central role in the above examples.
Clifford circuits also serve as a basis change transformation in quantum simulation algorithms.For example, simultaneous diagonalization of mutually commuting Pauli operators by a Clifford basis change can reduce the circuit depth for simulating quantum chemistry Hamiltonians [30].Another example is tapering off qubits for quantum simulations by identifying Pauli-type symmetries of quantum chemistry Hamiltonians [8,28].Such symmetry operators can be mapped to single-qubit Pauli Z by applying a suitable Clifford circuit after which the respective qubits can be removed from the simulation.
Earlier studies of the synthesis of n-qubit Clifford circuits resulted in the construction of asymptotically optimal (i.e., optimal up to a constant factor) implementations in the number of gates used.Specifically, the canonical form introduced by Aaronson and Gottesman [3] accomplishes this using Θ n 2 / log(n) gates [26].In contrast, in this paper we focus on the practical aspects of Clifford circuit optimization-our goal is to implement a given Clifford unitary by a circuit with the smallest possible number of entangling gates.We focus on the minimization of the cnot gate count, drawing motivation from physical layer realizations where entangling gates come at a higher cost than the single-qubit gates, and ignore the connectivity constraints.While in the worst-case scenario ignoring connectivity may lead to an O(n) blowup in the cnot gate count or depth (consider the cost of implementation of the maximal-distance cnot(x 1 ; x n ) gate in a linear chain with n qubits), known difference between the upper bound on the circuit depth between all-to-all and Linear Nearest Neighbor (LNN) architectures remains small.Indeed, for all-to-all architecture the best known upper bound on the two-qubit gate depth is 10  3 n + O(log(n)) (obtained by combining Lemma 8 in [7] with Corollary III.2.2 in [15], and noting that cz gate layer can be implemented in depth (n−1) or n depending on whether n is even or odd), and the best-known lower bound is Ω n log(n) (obtained by a slight modification of the counting argument employed in [26]).In the LNN, upper bound is 9n [7], and lower bound is 2n+1 [22].The above suggests that executing a (random) Clifford circuit in restricted architectures (LNN may often be embedded in other architectures) comes with a relatively small overhead.We also note that our methods and algorithms can be straightforwardly modified to respect a restricted connectivity and target depth minimization rather than gate count minimization.
Current approaches to the synthesis of exactly optimal Clifford circuits are prohibitively expensive even for small parameters: the largest number of qubits for which optimal Clifford circuits are known is six [10].Using these exhaustive tools leaves little hope of scaling optimal implementations beyond six qubits.Thus, efficient heuristics are desirable for practical applicability.Here we focus on the synthesis and optimization of Clifford circuits that cannot be obtained optimally, namely, circuits with n > 6 qubits.
Here we develop heuristic approaches for the synthesis and optimization of Clifford circuits.Our algorithms and their implementation bridge the gap between nonscalable methods for the synthesis of exactly optimal Clifford circuits and the suboptimal (albeit asymptotically optimal) synthesis methods.Our circuit synthesizer is based on the reduction of the tableau matrix representing Clifford unitary to the identity, while applying gates on both the input and output sides.Our optimization approach is based on the extension and modification of two circuit optimization techniques: template matching [24] and peephole optimization [27].
To generate an optimized circuit for a specific Clifford unitary, we first compile it using the tableau representation and then apply the optimization techniques to the compiled circuit.We note that the optimization techniques can be applied independently of the synthesizer considered in this paper.
The first optimization technique we develop is a Clifford-specific extension of the template matching method [24].We discuss previous results on template matching in depth in Subsection 2.1.We introduce a three-stage approach that leverages the observation that in Clifford circuits Pauli gates can always be "pushed" to the end of the circuit without changing the non-Pauli Clifford gates (i.e., Hadamard, controlled-NOT, and Phase gates) and that all swap gates can be factored out of any quantum circuit by qubit relabeling.We thus partition the circuit into "compute," "swap," and "Pauli" stages by "pushing" Pauli and swap gates to the end of the circuit.Next we optimize the "compute" stage using templates.Then we optimize the "swap" stage by exploiting the fact that a swap gate can be implemented at the effective cost of one entangling gate if it can be merged with a cnot or a cz gate.
The second technique we develop is symbolic peephole optimization.It is inspired by the peephole optimization method first introduced in the context of reversible computations [27].At each step, the symbolic peephole algorithm considers subcircuits spanning a small set of qubits (2 and 3 in this paper) by introducing symbolic Pauli gates (SPGs) to replace the two-qubit gates that entangle qubits in the chosen set with a qubit outside of it.The resulting Clifford+SPG subcircuit is optimized via dynamic programming using a library of optimal circuits.
We numerically evaluate the proposed methods using two sets of benchmarks.The first benchmark is based on the database of optimal Clifford circuits constructed in [10].We consider a selection of 1,003 randomly sampled 6-qubit Clifford unitaries, conditional on the optimal cnot gate implementation cost being higher than 4 (otherwise, it is easy to implement such a unitary optimally).The set of tools developed in this work is able to recover an optimal (in terms of the cnot count) implementation for 97.9% of the circuits, while producing circuits no more than one cnot away from the optimal count in the worst case.Second, to evaluate the performance on "large" circuits, we consider a toy model of Hamiltonian evolution with a graph state Hamiltonian, defined as follows.For a given graph with n nodes, the Hamiltonian evolution performs the transformation (cz • h) t , where cz gates apply to graph edges, h gates apply to graph nodes (individual qubits), and t is the evolution time.At integer times, the evolution by such a Hamiltonian is described by a Clifford unitary.Implementing it as a circuit cz • h repeated t times turns out to be less efficient than implementing it by using the techniques reported here.The methods we developed are evaluated on a collection of 2,264 circuits and shown to reduce the average cnot gate count by 64.7% compared with the methods proposed by Aaronson and Gottesman in [3].We make the full benchmark and the raw results available online [1].
The rest of the paper is organized as follows.We begin by briefly revisiting relevant concepts and defining the notations (Section 2).We next discuss previous results that our work is based on (Subsection 2.1, Subsection 2.2).Following this discussion, we describe the proposed methods (Section 3), report numerical results, and evaluate the performance (Section 4).We conclude with a short summary (Section 5).

Background
We assume basic familiarity with quantum computing concepts, stabilizer formalism, and Clifford circuits.Below we briefly introduce relevant concepts and notations.For detailed discussion, the reader is referred to [25] and [3].
Clifford circuits (also known as stabilizer circuits) consist of Hadamard (h), Phase (s, also known as p gate), and controlled-NOT (cnot) gates, as well as Pauli x, y, and z gates.We use I to denote the identity gate/matrix.We also utilize the controlled-z (cz) gate, which can be constructed as a circuit with Hadamard and cnot gates as follows, Clifford circuits acting on n qubits generate a finite group C n , known as the Clifford group.An important property of Clifford circuits is that Clifford gates h, s, and cnot map tensor product of Pauli matrices into tensor products of Pauli matrices.This property can be employed to "push" Pauli gates through the Clifford gates h, s, and cnot as follows: (5) Our approach combines two building blocks: Clifford-specific extension of template matching and symbolic peephole optimization.Below we briefly review these techniques.While the developed methods reduce both single-and two-qubit gate count, in this paper we focus on the optimization of the number of two-qubit gates it takes to implement a Clifford group element.The reason for our focus is that the leading quantum information processing technologies, trapped ions [13] and superconducting circuits [17], both feature two-qubit gates that take longer time and have higher error rates compared with those of single-qubit gates.

Template Matching
A size m template [24] is a sequence of m gates that implements the identity function: The templates can be used to optimize a target circuit as follows.
First, a subcircuit G i G i+1 (mod m) . . .G i+p−1 (mod m) of the template is matched with a subcircuit in the given circuit.If the gates in the target circuit can be moved together, this sequence of gates can be replaced with the inverse of the other m−p gates in the template.The larger the length p of the matched sequence is, the more beneficial it is to perform the replacement, and for any p > m 2 the gate count is reduced.The exact criteria for the application of the template depends on the choice of the objective optimization criteria (e.g., depth, total gate count, 2-qubit gate count).More formally, for parameter p, m 2 ≤p≤m, the template T can be applied in two directions as follows, Any template T of size m should be independent of smaller templates; that is, an application of a smaller template should not decrease the number of gates in T or make it equal to another template.Circuit optimization using template matching is an iterative procedure where at each step we start at an index gate and attempt to match a given template by considering gates left to the index gate in the target circuit.If the matched gates can be moved together and the substitution is beneficial, the template is applied as defined above.This step is repeated by incrementing the position of the index gate by one when no match is found until the last gate is reached.
Circuit optimization with templates was originally proposed in [24].This work has been extended with the introduction of graph-based matching techniques [18].While the methods in these references are applicable to Clifford circuits since they are defined for universal quantum circuits, neither of them leverages the particular structure Clifford circuits have for optimization.After completion of the present work we became aware that template-based optimization techniques have been recently applied to Clifford circuits in [29].

Peephole Optimization of Quantum Circuits
Peephole optimization [27] is an iterative local optimization technique that optimizes a circuit by considering subcircuits spanning small subsets of qubits A and attempting to replace them with an optimized version drawn from a database (or synthesized on the spot in some other versions).At each step, for a given gate all subcircuits on a fixed small number of qubits (e.g., |A|=4 in [19]) including that gate are considered.For each subcircuit, its cost and the optimal cost (retrieved from the database of precomputed optimal circuits) of the unitary it implements are compared.If a substitution is beneficial, the given subcircuit is replaced with its optimal implementation.The step is repeated for all gates until a convergence criterion is satisfied.Peephole optimization of reversible circuits was introduced in [27] and identified to be complementary to template matching.Since its introduction in the context of reversible computations, this approach has been applied to Clifford circuits [19].
The performance of the standard peephole optimization is limited by the need to store the entire database of optimal circuits in memory and to perform O n−2 |A|−2 g 3 lookups, where g is the number of gates in the circuit [27].Furthermore, since the size of the n-qubit Clifford group (inclusive of the Pauli group) equals (2 2j −1) and grows very quickly with n, it is unlikely that all optimal circuits can be found and stored in a suitable database for more than 6 qubits [10].

Algorithms
We introduce two algorithms for Clifford circuit optimization and apply them to the problem of compiling optimized Clifford circuits.The first algorithm is a Clifford-specific extension of the template matching technique, which we describe in Subsection 3.2.The second algorithm is symbolic peephole optimization, detailed in Subsection 3.3.
These optimizations can be applied in at least the following two ways.First, if the input is a Clifford unitary, we begin by synthesizing a circuit using a "greedy" compiler (described in Subsection 3.1) and then reduce the gate count by our proposed circuit optimization techniques.Second, if the input is already a Clifford circuit, we can either resynthesize it or apply the circuit optimizations directly.The gate count in the final circuit can be further decreased at the cost of increasing the runtime by a constant factor if the circuit is resynthesized k times using a randomized version of the "greedy" compiler, the k circuits are optimized individually, and the best of the k results is picked.Note that the k repetitions can be done in parallel.

"Greedy" Compiler
Suppose U ∈ C n is a Clifford unitary to be compiled and L ∈ C n is an operator that reproduces the action of U on a single pair of Pauli operators, x j and z j .In other words, U P U −1 = LP L −1 for P ∈ {x j , z j }.The requisite operator L, as well as a Clifford circuit with O(n) cnots implementing L, can be easily constructed for any given qubit j by using the standard stabilizer formalism [3].Then the operator L −1 U acts trivially on the jth qubit and can be considered as an element of the Clifford group C n−1 .The greedy compiler applies this operation recursively such that each step reduces the number of qubits by one.A qubit j, picked at each recursion step, is chosen such that the operator L has the minimum cnot count.In the randomized version of the algorithm, qubit j is picked randomly.The compiler runs in time O(n 3 ) and outputs a circuit with the cnot count at most 3n 2 /4 + O(n).We also developed and employ a bidirectional version of the greedy compiler that follows the same strategy as above except that each recursion step applies a transformation U ← L −1 U R −1 , where L, R ∈ C n are chosen such that after the transformation U acts trivially on the jth qubit and the combined cnot count of L and R is minimized.In Section 4, we use the bidirectional version of the greedy compiler as it leads to lower cnot costs of optimized circuits.We include a detailed description of the greedy compilers in Appendix A.

Template Matching for Clifford Circuits
We extend template matching, described in Subsection 2.1, by introducing a three-stage approach that takes advantage of the observation that Clifford gates map tensor products of Pauli matrices into tensor products of Pauli matrices.Below we describe the features used in the proposed three-stage approach.In Subsection 3.4, we combine this approach with symbolic peephole optimization.
First, we partition the circuit into three stages, "compute," "swap," and "Pauli", by pushing swap and Pauli gates to the end of the circuit.Paulis are "pushed" according to the rules in Eqs.(2,3,4,5).This step results in the construction of the "compute" stage consisting of h, s, cnot, and cz gates only.
Second, we apply the template matching to the "compute" stage.We further simplify template matching by converting all two-qubit gates into cz gates (at the cost of introducing two Hadamard gates when the cnot is considered) before performing template optimization.Templates are applied as described in Subsection 2.1.The list of templates is given in Fig. 1a-1h.
We reduce the single-qubit gate count and increase the opportunities for template application by introducing Hadamard and Phase gate pushing.Specifically, assuming that a circuit was optimized with templates, the idea is then to "push" Hadamard and Phase gates to one side of the two-qubit gates as far as possible."Pushing" a gate through a two-qubit gate is implemented as the application of a template where a fixed subsequence must be matched.For example, the rule in Fig. 1i can be used to push a Hadamard to the right of the cnot gate.
Note that once the circuit is optimized in terms of the two-qubit gate count, template matching can be applied to reduce the single-qubit gate count by restricting the set of templates and how they are applied.This can be accomplished by applying templates spanning a single qubit and considering certain applications of templates with an even number of two-qubit gates.
Third, we consider swap gate optimization as a separate problem.swap optimization is performed by observing that a swap gate can be implemented at the effective cost of one two-qubit gate if it is aligned with a two-qubit gate (cnot or cz) as, for example, in the following.
• × In order to reduce the number of swaps, the swap stage is resynthesized with the goal of aligning as many swaps as possible with the two-qubit gates in the "compute" stage.

Symbolic Peephole Optimization
As outlined in Subsection 2.2, various methods were proposed to create a database of optimal few-qubit Clifford circuits; some employ such databases to perform peephole optimization of larger Clifford circuits.However, these methods are limited to few-qubit subcircuits that must be completely decoupled from the remaining qubits.To address this limitation, we introduce a modified approach to Clifford circuit optimization, symbolic peephole optimization.
Consider a circuit U ∈ C n and a small subset of qubits A ⊆ [n].Our goal is to meaningfully define and optimize the restriction of U onto A. Let B = [n]\A be the complement of A. We say that a cnot gate is entangling if it couples A and B. Assume without loss of generality that each entangling cnot has its target qubit in the set A (otherwise, switch the control and the target by adding extra Hadamards).Partition < l a t e x i t s h a 1 _ b a s e 6 4 = " L T E O o I d u 8 S 1 e C q 5 J Z 3 s L The subcircuit acting on A is optimized to reduce the number of SPGs.Here we used the commutation rules hx v = z v h, x v z v = (−iy) v , and sy v = − x v s.Yellow arrow: The subcircuits acting on A and B are merged by replacing the SPG x v with the cnot.The phase factor i v is replaced with the phase gate s acting on B.
entangling cnots into groups such that all cnots in the same group have the same control bit.Let k be the number of groups.Expanding each entangling cnot as where U A (v) is a Clifford circuit obtained from U by retaining all gates acting on A and replacing each entangling cnot from the ith group with the Pauli gate X vi acting on the target qubit of the respective cnot.Likewise, U B (v) is a (nonunitary) circuit obtained from U by retaining all gates acting on B and replacing each entangling cnot from the ith group with the projector |v i v i | acting on the control qubit of the respective cnot.We refer to the single-qubit gates x vi , y vi , and z vi as Symbolic Pauli Gates (SPGs).These are similar to controlled Pauli gates except that the control qubit is replaced by a symbolic variable v i ∈ {0, 1}.
A symbolic Clifford circuit U A (v) can be optimized as a regular Clifford circuit on |A| qubits with the following caveats.First, U A (v) must be expressed by using the Clifford+SPG gate set.The cost of U A (v) should be defined as the number of CNOTs plus the number of SPGs.Second, the optimization must respect the temporal order of SPGs.In other words, if i<j, then all SPGs controlled by v i must be applied before SPGs controlled by v j .Third, the optimization must preserve the overall phase of U A (v) modulo phase factors (−1) vj or i vj .The phase factors can be generated by single-qubit gates z or s applied to control qubits of the entangling cnots.These conditions guarantee that the optimized circuit U A (v) can be lifted to a full circuit U ∈ C n that is functionally equivalent to U .A toy optimization example is shown in Fig. 2.
We now describe the optimization of U A (v) in more detail.Let P A and C A be the groups of Pauli and Clifford operators acting on A, respectively.The circuit U A (v) can be compactly specified by a k-tuple of Pauli operators P 1 , P 2 , . . ., P k ∈ P A and a Clifford operator 1 R for all v ∈ {0, 1} k .Indeed, any SPG can be commuted to the left since Clifford gates map Pauli operators to Pauli operators.The most general Clifford+SPG circuit that implements U A (v) can be parameterized as for some Clifford operators U j ∈ C A and Pauli operators Q j = U j P j U −1 j ∈ P A .The cost of the circuit in Eq. 7 includes the cnot count of subcircuits U j U −1 j−1 and the SPG count of controlled Pauli operators Q vj j .Note that Q vj j is a product of |Q j | single-qubit SPGs, where |Q j | is the Hamming weight of Q j .Denoting U 0 :=R −1 , one can express the cost of the circuit in Eq. ( 7) as Here $(V ) is the cnot cost of a Clifford operator V ∈ C A .Our goal is to minimize the cost function f over all k-tuples U 1 , U 2 , . . ., U k ∈ C A .We claim that the global minimum of f can be computed in time O(k), as long as |A| = O(1).The key observation is that the function f is a sum of terms that depend on at most two consecutive variables U i and U i−1 .Such functions can be minimized efficiently using the dynamic programming method; see, for example, [4].Indeed, define intermediate cost functions f 1 , f 2 , . . ., f k : C A → Z + such that f j is obtained from f by removing the term $(U k ), retaining the first j terms in the sums over i, and taking the minimum over U 1 , U 2 , . . ., U j−1 .More formally, and for j = 2, 3, . . ., k.Using the induction in j, one can easily check that for j = 2, 3, . . ., k. Below we assume that a lookup table specifying the cnot cost $(V ) for all V ∈ C A is available.Then one can compute a lookup table of f 1 by iterating over all U 1 ∈ C A and evaluating the right-hand side of Eq. ( 9).Proceeding inductively, one can compute a lookup table of f j with j = 2, 3, . . ., k by iterating over all U j ∈ C A and evaluating the right-hand side of Eq. (10).Each step takes time roughly 1) since we assumed that |A| = O(1).Finally, use the identity to compute the global minimum of f .Thus, the full computation takes time O(k).
To make the above algorithm more practical, we exploited symmetries of the cost function, Eq. ( 8).Namely, function f is invariant under multiplying U j on the left by any element of the local subgroup C 0 A ⊆ C A generated by the single-qubit gates h a and s a with a ∈ A. In other words, f (U 1 , U 2 , . . ., U k ) depends only on the right cosets of the local subgroup C 0 A U j .Thus one can restrict the minimizations in Eqs.(10,11) to some fixed set of coset representatives R ⊂ C A such that each coset C 0 A V has a unique representative r(V ) ∈ R. We chose r(V ) as the left-reduced form of V defined in [10,Lemma 2].This lemma provides an algorithm for computing r(V ) with the runtime O(|A| 2 ).Now each variable U i takes only For example, |R|=20 and |R|=6720 for |A|=2 and |A|=3, respectively.Likewise, it suffices to compute the lookup table for the cnot cost $(V ) only for V ∈ R.This computation was performed using the breadth-first search on the Clifford group C A .
An important open question concerns the selection of the subsets A to be considered.From numerical experiments with |A| ∈ {2, 3}, our most successful strategy turned out to be the random subset selection.Specifically, we generate a list of all n 2 pairs and n 3 triples of qubits.We run passes of the symbolic peephole method first on pairs of qubits and next on triples of qubits until no further improvement can be obtained.At each pass of the symbolic peephole optimization, we randomly reshuffle both lists and run optimization on all the subsets in the reshuffled order.We continue passes until either the optimal cnot count is reached (for circuits for which the optimal cnot count is known) or there is no improvement between two consecutive passes.

Full Algorithm
We combine the components described above in the following way.We begin by synthesizing the circuit using the "greedy" compiler described in Subsection 3.1.Then the synthesized circuit is optimized as follows.First, the circuit is partitioned into three stages.Second, template matching and swap gate merging is performed until a pass yields no further optimization.Third, symbolic peephole optimization is performed, as described in Subsection 3.3.Lastly, a single pass of template matching is performed to reduce the single-qubit gate count.
(a) The ratio of circuits for which the implemented methods recover optimal cnot count.The "nonsmoothness" of the line for the cnot count of 15 is due to only 3 circuits being considered.Mean running time (s) Figure 3: Quality of the solution (a) and the mean running time (b) for 6-qubit circuits with known optimal cnot gate count.To demonstrate the trade-off between running time and the quality of the solution, we consider 20 time limits between 100 seconds and 15 hours.We observe that for all problems there exists a time limit at which the ratio of the recovered optimal circuits and the mean running time stops increasing.This value depends on the hardness of the circuits; for the hardest circuits (optimal gate count of 12) the metrics are saturated at ≈4 hours.With this time limit we recover the optimal cnot count for 97.9% of the circuits and observe the difference of 0.2% between the average optimal cnot count and the average cnot count recovered by our software.Small deviations from monotonic growth of running time and quality with the time limit are due to the experiments being performed on a heterogeneous computing cluster and the random nature of the algorithm implementation.The mean running time being above the time limit for time limit 100s is due to letting template matching complete even after the time limit is triggered.
Type  1: Optimization results for Hamiltonian evolution circuits.For each graph on nq qubits, we generate and optimize tmax circuits corresponding to all integer numbers of steps between 1 and tmax = min(tp, 300).Corig is the average cnot gate count in the original circuits, C A-G is the average cnot gate count of the circuits in Aaronson-Gottesman canonical form [3], C greedy is the average cnot gate count of the circuits produced by the bidirectional "greedy" compiler, Copt is the average cnot gate count of the optimized circuits, and r = (C A-G −Copt)/C A-G is the improvement in the average cnot gate count over the Aaronson-Gottesman canonical form.For all runs we set the time limit to 36 hours and stop both peephole optimization and template matching when the time limit is reached.We note that the "greedy" compiler by itself (without any further optimization) reduces the cnot gate count by 48.6% compared to [3].We additionally compare the performance of our methods with the CliffordSimp method of tket framework [29] applied to the output of our "greedy" compiler (column C tket ).As tket ignores the swap gates, we modified our implementation such that once the swaps are factored out in template matching phase, they are ignored.The cnot gate counts are presented in column Cno swap.We observe that our optimizations result in cnot counts that are 6.58% lower on average as compared to tket, with larger improvements (up to 17.8%) observed for harder (deeper) circuits.

Experimental Results
We ran two sets of computational experiments designed to test the performance of our synthesis and optimization algorithms, detailed in the next two subsections.In addition, we compared our results to [29] as well as to 8-qubit t gate free circuits from [14].The comparison to [29] is detailed in Table 1 and the comparison to [14] reads 24.4529 (obtained using 10,000 random samples) to 50+ in [14, Figure 3].

Recovering Optimal CNOT Count for Clifford Unitaries on Six Qubits
First we compare the proposed heuristic methods with the optimal Clifford compiler for n ≤ 6 qubits [10].
The latter uses breadth-first search on the Clifford group to construct a database specifying the optimal cnot gate count of each Clifford operator.As shown in [10], the optimal cnot gate count for 6-qubit Clifford operators takes values 0, 1, . . ., 15.We generate 1,003 uniformly sampled random Clifford unitaries with the cnot gate counts between 5 and 15.We consider only unitaries with the cnot gate count ≥5 because one needs at least 5 cnots to entangle all 6 qubits.For the cnot gate counts from 5 to 14, we consider 100 circuits for each cost value.For the cnot gate count of 15, there are only 3 Clifford circuits (modulo single-qubit Cliffords on the left and on the right and modulo qubit permutations) to consider [10].
For each Clifford unitary, we start by synthesizing it using the bidirectional "greedy" compiler.The optimization is run as described in Subsection 3.4.The circuit is then resynthesized by using the randomized version of the compiler, and the resynthesized circuit is optimized.This process is repeated until the time limit is reached, and the circuit with the lowest cnot count is chosen as the output.Note that we also stop the peephole optimization when the time limit is reached, but we allow template matching to complete.The reason is that template matching is fast as compared to peephole optimization and allowing it to complete results in the actual running time above the time limit by only 0.66% of the instances considered.
The quality of the solution obtained by the implemented methods as a function of the time limit is shown in Fig. 3a.Our algorithm converges before exhausting the time limit on most instances.Fig. 3b shows actual observed mean running time as a function of the time limit.We note that the combination of the iterative nature of symbolic peephole optimization and the randomized resynthesis allows the user to trade off the quality of the optimization and the running time as desired.

Circuits for Hamiltonian Evolution
To evaluate the performance of the proposed methods on circuits with n>6 qubits, we consider a toy model of Hamiltonian time evolution.Suppose G = (V, E) is a fixed graph with n vertices.We place a qubit at each vertex of G. Define a Hamiltonian evolution circuit with time t as The layers of Hadamard and cz gates model time evolution under an external magnetic field and nearest neighbor two-qubit interactions, respectively.We consider several choices for the interaction graph.First, we take instances of the path and cycle graphs with the number of qubits n ∈ {5, 15, 25, 35, 45, 55}.Second, we include all three regular plane tessellations (by triangles, squares, and hexagons).We choose the numbers of vertices between 6 and 64 such that the convex hull spanned by the centers of masses of individual tiles in the gapless regular tiling is congruent to the basic tile.Third, we consider a heavy hexagon grid, obtained from hexagonal tessellation by adding a node in the middle of each edge.This set includes some of the frequently appearing qubit-to-qubit connectivities/architectures.We consider the number of layers 1 ≤ t ≤ t max = min(t p , 300), where t p is the period such that the Hamiltonian evolution with the number of layers t p produces the identity transformation.For each interaction graph G we compute the cnot gate count of optimized circuits averaged over the number of layers t = 1, 2, . . ., t max .The total number of circuits considered is 2,264.We set the time limit to 36 hours and stop both peephole optimization and template matching when the time limit is reached, only allowing the current pass of template matching to complete.Allowing the current pass of template matching to complete results in only a small ratio of problems (4.5%) to significantly (≥10%) exceed the time limit.The results are reported in Table 1.The maximum graph size in these experiments is n=64 because we represent n-qubit Pauli operators by a pair of 64-bit integers in our C++ implementation; this limitation can be easily removed by revising the data structure.

Conclusion
We reported a bidirectional synthesis approach and two circuit optimization techniques that extend known approaches to Clifford circuits by exploiting the unique properties of the Clifford group.We demonstrate the effectiveness of these methods by recovering optimal cnot gate count for 98.9% 6-qubit circuits (over 1,003 samples) and by reducing the cnot gate count by 64.7% on average for Hamiltonian evolution circuits with up to 64 qubits (2,264 circuits considered), compared to Aaronson-Gottesman canonical form [3]. We show evidence of the improvement in the gate count by a factor of 2 compared to other techniques, such as [14].
Recall that the single-qubit Clifford group C 1 acts by permutations on the Pauli operators x, y, and z.Below we use the notation $(O, O ) for the cnot gate count of the disentangling circuit constructed in Algorithm 1.
Our implementation of the greedy compiler optimizes the order in which the qubits are disentangled.Namely, suppose that at some step j of the compiler a subset of qubits S j has been disentangled such that be a qubit with the smallest disentangling cost.Let L be the circuit disentangling C j x p C −1 The greedy synthesizer described above has the runtime O(n 3 ).Indeed, consider the first step of the synthesis.Since the disentangling cost $(O, O ) can be computed in time O(n), picking a qubit with the smallest disentangling cost takes time O(n 2 ).Computing the disentangling circuit L and the product L −1 C takes time O(n 2 ) since L contains O(n) gates and the action of a single gate can be simulated in time O(n) using the stabilizer formalism [3].Thus the full runtime of the greedy synthesizer is O(n 3 ).
The bidirectional greedy synthesizer sequentially constructs Clifford circuits L 1 , L 2 , . . ., L n and R 1 , R 2 , . . ., R n such that where C j acts trivially on the first j qubits.This gives a circuit implementing C with the cnot cost at most By definition, R −1 is a disentangler for the pair (P, P ).Simple algebra shows that C 1 commutes with the Pauli operators x 1 and z 1 if and only if L is a disentangler for the pair (O, O ).The above shows that minimizing the combined cost $(L) + $(R) subject to the constraint that C 1 = L −1 CR −1 acts trivially on the first qubit is equivalent to minimizing the function f (P, P ) = $(P, P ) + $(CP C −1 , CP C −1 ) over all pairs of n-qubit anti-commuting Pauli operators P and P .Note that f (P, P ) can be computed in time O(n) for a given pair (P, P ).Once the optimal pair (P, P ) is found, one chooses L and R −1 as disentanglers for the Pauli pairs (O, O ) and (P, P ), respectively.Since the total number of n-qubit anticommuting Pauli pairs grows exponentially with n, the global minimum of f (P, P ) cannot be computed exactly for large n.To make the problem tractable, we restricted the minimization to Pauli operators P, P with weight at most two.The number of such pairs (P, P ) is at most O(n 3 ) since the anti-commutativity condition implies that the supports of P and P must overlap on at least one qubit.Now the minimum of f (P, P ) can be computed in time O(n 4 ), and thus the full runtime of the compiler is O(n 5 ).Note that the unidirectional greedy compiler described earlier corresponds to R=I, that is, P = x 1 and P = z 1 .Thus the bidirectional compiler subsumes the unidirectional one, even with the restricted minimization domain.

Disclaimer
This paper was prepared for information purposes with contributions from the Future Lab for Applied Research and Engineering (FLARE) Group of JPMorgan Chase & Co. and its affiliates, and is not a product of the Research Department of JPMorgan Chase & Co. JPMorgan Chase & Co. makes no explicit or implied representation and warranty, and accepts no liability, for the completeness, accuracy or reliability of information, or the legal, compliance, tax or accounting effects of matters contained herein.This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction.
program under contract number DE-AC02-06CH11357.Clemson University is acknowledged for generous allotment of compute time on the Palmetto cluster.We gratefully acknowledge the computing resources provided on Bebop, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory.SB is partially supported by the IBM Research Frontiers Institute.

Figure 1 :
Figure 1: Templates (a)-(h) are used for template matching.The rewriting rule (i) is used for Hadamard gate pushing.

Figure 2 :
Figure 2: Example of symbolic peephole optimization.Purple arrow: each entangling cnot gate is replaced with a symbolic Pauli gate (SPG) x v , where v ∈ {0, 1}.Now the subcircuit acting on A is isolated from the remaining qubits.Green arrow:The subcircuit acting on A is optimized to reduce the number of SPGs.Here we used the commutation rules hx v = z v h, x v z v = (−iy) v , and sy v = − x v s.Yellow arrow: The subcircuits acting on A and B are merged by replacing the SPG x v with the cnot.The phase factor i v is replaced with the phase gate s acting on B.
Mean running time of the implemented methods.Note that the actual running times are significantly lower than the time limit (black line) for most instances.

19 :
cnot A(1),A(2j+1) Perform simultaneous mapping xxx → xii and zzz → zii 20: end for Let L be the operator realized by the above circuit combined with the initial layer of single-qubit Cliffords.Direct inspection shows that L has the desired property, Eq. (12), up to sign factors.The latter can be fixed by applying Pauli x 1 or y 1 or z 1 as the first gate of L. Simple algebra shows that L has the cnot gate count of at most (3/2)|A| + |B| + |C| + |D| + O(1) ≤ 3n/2 + O(1).

j
and C j z p C −1 j .Set S j+1 = S ∪ {p} and L j+1 = L • SWAP 1,p .Then C = L 1 L 2 • • • L j+1 C j+1, where C j+1 acts trivially on S j+1 .Thus one can proceed inductively.The extra swap gates are isolated and incorporated back into the compiled circuit as described in Subsection 3.2.
j ) + $(R j ).Consider the first step, j=1.Let us construct the circuits L = L 1 and R = R 1 (all subsequent steps are analogous).Our goal is to minimize the combined cost $(L) + $(R) subject to the constraint that C 1 = L −1 CR −1 acts trivially on the first qubit.Equivalently, C 1 should commute with the Pauli operators x 1 and z 1 .Define n-qubit Pauli operatorsP := R −1 x 1 R, P := R −1 z 1 R, O := CP C −1 , and O := CP C −1 .
Thus one can transform any Pauli pair (O, O ) into the standard form by applying a layer of single-qubit Clifford operators.This gives rise to a partition of n qubits into five disjoint subsets A, B, C, D, and E. Note that A has odd size since otherwise O and O would commute.Let A(j) be the j-th qubit of A. Consider the following circuit.