Exponentially faster implementations of Select(H) for fermionic Hamiltonians

We present a simple but general framework for constructing quantum circuits that implement the multiply-controlled unitary $\text{Select}(H) \equiv \sum_\ell |\ell\rangle\langle\ell|\otimes H_\ell$, where $H = \sum_\ell H_\ell$ is the Jordan-Wigner transform of an arbitrary second-quantised fermionic Hamiltonian. $\text{Select}(H)$ is one of the main subroutines of several quantum algorithms, including state-of-the-art techniques for Hamiltonian simulation. If each term in the second-quantised Hamiltonian involves at most $k$ spin-orbitals and $k$ is a constant independent of the total number of spin-orbitals $n$ (as is the case for the majority of quantum chemistry and condensed matter models considered in the literature, for which $k$ is typically 2 or 4), our implementation of $\text{Select}(H)$ requires no ancilla qubits and uses $\mathcal{O}(n)$ Clifford+T gates, with the Clifford gates applied in $\mathcal{O}(\log^2 n)$ layers and the $T$ gates in $O(\log n)$ layers. This achieves an exponential improvement in both Clifford- and T-depth over previous work, while maintaining linear gate count and reducing the number of ancillae to zero.


Introduction
Quantum computers have the potential to efficiently simulate quantum systems. A particularly promising application of both near-term and fault-tolerant architectures is solving problems in quantum chemistry and materials science. In recent years, significant advances have been made on this front; for reviews of the major algorithmic developments, we refer the reader to Refs. [1][2][3].
Much of the current research in quantum simulation is concerned with the estimation of Hamiltonian spectra and preparation of energy eigenstates, which can provide insight into various properties of molecules and materials. As shown by Ref. [4], the quantum phase estimation algorithm [5,6] can be used to perform projective energy measurements, collapsing the system into a desired eigenstate with high probability if the initial state has appreciable overlap with that eigenstate. Even in the absence of a suitable initial approximation, such measurements may be applied to prepare an eigenstate by exploiting the quantum Zeno effect [7,8]. Alternatively, approximate eigenstates may be obtained via adiabatic state preparation, given sufficient information about the gap(s) in the spectrum of the interpolating Hamiltonian [9,10].
Several techniques are useful for realising these schemes on a gate-based quantum computer. For instance, the qubitisation procedure of Refs. [11,12] can implement the time-evolution operator exp(−iHt) or, more directly, a walk operator corresponding to exp[−i arccos(Ht)], for a time-independent Hamiltonian H and some t ∈ R. Either of these operators can be used as the unitary input to phase estimation for the purpose of approximating eigenvalues and eigenstates of H [4,8,13,14]. Adiabatic evolution can be digitally simulated by applying the truncated Dyson series algorithm of Refs. [15,16] for time-dependent Hamiltonian simulation, or by using the method of Ref. [17], which is based on quasi-adiabatic continuation [18]. Approximate ground states can also be prepared using the methods of Ref. [19,20]. All of these techniques are formulated in terms of queries to unitary oracles that encode the relevant Hamiltonian(s) in some form. One such encoding is the "linear combination of unitaries" (LCU) query model, motivated by the algorithms of Refs. [21,22]. In this model, the input Hamiltonian is decomposed as where each H ℓ is a time-independent unitary and the (possibly time-dependent) coefficients α ℓ are real and nonnegative. Information about the Hamiltonian is accessed via two oracles: Select(H) and Prepare(α), which respectively encode the unitaries H ℓ and the coefficients α ℓ . Specifically, is a multiply-controlled operation that applies the unitary H ℓ to the target register conditioned on the control register being in the state |ℓ , and Prepare(α) is some unitary that transforms the all-zeros state of the control register as where α := L−1 ℓ=0 α ℓ . (In the case where the coefficients are time-dependent, Prepare(α) may be controlled on an additional register that encodes time [16].) While any operator can in principle be written as a linear combination of unitaries, some Hamiltonians are more naturally expressed in this framework. Since the complexities of the aforementioned algorithms are typically dominated by that of Select(H) and Prepare(α), 1 it is important to design time-and space-efficient circuits for these oracles. The purpose of this paper is to provide an efficient construction for Select(H) in the case where H is obtained from a fermionic Hamiltonian via the Jordan-Wigner transformation [23], so that each H ℓ is a tensor product of Pauli operators. Although our method is applicable to arbitrary fermionic Hamiltonians, it is worth noting that in many of the models considered in practice, each site interacts with only a small number of other sites. More precisely, for a fermionic Hamiltonian given in its second-quantised representation, let k denote the maximum number of distinct spin-orbitals that appear in each term. For most Hamiltonians of physical interest, such as the commonly studied molecular electronic structure Hamiltonian and the Fermi-Hubbard model, k does not scale with the system size. Our contribution can be stated as follows.
Main result: For any fermionic Hamiltonian for which k is a constant independent of the number of spinorbitals n, we can construct a circuit for Select(H) using zero ancilla qubits and O(n) Clifford and T gates, with the Clifford gates performed in O(log 2 n) layers and the T gates in O(log n) layers.
This constitutes an exponential reduction in Clifford-and T -depth compared to existing methods. The approach of Ref. [24] can be applied to arbitrary LCU inputs but requires O(L) Clifford and T gates and O(L) Clifford-and T -depth, and in general L ∈ O(n k ) for the type of Hamiltonians considered here. Ref. [14] improves the gate count and depth (for both Clifford and T gates) to O(n) for two specific k = 2 Hamiltonians. Like Ref. [14], we obtain a speedup by exploiting the structure of the Jordan-Wigner encoding. However, our circuits are completely different in structure from those in Ref. [14], which cannot be parallelised to sublineardepth in any straightforward way. Moreover, our implementation uses no ancilla qubits, in contrast to the ∼ log n required by Refs. [14] and [24].
Our construction can be directly applied to asymptotically improve the circuit depth of existing fermionic simulation algorithms that are bottlenecked by Select(H). In Ref. [14], for example, the complexity of simulating the planar Fermi-Hubbard model is dominated by that of Select(H), while Prepare(α) is extremely easy to implement as there are only three unique coefficients in the Hamiltonian. By using our circuit for Select(H), the overall circuit depth of estimating energies via phase estimation to absolute error at most ǫ can be immediately reduced from O(αn/ǫ) to O(α/ǫ) in Theorem 2 of Ref. [14] [cf. Eq. (27) therein], where O hides logarithmic factors. Similarly, in Ref. [25], the overall depth of approximating the time-evolution operator e −iHt for the k = 4 Sachdev-Ye-Kitaev model with n Majorana modes can be reduced from O(n 3.5 t) to O(n 2.5 t).
In addition to the exponentially reduced circuit depth and minimal space overhead, an advantage of our construction lies in its simplicity and broad applicability. The circuits consist of very few different components, and take exactly the same form for all Hamiltonians with the same k (though straightforward optimisations can be made if the class of input Hamiltonians is further restricted). The bulk of the gate complexity is due to a single gadget, composed entirely of controlled-Swap and cnot gates. Thus, while the use of circuit depth as a complexity measure is mainly justified by long-term considerations (of prospective architectures in which many fault-tolerant gates can be executed in parallel), the simple structure of our circuits potentially makes them amenable to near-term implementation.

Circuit construction
In this section, we prove our main result. We begin in subsection 2.1 by developing the circuit for Select(H) for a particular class of fermionic Hamiltonians, to illustrate the main idea. It will then become obvious how circuits for arbitrary fermionic Hamiltonians can be built, as we discuss in subsection 2.3, and that these circuits have linear gate count and polylogarithmic depth provided that k ∈ O(1). We also describe, in subsection 2.2, a simple way to substantially reduce the constant factors in the scaling of the T -count and T -depth.
Conventions. Unsurprisingly, circuit diagrams are an essential part of this paper. We will use the following convention for representing operators that are controlled in a nontrivial manner on one or more qubits. Such an operator will be depicted by drawing a small solid square on the control register, connected to a box on the target register that contains the name of the operator or an abbreviation thereof. For instance, Select(H) will be represented by To clearly distinguish different registers, we will often label a control register using a computational basis state, and the target register using an arbitrary state |ψ . As an example, since ℓ is used to index the computational basis states of the control register of Select(H) in Eq. (1), we may add the labels |ℓ and |ψ to the above circuit representation of Select(H): (Note that if a circuit identity holds for any computational state on the control register and any arbitrary state on the target register, it holds for all input states.) We will refer to the control register of Select(H) as the "selection register" and the target register as the "system register."

Main idea
Our method is most easily explained by first considering quadratic fermionic Hamiltonians that consist only of terms involving two distinct spin-orbitals. The most general form of such a Hamiltonian in a second- where a † p and a p are fermionic creation and annihilation operators associated with spin-orbital p, and t pq , ∆ pq ∈ C. For a system of n spin-orbitals, we label the spin-orbitals from 0 to n − 1 in accordance with the canonical ordering chosen for the Jordan-Wigner transformation. Under this transformation, the fermionic operators are mapped to Pauli operators on n qubits as a p → ( Here and throughout the paper, Z p denotes the n-qubit operator that acts as Z on qubit p and as the identity on the rest of the qubits, and similarly for X and Y , while Z p,q := q−1 j=p+1 Z j denotes a string of Z operators on all of the qubits between p and q (exclusive). Thus, the Jordan-Wigner transform H ofĤ is a linear combination with real coefficients of operators that all have the form ( Absorbing the signs of the coefficients into P 1 , we can write where the coefficients α p,q,P1,P2 are all nonnegative and α p,q,P1,P2 = 0 for p ≥ q. Clearly, H is a linear combination of unitaries, and each of the unitaries is completely specified by the two spin-orbitals p and q and the Pauli operators P 1 and P 2 . Accordingly, we allocate 2⌈log n⌉ + 3 qubits to the selection register, and encode each computational basis state |ℓ of the selection register as |ℓ ≡ |p |q |P 1 |P 2 . The first two subregisters each contain ⌈log n⌉ qubits and store the binary representations of p, q ∈ {0, . . . , n − 1}. The third and fourth subregisters, which have two qubits and one qubit, respectively, specify P 1 ∈ {±X, ±Y } and P 2 ∈ {X, Y }. By Eq. (1), Select(H) can then be defined by its action on computational basis states in the selection register (and an arbitrary state |ψ in the system register) as for p, q ∈ {0, . . . , n − 1}. (The action of Select(H) on basis states for which p and/or q are out of range is unimportant, as Prepare(α)|0 has no support on such states.) To construct a circuit that implements Eq. (4), our starting point is the following circuit identity: which is an immediate consequence of the elementary identities and the fact that cnot is self-inverse. The analogue of Eq. (5) for an arbitrary number of qubits and with the two Z operators on a different pair of qubits is obvious. Letting Q P1 denote the Pauli operator such that Therefore, if the n qubits in the system register are ordered such that the qubit corresponding to spinorbital 0 is on the top wire and the qubit corresponding to spin-orbital n − 1 is on the bottom, the circuit on the left-hand side would implement the term (P 1 ) p Z p,q (P 2 ) q |ψ for a particular p, q, P 1 , P 2 . From here, we would obtain a circuit for Select(H) if we were to control the Q P1 , P 2 , and Z operators in the circuit of Eq. (6) on the selection register such that (1) the states |P 1 and |P 2 of the third and fourth selection subregisters determine which Pauli operators Q P1 and P 2 represent, and (2) conditioned on the first two selection subregisters being in the state |p |q , Q P1 and one of the Z operators are applied to qubit p of the system register, while P 2 and the other Z operator are applied to qubit q.
Condition (1) can be straightforwardly satisfied by constructing Select(Q) and Select(P ) operators that choose the appropriate Q P1 and P 2 according to the states |P 1 and |P 2 . For concreteness, suppose that |P 1 = |00 , |01 , |10 , |11 for P 1 = X, −X, Y, −Y , respectively, and |P 2 = |0 , |1 for P 2 = X, Y , respectively. Then, Select(Q) and Select(P ) can be implemented as Condition (2) requires the ability to target a particular qubit in the system register depending on the states of the selection subregisters that encode |p and |q . For this purpose, we define for any single-qubit unitary U a (⌈log n⌉ + n)-qubit operator Inject(U ). When applied to |x |ψ , where x ∈ {0, . . . , n − 1} (encoded in binary) and |ψ is an arbitrary n-qubit state, Inject(U ) implements U on qubit x of |ψ and acts as the identity on the other qubits, i.e., To synthesise Inject(U ) for any U , we use a (⌈log n⌉ + n)-qubit operator SwapUp, defined as follows. For any x ∈ {0, . . . , n − 1} and n-qubit product state n−1 y=0 |ϕ y y , where σ is some permutation of {0, . . . , n − 1} such that σ(0) = x. In other words, conditioned on the ⌈log n⌉-qubit control register being in the state x for x ∈ {0, . . . , n − 1}, SwapUp moves the state |ϕ x of the qubit indexed by x in the target register up to the first qubit of the target register, and permutes the states of the other qubits in some way. Ref. [26] shows that SwapUp can be implemented without ancilla qubits using O(n) Clifford and T gates, O(log 2 n) Clifford-depth, and O(log n) T -depth. We sketch the construction in Appendix A.1. From Eq. (8), it is easy to see that Inject(U ) = SwapUp † (U ⊗ I (n−1) )SwapUp. SwapUp permutes the qubits in the target register in such a way that the state of qubit x is moved up to the first qubit, U is then applied to the first qubit, and SwapUp † undoes the permutation.
(The third circuit above illustrates the effect of Inject(U ) in the special case where the input to the control register is a computational basis state, whereas the second circuit implements Inject(U ) on arbitrary inputs.) Hence, we can ensure that the two Z operators in Eq. (6) are applied to qubits p and q of the system register conditioned on the state of the first two selection subregisters being |p |q by implementing two Inject(Z) operators, one with |p as the control register and the other with |q as the control register. To correctly apply Q P1 and P 2 , we use the Select(Q) and Select(P ) circuits constructed in Eq. (7) in conjunction with SwapUp to form Inject-Select(Q) and Inject-Select(P ). This is shown below for Inject-Select(Q); the construction of Inject-Select(P ) is analogous.
With these components in hand, we can assemble the circuit for Select(H): where Ladder denotes the operator implemented by the ladder-like sequence of n − 1 cnot gates in Eq. (6): By comparing the circuit in Eq. (11) to that in Eq. (6), it can be verified that the former correctly implements Select(H) as it is defined in Eq. (4). When the selection register is in the computational basis state |p |q |P 1 |P 2 , the two Inject(Z) operators apply Z operators on qubits p and q in the system register, between the two sequences of cnot gates corresponding to Ladder and Ladder † . Then, Inject-Select(Q) applies Q P1 to qubit q and Inject-Select(P ) applies P 2 to qubit q. Thus, by Eq. (6), the circuit applies the operator (P 1 ) p Z p,q (P 2 ) q on the system register conditioned on the state of the selection register being |p |q |P 1 |P 2 , as required by Eq. (4). This holds for all of the computational basis states, and therefore for arbitrary states of the selection register. Inject-Select(Q) 28(n − 1) 32⌈log n⌉ 4 Inject-Select * (Q) 16(n − 1) 16⌈log n⌉ Inject-Select(P ) 28(n − 1) 32⌈log n⌉ 2 Inject-Select * (P ) 16(n − 1) 16⌈log n⌉ Table 1: T -count and T -depth of each of the main components used to construct circuits for Select(H) in Section 2. (The Clifford-count and Clifford-depth are O(n) and O(log 2 n), respectively, for all of the components.) The fourth column specifies the number of one-or two-qubit (Clifford) gates to which controls need to be added in order to construct the controlled version of the operator in the first column.
It is clear from Eqs.  7) and (10)], and the (−i)-phase gate (by implementing an S † gate on the control qubit). When these operators are not applied, the circuit implements the identity since the Ladder and SwapUp operators are cancelled by their inverses. Therefore, controlling the entire circuit on any constant number of qubits incurs only constant additive gate complexity (which can be quantified using the fourth column of Table 1). This is important because algorithms that use the LCU query model generally require access to controlled-Select(H).

Reducing the constant factors
Before generalising the Select(H) circuit in subsection 2.1 to arbitrary fermionic Hamiltonians, we provide more efficient versions of the main circuit components, which can be used to reduce the T -count and T -depth by constant multiplicative factors. The Clifford-count and Clifford-depth are reduced by constant factors as well. However, we focus on the T complexity in this subsection because T gates are the bottleneck in many models of fault-tolerant quantum computation, notably those that are based on topological error correcting codes. In these settings, T gates require significantly more time and physical qubits to implement than Clifford gates [27], and even a constant-factor improvement in the T complexity may be useful.
The strategy is to replace all of the SwapUp operators by a particular phase-incorrect SwapUp operator, which is based on a phase-incorrect Toffoli gate introduced in Ref. [28]. As pointed out by Ref. [26], this phase-incorrect SwapUp, which we will call SwapUp * , can be implemented using 4(n − 1) T gates that are applied in 4⌈log n⌉ layers, a considerable reduction from the 14(n − 1) T -count and 16⌈log n⌉ T -depth of SwapUp. The circuit for SwapUp * is described in Appendix A.2. In the computational basis, SwapUp * has the same matrix elements as SwapUp up to sign, i.e., for any computational basis state |z , where |z ′ = SwapUp|z is also a computational basis state. Hence, we can write SwapUp * = D · SwapUp for some operator D that is diagonal in the computational basis, with eigenvalues ±1. This implies that Inject(Z) would still be implemented correctly if SwapUp and SwapUp † in the circuit of Eq. (9) were replaced with SwapUp * and SwapUp * † : where the second inequality follows from the fact that D commutes with Z 1 and is unitary. We denote this implementation of Inject(Z) by Inject * (Z). 2 By the same token, SwapUp * can be used to construct Inject(U ) for any U that is diagonal in the computational basis.
On the other hand, the circuit in Eq. (10) would not correctly implement Inject-Select(Q) if SwapUp * were used instead of SwapUp, since X and Y are not diagonal in the computational basis. To reduce the T cost of Inject-Select(Q), we modify the circuits so that X and Y are "injected" separately by applying Inject * (Z) conjugated by basis change operators, as follows: using C X and C Y to represent B ⊗n X and B ⊗n X , where B X and B Y are Clifford gates for which B X ZB † X = X and B Y ZB † Y = Y . Note that controlled-Inject(Z) can be constructed by simply adding a control to the Z operator in Inject * (Z). The construction for Inject-Select(Q) is similar [cf. Eq. (7)]. We denote these alternative implementations of Inject-Select(Q) and Inject-Select(P ) by Inject-Select * (Q) and Inject-Select * (P ).
The T -count and T -depth of each of these improved circuit components follow directly from the T cost of SwapUp * , and are listed in Table 1. Inject(Z), Inject-Select(P ), and Inject-Select(Q) can always be replaced with their asterisked counterparts to minimise the complexity. For example, the circuit in Eq. (11), which implements Select(H) for quadratic fermionic Hamiltonians, has T -count 112(n − 1) and T -depth 128⌈log n⌉. Replacing the components in Eq. (11) by their improved versions would reduce the total T -count to 48(n − 1) and the T -depth to 48⌈log n⌉.

Generalising to arbitrary k
We can extend the ideas of subsections 2.1 and 2.2 to devise ancilla-free implementations of Select(H) for the Jordan-Wigner transforms of arbitrary fermionic Hamiltonians. If at most k distinct spin-orbitals are involved in each term of the fermionic Hamiltonian, the resulting circuit has Clifford-and T -count O(kn), Clifford-depth O(k log 2 n), and T -depth O(k log n). To help illustrate the concepts by way of circuit diagrams, we will use k = 4 Hamiltonians as a concrete example.
In its second-quantised representation, each term in a general fermionic Hamiltonian is a product of interaction operators a † p a q or a † p a † q (with p < q) and their Hermitian conjugates, and number operators n p := a † p a p . 3 By definition of k, Hamiltonians with k = 4 may include such terms as a † p a q a † r a s + h.c., a † p a q n r n s + h.c., n p , and n p n q n r , to list a few examples. The circuit for Select(H) can be constructed in two main parts. Loosely speaking, one part of the circuit implements interaction operators and the other part implements number operators. As we saw in subsection 2.1, an interaction operator involving two spin-orbitals p and q is mapped under the Jordan-Wigner transformation to linear combinations of Pauli "strings" of the form (P 1 ) p Z p,q (P 2 ) q , with P 1 , P 2 ∈ {X, Y } [cf. Eqs. (2) and (3)]. More generally, any product of interaction operators is mapped to a linear combination of products of such Pauli strings. For instance, a † p a q a † r a s + h.c. (for p < q < r < s) becomes a linear combination of (P 1 ) p Z p,q (P 2 ) q (P 3 ) r Z r,s (P 4 ) q , for P 1 , P 2 , P 3 , P 4 ∈ {X, Y }. Hence, the circuit in Eq. (11), which implements Select(H) in the special case that the Hamiltonian consists only of interactions between two spin orbitals, can be easily expanded to implement Hamiltonians containing arbitrary products of interaction operators. The key observation is that the identities in Eqs. (5) and (6) hold analogously for any number of pairs of Z operators, e.g., for two pairs of Z operators, we have Consequently, by the exact same logic as that in subsection 2.1, we can "select" between Pauli operators corresponding to interaction terms using a circuit composed of the Inject(Z), Inject-Select(Q), and Inject-Select(P ) subroutines defined in subsection 2.1, along with a Ladder and Ladder † . This circuit would essentially be an extended version of that in Eq. (11), with two minor modifications. First, instead of absorbing the sign of the Pauli operator into P 1 , as we did in subsection 2.1, we use one qubit |sgn in the selection register to encode the sign (with |sgn = |0 corresponding to +1 and |sgn = |1 to −1). We then remove the second wire in the circuit for Select(Q) in Eq. (7), and modify Inject-Select(Q) accordingly. Second, to account for the possibility that different terms in the Hamiltonian may be products of different numbers of interaction operators (e.g., a † p a q + h.c. and a † p a q a † r a s + h.c. may both be present in a k = 4 Hamiltonian), we use k of the qubits in the selection register as control qubits. We denote the states of these control qubits by |i p , |i q , |i r , |i s , etc. Note that the Pauli string (P 3 ) r Z r,s (P 4 ) s on the right-hand side of Eq. (13) would not be implemented if the bottom two Z operators, Q P3 , and P 4 are not applied on the left-hand side. It follows that by controlling the corresponding Inject(Z), Inject-Select(P ), and Inject-Select(Q) operators on |i r and |i s , either (P 1 ) p Z p,q (P 2 ) q (P 3 ) r Z r,s (P 4 ) q or (P 1 ) p Z p,q (P 2 ) q is applied depending on the state |i r |i s . The operators associated with p and q are controlled as well in order to allow for the implementation of number operators, which do not transform into Pauli strings of the form in Eq. (13) [cf. Eq. (14) below]. For the example of k = 4, the (sub)circuit for interaction operators is the left part (labelled "interaction operators) of the circuit in Eq. (15).
Number operators are very straightforward to implement using Inject(Z) gates, since under the Jordan-Wigner transformation. Therefore, to incorporate the Hamiltonian terms that involve number operators, the state of the selection register simply needs to indicate whether Z operators should be applied on certain qubits (recalling that the overall sign is encoded in |sgn ). It suffices to use another k of the selection register qubits as control qubits, labelling their states by |n p , |n q , |n r , |n s , etc., and control an Inject(Z) operator on |p and |n p , another Inject(Z) operator on |q and |n q , and so on. The full circuit for Select(H) for any k = 4 Hamiltonian is shown in Eq. (15). As always, each of the Inject(Z), Inject-Select(Q), and Inject-Select(P ) gates can be replaced by their more efficient variants constructed in subsection 2.2.
interaction operators number operators (15) All possible terms can be encoded by appropriately choosing the correspondence between the Pauli operators in the Jordan-Wigner encoding and the computational basis states of each subregister (and constructing Prepare(α) in a way that is consistent with this correspondence). As an explicit example, suppose that the Hamiltonian includes terms of the form a † p a q n r + h.c., which transform to linear combinations of (P 1 ) p Z p,q (P 2 ) q and (P 1 ) p Z p,q (P 2 ) q Z r , with P 1 , P 2 ∈ {X, Y }. It can be seen from Eq. (15) that the first type of Pauli operators are applied when |n r = |0 and the second type are applied when |n r = |1 , with |i p |i q |i r |i s = |1 |1 |0 |0 and |n p |n q |n s = |0 |0 |0 for both. While the circuit in Eq. (15) implements arbitrary terms involving up to k = 4 spin-orbitals, it is often the case that the Hamiltonian in question only contains a few types of terms. Some of the qubits could then be removed from the selection register and the control logic could be simplified. For the molecular electronic structure Hamiltonian, which is a linear combination of a † p a q + h.c., a † p a q a † r a s + h.c., n p , n p n q , and a † p a q n r + h.c. [29,30], we would not need the qubits storing |n r and |n s and the last two Inject(Z) gates in Eq. (15), for instance.
Thus, for arbitrary k, the circuit for Select(H) can be constructed from at most 2k controlled-Inject(Z), ⌊k/2⌋ controlled-Inject-Select(Q), and ⌊k/2⌋ controlled-Inject-Select(P ), and 2 Ladder operators. Inject(Z), Inject-Select(Q), and Inject-Select(P ) are implemented using O(n) Clifford and T gates, O(log 2 n) Clifford-depth, and O(log n) T -depth, and, as discussed in subsection 2.1. Ladder can be implemented suing O Clifford gates and O(log n) Clifford-depth [cf. Appendix A.3]. The circuit therefore has Clifford-and T -count O(kn), Clifford-depth O(k log 2 n), and T -depth O(k log n) in all cases. It is clear from Eq. (15) that the selection register comprises k⌈log n⌉ + O(1) qubits, and that no ancillae are required. For practical applications, k is usually a constant independent of n, in which case the total gate count is O(n) and the Clifford-depth and T -depth are O(log 2 n) and O(log n), respectively. The exact constant factors 4 in the T complexity can be directly calculated using Table 1.
O(2 j ) cnots applied in O(j) layers. Summing over j, the total number of controlled-swaps is 2 n − 2 ⌈log n⌉−1 + ⌈log n⌉−2 j=0 2 · 2 j = 2(n − 1) and these are applied in 4⌈log n⌉ layers. Each controlled-swap can be decomposed into one Toffoli and two cnot gates, and each Toffoli has T -count 7 and T -depth 4. Therefore, SwapUp has T -count 14(n− 1) and T -depth 16⌈log n⌉. The total number of Clifford gates (including the multi-target cnots in Eq. (16) and the Cliffords in the decomposition of each controlled-swap) is O(n), and the total Clifford-depth is O(log 2 n).

A.2 SwapUp *
In subsection 2.2, we use the fact that a certain phase-incorrect version SwapUp * of SwapUp is less costly than SwapUp to reduce the T -count and T -depth by constant multiplicative factors. SwapUp * is obtained by replacing each controlled-swap gate in Eq. (17) with a phase-incorrect version of controlled-swap: where A := e i(π/8)Y = S † HT HS (here, H denotes the Hadamard gate). The circuit on the right-hand side implements an operator whose action on computational basis states differs from controlled-swap only in that |100 is mapped to −|100 . Hence, since controlled-swap merely permutes the computational basis states, SwapUp * acts the same as SwapUp on computational basis states up to sign. As observed in Ref. [26], any sequence of these phase-incorrect controlled-swap operators that target disjoint pairs of qubits can be parallelised such that the A and A † gates are applied in four layers, and the cnots in O(log n) layers. It follows that SwapUp * has T -count 4(n−1) and T -depth 4⌈log n⌉ (and Clifford-count O(n) and Clifford-depth O(log 2 n)).
It is easily verified that the same transformation can be realised by arranging cnot gates in a tree-like structure, shown below for n = 8: (For n that is not a power of 2, the circuit for Ladder can be obtained by starting with the circuit for the next largest power of 2, then removing all of the cnots supported on qubits that are out of range.) Thus, a circuit Ladder can be constructed using O(n) cnots applied in O(log n) layers.