Trading T gates for dirty qubits in state preparation and unitary synthesis

Efficient synthesis of arbitrary quantum states and unitaries from a universal fault-tolerant gate-set e.g. Clifford+T is a key subroutine in quantum computation. As large quantum algorithms feature many qubits that encode coherent quantum information but remain idle for parts of the computation, these should be used if it minimizes overall gate counts, especially that of the expensive T-gates. We present a quantum algorithm for preparing any dimension-$N$ pure quantum state specified by a list of $N$ classical numbers, that realizes a trade-off between space and T-gates. Our scheme uses $\mathcal{O}(\log{(N/\epsilon)})$ clean qubits and a tunable number of $\sim(\lambda\log{(\frac{\log{N}}{\epsilon})})$ dirty qubits, to reduce the T-gate cost to $\mathcal{O}(\frac{N}{\lambda}+\lambda\log{\frac{N}{\epsilon}}\log{\frac{\log{N}}{\epsilon}})$. This trade-off is optimal up to logarithmic factors, proven through an unconditional gate counting lower bound, and is, in the best case, a quadratic improvement in T-count over prior ancillary-free approaches. We prove similar statements for unitary synthesis by reduction to state preparation. Underlying our constructions is a T-efficient circuit implementation of a quantum oracle for arbitrary classical data.


Introduction
Many quantum algorithms require coherent access to classical data, that is, data that can be queried in superposition through a unitary quantum operation.This property is crucial in obtaining quantum speedups for applications such as machine learning [1], simulation of physical systems [2,3] and solving systems of linear equations [4,5].
The nature of quantum-encoded classical data is itself varied.For example, quantum data regression [6] queries a classical list of N data-points a x through a unitary data-lookup oracle [7] O|x⟩|0⟩ = |x⟩|a x ⟩. ( Other applications, particularly in quantum chemistry [8] instead access Hamiltonian coefficient data through a unitary A|0⟩ = |ψ⟩ that prepares these numbers as amplitudes in a normalized quantum state, or as probabilities in a purified density matrix.Even more generally, the central challenge is synthesizing some arbitrary unitary A ∈ C N ×N of which k ≤ N columns are either partially or completely specified by a list of complex coefficients that is, say, provided on paper.Synthesis of these data-access unitaries is typically a dominant factor in the overall algorithm cost.In any scalable approach to quantum computation, all unitaries decompose into a universal fault-tolerant quantum gate set, such as Clifford gates {H, S, Cnot} and T gates [9].Solovay and Kitaev [9] were the first to recognize that any single-qubit unitary could be ϵ-approximated using O(log c (1/ϵ)) fault-tolerant gates for c = 3.97, which was later improved to c = 1 [10,11].By bootstrapping these results, it is well-known that a roughly equal number of O(kN log (N/ϵ)) [12] Clifford and non-Clifford gates suffice for arbitrary dimensions.Notably, the total gate count scaling is optimal in all parameters, following gate-counting arguments [13].
The possibility that T gates could be substantially fewer in number than the Clifford gates, however, is not excluded by known lower bounds.It is believed that fault-tolerant Clifford gates {H, S, Cnot} will be cheap in most practical implementations of fault-tolerant quantum computation.In contrast, the equivalent cost of each fault-tolerant non-Clifford T gates, implemented at machine precision, is placed at a space-time volume ≈ (225 logical qubits) × (10 Clifford depth) for realistic estimates [14] based on |T⟩ magic-state distillation at a physical error rate of ≈ 10 −3 .
We present an approach to arbitrary quantum state preparation and unitary synthesis that focuses on minimizing the T count.Unique to our approach is the exploitation of a variable number O(λ log (1/ϵ)) of ancillary qubits, in a manner not considered by prior gate-counting arguments or algorithms.We find a O(λ) improvement in the T count and circuit depth while keeping the Clifford count unchanged, excluding logarithmic factors.Most surprisingly, its benefit far exceeds the naïve approach of applying these ancillary qubits to producing |T⟩ magic-states for any λ = O( √ N ), as seen in Table 1.In the best-case, the T count of Õ( √ N ) is a square-root factor smaller than prior art, such as for for preparing arbitrary pure states Moreover, we prove this approach realizes an optimal ancillary qubit and T count trade-off up to log factors.
In particular, our approach is always advantageous as all but a logarithmic number O(log (N/ϵ)) of qubits, independent of λ, may be dirty, meaning that they start in, and are returned to the same undetermined initial state.At first glance, the full quadratic speedup is not always desirable as any clean ancillary qubit, initialized in the |0⟩ state, is a resource that may be better allocated to magic-state distillation.However, dirty qubits may not be used for magic-state distillation, and are a resource typically abundant in many algorithms, such as quantum simulation by a linear combination of unitaries [2].Even in the most pessimistic scenario where no dirty qubits are available, a reduction in the overall execution time of the algorithm, including the effective cost of magic-state distillation, is possible.
We also consider applications of our approach.For instance, a similar speedup to unitary synthesis where K ≤ N columns are specified, follows by a well-known reduction based on Householder reflections.Improvements to state preparation with garbage relevant to the most advanced quantum simulation techniques [16,17,8] are also presented in Appendix A. Underlying our T gate scaling results is an improved implementation of a data-lookup oracle Note the attached garbage state may always be uncomputed by applying O in reverse.We begin by describing our implementation of Eq. ( 5), which we call a 'SelectSwap' network, with costs outlined in Table 2 Subsequently, we apply SelectSwap to the state preparation problem using the fact that there exists classical data such that preparing any |ψ⟩ requires only O(polylog(N )) queries and additional primitive quantum gates [18].The reduction of unitary synthesis to state preparation is then described.Finally, we prove optimality of our approach through matching lower bounds, and discuss the results.

Data-lookup oracle by a SelectSwap network
The unitary data-lookup oracle of Eq. ( 5) accepts an input number state |x⟩ ∈ and returns an arbitrary b-bit number a x ∈ {0, 1} b .Our approach combines a multiplexer implementation of O [20], called Select and a unitary swap network Swap, with costs summarized in Table 2.The Select operator applies some arbitrary unitary U x controlled by the index state |x⟩, that is Thus O is realized by choosing U x = X ax ≡ ⊗ b−1 j=0 X ax,j [8] to either be identity or the Pauli-X gate depending on the bit string a x .As described in Fig. 1a, the costs, excluding {U x }, is O(N ) Clifford+T gates.As controlled-X is Clifford too, an additional O(bN ) Clifford gates are applied.These Cnots may be applied in logarithmic depth using an ancillary qubit free quantum fanout discussed in Appendix B.1.
The unitary Swap network moves a b-qubit quantum register indexed by x to the x = 0 register, controlled by the state |x⟩.For any quantum states where the remaining quantum states As illustrated in Fig. 1b, this decomposes into a network of controlled-swap operators.As each controlled-swap operator decomposes into two Cnots and one Toffoli, this network uses O(bN ) Clifford+T gates.An ancillary qubit free logarithmic-depth version of Swap is discussed in Appendix B. 2 Our SelectSwap network illustrated in Fig. 1c is a simple hybrid of the above two schemes.Similar to the Swap approach, we duplicate the b-bit register λ times, where λ ∈ {1, • • • , N } is an integer.For λ that is not a power of 2, we compute |x⟩ → |q⟩|r⟩, which is the quotient q = x/⌊λ⌋ and remainder r = x mod λ.This contributes an additive cost of O(log N log λ) gates.Select is controlled by |q⟩ to write multiple values of a x simultaneously into these duplicated registers by choosing Importantly, all but b + ⌈log 2 N ⌉ of the qubits may be made dirty using a simple modification shown in Fig. 1d.Then for any computational basis state |ϕ⟩ = ⊗ λ−1 r=0 |ϕ r ⟩ r , and any input state |x⟩ = |q⟩|r⟩, let us evaluate |0⟩|ϕ⟩ at each dotted line: By linearity, this is true for all quantum states |ϕ⟩.
As the T gate complexity begins to increase with sufficiently large λ, one may simply elect to not use excess available dirty qubits.However, continued reduction of the T depth down to O(log N ) might be a useful property.In Appendix C we discuss an alternate construction that achieves logarithmic T depth and preserves the quadratic T count improvement for larger λ = Ω( √ N ).

Arbitrary quantum state preparation
Preparation of an arbitrary dimension N = 2 n quantum state |ψ⟩ = 1 ∥⃗ a∥2 x∈{0,1} n a x |x⟩ using the data-lookup oracle O of Eq. ( 5) is well-known in prior art.The basic idea was introduced by [21], and an ancillary-free implementation was presented in [12].We outline the inductive argument of [18], and evaluate its cost using our SelectSwap implementation of O.
For any bit-string y ∈ {0, 1} w of length w ≤ n, let the probability that the first w qubits of |ψ⟩ are in state |y⟩ be , where p 0 is the probability that the first qubit of |ψ⟩ is in state |0⟩.We then recursively apply single-qubit rotations on the (w + 1) th qubit conditioned on the first w qubits being in state |y⟩.The rotation angles θ y = cos −1 p y0 /p y are chosen so that the state produced |ψ w+1 ⟩ reproduces the correct probabilities on the first w + 1 qubits.These conditional rotations are implemented using a sequence of data-lookup oracles where O w stores a b-bit approximation of all θ y where y ∈ {0, 1} w .At the w th iteration, Note that we omit any garbage registers as they are always uncomputed.Also, the second line is implemented using b single-qubit rotations each controlled by a bit of θ y .The complex phases of the target state |ψ⟩ are applied to |ψ n ⟩ by a final step with a data-lookup oracle storing We implement these oracles with the SelectSwap network of Fig. 1, using a fixed value of λ for all O k .A straightforward sum over the T count of Fig. 1 is O(bλ log (N ) + N λ ), which is then added to the total T count of O(b log ( N δ )) for synthesizing all single-qubit rotations each to error δ using the phase gradient technique [22], outlined in Appendix D. The error of the resulting state |ψ ′ ⟩ produced is determined by the number of bits b used to represent the rotation angles, in addition to rotation synthesis errors δ.Adding these errors leads to T gate count dependence on number of dirty qubits exploited for approximating an arbitrary quantum state of dimension N = 10 {2,4,6,8} to error 10 −3 using our algorithm (dots) in comparison with the standard ancillary-free approach (dashed) [12].Note that at this error, b ≥ 17 qubits are required to represent each coefficient in binary, but this may be halved by randomization techniques [23].Moreover, one may always use fewer than the maximum number of available dirty qubits.
which is bounded by ϵ with the choice b = Θ(log ( log N ϵ )) and δ = Θ(ϵ).As a function of ϵ, the total T gate complexity is then and is plotted in Fig. 2.

Unitary synthesis by state preparation
The ability to prepare arbitrary quantum states enables synthesis of arbitrary unitaries U ∈ C N ×N .Given the matrix elements {|u k ⟩ | k ∈ [K]} for the first K columns of U , the isometry synthesis problem is to find a quantum circuit that implements a unitary V that approximates U in the first We use the Householder reflections decomposition [24] to find a V that is a product of K reflections I − 2|v k ⟩⟨v k | and a diagonal gate diag(e iϕ ′ 1 , . . ., e iϕ ′ K , 1, . . ., 1), for some set of quantum states |v k ⟩.Note that this representation is not unique.The diagonal gate can be eliminated by using one ancillary qubit as discussed in [25].There, it suffices to implement the unitary Apply Hadamard to the first qubit, then a sequence of CNOT gates to prepare |1⟩ ⊗ |k⟩ + |0⟩ ⊗ |0⟩, and finally apply controlled-A k negatively controlled on the first qubit.Note that when this method is applied to the synthesis of sparse isometries, the states being synthesized are again sparse.Moreover, the cost of converting a state into a reflection doubles the number of non-Clifford gates.Thus the number of T gates used to synthesize an isometry is twice that for all the Controlled-A k operations, and scales like ) .

Lower bound
We prove the optimality of our construction through a circuit counting argument.The most general circuit on q qubits that uses Γ T-gates has the canonical form [26] C • Γ j=1 e −iπPj /8 ,where each P j is one of 4 q possible Pauli operators, and C is one of 2 O(q 2 ) possible Clifford operators.Thus the number of unique quantum circuits is at most Unique quantum circuits = O(4 qΓ+O(q 2 ) ). ( 13) A lower bound on the qubit and T-gate complexity of the data-lookup oracle of Eq. ( 5) is obtained by counting the number of unique Boolean functions f : [N ] → {0, 1} b .As there 2 bN such functions, we compare with Eq. ( 13).This leads to a lower bound on the space-T gate product qΓ = Ω(bN − q 2 ). ( As the SelectSwap complexity in Fig. 1 (N/2)! ϵ N −1 ) [27].Thus there are at least Ω( ) quantum states.Once again by comparing with Eq. ( 13), we obtain a T -gate lower bound of This also matches the cost of our approach in Eq. ( 13) up to logarithmic factors, so long as λ = o( N/ log (1/ϵ)).The total number of isometries within at least distance ϵ from each other can also be estimated using Lemma 4.3 on Page 14 in [28], and is roughly Ω((1/ϵ) KN ).An analogous argument can be made for state preparation with garbage by considering by considering the unit simplex instead of the unit ball.
Let us now establish a lower bound on state preparation that holds when measurements and arbitrary number of ancillae are used.For the purpose of the lower bound we also allow the use of post-selected measurements of multiple-qubit Pauli observables.Every preparation of an n-qubit state by Clifford+T circuit with ancillae and post-selected Pauli measurements can be rewritten [29] as the following sequence of operations: 1) initialization of Γ qubits into T state |0⟩ + e iπ/4 |1⟩; 2) post-selected measurement of Γ − n commuting Pauli observables; 3) application of a Clifford unitary on Γ qubits.After the three steps first Γ − n qubits are in a zero state and last n qubits are in the state being prepared.Let us count the number of distinct states that can be prepared by the steps described above.For the step two, there are at most 2 • 4 Γ ways of choosing first Pauli observable and at most 4 Γ ways of choosing each of the remaining Γ − n − 1 observables because each of them needs to commute with the first observable.Therefore, on step two we have at most 2•4 Γ(Γ−n) choices of Pauli observables.For the step three, two distinct Clifford unitaries can lead to preparing the same state and counting total number of Clifford unitaries on Γ qubits leads to an overestimate.The prepared |ψ⟩ state is completely described by 4 n numbers α P = Tr(P |ψ⟩⟨ψ|) where P goes over all n-qubit Pauli matrices {I, X, Y, Z} ⊗n .Let us count how many distinct 4 n dimensional vectors of α P we can get on the step three by applying different Clifford unitaries C. Let ρ ′ be a density matrix describing a state of all qubits after step two, then ).We see that the vector of α P is uniquely defined by action of C on 2n Pauli operators Z Γ−n+k , X Γ−n+k for k from 1 to n.There are at most 2 • 4 2Γn ways of choosing the action of C on the listed Pauli operators.Therefore we can prepare at most 4 • 4 Γ(Γ+n) distinct states.This leads to the lower bound on the number of required T gates Ω N log N log(1/ϵ) .

Conclusion
We have shown that arbitrary quantum states with N coefficients, or unitaries with KN values specified by classical data may be synthesized with a T gate complexity that is an optimal Õ( √ N ) reduction over prior art.As these subroutines are ubiquitous in many quantum algorithms we expect this result to be widely applicable.
We also expect our approach to be practical due to its almost exclusive usage of dirty qubits, which are typically abundant in larger quantum algorithms.Though our results are asymptotically optimal, constant factor and logarithmic improvements in costs could still be possible through careful optimization [30,31].For instance, our approach can be modified to use only O(1) additional clean qubits, but this increases the T count by a logarithmic factor.As more limited trade-offs between T gates and ancillary qubits are observed in other quantum circuits, such as for addition [22] or And [19], a major open question highly relevant to implementation in nearer-term quantum computers, is whether such a property could be generic for many other quantum circuits and algorithms.

Developments after 2018 release of preprint
There have been numerous improvements in trade-offs [15] between qubits, depth, and overall spacetime volume [32] of quantum circuits for state preparation and unitary synthesis [15], especially in connection to the block-encoding framework [16,33] for matrices represented by classical data [34].A notable new direction is error-resilient table-lookup [35], which achieves logarithmic error scaling by using a large number of O(N ) qubits.However, minimizing use of expensive non-Clifford gates is rarely a priority in these results, and almost all exhibit the same scaling for number of Clifford and non-Clifford gates.We have updated Table 1 with a comparison to recent work [15] that claims an optimal qubit-depth trade-off.To facilitate comparison, we rescale λ and we use the tighter bound Eq. ( 11) on the T gate cost that was originally Eq. (12).
Our SelectSwap architecture of table lookup remains state-of-art.To date, all methods with an asymptotic advantage in non-Clifford gate cost for state preparation or unitary synthesis [34] do so by reduction to SelectSwap.The circuit implementation of SelectSwap has also seen some optimizations.By allowing intermediate circuit measurements, the multi-target Cnot n gates in Appendix B.1 can be implemented in constant, rather than logartihimic depth, though at the cost of using an additional O(n) clean qubits.Intermediate measurements also enable uncomputing the garbage register Eq. ( 5) of SelectSwap using 4⌈ N λ ⌉ + 4λ T gates and λ + ⌈log (N/λ)⌉ clean qubits, which is independent of b [36].
The T count reductions enabled by SelectSwap have been key to numerous state-of-art resource estimates for quantum computing applications, such as in chemistry [36,37,38,39], and other classical-data-intense routines [40,34].

A Purified density matrix preparation
In some applications, particular in quantum simulation based on a linear combination of unitaries or qubitization [2,16], it suffices to prepare the density matrix ρ =  4) where the number state |x⟩ is entangled with some garbage that depends only on x.By allowing garbage, it was shown by [8] that strictly linear T gate complexity in N is achievable, using a Select data-lookup oracle corresponding to the λ = 1 case of Table 2.We outline the original idea, then generalize the procedure using the SelectSwap network, which enables sublinear T gate complexity and better error scaling than the garbage-free approach.As density matrices have positive diagonals, we only consider the case of positive a x ≥ 0.
The original approach is based on a simple observation.By comparing a b-bit number state |a⟩ together with a uniform superposition state |u where we denote a uniform superposition after the first a elements by |u ≥a ⟩ = . This may be implemented using quantum addition [41], which costs O(b) Clifford+T gates with depth O(b).This observation is converted to state-preparation in four steps.First, the normalized coefficients a x N 2 b ∥a∥1 ≈ a ′ x are rounded to nearest integer values such that ∥⃗ a ′ ∥ 1 = N 2 b .Second, the data-lookup oracle that writes two numbers a ′′ where we have omitted the irrelevant garbage state.Third, the oracle O is applied to a uniform superposition over |x⟩, and the comparator trick of Eq. ( 16) is applied.This produces the state Finally, |f (x)⟩ is swapped with |x⟩, controlled on the |1⟩ state.This leads to a state |ψ⟩ After tracing out the garbage register, the resulting density matrix ρ ′ approximates the desired state ρ with trace distance The T gate complexity is then the cost of the data-lookup oracle of Eq. ( 17) plus O(b) for the comparator of Eq. ( 16), plus O(log N ) for the controlled swap with |f (x)⟩.By implementing this data-lookup oracle with the SelectSwap network, one immediately obtains the stated T gate complexity of O(λ(b , where we choose b = O(log (1/ϵ)).

B Data-lookup oracle implementation details
In this section, we present additional details on the implementation of the data-lookup oracle.In particular, we discuss a multi-target Cnot implementation in logarithmic depth without ancillary qubits in Appendix B.1, and a swap network Swap with similar properties in Appendix B.2. Also evaluated is the T count and Clifford depth of these implementations up to constant factors.We define the Clifford depth, to be the number of layers of two-qubit Clifford gates that cannot be executed in parallel, assuming all-to-all qubit connectivity.We also assume that each T magic-state injection circuit has a Clifford depth of 1.

B.1 Quantum fanout in logarithmic depth without ancillary qubits
In this section, we construct a controlled-NOT gate that targets n qubits, that is, The most straightforward approach applies n NOT gates in sequence, each controlled by the same qubit.A slight modification results in logarithmic depth as shown in Table 3.
Given any number of n qubits in states |x j ⟩ for j = 0, 1, • • • , n − 1, one may use a ladder of n − 1 controlled-NOT gates to realize the transformation Let us call this unitary operation Ladder n .We now introduce a control qubit |z⟩.One implementation of Cnot n is then obtained by applying Ladder † n , followed by a NOT on |x 0 ⟩ controlled by |z⟩, and finally followed by Ladder n .This has a Clifford depth of 2n − 1 as depicted below for the example n = 4.   20) that targets n qubits.

Approach
By distributing the controls and targets above in a tree-structure as depicted below for the example n = 4, the Clifford depth of Cnot n may be reduced to 2⌈log 2 (n)⌉ + 1.

=
As the control qubit |z⟩ is only used once in the above circuit, a further reduction in depth is possible by repeatedly using it to apply additional multi-target Cnot gates in each time-slice.Let us denote by g(d) = 2 (d−1)/2 the maximum number of qubits targeted by the above circuit in a depth of d.Then the total number of qubits n(d) targeted with this additional reduction satisfies the recurrence Let us denote by D[Cnot n ] the depth of this implementation of Cnot n , which satisfies

B.2 Implementations of a Swap network
In this section, we detail various implementations of the unitary swap network Swap that moves an b-qubit quantum register indexed by x ∈ {0, • • • , N − 1} to the position of the x = 0 register, controlled by an index state |x⟩.More precisely, for any set of quantum states is from Eq. ( 25).
where the final quantum states in registers indexed by x > 0 are unimportant.Let us express the index x ≡ x 0 x 1 • • • in binary, where x 0 is the smallest bit.Then it suffices to perform swaps between all pairs of registers indexed by Each controlled pair-wise swap may be understood as a circuit CSwap n that swaps two n-qubit quantum registers in any state |ψ⟩, |ϕ⟩ ∈ C 2 n , controlled by a single qubit |z⟩ ∈ C 2 .That is, The overall cost of Swap is then the sum of costs of CSwap n , over all j, where n = ⌊ N −1 2 j+1 + 1 2 ⌋.We now consider different implementations considered realize trade-offs between Toffoli-gate count, circuit depth, and ancillary qubit usage, as summarized in Table 4.

B.2.1 CSwap n in linear depth without ancillary qubits
It is simple to construct CSwap n with depth O(n) without any ancillary qubits.As the circuit that swaps two qubits is constructed from three Cnot gates as follows, A circuit that implements Eq. ( 27), a swap between n pairs of qubits, is then the above repeated n times in sequence, each controlled by the same qubit as follows.

B.2.2 CSwap n in logarithmic depth without ancillary qubits
Constructing CSwap n with depth O(log n) without any ancillary qubits requires a little more thought.Let us consider a more general problem.Suppose we have an arbitrary unitary operator V that is self-inverse, meaning V 2 = I -one may verify that the two-qubit swap satisfies this property.Our goal is to implement a multi-target controlled-V gate on n registers To begin, consider the following circuit identity, which is motivated by the 'toggling' trick in [42].
Observe the the bottom qubit may be dirty -its state does not affect the computation, and remains unchanged at the end of it.Thus a multi-target controlled-V on n registers may be constructed by applying n singlycontrolled V gates in parallel before and after a single multiply-controlled not gate, using a total of n extra dirty qubits as follows.
As the additional qubits may be dirty, this is easily modified to use no ancillary qubits at all.Let us apply the multi-target V on ⌈n/2⌉ registers, using any ⌊n/2⌋ qubits from the other registers as dirty qubits.When n is odd, the topmost V may be controlled directly by the |z⟩ qubit.We then apply the same circuit on the remaining registers by using qubits in the initial targets as control qubits.In total, this uses at most 2n controlled-V gates and two multiply-controlled not gates on ⌈n/2⌉ qubits, each with cost given by Table 3.

B.2.3 T-gate decomposition
Each controlled-swap may be decomposed into Clifford+T gates using standard techniques.For instance, the standard synthesis of each Toffoli uses 7 T -gates [12], as seen below.A further reduction to just 4N T -gates is possible if we allow the output state to be correct up to a phase factor.The decomposition by [19] below, using the gate approximates the Toffoli gate up to a minus sign on one of the matrix elements.
Thus a controlled-swap is obtained by a simple modification as follows. ≈ the function which outputs the function for all possible values of the low order bits, given the high order bits.Suppose, for the moment, that we have as many clean ancillary qubits as we want.We can naïvely compute f (x hi ) by constructing e (n−k) (x hi ), making b • 2 k copies, and computing the parity of some subset of bits of e (n−k) (x hi ) for each output bit of f (x hi ).Constructing e (n−k) (x hi ) requires O( 2 Since the layers of Cnot gates compute (in the output bits) a linear function of the inputs, it is not difficult to adapt for dirty ancillary qubits.Just apply the linear function, flip the dirty ancillary qubits that are set to 1, then apply the linear function again, appealing to the identity T (x ⊕ y) ⊕ T (y) = T (x) for a linear function.Thus, whatever state |y⟩ was in the dirty qubits, we still manage to compute T (x).
We have shown how to compute f (x hi ) and XOR it into dirty ancillary qubits.We also know how to compute e k (x lo ) and XOR b copies of it into dirty ancillary qubits.Think of f (x hi ) as a b × 2 k matrix, then all that remains is to return the correct column of f (x hi ).If we also think of e k (x lo ) as a length 2 k column vector, then we are computing a matrix/vector product.The simplest way to do this is to make b copies of e k (x lo ) and execute b vector/vector products in parallel, at a cost of O(2 k ) Toffoli gates for each one.
To compute the Toffoli gates on dirty ancillary qubits, we decompose them into Cnot gates and single qubit gates.The layers of Cnot gates are linear, so it is possible to compute each such layer with dirty ancillary qubits.
Computing f (x hi ) uses O(bN ) ancillary qubits, O(2 n−k ) T gates, and has depth O((n − k) 2 + k + log b).The rest of the circuit has O(b2 k ) ancillas to store f (x hi ) and/or copies of e k (x lo ), and uses O(b2 k ) T gates in depth O(k 2 + log b).Since the T gate count is O(2 n−k + b2 k ), we set k such that 2 k ≈ N/b, and it becomes O( √ bN ).It is possible to trade off depth and number of ancillary qubits.We only need O(2 n−k ) qubits to store e (n−k) (x hi ) in the computation of f (x hi ), if we are willing to compute parities for each of the b2 k output bits of f (x hi ) one at a time, in depth O(b2 k ).More generally, we can use O(λ2

D Pure-state preparation implementation details
The approach by Shende, Bullock, and Markov [12] synthesizes a unitary A that prepares a pure state A|0⟩ = |x⟩⟨x| ⊗ e i2πθj,xZ , ( for some set of rotation angles θ j,x .Note that it suffices to consider Z-phase rotations as rotations about the X, Y Pauli operators are equivalent up to a single-qubit Clifford similarity transformation.Each multiplexor is applied twice -once to create a pure state with the right probabilities |a x | 2 , and once to apply the correct phase e i arg [ax] .Below we describe how U j may be implemented using the data-lookup oracle of Eq. ( 5) and evaluate the overall error and cost of state preparation.

D.1 Multiply-controlled phase gate from data lookup oracles
Consider a multiply-n-controlled arbitrary single qubit rotation where each rotation angle θ x ∈ [0, 1).Given a number state |x⟩ and an arbitrary single-qubit state |z⟩, this unitary performs a controlled-rotation Each rotation angle has a binary expansion By truncating to the above to b-bits of precision, we obtain an integer approximation a x of θ x where Let us encode these values of a x into the data-lookup oracle, and express its T cost as the function f (n, b).Its output is then where we explicitly represent the number state |a x ⟩ in terms of its component qubits.
Now, each controlled-arbitrary phase rotation decomposes into 2 arbitrary single-qubit rotations, and CNot gates -Note that a decomposition into 1 arbitrary single-qubit rotation is possible if we modify the above for the range θ x ∈ (−1/2, 1/2), but the explanation is slightly more complicated.
As any arbitrary single-qubit rotation is approximated to error ϵ using O(log (1/ϵ)) T gates [43], we may use a triangle inequality to bound the error to ∥U − U ′ ∥ ≤ Thus another possible approximation, call it U ′′ , uses this adder, controlled by the target state |z⟩, to the registers containing the desired phase rotation |a x ⟩, and the Fourier state.This realizes the same transformation in Eq. ( 40), using the circuit depicted below.
The state preparation unitary A ′′ applies n such approximations U ′′ j to the multiply-controlled rotations U j , leading to an overall error bounded by

Figure 1 :
Figure 1: (a) Example Select operator N −1 x=0 |x⟩⟨x| ⊗ X ax with N = 4.The symbol ⊘ indicates control by a number state.A naive decomposition of all multiply-controlled-Nots requires O(N log N ) Clifford+T gates and only one dirty qubit [19].Cancellation of adjacent gates can reduce this to only O(N ) [20, 8], but by using additional ⌈log 2 N ⌉ clean qubits.(b) Example Swap network with N = 4 using O(bN ) Clifford+T.Any arbitrary state |ϕx⟩x in register index x is swapped to the x = 0 position.(c) The SelectSwap network with N = 16, λ = 4 that combines the above two approaches.(d) Modification of SelectSwap network that uses 2⌈log 2 N ⌉ + b clean qubits and bλ dirty qubits to implement the data-lookup oracle of Eq. (5) without garbage.We omit the Select ancillary qubits for clarity.

Figure 2 :
Figure2: T gate count dependence on number of dirty qubits exploited for approximating an arbitrary quantum state of dimension N = 10 {2,4,6,8} to error 10 −3 using our algorithm (dots) in comparison with the standard ancillary-free approach (dashed)[12].Note that at this error, b ≥ 17 qubits are required to represent each coefficient in binary, but this may be halved by randomization techniques[23].Moreover, one may always use fewer than the maximum number of available dirty qubits.

T
Thus one might expect that CSwap n logarithmic depth requires 14N T -gates.However, simple cancellations using the above decomposition reduces this to 10N T -gates.
n−k ) ancillary qubits, O(2 n−k ) T gates and O((n − k) 2 ) depth.The rest is done with Cnot b•2 k gates to make copies and Cnot O(2 n−k ) gates conjugated by Hadamards to compute parities.This uses many ancillary qubits-O(2 n−k ) for the original vector times O(b2 k ) copies is O(bN ) qubits.

∥⃗
a∥2 |x⟩ = |ψ⟩ with arbitrary coefficients in N = 2 n dimensions.The underlying circuit, illustrated below for the example of N = 8 for positive coefficients, is built from j ∈ [n] multiply-controlled arbitrary single qubit rotations, U j where U j = 2 j −1 x=0

D. 1 . 1
Approach using arbitrary single-qubit synthesis One possible approximation, call it U ′ , of U in Eq. (35) applies the single-qubit rotation e iπZ/2 k to the target state |z⟩, controlled by the state |a x,k ⟩.The garbage register is then uncomputed by running O in reverse.Explicitly, this circuit realizes the transformation k ⟩ |garbage x ⟩e iπ b−1 k=0 a x,k Z/2 k |z⟩ =|x⟩|a x ⟩|garbage x ⟩e i2πax/2 b Z |z⟩ →|x⟩|0⟩ ⊗b |0⟩e i2πax/2 b Z |z⟩.

Table 2 :
Upper bounds on cost of possible implementations of the data lookup oracle O of Eq. (5).See Appendices B.1, B.2 and C for other variations, such as a linear-depth phase-incorrect version of Swap using ≤ 4bN T gates and no additional ancillary qubits.Our results allow for a space-depth trade-off determined by a choice of λ ∈ [1, N ], with a minimized T gate complexity of O( √ bN ) by choosing λ = O( N/b).Note that bλ qubits of the Fig. 1d implementation may be dirty.
λ+j , where x ∈ [⌊N/λ⌋].Swap is then controlled |r⟩ to move the desired data entry |a x ⟩ to the output register.As the T gate complexity of O(λb + N λ ) is determined only by the dimension of the Select and Swap control registers, this is minimized with value is qΓ = O(λ 2 b 2 + bN + log (N )(1/λ + λb)), this is optimal up to logarithmic factors so long as the number of T-gates dominates the qubit count like λ = o( N/b), which is the case in most quantum circuits of interest.A similar lower bound on state preparation is obtained by counting the number of dimension-N quantum states that a distinguishable with error ϵ.Without loss of generality, we only count quantum states |ψ⟩ ∈ R N with real coefficients.These states live on the surface a unit-ballB N of dimension N , with area Area[B N ] = 2π N/2(N/2−1)! .Let us now fix a state |ψ⟩.Then the states |χ⟩ that satisfy ∥|ψ⟩

Table 3 :
Different implementations of a controlled-NOT gate Eq. (

Table 4 :
Different implementations of a controlled-swap between two n-qubit registers.The depth D[Cnotn] ≤ 1} n and y ∈ {0, 1} b .There is a circuit for U of depth O(log 2 N + log b), using O(bN ) dirty ancillary qubits and Divide the input into n − k and k bit pieces, where k is to be determined later.Let f : {0 n−k ) ancillary qubits for any integer λ ∈ [1, b2 k ] and use depth O(b2 k /λ).For the optimal T count we use the same setting of k, giving O(