Spacetime-Efficient Low-Depth Quantum State Preparation with Applications

We propose a novel deterministic method for preparing arbitrary quantum states. When our protocol is compiled into CNOT and arbitrary single-qubit gates, it prepares an $N$-dimensional state in depth $O(\log(N))$ and spacetime allocation (a metric that accounts for the fact that oftentimes some ancilla qubits need not be active for the entire circuit) $O(N)$, which are both optimal. When compiled into the $\{\mathrm{H,S,T,CNOT}\}$ gate set, we show that it requires asymptotically fewer quantum resources than previous methods. Specifically, it prepares an arbitrary state up to error $\epsilon$ with optimal depth of $O(\log(N) + \log (1/\epsilon))$ and spacetime allocation $O(N\log(\log(N)/\epsilon))$, improving over $O(\log(N)\log(\log (N)/\epsilon))$ and $O(N\log(N/\epsilon))$, respectively. We illustrate how the reduced spacetime allocation of our protocol enables rapid preparation of many disjoint states with only constant-factor ancilla overhead -- $O(N)$ ancilla qubits are reused efficiently to prepare a product state of $w$ $N$-dimensional states in depth $O(w + \log(N))$ rather than $O(w\log(N))$, achieving effectively constant depth per state. We highlight several applications where this ability would be useful, including quantum machine learning, Hamiltonian simulation, and solving linear systems of equations. We provide quantum circuit descriptions of our protocol, detailed pseudocode, and gate-level implementation examples using Braket.


Introduction
Quantum state preparation (QSP) is a crucial subroutine in many proposed quantum algorithms that claim speedup over their classical counterparts in applications such as quantum machine learning [1][2][3][4][5][6][7][8], simulating quantum systems [9][10][11][12], solving linear systems Figure 1: Illustration of spacetime allocation in our state preparation method (SP+CSP) vs. previous methods.Green regions correspond to spacetime allocated for arbitrary singlequbit rotations.Orange regions correspond to spacetime allocated to qubits that are active (i.e., not in |0⟩) without single-qubit rotations; they might experience, e.g., CNOT or Toffoli gates or might be idling while entangled with other qubits.In our method, the spacetime allocation is asymptotically smaller than the product of qubit count and depth.
The state preparation problem is to create an arbitrary quantum state of the form: where n denotes the number of qubits, ∥•∥ denotes the standard Euclidean norm, and x is a 2 n -dimensional vector with components x i ∈ C. In many applications, it is sufficient to let x i ∈ R + ∪ {0}, and henceforth, we assume this for simplicity; the additional phase information can be easily incorporated with minimal overhead (see App. C).We also denote N = 2 n as the total number of parameters encoded, the same as the dimension of the vector space.Table 1: Comparison of our results to previous state-of-the-art low-depth QSP methods.N = 2 n is the total number of basis/parameters, and ϵ is the error precision parameter.Note that Sun et al. [17] proposed additional variations of the QSP protocol with larger circuit depth and fewer ancilla qubits that all have spacetime allocation of Θ(N ) and O(N log(N/ϵ)).The dagger † in the row for Ref. [18] indicates lower bounds from our calculation because the paper did not analyze the Clifford + T costs.Note that T depth is optimized in Clader et al. [20].We also note that our SP+CSP protocol allows some ancilla registers to be dirty, similar to that in [16].See the corresponding theoretical lower bound for spacetime allocation using {H, S, T, CNOT} gate set in Sec.2.3.
Recent advancement by Sun et al. [17] gave an optimal construction that creates |ψ⟩ with Θ(2 n /n) 1circuit depth using arbitrary single-qubit gates and two-qubit CNOT gates (henceforth called the {U(2), CNOT} gate set) and no ancilla qubits.On the other hand, if ancilla qubits are available, one can dramatically reduce the circuit depth.It is preferable to use low-depth protocols when the state-preparation procedure must be completed quickly or repeated many times sequentially.Oftentimes, the quantum algorithm that follows the preparation of |ψ⟩-for instance, making a machine learning inference-runs in depth poly(n), exponentially faster than the ancillafree QSP implementations, so low-depth state preparation is vital to make the overall runtime reasonable.
Toward that end, recent deterministic methods [17,19,21] have achieved optimal Θ(n) quantum circuit depth in the {U(2), CNOT} gate set.However, this exponential depth reduction comes with an exponential overhead in space.In other words, one would need an exponential number of ancilla qubits.Recent work [18] is able to achieve Θ(n) depth with only Θ(2 n /n) ancilla qubits, which is optimal.
When considering the total required quantum resource tradeoffs, metrics such as circuit depth, circuit size, and qubit count are typically considered.Assuming the physical architecture allows one to perform gates in parallel, the circuit depth is a proxy for the overall runtime of the computation, and the qubit count represents the overall amount of space that must be allocated to the computation.We now propose another metric-spacetime allocation-the total time that each individual qubit must be active (i.e., not in the |0⟩ state), summed over all qubits.The spacetime allocation is bounded below by the circuit size and bounded above by the product of the qubit count and the circuit depth, but it carries a distinct operational meaning.In a model where one wishes to perform many distinct ancilla-intensive jobs (such as rapidly preparing many independent n-qubit states) with a fixed number of ancilla qubits, the availability of fresh ancillae in the state |0⟩ becomes the algorithmic bottleneck.Assuming ancillae can be reallocated from one job to another as soon as they are returned to the |0⟩ state, the overall runtime to complete the batch of jobs is determined by the spacetime allocation of the jobs rather than their depth or size.State-preparation algorithms that are optimal in terms of depth or size are not necessarily also optimal in terms of spacetime allocation.We will show that when compiled into the {U(2), CNOT} gate set, our state preparation protocol is simultaneously optimal in depth, size, and spacetime allocation up to constant factors.This feat has already been achieved by the state preparation method of Ref. [18], which has Θ(n) depth and Θ(2 n /n) qubit count, along with Θ(2 n ) size and spacetime allocation.
However, in practice, it may not be possible to perform arbitrary single-qubit gates to exact precision, and thus the {U(2), CNOT} gate set may not be applicable.In this case, we cannot hope to prepare the state |ψ⟩ exactly; rather, given an error parameter ϵ, we seek to prepare a state | ψ⟩ such that |ψ⟩ − | ψ⟩ ≤ ϵ . ( For example, this is the case in most proposals for fault-tolerant quantum computation based on errorcorrecting codes, where logical single-qubit gates are approximately performed using a sequence of logical gates drawn from a discrete gate set (see, e.g., [22]).A common choice of discrete gate set is {H, S, T, CNOT} (defined in the next section); we quote the scaling of the resource cost of our protocol for approximate QSP when compiled into this gate set.We use the terminology approximate spacetime allocation and exact spacetime allocation, denoted as SA approx and SA exact , to differentiate the cost in the two models.Similarly, we denote the circuit depth in the two models by D approx and D exact .Crucially, in the {H, S, T, CNOT} gate set, for constant target error ϵ, the method of Ref. [18] does not achieve Θ(n) depth.While they do not explicitly report the depth in this gate set, this can be seen by the fact that their circuit has O(2 n ) arbitrary single-qubit gates spread out over O(n) distinct layers; to convert to the discrete gate set while incurring constant overall error, each single-qubit gate must be approximately decomposed into a sequence of depth at least Ω(log(n)), yielding total depth at least Ω(n log(n)).Our state preparation method will have depth Θ(n) even in the discrete gate set, when ϵ is taken as a constant.
The value of minimizing with respect to spacetime allocation is also apparent in the context of realistic hardware considerations.For example, in gate-based near-term devices without error correction, reducing the spacetime allocation (even as circuit size remains the same) amounts to reducing the time that qubits are left idling.By deallocating and reinitializing ancillas in |0⟩, this kind of idling time can be minimized, and the impacts of noise can be made minimal.Moreover, idling of logical qubits is also costly in faulttolerant architectures, which require continuous error correction even when no gates are being performed.Each round of error correction requires a set of parity check circuits and measurements, followed by a nontrivial classical decoding calculation, all of which contribute to the overall energy expenditure of the computation.
Another feature of our constructions is that they are garbage-free.This contrasts with some previous work (e.g., [23,24]), which perform a relaxed version of the QSP task, where, instead of preparing the nqubit state |ψ⟩ from Eq. (1), one aims to prepare an (n + a)-qubit state |Ψ⟩ given by Typically, a would scale at least linearly in n; these a qubits are left entangled with the n data qubits.
Allowing garbage makes the QSP task easier, and in some applications, garbage is tolerable (since the prepared state only acts as control qubits).However, as long as the a ancilla qubits are entangled with the data, they cannot be used as fresh |0⟩ ancillae for other tasks, and they continue to contribute toward the spacetime allocation of the QSP protocol.Moreover, it is important to emphasize that this is a fundamentally different task.After tracing out the garbage, the n-qubit state that results is a mixed state, which lacks the coherence of the pure state in Eq. (1).This is especially true when one wants to directly manipulate the prepared states (e.g., in quantum linear system solvers and quantum machine learning applications) rather than only using them as control bits.
Lastly, we point out an interesting feature of our state preparation protocol: a constant fraction of the O(N ) ancilla qubits needed to achieve logarithmic depth can be dirty, that is, they can start in an arbitrary state, and they will be returned to the state they began in at the end of the circuit.Dirty qubits are appealing because they can be prepared without spending resources to fault tolerantly prepare highfidelity |0⟩ states.Additionally, dirty qubits can be qubits from another computation experiencing a long period of idling, provided they are returned to their original state before the end of the idling period.This situation is especially salient when using the {H, S, T, CNOT} gate set, since our state preparation circuits involve a layer where some qubits are idling and others experience a single-qubit gate sequence approximating a single-qubit rotation-the idling qubits can be used as dirty ancillae for other state preparation instances.The possibility of state preparation ancillae being dirty has been previously explored in Ref. [16], which studied tradeoffs between the number of T gates and the number of dirty ancillae.We believe with additional innovations, it may be possible for the number of clean qubits to be further reduced to be asymptotically better than O(N ).
The contributions of this work are the following: 1. We propose a novel, deterministic, garbage-free quantum state preparation method, which we call SP+CSP (Section 3), and we show that in the {U(2), CNOT} gate set it simultaneously achieves optimal depth and spacetime allocation.We also show that in the {H, S, T, CNOT} gate set, it achieves depth and spacetime allocation that are both asymptotically superior to previous methods (see Table 1).A matching lower bound shows that the depth-scaling is optimal in this gate set, whereas the best lower bound on spacetime allocation leaves open the possibility of improvements by logarithmic factors.
2. We show how the optimal spacetime allocation of our protocol allows it to prepare multiple copies of a state more rapidly than other methods (Section 4).
3. We discuss several applications of the optimal spacetime allocation quantum state preparation protocol (Section 5), including (a) Quantum Machine Learning (b) Hamiltonian simulation (c) Solving linear systems with HHL-style quantum algorithms 4. We provide a circuit-level implementation for the SP+CSP method (Sec.3 & App.A) with detailed pseudocode.
We summarize our result compared to other stateof-the-art results in Table 1, with a pictorial depiction in Fig. 1.
To achieve these results, we build from the work of Clader et al. [20], who gave a QSP method with depth O(n) and qubit count O(2 n ) under the assumption that the FANOUT-CNOT operation-that is, the product of up to O(2 n ) CNOT gates sharing the same control qubit-could be performed in a single time step.Our protocol (we call it the SP protocol) eliminates the need for FANOUT-CNOT at the expense of only constant-factor overhead in the number of ancilla qubits and the depth.The high-level idea is to use a tree-like data copying circuit consisting only of CNOT gates to copy the control bit into many ancillas, so it can then be used to control many operations in parallel (see App. A.1).However, this observation is not alone sufficient-to achieve O(n) overall depth, we require a delicate method that alternates between layers of control-bit copying and layers of controlled operations (see App. A.4). Like that of Clader et al., this circuit has the feature that only a constant number of layers involve single-qubit rotations (which incur depth O(log(n/ϵ)) when approximately decomposed into a finite gate set), allowing the overall depth to scale as O(n + log(1/ϵ)).In comparison, previous methods (e.g., [17,19]) assemble these single-qubit rotations across at least n layers, which would then make the total depth at least Ω(n log(n/ϵ)).
The SP protocol (and protocols designed in previous works such as [17] and [19]) does not achieve optimal spacetime allocation since it fundamentally requires O(2 n ) ancilla qubits to be entangled with the O(n) data qubits for a constant fraction of the O(n) circuit depth, leading to O(n2 n ) spacetime allocation (similar to what is illustrated on the left side of Fig. 1).To circumvent this, we pursue an additional idea: prepare roughly half of the qubits using the SP method, and then perform controlled state preparation (CSP) from those qubits into the rest of the qubits.Both SP and CSP require only O(1) layers of single-qubit rotations, preserving the O(n + log(1/ϵ)) depth for the approximate compilation.Moreover, the full O(2 n ) ancilla qubits are only needed very briefly during the CSP procedure, and most can be freed up after only O(1) depth (illustrated as the right side of Fig. 1).Ultimately, we show that O(2 n ) spacetime allocation can be achieved.We also give a detailed circuit implementation that shows the difference in spacetime allocation requirements in App.A.7.

Background
This section will introduce the required quantum gates and provide more details about the spacetime allocation metric, as well as relevant lower bounds and a summary of the relevant prior work.

Quantum Gates
In this work, we will use the following quantum gates: X gate, S gate, T gate, Hadamard (H) gate, R y gate, CNOT gate, SWAP gate, controlled-R y gate, Fredkin gate, and Toffoli gate.

Gate Unitary
Table 2: Elementary Quantum Gates.Note that there is some redundancy as S = T 2 and X = HS 2 H.The Ry gate can be approximated to arbitrary precision as a sequence of H, S, and T gates (see, e.g., [22]) Table 2 shows the elementary quantum gates required for the {U(2), CNOT} gate set and the {H, S, T, CNOT} gate set.We also summarize all other required quantum gates that can be constructed in Fig. 2, 3, 4, and 5.In classical computing, "core hours" [27] refers to the number of CPUs used to run a certain computing task multiplied by the duration of the job in hours.
We now define the a quantum analogue of the classical "core hours" as the quantum spacetime allocation cost, equivalently defined in either of the following ways • Sum of the individual duration (depth) that each logical qubit is active (i.e., not in the |0⟩ state) • Sum of the number of active qubits in each layer Or quantitatively: where d is the total depth, q is the total number of qubits, d i is the active time (depth) for the i th qubit, and q t is the number of active qubits at layer t.
In the case that all qubits are active for the entirety of the computation, the spacetime allocation is simply the product of the circuit depth and the total number of qubits.However, if most of the ancilla qubits are needed only briefly and can be reset to the |0⟩ state before the end of the computation, the spacetime allocation can be significantly less [28].In this work, we will show that freeing up the ancilla qubits at early times can bring asymptotic spacetime allocation advantage for quantum state preparation.

Lower Bounds
Previous work has shown QSP lower bounds for circuit depth, size, number of qubits, and the corresponding tradeoffs therein [17,29] when using only arbitrary single-qubit gates and CNOT gates.Since each gate in the circuit acts on at least one qubit, at least one qubit must be active per gate in each layer.Thus, the number of gates provides a lower bound for the spacetime allocation.Lemma 2.1.If a quantum circuit has circuit size C, its spacetime allocation must be at least C.
Consequently, if we wish to lower bound the spacetime allocation, it suffices to bound the circuit size.An arbitrary quantum state is described by O(2 n ) real parameters.Each single-qubit gate can introduce at most O(1) real parameters, so there must be Ω(2 n ) single-qubit gates.Additionally, any consecutive single-qubit gates on the same qubit that do not have a CNOT in between can be combined into a single gate, so the number of CNOTs must be in the same order as the number of single-qubit gates.This line of reasoning gives rise to a lower bound on circuit size [29].Proof.We follow the logic similar to Theorem 9 of [30].Following Lemma 1 in [30], the circuit size C must satisfy C log(C) ≥ Ω(2 n log(1/ϵ)).Letting U = C log(C) and taking the logarithm on both sides, we have log(U ) = log(C log(C)) ≥ log(C), and hence Note that Ref. [16] has shown that the number of T gates can be significantly smaller than this lower bound, an interesting fact since T gates are significantly more expensive than the other gates in many approaches to fault-tolerant quantum computation.Lower bounds on depth can also be shown in both gate sets.Lemma 2.4 (Theorem 3 of [17]).A quantum circuit consisting of gates drawn from {U (2), CNOT} that prepares an arbitrary n-qubit state using q ancilla qubits must have depth at least Ω(max(n, 2 n q+n )).Lemma 2.5.A quantum circuit consisting of gates drawn from {H, S, T, CNOT} that prepares an arbitrary n-qubit state using O(2 n ) ancilla qubits must have depth at least Ω(n + log(1/ϵ)).
Proof.First we examine the n dependence for any constant ϵ < 1.Following a similar argument as [17, Theorem 3], at depth D, the number of gates that are in the "lightcone" of the n data qubits is upper bounded by n2 D .The rest of the gates can be ignored.Since there are a constant number of gates in the gate set, the total number of unique output states is upper bounded by e O(n2 D ) .Meanwhile, Eq. (4.85) of Ref. [31] asserts that the number of circuits needed to cover the entire set of n-qubit states up to ϵ = O(1) precision is at least e Ω(2 n ) .Together, these imply that D = Ω(n).Separately, we can lower bound the ϵ dependence as D = Ω(log(1/ϵ)) directly from Theorem 9 of [30], once we set n anc = O(2 n ).These two bounds hold independently for sufficiently large n and sufficiently small ϵ, thus the sum of the bounds must also hold (up to a constant factor that can be absorbed into big-Ω), implying the overall stated bound of Ω(n + log(1/ϵ)).
The SP+CSP circuit construction illustrated in Sec. 3 will give an upper bound that matches the lower bound of Lemma 2.5.

{U(2), CNOT} gate set
To the best of our knowledge, the state-of-the-art quantum state preparation [17,18] methods achieve the following depth, where a is the number of ancillae available: Here D exact labels the depth of the exact quantum state preparation circuit using the {U(2), CNOT} gate set, a denotes the number of ancillae, and n denotes the number of qubits of the desired arbitrary quantum state to be prepared.
We can compute the spacetime allocation upper bound by simply multiplying the depth with the number of qubits required; we find that the lower bound of Ω(2 n ) is saturated for any choice of ancilla qubits a < O(2 n /n), and in particular: Interestingly, Sun et al. [17] also proved a depth lower bound of Ω(n) for circuits that use arbitrary single-qubit and two-qubit gates, regardless of the number of ancillary qubits from a graph theory perspective.In Sec. 3, we show that our SP+CSP state preparation protocol can also achieve Θ(n) depth and Θ(2 n ) spacetime allocation.This provides an alternative construction to that of Ref. [18], which is also optimal.

{H, S, T, CNOT} gate set
When compiled into gates from {H, S, T, CNOT} previous methods have achieved [17,19] the following depth, where a is the number of ancillae available: We can also compute the spacetime allocation cost by multiplying the circuit depth by the number of qubits involved: We also note that the method proposed by Yuan et al. [18] will have at least Ω(n log(n/ϵ)) depth and Ω(2 n log(n/ϵ)) spacetime allocation.This was not stated explicitly in the paper but can be deduced from the fact that single-qubit rotations appear in O(n) different layers and each requires depth at least Ω(log(n/ϵ)) in the discrete gate set.
In addition, Clader et al. [20] gave a method with depth O(log(2 n /ϵ)) and spacetime allocation O(2 n log(2 n /ϵ)), but under the assumption of unit time FANOUT-CNOT.Decomposing FANOUT-CNOT with 2 n targets into two-qubit CNOT gates incurs a multiplicative overhead of O(n) in the depth of the protocol, which also induces a similar overhead to the spacetime allocation (see Table 1).
In this work, we will show that our method achieves the depth scaling of Ref. [20] in the {H, S, T, CNOT} gate set (i.e.not requiring FANOUT-CNOT) while achieving superior spacetime allocation of O(2 n log(n/ϵ)).

SP+CSP State Preparation
In this section, we will walk through the details of our SP+CSP method that achieves Θ(N ) spacetime allocation while keeping the Θ(log(N )) depth using the {U(2), CNOT} gate set.
Recall that we wish to create the n-qubit state |ψ⟩ in Eq. (1), which has known coefficients x i /∥x∥ in the computational basis.We propose to use the following method that first uses a state preparation step (SP) on a subset of the data qubits, followed by a controlled state preparation step (CSP).The rationale for doing this rather than a direct SP is that the SP+CSP protocol can harness both the advantages of SP and CSP steps while avoiding their disadvantages if set up correctly.We explain the details of the complexity advantages and disadvantages later in this section at Sec. 3.3.
The general idea of the SP+CSP protocol is to partition the N = 2 n basis states into M = 2 m non-overlapping sets of size N M .Computational basis states for which the first m bits agree (when written in binary) are placed into the same set.Denote the M sets by J i with i = 0, . . ., M − 1.For each i, we let y i = j∈Ji |x j | 2 .We first prepare the state of m qubits (m < n): The total state is now |ϕ⟩ ⊗ |0 n−m ⟩.
Next, we perform the controlled state preparation operation U CSP , defined by for a particular |i⟩ state being controlled on, so that The circuit diagram is shown in Fig. 6.Implementation details of U SP and U CSP are shown in Fig. 7 and Fig. 8.We will now describe how to perform the U SP and U CSP circuits.

SP Circuit Structure
To perform U SP , we give a method that has depth O(m) and uses O(2 m ) ancilla qubits.The spacetime allocation is thus at most O(m2 m ).If we require a discrete gate set and have approximation error ϵ, this method achieves O(m+log(1/ϵ)) depth and spacetime allocation O(m2 m + 2 m log(m/ϵ)).
The idea behind the SP protocol follows previous literature and begins by defining M − 1 angles that can each be efficiently computed classically from the list of amplitudes {y i } M −1 i=0 .For convenience and consistency with previous literature [8], we use a 2-index angle definition to define the rotation angles: where s = 0, . . ., m − 1, and p = 0, . . ., 2 s − 1, for a total of 1 + 2 + 4 + . . .+ 2 m−1 = M − 1 angles.Naively, computing each angle can require querying as many as O(2 m ) entries of the vector y, resulting in O(2 2m ) total classical runtime.However, we can reuse some of the computations by storing the quantities S s,p = a binary-tree data structure with m levels [3].The tree has 2 m−1 leaves; we store the quantities S m−1,p in the leaves for p = 0, . . ., 2 m−1 − 1.The tree is then constructed recursively by the rule that a parent stores the sum of its children.Since S s,p = S s+1,2p + S s+1,2p+1 , we can verify that the value of the p th node at level s will store the value S s,p .The angle θ s,p can then be determined only from the values stored at this node along with its two children.The total work to construct the entire tree is O(2 m ), and the tree has the added benefit that if an individual entry of y is modified or if the entries arrive in an online fashion, updating the tree data structure can be classically done in time O(m) by following a single path from a leaf back to the root.Additionally, although we do not utilize this property, in the case that y is sparse and has only d nonzero entries, the tree structure will also be sparse and have at most dm nonzero entries.This 2-index angle definition will also give us more convenience when labeling the ancilla registers holding the computed angles shown in Eq. ( 9) and the corresponding data qubits.In particular, s corresponds to the index of the ancilla register A s (referenced in, e.g., Algorithm 5) as well as the index of the data qubit being processed.Meanwhile, p corresponds to the index of the qubit within the particular ancilla register A s .Araujo et al. [24] proposed a low-depth strategy for implementing U SP using O(M ) ancilla qubits.The general idea is to "pre-rotate" many angles by prepar- ing the states (9) in parallel for each s = 0, . . ., m − 1, p = 0, . . ., 2 s − 1, and then efficiently inject a subset of these states into the data qubits.Which states are injected is controlled by the data qubits themselves, leaving the data qubits entangled with a large garbage register.Clader et al. [20] showed how to uncompute the garbage using a flag mechanism.It was shown that the overall T depth of the injection steps (e.g., SPF and FLAG) was O(m), but the Clifford depth was not studied, and a naive calculation would suggest the Clifford depth is at least Ω(m2 ) when we insist on using only two-qubit Clifford gates, due to the existence of the FANOUT-CNOT gate.In this work, we give an adapted version of Ref. [20] by utilizing a bit-copying mechanism to guarantee that the Clifford depth is also O(m).
The high-level circuit that accomplishes the U SP routine is shown in Fig. 7.The circuit describes the gate sequence that prepares an arbitrary m-qubit state |ϕ⟩ with the assistance of 2M − 2 = 2 m+1 − 2 ancilla qubits that begin and end 2 in |0⟩.Note that implementations of the SPF circuit and FLAG circuit, discussed in App.A, require an additional O(M ) ancillae not shown in the figure to implement the bit copying mechanism mentioned above.
We now follow Ref. [20] and define the action of circuits SPF and FLAG that appear in Fig. 7. First, define the product states Next, for each j = 0, . . ., M − 1, s = 0, . . ., m − 1, p = 0, . . ., Thus, if we fix a value of j, there are m pairs (s, p) for which f (s,p)|j = 1: in fact, for each value of s, there is exactly one p ∈ [0, 2 s − 1] for which f (s,p)|j = 1.The values of (s, p) for which f (s,p)|j = 1 correspond to the angle states |θ s,p ⟩ that are injected into the data by the SPF circuit for the data qubit setting |j⟩.
The U SP procedure begins by preparing |Θ⟩ into an ancilla "angle" register (register A in Fig. 7) of size M − 1 using M − 1 parallel R y gates.The SPF circuit is then applied, which injects some of the angle states into the data and produces the action where Here, the notation |θ(1 − f )⟩ is used to denote that the state is |θ⟩ when f = 0 and |0⟩ when f = 1.
In other words, the registers holding |θ s,p ⟩ for which f (s,p)|j = 1 are replaced with |0⟩.We see that the output of SPF has the correct amplitudes as the target state |ϕ⟩, but leaves the data entangled with an (M − 1)-qubit garbage register.The FLAG operation computes the bit f (s,p)|j into a fresh ancilla "flag" register3 (register F in Fig. 7) for each pair (s, p), i.e.,

FLAG(|j⟩
while leaving the garbage states on the angle qubits untouched.Thus, by applying a controlled-R y (−θ s,p ) gate controlled by the flag bit (1 − f (s,p)|j ) onto the angle qubit in the state |θ s,p (1 − f (s,p)|j )⟩ after the FLAG, we reset all of the angle qubits back to |0⟩.This is done in parallel for each pair (s, p).Finally, the adjoint of the FLAG operation (FLAG † ) can be applied to uncompute the flag bits and bring the flag register back to |0 M −1 ⟩.The output of this process is the state |ϕ⟩ without any garbage.
The construction for SPF and FLAG described in Ref. [20] was only optimized for the T-depth and Tcount.The Clifford count and depth were not explicitly studied.Moreover, it was assumed that a FANOUT-CNOT gate with an arbitrary number of targets could be performed in a single time step.In the appendix, we detail our constructions for the SPF (Sec.A.4) and FLAG (Sec.A.5) subroutines.There, we illustrate how both subroutines can be performed in O(m) depth and O(m2 m ) spacetime allocation using {U(2), CNOT} gate set and near-optimal costs for {H, S, T, CNOT} gate set.
We document the pseudocode of the SP circuit implementation below in Algorithm 1.
Algorithm 1 SP Circuit (see Fig. 7) for s in range(m) and p in range(2 s ) do Classically compute θ s,p from y ▷ Eq. ( 8) end for for s in range(m) and p in range(2 s ) do R y (θ s,p , A s,p ) onto fresh "angle" ancilla end for 8: for s in range(m) and p in range(2 s ) do 10: X(F s,p ) onto fresh "flag" ancilla end for 20: end procedure

CSP Circuit Structure
The controlled state preparation circuit prepares a different n − m qubit state for each of M possible settings |k⟩ ∈ {|0⟩ , . . ., |M − 1⟩} of an m-qubit control register.Thus, for each k, there are 2 n−m −1 = N M −1 angles, which we denote by θ (k) s,p , for s = 0, . . ., n−m− 1 and p = 0, . . ., 2 s −1.These angles can be computed from the amplitudes of x by the equation (c.f.Eq. (8)) The total number of angles is then N −M .The states |θ s,p ⟩ are then defined as in Eq. ( 9), and the product states |Θ (k) s ⟩ and |Θ (k) ⟩ as in Eq. (10).We document the pseudocode of the CSP circuit implementation in Algorithm 2.
Algorithm 2 CSP Circuit (see Fig. 8) for k in range(2 m ) do for s in range(n − m) and p in range(2 s ) do 16: X(F s,p ) to reset "flag" ancilla 17: end for 18: end procedure The controlled state preparation circuit, shown in Fig. 8, is a generalized version of the controlled state preparation circuit proposed in Clader et al. [20].The general idea of the CSP circuit is to first, controlled on the control register being |k⟩, load in the correct state |Θ (k) ⟩.Once this has been done, the same SPF circuit as was used for the U SP protocol is applied to inject the correct angles into the n−m data qubits.A FLAG mechanism similar to that of the U SP protocol is then employed to disentangle the data from the angle register and reset4 the angle register to |0 M −1 ⟩.
The loading and unloading is accomplished with a circuit we call LOADF, similar to (but not the same as) the LOADF in Ref. [20], which is defined by the Note that implementations of LOADF, SPF, and FLAG, given in the appendix, involve O(N ) additional unshown ancillae.We also note that in the actual implementation, one can reduce the first LOADF operation so that it is controlled only by the B register (not on the F register, which is guaranteed to be in the |1 N/M −1 ⟩ state at this stage in the circuit) in order to save a constant depth.We choose to define the LOADF operator in this way only to avoid another definition of the later LOADF † . action Note that if all the flag bits f s,p are set to 1, the state |Θ (k) ⟩ is prepared into the second register.Our implementation of LOADF is shown in Fig. 19 in the appendix, and involves an additional O(2 n ) ancilla qubits.

Overall depth and spacetime allocation
In this subsection, we compute the overall depth and ancilla allocation of our protocol.To verify the stated complexities of subroutines, we refer the reader to the sections where those subroutines are implemented.We state complexities in the {U(2), CNOT} gate set, and then quote the associated complexity in the {H, S, T, CNOT} gate set in (parentheses) when it is different.We summarize all the individual complexity contributions and the resulted total complexities in Table 3.Note that to achieve overall error ϵ in state preparation or controlled state preparation, it suffices to prepare each angle state |θ s,p ⟩ to precision ϵ/n.To see this, note that the protocol is equivalent to a sequence of n multi-controlled rotations on the data qubits.If every angle is accurate up to error ϵ/n, each of these n unitaries is enacted up to spectral norm error ϵ/n, yielding total error at most ϵ.More thorough justifications of this appear in [20, Section V B] and [19, Section VIIIA of Supplementary Material].
We begin with U SP , implemented as in Fig.The required total qubits are the n + 2N/M that appear in Fig. 8, plus the additional O(N ) needed to perform the LOADF, SPF, and FLAG circuits.The product of space and depth gives an immediate upper bound of spacetime allocation O(n2 n ) (SA approx = O(2 n log(2 n /ϵ))).However, in this case, the actual spacetime allocation is better than this upper bound.The LOADF subroutine achieves spacetime allocation O( 2 Adding both parts together, the SP+CSP protocol If m is chosen such that we see that the stated result of spacetime allocation In retrospect, we can pinpoint the rationale for pursuing the SP+CSP approach to state preparation.The issue with simply performing the U SP protocol with n = m is that it would require 2 n − 1 ancilla qubits to store the 2 n −1 angles, and each of these ancillae must remain allocated for the entire O(n) depth of the circuit.The SP+CSP protocol gets around this by first preparing the m-qubit state |ϕ⟩, requiring only 2 m − 1 angles, and for each of the 2 m basis states, preparing a different n − m qubit state in the remaining qubits.The latter operation requires 2 n−m − 1 angles (stored in the buffer ancilla register B), so we avoid the need for O(2 n ) angle states to be allocated for the entire O(n) depth.The trick comes from the ability to load the correct set of 2 n−m −1 angles among all 2 n − 2 m angles with only O(2 n ) spacetime allocation, which is accomplished by our LOADF implementation presented in the appendix.
In the appendix, we also comment on parts of the LOADF subroutine where clean ancillae in the |0⟩ state can be replaced with dirty ancillae in an arbitrary state.In total, a constant fraction of the ancillae in our construction can be dirty, but a constant fraction must remain clean.Future work aims to investigate if a fraction of ancillae that asymptotically approaches 1 can be made dirty.If that is the case, it may be meaningful to differentiate between the portion of spacetime allocation that corresponds to clean ancillae and the fraction that corresponds to dirty ancillae.

Preparing Many Copies of Independent Quantum States
This section will show how we manage to utilize the novel encoding method we proposed in Sec. 3 to more efficiently prepare many copies of arbitrary quantum states (same or different ones) using the {U(2), CNOT} gate set.Specifically, we consider the task of preparing a product state of w separate Ndimensional states as quickly as possible.If we had O(wN ) ancilla qubits, we could perform state preparation on each of the w copies in parallel for total depth O(log(N )).However, in applications, N is likely to be large, and we may not have so many ancillae.Suppose we have only O(N ) ancilla qubitsenough to prepare one state in O(log(N )) depth, but not w states.Naively, we could prepare the w states in series, incurring depth O(w log(N )).However, since the spacetime allocation required for a single state preparation is O(N ), the spacetime allocation for preparing w states is O(wN ), suggesting that if the O(N ) ancillae are used reused optimally, we might achieve O(w) total depth, rather than O(w log(N )).Indeed, we now illustrate that using our SP+CSP protocol one can accomplish the task in depth O(w + log(N )), which is constant O(1) depth per copy when w ≥ Ω(log(N )).
There are two essential properties in the single-copy arbitrary state preparation method we described in section 3: 1.All ancilla qubits are disentangled from the data qubits and returned to the |0⟩ state.The previously used ancillae are ready to act as either new data qubits or new ancilla qubits.
2. The amount of spacetime occupied by active ancilla qubits is only O(N ), even though the depth is O(log(N )) and there are O(N ) ancilla qubits.
Thus, once the ancillae are returned to |0⟩, they can begin to assist with the next state preparation, even if the previous state preparation has not been completed.We can concatenate and stack many of the state preparation circuits one after another.
Quantitatively speaking, we can prepare many unentangled copies of the same quantum state Here U s 's are the same quantum state preparation unitary but acting on different |0 n ⟩ states.The above equation does not include ancillas that assist the implementation of the U s unitaries: even though the U s 's act on different |0 n ⟩ data registers, they can share many of the same ancilla qubits.
We can also prepare many copies of different unentangled quantum states Here U d 's are different quantum state preparation unitaries acting on different |0 n ⟩ states and preparing different N -dimensional states described by the vector of coefficients x d .Even though each U d is mathematically different, the exact quantum gate scheduling is the same.The only difference is the single-qubit rotation angles.

SP+CSP Multi-Copy
In order to prepare the joint state in Eq.That is, we first prepare the state where |ϕ d ⟩ is defined from x d as in Eq. ( 5).Then, we jointly execute all the U d(CSP) circuits.
We do not have sufficient ancillae to perform these circuits in parallel, so we perform them with some "indentation" k between each other.That is, we begin the d = 0 protocol at layer 0, the d = 1 protocol at layer k, the d = 2 protocol at layer 2k, etc.We illustrate the above operation as a quantum circuit in Fig.  O(log(N )) joint states can be parallelized in this way without introducing additional ancilla qubits.Now let's walk through the second operation described in Eq. ( 21).The most ancilla-consuming steps of the CSP circuit are the LOADF operations, and in particular, the layer of doubly-controlled rotation (CCR y ) operations that appear in Fig. 19  We offer a toy model gate-level illustration in Fig. 10.We can see that the circuit depth will be 2w + log(N ) instead of w log(N ).The number of ancillae used is upper bounded by 8N .When the ratio between w and log(N ) grows larger, which is often the case in near-term applications such as amplitude encoding for quantum machine learning (as illustrated in Sec.5.1), we would effectively have state preparation circuit depth of O(w), same as the depth using the protocol proposed in [18].
When using the {H, S, T, CNOT} gate set, one can develop specific compilation strategies that utilize this early-ancilla-free-up structure based on the particular ϵ values.Asymptotically speaking, since each state preparation occupies spacetime allocation O(N log(log(N )/ϵ)), with O(N ) ancilla qubits it would be possible to create w states in depth O(w log(log(N )/ϵ) + log(N ) + log(1/ϵ)).

Applications
In general, quantum measurements are probabilistic.Therefore, for many quantum algorithms, we typically need to run many shots with the same quantum circuit setup.That is, we will need to execute the entire circuit (state preparation, algorithm circuit, and measurement) many times to get enough measurement precision.Therefore, for many quantum algorithms that require multiple shots/executions and arbitrary quantum state preparation, our novel state preparation method can reduce the average time-per-shot by as much as a factor linear to the number of qubits.
However, there exist more concrete examples where our novel encoding methods can provide advantages beyond the need to improve measurement precision, including but not limited to: quantum machine learning, Hamiltonian simulation, and solving linear systems with algorithms in the style of the Harrow-Hassidim-Lloyd (HHL) algorithm [13].

Quantum Machine Learning -Batching
One proposed quantum advantage for machine learning tasks is utilizing quantum computers' exponentially large Hilbert space to encode and process data in parallel, providing potential speedup over classical machine learning methods.One of the bottlenecks to realizing such an advantage is that encoding the classical data into the quantum machine in an exponentially compact form (amplitude encoding [8]) is costly.In amplitude encoding, we wish to encode the feature vector x = [x 0 , x 1 , ..., x N −1 ] into the amplitude of the quantum state such that Notice that N = 2 n , where n is the number of required qubits.This indicates we can encode the feature vector in an exponentially compact way.For instance, if we have a 2 × 2 image that has a pixel value array of x = [232, 31,62,137] (each pixel has an integer color value ranging from 0 to 255), we only need log 2 4 = 2 qubits to encode the values on the amplitudes: In typical machine learning tasks, a model is trained by feeding it many copies of training data and adjusting the parameters in the model to minimize a loss function evaluated based on that data.One can update the parameters after computing the loss sum of many data points (e.g., batch gradient descent [32]).
In the quantum computing setting, one could execute a QML iteration as follows (also depicted in Fig. 11): 1. encoding many data points into separate quantum states simultaneously; 2. processing all states with, e.g., separate quantum neural network circuits simultaneously; and 3. measuring the output of each circuit simultaneously (disjoint or joint measurements) and computing the total loss.Since the encoding process can take significant portions of the total required quantum resources, one must consider the best strategy for state preparations.
Using our SP+CSP encoding method, we would be able to achieve O(w+log(N )) circuit depth using only O(N ) qubits, using the indentation method described in Sec. 4. In principle, one could encode many data points in effectively constant depth, given O(N ) ancilla qubits.When running on near-term machines, this effectively matches the best available theoretical performance using the protocol proposed by Yuan et al. [18] using the {U(2), CNOT} gate set.The actual best strategy using near-term machines will depend on the machine properties, such as additional available native gates (e.g., CSWAP) and connectivity allowance.For fault-tolerant level quantum machine learning algorithms (e.g., using the {H, S, T, CNOT} gate set), our SP+CSP protocol has clear advantages in terms of both depth and spacetime allocation.

Hamiltonian Simulation -Linear combination of unitaries
Hamiltonian simulation is the task of synthesizing the time evolution unitary U (t) given a length of time t and a description of some Hamiltonian H.For example, Hamiltonian simulation allows one to measure the properties of a time-evolved state where |ψ(0)⟩ is the initial state and One method to perform Hamiltonian simulation is to use linear combination of unitaries (LCU) [9,33].The idea of this method is to approximate the Hamiltonian evolution operator with a Taylor expansion truncated to order K and then expanded as a linear combination of (unitary) Pauli matrices where V j are Pauli matrices and the number of Pauli matrices in the linear combination is Γ = K k=0 L k , where L is the number of Pauli terms in the Hamiltonian H. Let a = ⌈log 2 (Γ)⌉.
Implementing the linear combination of unitaries is then done by constructing a SELECT operator and a state-preparation operator B, defined as Combining these to form we see that The above equation is equivalent to the statement that W is an approximate "block-encoding" of U (t) [34,35], with block-encoding factor Γ−1 j=0 β j .Thus, U (t) can be applied by application of W and postselection onto the ancilla state |0 a ⟩, which occurs with probability ( Γ−1 j=0 β j ) −2 .The success probability can be boosted to 1 by dividing the evolution time t into r segments of length t/r for a specific choice of r, and applying oblivious amplitude amplification [9,10].
In any case, the state-preparation operator B is a key subroutine of the algorithm, which could be implemented in shallow depth with our statepreparation method.In a typical quantum chemistry simulation setting, the second-quantized Hamiltonian considered includes two-electron and sometimes three-electron integrals, which would make Γ scale as O(n 4 ) [36] and sometimes O(n 6 ) [37,38] (where n indicates the number of orbitals).The number of terms in the LCU will be even larger (depending on the truncation parameter); however, our low-depth state preparation method would have depth scaling as O(log(n)).
Recent works [39][40][41] have shown the advantage of performing joint measurements on multiple copies of quantum states (e.g., |ψ(t)⟩ in Eq. ( 24)) compared to separate measurements of the end states.Doing this would give us a better sample complexity scaling, which, in the end, will result in better total circuit complexity by saving the total number of circuit executions in order to reach certain measurement thresholds.

Quantum linear system solving -Repeat until Success
The classical linear system problem is defined as follows: given an N × N invertible matrix A, and a N × 1 vector b, find the solution vector x such that Ax = b.The quantum linear system problem is to perform the following related task: prepare a state |x⟩ = 1 ∥x∥ i x i |i⟩ whose amplitudes are proportional to the solution vector x.While the state |x⟩ does not give access to all N entries of the vector x, multiple copies of |x⟩ can be used, for example, to estimate expectation values ⟨x| M |x⟩ of observables M .The first quantum algorithm for solving the quantum linear system problem was by Harrow, Hassidim, and Lloyd (HHL) [13].Their solution begins by preparing the state |b⟩ = ∥b∥ −1 b i |i⟩, where b i are the entries of the vector b.Then, if A is sparse and, assuming coherent query access to the entries of A in polylog(N ) time, the state |x⟩ can be prepared to high precision in time poly(κ, log(N )), where κ is the condition number of A, that is, the ratio of the largest to smallest singular values.When κ = O(1), this is exponentially faster than classical methods that manipulate vectors of length-N , such as Gaussian elimination or conjugate gradient descent [42]).However, reading out useful information from |x⟩ can often negate this exponential speedup (leaving the possibility of a polynomial speedup) in specific applications, such as solving differential equations [43].Even in these cases, there is still a possible exponential advantage in space complexity, due to the exponentially compact representation of the data involved as quantum states.Relatedly, this compactness is leveraged to yield an unconditional exponential quantum advantage for the linear system problem in the communication complexity setting [44].
It is important to note that even if A is a sparse matrix, the vector b can often be dense.This is the case, for example, when solving differential equations via mapping to linear systems [43], where b encodes arbitrary boundary conditions of the problem.Thus, preparing the state |b⟩ as the first step of the quantum algorithm is precisely an application of our statepreparation protocol.
There are a few reasons that our low-depth state preparation protocol is well suited for usage within algorithms for the quantum linear systems problem.When b is dense, preparing b will dominate the algorithm's runtime unless a low-depth state-preparation method is used.Furthermore, extracting information from the solution vector |x⟩ may require repeating the algorithm many times, and thus preparing many copies of |b⟩; as outlined in Sec. 4, the fact that our method has optimal spacetime allocation allows many copies to be rapidly prepared with efficient ancilla usage.Furthermore, depending on which quantum approach is taken, the algorithm itself may require preparing many copies of |b⟩ to generate a single copy of |x⟩.For example, in modern approaches such as [34], one can access the matrix A, via a unitary "block-encoding" U A .Using signal processing techniques, one can create a block-encoding U A −1 of A −1 .The unitary U A −1 uses Õ(κ) calls to the unitary U A , and performs the operation (up to small error) where |⊥⟩ is in the null space of the projector I ⊗ |0 ℓ ⟩ ⟨0 ℓ |, and |c| = O(1/κ).A measurement of the ancilla register then produces the desired state |x⟩ with probability |c| 2 = O(1/κ 2 ).Thus, to produce |x⟩, it suffices to create O(κ 2 ) copies of |b⟩ and make Õ(κ) queries to U A for each copy, for a total of Õ(κ 3 ) queries.In principle, these copies could be processed in parallel, and, as described above, our method offers a minimal-depth approach for preparing |b⟩ with only O(N ) ancillae.Using amplitude amplification, the number of queries to U A can be reduced to Õ(κ 2 ), which would involve O(κ) serial reflections about the state |b⟩, an operation that can be performed optimally by our protocol (see App. B).Finally, using variable-time amplitude amplification [14,34] or ideas from adiabatic quantum computing [45,46], the U A query complexity can be further reduced to O(κ).This process also requires O(κ) serial applications of our state-preparation protocol.

Block-encodings
We have already seen above how block-encodings are useful as a framework for giving a quantum algo-rithm, such as Hamiltonian simulation or quantum linear systems solvers, access to some underlying data [34,35,47].Quantum state preparation is also a useful ingredient for constructing block-encodings.
The definition of a block-encoding of a matrix A is a unitary U A where A appears in the upper left block, scaled by some constant α such that ∥A/α∥ ≤ 1: For example, given a Hamiltonian H representing some physical system, one may seek a block-encoding U H of H, which can then be used to build algorithms that produce the ground state or thermal state of H.It can also be used to perform Hamiltonian simulation via qubitization [12,35], a technique that can have superior performance to the LCU method discussed previously.How the block-encoding U A is constructed depends on the contents of A and the context of the algorithm.Many of the most common methods explicitly involve the need for QSP or controlled QSP (see Sec. 4.2 of [35]).
For example, when A is a Gram matrix-that is, a matrix for which entries A ij = ⟨ψ i |ϕ j ⟩ for some sets {|ψ i ⟩} i , {|ϕ j ⟩} j -one can provide a block encoding of where U L and U R are unitaries that prepare the arbitrary states |ψ i ⟩ and |ϕ j ⟩ [35, Lemma 47]).A similar approach can be used to construct block-encodings of arbitrary matrices [20,34].
Both U L and U R can be constructed using our controlled state preparation protocol with the optimized depth of O(log(N )) and O(N ) ancilla qubits.

Qubit Data Copying
Our implementation of SPF and FLAG circuits will require additional ancillae that are used as part of coherent data-copying subroutines.That is, we perform the isometry, which was also utilized, for example, in Refs.[17,18] using only CNOT gates and in depth ⌈log 2 (c)⌉.We will need this subroutine because we often want to perform many CSWAP gates with disjoint targets but controlled by the same register.We can accomplish this in shallower depth by copying the control register and applying the CSWAPs in parallel.Clader et al. [20] avoided this issue by assuming the ability to do FANOUT-CNOT with an arbitrary number of targets, which is a Clifford gate, in a single time step.
The protocol for copying consists of ⌈log 2 (c)⌉ sequential depth-1 operations, labeled by 0 , . . ., ⌈log 2 (c)⌉−1 .We refer to these as copying layers.In particular, j consists of a single layer of 2 j parallel CNOTs where the targets of the CNOTs are fresh ancillae.This is illustrated in Fig. 12 and described in Algorithm 3. Notably, while the total number of qubits is c and the depth is ⌈log 2 (c)⌉, most of the ancillae need not be allocated until close to the end of the protocol, and thus the spacetime allocation is Θ(c), rather than O(c log(c)).
Figure 12: Example circuit implementing state copy subroutine for c = 8 qubits.An arbitrary single-qubit state α |0⟩ + β |1⟩ is copied into c = 2 t qubits (in the sense of mapping to α |0 c ⟩+β |1 c ⟩) via a series of t layers of CNOTs, denoted 0 , . . ., t−1 .The c − 1 qubits are fresh ancilla qubits that begin in the state |0⟩.The depth is log 2 (c), and the spacetime allocation is only c, as seen in the right diagram, by waiting to allocate ancillae until the final moment that they are needed.

Algorithm 3 State-Copy Subroutine
▷ R is a register of size c, where c is assumed to be the power of 2 for simplicity 2: end for 5: end procedure 6: procedure t (R) for j in range(2 t ) do ▷ All values of j performed in parallel 8: end for 10: end procedure With the help of the State-Copy subroutine, we can now effectively perform the parallel CSWAP operations using single-qubit gates and two-qubit CNOT gates, without using the FANOUT-CNOT gate that was required in Ref. [20] to perform the parallel CSWAP when sharing the same control bits.
We denote a layer of 2 t parallel CSWAPs by CS t , as depicted in Fig. 13 and Algorithm 4.
Figure 13: Implementation of CSt for t = 0, 1, 2, which can be accomplished in one layer of parallel CSWAP gates.

Algorithm 4 Parallel CSWAP
1: procedure CS t (R, S) ▷ t: log number of parallel control-swaps, R: control bit data register with at least 2 t qubits, S: target bit angle register with at least 2 t+1 qubits (note that the subscript here labels the qubit indices of every single register)

2:
for i in range(2 t ) do ▷ All values of i performed in parallel 3: end for 5: end procedure

A.3 COPY Layer Application Example
In Fig. 14, we illustrate how we can effectively parallelize many CSWAP gates with the same control bits and different target bits with constant ancilla and total depth overhead.The same logic can also apply to Toffoli gates.CNOT layers and O(N ) ancilla qubits in the |0⟩ state, same scale as the total circuit depth and total number of ancilla qubits.
Figure 15: SPF circuit for preparing a state with m = 4 qubits, which are left entangled with 2 q − 1 qubits (garbage).The circuit has depth O(m) and uses an additional O(2 m ) ancillae that begin and end in |0⟩.Thus, the total spacetime allocation is upper bounded by O(m2 m ).The second half of the SPF circuit is very similar to the reverse of the first half, except that SWAP gates are not present.for i in range(m) do ▷ Each value of i occupies O(1) depth, shown as one box in Fig. 16 3: for q in range(i − 1) do ▷ 1 st parallelized sequence in the space between red lines in Fig. 16.

4:
if (i − q) is odd and 3(i−q−1) 2 − 1 ≤ m − q − 3 then 5: else if (i − q) is even and end for 10: for q in range(i) do ▷ 2 nd parallelized sequence in the space between red lines in Fig. 16. 11: else if (i − q) is even and 3(i−q) 2 − 2 ≤ m − q − 3 then 14: ( end if for q in range(i + 1) do ▷ 3 rd parallelized sequence in the space between red lines in Fig. 16.

36:
end for 37: for q in range(i) do ▷ All values of q performed in parallel 38: else if (i − q) is even and 3(i−q) 2 − 2 ≤ m − q − 3 then 41: end if

43:
end for

44:
for q in range(i − 1) do ▷ All values of q performed in parallel 45: if (i − q) is odd and 3(i−q−1) 2 − 1 ≤ m − q − 3 then 46: else if (i − q) is even and A.5 FLAG The implementation of the FLAG circuit is similar to that of the SPF circuit, with a few simplifications.Here, the data qubits are only acting as controls, and there is no injection of angles into the data qubits.Thus, we do not need to alternate between copying layers and CSWAP layers; we can simply perform all the copying at the beginning, then all the CSWAPs, and then all the uncopying.The conceptual idea behind the FLAG implementation is that a flag is set in the p th flag qubit by first flagging qubit 0 and then swapping qubit 0 into the p th position using a sequence of CSWAP layers with different data qubits acting as the control.We give an example of FLAG for m = 4 in Fig. 17 and pseudocode for FLAG in Algorithm 6. for q in range(m − i − 1) do 10: CS i (D q , F q+1+i )

A.6 Copy Swap Operation
To simplify the circuit logic of the LOADF subroutines, we define another subroutine operation CopySwap.To define CopySwap, let |k⟩ be an m-qubit computational basis state, written in binary as k m−1 k m−2 . . .k 0 , so that k 0 represents the least significant bit.Let |ξ⟩ be an arbitrary single-qubit state.The operation CopySwap enacts the isometry from an m + 1 qubit state to a 2M − 1 qubit space for any m-qubit computational basis state |k⟩ and any arbitrary single-qubit state |ξ⟩.Implementing CopySwap efficiently will involve a combination of copying layers and layers of parallel CSWAPs, depicted in Fig. 18 using the t and CS t subroutines defined in App.A.1.Similar to previous sections, the goal of performing many t operations before the corresponding CS t sequences at the earliest time is to ensure the CS t sequences can be maximally parallelized under 1 step.CS i (R i , S) ▷ This subroutine will also involve additional O(N) ancilla qubits.They do not need to be allocated in each ancilla group until individually called on.Once uncomputed, they can be deallocated (i.e., able to participate in a different computing task).
▷ D 0 is the address register of size m.D 1 , D 2 , D 3 contain O(N ) additional ancillae needed to complete all the parallelized CSWAP and CCRy gates.▷ B 0 is the "buffer" ancilla register of size B to hold the angles for SPF to pump into the data registers.B 1 contains O(N ) additional ancillae needed to complete all the parallelized CSWAP and CCRy gates.
▷ F 0 is the FLAG register of size 2 n−m − 1 (only required for LOADF † , optional for LOADF.Since the FLAG bits are all in the |1⟩ states and thus guaranteed the control requirement for the following CCRy gates, we can effectively treat them as CRy gates only controlled on the A register.)F 1 contains O(N ) additional ancillae needed to complete all the parallelized CSWAP and CCRy gates.
▷ A is the register of size M − N to hold the M − N expanded angle address (Not shown in Fig. 8).▷ For the purpose of illustration, we ignore the particular indices in the subroutine parameters.We provide exact circuit-level implementations in 6 with all the indices specified.for i in range(M ) do ▷ all i in parallel, to prepare control bits in the CCRy layer's control on the A registers for i in range( B) do ▷ all i in parallel, to prepare control bits in the CCRy layer's control on the F register end for 23: end procedure When compiling the CSP circuit into the {H, S, T, CNOT} gate set, a significant quantum resource consumption (O(log(n/ϵ)) depth and O(2 n log(n/ϵ)) spacetime allocation) is contributed by the parallelized CCRy gates.As illustrated in Fig. 19, a majority of the qubits in B that performs the parallelized Ry gates can be dirty (See proofing code for a state preparation task (m = 1, n = 3) where we use time-dependent gates before and after each segment to emulate the dirty qubits in the B 1 register), which would be of abundant supply in faulttolerant algorithms such as LCU [9].The ancilla qubits required to perform the Toffoli operations (i.e., A and F registers) can be immediately uncomputed using COPY † once the Toffoli gates are done if the ratio between log(1/ϵ) and n is large, thus making the required spacetime allocation of clean qubits O(N/M log(log(N )/ϵ)+N ) rather than O(N log(log(N )/ϵ)).Future work can also aim to modify the circuit structure to allow, e.g., D, A, and/or F registers to be dirty.
When considering the number of clean qubits required to perform Ry gates in the context of the whole circuit (both SP and CSP circuits require parallelized Ry gates), we can see that we essentially need O(M ) clean qubits for the SP circuit and O(N/M ) clean qubits for the CSP circuit.Given enough supply of dirty qubits in the case when we need to reach high quantum state precisions using the {H, S, T, CNOT} gate set, we can let M = √ N (in other words, m = n/2, which also satisfies Eq. ( 17)), so the total clean qubit count will now be upper bounded to O( √ N ), thus resulting in O( √ N log(log(N )/ϵ) + N ) spacetime allocation for the clean qubits.
We illustrate the early ancilla free-up advantage over previous work in Fig. 20 and Fig. 21 using Quirk [48].As mentioned at the end of Sec.17) is satisfied, we can see a much larger ratio between the qubits labeled with the red arrows and qubits labeled with the green arrows, which shows the full advantage of this work's approach.
B Action of our protocol when the input state is not |0 n ⟩ Our state preparation procedure acts on n data qubits and uses a number ℓ = O(N ) of ancilla qubits.It implements a unitary operation U on the n + ℓ qubits that sends |0 n ⟩ |0 ℓ ⟩ → |ψ⟩ |0 ℓ ⟩, that is, all the ancillae begin and end in |0⟩.However, if the same unitary is performed on a state |z⟩ |0 ℓ ⟩ for some z ̸ = 0 n (or more generally, a state which is a superposition of all z ̸ = 0 n ), it will no longer be the case that all ancillas are reset to |0⟩.This results from the SPF circuit structure, described in A.4, which enacts rotations by swapping the n data qubits into the ancilla register.It exploits the fact that the data qubits are known to start in |0⟩ in order to guarantee that the ancillae are left in the state |0⟩.
This feature could be problematic in applications where the state preparation unitary is a part of a larger algorithm, and does not always act on the state |0 n ⟩.Is the procedure still useful in these other cases, and does it still achieve efficient spacetime allocation?Here we argue that the favorable properties of our procedure, in particular its optimal depth and spacetime allocation, extend to other common situations where state preparation appears, for example, implementing reflections and projections.

B.1 Implementing reflections about arbitrary states
A common use of state preparation within larger algorithms is to perform a reflection about a particular state |ψ⟩, that is, the operation where I n denotes the identity operator on n qubits.We now discuss how to implement this operator using U .First, we make a distinction between different ancilla registers in our state-preparation protocol.For the SP portion of the protocol, we have ancilla registers A and F , each of size M − 1, depicted in Fig. 7.For the CSP portion, we have an additional register B of size N/M − 1 (register F is the same as the one from the SP portion).The figures depict the registers A and B beginning and ending in |0⟩, but this is only the case because some of the data registers have input |0⟩.On the other hand, the F register has the property that it begins in |0⟩ if and only if it ends in |0⟩, regardless of the state of the other registers.Additionally, the implementation of SPF, FLAG, and LOADF introduce additional ancilla registers, as depicted in Figs.15,17,19, but these ancilla registers are similar to the F register: they begin in |0⟩ if and only if they end in |0⟩, regardless of the state elsewhere in the circuit.Thus, we separate the ℓ ancillae into two groups, the group of ℓ ′ = M + N/M − 2 ancillae in registers A and B, and the other ℓ ′′ = ℓ − ℓ ′ ancillae.
With this in mind, we can express Let us verify the above formula.When the input is |ψ⟩, the action of U † yields |0 n ⟩ |0 ℓ ⟩, a sign is applied, and application of U outputs the state − |ψ⟩ |0 ℓ ⟩, as expected.When the input is |⊥⟩ orthogonal to |ψ⟩, application of U † yields a state |⊥ ′ ⟩ |0 ℓ ′′ ⟩, where all we can guarantee is that |⊥ ′ ⟩ is orthogonal to |0 n ⟩ |0 ℓ ′ ⟩.The reflection operation does not apply a sign, and subsequent application of U outputs |⊥⟩ |0 ℓ ⟩, as expected.Crucially, this requires that we perform a reflection about both the n data qubits and the ℓ ′ ancillae that make up registers A and B all simultaneously being in |0⟩ (but not the other ℓ ′′ ancillae as they are already guaranteed to be in |0⟩).A reflection about t qubits being in the state |0⟩ can be implemented in depth O(log(t)) using O(t) ancillae and O(t) Toffoli gates by computing whether all qubits are set to |0⟩ in a tree-like fashion, and applying a phase if the result is 1.Here t = n + N/M + M − 2, so the depth is O(n).The spacetime allocation is upper bounded by the depth times the number of active qubits, i.e.O(n • max(N/M, M )).Since we choose M such that max(N/M, M ) = O(N/n), this spacetime allocation is at most O(N ).In conclusion, our state preparation method can be used to perform reflections about arbitrary states on n qubits in depth O(n) and spacetime allocation O(N ).

B.2 Implementing projections and block-encodings involving projections
Another operator where state-preparation is relevant is the projection onto the complement of an arbitrary state |ψ⟩, that is This operator appears, for example, in the matrices that form the adiabatic path used in query-optimal quantum linear systems solvers [46].In that context, Appendix F of Ref. [46] explains how a block-encoding of the relevant matrix can be formed given a state preparation unitary that maps |0⟩ → |ψ⟩.This larger block-encoding involves constructing a block-encoding of the projector P , which is reduced to implementing a reflection about |ψ⟩ (see Figs. 1-6 of Ref. [49], especially Fig. 4, for a quantum circuit interpretation of the discussion in Appendix F of Ref. [46]).As illustrated above, our method can perform reflections with optimal depth and spacetime allocation, and thus can also be applied in this application.

C Complex amplitudes
The constructions presented in the main text can prepare arbitrary states with real non-negative coefficients.
Extending the construction to work for arbitrary coefficients is straightforward.The SP portion of the SP+CSP protocol is unchanged, as the state |ϕ⟩, as defined in Eq. ( 5) is insensitive to any phases in the vector x (all y i are positive).On the other hand, the CSP portion of the protocol must be slightly modified.For each k = 0, . . ., M − 1, j = 0, . . ., N/M − 1, define the phase (if x j+kN/M = 0, then the phase can be defined arbitrarily).
Next, we redefine the state |θ s,p ⟩ we need to apply a single-qubit rotation about the y-axis, and then, if s = n − m − 1, we must apply a single qubit rotation about the z-axis, as well as a global phase (the global phase will be important when we add controls).To prepare a state with complex amplitudes, the only aspect of the protocol that has to change is the LOADF operation, which appears twice in the circuit for U CSP in Fig. 8.As seen in Fig. 19, implementing LOADF involves a doubly controlled R y gate.To account for complex amplitudes, we must augment this step with a doubly controlled R z gate, as well as a doubly controlled global phase (which is equivalent to a singly-controlled R z gate).These additions lead to at most a constant factor increase in the depth and spacetime allocation.In the approximate {H, S, T, CNOT} gate set, all rotations need only be synthesized to error O(ϵ/n) to achieve overall error ϵ in the state preparation protocol, requiring only O(log(n/ϵ)) gates per rotation.None of the complexity statements in the paper are impacted.

Figure 6 :
Figure 6: Circuit for the SP+CSP protocol that implements state preparation of an arbitrary n-qubit state |ψ⟩ by first creating an m-qubit state |ϕ⟩ and then performing controlled state preparation into the final n − m qubits.The square control indicates a different operation is performed on the final n − m qubits for each setting of the first m qubits.

Figure 7 :
Figure7: Circuit that implements U SP , which prepares an arbitrary m-qubit state from the state |0 m ⟩ with the assistance of O(2 m ) ancillae that begin and end in |0⟩.We let M = 2 m and clarify that the controlled rotation gate denotes M − 1 controlled rotations occurring in parallel, by different angles θs,p with s = 0, . . ., m − 1 and p = 0, . . ., 2 s − 1.Note that the implementation of SPF and FLAG, given in the appendix, involves O(M ) additional unshown ancilla qubits.

Figure 8 :
Figure 8: Circuit that implements the U CSP operation, which prepares an arbitrary n − m qubit state for each setting of an m-qubit control register, with the assistance of O(2 n−m ) ancillas that begin and end in |0⟩.We let M = 2 m and N = 2 n .Note that implementations of LOADF, SPF, and FLAG, given in the appendix, involve O(N ) additional unshown ancillae.We also note that in the actual implementation, one can reduce the first LOADF operation so that it is controlled only by the B register (not on the F register, which is guaranteed to be in the |1 N/M −1 ⟩ state at this stage in the circuit) in order to save a constant depth.We choose to define the LOADF operator in this way only to avoid another definition of the later LOADF † .
7. The parallel rotation gates and parallel controlled rotation gates are accomplished in depth O(1) (D approx = O(log(m/ϵ))), while the SPF and FLAG circuits have depth O(m), yielding a total depth O(m) (total 5D approx = O(m + log(1/ϵ)) = O(log(2 m /ϵ))).The required total qubits are the m + 2M − 2 that appear in Fig.7, plus an additional O(M ) needed to perform the SPF and FLAG operations.The product of space and depth gives an immediate upper bound of spacetime allocation O(m2 m ) (SA approx = O(2 m log(2 m /ϵ)), and indeed for U SP there is a matching lower bound (up to constant factor).We now consider U CSP .Here, we again note that to perform the operation to ϵ precision, it suffices to perform each single-qubit rotation gate to error ϵ/n.The LOADF operation has depth O(n) (D approx = O(n + log(1/ϵ)).Meanwhile, in this context, the SPF and FLAG operations each have depth O(n − m), yielding total depth O(n) (total depth O(n+log(1/ϵ)).
(19), we can now let U d = U d(CSP) U d(SP) .We first execute all the U d(SP) circuits in parallel and let m = n − ⌈log 2 (n)⌉.
9. If we choose m = n − ⌈log 2 (n)⌉, the first operation described in Eq. (20) can be achieved in O(log(N )) depth and O(N ) spacetime allocation.Since each of the U d(SP) is using N log(N ) qubits, up to

Figure 9 :
Figure 9: Illustration of stacking different SP+CSP circuit together (w = 4) of the appendix, which requires O(N ) ancillae in the A register.Note that these ancillae can be quickly freed up after the parallelized CCR y operations by the CopySwap operation under O(N ) spacetime allocation, as shown in Fig. 19.Thus we can indent the U CSP part of the later U d 's by some constant number of layers k.Besides the A register that uses O(N ) ancillae, we also have the C, B, and F registers (as shown in Fig. 19) in the U CSP operation.By choosing m = n − ⌈log 2 (n)⌉, we would have negligible ancilla counts for the C register (m • N M = m log(N ) ancillae), the B register (M = N log(N ) ancillae), and the F register ( N M = n ancillae).Therefore, they will not contribute much to the overall spacetime allocation cost.

Figure 10 :
Figure 10: Illustration of stacking a portion of the U CSP circuit (LOADF) with indentation k = 2.

Figure 14 :
Figure 14: Circuit equivalence with and without COPY layers: note that we are only introducing an additional O(log(N ))CNOT layers and O(N ) ancilla qubits in the |0⟩ state, same scale as the total circuit depth and total number of ancilla qubits.

Figure 16 :Algorithm 5 SPF Subroutine 1 :
Figure 16: First part of SPF for m = 7.The circuit consists of 7 data registers (D0 -D6) and 7 ancilla registers (A0 -A6).Each space between the red line in the circuit represents 1 iteration of the i values in algorithm 5.There are 3 parallelized gate sequences in each red line space, separated by the blue dashed line.The first parallelized gate sequence corresponds to the first for loop in line 3 -line 9.The second parallelized gate sequence corresponds to the second for loop in line 10 -line 16.The third parallelized gate sequence corresponds to the third for loop in line 17 -line 25. (Notice that we might not have all 3 sequences in all of the boxes based on the condition, for instance, at the beginning of the circuit)

Figure 17 :Algorithm 6 FLAG Operation 1 : 2 : 5 : 6 :: end for 8 :
Figure 17: FLAG circuit for m = 4.All the gate sequences within each of the two blue dashed lines can be executed in parallel.

Figure 18 :Algorithm 7 2 : 3 :
Figure 18: Left: action of CopySwap operation, which simultaneously copies m control bits set to |k⟩ and moves a target register into the k th position of a target register.The input is m + 1 qubits and the output is 2M − 1 qubits, as several fresh ancillae are introduced during the protocol.Right: implementation of CopySwap operation using copying layers and swap operations for m = 5.The total depth is m.At layer t = 0, . . ., m − 1, there are (m + (2 t − 1)(m − t) + 2 t+1 ) active qubits so the total spacetime allocation is m−1 t=0 (m + (2 t − 1)(m − t) + 2 t+1 ) = O(2 m ).

3 : 4 :: end for 6 : 13 : 15 :
for i in range(N − M ) do ▷ All i in parallel, to load N − M angles into N − M qubits CCRy(F, A, B, θ) ▷ O(1)(O(log(n/ϵ))) depth 5Setup † Subroutine 7: end procedure 8: procedure Setup Subroutine(D, B, A, F) CopySwap(D, A, m) ▷ O(m) depth ▷ The following 3 for loops can be done in parallel 14: for i in range( B) do ▷ all i in parallel CopySwap(D, B, n − m) ▷ O(n − m) depth 1, previous methods (e.g., Clader et al., shown in Fig. 20, see code) require O(N ) ancilla qubits to be entangled with the O(n) data qubits for O(n) depth, leading to O(n2 n ) spacetime allocation.This work (illustrated in Fig. 21, see code, and also on the right part of Fig. 1), only requires most of the O(N ) qubits to be allocated briefly.

Figure 20 :
Figure 20: First part of Clader et al. [20] (aka U SP without COPY).Note that all the ancilla qubits in the A register are left entangled with the data qubit in the D register (pointed by the red arrows).These O(N ) ancilla qubits are not returned to |0⟩ until after the execution of the depth-O(n) FLAG subroutine.

Figure 21 :
Figure 21: U SP + LOADF part of the circuit developed in this work, on an example with m = 1 and n = 3.Note that most of the ancilla qubits in the B register are freed up (pointed by the green arrows).The remaining O( M N ) ancilla qubits (pointed by the red arrows) are freed up after the following SPF circuit that only takes O(m) depth rather than O(n) depth in the previous case.All other ancilla qubits in D, B, and A registers are freed up almost immediately after initialization.Note that as N grows larger and if the relation in Eq. (17) is satisfied, we can see a much larger ratio between the qubits labeled with the red arrows and qubits labeled with the green arrows, which shows the full advantage of this work's approach.

Table 3 :
Individual Complexity Summary for Each Step of State Preparation Protocol.(M = 2 m , N = 2 n )