Optimal (controlled) quantum state preparation and improved unitary synthesis by quantum circuits with any number of ancillary qubits

As a cornerstone for many quantum linear algebraic and quantum machine learning algorithms, controlled quantum state preparation (CQSP) aims to provide the transformation of $|i\rangle |0^n\rangle \to |i\rangle |\psi_i\rangle $ for all $i\in \{0,1\}^k$ for the given $n$-qubit states $|\psi_i\rangle$. In this paper, we construct a quantum circuit for implementing CQSP, with depth $O\left(n+k+\frac{2^{n+k}}{n+k+m}\right)$ and size $O\left(2^{n+k}\right)$ for any given number $m$ of ancillary qubits. These bounds, which can also be viewed as a time-space tradeoff for the transformation, are \optimal for any integer parameters $m,k\ge 0$ and $n\ge 1$. When $k=0$, the problem becomes the canonical quantum state preparation (QSP) problem with ancillary qubits, which asks for efficient implementations of the transformation $|0^n\rangle|0^m\rangle \to |\psi\rangle |0^m\rangle$. This problem has many applications with many investigations, yet its circuit complexity remains open. Our construction completely solves this problem, pinning down its depth complexity to $\Theta(n+2^{n}/(n+m))$ and its size complexity to $\Theta(2^{n})$ for any $m$. Another fundamental problem, unitary synthesis, asks to implement a general $n$-qubit unitary by a quantum circuit. Previous work shows a lower bound of $\Omega(n+4^n/(n+m))$ and an upper bound of $O(n2^n)$ for $m=\Omega(2^n/n)$ ancillary qubits. In this paper, we quadratically shrink this gap by presenting a quantum circuit of the depth of $O\left(n2^{n/2}+\frac{n^{1/2}2^{3n/2}}{m^{1/2}}\right)$.


Introduction
Quantum algorithms use quantum effects such as quantum entanglement and coherence to process information with the efficiency beyond any classical counterparts can achieve. In the past decade, many quantum machine learning algorithms [1] share a common subroutine of quantum state preparation (QSP), which loads a 2 n -dimensional complex-valued vector v = (v x : x ∈ {0, 1} n ) T ∈ C 2 n to an n-qubit quantum state |ψ v = x∈{0,1} n v x |x . These include quantum principle component analysis [2], quantum recommendation systems [3], quantum singular value decomposition [4], quantum linear system algorithm [5,6], quantum clustering [7,8], quantum support vector machine [9], etc. Quantum state preparation is also a key step in many Hamiltonian simulation algorithms [10][11][12][13].
Some of these quantum machine learning algorithms, such as quantum linear system algorithm [6], quantum recommendation systems [3] and quantum k-means clustering [7], need an oracle that can coherently prepare many states: |i |0 n → |i |ψ i , for all i ∈ {0, 1} k . We shall refer to this as the controlled quantum state preparation (CQSP) problem. The QSP and CQSP problems are also used in quantum walk algorithms such as the one by Szegedy [14] and by MNRS [15]. Given a general N × N state transition probability matrix P = [P xy ] x,y∈ [N ] , quantum walk algorithms often call three subroutines: Setup, Check, and Update. The Setup procedure needs to prepare state x π(x) |x , where π is the stationary distribution of P . The Update procedure needs to realize |x |0 log N → |x y P xy |y , a typical CQSP problem. More generally, quantum algorithms can be represented as unitaries, which need to be implemented by quantum circuits for a digital quantum computer to run the algorithm. What is the minimum depth and size that any unitary operation can be compressed to? This paper also addresses this Unitary synthesis (US) problem by presenting a parametrized quantum circuit that can implement any given unitary operation.
In all these CQSP, QSP, and US problems, we hope to find quantum circuits as simple as possible for the sake of efficiency of execution and physical realization. Standard measures for quantum circuits include depth, size (i.e. the number of gates), and a number of qubits. The depth of a circuit corresponds to time and the number of qubits to space. The rapid advancement of the number of qubits provides opportunities to trade space for time, and indeed it has been found that ancillary qubits are useful in compressing the circuit depth for many tasks including CQSP, QSP, and US. It is a fundamental question to pin down the time-space tradeoff, or in circuit complexity language, the depth-qubit number tradeoff, for both quantum state preparation and general unitary synthesis problems.
Controlled quantum state preparation and Quantum state preparation Much previous work focuses on specific CQSP by quantum circuits [16][17][18]. QSP, in contrast, has been extensively studied. Bergholm et al. presented a QSP circuit with 2 n+1 − 2n − 2 CNOT gates and depth O(2 n ), without ancillary qubits [19]. Plesch and Brukner [20] improve the number of CNOT gate to 23 24 2 n − 2 n 2 +1 + 5 3 for even n, and 115 96 2 n for odd n. Ref. [19] also gives a depth upper bound of 23 48 2 n for even n and 115 192 2 n for odd n. The best result was obtained in [21], where the authors achieve the depth O(2 n /n), which is optimal.
Zhang et al. [22] presented a QSP circuit of depth O(n 2 ), by using O(4 n ) ancillary qubits, but the circuit involves measurement and the probability of successfully generating the target state is only Ω(1/(max i |v i | 2 2 n )).
The best previous result on QSP for an arbitary number m of ancillary qubits is by [21], where the authors presented a quantum circuit of depth O n + 2 n n+m and size O(2 n ) for m = O 2 n n log n or m = Ω(2 n ), which is asymptotically optimal. For m ∈ Ω 2 n n log n , O(2 n ) , they proposed a QSP circuit of depth O(n log n), which is only O(log n) off from the lower bound Ω max n, 2 n n+m . Later, Rosenthal independently constructed a QSP circuit of depth O(n) using O(n2 n ) ancillary qubits [23]. The result also showed that an n-qubit quantum state preparation is in QAC 0 f with the same number of ancillary qubits. After that, [24] gave yet another proof of the O(n) depth upper bound using O(2 n ) ancillary qubits. Both [23] and [24] did not give results for general m.
A related study is to prepare a quantum state in the unary encoding 2 k −1 k=0 v k |e k instead of binary encoding 2 n −1 k=1 v k |k in [25], where e i ∈ {0, 1} 2 n is the vector with the k-th bit being 1 and all other bits being 0. The binary encoding quantum state preparation is more efficient than unary encoding because binary encoding QSP utilizes n qubits but unary encoding utilizes 2 n qubits. In [25], Johri et al. prepared a unary encoding quantum state by a circuit of depth O(n). Moreover, by encoding k to a d-dimensional tensor (k 1 , k 2 , . . . , k d ), they extended the QSP circuit construction and obtained circuit depth O n d 2 n−n/d . If d = n, their encoding of k is binary encoding, and the depth upper bound is O(2 n ).
In this paper, we first give new quantum circuit constructions for CQSP with quantum content. Taking k = 0, This immediately implies the following result for QSP.
Theorem 2 (QSP). For any m ≥ 0, any n-qubit quantum state |ψ v can be generated by a quantum circuit, using single-qubit gates and CNOT gates, of depth O n + 2 n n+m and size O(2 n ) with m ancillary qubits. These bounds are optimal for any m ≥ 0.
These bounds match the known lower bounds of circuit depth and size for QSP: Ω(max{n, 4 n n+m }) for depth [21,26] and Ω(4 n ) for size [27]. Thus we completely characterize the depth and size complexity for QSP with any number m of ancillary qubits.
Unitary synthesis For general unitary synthesis, Barenco et al. constructed a circuit involving O(n 3 4 n ) CNOT gates [28]. Knill reduced the circuit size to O(n4 n ) in [29], which was further improved by Vartiainen et al. [30] and Mottonen and Vartiainen [31] to O(4 n ), the same order as the lower bound of 1 4 (4 n − 3n − 1) for for the number of CNOT gates [27]. These results assume no ancillary qubits. When there are m ancillary qubits available, Ref. [21] presented a quantum circuit for n-qubit general unitary synthesis of depth O n2 n + 4 n n+m , and also proved a depth lower bound Ω n + 4 n n+m . Hence, their circuit depths are asymptotically optimal when m = O 2 n /n , and leave a gap of Ω n + 4 n m , O n2 n + 4 n m when m = Ω 2 n /n . By using Grover search in a clever way, Rosenthal improved the depth upper bound to O(n2 n/2 ) with m = Θ(n4 n ) ancillary qubits [23], but did not give results for smaller m. For general unitary synthesis, based on cosine-sine decomposition and Grover search, we optimize the circuit depth for general unitary synthesis.
Theorem 3 (Unitary synthesis). For any m ≥ 0, any n-qubit unitary U ∈ C 2 n ×2 n can be implemented by a quantum circuit with m ancillary qubits, using single-qubit gates and CNOT gates, of depth O n 1/2 2 3n/2 when Ω 2 n /n ≤ O(4 n /n). In particular, the depth This result improves the one in [23] is two-fold. First, to achieve the same minimum depth of O(n2 n/2 ), we need fewer ancillary qubits: we need m = Θ(4 n /n) compared to m = Θ(n4 n ) used in [23]. Second, our method works for any m as opposed to the one in [23] which needs m = Θ(n4 n ) many ancillary qubits. Note that there is still a gap between upper and lower bounds when m = ω(2 n /n), left as an interesting open question for future studies.
Theorem 3 and previous results on circuit depth for general unitary synthesis are shown in Figure 1.
Relation to QRAM. The CQSP problem has a close relation to quantum random access memory (QRAM). In the original proposal [32,33], QRAM aims to provide the transformation of Depth upper bound in [21] Depth lower bound in [21] Depth upper bound in Theorem 3 Depth upper bound in [23] O n2 n + 4 n n+m O n2 n/2 + n 1/2 2 3n/2 |i |0 n → |i |ψ i for all i ∈ {0, 1} k , where |ψ i 's are states in {|0 , |1 } ⊗n or in (C 2 ) ⊗n , depending on whether the QRAM stores classical or quantum information as its content. Many quantum algorithms such as those mentioned at the beginning of this section, usually assume an efficient implementation of QRAM with classical content, and the hope is to have a hardware device to realize this. Despite some conceptual designs, working QRAM devices are yet to be convincingly demonstrated, even for a small scale. Results in this paper address a related and fundamental question of implementing QRAM (with quantum content) by standard quantum circuits, and show tight depth and size bounds for it.
Organization. The rest of this paper is organized as follows. In Section 2, we introduce notation and review some previous results. In Section 3, we will present a quantum circuit for (controlled) quantum state generation using arbitrary number of ancillary qubits. Then we will show a quantum circuit for general unitary synthesis in Section 4.
Quantum gates and circuits An n-qubit gate/unitary is a 2 n ×2 n unitary operation on n qubits. The identity unitary is usually denoted by I n . The X gate is the single-qubit gate that flips the basis |0 and |1 . Single-qubit gates are known to have the following factorization. A CNOT gate acts on two qubits, one control qubit, and one target qubit. The gate flips the basis |0 and |1 on the target qubit, conditioned on the control qubit is on |1 . A quantum circuit on n qubits implements a unitary transform of dimension 2 n × 2 n . A quantum circuit may consist of different types of gates. One typical set of gates contains all 1-qubit gates and 2-qubit CNOT gates. This is sufficient to implement any unitary transform. For notational convenience, we call this type of quantum circuits the standard quantum circuits. Unless otherwise stated, a circuit in this paper refers to a standard quantum circuit. A subset of circuits is CNOT circuits, which are the ones consisting of 2-qubit CNOT gates only.
A Toffoli gate is a 3-qubit CCNOT gate where we flip the basis |0 , |1 of (i.e. apply X gate to) the third qubit conditioned on the first two qubits are both on |1 . Namely, there are two control qubits and one target qubit. This can be extended to an n-fold Toffoli gate, which applies the X gate to the (n + 1)-th qubit conditioned on the first n qubits all being on |1 . This n-fold Toffoli gate can be implemented by a standard quantum circuit of linear depth and size without ancillary qubits [35], and of logarithmic depth and linear size if a linear number of ancillary qubits are available [36].

Lemma 5. An n-fold Toffoli gate can be implemented by a standard quantum circuit of O(n) depth and size without using any ancillary qubit and to O(log n) depth and O(n) size using O(n) ancillary qubits.
A non-standard quantum circuit model is QAC 0 f circuit. A QAC 0 f circuit is a quantum circuit with one-qubit gates, unbounded-arity Toffoli CQSP, QSP, and US problems 1. The Controlled Quantum State Preparation (CQSP) problem is: Given 2 k quantum states |ψ i of n qubits, realize the transformation of We sometimes write (k, n)-CQSP to emphasize the parameters.
2. The Quantum State Preparation (QSP) problem is the above CQSP problem in the spe- by a quantum circuit from the initial state |0 ⊗n , where {|j : j = 0, 1, . . . , 2 n − 1} is the computational basis of the quantum system. We sometimes call a quantum circuit for quantum state preparation a QSP circuit.

The general Unitary Synthesis (US) problem is: Given an n-qubit unitary U , find a quantum circuit to implement it.
In all these problems, we hope to find circuits as simple as possible, and standard measures for quantum circuits include depth, size (i.e. the number of gates), and number of qubits. The depth of a circuit corresponds to time and the number of qubits to space. For many information processing tasks including QSP and US, ancillary qubits turn out to be very helpful, and indeed there have been studies on quantum circuits with ancillary qubits for QSP and US. Since these tasks are often used as subroutines, it is usually desirable to have the ancillary qubits initialized to |0 at the beginning and are restored to |0 at the end. Thus we say that a quantum circuit C prepares an n-qubit quantum state |ψ with m ancillary qubits if Similarly, we call an (n + m)-qubit quantum circuit C implements an n-qubit unitary U using m ancillary qubits if where S is the index set of the control qubits, and T is the index set of target qubits. The 2 k multiple-controlled unitary operations are conditioned on distinct basis states of the k control qubits; see Figure 2 for T is a just an -qubit unitary operation. If = 1, the UCU is also called uniformly controlled gate (UCG), and we refer to Ref. [21] gives the following size and depth upper bounds of a general UCG, which is a special case of our later Lemma 10 with p = 1 and q = n − 1. The following framework of a QSP circuit was given in [3,37].

Lemma 7. The QSP problem can be solved by n UCGs of growing sizes, V
Decomposition of n-qubit quantum gate Based on cosine-sine decomposition, any n-qubit unitary can be decomposed into two (1, n − 1)-UCUs and one (n − 1)-UCG.
). Any n-qubit unitary U ∈ C 2 n ×2 n can be decomposed as where The circuit representation of cosine-sine decomposition is shown in Figure 3. Figure 3: Cosine-sine decomposition of an n-qubit quantum gate in the language of UCU.
3 Asymptotically optimal circuit depth for (controlled) quantum state preparation Now we give a more detailed implementation and analyze the correctness and cost of the quantum circuit. Recall that we are constructing quantum circuits to implement QSP and CQSP, without assuming any QRAM hardware available. First, we will use the following copying circuit many times so we single it out as a lemma. Proof. For any x ∈ {0, 1} q and y = y 1 · · · y p ∈ {0, 1} p , unitary x∈{0,1} q |x x| ⊗ L x can be realized as follows.
The first pq ancillary qubits are divided into p registers, which are labelled as register R 1 , R 2 , . . . , R p . Based on Lemma 9, we make p copies of |x , using a quantum circuit of depth O(log p) and size O(pq) in Eq. (2). For every i ∈ [p], we apply a q-UCG All these q-UCGs act on different qubits, so they can be implemented in parallel, each with m−pq p ancillary qubits. According to Lemma 6, Eq.
Two remarks are in order. First, we can compare this result to Theorem 2 in [39], which says that the unitary x∈{0,1} q |x x| ⊗ L x can be implemented by a quantum circuit of depth O(q 2 + p2 q m ) using m ancillary qubits. Apart from the difference between their depth bound and ours, the assumptions are also different. On one hand, our construction works for any m ≥ pq, while theirs needs √ p2 q ≤ m ≤ p2 q . On the other hand, ours takes m "clean" qubits of |0 (and restore them afterwards), while they can handle "dirty" qubits, i.e. those with unknown content before the circuit.
Second, the above Lemma 10 extends to general UCUs: as long as each U i in a UCU has a shallow circuit implementation, the UCU can be easily implemented. Proof. Similar to Lemma 10, we first make p copies of |x and un-copy these at the end. Between these two steps, we proceed layer by layer for the d layers of the W x circuits.
In each layer, we use the method of Lemma 10 to handle the single-qubit gates, and use Lemma 5 to handle the q-controlled CNOT gates i.e. (q + 1)-fold Toffoli gates.
The cost is analyzed as follows. The copy and un-copy steps take depth O(log p), size O(pq), and pq ancillary qubits. In each layer, all the single-qubit gates can be handled in parallel, as we have one copy of |x for each of them. Same as Lemma 10, these singlequbit gates take depth O q + p2 q m , size O(p2 q )), and m − pq ancillary qubits. Since the ancillary qubits are restored to |0 , they can be reused in the next layer. Each CNOT layer takes O(q) depth and size, without using ancillary qubits. And again these q-controlled CNOT gates i.e. (q + 1)-fold Toffoli gates, can be paralleled as we have one copy of |x for each of these Toffoli gates. Putting all these costs together gives the claimed bounds.

Rosenthal's quantum state preparation framework
In [23] Rosenthal presents a QAC 0 f circuit of depth O(n) with O(n2 n ) ancillary qubits for nqubit QSP. As mentioned in [23], this result suffices to yield a standard quantum circuit for QSP, with depth O(n) and O(n2 n ) ancillary qubits. Indeed, each k-qubit Toffoli or fanout gate can be simulated by a standard quantum circuit of depth O(log k) with O(k) ancillary qubits (Lemma 5). However, the QAC 0 f circuit needs O(n2 n ) ancillary qubits, which is out of our parameter regime of m ∈ ω 2 n n log n , o(2 n ) . Next we will analyze the QAC 0 f circuit and see how to make it suitable for any m = Ω(2 n /n 2 ). Let us first review Rosenthal's framework. In the following, we will use to denote the empty strings of length at most n − 1. For any x = x 1 x 2 · · · x n ∈ {0, 1} n , let x ≤i denote the i-bit string x 1 x 2 · · · x i and x <i denote the (i − 1)-bit string x 1 x 2 · · · x i−1 . Let R(α) denote a single-qubit gate R(α) = 1 e iα for any α ∈ R, which puts a phase of α on |1 basis.
Let |ψ v = x∈{0,1} n v x |x denote the target quantum state. For all x ∈ {0, 1} <n , let |x| denote the length of x. Define (n − |x|)-qubit states {|ψ x : 0 ≤ |x| < n} recursively by the equations where z x ∈ {0, 1} for all x ∈ {0, 1} <3 . The leaf function is defined in the following way: Identify the input index set {0, 1} <n with the vertices of the complete binary tree, with each interior vertex x having the left and right children x0 and x1, respectively. The root corresponds to the empty string . Given an input z, (z) is the leaf that the following walk from the root lead to: at any interior node x, move to the left or right child if z x = 0 or 1, respectively. For example, given an input z := z z 0 z 1 z 00 z 01 z 10 z 11 =0110010, (z) is obtained as follows. First, since z = 0, we move to the left child of , which is labeled by 0. Second, since z 0 = 1, we move to the right child of 0, which is labeled by 01 and is the leaf node (z), i.e. (z) = 01. It can be verified that Also define a corresponding (2 n + n − 1)-qubit unitary transformation U by In the rest of this section, R x is a one-qubit register for each x ∈ {0, 1} n , and S is an n-qubit register.
The QSP algorithm in [23] can be summarized as follows.
Lemma 12. Any n-qubit quantum state |ψ can be generated by the following three steps: For correctness please refer to [23], and here we focus on the implementation and the corresponding analysis in a way suitable for our later circuit construction.
The first step of the algorithm consists of single-qubit rotations on 2 n − 1 qubits, and thus naturally has depth 1 and size 2 n −1. We denote by L v 1 this step of operation, where the superscript emphasizes that the gate parameters depend on the target vector v ∈ C 2 n .
As shown in [23], the second step can be implemented by a QAC 0 f circuit on O(n2 n ) qubits, which transfers to a standard circuit of depth O(n) and size O(n2 n ), with O(n2 n ) ancillary qubits. We also note that this second step is independent of the target state. We denote by C 1 the circuit of this step, where the absence of superscript v emphasizes the independence of the target state.
The third step, though also of depth O(n) and size O(n2 n ) with O(n2 n ) ancillary qubits, unfortunately, depends on the target vector v. This brings us some difficulty for small m, and we will show how to handle it next.

Implementation: separating depth and dependence on the target state
In this section we will show how to implement the third step in Rosenthal's algorithm in such a way that (1) it has a constant number of rounds, some deep and some shallow, (2) deep rounds have depth O(n), but are independent of the target vector v, (3) shallow rounds each have depth 1, and depend on v. This separation of depth and dependence is useful for our later construction of efficient circuits. The circuit and these conditions are formalized in the following lemma. Proof. We first introduce notation C S,y Rx (V ): For any y = y 1 · · · y ∈ {0, 1} ≤n , C S,y Rx (V ) is a unitary operation acting on an n-qubit register S (the first n qubits) and 1-qubit register R x (the last qubit) as follows

Lemma 13. A unitary transformation Γ † satisfying Eq.(7) can be implemented by a standard quantum circuit of the following form
The unitary C S,y Rx (V ) makes the following transformation where By introducing an ancillary qubit called register A, C S,y Rx (V ) can be implemented by the quantum circuit in Figure 4. In the quantum circuit, the single-qubit gate A, B, C, R(α) satisfy V = e iα AXBXC and ABC = I 1 (Lemma 4). According to Figure 4, we can rewrite the circuit of C S,y Rx (V ): where W 1 y , W 2 y are defined as and Tof S A is a Toffoli gate whose control qubits are in S and target qubit is in A. Because any n-qubit Toffoli gate can be implemented by a quantum circuit of depth O(n) based on Lemma 5, W 1 y , W 2 y can be implemented by a quantum circuit of depth O(n). Unitary  D 1 v , D 2 V consists of single-qubit gates and D 3 V consists of 2 single-qubit gates. The total depth of C S,y Rx (V ) is O(n). It is worth mentioning that in the circuit construction of V depend on unitary V . Now we start the circuit construction of Γ. For all x ∈ {0, 1} <n , let U x denote a singlequbit gate satisfying U x |0 = |φ x . First, we implement the following transformation Γ S,x Rx on register S and R x by the method in Figure 4: This needs depth and size O(n) with 1 ancillary qubit: For each x ∈ {0, 1} <n , S x denotes a register which stores a copy of |t in register S. Base on the construction of Γ Sx,x Rx , we can now implement Γ by a quantum circuit of depth O(n) and size O(n2 n ) with (n + 1)(2 n − 1) ancillary qubits. To compress the depth, we first make a copy of |t for each x ∈ {0, 1} <n .
Here all the three transformation steps have depth O(n) and size O(n2 n ), by Lemma 9 and the analysis in Eq.(11)- (13).
as discussed in Eq. (9). D 1 x are single-qubit gate respectively and D 3 U † x consists of 2 single-qubit gates which are determined by U † x (or by target quantum state |ψ v ). W 1 x , W 2 x are quantum circuits of depth O(n), independent of |ψ v . As discussed above, Γ is represented as The conclusion for the decomposition of Γ † then follows. For the cost analysis: According to Eq. (18) Recall that C 1 is the second step U in Lemma 12. Now letting C 1 = C 1 C 1 , we get the following result.
Lemma 14. Any n-qubit quantum state |ψ v can be generated by a quantum circuit QSP v , using single-qubit gates and CNOT gates, of depth O(n) and size O(n2 n ), with O(n2 n ) ancillary qubits. The QSP circuit can be written as

Quantum circuit for (controlled) quantum state preparation
Next, we will use Lemma 14 to efficiently realize the controlled quantum state preparation. Let us fix a constant c in the size upper bound of L v i in Lemma 14, i.e. s i ≤ c · 2 n . The next lemma is a restatement of the upper bound part of Theorem 1. Proof. We consider two cases depending on m. Case 1: m = O(2 n+k /(n + k) 2 ). Let QSP i denote the QSP circuit for generating quantum state |ψ i on qubits {k + 1, k + 2, · · · , k + n} obtained from Lemma 7, then QSP i can be decomposed into n UCGs: Therefore, the controlled quantum state preparation can be implemented as {k+1} .
For all i ∈ [n], UCG V Case 2: m = Ω(2 n+k /(n + k) 2 ). We will show the quantum circuit for CQSP in two sub-cases: k ≥ 4 log(n + k) and k < 4 log(n + k) . Case 2.1: k ≥ 4 log(n + k) . Then m ≥ max{2cn2 n , k2 n } for any constant c > 0. For all i ∈ {0, 1} k , suppose |ψ i = 2 n −1 j=0 v i j |j . Let QSP i denote a QSP circuit with m 1 = cn2 n ancillary qubits as guaranteed by Lemma 14 to prepare |ψ i , which can be represented as Here each L i r = sr j=0 U r,i j is a depth-1 circuit consisting of s r = O(2 n ) single-qubit gates, and L i r is determined by |ψ i . For r ∈ [5], C r is an (n + m 1 )-qubit circuit of depth O(n), which is independent of |ψ i . Note that the task in the statement of this lemma is nothing but the UCU of {QSP i }, which can be implemented by applying i∈{0,1} k |i i| ⊗ QSP i . This operator can be decomposed as follows.
where the notation 1 r=5 A r means to multiply the matrices A r 's in the order of A 5 A 4 A 3 A 2 A 1 . The second equation above holds because, when viewed as matrices, the equation is just a block diagonal matrix multiplication: =diag(C 5 , · · · , C 5 ) × diag(L 0 5 , · · · , L 2 k −1

1
In the m ancillary qubits, we used m 1 for the UCU {QSP i }, and have m − m 1 left. Since m ≥ k2 n , we can apply Lemma 10 (where q = k and p ≤ c2 n ) and obtain that, for every r ∈ [5], i∈{0,1} k |i i| ⊗ L i r can be implemented by a quantum circuit of depth |v η·2 n−s +p,i | 2 . Our construction consists of two steps. In the first step, we implement a 4 log(n + k) -qubit CQSP, using m = Ω(2 n+k /(n + k) 2 ) ancillary qubits: where s = 4 log(n + k) − k. Note that k and s satisfy m > max{2cs2 s , k2 s }, thus similar to Case 2.1 above, we can implement Eq. (21) by a circuit of depth O(log(n + k)) and size O((n + k) 4 ). In the second step, we implement an (n + k)-qubit CQSP using m ancillary qubits: where |φ i,η def = 2 (n+k)− 4 log(n+k) p=0 v η·2 (n+k)− 4 log(n+k) +p,i /v η,i |p . Eq. (22) is a CQSP, in which the number of controlled qubits 4 log(n + k) satisfying m > max{2c((n + k) − 4 log(n + k) )2 (n+k)− 4 log(n+k) , 4 log(n + k) 2 (n+k)− 4 log(n+k) }. Therefore Eq. (22) can be implemented in the same way as in Case 2.1, such that the depth and size are O n + k + 2 n+k n+k+m and O(2 n+k ), respectively. It can be verified that the CQSP operator can be implemented by CQSP2 · CQSP1 and the depth and size are O n + k + 2 n+k n+k+m and O(2 n+k ), respectively.
The paper [21] presents an n-qubit QSP circuit with m ancillary qubits. For m ∈ 0, O 2 n n log n ∪ Ω(2 n ), +∞ , the circuit depth for n-qubit quantum state preparation is optimal. However, if m ∈ ω 2 n n log n , o(2 n ) , there still exists a logarithmic gap between the upper and lower bounds of QSP circuit depth. Since our Theorem 1 gives a unified construction that works for any k, including k = 0, we obtain Theorem 2, which closes the gap left open in [21].

Remarks
1. In [20], it was shown that any n-qubit quantum states are determined by 2 n − 1 free parameters omitting a global phase. In Theorem 1, an (n + k)-qubit CQSP is defined by 2 k n-qubit quantum states. Therefore, it is determined by 2 k · 2 n − 1 = 2 n+k − 1 free parameters. Thus by a similar argument for the depth lower bound of the quantum state preparation in [20], we can get a depth lower bound for (k, n)-qubit CQSP is Ω 2 n+k n+k+m using m ancillary qubits.
Moreover, the same as the proof of Lemma 37 in [21], we can also obtain a depth lower bound Ω(n + k) by the light cone argument. Combining the two results above, the depth lower bound for CQSP is Ω n + k + 2 n+k n+k+m . Therefore, the circuit depth in Theorem 1 is optimal.
2. If the number of controlled qubits k in Theorem 1 is 0, the CQSP degenerates to standard QSP. Therefore, we can obtain an optimal QSP circuit as in Theorem 2.

Circuit depth optimization of general unitary synthesis
The following oracle is used in the circuit constructions in [23].
x ] y,x∈{0,1} n ∈ C 2 n ×2 n denote a general nqubit unitary operator. Let vector u x ∈ C 2 n denote the x-th column of U and |u x = y∈{0,1} n u y,x |y is the corresponding n-qubit quantum state. The following unitary transformation O U is defined as the U -oracle: This oracle is defined as an intermediate step of the circuit construction. Note that O U prepares the state |u x in the second register conditioned on the first register being |x . Given O U , it is not immediate how to implement U , which changes |x to |u x in place. However, we can indeed implement U if we are allowed to use many queries to O U and O † U . First, we can directly apply Theorem 1 to obtain the following circuit construction for the oracle.
If the number of ancillary qubits is large, this bound improves the previous depth bound of O(n2 n ) in [21]. Next, we will show how to further improve the circuit depth for the parameter regime Ω(2 n ) ≤ m ≤ O(4 n ) by cosine-sine decomposition and the following UCU. Note that each cosine-sine decomposition (Lemma 8) reduces a general unitary to two (1, n − 1)-UCUs and one (n − 1)-UCG. One can continue this decomposition to further decrease the number of the target qubits to n − 2, n − 3, and so on. But it turns out that going all the way down to 1 target qubit does not give the most efficient construction. To see where to stop using the cosine-sine decomposition, we need to understand the circuit complexity of an (n − k, k)-UCU for a general k, which is the subject of the next lemma. Proof. We will implement V S T by an (n + m)-qubit quantum circuit. The idea is to implement each (n − k)-qubit controlled k-qubit unitary U x by Lemma 18. Observe that these (n − k)-qubit controlled qubits can be combined with the k controlled qubits in the definition of oracle O Ux (Definition 16), to form an (n, k)-CQSP. Then we invoke Lemma 15 to implement them and obtain the bounds.
According to Lemma 18, for all x ∈ {0, 1} n−k , any k-qubit unitary U x ∈ C 2 k ×2 k acting on qubits {n − k + 1, n − k + 2, . . . , n} can be implemented by O(2 k/2 ) queries to the 2k-qubit oracles O Ux and O † Ux . Here each O Ux is on 2k qubits, k of which is for U x and the other k using ancillary qubits. Using the notation O Ux C 1 |φ |0 ⊗m , for all n-qubit states |φ where = O(2 k/2 ) and C 1 , . . . , C are depth-O(log m) and size-O(m) quantum circuits independent of U x . Any n-qubit UCU V S T can thus be implemented as follows where we switched the summation and multiplication again because of the block diagonal matrix as for Eq. (20). Now we implement x∈{0,1} n−k |x x| ⊗ O Ux . It can be regarded as a controlled quantum state preparation, which has n controlled qubits and k target qubits. Hence, by Lemma 15, we can implement it by a circuit of depth O n + k +

Remarks.
1. Extension of UCG. In [21], it was shown that any n-UCG can be implemented by a standard circuit of depth O(n + 2 n /(m + n)). Lemma 20 generalizes this result to any k.

2.
Tightness. In [27], it was shown that any n-qubit unitary is determined by 4 n − 1 free parameters omitting a global phase. Because UCU V S T in Lemma 20 are defined by 2 n−k different k-qubit unitaries, it is determined by 2 n−k · 4 k − 1 = 2 n+k − 1 free parameters. Similar to the depth lower bound for general unitary in [21], given m ancillary qubits, the depth lower bound for UCU V S T is Ω 2 n+k n+m . Moreover, we can also obtain a depth lower bound Ω(n) by the light cone. This proof of is the same as the proof of depth lower bound for quantum state preparation in [21]. Combining the two results above, giving m ≥ 0 ancillary qubits, the depth lower bound for UCU V S T is Ω n + 2 n+k n+m . When k = O(1), the depth in Lemma 20 is asymptotically optimal. The case for general k is left as an interesting open question. Proof. Let D n (k, m) and S n (k, m) denote the minimum circuit depth and size, respectively, of a general n-qubit UCU V [n−k] {n−k+1,...,n} with m ancillary qubits. Especially, D n (n, m) and S n (n, m) denote the minimum depth and size of an n-qubit unitary U . According to cosine-sine decomposition in Figure 3, for every k ∈ [n] we have . When m = Ω(4 n /n), we take k = n, and get depth D n (n, m) ≤ O(n2 n/2 ). This completes the proof for the depth. The size follows a similar argument: For m = O(2 n ), Ω(2 n ) ≤ m ≤ O(4 n ) and m = Ω(4 n ), we take k = 1, k = log m − n, and k = n, respectively, obtaining the size upper bound S n (n, m) = O(4 n ), O(2 2n/2 m 1/2 ), and O(2 5n/2 ), respectively. This completes the proof.