Quantum Circuits for Sparse Isometries

We consider the task of breaking down a quantum computation given as an isometry into C-NOTs and single-qubit gates, while keeping the number of C-NOT gates small. Although several decompositions are known for general isometries, here we focus on a method based on Householder reflections that adapts well in the case of sparse isometries. We show how to use this method to decompose an arbitrary isometry before illustrating that the method can lead to significant improvements in the case of sparse isometries. We also discuss the classical complexity of this method and illustrate its effectiveness in the case of sparse state preparation by applying it to randomly chosen sparse states.


I. INTRODUCTION
A general quantum computation on an isolated system can be represented by a unitary matrix. In order to execute such a computation on a quantum computer, it is common to decompose the unitary into a quantum circuit, i.e., a sequence of quantum gates that can be physically implemented on a given architecture. There are different universal gate sets for quantum computation. Here we choose the universal gate set consisting of C-not and single-qubit gates [1]. We measure the cost of a circuit by the number of C-not gates as they are usually more difficult to implement than singlequbit gates and since the number of single-qubit gates is bounded by about twice the number of C-nots [2,3].
More generally, we can consider operations where the dimensions of the input are different to those of the output. An isometry from m qubits to n qubits can be represented by a 2 n × 2 m matrix V satisfying V † V = I. Unitaries and state preparation are special cases of isometries where m = n or m = 0 respectively. An isometry with m = n can be implemented by extending it to a unitary and implementing the unitary instead. The freedom in the extension can be exploited to lower the number of gates required.
We will briefly summarize previous work on decompositions of generic quantum computations using the gate set consisting of C-not and single-qubit gates. Arbitrary unitaries can be decomposed using 23 48 4 n Cnots [4] to leading order, about twice as many as the best known lower bound. The most efficient known method for preparing arbitrary states requires about 2 n C-nots to leading order [5][6][7] which is about twice the best known lower bound [5]. The decomposition of arbitrary isometries has been considered in [7,8]. Near optimal methods for decomposing arbitrary isometries * maemanue@ethz.ch † itenr@itp.phys.ethz.ch ‡ roger.colbeck@york.ac.uk exist, and again they achieve C-not counts approximately twice as large as the best known lower bounds. The implementation of quantum channels has been considered in [9] and these have been implemented along with POVMs and instruments in [10].
In this work we focus on the special case of sparse isometries, including in particular sparse state preparation. We present a method for decomposing arbitrary isometries that achieves essentially the same near optimal C-not counts as the methods in [7,8], and adapts well to the case of sparse isometries, where the C-not count will depend on the number of non-zero entries and their structure. Some particular examples of sparse unitaries have been studied in previous work, e.g., diagonal gates [4], uniformly controlled singlequbit gates [6] and permutation gates [11]. The case of efficiently computable sparse unitaries with a polynomial number of non-zero entries per row or column has also been considered [12], and can be implemented using a polynomial number of C-nots.
The general approach taken here to implement an isometry V from m to n qubits is to apply a sequence of elementary gates G = G k · · · G 1 G 0 to the columns of the isometry in order to reduce it to the first 2 m columns of the identity matrix on n qubits denoted by I n,m , that is GV = I n,m . Then G † is an extension of V to a unitary and yields an implementation of V using elementary gates. Our decompositions are based on Householder reflections, whose significance for (dense) circuit decompositions has been considered in [13,14].
The main results presented in this work are summarised in Table II. In particular we can bound the number of C-not gates required to implement a sparse isometry in terms of a quantity that we call the matrix envelope (see Section IV B 2), or in terms of the number of non-zero entries of the isometry.

II. SPARSE STATE PREPARATION
State preparation is the main building-block of the decompositions used in this work. In this section we in- k-controlled single-qubit gate C k (U) Theorem 4] k-controlled single-qubit gate C k (U) 6k − 4 k − 1 clean [15,Corollary 4] k-controlled not gate C k (X) 16k − 8 1 dirty Corollary 29, [7] k-controlled not gate C k (X) 8k − 12 if k ≥ 5  troduce a method for implementing sparse state preparation more efficiently than is possible in the dense case.
Definition 1. Let |v be a state on n qubits. We say that a unitary SP v on n qubits implements state preparation for |v if We start by presenting a useful pivoting algorithm for permuting entries in a sparse state such that all nonzero entries are grouped together. The idea is to then perform a decomposition scheme for dense state preparation on the grouped entries, which correspond to the state of a subset of the n qubits.

Lemma 2.
Let |v be a state on n qubits and let nnz(v) denote the number of non-zero entries of |v in the computational basis. Let s = log 2 nnz(v) . Then there exists a permutation gate Piv v that disentangles n − s qubits in a computational basis state, i.e., Piv v |v = |i n−s ⊗ |ṽ s for some s-qubit state |ṽ s . Let N Piv (n, s) denote the number of C-nots required for pivoting any state on n qubits with at most 2 s non-zero entries and let N C n s (X) denote the number of C-nots required to implement an s-controlled not when there are a total of n qubits. Then N Piv (n, s) ≤ (n − 1 + N C n s (X) ) nnz(v). Explicit counts are given in Table II. Proof. If s = n there is nothing to do so assume s < n. The given state can be represented as a column vector in the computational basis, where the basis states can be written in terms of n binary indices |b 1 b 2 . . . b n , which we split into two |b 1 . . . b n−s |b n−s+1 . . . b n . The first of these corresponds to a block of the vector. The goal is then to move all non-zero elements of |v into a single block. We achieve this by using the following algorithm: 1. If all non-zero entries are in the target block, we stop the algorithm.
2. Pick a non-zero entry outside the target block and a zero entry inside the target block. Write the basis state of the non-zero entry as |t n−s |r s and that of the zero entry as |t n−s |r s .
3. Choose a qubit on which t n−s and t n−s differ. 4. With this as a control qubit, use at most n − 1 Cnots to adjust |t n−s |r s to |t n−s |r s such that t and t differ only on the control qubit.
5. Use one s-controlled not (controlling on |r s ) to exchange |t n−s |r s and |t n−s |r s . Note that none of the other entries of the target block are affected by this process.
At the end of this algorithm, all non-zero entries of |v are in the target block. Thus we used at most n − 1 C-nots and one s-controlled not to insert one non-zero entry into the target block. Since no non-zero entry ever leaves the target block, the claimed bound follows.

Remark 3.
In the preceding Lemma we split the vector into blocks, where we have chosen to separate the n − s most significant qubits from the s remaining ones. However, any splitting into n − s and s qubits would work. In addition, the target block and the order in which to proceed are not fixed and none of these choices affect any of the decompositions used in this work. Making these choices in the right way can reduce the C-not counts. See Section V B for details on implementing the algorithm. Table I the following bounds on the  C-not count hold if a dirty ancilla are available. N Piv (n, s) ≤ (n + 16s 2 − 28s − 3) nnz(v) for n + a ≥ s + 1 N Piv (n, s) ≤ 27n2 n for n + a ≥ s + 1 N Piv (n, s) ≤ (n + 16s − 9) nnz(v) for n + a ≥ s + 2 N Piv (n, s) ≤ (n + 8s − 13) nnz(v) for n + a ≥ s + s 2 , s ≥ 5

Remark 4. Using
In the case n = s + 1, a = 0, there are not enough qubits available for any of the known decompositions of an scontrolled not and hence we can decompose it as an s-controlled single-qubit gate (first line) or note that the entire pivoting can be written as a permutation (second line -see Appendix C for a more precise bound). The first of these gives a better count for small n.
This pivoting algorithm allows us to find a scheme with low cost for the preparation of sparse states.

Corollary 5.
Let |v be a state on n qubits and let nnz(v) denote the number of non-zero entries of |v in the computational basis. Let s = log 2 nnz(v) . The number of C-nots required for sparse state preparation of a state of n qubits with nnz(v) non-zero elements is bounded by N SSP (n, s) ≤ N ∆ Piv (n, s) + N SP (s), where N ∆ Piv (n, s) denotes the number of C-not gates used to implement pivoting up to a diagonal gate, i.e., to implement ∆ Piv for some diagonal gate ∆. Explicit counts are given in Table II. Proof. It is sufficient to find a circuit that maps |v to |0 n , since the inverse of this circuit implements state preparation for |v . Let Piv v be constructed as in Lemma 2, then ∆ Piv v |v has the form |i n−s ⊗ |ṽ for any diagonal gate ∆. Without loss of generality i = 0. Now we merely need to apply reverse state preparation for |ṽ on s qubits. Thus sparse state preparation can be implemented as which gives the claimed count.

Remark 6.
For use in our sparse state preparation decomposition, it is sufficient to decompose the scontrolled not gates of Lemma 2 up to a diagonal gate. For example the Toffoli gate (with two controls) requires six C-nots when implemented exactly [1], but it can be implemented using only three C-nots when implemented up to a diagonal gate [1, Section VI B]. We are not currently aware of schemes to decompose not gates with more controls up to diagonal, but, if these were found, our counts would be improved.

Remark 7.
A more straightforward way to implement sparse state preparation is to use two-level unitaries to eliminate the non-zero entries one by one, but this leads to higher counts than the method presented here. A similar method is used in [1] to implement arbitrary unitaries.

III. GENERALIZED HOUSEHOLDER REFLECTIONS
Given a unit vector |v , the standard Householder reflection [16] with respect to |v is defined as We call |v the Householder vector associated with the reflection. The generalized Householder reflection of phase φ with respect to |v is defined as and coincides with the standard definition if φ = π. On certain architectures generalized Householder reflections can be implemented directly [17] and in a faulttolerant way [18]. Standard Householder reflections can be approximated well using Clifford and T gates [13]. In the circuit model a state preparation scheme can be used to perform a generalized Householder reflection.
Proof. This can be seen by considering the action on an orthonormal basis containing |v .

Lemma 9.
The gate H φ 0 on n qubits can be implemented using an (n − 1)-controlled single-qubit gate. The special case H 0 can be implemented using the same number of Cnots and ancilla qubits as an (n − 1)-controlled not gate. Explicit counts for these gates are given in Table I. Proof. The gate H φ 0 is a multi-controlled single-qubit gate with n − 1 controls and hence the C-not count is as in Table I. If φ = π, then using some common gates (note the Hadamard gate H has a similar notation to the Householder reflection) and −Z = XZX = X(HXH)X we obtain The C-not count is thus the one of C n−1 (X) (see again  Table I).  Here n is the number of output qubits, m the number of input qubits, nnz(·) the number of non-zero entries of a state or of an isometry in the computational basis and s = log 2 nnz(v) . The number of eliminations elim is defined in Eq. (5) and ed is defined in Definition 21. All results follow from the given reference and the results in Table I and other entries in the present table. Slightly different results can be derived by using different decompositions for multi-controlled not gates. An ancillary qubit is called clean if it starts in a known computational basis state and is restored to that state after the computation. It is called dirty if it starts in an unknown state which must be restored after the computation.
Given two states |v and |w we can construct a gate that maps |v to e iθ |w for some real θ using a standard Householder reflection defined as with θ = π − arg( v| w ) or θ = 0 if v| w = 0. We also define the generalized Householder reflectioñ which has the propertỹ The motivation for these definitions and related proofs are given in Appendix A.

IV. HOUSEHOLDER DECOMPOSITION
Our goal is to find a decomposition of any isometry V from m to n qubits. Let I n,m denote the first 2 m columns of the identity gate on n qubits. If G = G k · · · G 1 G 0 is a product of elementary gates on n qubits such that GV = I n,m , then G † is an extension of V to a unitary. Thus G † yields an implementation of V using elementary gates.
Householder reflections provide a straightforward method for implementing arbitrary isometries. Let |v 0 = V |0 be the first column of V and consider H v 0 ,0 , the Householder reflection mapping |v 0 to |0 up to a phase. We can reduce the first column (and row by orthogonality) of V by applying the Householder reflection to the isometry, i.e., the only entry in the first row and column of H v 0 ,0 V is that corresponding to |0 0|. Using the same idea the isometry can be reduced column by column to a diagonal isometry. Applying a diagonal gate on m qubits then yields I n,m . See Figure 1 for a schematic representation of the decomposition.
Before we describe the decomposition in more detail we show how the reduction of a column via Householder reflection affects the other columns of the isometry.
Lemma 10. Let V be an isometry. Let v j = V |j be the j th column of V and i be the target row index. The Householder reflection H v j ,i reduces the j th column to the i th row (i.e., such that its only non-zero element is in the i th row). For s = i and t = j we have The basic idea of the Householder decomposition for dense isometries. Here * represents an arbitrary complex entry. Each step reduces one column without affecting the previous columns. The rows are reduced automatically due to the orthogonality of the columns. The final diagonal gate sets the phases on the diagonal equal to one. and Proof. By construction we have H v j ,i V |j = e iθ |i and by orthogonality of the columns i| H v j ,i V |t = 0. For the final case given in the statement, we compute In other words if the j th column is reduced to the i th row then the entry of V at position s, t does not change if i| V|t = 0 or s| V|j = 0.

A. Dense isometries
First we consider the Householder decomposition for dense isometries. This decomposition works for any isometry, but does not take advantage of any sparseness. The idea is to reduce the columns one by one. It follows from Corollary 11 that previously reduced columns are not affected by subsequent Householder reflections. More precisely define V 0 = V and iteratively where ∆ m denotes a diagonal gate on m qubits. This proves the following Lemma.

Lemma 12.
Let V be an isometry from m to n qubits. Then V can be implemented using 2 m Householder reflections on n qubits and a diagonal gate on m qubits.
In any sequence of Householder reflections implemented using the construction in Lemma 8, gates of the form SP † v · SP w occur. Using the dense state preparation scheme from [5] one can merge some gates as described in [7, Theorem 1] for Knill's decomposition [8]. This essentially halves the C-not count. The structural similarity between the Householder decomposition and Knill's decomposition suggests a generalized decomposition, which we present in Appendix D.
Lemma 13. Let V be an isometry from m to n qubits with n ≥ 5. Then V can be implemented via standard Householder reflections using one dirty ancilla qubit and Proof. This follows analogously to the proof of Theorem 1 of [7], so we only point out the necessary modifications. The idea is to decompose the isometry via where S i implements state preparation of the i th column of V. Our modification is to replace the 2 m generalized Householder reflections by standard Householder reflections, and then to correct for the difference by using a diagonal gate on m qubits at the end, i.e., Lemma 9 shows that using one dirty ancilla qubit the standard Householder reflection H 0 can be implemented using 16n − 24 C-nots instead of 16n 2 − 60n + 42 used for the generalized version H φ 0 . The cost of the diagonal gate is 2 m − 2 (see Table I).
Note that, to leading order, the counts match those of [7, Theorem 1] and hence, in the case of dense isometries, the advantage of using the Householder decomposition rather than Knill's is only apparent for small cases.

Corollary 14.
Let U be a unitary on n qubits with n ≥ 5. Then U can be implemented via standard Householder reflections using one dirty ancilla qubit and N U (n) C-not gates where Proof. First we claim that a controlled isometry from m − 1 to m qubits with n − m controls can be implemented using at most N(m − 1, m) + 16(n − m)2 m−1 + 2 n−1 − 2 m−1 C-nots where N(m, n) denotes the upper bound for implementing an isometry from m to n qubits given in Lemma 13 1 . This follows from Lemma 13 and the fact that when implementing the controlled isometry by controlling all the gates in (4), we do not need to control the state preparation gates or their inverses (i.e., the S i and S † i gates do not need controls). Thus the control does not affect the first three terms in the counts from Lemma 13.
To decompose the unitary, we start by reducing the first half of the columns using the inverse of a circuit implementing an isometry from n − 1 to n qubits. This yields |0 0| ⊗ I + |1 1| ⊗ U 1 , where U 1 is an n − 1 qubit unitary. We can then control on the first qubit and do the inverse of an isometry from n − 2 to n − 1 qubits and so on. At each step we reduce half of the remaining columns and requires the inverse of an isometry from n − k − 1 to n − k qubits with k controls, where k = 0, 1, . . . , n − 1.
The C-not count is thus This can be used with the counts from Lemma 13 to upper bound the number of C-nots. To generate a clean bound on the leading order term, note that if n This result improves on the C-not count for a similar decomposition for dense unitaries based on Householder reflections given in [14] which achieves 4 n to leading order.

B. Sparse isometries
Now we consider the Householder decomposition for sparse isometries. Again the decomposition works for any isometry, but now the number of C-not gates depends on the number of non-zero entries and their structure. The method yields lower C-not counts than the decomposition for dense isometries if the isometry is sufficiently sparse. We do not specify a precise number of zeros an isometry should have in order to be considered sparse, but use W to denote isometries in the context of methods designed to make use of sparseness.

Decomposition of sparse isometries
The basic idea of the sparse Householder decomposition is that if the columns of W are sparse, then so are the Householder vectors used in the decomposition. Therefore we can use our sparse state preparation method (Corollary 5) to save C-nots. The main obstacle is fill-in generated by the Householder reflections, i.e., where zero entries of the isometry are removed by the reflection. Corollary 11 implies that such fill in is relatively small. It can be further reduced by decomposing the columns of the isometry in a well-chosen order.
More precisely let ρ be a permutation of the rows of W and let σ be a permutation of its columns. We call the pair (ρ, σ) an elimination strategy. Then the sparse Householder decomposition proceeds as in the dense case, except that at step i we reduce column σ(i) to row ρ(i). If we implement the Householder reflections up to a permutation and a diagonal gate, then, after all reductions, the isometry will be a row permutation of a diagonal isometry. More precisely define W 0 = W and iteratively for i = 0, . . . , 2 m − 1 and where Π i is a permutation gate andρ(i) = (Π i−1 · · · Π 1 Π 0 • ρ)(i) with • denoting the action of a permutation gate on a permutation. Then G 2 m −1 · · · G 0 W = Π n I n,m ∆ m . The following two lemmas show how to decompose permuted diagonal gates and how to implement Householder reflections up to a permutation and a diagonal gate.

Lemma 15.
Let Π n be a permutation gate on n qubits and ∆ m a diagonal gate on m qubits. An isometry of the form Π n I n,m ∆ m can be implemented using C-nots. Explicit counts are given in Table II. Proof. Consider the vector |u formed by summing the columns of Π n I n,m ∆ m and the pivoting algorithm (Algorithm 1) applied to this. If the gates required to pivot |u are applied to Π n I n,m ∆ m , all the 2 m non-zero entries of Π n I n,m ∆ m are moved to the top. [Note that when taking the counts from Remark 4, we replace nnz(v) by 2 m .] The resulting isometry has the form I n,m Π m ∆ m .
Since it leads to the same structure, it suffices to perform the pivoting up to a diagonal. It remains to implement a permutation on m qubits (up to diagonal) and a diagonal gate on m qubits.

Lemma 16.
Let |v be a state on n qubits and let nnz(v) denote the number of non-zero entries of |v in the computational basis. Let s = log 2 nnz(v) . Then the standard Householder reflection H v can be implemented up to diagonal and permutation gates using C-nots. Explicit counts are given in Table II.
Proof. It follows from Lemma 8 and Eq. (1) that H v can be implemented up to diagonal and permutation gates as which gives the claimed bound.
Now we can give the C-not counts for the sparse Householder decomposition. First we define to be the total number of eliminations during the Householder decomposition of W when using the elimination strategy (ρ, σ). Recall that |w i denotes the column reduced in the i th step of the decomposition, which differs in general from W |σ(i) .

Lemma 17.
Let W be an isometry from m to n qubits and let (ρ, σ) be an elimination strategy. Then W can be implemented using C-nots where s(i) = log 2 (1 + nnz(w i )) . Explicit counts are given in Table II. Proof. In the Householder decomposition the steps involve H w i ,ρ(i) . This is a standard Householder reflection with respect to a vector that may have one additional non-zero element (see Eq. (2)). The bound given then follows from the sparse Householder decomposition.
The count resulting from Lemma 17 depends on the chosen elimination strategy. The optimal strategy is the one that minimizes the amount of fill-in produced and therefore the number of eliminations required.

Remark 18.
It can be beneficial to use the idea of the sparse Householder decomposition, without adhering to the exact form given above. For example, using a single standard Householder reflection we can implement a k-controlled single-qubit gate up to diagonal. This gate can be used to improve the C-not count of the column-by-column decomposition [7]. Indeed, without loss of generality we can assume that the least significant qubit is the target qubit of the k-controlled singlequbit gate. Then the corresponding unitary is the identity matrix except for the 2 × 2 block in the bottom right corner. We can reduce the penultimate column (up to diagonal) using a standard Householder reflection with respect to |1 . . . 10 n and two single-qubit gates for the state preparation and reverse state preparation. The Cnot count is then that of a k-controlled not gate by a similar argument as in Lemma 9.

Remark 19.
To obtain a more explicit bound we can plug in the counts from Table II: where we have used the bound s(i) ≤ n. The given bound is valid with one dirty ancilla.

Fill-in and envelopes
In order to gain as much advantage as possible from the sparseness of an isometry, we need to minimize fillin as much as possible, which corresponds to choosing ρ and σ so as to minimize elim(W, ρ, σ). Reducing a column of W in general affects other columns and creates new non-zero entries. Due to the orthogonality of the columns however, Householder reflections create little fill-in when applied to isometries. In fact, it follows immediately from Corollary 11 that when reducing column j to row i fill-in can only occur in columns that are non-zero in the i th entry and fill-in is confined to the rows where W |j is non-zero.
We use matrix envelopes to give a bound on the amount of fill-in the sparse Householder decomposition produces. The envelope of a sparse matrix gives for each column of the matrix the row index of the lowest non-zero element (i.e., the largest row index of a non-zero element) in that column or any previous column.
Definition 20. Let W be an isometry from m to n qubits. The envelope of W is defined to be the function env W : {0, 1, . . . , 2 m − 1} → {0, 1, . . . , 2 n − 1} that maps each column index j to the smallest row index env W (j) such that i| W|j = 0 for all i > env W (j) and such that env W (j + 1) ≥ env W (j).
Definition 21. Let W be an isometry from m to n qubits. Define ed(W) = ∑ 2 m −1 j=0 (env W (j) − j). Then ed(W) denotes the number of entries between the envelope and the diagonal of W.
The definition of ed(W) is motivated by the following result.
Proof. LetW = Π ρ WΠ σ for some permutations (ρ, σ). Reducing the columns of W according to (ρ, σ) is equivalent in terms of fill-in to reducingW according to the trivial elimination strategy (ι, ι) where ι denotes the identity permutation, i.e., elim(W, ρ, σ) = elim(W, ι, ι). It follows from Corollary 11 that the fill-in forW is confined to entries (i, j) with i ≤ env(j). Due to orthogonality of the columns, if we proceed in column order we never need to eliminate any elements above the diagonal, so elim(W, ι, ι) ≤ ed(W).
Finding the row and column permutations that minimize the envelope is a computationally difficult task. Methods for similar problems have been considered in the context of sparse matrix decompositions [19]. Note that once a column permutation is fixed, the optimal row permutation can be found in the following straightforward way.
Let W be an isometry from m to n qubits. Consider the following algorithm for constructing a modified isometry W .

Set k to be the number of non-zero elements in
the column with index j with row indices greater than or equal to i.
3. If k = 0, permute the rows with indices greater than or equal to i such that these k non-zero elements have row indices i to i + k − 1, assign the new isometry to W and set i = i + k.
Lemma 23. Let W be an isometry from m to n qubits and W be the output after applying Algorithm 2 to W. For all row permutations ρ we have env W (j) ≤ env Π ρ W (j) for all j.
Proof. Given a column index j, let t(j) be the number of rows such that for each of the columns with indices smaller than or equal to j, those rows have at least one non-zero element, i.e., t(j) := 2 n − |{i : i| W j = 0 ∀ j ∈ {0, 1, . . . , j}}| .
It follows that for any row permutation ρ we have env Π ρ W (j) ≥ t(j) − 1 for all j (recall that env Π ρ W is a non-decreasing function by definition). However, by construction, env W (j) = t(j) − 1 for all j, from which the claim follows.
The difficult part is therefore finding the best column permutation. In this work we consider a simple greedy algorithm for finding a good column permutation. 1. Set i = 0, j = 0 and W = W.

Set M to be the submatrix of W formed by only
considering rows with row index greater than or equal to i and columns with column index greater than or equal to j.
3. Pick one of the columns of M with the fewest nonzero elements and set k to be the number of nonzero elements.
4. Permute the columns of W with column index greater than or equal to j such that when restricting the permutation to M, the chosen column becomes the first column of M. Then permute the rows of W with row index greater than or equal to i such that when restricting the permutation to M, all non-zero elements of the chosen column are moved to the top. Set i = i + k, j = j + 1.
5. If j < 2 m and i < 2 n return to Step 2, otherwise output W .
This algorithm corresponds to minimizing the increment of the envelope at each step. More information on an efficient way to implement it can be found in Section V C.

C. Fixed envelope method
The asymptotic C-not count for the sparse Householder decomposition contains an undesirable factor n stemming from Lemma 2. We now show how this factor can be avoided if we give up some control on the amount of fill-in.
For this we make use of the decrement gate Dec n which subtracts 1 in the computational basis (modulo 2 n ), i.e., Dec n = ∑ 2 n −1 i=0 |i i ⊕ 1| where ⊕ denotes addition modulo 2 n . This gate can be implemented using O(n) C-nots and one ancilla qubit [20]. The method is illustrated in Figure 2. 2. Illustration of the first step of the fixed envelope method. Here * represents an arbitrary complex entry, denotes the target entry of the reduction, × stands for an entry that was eliminated, − denotes entries eliminated due to the orthogonality constraint and + means that fill-in occurs. Here σ(0) = 0.
Lemma 24. Let W be a sparse isometry from m to n qubits and let (ρ, σ) be some row and column permutations. Then W can be implemented using Explicit counts are given in Table II.
Proof. We consider reducing W in the following way: First apply the row permutation ρ up to diagonal. Then reduce the columns in the order given by the column permutation σ. For each column, use a Householder reflection to reduce it to the topmost row. Apply the decrement gate and then move to the next column. After all columns have been reduced in this way we apply X ⊗n−m ⊗ I m . The resulting isometry has the form I n Π m ∆ m which can be reduced to the identity by applying a permutation on m qubits up to diagonal and a diagonal gate on m qubits. By construction, before each Householder reflection, all non-zero entries in the column being reduced are in the topmost 2 s(i) positions. We can hence perform the Householder reflections using the method of Lemma 16 but omitting the pivoting steps.

Remark 25.
To obtain a more explicit bound we can plug in the counts from Table II This bound is valid with one dirty ancilla. The factor 4 stems from the fact that each Householder reflection uses state preparation twice and each state preparation acts on s(i) qubits where 2 s(i) is at most twice the height of the envelope 1 + env Π ρ WΠ σ (i) − i in column i of Π ρ WΠ σ .

D. No fill-in method
Using a clean ancilla qubit we can avoid fill-in altogether. The method is illustrated in Figure 3.
Lemma 26. Let W be a sparse isometry from m to n qubits. Then, using one additional clean ancilla, W can be implemented using C-nots where s(i) = log 2 (1 + nnz(W |i )) . Explicit counts are given in Table II.
Proof. We implement the isometryW from m to n + 1 qubits defined byW |i = |0 ⊗ W |i which in the computational basis is just W stacked on top of a zero matrix of the same size. Then each of the 2 m columns can be reduced to one of the 2 n zero rows without creating any fill-in. These reductions can be implemented using Householder reflections up to a diagonal and permutation gate as in Lemma 16. Then using Lemma 15 we reduce the resulting permuted diagonal isometry to the identity.

V. CLASSICAL COMPLEXITY
In this section we compute the classical worst-case time complexity of some decompositions presented in this work. We compare the dense Householder decomposition to other known methods and we propose a sparse storage format which is well adapted to the sparse Householder decomposition.

A. Dense isometries
First we consider the dense case. For each column we have to compute the corresponding Householder vector (see Eq. (2)) and apply the Householder reflection to the entire isometry and produce the circuit implementing the Householder reflection. Computing the vector takes O(2 n ) and applying the reflection takes O(2 m+n ). Producing the circuit takes O(2 3n/2 ) according to Appendix B.4 of [10]. Thus the classical complexity is O(2 m+n (2 m + 2 n/2 )). For comparison, the column-bycolumn decomposition requires O(n2 2m+n ) and Knill's 3. Illustration of the no fill-in method. The clean ancilla leads to the empty 4 × 4 block at the start. Here * represents an arbitrary complex entry, × stands for an entry that was eliminated and denotes the target entry of each reduction. We actually implement each of the Householder reflections up to permutation, so the final state is a row permutation of that shown (for simplicity we did not depict this). decomposition and the Cosine-sine decomposition both require O(2 3n ) [10]. A high performance implementation for a decomposition of dense unitaries based on Householder reflections is presented in [14].

B. Sparse state preparation
The non-zero pattern of a sparse state on n qubits with at most 2 s non-zero entries can be compactly stored as a list of the n-bit indices of the non-zero entries. This requires O(n2 s ) space.
The pivoting algorithm, Algorithm 1, can be implemented with some greedy optimizations. First there are ( n s ) possible splittings for the n qubits. We will think of the vector as being reshaped into a two dimensional array with 2 s rows and 2 n−s columns corresponding to the chosen qubit splitting. If the number of possible splittings is small enough, we try all splittings and choose the one with the largest number of non-zero elements in one column and this will be the target column. Otherwise one can randomly sample a fixed number of splittings and use the best one. Second, in each insertion step we choose an element not yet in the target column for which the cost of inserting it into the target column is minimal and perform the insertion. We iterate this until all elements are in the target column.
The insertion of one element into the target column can be implemented using one s-controlled not gate and d − 1 C-nots, where d is the Hamming distance between the index of the non-zero entry and the index of the target entry. Thus we want to find a non-zero entry outside the target column and a zero entry in the target column for which the Hamming distance is minimal. We do this by finding for each non-zero entry outside the target column the closest zero entry in the target column. The Hamming distance can be written as d = d c + d r where d c is the Hamming distance obtained when restricting the indices to the column indices, and d r is the part corresponding to the row indices. For each non-zero entry, we can compute d c in O(n − s), which adds up to O((n − s)2 s ) for all non-zero entries outside the target column. We can compute d r for all non-zero entries at the same time in O(s2 s ) by using breadth first search on the s-dimensional hypercube with multiple starting vertices, given by the row indices of the zero entries in the target column. More precisely, we store a list of length 2 s , where each entry corresponds to one row and stores the minimal distance d r to some free row of the target column. The entries corresponding to free rows are initialized with distance 0 and all other entries are initialized with distance ∞. Then we perform the usual breadth first search on the graph whose vertices are given by the entries of the list and whose edges connect any pair of entries whose indices have Hamming distance one. Performing the insertion takes O(n2 s ). The entire circuit implementing sparse state preparation with the greedy optimizations mentioned above can thus be computed in time O(( n s ) + n2 2s ).

C. Sparse isometries
To store a sparse isometry W we store two arrays R and C of size 2 n and 2 m respectively. For a given row index i, R(i) stores a reference to a balanced tree (e.g. a red-black tree) containing for each non-zero element of the i th row a triplet of the form (i, j, i| W|j ). The elements of the tree are sorted according to the key j.
Analogously we define C(j) to store a reference to a tree containing the non-zero elements of the j th column. This requires O(2 n + 2 m + nnz(W)) space. Given row and column indices i and j, the corresponding entry can be created, read, modified or deleted in time O(n).
We now show how to reduce column j to row i. From Corollary 11 we know that the modified entries are those in column j and row i and those with indices s and t such that s ∈ C(j) and t ∈ R(i). First we iterate over all choices for s = i and t = j and create or modify the entries according to Lemma 10. Then we set the entries in column j and row i to zero except for the entry with indices (i, j) which is determined by Lemma 10. Thus each reduction can be done in time O(n mod) where mod denotes the number of modified elements.
In Algorithm 3 we presented a greedy method for constructing a permutation of an isometry leading to a small envelope. For a sparse isometry W given in the data structure described above, the corresponding row and column permutations ρ and σ can be computed iteratively as follows. Using the notation from Algorithm 3, we only store the submatrix M, which is initially set to W. In each step we find the sparsest column of M, which will be the next column in the column permutation, and the rows containing the non-zero elements of this column, which will be the next rows in the row permutation. Then we simply delete the non-zero elements in the chosen column and rows and iterate the procedure. In order to find the sparsest column in each iteration, we maintain a minheap storing for each column the number of non-zero elements. The smallest element can be removed in time O(n) which yields the sparsest column. Then we can delete the elements as described above, each in time O(n). For every deleted element we decrease the number of non-zero elements in the containing column by one, and thus we may have to reorder the minheap. But this can also be done in time O(n). The procedure stops when all non-zero elements have been deleted after total time O(n nnz(W)).

VI. NUMERICAL RESULTS FOR SPARSE STATE PREPARATION
We compare the C-not counts resulting from our sparse state preparation scheme presented in Section II to the dense case. The implementation of the sparse state preparation scheme is described in Section V B. In order to improve the classical computation time, we do not consider all possible qubit splittings, but randomly sample 100 splittings and choose the one with the largest number of non-zero elements in one column. We use the dense state preparation scheme from [5], implemented in [10], which achieves near optimal C-not counts for arbitrary dense states. The results are presented in Figure 4. These indicate the advantage we gain by taking into account the sparseness. Note however, that the dense case outperforms the sparse case for fairly dense states (where the cost of pivoting is not compensated by the smaller state preparation).
RC is supported by EPSRC's Quantum Communications Hub (grant numbers EP/M013472/1 and EP/T001011/1). RI acknowledges support from the Swiss National Science Foundation through SNSF Requiring H φ u θ |v ∝ |w leads to the following condition which we can also write as φ = π + 2 arg(z) mod 2π.

Standard Householder reflection
For a standard Householder reflection we have φ = π. Now choose θ = π − arg( v| w ) or θ = 0 if v| w = 0. This implies z = 1 + | v| w | = 0. Then we define This map has the property that H v,w |v = e iθ |w .

Generalized Householder reflection
If we want to get rid of the phase e iθ we have to use a generalized Householder reflection. Setting θ = 0 implies that z = 1 − v| w which might lead to numerical instabilities. Then φ = π + 2 arg(z). We definẽ

Appendix B: Multi-controlled not gates
We denote a k-controlled not gate by C k (X). Using a single dirty ancilla qubit such gates can be decomposed with a linear number of C-not gates. We start by recalling two lemmas from [1,7].
Note that if k ≥ 5 this bound can be reduced to 8k − 12 [15]. However, we do not use this here for the convenience of having a single bound for all k ≤ n 2 . The desired decomposition of multi-controlled not gates using a single ancilla qubit follows.
Corollary 29. Let n ≥ 3 denote the total number of qubits. Then we can implement a C n−2 (X) gate with at most 16n − 40 C-nots.

Appendix C: Permutation gates
One key feature of several of our methods is the use of permutations to adjust the form of the given isometry. Here we discuss the number of C-nots needed for these.
Proof. Permuting the rows of a state on n qubits corresponds to constructing a 2 n × 2 n permutation matrix (i.e., a unitary matrix with one 1 in every row and column). Each such matrix corresponds to a permutation on 2 n objects. It is known that all permutations can be decomposed as a sequence of swaps. A permutation is even if it can be decomposed into an even number of swaps and otherwise it is odd. It is known that all even permutations on n ≥ 3 qubits can be performed with at most n nots, n 2 C-nots and 3(2 n + n + 1)(3n − 7) Toffoli gates [11,Theorem 33]. A Toffoli gate can be performed up to a diagonal gate using 3 C-nots [1] and diagonal gates can be commuted with not, C-not and Toffoli gates up to another diagonal. Thus, ignoring the single-qubit (not) gates, any even permutation can be decomposed into at most (27n − 63)2 n + 28n 2 − 36n − 63 C-nots without ancilla, up to a diagonal gate.
If we have an odd permutation on n qubits, we can apply an (n − 1)-controlled not to make it an even permutation. Without an ancilla, we can do this as a n − 1 controlled single-qubit unitary with an overhead of 16n 2 − 60n + 42 C-nots (see Table I). This leads to an overall C-not count of (27n − 63)2 n + 44n 2 − 96n − 21 C-nots without ancilla 2 .
A diagonal gate on n qubits can be performed using 2 n − 2 C-nots (see Table I), leading to an overall C-not count of (27n − 62)2 n + 44n 2 − 96n − 23.
Slightly lower counts are possible with ancillas, but these do not change the leading order so we do not consider them here for simplicity. Note also that the above bound is always less than 27n2 n for n ≥ 3, so we use the latter as a simplification.
Any unitary on n ≥ 2 qubits can be decomposed using at most 23 48 4 n − 3 2 2 n + 4 3 C-nots [4] (without ancilla) which gives a better count for an arbitrary permutation for n ≤ 8.
[In [11], the authors note that if we add a qubit on which we do not act, an odd permutation becomes even and hence any permutation on n ≥ 3 qubits can be done using one ancilla and an even permutation on n + 1 qubits [11,Corollary 13]. However, this is significantly worse than the above bound.] Lemma 31. A permutation gate on n ≥ 2 qubits can be performed with one dirty ancilla and at most (18n − 26)(2 n − 1) C-nots.
Proof. The idea is to use Householder reflections. We can write the permutation gate as ∑ 2 n −1 i=0 |j(i) i|, where |j(i) is a computational basis state. The Householder reflection H j(i),i takes |j(i) → |i and |i → |j(i) without affecting any other columns (cf. Corollary 11).
Since H j(i),i = H u , where |u = 1 √ 2 (|i − |j(i) ), we can implement each Householder reflection along the lines given in Lemma 8. The state |u can be reduced to |i as follows. Let |i = |i 1 i 2 . . . i n and |j(i) = |j 1 j 2 . . . j n in the computational basis. Find an index k such that j k = i k . Controlling on the k th qubit we can apply at most n − 1 C-nots such that |j(i) → |i 1 . . . i k−1 j k i k+1 . . . i n and |i is unchanged. When applied to |u , this results in |u = 1 √ 2 (|i − X k |i ), where X k is a not on the k th qubit. Applying a single qubit rotation to the k th qubit then maps |u to |i . Let us denote the reverse of these steps by SP i,u , so that SP i,u |i = |u . We have where H i = I − 2 |i i| is either a Z or −Z gate with n − 1 controls, and hence has the C-not count of C n−1 (X), which is 16n − 24 if one dirty ancilla is available (see Table I).
It follows that we can do H j(i),i using 18n − 26 Cnots, and hence the whole permutation matrix using at most (18n − 26)(2 n − 1) C-nots.

Appendix D: Knill-Householder decomposition
In this appendix we consider a generalized decomposition scheme that contains Knill's decomposition [8] and the Householder decomposition as special cases. Suppose U is a unitary that we want to implement and B is another unitary representing a change of basis. Assume that B andŨ = B † UB are sparse. Let |b i = B |i , then |ũ i := UB |i = U |b i and Consider the generalized Householder reflectionHũ 0 ,b 0 mapping |ũ 0 to |b 0 (in this section we want this map to be exact and not up to a phase). In the basis formed by the columns of B this reduces the first column ofŨ. Consider the effect on the second column |ũ 1 . Since it is orthogonal to |ũ 0 , this column will suffer fill-in if and only if it is not orthogonal to |b 0 . In this case, fill-in will be confined to the subspace spanned by |ũ 0 and |b 0 . This means that fill-in is determined by the structure ofŨ and works in the same way as for the Householder reflection in the standard basis. It is important to note however that state preparation is only efficient if |ũ 0 is sparse, i.e., if bothŨ and B are sufficiently sparse. Continuing in this fashion allows us to reduce U to the identity.
Remark 32. If we define B to be the unitary whose columns form an eigenbasis of U, then this scheme reduces to Knill's decomposition, and if B is the identity, it is essentially the Householder decomposition. Note however that in the case of the Householder decomposition we used standard Householder reflections and it was sufficient to implement them up to diagonal and permutation gates.
This method can be generalized to isometries by extending a given isometry V to a unitary U. This can be done such that U has at least 2 n − 2 m eigenvalues equal to 1 (see [8], [10,Lemma 5]). Let B be an n qubit unitary whose first 2 n − 2 m columns are eigenvectors of U with eigenvalue 1. ThenŨ = B † UB is blockdiagonal with a (2 n − 2 m ) × (2 n − 2 m ) trivial block and an 2 m × 2 m nontrivial block. This observation can be used to reduce U as described above.