Accelerating Quantum Algorithms with Precomputation

Real-world applications of computing can be extremely time-sensitive. It would be valuable if we could accelerate such tasks by performing some of the work ahead of time. Motivated by this, we propose a cost model for quantum algorithms that allows quantum precomputation, i.e., for a polynomial amount of"free"computation before the input to an algorithm is fully specified, and methods for taking advantage of it. We analyze two families of unitaries that are asymptotically more efficient to implement in this cost model than in the standard one. The first example of quantum precomputation, based on density matrix exponentiation, could offer an exponential advantage under certain conditions. The second example uses a variant of gate teleportation to achieve a quadratic advantage when compared with implementing the unitaries directly. These examples hint that quantum precomputation may offer a new arena in which to seek quantum advantage.


Introduction
In order to efficiently use limited computational resources, it is natural to quantify and minimize their use.In quantum computing, we frequently try to minimize some proxy for the spacetime cost of an algorithm, such as the number of two-qubit gates on an nearterm machine or the number of non-Clifford gates on a fault-tolerant device.Focusing on spacetime metrics allows one to easily incorporate the fungibility of additional qubits and time inside error correcting codes [18,21,35], as well as elements of algorithmic parallelism.However, in some cases, one is interested in the raw time to solution, or "wall-clock time," given any reasonable resources.As such, in this paper, we explore a different cost model that allows for what we call "quantum precomputation."In the process, we aim to understand the opportunities and challenges inherent in generalizing classical ideas of precomputation, e.g., caching of results, indexing in databases, or creating lookup tables.The precomputation cost model allows for a quantum algorithm to start with access to a specially prepared resource state that depends on the algorithm and some portion (but not all) of its input.We neglect the cost of preparing this resource state, but we demand that it can be prepared efficiently, i.e., that the quantum and classical resources required scale polynomially in the size of the input.
Our precomputation cost model is motivated by real-world problems where the crucial limited resource is the computational power available after the problem is fully specified.
For some of these problems, the value of finding a solution as quickly as possible would justify investing extra effort ahead of time preparing to perform a computation.In fields ranging from optimization, to finance, to data analysis, there are tasks that naturally fit into this framework.If we can build useful quantum primitives that accelerate such tasks in the precomputation cost model, it could have a substantial impact even in cases where the overall quantum advantage is modest or non-existent.We study quantum precomputation because of these potential practical applications, and also because it offers the chance to investigate the nature of quantum computation from another angle.Notably, the nocloning theorem imposes limitations on our ability to reuse the results of earlier quantum computations, which implies that precomputation may occupy a different role in quantum computing than it does classically.
In order for the precomputation cost model to make sense, there must be some components of the computational task that are naturally specified before others.For example, we could be given a classical description of a Hamiltonian now with the understanding that we will want to estimate some properties of its ground state that will be determined at a later time.In such a situation, we could prepare for when these properties are specified by generating and storing a sufficient number of copies of the ground state.In other cases, we might have a classical description of some unitary U available now that we will later wish to apply to a (currently unknown) state |ψ⟩.In this paper, we ask if we can find interesting or useful families of tasks that can be implemented using asymptotically fewer quantum resources in a cost model that allows for free precomputation.
We formalize our definition of the precomputation cost model in Section 2. In Section 3, we discuss some of the connections that quantum precomputation has with prior work on quantum and classical computation.We go on to explore how existing algorithmic primitives can interpreted as tools for quantum precomputation in Section 4. Specifically, we make use of density matrix exponentiation and gate teleportation to accelerate the application of certain unitaries in the precomputation cost model [23,36], finding the possibility of speedups that range from quadratic to exponential (when comparing the cost in the precomputation model with the usual quantum gate complexity).In Section 5, we present a less straightforward protocol for quantum precomputation that uses a technique known as selective teleportation [18] to yield a quadratic improvement in complexity for a family of diagonal unitaries.We conclude with a discussion of open questions and potential applications in Section 6.

The Precomputation Cost Model 2.1 Formalizing the cost model
Analyzing the resources required to execute an algorithm requires a cost model.A good cost model encodes useful assumptions that simplify the analysis, abstracting away irrelevant details while keeping the essential information required to answer the questions at hand.There are a number of different choices one could make in formalizing the intuition behind quantum precomputation into a cost model; i.e., specifying what it means to "allow a reasonable amount of work to be performed for free."In this section, we propose a concrete definition flexible enough to encompass several interesting examples rather than a maximally general abstract definition.
There are many kinds of computational tasks that we might wish to analyze in the precomputation cost model.We will loosely formalize a computational task as an algorithm, which we treat as a map that takes an input from some set of valid inputs and returns a correct output (or a sample from a correct distribution over outputs).Different algorithms may define different notions of valid inputs and correct outputs.For now, we leave these details unspecified, although they may be crucial to determining the complexity of implementing an algorithm.For example, there are some tomographic tasks that are efficient for pure state inputs but prohibitively expensive for general mixed state inputs [22].In other cases, the computational complexity of a problem may vary depending on the definition of the "correct" output, e.g., what kind of approximation is allowed [19].
To be sufficiently general, we need a notion of a quantum algorithm that can accept both quantum and classical input and can output both quantum and classical data. 1 We also need to allow for the possibility that the input is partitioned into two components that are provided at different times.For simplicity, we assume that the earlier input (that might be used in the precomputation step) is classical, and that the later input may be a combination of classical and quantum data.Let x denote the (classical) input provided at the earlier time and let ρ and y denote the quantum and classical components of the input provided at a later time.For the quantum and classical outputs we use the symbols σ and z respectively.
In the usual situation, where we do not take advantage of the fact that some portion of the input may be available ahead of time, a quantum algorithm A implements a map A : x, y, ρ → z, σ. ( In general, we can understand A as performing some classical computation that takes x and y as an input, determining a quantum circuit that is subsequently applied to ρ.The portions of the resulting quantum state that are not measured or discarded constitute σ.The classical component of the output, z, is classically computed from x, y, and the measurement outcomes.In a standard cost model, we are concerned with the cost of executing the algorithm A given access to x, y, and ρ. In a model that allows for free precomputation, we aim to produce the same (distribution over) outputs by implementing the map where x(A, x) and |Γ(A, x)⟩ represent the classical and quantum outputs of some precomputation step.We allow for x(A, x) and |Γ(A, x)⟩ to be generated using a "reasonable" amount of classical and quantum computation performed ahead of time, i.e., with knowledge of A and x but not ρ or y.In a precomputation cost model, the only cost that we consider directly is the cost of performing the map P : x(A, x), |Γ(A, x)⟩ , y, ρ → z, σ.
In order to fully define a precomputation cost model and compare it to a standard cost model, we therefore have to specify answers to two questions: i) How will we quantify the costs of implementing A and P? ii) What do we mean when we say that we allow for a "reasonable" amount of classical and quantum computation to be used in the preparation of |Γ(A, x)⟩ and x(A, x)?
In this paper, we focus on quantifying the quantum resources used to implement P (and A itself) in terms of the quantum circuit complexity (a term that we use interchangeably with "gate complexity"), the number of gates from some elementary set of discrete operations required to implement the algorithm.We consider a discrete set of gates that consists of one-and two-qubit Clifford gates, single-qubit computational basis measurement operations, and T gates.We also choose to count single-qubit identity operations as gates in order to include the cost of storage (which is comparable in most architectures to the cost of active workspace).This choice implies that our notion of circuit complexity grows asymptotically as fast as the product of the number of qubits and the circuit depth (the number of layers of gates, executed in parallel).
We could define other related models that allow for free precomputation but account for "cost" differently.Depending on the context, it might be useful to work in an oracle model, or to count only the number of non-Clifford gates, or even to quantify the spacetime volume used in a particular error-correcting code.It might also be useful to discuss the number of gates required for the best known implementation of an algorithm, rather than the absolute minimum required.For the examples we consider, this distinction will not be important.We find that discussing the gate complexity is convenient because it allows us to use the same model to consider several different examples, but we will make some comments along the way regarding other notions of cost.As we consider these examples, it will sometimes make sense to allow for A or P to be implemented with some error.In the context of this work, when we need to allow for some notion of error, it will be sufficient to focus on the case where the output is a quantum state and we can quantify the error using a single parameter ϵ that bounds the trace distance between the ideal output and the actual output.
By focusing on quantifying the cost in the precomputation model in terms of the number of quantum operations, we are implicitly treating quantum operations as a fundamentally different and more limited resource than classical ones.This decision is motivated by the practical observation that quantum operations on a fault-tolerant computer are expected to be vastly slower and more expensive than classical operations [4].Nevertheless, we would like a definition of the precomputation cost model that is useful in practice, so we demand that the classical time and space complexity of implementing P scales as O(poly(ϵ −1 , |x|, |y|, |ρ|)).Here the notation | * | indicates the size of * in terms of classical or quantum bits.
Besides specifying how we quantify the cost of implementing A or P, we also need to formalize the notion that the amount of work performed ahead of time is required to be "reasonable."We should bound the quantum gate complexity of the precomputation step, as well the classical time and space complexities.For all of these resources, we allow their usage during the precomputation step to scale as O(poly(ϵ −1 , |x|)).Although we define our model with this coarse-grained notion of what is allowed during the precomputation step, we will discuss the actual scaling of the various resources in more detail for the particular examples we consider in this paper.

Prior work
While the authors are not aware of prior work that has focused on a cost model that allows for free precomputation in the sense that we consider, there are a number of closely related ideas that we draw inspiration from.The paper that first described gate teleportation speculated that it might be used to mass manufacture resource states for later consumption [23].For example, one could imagine using magic state distillation to distill a large number of magic states, storing them for use in a later computation [8].Going beyond the prototypical use of magic state distillation to implement a T gate, state distillation schemes have been proposed for a variety of other few-qubit operations [10,13,21,28,35].In Ref. 13, Jones et al. proposed a method that implements an arbitrary single-qubit Z rotation with success probability 1 − δ by precomputing and storing a resource state on O(− log(δ)) qubits.More abstractly, measurement based quantum computing has some similarity to quantum precomputation, but it aims to prepare generically useful resource states rather than ones that are tailored to accelerating particular algorithms [40,43].
The idea of precomputing and storing a reservoir of resource states for single or fewqubit operations is appealing, but it faces serious challenges.In particular, the number of such resource states required for interesting and classically intractable applications appears large [6,20,44], while quantum memory has a comparable cost with active workspace in most proposed architectures [49].For example, Ref. 20 estimates that thousands of logical qubits and billions of Toffoli and T gates would be required to factor a 2048 bit RSA integer using Shor's algorithm.A fault-tolerant quantum computer that large enough to perform this computation, but not too much larger, would be unable to precompute and store more than a tiny fraction of the necessary resource states ahead of time.
Even so, one might ask if precomputing resource states for T or Toffoli gates offers a simple example of asymptotic advantage when the cost of the precomputation itself is neglected.In our definition of the precomputation cost model, the answer is no.This is because, even with access to the appropriate resource state, applying either of these gates still requires a (nonzero) constant number of operations and our model allows T gates to be performed at unit cost.If we instead consider the task of implementing arbitrary singlequbit rotations to within some precision ϵ, Ref. 13 provides an example where allowing for free precomputation does indeed change the asymptotic cost.Specifically, precomputation can be used to remove the dependence on ϵ from the cost (not including the cost of the precomputation step) at the expense of incurring some logarithmic dependence on the allowed failure probability δ.
The idea of supplementing a quantum computer with a specially-prepared resource state has also been considered from a complexity-theoretic perspective.The complexity class BQP/qpoly formalizes the power of a polynomial-time quantum computer augmented with an arbitrary resource state, referred to as "quantum advice," that is allowed to depend on the length of the input.Comparing this complexity class to our model of quantum precomputation requires some care, so we provide a longer discussion in Appendix A and merely summarize the conclusions here.First of all, the model formalized in BQP/qpoly places no restrictions on the computational power used to prepare the resource state, whereas we require that it be preparable in polynomial time.Secondly, the quantum advice states of BQP/qpoly can only depend on the length of the input.We allow for the resource states to depend on a subset of the parameters, denoted by x.Thirdly, the only problems that fit into the framework of BQP/qpoly are decision problems, which have a classical input and a (single bit of) classical output.This is a more limited setting than the one that we consider. 2espite these differences between BQP/qpoly and our notion of quantum precomputation, we can make a useful comparison if we restrict ourselves to considering the power of both models to solve decision problems.One might suspect that our model of quantum precomputation gets some additional power from the fact that we allow the resource state to depend on the input in richer ways than allowed by the complexity class BQP/qpoly.However, any decision problem that is solvable in polynomial time in the precomputation model we have defined is not only a member of BQP/qpoly, but also BQP itself.This is because we only allow a polynomial amount of "free" precomputation, which can't add any power to a machine that is already allowed to run arbitrary polynomial-time quantum computations.Ultimately, our model of quantum precomputation is trying to capture a finer-grained notion of speedup than these particular complexity classes are designed to address.Imprecisely, we could say that we are interested in the power of the "advice that a polynomial time quantum computer can give itself." In the context of classical computing, the term "precomputation" has been used extensively to describe variations on the idea of performing useful work ahead of time and caching the result.For example, branch-prediction is an essential component of modern computer architecture design [47].Precomputation is used to optimize certain tasks in computer graphics [46] and computer vision [24].The precomputation of expensive operations involved in breaking cryptographic schemes is both a practical and theoretical concern [5], which is closely related to the study of advice in classical computational complexity theory [30].For the most part, these examples seem slightly different than the quantum algorithmic primitives that we will discuss.Classically, some applications of precomputation derive their usefulness from the ability to reuse the precomputed information rather than the time-sensitive nature of the computation.In contrast with the classical case, the resource states that we consider are generally consumed when used, precluding their reuse.It would be interesting if other techniques, perhaps based on gentle measurements [3], can be used to design quantum precomputation protocols that allow for some amount of information reuse.

Examples of Precomputation
In this section, we discuss several examples of quantum precomputation.These examples show how existing quantum primitives can be leveraged to obtain an advantage in a cost model that allows for free precomputation.In particular, we study the application of density matrix exponentiation (introduced in Ref. 36, reviewed in Appendix B.1) and gate teleportation (introduced in Ref. 23, reviewed in Appendix B.2) as tools for quantum precomputation.
Before turning towards these examples, it is worth briefly discussing two particularly simple forms of quantum precomputation.One natural example is the case where precomputation is equivalent to performing the first steps of some algorithm and then waiting until the problem is fully specified to perform the rest.For example, many quantum algorithms consist of applying a known unitary to the all zero state and performing a measurement.If we knew the unitary ahead of time but the measurement wasn't yet specified, we could perform the state preparation in advance.More speculatively, there may be settings where it is natural to prepare for the future execution of some quantum machine learning task by encoding data into a quantum state "on the fly" as it streams in.This latter idea is related to rigorous work on quantum algorithms in streaming settings, which is itself connected to the study of quantum communication complexity [29,32].
It is easy to understand how one might be able to usefully perform precomputation by executing the steps at the beginning of some algorithm ahead of time.We could try to imagine situations where this naturally occurs, but it is unclear if our formal definition of the notion of quantum precomputation adds anything to the understanding of such cases.For this reason, in the other examples that we consider in this paper, we focus instead on the goal of using precomputation to accelerate steps that lie in the middle of an algorithm, rather than at the beginning.Turning towards a second example, recall that we briefly discussed the idea of precomputing magic states to use as resources for implementing non-Clifford gates in an error correcting code in Section 3. We explained how there is no advantage to this idea in the primary cost model we use throughout this paper because we do not distinguish between Clifford and non-Clifford gates.This is true, but it is instructive to consider this example in a slightly different model of quantum precomputation, where we instead quantify the amount of spacetime volume required to implement a circuit in a quantum error correcting code.For simplicity, let us work in units where a depth d circuit acting on n qubits has a volume of dn and let us assume that the spacetime volume required to prepare a suitably distilled T state is λ ≫ 1.Furthermore, we will neglect the spacetime cost of qubits that have not yet been initialized and qubits that have already been measured (since they could presumably be used for other purposes).
Under this more nuanced cost model, we can compare the cost of implementing an algorithm with and without the precomputed T states.Let us consider a depth d circuit on n qubits that consumes one magic state per time step.Implementing this algorithm without precomputation would require a spacetime volume of nd + λd in order to account for the cost of the circuit itself and the cost of the magic state distillation.In the precomputation model, we allow ourselves to start with all d magic states already prepared, but we must account for the cost of storing them while the algorithm executes.We are using d − s qubits to store the magic states at each step s from 0 to d − 1, so the spacetime volume required is nd In this cost model, precomputing the magic states removes the dependence on λ but it increases the dependence on d from linear to quadratic.Realistic values of λ are expected to be significantly less than 100, which suggests that only relatively short-depth circuits of this type would benefit from free access to precomputed magic states [34].This example highlights the fact that our model implicitly penalizes precomputation protocols for the space used to store their precomputed resource states.Because of this penalization, it is not trivially true that a precomputation protocol is at least as efficient as a straightforward approach to executing an algorithm.

Precomputation with density matrix exponentiation
In this subsection, we consider applications where reflections about an expensive to prepare state, |b⟩, are a dominant contribution to the complexity of an algorithm.As we explain below, an algorithm that requires q calls to the reflection operator R = I − 2 |b⟩⟨b| can be implemented by consuming O(q 2 ) copies of |b⟩ (at nearly unit time per consumption) in lieu of making any calls to R directly.A cost model that allows for free precomputation can therefore entirely remove the component of such an algorithm's cost that depends on |b⟩.In the most extreme cases, this could lead to a cost in the precomputation model that is exponentially smaller than the cost in a standard model.For example, preparing or reflecting about the state |b⟩ might require using poly(|x|) gates to implement a bruteforce encoding of some classical input x into n = polylog(|x|) qubits, while the other components of the algorithm could scale polynomially in n.We consider the quantum algorithm for linear systems as a specific example of an algorithm where such a speedup might prove useful [12,26].
This type of quantum precomputation makes use of a technique called density matrix exponentiation.Introduced in Ref. 36, density matrix exponentiation allows us to consume copies of some density matrix ρ in order to approximately apply the unitary e −itρ for some time t.We provide a brief review of density matrix exponentiation in Appendix B.1, but for now we just recall the fact that using density matrix exponentiation to implement e −itρ to within an error ϵ (in the diamond norm) requires copies of ρ [31].
Before explaining how we can make good use of density matrix exponentiation for quantum precomputation, let us examine why it does not lead to efficient protocols for implementing general unitaries in the precomputation cost model.Imagine that we want to implement a unitary U that corresponds to evolution under a Hamiltonian H for a time t, where ||H|| (the spectral norm of H) and t are both O(1).We can shift H by some multiple c of the identity to obtain a positive semidefinite operator H + cI with ||H + cI|| = O(1).Applying U using density matrix exponentiation entails evolving under the Hamiltonian corresponding to the normalized state for a time The cost of implementing U using density matrix exponentiation scales quadratically with t, which can scale exponentially with the number of qubits in the worst case.This occurs easily even for simple unitaries, for example, when H is a non-trivial Pauli operator.
In order for density matrix exponentiation to be a useful tool for precomputation, we need to focus on cases where the normalization factor is small.One natural example of a unitary that is efficiently implementable using density matrix exponentiation is the reflection about a state |b⟩, In order to implement R up to an accuracy ε using density matrix exponentiation, it suffices to consume O(ε −1 ) copies of the state |b⟩⟨b|.If an algorithm involves q calls to R, we can guarantee a constant overall error ϵ by setting ε ∝ ϵq −1 .We can therefore implement all q calls to R to within the desired accuracy by consuming a total of O(ϵ −1 q 2 ) copies of |b⟩⟨b|.
As an example of a context where this kind of precomputation might be useful, consider the quantum linear systems problem [12,14,26,33,37,48].Given a matrix A and a vector ⃗ b, the linear systems problem is to find a vector ⃗ x such that A⃗ x = ⃗ b.The quantum formulation of this problem encodes the vector ⃗ b into the amplitudes of a state |b⟩ and asks that we prepare a state |x⟩ ∝ A −1 |b⟩.Without loss of generality we can assume that A is Hermitian. 3The access models for A and |b⟩ can vary, but it is usually assumed that one has access to an oracle that prepares |b⟩ and either i) the ability to perform time evolution by A, ii) oracle access to the non-zero entries of (a sparse) A, or iii) a block encoding of A. Regardless of the access model for A, the most efficient algorithms for this problem query the state preparation oracle for |b⟩ a number of times that scales as Õ(κ), where κ denotes the condition number of A and the Õ(•) notation hides logarithmic factors in κ and the precision.These queries are used to prepare |b⟩ and to implement the reflection R about |b⟩.
In a context where a classical description of |b⟩ is available before A, preparing Õ(ϵ −1 κ 2 ) copies of |b⟩ during the precomputation step would allow us to apply one of the standard quantum algorithms for the linear systems problem at a cost that is independent of the cost of preparing |b⟩.As we argued above, it is easy to imagine situations where preparing or reflecting about |b⟩ is exponentially more expensive than any other component of the algorithm.For example, we could take |b⟩ to be a brute force encoding of some classical 3 One can always solve a linear systems problem on a larger space with the Hermitian Ã := 0 A A † 0 instead of the original A.
data |x| into n = polylog(|x|) qubits, such that preparing or reflecting about |b⟩ has a complexity that scales polynomially in |x|.We could also make the (sometimes reasonable) assumption that the condition number of A and the gate complexity of implementing A (under whatever notion of access is appropriate) scale polynomially in n.Given these two conditions, the complexity of applying any of the standard quantum algorithms for the linear systems problem would be exponentially better in the precomputation cost model than in the standard one (assuming that the target precision is a constant).
Of course, this separation is entirely due to the fact that we discount the cost of preparing the resource state.In fact, in this form of precomputation, the cost of preparing the resource state would be asymptotically larger than the cost of implementing the reflections in the standard way since we require Õ(q 2 ) copies of |b⟩ to implement the reflection R a total of q times with constant error in the overall algorithm.Furthermore, the optimal algorithms for the quantum linear systems problem have a logarithmic dependence on the target precision [12], whereas our approach introduces a polynomial dependence.Additionally, sufficient storage for the copies of |b⟩ would be required.Nevertheless, in a situation where ⃗ b is specified ahead of time and the solution to the problem is sufficiently valuable and time-sensitive, quantum precomputation could prove useful.Note that there is no significant classical cost in terms of storage or computation for this form of precomputation.
It is worth point out that, if one is willing to prepare Õ(κ 2 ) copies of |b⟩ ahead of time, there is a simpler strategy to solving the linear systems problem that does not require density matrix exponentiation.However, this simpler strategy is less efficient with respect to the number of times that A must be queried.Consider the original HHL algorithm of Ref. 26.This algorithm requires starting with the state |b⟩ and time-evolving under the Hamiltonian A for a time that scales as Õ(κ) (to perform phase estimation).This is followed by a postselection step that succeeds with probability Ω(1/κ 2 ).Normally one uses amplitude amplification to increase the success probability to O(1).
Instead of using amplitude amplification, one could instead repeatedly prepare the appropriate state and actually perform the postselection based on the output from phase estimation.This would solve the quantum linear systems problem with high probability using a number of copies of |b⟩ that scales as Õ(κ 2 ).However, it would also require a total amount of time evolution under A equal to Õ(κ 3 ).The approach we proposed above uses a similar number of copies of |b⟩, but the scaling in terms of A (either time evolution under A or a related notion of access) can be made nearly linear with respect to κ by using the optimal algorithms of, e.g., Ref. 12.

Precomputing Clifford unitaries with gate teleportation
In this subsection, we consider accelerating the task of implementing an n-qubit unitary from the Clifford group using precomputation.We explain how a well-known construction allows for a quadratic savings in gate complexity (when comparing the cost in the precomputation model to the gate complexity in a standard cost model).This construction is a straightforward application of gate teleportation, a technique introduced in Ref. 23 which we illustrate in Figure 1 and review in more detail in Appendix B.2 (along with the definition of the Clifford group).Although this example of quantum precomputation is particularly simple, it provides a good introduction to some of the concerns relevant in the more technically interesting example that we consider in Section 5.
We recall that an arbitrary unitary from the n-qubit Clifford group can be efficiently implemented using one-and two-qubit Clifford gates arranged in a circuit with depth O(n) [9], leading to a gate complexity of O(n 2 ).A counting argument shows that this asymptotic scaling must be optimal for most elements of the Clifford group.We will show that, in the precomputation cost model, the quantum gate complexity of applying the same unitary is only O(n).
Let U be an arbitrary unitary in C (2) (the Clifford group on n qubits) and |ψ⟩ be an arbitrary n-qubit quantum state.Using standard multi-qubit gate teleportation, we can prepare a state |Γ(U )⟩ on 2n qubits that we can consume to apply U to |ψ⟩ (up to a Pauli correction).This straightforward generalization of the procedure presented in Figure 1 consists of preparing n bell pairs and applying U to a set of n qubits, one taken from each bell pair.Let us consider the steps involved in applying U to |ψ⟩ once |Γ(U )⟩ is already prepared.Applying a Clifford unitary using gate teleportation involves making n simultaneous bell-basis measurements of the 3n-qubit state |ψ⟩ ⊗ |Γ(U )⟩.The resulting n-qubit state can therefore be obtained in constant depth, where P is the "byproduct operator," a member of the Pauli group that is determined by the measurement outcomes.By the definition of the Clifford group, the correction operator U P † U † is also a Pauli operator (up to a possible phase) and can therefore be applied in constant depth to yield the desired state U |ψ⟩.The overall quantum circuit complexity (neglecting the cost of preparing |Γ(U )⟩) is therefore O(n), in contrast with the O(n 2 ) cost of applying U without precomputation.
Although we are primarily concerned with the quantum gate complexity of applying U given |Γ(U )⟩, we may also wish to consider the classical computational costs of determining which of the 4 n possible correction operators to apply once the measurement outcomes are known.We need to use 2n bits initially to store the results of the bell basis measurement that determines the byproduct operator.We could store a classical description of the O(n 2 ) Clifford gates in U and apply them to the byproduct operator.This would require O(n 2 ) operations (updating a constant number of the O(n) stored bits each time we conjugate by a gate in the circuit) which could be performed in O(n) sequential steps by parallelizing across gates in the same layer of the circuit.
We can reduce the depth of the classical computation (although not the overall number of operations) by factorizing the correction operator ahead of time, where the x i and z i are determined by the measurement outcomes of bell basis measurement.This allows us to classically precompute each of the 2n Pauli operators of the form U X i U † or U Z i U † and store the results using O(n 2 ) bits.Once we know the measurement results, we can multiply the appropriate operators together in logarithmic depth using a divide and conquer strategy, ultimately computing the final correction operator using O(n 2 ) operations using O(log(n)) sequential steps (neglecting the classical cost of the precomputation).
5 Precomputing diagonal unitaries in the Clifford hierarchy with selective gate teleportation In this section, we show how a more sophisticated form of gate teleportation introduced in Ref. 18 can be used to construct a precomputation protocol for a set of diagonal unitaries in the Clifford hierarchy (reviewed Appendix B.2).We graphically illustrate this selective gate teleportation in Figure 2 and present a more substantial review in Appendix B.3.In Section 4.2, we considered a simple example of quantum precomputation that uses standard gate teleportation to apply some U ∈ C (2) (the Clifford group).We explained how the O(n 2 ) gate complexity required to implement an arbitrary n-qubit Clifford unitary can be reduced to O(n) in the precomputation model.The approach is less straightforward, but the generalization that we present in this section achieves the same quadratic compression for a subset of unitaries from higher levels of the Clifford hierarchy.In other words, we show that unitaries from the family Z (k) , defined below, that have a gate complexity of Θ(n k ) when implemented directly can be implemented with a gate complexity of Õ(kn k/2 ) in the precomputation cost model (assuming k is even for simplicity).The basic strategy we use is to apply such a unitary with gate teleportation and then use a series of selective gate teleportation steps to apply the correction operator up to some simpler correction that can be implemented directly.Before we present our actual proposal, let us consider a naive generalization, where we use gate teleportation to implement some U ∈ C (3) (the third level of the Clifford hierarchy).By definition, the correction operator required will be some R ∈ C (2) .Applying R directly would result in an overall gate complexity of O(n 2 ), essentially saving a factor of n compared to the cost of implementing U directly, which is Ω(n 3 ) by a counting argument.For a general U ∈ C (k) , it is unclear if it is possible to obtain an advantage greater than a factor of n in the precomputation model.
However, if we restrict ourselves to considering a smaller set of unitaries, we can do better.Rather than allowing for arbitrary elements of the Clifford hierarchy, we limit ourselves to considering elements of the hierarchy that are also diagonal.To simplify the presentation, we actually restrict ourselves even further in this section, considering only those gates in C (k) that are composed of products of ±I, Pauli Z operators, and controlled Z operators with up to k − 1 controls. 4We denote this set Z (k) and in Appendix C, we UAP (2) |ψ⟩ P (1) UBP (2) |ψ⟩ Figure 2: A circuit diagram for the one-qubit version of selective gate teleportation [18].This protocol allows for the teleportation of a choice of unitaries, UA or UB, onto an input state.Which unitary is teleported is controlled by the measurement settings (the four ancilla qubits are each measured in the X or Z basis according to the proscriptions shown in the blue and red shaded areas of the diagram).The possible states of the output qubit are color-coded to match the measurement settings that select for them.
In our use of selective teleportation, we take UA = U and UB = I.Byproduct operators P (1) and P (2) from the set {I, X, Z, XZ} are randomly applied before and after the selected unitary based on the measurement outcomes.
show that it forms a group.We also note that Z (j) is a proper subgroup of Z (k) for j < k and prove the following proposition: Proposition 1.Consider a gate G ∈ Z (k) and a product of single-qubit Pauli X operators that we denote by X s (where s ∈ [n] indicates the indices of the qubits where X s acts non-trivially).Define G ′ in the following way, Then As a corollary, we also have that Diagonal unitaries commute, and the elements of Z (k) are all self-inverse.As a result, we can specify a U ∈ Z (k) using exactly bits, one to specify the sign and one to specify the presence or absence of each possible C j−1 Z gate for each j ∈ [1.
.n].A C j−1 Z gate can be implemented using O(j) T gates in depth O(log j) [39].An arbitrary gate G ∈ Z (k) can therefore be implemented in depth Õ(n k−1 ) and gate complexity Õ(n k ), even under reasonable assumptions about qubit connectivity [41].Furthermore, by counting the number of distinct elements of Z (k) , we can also see that a typical element must have a circuit complexity lower bounded by Ω(n k ).We begin our construction by preparing the usual 2n qubit resource state for applying the gate U ∈ Z (k) using teleportation.If this state were used directly for gate teleportation, we would need to perform a correction of the form U P † U † for some n-qubit byproduct operator P (which we can write as a product of single-qubit X and Z operators).We will perform this correction using selective teleportation.Note that we can neglect the Z corrections (as they can be trivially commuted to the end of the circuit up to a sign).Factorizing the X component of the corrections, we see that we need to apply the unitary where the bits x i will be chosen based on the measurement outcomes of first gate teleportation step.It is convenient to rewrite each of the terms in the product as i.e., a product of X x i i and an operator that is in Z (k−1) by Proposition 1.We can use selective gate teleportation to apply the diagonal term (X x i i U X x i i U † ) from each of the n possible factors of the correction operator.Note that we can do this after applying U to the n bell pairs and before performing the bell basis measurement that completes the gate teleportation.We ignore the X x i i terms that precede the diagonal components of the factors of the correction operator in Equation ( 13) because we can absorb them into the byproduct operators that will arise anyway from the selective teleportation.For each of the correction operators, we need 4n additional qubits to implement the selective gate teleportation, so the overall overhead is 4n 2 .When we attempt to use selective teleportation in this way to implement the correction operator, we will actually end up implementing the operator where the P (i) terms represent randomly obtained products of Pauli operators and the 1) .Notice that we can commute the Pauli terms to the left at the cost of requiring a series of corrections R (i) ′ ∈ Z (k−2) .
We can proceed recursively.We factored the one byproduct operator to obtain n possible factors of the correction operator, each of which we applied using selective gate teleportation.Implementing these corrections required a total of 4n 2 additional ancilla qubits and resulted in the addition of Pauli byproduct operators at n + 1 locations.We can commute these byproduct operators through to the left, starting at the righthand side of our expression.Each time we commute an n-qubit operator of the form n i=1 X x i i through a diagonal gate we do so by factorizing it and we pick up n possible correction terms one level lower in the Z (k) hierarchy.The number of corrections that we must apply, and the number of additional ancilla qubits that we require, therefore increases by a factor of n each time we descend the hierarchy by a level.For example, we can use O(n 3 ) ancilla qubits to implement each of the n 2 possible second-order corrections using selective teleportation, leaving only corrections that are three or more levels down the hierarchy.More generally, to implement U ∈ Z (k) up to a correction R ∈ Z (k−a) (and some Pauli X operators), we require a resource state on O(n a ) qubits.
If we were to descend the hierarchy all the way to the point where the only remaining corrections were Pauli corrections (a = k − 1), we would obtain only a modest compression in circuit complexity (compared with directly applying U ).This is because, although the circuit depth would be merely O(k), we would require O(n k−1 ) qubits.However, consider what happens when we stop at the level a = ⌊k/2⌋.To simplify the presentation we assume that k is even.We can use a resource state on O(n k/2 ) qubits to implement U up to a correction R ∈ Z (k/2) (and some additional Pauli terms) in k/2 rounds of measurement.
We can implement the remaining correction directly with a gate complexity of Õ(kn k/2 ) in depth Õ(k) with no additional space overhead using the constant depth fanout and unfanout circuits of Ref. 42.Therefore, the overall gate complexity of implementing an arbitrary U ∈ Z (k) in the precomputation model (i.e., neglecting the cost of preparing the resource state) is Õ(kn k/2 ).
Recall that a fanout operation takes an n-qubit state |ψ⟩ and performs the map for some integer m > 1, where the states in {|i⟩} are the computational basis states.Unfanout reverses this mapping.Ref. 42 explains how both of these operations can be implemented using constant depth quantum circuits and classical feedback.We can parallelize the implementation of m diagonal unitaries by performing a fanout, applying each unitary to a separate fanned out copy of |ψ⟩, and then performing an unfanout.We can take advantage of this capability by partitioning the individual terms that make up an arbitrary R ∈ Z (k/2) into O(k) sets of gates, where each set contains only terms that act on disjoint qubits.By setting m = n k/2−1 , we can apply the terms from each of the sets in parallel.We can therefore apply all of the terms with the desired gate complexity and depth.Because the fanout and unfanout operations are constant depth, they do not increase the asymptotic scaling of the gate complexity.The remaining Pauli correction can then be applied to complete the implementation of U .Now let us consider the classical computational cost associated with applying U this way in the precomputation model.Applying U up to a correction at level Z k−a is trivial for a = 1.For a = 2, we apply some subset of the n possible corrections that corresponds directly to the bits we obtained from the first set of measurements.For a = 3, we need to repeatedly XOR one n bit string into another O(n) times in order to determine the measurement settings, using O(n 2 ) classical operations.This growth continues, and we find that we need to perform O(n k/2−1 ) classical operations to determine which corrections to perform at the level that leaves us with a final correction in Z k/2 .Actually computing the final correction R ∈ Z k/2 requires determining the O(n k/2 ) elements of Z k/2 that arise from commuting the byproduct operators through and then taking their product, which ultimately takes O(n k ) XOR operations.The classical postprocessing involved in the fanout operation is negligible compared to these costs, so the overall classical complexity is O(n k ).
We can also ask about the quantum and classical complexities of performing the precomputation step.Neglecting the operations involved in setting up the teleportation and selective teleportation gadgets themselves since they contribute negligibly to the overall complexity, we can just consider the gate complexities of performing one operation from Z (k) , n operations from Z (k−1) , and so on, down to n k/2−1 operations at the level Z (k/2+1) .The only clear way to apply these operations is to work serially (since the use of selective teleportation may prevent us from using fanout and unfanout operations to parallelize).This means that, although we only require Õ(kn k ) non-identity gates, our definition of gate complexity (which attempts to account for storage space by counting the single-qubit identity operation as a gate) implies that the overall gate complexity of the precomputation step is Õ(kn 3k/2 ).This may not be a fundamental requirement, and it is also true that most of the O(n k/2 ) qubits are not required at all until the very last portions of the precomputation step, so they could be used for other things in the meantime.The classical complexity of the precomputation step arises from computing the various correction operators and is not substantially larger than would be expected from the need to generate some kind of classical description of the circuits involved anyway.
In many ways, the techniques of this section are a generalization of the simpler scheme for applying Clifford operators using gate teleportation that we presented in Section 4.2.In order to make a comparison easy, we summarize the various scalings of these two examples of quantum precomputation in Table 1.
Table 1: A summary of the scalings for applying arbitrary Clifford operators using gate teleportation (Section 4.2) and arbitrary elements of Z k (products of Z and controlled Z operators with up to k − 1 controls) using selective gate teleportation (Section 5).For simplicity we assume that k is even.For the gate complexity, we count the number of one-and two-qubit gates from the Clifford + T gate set (counting single-qubit identity operations as gates).The quoted gate complexity in the precomputation model includes only those quantum operations required to consume the resource state |Γ(U )⟩.The (quantum) cost of preparing the resource state is provided separately, as is the number of classical operations required to consume the resource state to apply U .

Discussion
In this paper, we introduced a new cost model for quantum computation that allows for "quantum precomputation."This model is motivated by practical scenarios where it is highly valuable to perform a time-sensitive computation as quickly as possible, and where some portion of the problem's input is naturally known ahead of time.In the precomputation cost model, we allow a reasonable (polynomial in the input size) amount of effort to be spent "for free" preparing a resource state before the input is fully specified.The cost of an algorithm in the precomputation cost model is determined solely by the resources required to implement the algorithm given access to the resource state.We presented three realizations of quantum precomputation that require asymptotically fewer resources in the precomputation cost model than in a standard one.The first realization uses density matrix exponentiation to implement reflections about a state by consuming copies of that state.We explained how, in some cases, this type of quantum precomputation can offer an exponential advantage (in the sense that the complexity required to execute an algorithm by consuming the resource state can be exponentially smaller than the complexity required to execute an algorithm directly).As a particular example, we considered the task of accelerating quantum algorithms for linear systems in cases where it is natural to prepare copies of the state |b⟩ ahead of time.
In the future, we hope to find practical examples where this type of precomputation is useful, either for solving particular linear systems of equations, or for executing some other quantum algorithm whose cost might be dominated by the cost of implementing low-rank reflections.In practice, the advantage need not be exponential to be useful.It would be especially interesting if we could find situations where the ability to accelerate an algorithm using precomputation was the deciding factor that made it worth solving a particular problem using quantum rather than classical computation.
As a second example, we pointed out that standard techniques for implementing Clifford unitaries using gate teleportation constitute a simple illustration of an asymptotic advantage in the precomputation cost model.These techniques allow for unitaries with a gate complexity of Θ(n 2 ) to be implemented in O(1) (quantum gate) depth by consuming a state on 2n qubits.This example highlights the importance of choosing an appropriate notion of cost when defining a model of quantum precomputation.Under a definition of cost that treated Clifford operations as free, there could be no value in using precomputation to apply a Clifford unitary more efficiently.However, as schemes for magic state distillation continue to improve, it is becoming less clear if quantifying the cost of a fault-tolerant quantum algorithm solely in terms of the number of non-Clifford gates is an accurate approximation [34].This motivated our particular definition of a precomputation cost model (that counts gate complexity, including Clifford gates), but it is possible that a metric of cost even closer to the hardware might be more appropriate.For instance, one could imagine squeezing some additional benefit out of a scheme for quantum precomputation by preparing the resource states using shorter distance error correcting codes (and therefore, fewer physical qubits and less actual time) in conjunction with error detection and postselection.
Even within the particular cost model we have defined, there are many degrees of freedom to explore in defining precomputation protocols.For example, the technique we used to implement an arbitrary Clifford unitary could be modified to apply an n-qubit circuit U that interleaved Clifford operations with a small number (t) of T gates.Such a modified scheme could use a combination of gate teleportation and selective teleportation to apply the Clifford gates as normal, while selectively implement the possible corrections after each T gate.This would require an O(n + t) qubit resource state that would be consumed in O(t) rounds of measurement to apply U up to a final Pauli correction.
The most novel example of precomputation that we proposed in this paper uses selective teleportation to achieve a quadratic reduction in the complexity of implementing a family of diagonal unitaries from the Clifford hierarchy (when comparing the cost in the precomputation model with the standard cost).Our scheme is likely generalizable to all diagonal unitaries that are members of the Clifford hierarchy, but this is still a relatively restricted class of unitaries.This naturally raises the question, are there ways to compile existing algorithms such that they would make heavy use of the kinds of diagonal unitaries that we have shown can be accelerated by precomputation?Diagonal unitaries appear in a variety of places, oftentimes as a natural way of encoding the output of a classical function into a phase.For example, the Forrelation problem [2], IQP circuits [45], QAOA [17], and Grover's algorithm itself [25], can all be formulated to involve heavy use of diagonal unitaries.In the future, we hope that extensions of our precomputation protocols can be used to accelerate some such algorithms for interesting and time-sensitive applications.
More broadly, does quantum precomputation have anything to teach us about the nature or power of quantum computation?The power of advice (computation supplemented by a resource state) has been studied both in classical and quantum contexts [1,30], but, as we discuss in Section 2, the precomputation model we introduced differs from these prior works in that we require that the extra resource state be efficient to prepare.In this finer-grained setting, what can we say about the difference between quantum and classical computation?Are there classical analogues of the kinds of quantum precomputation that we have proposed, or are there some types of precomputation are uniquely quantum mechanical?Conversely, classical precomputation is widely applicable in situations where the precomputed information is used multiple times.One could interpret recent shadow tomography proposals as examples of quantum precomputation that allow for information reuse [3,11], and it would be interesting to see if techniques from that domain can be adapted to enable such reuse in the context of other types of quantum precomputation.
Finally, are there other, perhaps more general, classes of quantum computation that we can accelerate in the precomputation cost model?Many proposed applications of quantum machine learning techniques to classical data rely on quantum random access memory (QRAM) to obtain a computational advantage [7].Are there real-world applications where it would be natural to circumvent the need for QRAM by encoding some classical data into quantum states ahead of time?

A Precomputation and quantum advice
The purpose of this appendix is to relate our proposed model of quantum precomputation to the notion of quantum advice and the complexity class BQP/qpoly.We do not aim to provide a self-contained introduction to quantum complexity theory, but we will briefly mention some basic definitions that will aid in making the comparison.The most wellstudied computational problems in complexity theory are decision problems, questions that have a yes or no answer.We can formalize a decision problem as a language, a set of bitstrings that encode the inputs to the problem for which the answer is yes.Informally, a decision problem is in the complexity class BQP if it can be solved in polynomial time on a quantum computer.Formally, we have the following definition: Definition A.1.Let {0, 1} * denote the set of all binary strings.A language L ⊆ {0, 1} * is in BQP if these exists a uniform family of polynomial-size quantum circuits, {C n }, such that the following conditions hold for all x ∈ {0, 1} n : 1.If x ∈ L, then the probability that the first qubit is measured to be |1⟩ after C n is applied to the input |x⟩ ⊗ |0 • • • 0⟩ is at least 2/3.

If x /
∈ L, then the probability that the first qubit is measured to be |1⟩ after C n is applied to the input |x⟩ ⊗ |0 • • • 0⟩ is at most 1/3.
Note that the circuit C n depends only on n, the size of the input.The condition that the family of circuits is uniform essentially requires that a polynomial time classical computer can generate the description of the circuit that the quantum computer will execute.
Like our model of quantum precomputation, the complexity class BQP/qpoly is intended to capture the power of a polynomial-time quantum machine augmented with an additional resource state.Formally, the class can be defined as follows: Definition A.2.A language L ⊆ {0, 1} * is in BQP/qpoly if there exists a uniform family of polynomial-size quantum circuits, {C n }, and a family of polynomial-size quantum states, {|ψ n ⟩}, such that the following conditions hold for all x ∈ {0, 1} n : 1.If x ∈ L, then the probability that the first qubit is measured to be |1⟩ after

If x /
∈ L, then the probability that the first qubit is measured to be |1⟩ after It is important to note that the additional quantum resources afforded to the polynomially powerful quantum machine can be arbitrarily complex states on poly(n) qubits.However, these states are only allowed to depend on the size of the input.
There are therefore three key differences between the model of computation considered in BQP/qpoly and the model we consider when we allow for "free" polynomial-time quantum precomputation.First of all, we have defined quantum precomputation to allow inputs and outputs that are combinations of classical and quantum information.BQP/qpoly is concerned with machines that take a classical bitstring as an input and return (with some probability of failure) a single classical bit as output.Secondly, in the precomputation cost model, we require that the quantum resources states are preparable in polynomial time, whereas the quantum advice states allowed in BQP/qpoly can be arbitrary quantum states.Finally, in the precomputation model, we partition the input into two subsets and allow for the resource state to depend on one subset, but not the other.The complexity class BQP/qpoly only allows for the resource states to depend on the size of the input, but none of its other features.

B Algorithmic Primitives B.1 Density matrix exponentiation
Density matrix exponentiation is a technique that allows one to consume copies of a mixed quantum state ρ in order to approximately implement the unitary e −itρ [36].In Ref. 36 copies of ρ.This scaling is optimal with respect to ϵ, and optimal with respect to t for general ρ (but not necessarily for pure states) [31].Furthermore, the protocol is relatively simple to implement.In order to act on an input state σ, one repeatedly consumes a single copy of ρ to apply an approximation to e itρ/m .This is done by performing a partial swap operator (with a small angle) on the joint system ρ ⊗ σ and discarding the first register.The entire evolution can be performed using O(nt 2 /ϵ) one-and two-qubit gates [31].Density matrix exponentiation is a basic algorithmic primitive that has been applied in a variety of ways [22,36,38].In the original paper, Ref. 36, it was used as a building block in the quantum principle component analysis algorithm.Quantum principle component analysis allows one to (approximately) sample the eigenvectors of ρ corresponding to large eigenvalues exponentially more quickly than any classical algorithm that has access only to single copies of ρ [15,27].In Ref. 38, density matrix exponentiation was used to efficiently emulate the action of a unitary U on a small subspace by consuming samples of the form |b⟩ ⊗ U |b⟩, where the input states |b⟩ span the subspace.This type of application closely resembles a sort of quantum lookup table, and shares some features with our proposed use of density matrix exponentiation for precomputation, although the aim of that work is different.

B.2 Gate teleportation and the Clifford hierarchy
Our work makes heavy use of the concept of gate teleportation [23].We illustrated the single-qubit version of gate teleportation in Figure 1 in the main text, but we present a more detailed review here.Given a unitary U , gate teleportation allows us to prepare a resource state Γ(U ) that we can later consume to apply U P to an arbitrary state |ψ⟩, where the "byproduct operator" P is an element of the Pauli group randomly determined by the measurement outcomes of the teleportation protocol.The state obtained when using gate teleportation to apply U (actually U P ) to |ψ⟩ can be written as U P U † U |ψ⟩.

Multiplying by U P
Gate teleportation can be especially useful when U P † U † is simpler to apply than U itself.This is the case in the canonical application of gate teleportation, implementing T gates in a quantum error correcting code that supports fault-tolerant Clifford gates [8].The problem of applying T gates without error is reduced to the problem of preparing high-fidelity "magic states," because, for all possible byproduct operators P , T P T † is a Clifford gate despite the fact that T is not. 5Just as state teleportation trivially generalizes to multiple qubits, gate teleportation can likewise be straightforwardly applied to multiple qubits.In the n-qubit case, the byproduct operator is an n-qubit Pauli operator (up to a phase) that depends on the 2n-bit measurement outcome obtained from n simultaneous bell basis measurements.
The notion that gate teleportation is most useful when U P † U † is easier to implement than U itself led Gottesman and Chuang to define an infinite hierarchy of unitaries now known as the Clifford hierarchy [23].The first level of the Clifford hierarchy, which we denote by C (1) , is defined to be the Pauli group.The kth level of the Clifford hierarchy is defined inductively, The second level of the hierarchy is therefore the usual Clifford group.The higher levels of the Clifford hierarchy are harder to characterize in familiar terms, but we can give some examples.For instance, T gates, Toffoli gates, and CCZ gates belong to C (3) .More generally, multi-controlled C k−1 N OT and C k−1 Z gates are in C (k) , as are the single-qubit rotations Z k , It is an open problem to fully characterize the higher levels of the hierarchy, although the diagonal elements are well-understood algebraically in terms of polynomials and roots of unity [16].Both protocols allow for a choice that is made by selecting between two measurement settings (indicated by the blue and red shaded areas of the diagrams).Selective destination teleportation teleports the state of one qubit to a choice of two different qubits.Selective source teleportation allows one to choose which of two qubits will have its state teleported to a fixed target.The possible states of the output qubit(s) are color-coded to match the measurement settings that select for them.In both cases, a byproduct operator P drawn from the set {I, X, Z, XZ} is randomly applied based on the measurement outcomes.

B.3 Review of selective teleportation
When gate teleportation is used to implement a unitary U that is not in the Clifford group, the resulting correction operator U P † U † is not, in general, a Pauli operator.For example, consider the use of gate teleportation to implement a T gate.With probability 1 2 , correcting for the byproduct operator requires the subsequent implementation of a phase gate (S ∈ C (2) ).Naively, this means that after applying a T gate using gate teleportation it is necessary to determine and apply the correction before performing additional Clifford gates.However, in Ref. 18, Fowler showed how a generalization of quantum teleportation can be used to selectively implement this phase gate correction using a small number of ancilla qubits measured in a classically controlled choice of the X or Z basis.
Fowler's selective teleportation relies on two related constructions, selective source teleportation and selective destination teleportation.Selective destination teleportation allows one to teleport a single qubit's state to either one of two destination qubits.Selective source teleportation allows for teleportation from a choice of two different source qubits to a fixed destination qubit.Both types of selective teleportation are controlled by making an appropriate choice of measurement basis and both introduce a Pauli byproduct operator P ∈ {I, X, Z, XZ} that can be inferred from the (uniformly random) measurement outcomes.We give circuit diagrams for the single-qubit versions of these primitives in Figure 3.The multi-qubit versions are straightforward generalizations.
Together, selective source and destination teleportation can be used to implement a primitive that we refer to as selective gate teleportation.We illustrated the singlequbit version of this selective gate teleportation in Figure 2 in the main text.Selective gate teleportation allows us to apply our choice of unitaries U 1 or U 2 to an unknown nqubit state |ψ⟩ by choosing how to measure some set of 4n ancilla qubits.As a special case, we can use selective gate teleportation to defer the choice of whether or not to apply a unitary U by taking U 1 = U and U 2 = I.Selective gate teleportation randomly introduces the byproduct operators P (1) and P (2) (both n-qubit Pauli operators) before and after the location at which the choice of unitaries is to be applied.For example, let s ∈ {0, 1} denote the classical bit that determines whether or not to perform the teleportation that applies U .Rather than obtaining the desired U s |ψ⟩, we instead obtain the state |ϕ⟩ = P (1) U s P (2) |ψ⟩.To obtain U s |ψ⟩, we would need to subsequently apply the correction operator U s P (2) † U s † P (1) † .
In the case that Fowler originally consider in Ref. 18, one first uses gate teleportation to implement a T gate (up to a possible S gate correction) and then selectively applies the S gate.Because S is a Clifford gate, S s P (2) † S s † P (1)  † is a Pauli operator regardless of the choice of s or the measurement outcomes.As a consequence, the measurements for both teleportation steps can be deferred or performed while applying additional Clifford gates and the necessary Pauli correction can be propagated through the resulting circuit afterwards.This type of optimization has been used to create efficient surface code layouts for a variety of algorithmic primitives [18,21,35].

C The Z (k) hierarchy
In Section 5, we defined Z (k) to be the set of n-qubit unitaries generated by arbitrary products of controlled Z gates with up to k − 1 control qubits (including the case with 0 controls, Z gates themselves) and ±I.For convenience, we define Z (0) := {±I}.Let D (k)  denote the elements of the k-th level of the Clifford hierarchy that are also diagonal.As sets, we have that Z (k) ⊆ D (k) ⊂ C (k) .While C (k) does not form a group for k > 2, Ref. 16 showed that D (k) is a group for all k.
The set Z (k) can also be shown to form a group under composition.By definition, Z (k) is closed under composition (which is associative) and includes the identity element.Because diagonal unitaries commute and C k Z gates are self-inverse for all k, we can see that each element of Z (k) is its own inverse.Therefore, Z (k) is a group.
The following proposition will be useful: Proposition 1.Consider a gate G ∈ Z (k) and a product of single-qubit Pauli X operators that we denote by X s (where s ∈ [n] indicates the indices of the qubits where X s acts non-trivially).Define G ′ in the following way, Then G ′ ∈ Z (k−1) if k > 1 and G ′ = ±I if k ∈ {0, 1}.As a corollary, we also have that Proof.We will prove this proposition by induction.The k = 0 case is clear by inspection and the k = 1 case follows from the fact that Pauli operators either commute or anticommute.Now let us assume that the proposition is true for all j < k and prove that it must also hold for j = k.Consider an arbitrary G ∈ Z (k) and s ∈ [n].
First of all, we can simplify the proof by considering a single Pauli X operator acting on arbitrary qubit i rather than the product X s .This is because we can expand repeated resolutions of the identity.If we can show that X i GX i G † ∈ Z (k−1) for all i, then it would follow that for some set of 1) .We could then use the inductive hypothesis to commute the various X operators through to the left, incurring additional terms from the Z (j) hierarchy with j < k.These are all elements of Z (k−1) , which is a group, and therefore their product is also in Z (k−1) .The X terms would cancel, completing the proof.
With that simplification established, the task that remains is to show that for an arbitrary qubit i.We can further simplify by expanding G as a product of m unitaries that are either ±I, single-qubit Z gates, or C j Z gates (for j < k), We will proceed by showing that G ℓ X i = X i G ′ ℓ G ℓ for some G ′ ℓ ∈ Z (k−1) .If this statement holds, then we can commute X i to the left through the each of the G ℓ terms that make up G in Equation (20) and cancel it, picking up a collection of additional G ′ ℓ terms from Z (k−1) .Because diagonal unitaries commute, we could also commute these additional terms to the left through the G ℓ terms, allowing G and G † to cancel and leaving us with a product of G ′ ℓ terms.Because Z (k−1) is a group, this product of G ′ ℓ terms would be in Z (k−1) and we would therefore be done.Now all that remains is to show that for some G ′ ℓ ∈ Z (k−1) .First consider the case where G ℓ and X i have support on disjoint qubits.Then we trivially have G ℓ X i = X i G ℓ , which shows that the equality in Equation (22) holds if we take G ′ ℓ = I.Now we address the case where X i acts on one of the qubits that G ℓ also acts non-trivially on.Let x denote the indices of the qubits where G ℓ acts non-trivially.
Consider the action of the operator X i G ℓ X i G ℓ on an arbitrary state |ψ⟩.Applying G ℓ flips the sign of those computational basis states where the qubits index by x are all in the 1 state.Applying X i flips the state of the ith qubit.Applying G ℓ once again flips the sign of those basis states where the qubits index by x are all in the 1 state.Applying X i unflips the state of the ith qubit.The cumulative result of these operations is to flip the sign of those states index by the qubits in the set x\i.In other words, X i G ℓ X i G ℓ acts as a controlled Z operator with one fewer controls than G ℓ (the control on qubit i is removed).Letting G ′ ℓ denote this new operator, we have that G ′ ℓ ∈ Z (k−1) by the definition of Z (k−1) .We can multiply the expression G ′ ℓ = X i G ℓ X i G ℓ by X i on the left and G ℓ on the right to obtain the desired result, This completes the proof.

Figure 3 :
Figure3: Circuit diagrams for the one-qubit versions of selective destination and source teleportation[18].Both protocols allow for a choice that is made by selecting between two measurement settings (indicated by the blue and red shaded areas of the diagrams).Selective destination teleportation teleports the state of one qubit to a choice of two different qubits.Selective source teleportation allows one to choose which of two qubits will have its state teleported to a fixed target.The possible states of the output qubit(s) are color-coded to match the measurement settings that select for them.In both cases, a byproduct operator P drawn from the set {I, X, Z, XZ} is randomly applied based on the measurement outcomes.
[23]re1: A quantum circuit diagram for the one-qubit version of gate teleportation[23].The circuit in the blue shaded area prepares a bell pair and the circuit in the red shaded area performs a bell basis measurement (the X/Z in the rounded caps indicate X/Z basis measurements).Based on the outcome of the measurement, a classically controlled operation U P † U † is performed, where the "byproduct operator" P ∈ {I, X, Z, ZX} depends on the measurement outcome.When U is a member of the Clifford group, U P † U † is an element of the Pauli group.Single-qubit gate teleportation generalizes naturally to a multi-qubit version.Using multi-qubit gate teleportation to apply an n-qubit unitary from the Clifford group offers a simple example of advantage in the precomputation cost model, reducing the quantum gate complexity from O(n 2 ) to O(n).
, Lloyd et al. gave a protocol for implementing e −itρ to within an error ϵ (in the diamond