Faster Coherent Quantum Algorithms for Phase, Energy, and Amplitude Estimation

We consider performing phase estimation under the following conditions: we are given only one copy of the input state, the input state does not have to be an eigenstate of the unitary, and the state must not be measured. Most quantum estimation algorithms make assumptions that make them unsuitable for this 'coherent' setting, leaving only the textbook approach. We present novel algorithms for phase, energy, and amplitude estimation that are both conceptually and computationally simpler than the textbook method, featuring both a smaller query complexity and ancilla footprint. They do not require a quantum Fourier transform, and they do not require a quantum sorting network to compute the median of several estimates. Instead, they use block-encoding techniques to compute the estimate one bit at a time, performing all amplification via singular value transformation. These improved subroutines accelerate the performance of quantum Metropolis sampling and quantum Bayesian inference.

We consider performing phase estimation under the following conditions: we are given only one copy of the input state, the input state does not have to be an eigenstate of the unitary, and the state must not be measured. Most quantum estimation algorithms make assumptions that make them unsuitable for this 'coherent' setting, leaving only the textbook approach. We present novel algorithms for phase, energy, and amplitude estimation that are both conceptually and computationally simpler than the textbook method, featuring both a smaller query complexity and ancilla footprint. They do not require a quantum Fourier transform, and they do not require a quantum sorting network to compute the median of several estimates. Instead, they use block-encoding techniques to compute the estimate one bit at a time, performing all amplification via singular value transformation. These improved subroutines accelerate the performance of quantum Metropolis sampling and quantum Bayesian inference.
Phase estimation is one of the most widely used quantum subroutines, because it grants quantum computers two unique capabilities. First, it yields a quadratic speedup in the accuracy of Monte Carlo estimates [BHMT00, Mon15,HM18]. Achieving 'Heisenberg limited' accuracy scaling has numerous applications in physics, chemistry, machine learning, and finance [Wright&20, ESP20, Rall20, An&20]. Second, it allows quantum computers to diagonalize unitaries in a certain restricted sense: if U = j e 2πiλ j |ψ j ψ j | then phase estimation performs the transformation j α j |0 n |ψ j → j α j |λ j |ψ j (1) where |λ j is an n-bit estimate of λ j . This access to spectral information enables quantum speedups for linear algebra [HHL08,CKS15], studying physical systems [Temme&09, YA11, Lemi&19, JKKA20], estimating partition functions [Mon15], and performing Bayesian inference [HW19,AHNTW20]. Phase estimation is also a very complicated algorithm. Textbook phase estimation [NC00] requires the Quantum Fourier Transform (QFT), a tool we commonly associate with the exponential speedup of Shor's algorithm [Shor95]. However, phase estimation only delivers a quadratic speedup for estimation and the exponential speedup for linear algebra can sidestep the QFT [CKS15,GSLW18]. When applied to energies, phase estimation also requires Hamiltonian simulation, a quantum subroutine that is an entire subject of study in its own right: it requires recent innovations to apply optimally in a black-box setting [BCK15, LC1606, LC1610, LC17, GSLW18] and optimal Hamiltonian simulation for specific systems is still being actively studied [SHC20,Cam20]. Furthermore, phase estimation demands median amplification to guarantee an accurate answer. This can be challenging to implement coherently on a quantum computer [HNS01,Klauck02,Beals&12], because it requires many ancillae and a quantum sorting network. The probability with which phase estimation gives the correct answer (sometimes referred to as the 'Fejer kernel' [Rogg20]) is not always high enough to be amplified, which must be dealt with either by rounding or adaptivity [AA16]. The conceptual complexity of textbook phase estimation and the resulting computational overhead motivates a search for alternatives.
Fortunately, a lot of simplification is possible if we allow for 'incoherent' algorithms, where incoherence manifests via a variety of assumptions. We could be allowed to measure the quantum state, either because we are given many copies of the input state, or because we have the ability to prepare it inexpensively. Alternatively, we can assume a type of adaptivity where quantum states wait patiently without decohering while a classical computer performs a computation on the side. This assumption sometimes lets us repair the input state after using it [MW05,Temme&09,Poulin&17]. But most importantly, incoherent phase estimation algorithms usually require that the input state is an eigenstate of the unitary or Hamiltonian in question.
If enough of these assumptions hold, then 'iterative phase estimation' [Kit95] removes an enormous amount of conceptual and computational overhead. The estimate can be extracted one bit at a time, thereby removing the QFT. The accuracy of each bit can be amplified individually via many classical samples, removing the need for the quantum sorting network. The awkward notion of an 'n-bit estimate' can be removed and replaced with a traditional additive-error estimate [AA16]. Iterative phase estimation has seen many refinements and has been applied to gate set tomography and ground state energy estimation [Higgins07,KLY15,LT21]. An incoherent iterative approach also permits direct simplification of amplitude estimation [AR19,GGZW19].
However, for some applications of phase estimation maintaining coherence remains crucial. The original strategy for quantum matrix inversion [HHL08], quantum Metropolis sampling [Temme&09,YA11,Lemi&19], and a protocol for partition function estimation [Mon15], thermal state preparation, and Bayesian inference [HW19,AHNTW20] all violate the assumptions above. The eigenvalues must be estimated while preserving the superposition, and there is no guarantee that the input state is an eigenstate. These applications of 'coherent' phase estimation motivate the main question of this paper: does there exist a conceptually and computationally simpler algorithm for performing the transformation (1) while remaining coherent? That is: the input state is not necessarily an eigenstate, we are given exactly one copy, and there is no adaptive interaction with a classical computer.
In this paper we answer this question in the affirmative. We present simplified coherent algorithms for phase estimation, energy estimation, and amplitude estimation. None of these algorithms require a QFT, and they can be made arbitrarily close to the ideal transformation without a quantum sorting network to compute a median. These algorithms are about 14x to 10x faster than traditional phase estimation in terms of their query complexity, a performance metric that neglects the fact that they also require fewer ancilla qubits.
However, we also observe that there are unavoidable barriers to estimation in superposition. Consider an estimation algorithm that outputs a superposition of two estimateŝ λ (1) j andλ (2) j , both of which could be pretty close to the true value λ j : However, it is a well known fact [BBBV97] that uncomputation cannot work in such a situation (unless one of ξ j or ζ j is ≈ 0). Thus, any algorithm that actually makes use of the estimates must necessarily damage the input superposition and can no longer be considered coherent. Even worse, we show that any unitary quantum algorithm for estimation must perform a map in the form (2), and there always exist some values of λ j where the neither of the ξ j and ζ j are ≈ 0. Therefore, the only way to get an algorithm that performs the deterministic transformation (1) is to assume certain values of λ j do not appear. We refer to this as a rounding promise, and show that if the rounding promise holds, then our algorithms perform the map (1). However, we also construct our algorithms in such a way that they give reasonable non-deterministic estimates as in (2) even when no rounding promise holds.
In the following we give a brief outline of the method we use to construct the algorithms. They strongly resemble iterative phase estimation [Kit95], which works roughly like this: the estimate is computed one bit at a time, starting with the least significant bit. A 'Hadamard-test' computes each bit with a decent success probability, which is then amplified to a high probability by taking the majority vote of many samples. This process could be made coherent naively by computing each sample into an ancilla, but this requires so many ancillae that any performance benefit over the QFT-and median-based approach is lost. The key idea is to amplify without using any new ancillae.
To manipulate the probabilities of the Hadamard-test, we use 'block-encodings'. Blockencodings permit quantum computers to manipulate non-unitary matrices. In our case, the matrices' eigenvalues encode our probabilities. A unitary matrix U A is a block-encoding of A if A is in the top-left corner: Block-encoded matrices can be manipulated in two ways. First, linear combinations of unitaries [BCK15,CKS15] allow us to build block-encodings αA + βB given block-encodings of A and B. Linear combinations of unitaries allow us to make block-encodings of Hamiltonians presented as a sum of local terms, covering most practical applications. Second, singular value transformation [LC1610, LC1606, GSLW18] lets us apply certain polynomials p(x) to the singular values of a block-encoded matrix A, using deg(p) many queries to controlled-U A and its inverse. Together, these two techniques permit the unification of many quantum algorithms into a single framework. For an accessible introduction to these methods, along with a comprehensive review of the most modern versions of these algorithms we refer to [MRTC21]. This work also presents the independent discovery of an algorithm very similar to our improved method of phase estimation, which is also sketched in [Chuang20].
To obtain the k'th bit (0 ≤ k < n−1), iterative phase estimation performs a Hadamardtest on U 2 n−k−1 , where U = j e 2πiλ j |ψ j ψ j |. The outcome of the test is a coin toss that is heads with probability cos 2 (2 n−k−1 πλ i ). We observe that the quantum circuit for the Hadamard-test resembles a linear combination of unitaries. Therefore, we can construct a block-encoding of a matrix whose eigenvalues encode the Hadamard-test probability for each eigenvector of U .
Iterative phase estimation then proceeds to perform the Hadamard-test several times and takes the majority vote. We observe that the probability that the majority vote is 1 is a polynomial in the Hadamard-test probability p: We can therefore apply such an 'amplifying polynomial' [Diak09] directly to the blockencoding using singular value transformation, which requires no new ancillae. In fact, using techniques from [LC17], we can construct a polynomial that performs the same task with much smaller degree. Now we have amplified the eigenvalues of the block-encoding to be close to 0 or 1, so we have a projector. To extract the 0/1-eigenvalue into a qubit we use the following novel technical tool: Theorem. Block-measurement. Say we have an approximate block-encoding of a projector Π. Then there exists a quantum channel that approximately implements the map: The channel is based on uncomputation [BBBV97], but the error analysis also features an interesting trick to deal with the case where the uncomputation fails. This theorem may find applications elsewhere.
Repeating the above procedure for each bit while carefully adjusting the phases at every step yields an algorithm that performs coherent phase estimation with no ancillae required. To estimate energies, we can construct a block-encoding with eigenvalues cos 2 (2 n−k−1 πλ j ) directly using the Jacobi-Anger expansion [GSLW18] rather than going through Hamiltonian simulation, which boosts the performance further, although now the algorithm does require ancillae. Finally, to perform coherent amplitude estimation, we construct a blockencoding of a 1x1 matrix containing the amplitude to be estimated and then invoke energy estimation.
A key research question of this paper is: Do these algorithms perform better than traditional phase estimation in practice? An asymptotic analysis is not sufficient to answer this question. Instead, we carefully bound the query complexity and failure probability of all algorithms involved using the diamond norm and carefully select constants to maximize the performance. Subsequently, we perform a numerical analysis of the query complexity. We find that we improve the query complexity of phase estimation by a factor of about 14x, and the query complexity of energy estimation is improved by about 10x. Our amplitude estimation algorithm inherits the speedup of the energy estimation algorithm. This paper is structured as follows. In section 1 we carefully define the problem of coherent estimation and establish the notion of a rounding promise. Then we analyze the textbook method. In section 2 we present our novel algorithms for phase and energy estimation. In section 3 we discuss a numerical analysis of the query complexities of the algorithms. Then, in section 4 give a proof of the block-measurement theorem above and then show how to use energy estimation to perform non-destructive amplitude estimation in section 5.

Preliminaries
In this section we formally define the estimation tasks we want to solve (Definition 2). This requires setting up the notion of a 'rounding promise' (Definition 1) which highlights an inherent complication with coherent estimation that is not present in the incoherent case. We argue that this complication is unavoidable, and should be taken into account when coherent estimation is performed in practice. Next, we set up some simple technical tools for dealing with the error analysis of uncomputation (Lemmas 3,4). These will be used several times in this paper. Finally, we give a precise description and analysis of 'textbook' phase estimation (Proposition 5). Thus, this section clearly defines the problem and the previous state-of-the-art which we improve.
Our paper presents algorithms for phase, energy, and amplitude estimation. Amplitude estimation follows as a relativity simple corollary of energy estimation so we will only be talking about the prior two until Section 4. To talk about both phases and energies at once, we standardize input unitaries and Hamiltonians into the following form: We refer to the {λ j } as the 'eigenvalues', and assume they live in the range [0, 1). While any unitary can be put into this form, the form places the constraint 0 H ≺ I onto the Hamiltonian.
Our goal is to compute |λ j an 'n-bit estimate of λ j '. In this paper, we take this to be an n-qubit register containing a binary encoding of the number floor(2 n λ j ). Since λ j ∈ [0, 1) we are guaranteed that floor(2 n λ j ) is an integer ∈ {0, ..., 2 n − 1} and thus has an n-bit encoding.
Consider phase estimation using a quantum circuit composed of elementary unitaries and controlled-e 2πiλ j . Following an argument related to the polynomial method, we see that the resulting state must be of the form: where α x,y (e 2πiλ j ) is some polynomial of e 2πiλ j . We would like |x to encode floor(2 n λ j ), meaning that α x,y (e 2πiλ j ) = 1 if x = floor(2 n λ j ) and α x,y (e 2πiλ j ) = 0 otherwise. This is impossible: α x,y (e 2πiλ j ) is a continuous function of λ j , but the desired amplitude indicating x = floor(2 n λ j ) is discontinuous. For energy estimation a similar argument applies, just that the amplitudes are of the form α x,y (λ j ). This argument even holds in the approximate case when we demand that α x,y ≤ δ or ≥ 1−δ for some small δ -the discontinuity is present regardless.
To some extent, this issue stems from the awkwardness of the notion of an 'n-bit estimate' since it requires rounding or flooring, a discontinuous operation, when only continuous manipulation of amplitudes is possible. A more comfortable notion is that of an additive-error estimate, used by [AA16,KP16] in their estimation algorithms.
However, the primary application of our algorithms is a situation where this simplification is not possible: Szegedy walks based on Hamiltonian eigenspaces [YA11, Lemi&19, JKKA20, WT21]. The original quantum Metropolis algorithm [Temme&09] implements a random walk over Hamiltonian eigenspaces, where the superposition is measured at every step and it is demonstrated that an additive error estimate is sufficient. But in order to harness a quadratic speedup due to quantum walks [Sze04], the measurement of energies must be made completely coherent, which is not achieved by an additive-error estimate unless the error is smaller than the gap between any eigenvalues. Therefore, we retain the notion of an 'n-bit estimate' in this paper. We could always fall back to an estimate with additive error ε/2 by computing n = ceil(log 2 (ε −1 )) bits of accuracy.
However, the above argument still applies for additive-error estimation: the amplitude of any given estimate |λ is a continuous function of the eigenvalue λ j . This means that the amplitude cannot be 0 or 1 everywhere, there must exist points where it crosses intermediate values in order to interpolate in between the two. In both the 'n-bit estimate' case and the additive-error case, this causes a problem for coherent quantum algorithms since the estimate cannot always be uncomputed. For some eigenvalues λ j , the algorithm will yield an output state of the form α |λ + β |λ . Even ifλ,λ are good estimates of λ j , the uncompute trick [BBBV97] cannot be used for such an output state. Thus, the damage to the input superposition over |λ j is irreparable.
The only way to deal with this issue is to assume that the λ j do not take certain values. Then the amplitudes can interpolate between 0 and 1 at those points. Observe that the discontinuities in the function floor(2 n λ j ) occur at multiples of 1/2 n . A 'rounding promise' simply disallows eigenvalues near these regions. Definition 1. Let n ∈ Z + and α ∈ (0, 1). A hermitian matrix H satisfies an (n, α)rounding promise if it has an eigendecomposition H = j λ j |ψ j ψ j |, all the eigenvalues λ j satisfy 0 ≤ λ j < 1, and for all x ∈ {0, ..., 2 n }: Similarly, a unitary matrix U satisfies an (n, α)-rounding promise if it can be written as U = j e 2πiλ j |ψ j ψ j | and the phases λ j satisfy the same assumptions.
We have λ j ∈ [0, 1), and we have disallowed λ j from certain sub-intervals of [0, 1). The definition above is chosen such that the total length of these disallowed subintervals is α, regardless of the value of n. We have essentially cut out an α-fraction of the allowed eigenvalues.
Guaranteeing a rounding promise demands an enormous amount of knowledge about the eigenvalues. We do not expect many Hamiltonians or unitaries in practice to actually provably satisfy such a promise. However, we expect that the estimation algorithms discussed in this paper will still perform pretty well even if they do not satisfy such a promise. One can increase the chances of success by setting α to be very small, so that the vast majority of the eigenvalues do not fall into a disallowed region. Then, if the input state is close to a uniform distribution over the eigenstates, then one can be fairly confident that only an α fraction of the eigenvalues in the support will be disallowed. The need for a rounding promise can be also sidestepped entirely by avoiding the use of energy estimation as a subroutine and approaching the desired problem directly. This has been accomplished for ground state finding [LT21].
The rounding promise is not just a requirement of our work in particular -it is a requirement for any coherent phase estimation protocol. The fact that coherent phase estimation does not work for certain phases is largely disregarded in the literature: for example, works like [KP16] neglect this issue entirely. However, there are some works that have observed this problem and attempt to mitigate it. Such methods are called 'consistent phase estimation' [TaShma13] since the error of their estimate is supposedly independent of the phase being estimated, thus allowing amplification to make the amplitudes always close to 0 or 1. We claim that all of these attempts fail, and furthermore that achieving coherent phase estimation without some kind of promise is impossible in principle. This is due to the polynomial method argument above: any coherent quantum algorithm's output state's amplitudes must be a continuous function of the phase, and continuous functions that are sometimes ≈ 0 and sometimes ≈ 1 must somewhere have an intermediate value. Recall that uncomputation only works when the amplitudes are close to 0 or 1. The error depends on if the phase is close to this transition point or not, so the error must depend on the phase. This issue was not taken into account by [Ambainis10], [KP17], and [KLLP18], since all of these either implicitly or explicitly state that there is an algorithm that approximately performs |ψ i |0 n → |ψ i |λ i in superposition while λ i is a deterministic computational basis state 1 . A more promising approach is detailed in [TaShma13], which, crudely speaking, shifts the transition points by a classically chosen random amount. Now whether or not a phase is close to a transition point is independent of the phase itself, making amplification possible. [Ambainis10] describes a similar idea, calling it 'unique-answer' eigenvalue estimation. However, we claim that this only works for a single phase. If we consider, for example, a unitary whose phases are uniformly distributed in [0, 1) at a sufficiently high density, then there will be a phase near a transition point for any choice of random shift. The only way to avoid the rounding promise is to sacrifice coherence and measure the output state.
All of the algorithms in this paper, including textbook phase estimation, achieve an asymptotic runtime of O(2 n α −1 log(δ −1 )), where δ is the error in diamond norm. Before we move on to the formal definition of the estimation task, we informally argue that the α −1 dependence is optimal, via a reduction to approximate counting. We are given N items, K of which are marked. Following the standard method for approximate counting [BHMT00], we construct a Grover unitary whose phases λ j encode arcsin( K/N ). Given a (1, α)-rounding promise, computing floor(2 1 λ j ) amounts to deciding if λ j ≤ 1−α/2 2 or λ j ≥ 1+α/2 2 given that one of these is the case. By shifting the λ j around appropriately we can thus decide if K ≥ (1/2 + Cα)N or K ≤ (1/2 − Cα)N , for some constant C obtained by linearising arcsin( K/N ). We have achieved approximate counting with a promise gap ∼ α. Thus the Ω(α −1 ) lower bound on approximate counting [NW98] implies our runtime must be Ω(α −1 ).
Equipped with the notion of a rounding promise, we can define our estimation tasks. Many algorithms in this paper produce some kind of garbage, which can be dealt with the uncompute trick [BBBV97]. Rather than repeat the analysis of uncomputation in every single proof, we present a modular framework where we can deal with uncomputation separately. Furthermore, some applications may require computing some function of the final estimate, resulting in more garbage which also needs to be uncomputed. Rather than baking the uncomputation into each algorithm, it is thus more efficient to leave the decision of when to uncompute to the user of the subroutine.

Definition 2.
A phase estimator is a protocol that, given some n ∈ Z + and α ∈ (0, 1), and any error target δ > 0, produces a quantum circuit involving controlled U and U † calls to some unitary U . If U = j e 2πiλ j |ψ j ψ j | satisfies an (n, α)-rounding promise then this circuit implements a quantum channel that is δ-close in diamond norm to the map: Similarly, an energy estimator is such a protocol that instead involves such calls to U H which is a block-encoding (see Definition 9) of a Hamiltonian H = j λ j |ψ j ψ j | that satisfies an (n, α)-rounding promise.
The query complexity of an estimator is the number of calls to U or U H in the resulting circuit, as a function of n, α, δ.
An estimator is said to 'have garbage' or 'have m qubits of garbage' if it is instead close to a map that produces some j-dependent m-qubit garbage state in another register: (Note that the quantum circuit can allocate and discard ancillae, but that does not count as garbage.) An estimator is said to 'have phases' if the map introduces a j-dependent phase ϕ j : If an estimator is both 'with phases' and 'with garbage', then we can just absorb the e iϕ j into |garbage j so the 'with phases' is technically redundant. However, in our framework it makes more sense to treat phases and garbage independently since some of our algorithms are just 'with phases'.
The reason we measure errors in diamond norm has to do with uncomputation of approximate computations with garbage. Consider for example a transformation V acting on an answer register and a garbage register: for some small nonzero ε. We copy the answer register into the output register: and then we project the answer and garbage registers onto V |0 |0...0 : The resulting state (1 − ε) |0 + ε |1 is not normalized, meaning that the projection V |0 |0...0 succeeds with some probability < 1. Thus the uncomputed registers are not always returned to the |0 |0...0 state. At this point our options are either to postselect these registers to |0 |0...0 or to discard them. Postselection improves the accuracy, but also implies that the algorithms do not always succeed. Since many applications of coherent energy estimation demand repeating this operation many times, it is important that the algorithm always succeeds. Thus, we need to discard qubits, so we need to talk about quantum channels. Therefore, the diamond norm is the appropriate choice for an error metric. Recall that the diamond norm is defined in terms of the trace norm [AKN98]: where I is the identity channel for some Hilbert space of higher dimension than Λ. The following analysis shows how to remove phases and garbage from an estimator, even in the approximate case.
Lemma 3. Getting rid of phases and garbage. Given a phase/energy estimator with phases and/or garbage with query complexity Q(n, α, δ) that is unitary, we can construct a phase/energy estimator without phases and without garbage with query complexity 2Q(n, α, δ/2).
Proof. Without loss of generality, we assume we are given a quantum channel Λ that implements something close in diamond norm to the map: If the channel is actually without phases then we can just set ϕ j = 0, and if it is actually without garbage then the following calculation will proceed without problems. Our strategy is just to use the uncompute trick, and then to discard the uncomputed ancillae. This requires us to implement Λ −1 , so we required that Λ be implementable by a unitary.
First, we consider the ideal case when Λ implements the map (16) exactly. If we use |ψ j as input and we stop the circuit before the discards, we produce a state |ideal j : Let ρ ideal i,j := |ideal i ideal j |, so that we can write down the ideal channel: If we apply Λ with error δ/2 instead we obtain the actual channel Γ such that: If A refers to the subsystems that start (and approximately end) in the states |0 n |0...0 then the final output state satisfies: where we have used the fact that for all ρ and subsystems A we have |Tr A (ρ)| 1 ≤ |ρ| 1 . Upon plugging in σ = |ψ j ψ j | we can see that tracing out the middle register after applying Γ ideal implements the map so therefore the circuit in (17) is an estimator without phases and garbage as desired.
The proof is complete, up to the fact that |Tr A (ρ)| 1 ≤ |ρ| 1 for all ρ and subsystems A.
This establishes that uncomputation works in the approximate case as expected. While we are reasoning about the diamond norm, we also present the following technical tool which will come in handy several times. In particular, we will need it for our analysis of textbook phase estimation.

Lemma 4. Diamond norm from spectral norm. Say U, V are unitary matrices
Proof. We have that: where I was the identity channel on some subsystem of dimension larger than that of U, V , and |M | 1 is the sum of the magnitudes of the singular values of M . LetŪ = U ⊗ I andV = V ⊗ I. Then: If we letĒ :=Ū −V we can proceed with the upper bound: Now we have all the tools required to analyze the textbook algorithm [NC00]. This algorithm combines several applications of controlled-U with an inverse QFT to obtain the correct estimate with a decent probability. One can then improve the success probability via median amplification: if a single estimate is correct with probability ≥ 1 2 + η for some η then the median of ln(δ −1 )/2η 2 estimates is incorrect with probability ≤ δ.
However, median amplification alone is not sufficient to accomplish phase estimation as we have defined it in Definition 1. This is because any λ i that is ≈ 10% · 2 −n−1 close to a multiple of 1/2 n , the probability of correctly obtaining floor(2 n λ j ) is actually less than 1/2! This means that no matter how much median amplification is performed, this approach cannot achieve α below ≈ 10%. (For reference, a lower bound γ on the probability is plotted in Figure 1, although reading this figure demands some notation from the proof of the following proposition. We see that when |λ 10%/2 the probability of obtaining floor(2 n λ j ) is less than 1/2.) Furthermore, even if the signal obtained from the QFT could obtain arbitrarily small α, it would not achieve the desired α −1 scaling. This is because the amplification gap η scales linearly with α, and median amplification scales as η −2 due to the Chernoff-Hoeffding theorem. Therefore, we would obtain a scaling of α −2 .
Fortunately, achieving ∼ α −1 is possible, using a different method for reducing α: we perform estimation for r additional bits which are then ignored. This makes use of the fact that α is independent of the number of estimated bits, but the gap away from the multiples of 1/2 n , which is α/2 n+1 , is not. Every time we increase n, the width of each disallowed interval is chopped in half, but to compensate the number of disallowed intervals is doubled. If we simply round away some of the bits, then we do not care about the additional disallowed regions introduced. If we estimate and round r additional bits, we suppress α by a factor of 2 −r while multiplying our runtime by a factor of 2 r -so the scaling is ∼ α −1 .
We still need median amplification, though. In order to use the above strategy we need rounding to be successful: if an eigenvalue λ j falls between two bins floor(2 n λ j ) and floor(2 n λ j ) + 1, then it must be guaranteed to be rounded to one of those bins with failure probability at most δ. Before amplification, the probability that rounding succeeds is ≥ 8/π 2 , a constant gap above 1/2 corresponding to α = 1/2. We cannot guarantee a success probability higher than 8/π 2 using the rounding trick alone, and thus need median amplification to finish the job.
Below, we split our analysis into two regimes: the α ≤ 1/2 regime and the α > 1/2 regime. When α > 1/2 then we do not really need the rounding trick described above, and we can achieve this accuracy with median amplification alone. Otherwise when α ≤ 1/2 we use median amplification to get to the point where rounding is likely to succeed, and then estimate r additional bits such that α2 r ≥ 1/2.

Proposition 5. Standard phase estimation.
There exists an phase estimator with phases and garbage. Consider a unitary satisfying an (n, α)-rounding promise. Let: . Then the phase estimator has query complexity: and has (n + r) Then the phase estimator has query complexity: and has n Proof. We begin with the α > 1/2 case which is simpler, and then extend to the α ≤ 1/2 case. The following is a review of phase estimation, which has been modified to estimate floor(2 n λ j ) rather than round (2 n λ j ): 1. Prepare a uniform superposition over times: 2. Apply U to the |ψ j register t times, along with an additional phase shift of −2πt(1−α) 2 n+1 to account for flooring: 3. Apply an inverse quantum Fourier transform to the |t register: where in the final line we have defined: 4. Let η and δ med be as in the theorem statement, and let: Repeat the above process for a total of M times. If we sum x over {0, ..., 2 n − 1} M , then we obtain: 5. Run a sorting network, e.g. [Beals&12], on the M estimates, and copy the median into the output register.
The query complexity analysis is straightforward: each estimate requires 2 n − 1 applications of controlled-U , and there are M estimates. So all we must do is demonstrate accuracy. It is clear that the process above implements a map of the form: for some probabilities p j,x and phases ϕ j,x , since this is just a general description of a unitary map that leaves the |ψ j register intact. This way of writing the map lets us bound the error in diamond norm to the ideal map (selecting ϕ j := ϕ j,floor(λ j 2 n ) and |gar j := |gar j,floor(λ j 2 n ) ) without uncomputing while still employing the standard analysis of phase estimation which just reasons about the probabilities p j,x . These can be bounded from a median amplification analysis of the . Consider a vector of M estimates x. Then: First we show that probabilities β λ (x) j 2 associated with the individual estimates are bounded away from 1 2 . Using some identities we can rewrite: Using the small angle approximation sin(θ) ≤ θ for 0 ≤ θ ≤ π, we can derive: and is plotted in Figure 1. We would like this probability to be larger than 1 2 +η for some η when λ j satisfies a rounding promise. This is guaranteed if we select: as in the theorem statement. From the figure, we see that when α > 1/2 then η > η 0 . This fact actually is not necessary for this construction to work, but observing this will be useful for the α ≤ 1/2 case. However, it is necessary to observe that when α > 1/2 we are guaranteed that η is positive. Then, from the rounding promise and the definition of λ (x) j , we are guaranteed that when x = floor(2 n λ j ) then |β(λ (x) j )| 2 ≥ 1/2 + η. Now we perform a standard median amplification analysis. The probability of a particular estimate x being 'correct' is ≥ 1/2 + η, so the probability of being incorrect is ≤ 1/2 − η. The only way that a median of M estimates can be incorrect is if more than half of the estimates are incorrect. Let X be the random variable counting the number of incorrect estimates. Then we can invoke the Chernoff-Hoeffding theorem: We can bound the above by δ med , a constant we will fix later, if we select M := log(δ −1 med ) 2η 2 as we did above.
, a lower bound on the magnitude of the amplitude of phase estimation as indicated in the shaded region. From this sketch we draw two conclusions necessary for our proof. First, when α > 1/2 then γ(λ (x) j ) is guaranteed to be strictly greater than 1/2, so η > 0 (in fact, η > η 0 ). Second, even when no rounding promise applies, the sum of the probabilities of two adjacent bins is at least 8/π 2 , which we use to define the amplification threshold η 0 . Also visible in this figure is the impossibility of reducing α further than ≈ 10% without rounding: several values of λ (x) j near 1/2 have probabilities less than 1/2 and cannot be amplified.
We have demonstrated that p j,x ≥ 1 − δ med whenever x = floor(2 n λ j ). Now all that remains is a bound on the error in diamond norm. We begin by bounding the spectral norm. Let p j := p j,floor(2 n λ j ) and observe that 1 ≥ p j ≥ 1 − δ med . Taking the difference between equations (47) and (48) we obtain: Now we apply Lemma 4 to bound the diamond norm, and since the error is ∼ √ δ med to leading order, we construct a bound that holds whenever √ δ med ≤ 1: If we select δ med := δ 2 /6.25 then the diamond norm is bounded by δ.
Having completed the α > 1/2 case, we proceed to the α ≤ 1/2 case. As stated above, the construction we just gave actually also works when α ≤ 1/2. However, the asymptotic dependence on α is like ∼ α −2 which is not optimal.
Looking again at Figure 1, we see that the sum of the probability of two adjacent bins is at least 8/π 2 , even when λ (x) j is in a region disallowed by the rounding promise. Therefore, if we perform median amplification with gap parameter η 0 we are guaranteed that values of λ (x) j that fall into a rounding gap will at least be rounded to an adjacent bin.
We can use this fact to achieve an ∼ α −1 dependence. Recall that α is the fraction of the range [0, 1) where λ j are not allowed to appear due to the rounding promise, independent of n. If we perform phase estimation with n + 1 rather than n, then the width of each disallowed region is cut in half from α2 −n to α2 −n−1 -but we also double the number of disallowed regions. However, if we simply ignore the final bit, then, because we are amplifying to η 0 , any values of λ j that fall into one of the newly introduced regions will simply be rounded. Thus, we cut the gaps in half without introducing any new gaps, so α is cut in half. We will make this argument more formal in a moment.
So, in order to achieve a particular α, we observe from Figure 1 that if we amplify to η 0 then the gaps have width 1 2 · 2 −n . If we estimate an additional r bits, then the gaps have width 1 2 · 2 −n−r . Solving 1 2 · 2 −n−r ≤ α2 −n for r, we obtain We briefly make the n explicit in λ . After running the protocol at the beginning of the proof with n + r bits, we obtain, for x ∈ {0, ..., 2 n − 1} M , the state: where in the last step we have defined the normalized state |gar x , the probability p j,x , and the phase ϕ j,x via: Now if we can demonstrate that p j,x ≥ 1 − δ med when x = floor(2 n λ j ) then the same argument as in (60-65) holds and we are done. Observe that: Say x = floor(2 n λ j ). Then, consider t := floor(2 n+r λ j ) − x2 r , which is an integer ∈ {0, ..., 2 r − 1}. Then, looking at Figure 1, we see that: Therefore the probability that the first n bits are x is 2 r −1 t=0 β(λ (n+r, 2 r x+t) j ) 2 ≥ 1/2 + η 0 , which is all that was required to perform the median amplification argument above. So we can conclude that p j,k ≥ 1 − δ med , so the modified algorithm retains the same accuracy.

Coherent Iterative Estimation
In this section we present the novel algorithms for phase estimation and energy estimation. As described in the introduction, both of these feature a strong similarity to 'iterative phase estimation' [Kit95], where the bits of the estimate are obtained one at a time. Unlike iterative phase estimation however, the state is never measured and the entire process is coherent. We therefore name these algorithms as 'coherent iterative estimators'.
Another similarity that these new algorithms share with the original iterative phase estimation is that the less significant bits are taken into account when obtaining the current bit. This greatly reduces the amount of amplification required for the later bits, so the runtime is vastly dominated by the estimation of the least significant bit. We will go into more detail on this later.
To permit discussion of coherent iterative estimation of phases and energies in a unified manner, we fit this idea into the modular framework of Definition 2 and Lemma 3. A 'coherent iterative estimator' obtains a single bit of the estimate, given access to all the previous bits. Several invocations of a coherent iterative estimator yield a regular estimator as in Definition 2. Furthermore, we can choose to uncompute the garbage at the very end using Lemma 3, or, as we will show, we can remove the garbage early, which prevents it from piling up. Definition 6. A coherent iterative phase estimator is a protocol that, given some n, α, any k ∈ {0, ..., n − 1}, and any error target δ > 0, produces a quantum circuit involving calls to controlled-U and U † . If the unitary U satisfies an (n, α)-rounding promise, then this circuit implements a quantum channel that is δ-close to some map that performs: Here bit k (λ j ) is the (k + 1)'th least significant bit of an n-bit binary expansion, and |∆ k is a k-qubit register encoding the k least significant bits. (For example, if n = 4 and λ j = 0.1011... then bit 2 (λ j ) = 0 and ∆ 2 = 11.) Note that the target map is only constrained on a subspace of the input Hilbert space, and can be anything else on the rest. An coherent iterative energy estimator is the same thing, just with controlled-U H and U † H queries to some block-encoding U H of a Hamiltonian H that satisfies an (n, α)rounding promise.
A coherent iterative estimator can also be with garbage and/or with phases, just like in Definition 2.
Lemma 7. Stitching together coherent iterative estimators. Given a coherent iterative phase/energy estimator with query complexity Q(n, k , α, δ ), we can construct a non-iterative phase/energy estimator with query complexity: The non-iterative estimator has phases if and only if the coherent iterative estimator has phases, and if the coherent iterative estimator has m qubits of garbage then the iterative estimator has nm qubits of garbage.
Proof. We will combine n many coherent iterative phase/energy estimators, for k = 0, 1, 2, .., n − 1. The diamond norm satisfies a triangle inequality so if we let the k'th iterative estimator have an error δ k := δ2 −k−1 then the overall error will be: So now all that is left is to observe that the exact iterative estimators chain together correctly. This should be clear by observing that for all k > 0: and |∆ 0 = 1 ∈ C since when k = 0 there are no less significant bits. So the k'th iterative estimator takes the k least significant bits as input and computes one more bit, until finally at k = n − 1 we have: The total query complexity is just the sum of the n invocations of the iterative estimators from k = 0, ..., n − 1 with error δ k . If the iterative estimator has garbage, then the garbage from each of the n invocations just piles up. Similarly, if the estimator is with phases, and has an e iϕ j,k for the k'th invocation, then the composition of the maps will have a phase n−1 k=0 e iϕ j,k . Lemma 8. Removing garbage and phases from iterative estimators. Given a coherent iterative phase/energy estimator with phases and/or garbage and with query complexity Q(n, k, α, δ ) that has a unitary implementation, we can construct a coherent iterative phase/energy estimator without phases and without garbage with query complexity 2Q(n, k, α, δ/2).
Proof. This argument proceeds exactly the same as Lemma 3, just with the extra |∆ k register trailing along. If Λ is the channel that implements the iterative estimator with garbage and/or phases, then we obtain an estimator without garbage and phases via: The advantage of the modular framework we just presented is that maximizes the amount of flexibility when implementing these algorithms. How exactly uncomputation is performed will vary from application to application, and depending on the situation uncomputation may be performed before invoking Lemma 7 using Lemma 8, after Lemma 7 using Lemma 3, or not at all! Of course, the idea of uncomputation combined with iterative estimation itself is quite simple, so given a complete understanding of the techniques we present the reader may be able to perform this modularization themselves. However, we found that this presentation significantly de-clutters the presentation of the main algorithms, those that actually implement the coherent iterative estimators from Definition 6. While the intuitive concept behind these strategies is not so complicated, the rigorous presentation and error analysis is quite intricate. We therefore prefer to discuss uncomputation separately.

Coherent Iterative Phase Estimation
This section describes our first novel algorithm, presented in Theorem 12. Before stating the algorithm in complete detail, performing an error analysis, and showing how to optimize the performance up to constant factors, we outline the tools that we will need and give an intuitive description.
As stated in the introduction, a very similar algorithm was independently discovered by [MRTC21]. The techniques of the two algorithms for phase estimation feature some minor differences: our work is more interested in maintaining coherence of the input state, the algorithm's constant-factor runtime improvement over prior art, and our runtime is O(2 n ) rather than O(n2 n ) (O(n) vs O(n log n) respectively in the language of [MRTC21]). On the other hand, [MRTC21] elegantly show how a quantum Fourier transform emerges as a special case of the algorithm, and their presentation is significantly more accessible. Both methods avoid use of ancillae entirely.
A key tool for these algorithms is the block-encoding, which allows us to manipulate arbitrary non-unitary matrices. In this paper we simplify the notion a bit by restricting to square matrices with spectral norm ≤ 1.

Definition 9. Say A is a square matrix ∈ L (H). A unitary matrix U A is a blockencoding of A if U A acts on m qubits and H, and:
We can also make m explicit by saying that A has a block-encoding with m ancillae. We allow the quantum circuit implementing U to allocate ancillae in the |0 state and then to return them to the |0 state with probability 1. These ancillae are not postselected and do not contribute to the ancilla count m.
A unitary matrix is a trivial block-encoding of itself. In this sense, we already have a block-encoding of the matrix: Recall that the goal of the coherent iterative estimator is to compute bit k (λ j ). The strategy involves preparing an approximate block-encoding of: We begin by rewriting the above expression a bit. Recall that ∆ k is an integer from 0 to 2 k+1 − 1 encoding the k less significant bits of λ j . If we subtract ∆ k /2 n from λ j , we obtain a multiple of 1/2 n−k plus something < 1/2 n which we floor away. Then, bit k (λ j ) indicates if this is an even or an odd multiple. That means we can write: Since λ j − ∆ k 2 n is a multiple of 1/2 n−k plus something < 1/2 n , we equivalently have that 2 n−k λ j − ∆ k 2 n is an integer plus something less than 1/2 k . For such values, we can write the parity(floor(x)) function in terms of a squared cosine that has been 'amplified': where the shift φ centers the extrema of the cosine in the intervals where x occurs. Therefore: Where we have defined: for some k-dependent choice of phase φ k = φ/2. By applying a phase shift conditioned on the |∆ k register, and then iterating it 2 n−k−1 times, we can construct a block-encoding of the unitary with eigenvalues λ (k) j . Our goal is now to transform this unitary as follows: To obtain a cosine use linear combinations of unitaries [BCK15,CKS15] to take a take a linear combination with the identity: where on the previous line ± indicates sign cos πλ . This way the above is a singular value decomposition of the block-encoded matrix. So all that is left to do is to approximately transform the singular values a of the matrix above by: We will accomplish this using singular value transformation.

Lemma 10. Singular value transformation. Say
A is a square matrix with singular value decomposition A = i a i |ψ l i ψ r i |, and p(x) is a degree-d even polynomial with real coefficients satisfying |p(x)| ≤ 1 for |x| ≤ 1, and suppose A has a block-encoding U A . Then, for any δ > 0, there exists a O(poly(d, log(1/δ))) time classical algorithm that produces a circuit description of a block-encoding of the matrix:  Proof. This is a slightly simplified version of the main result of [GSLW18]. The m > 1 case involves an application of Corollary 18 which demands an extra control qubit that is also postselected. For an accessible review on the subject see [MRTC21].
We briefly elaborate on the m = 1 special case when U A = i V i ⊗ |ψ l i ψ r i | and the V i are reflections. We find that when this is the case, the (W x , S z , +| · |+ )-QSP convention (see Theorem 13 of [MRTC21]) can actually be implemented by converting to the reflection convention and then using the circuit from Lemma 19 of [GSLW18], simply by applying a Hadamard gate to the ancilla register at the beginning and end of the circuit. This lets us implement polynomials with the constraints above without resorting to Corollary 18 of [GSLW18].
Note that the circuit from Lemma 19 of [GSLW18] (which is used in both the m = 1 and m > 1 cases) initializes an extra qubit to perform reflections. However, this qubit is guaranteed to be returned to the |0 state exactly and is not postselected, so it is not an ancilla in the sense of Definition 9.
Singular value transformation can perform the desired conversion if we can construct a polynomial A(x) such that: If we invoke the above lemma with the even polynomial A(x 2 ) we get an approximation to the desired block-encoding of j bit k (λ j ) |ψ j ψ j |.
In particular, the behavior we must capture in the polynomial approximation is that A(x) ≈ 1 when x ∈ [0, 1/2 − η] and A(x) ≈ 0 when x ∈ [1/2 + η, 1] for some gap η away from 1/2. If we view the input x as a probability, then A(x) essentially 'amplifies' this probability to something close to 0 or 1 (and additionally flips the outcome). We hence call A(x) an 'amplifying polynomial'.
One approach to constructing such an amplifying polynomial is to simply adapt a classical algorithm for amplification. We stated this method in the introduction: the polynomial is the probability that the majority vote of several coin tosses is heads, where each coin comes up heads with probability x. Then the desired properties can be obtained from the Chernoff-Hoeffding theorem [Diak09]. The number of coins we have to toss to accomplish a particular η, δ is the degree of the polynomial, which is bounded by O(η −2 log(δ −1 )).
However, this polynomial does not achieve the optimal η dependence of O(η −1 ). This might be achieved by a polynomial inspired by a quantum algorithm for approximate counting, which does achieve the O(η −1 ) dependence. But rather than go through such a complicated construction we simply adapt a polynomial approximation to the sign(x) function developed in [LC17], which accomplishes the optimal O(η −1 log(δ −1 )).
The method for obtaining M η→δ given η and δ is complicated enough that it is not worth re-stating here. However, the complexity is by all means worth it: in our numerical analyses we found that the polynomials presented in Appendix A of [LC17] feature excellent performance in terms of degree. This is a major source of the query complexity speedup of our algorithms.
The value of δ is chosen such that the final error in diamond norm is bounded. The value of η depends on how far away cos 2 πλ where cos 2 πλ (k) j = 1 2 exactly, so η = 0 and amplification is impossible. This is where the rounding promise comes in: it ensures that λ ∆ is always sufficiently far from such values, so that we can guarantee that cos 2 πλ (k) j is always either ≥ 1 2 + α 2 or ≤ 1 2 + α 2 . So we select η = α/2. When k = 0 then indeed the rounding promise is the only thing guaranteeing that amplification will succeed. However, if the less significant bits have already been computed then the set of values that λ (k) j can take is restricted. This is because the previous bits of λ j A k (cos 2 ( (2 n k 1 j + k ))) Polynomials for Coherent Iterative Estimation (n = 4) Figure 2: Sketch of the polynomials used in Theorems 12 and 15. In black we show a shifted cos 2 (x) function that indicates if the bit is 0 or 1. Then, the amplifying polynomial from Lemma 11 is applied to it to yield the dashed line, which is either ≤ δ or ≥ 1 − δ depending on the bit. For the k = 0 bit, the gaps between the allowed intervals are guaranteed by the rounding promise. But as more bits are estimated and subtracted off, the gaps for k ≥ 1 require no rounding promise, and also become larger and larger so less and less amplification is needed.
have been subtracted off. This widens the region around the solutions of cos 2 πλ where λ ∆ cannot be found, allowing us to increase η. Furthermore, this can be done without relying on the rounding promise anymore: bits with k ≥ 1 are guaranteed to be deterministic even if no rounding promise is present. That means that if the rounding promise is violated, then only the least significant bit can be wrong. The polynomials are sketched in Figure 2. After constructing the amplifying polynomial A η→δ , we use singular value transformation to apply A η→δ (x 2 ) which is even as required by Lemma 10. Now we have an approximate block-encoding of j bit k (λ j ) |ψ j ψ j |, which is in fact a projector. In the introduction we stated that we would use a block-measurement theorem to compute the map: given an approximate block-encoding of Π. However, this general tool involves uncomputation which we specifically wanted to modularize. The fact that we are already measuring errors in terms of the diamond norm means that Lemmas 3 and 8 are already capable of dealing with garbage. We therefore defer the proof of this general tool to Section 4, specifically Theorem 19. There is another reason to not use Theorem 19 in a black-box fashion, specifically for coherent iterative phase estimation. The block-encoding of j bit k (λ j ) |ψ j ψ j | actually features only one ancilla qubit: the qubit we used to take the linear combination of U and the identity. That means that the block-encoding itself is already very close to the map |0 ⊗ |ψ → |0 ⊗ Π |ψ + |1 ⊗ (I − Π) |ψ (note the flipped output qubit). The details of this usage of the block-encoding will appear in the proof. This completes the sketch of the procedure to implement the map for some phases ϕ j . We can now state the protocol in detail, and perform the accuracy analysis.
Proof. We construct the estimator as follows: 1. Construct a unitary that performs a phase shift depending on ∆ k -we call this e −2πi∆ k /2 n employing some notation inspired by physics literature.
2. Rewrite the oracle unitary in this notation: Then define: and implement the corresponding phase shift: This unitary acts jointly on the |∆ k and |ψ j inputs. Since k ∈ {0, ..., n − 1}, the exponent 2 n−k−1 is always an integer.
Observe that: In other words, U signal is a block-encoding of: The above is a singular value decomposition of the block-encoded matrix.
4. Choose the amplification threshold via: Also, let δ amp < 1 be an error threshold we will pick later. Now, let A η→δamp (x) be the polynomial from Lemma 11. Viewing U (k) signal as a block-encoding of cos πλ (k) j , apply singular value transformation as in Lemma 10 to U (k) signal with a polynomial p(x) approximating A η→δamp (x 2 ) to accuracy δ svt , which we also pick later.
Lemma 10 applies because A η→δamp (x 2 ) is even. Furthermore, U (k) signal only has one ancilla qubit, and has the special form where the it implements a reflection on the ancilla (116). Thus, the circuit from Lemma 10 also just has one ancilla.
Call the resulting circuit U (k) svt , which implements the unitary: for some matrix element γ(λ (k) j ).

Now we prove that U (k)
svt is an iterative phase estimator with phases. It implements the map: svt is a block-encoding with only one ancilla, and that ancilla is the output qubit of the map.
We must show that U (k) svt is close in diamond norm to a map that leaves the first qubit as |bit k (λ j ) whenever ∆ k encodes the k least significant bits of an n-bit binary expansion of λ j .
To study this map we will proceed through the recipe above, proving statements about the expressions encountered along the way. In step 2. we defined: We discuss the relationship between λ (k) j − φ k and bit k (λ j ). If k = 0 then ∆ k = 0, and due to the rounding promise we find λ j in regions of the form m 2 n + α 2 n , 1 2 n for integers m. Thus, we find λ (0) j − φ 0 in regions of the form m 2 + α 2 , 1 2 . The function bit k (λ j ) just indicates the parity of m. We can also write this as: If k > 0 then we also can show a similar property. While for k = 0 we used the rounding promise to guarantee that λ (0) j − φ 0 only falls into certain regions, for larger k we simply use the fact that ∆ k has been subtracted off in the definition of λ (k) j . That means that the regions where bit <k (λ j ) = 1 are no longer possible. We find: Note that the claim for k > 0 did not make use of the rounding promise, and is true regardless of if the rounding promise holds. These regions are shown in Figure 3. In step 3. we defined U signal which is a block-encoding of cos πλ Later, this will approximately be transformed by singular value transformation via x → A η→δamp (x 2 ), so we employ a trigonometric identity: Clearly this is a probability, and since cosine has period 2π, the floor(λ (k) j − φ k ) term does not matter. We argue that this probability is either ≥ 1/2 + η k or ≤ 1/2 − η k depending on bit k (λ j ). See Figure 3. The idea is that the φ k are chosen precisely so that the troughs and peaks of cos 2 πλ k j line up with the centers of the intervals corresponding to bit k (λ j ) = 0 and bit k (λ j ) = 1 respectively. Then, the nodes of cos 2 πλ k j line up with the midpoints We are guaranteed that λ can only appear in the un-shaded regions: for k = 0 the rounding promise rules out the shaded regions, and for k ≥ 1 we have subtracted ∆ k /2 n off of λ j , preventing regions where previous bits are 1 . We can see how the probability cos 2 πλ (k) j is close to 1 if bit k (λ j ) = 0 and close to 0 if bit k (λ j ) = 1. To make this claim precise, we simply fit a line with slope 2 to the points where the probability intersects 1/2, and see that this line alternatingly gives upper or lower bounds on the probability. So if η k /2 is equal to half the distance between allowed intervals, then the probability is either ≥ 1/2 + η k or ≤ 1/2 − η k in the regions where λ is alternatingly ≤ 1 2 − η k and ≥ 1 2 + η k . Then we have: By Lemma 11: And finally, sincep(x) approximates A η k →δamp (x 2 ) to accuracy δ svt : The circuit for U svt is completely unitary, so the resulting state is normalized. Therefore: Using this fact we can reason that ifp cos πλ Similarly, ifp cos πλ Now that we have bounds on the amplitudes of the output state, we can bound its distance to |bit k (λ j ) for a favorable choice of e iϕ j . Say bit k (λ j ) = 0. Then we select ϕ j = 0, so that: Otherwise, if bit k (λ j ) = 1, then we define ϕ j by γ(λ j )|. That way: So either way, the output state is within 2(δ amp + δ svt ) in spectral norm of that of the ideal state. Thus, the unitary map U svt is close in spectral norm to some ideal unitary. Invoking Lemma 4, the distance in diamond norm is at most: The inequality above holds if we select, for any m > 0: This solution to the inequality relies on the fact that only δ amp actually enters the query complexity, so if classical resources are cheap but query complexity is expensive then we can make the classical computer do as much work as possible by making m larger.
Finally, we compute the query complexity. An invocation of e 2πiλ (k) j requires 2 n−k−1 invocations of U = e 2πiλ . By Lemma 10, the number of queries made by the unitary U svt to U signal is the degree of the polynomial. By Lemma 11, the polynomial A η→δamp (x 2 ) has degree 2 · M η→δ , so the query complexity is: A really nice feature of the coherent iterative phase estimator we present is that it produces no garbage qubits. All singular value transformation is performed on the final output qubit. It does still produce extra phase shifts between the eigenstates, which in some applications may still need to be be uncomputed. However, in applications where phase differences between eigenstates do not matter, like thermal state preparation, we expect that this uncomputation step can be skipped.
To finish the discussion of coherent iterative phase estimation, we stitch the iterative estimator we just defined into a regular phase estimator. In doing so, we also remark on what happens when no rounding promise is present. A summary of the construction is presented in Figure 4.
Corollary 13. Improved phase estimation. The coherent iterative phase estimator from Theorem 12 can be combined with Lemma 7 to make a phase estimator with phases with query complexity at most: assuming α is bounded away from 1 by a constant. Furthermore, if no rounding promise is given, then the estimator δ-approximates in diamond norm a map: for some complex amplitudes ξ, ζ and λ j = floor(2 n λ j )−1 mod 2 n is an erroneous estimate. The performance is the same, except that 0 < α < 1 can be any constant.
Proof. Write η k := 1 2 − 1 2 k 1 2 + α 2 if k ≥ 1 and η 0 := α 2 if k = 0, and, following Lemma 7, demand an accuracy of δ amp,k := (1 − 10 −m )(δ2 −k−1 ) 2 /8 for the k'th bit. Recall from Lemma 11 that M η k →δamp ∈ O η −1 k log(δ −1 amp ) . Then, the overall query complexity is: When k = 0 we have η 0 = α 2 , and when k > 0 we have η k > 1−α 4 which is bounded from below by a constant. We can use this to split the sum: This completes the runtime analysis. Now we turn to the case when no rounding promise is present. Notice that when k ≥ 1, the regions where λ j is assumed not to appear are guaranteed by the previous estimators definitely outputting 1 for previous bits regardless of the promise. Thus, the only bit that can be wrong is the first bit. When an eigenvalue λ j falls into a region disallowed by the rounding promise, the first bit will be some superposition ξ |0 + ζ |1 .
Flipping the final bit of an estimate in general results in an error of ±1. However, recall that when estimating future bits the value of ∆ k /2 n is subtracted off from λ j . That means that when we erroneously measure a 1, the rest of the algorithm proceeds to measure λ j − 1/2 n instead. As a result, if the first bit is wrong, then the algorithm will output floor(2 n λ j ) − 1 instead. Since the algorithm is periodic in λ j with period 1, the output will be 2 n − 1 if an error occurs when floor(2 n λ j ) = 0.
Notice that if we had allocated the error evenly between the n steps, then we would have incurred an extra O(2 n log n) term in the above. Spreading the error via a geometric series avoids this, and we find that we obtain better constant factors with this choice as well. This is because the k = 0 term dominates, so we want to make δ amp,k as large as possible.

Coherent Iterative Energy Estimation
While for phase estimation we are given access to j e 2πiλ j |ψ j ψ j |, for energy estimation we have a block-encoding of j λ j |ψ j ψ j |. For phase estimation we could relatively easily synthesize a cosine in the eigenvalues, just by taking a linear combination with the identity. But for energy estimation we must build the cosine directly.
To do so we leverage a tool employed by [GSLW18] to perform Hamiltonian simulation. The Jacobi-Anger expansion yields highly efficient polynomial approximations to sin(tx) and cos(tx). To perform Hamiltonian simulation one then takes the linear combination cos(tx) + i sin(tx). However, we only need the cosine component.
Then p cos,t (x) is an even polynomial of degree 2R such that for all x ∈ [−1, 1]: Furthermore: Proof. These results are shown in Lemma 57 and Lemma 59 of [GSLW18], outlined in their section 5.1.
Again, computation of the degree of the Jacobi-Anger expansion is a bit complicated, but the complexity is worth it due to the method's high performance. Our approach is to first synthesize cos πλ (k) j to obtain a signal that oscillates to indicate bit k (λ j ), and then apply A η→δ (x 2 ) to amplify the signal to 0 or 1, as shown in Figure 2.
Actually, one might observe that synthesizing cos πλ (k) j first is not necessary to make polynomials that look like those in Figure 2. Instead one can take an approach similar to the one used for making rectangle functions in [LC17]: simply shift, scale and add several amplifying polynomials A η→δ (x) to make the desired shape. None of these operations affect the degree, so this approach also yields the same asymptotic scaling O(2 n α −1 log(δ −1 )) as phase estimation. We will see in Corollary 16 that the method using the Jacobi-Anger expansion actually achieves the worse scaling of O(α −1 log(δ −1 )(2 n + log(α −1 ))).
The reason why we present the approach using the Jacobi-Anger expansion, despite it having worse asymptotic scaling, is that in the regime of interest (n ≈ 10, α ≈ 2 −10 ) we numerically find that the Jacobi-Anger expansion actually performs better. There may be a regime where it is better to remove the Jacobi-Anger expansion from the construction, in which case the algorithm is easily adapted.
The rest of the construction of Theorem 15 strongly resembles coherent iterative phase estimation, so much so that we can re-use parts of the proof of Theorem 12. One further difference is that this estimator now has garbage, because we have no guarantee that the block-encoding of the Hamiltonian only has one ancilla.

Theorem 15. Coherent Iterative Energy Estimation.
Say the block-encoding of H requires a ancillae, that is, U H acts on C 2 a ⊗H. Then there is an iterative energy estimator with phases and a + n + 3 qubits of garbage with query complexity: where M η→δ and η k are as in Lemma 11 and δ amp can be chosen to be (1 − 10 −m svt )δ 2 /8 for any m svt > 0, and we can choose any m cos > 0.
Proof. We construct the estimator as follows: 1. Let W k be a hermitian matrix on k qubits defined by: W k has a block-encoding which can be constructed as follows: First, prepare |+ n−1 : Next, observe that ∆ k ≤ 2 k −1 ≤ 2 n−1 −1. Into an ancilla register compute |x < ∆ k , and uncompute any garbage necessary to do so.
Then postselect that the final register is in the |1 state: Finally, postselect that the x register is in the |+ n−1 state: This process makes use of n − 1 ancillae initialized and postselected to |+ , and one more ancilla that is postselected to |1 .
2. Use linear combinations of unitaries to construct a block-encoding of: where the φ k are selected just as in as in Theorem 12. Observe that φ k < 1/2, so therefore 4φ k 2 k−n is a probability which can be block-encoded. Since the blockencoding of H has a ancillae, and that of W k has n ancillae, and the three terms in the linear combination need two control qubits, the block-encoding of H (k) has a + n + 2 ancillae.
3. Use Lemma 14 to construct a polynomial approximation p cos,π2 n−k (x) of cos(π2 n−k x) to accuracy 2δ cos , to be picked later.
Use Lemma 11 to construct a polynomial A (η−δcos)→δamp . We will pick δ amp later and select η k just as in Theorem 12.
Finally, use Lemma 10 to construct U svt , a block-encoding ofp(H (k) ) which approximates the even polynomial: To perform singular value transformation we needed one extra ancilla, so U svt has a + n + 3 ancillae -these are the garbage output of this map.
We rewrite H (k) in terms of its eigendecomposition: where we defined λ (k) j := 2 n−k−1 λ j − ∆ k 2 n + φ k just like in Theorem 12. This protocol implements a map: Here γ(λ (k) j ) is defined such that all the other amplitudes for the failed branches of the block-encoding are absorbed into the normalized state |gar 1,j .
We see thatp 2 k−n λ k j approximates which is the same expression encountered in Theorem 12, with the same definition of λ (k) j , up to a minor shift on η k . Therefore, we can follow the same reasoning as in Theorem 12 up to two minor differences, and arrive at the exact same conclusion. Namely, if we select some m svt > 0 and then let: Then the unitary channel we implement is at most δ-far in diamond norm to a channel that implements the map: whenever ∆ k encodes the last k bits of λ j . Since the argument in Theorem 12 is lengthy and the modifications are extremely minor, we will not repeat the argument here. Instead we will articulate the two things that change.
First, Theorem 12 gives an estimator with phases and no garbage, whereas here we also have garbage. The garbage just tags along for the entire calculation, and when we come to selecting ϕ j depending on λ j we can also select |gar λ j = |gar 0/1,j depending on bit k (λ j ).
Second, we incur an error of δ cos in the approximation of cos(π2 n−k x) with p cos,π2 n−k (x). Since we show that cos 2 (π2 n−k x) is bounded away from 1 2 by ±η we therefore have that p cos,π2 n−k (x) 2 is bounded away from 1 2 by ± η − 2δcos 2 . The amplifying polynomial then proceeds to amplify (η − δ cos ) → δ amp appropriately, so the calculation proceeds the same. We just need to ensure that η − δ cos > 0, so we select δ cos := 10 −mcos · η for some m cos > 0. This completes the accuracy analysis. Finally, we analyze the query complexity. The block-encoding of H (k) makes one query to U H , so by Lemma 10 the query complexity is exactly the degree ofp(x) which is the degree of A (η−δcos)→δamp p 2 cos,π2 n−k (x) . From Lemma 14 and Lemma 11, the degree is: Substituting the definitions for η, δ amp and δ cos yields the final runtime. As with Theorem 12, δ amp can be made larger by increasing m amp . By increasing m cos we can decrease δ cos , which decreases the degree of A (η−δcos)→δamp (x) while increasing the degree of p cos,π2 n−k (x). The Jacobi-Anger expansion deals with error more efficiently than the amplifying polynomial, so in practice m cos should be quite large.
As with phase estimation, we also pack the coherent iterative energy estimator into a regular energy estimator using Lemma 7. This time, since the iterative estimator produces garbage it makes sense to uncompute the garbage using Lemma 8.
Corollary 16. Improved Energy Estimation. The iterative energy estimator with phases and garbage from Theorem 15 can be combined with Lemma 8 and Lemma 7 to make an energy estimator without phases and without garbage with query complexity bounded by: assuming that α is bounded away from 1 by a constant. Furthermore, even when there is no rounding promise, there exists an algorithm that, given an eigenstate |ψ j of the Hamiltonian λ j , for any δ > 0 performs a transformation δ-close in diamond norm to the map: where p is some probability and λ j = floor(2 n λ j ) − 1 mod 2 n is an erroneous estimate. Just as in Corollary 13, the performance is the same except that 0 < α < 1 can be any constant.
Proof. As with Corollary 13, we write η k to make the k dependence explicit and demand accuracy δ amp,k = (1 − 10 −mamp )(δ2 −k−1 ) 2 /8 for the k'th bit. From Lemma 14 we obtain an asymptotic upper bound r(t, ε) ∈ O t + log(ε −1 ) . Again, recall from Lemma 11 that M η k →δamp ∈ O η −1 k log(δ −1 amp ) . The asymptotic query complexity of the iterative energy estimator from Theorem 15 is then bounded by: Next we invoke Lemma 8 to remove the phases and the garbage, doubling the query complexity. We do this before invoking Lemma 7, because Lemma 7 involves blowing up the number of garbage registers by a factor of n. While we could also invoke Lemma 7 and then invoke Lemma 3 to obtain a map without garbage, this would involve many garbage registers sitting around waiting to be uncomputed for a long time. If we invoke Lemma 8 first we get rid of the garbage immediately. Finally, we invoke Lemma 7 to turn our iterative energy estimator without garbage and phases into a regular energy estimator without garbage and phases. As in Corollary 13, we observe that η 0 = α/2 and for k > 0 we have η k bounded from below by a constant. Then, the total query complexity is: Next, we show that it is possible to implement a map that, given an eigenstate |φ j , measures an estimate that is either floor(2 n λ j ) or floor(2 n λ j ) − 1 mod 2 n with some probability. Note that this is not the same algorithm as above. Just as with phase estimation, it is only the first bit that actually needs the rounding promise, and all other bits are guaranteed to be deterministic.
The first bit performs a map of the form: We immediately see that Lemma 8 cannot be used to perform uncomputation here, because uncomputation only works when p = 1 or p = 0. Instead, we simply measure the output register containing |0 or |1 and discard the garbage. This would damage any superposition over the |ψ j , which is why this algorithm only works if the input is an eigenstate. We then use iterative estimators for the remaining bits to compute the rest of the estimate. This time their outputs will be deterministic, but there is no point in doing uncomputation since the superposition has already collapsed. Instead, we do the same thing as for the k = 0 estimator: compute the next bit and some garbage, and discard the garbage. The final answer is either floor(2 n λ j ) or floor(2 n λ j ) − 1 for the same reason as in Corollary 13.

Performance Comparison
Above, we have presented a modular framework for phase and energy estimation using the key ingredients in Theorem 12 and Theorem 15 respectively. These results already demonstrate several advantages over textbook phase estimation as presented in Proposition 5.
First, they eliminate the QFT and do not require a sorting network to perform median amplification. Instead, they rely on just a single tool: singular value transformation.
Second, they require far fewer ancillae. Our improved phase estimation algorithm requires no ancillae at all, and is merely 'with phases' so arguably does not even need uncomputation for some applications. On the other hand, textbook phase estimation requires O((n + log(α −1 )) log(δ −1 )) ancillae in order to implement median amplification.
Given a block-encoding of a Hamiltonian with a ancillae, then our energy estimation algorithm requires a + n + 3 ancillae. But in order to even compare Proposition 5 to Theorem 15 we need a method to perform energy estimation using textbook phase estimation. This is achieved through Hamiltonian simulation.
Which method of Hamiltonian simulation is best depends on the particular physical system involved. Hamiltonian simulation using the Trotter approximation can perform exceedingly well in many situations [SHC20]. However, in our analysis we must be agnostic to the particular Hamiltonian in question, and furthermore need a unified method for comparing the performance. Hamiltonian simulation via singular value transformation [LC17,GSLW18], lets us compare Proposition 5 and Theorem 15 on the same footing. After all, this method features the best known asymptotic performance in terms of the simulation time [Childs&20] in a black-box setting.
Singular value transformation constructs an approximate block-encoding of e iHt with ancillae. Since e iHt is unitary, in the ideal case the ancillae start in the |0 state and are also guaranteed to be mapped back to the |0 state. But in the approximate case the ancillae are still a little entangled with the remaining registers, so they become an additional source of error. Certainly the ancillae cannot be re-used to perform singular value transformation again, because then the errors pile up with each use. Thus, we are in a similar situation to Lemma 3, where we must discard some qubits and take into account the error.
Therefore, we must do some additional work beyond the Hamiltonian simulation method presented in [GSLW18], to turn the approximate block-encoding of a unitary into a channel that approximates the unitary in diamond norm. The main trick for this proof is to consider postselection of the ancilla qubits onto the |0 state. Then the error splits into two parts: the error of the channel when the postselection succeeds, and the probability that the postselection fails.
The Hamiltonian simulation method is extremely accurate, letting us obtain blockencodings with an error that decays exponentially in the query complexity to U H . Thus, the error contribution of this channel is almost entirely negligible in the performance analysis. The purpose of the argument below is really to provide a Lemma analogous to Lemma 4 that lets us convert error bounds in spectral norms on block-encodings to diamond norms. That way, our entire error analysis is completely formal, as it should be for a fair comparison of all algorithms involved. Notably, an ε-accurate block-encoding of a unitary is not the same thing as an ε-accurate implementation of that unitary: the latter implies a channel with diamond norm error 2ε while the prior yields an error 4ε due to the ancilla registers.

Lemma 17. Hamiltonian simulation. Say U H is a block-encoding of a Hamiltonian H.
Then, for any t > 0 and ε > 0 there exists a quantum channel that is ε-close in diamond norm to the channel induced by the unitary e iHt . This channel can be implemented using Proof. This is an extension of Theorem 58 of [GSLW18], which states that there exists a block-encoding U A of a matrix A such that |A − e iHt | ≤ ε with query complexity 3r et 2 , ε 6 . This result leverages the Jacobi-Anger expansion (Lemma 14) to construct approximate block-encodings of sin(tH) and cos(tH) and uses linear combinations of unitaries to approximate e iHt /2. Then it uses oblivious amplitude amplification to get rid of the factor of 1/2, obtaining U A . If U H is a block-encoding with a ancillae, then U A has a + 2 ancillae.
All that is left to do is to turn this block-encoding into a quantum channel that approximately implements e iHt . To do so, we just initialize the ancillae to |0 a+2 , apply U A , and trace out the ancillae. We can write this channel Λ as: To finish the theorem we must select an ε so that the error in diamond norm of Λ to the channel implemented by e iHt is bounded by δ. To do so, we write Λ as a sum of postselective channels Λ i (ρ): That way Λ = i Λ i . If we let Γ e iHt := e iHt ρe −iHt , then we can bound the error in diamond norm using the triangle inequality: Now we proceed to bound the two terms individually. Observe that: Since |A − e iHt | ≤ ε, we can invoke Lemma 4 and see that |Λ 0 − Γ e iHt | ≤ 2ε, and we are done with the first term.
The above proof uses the trick for the proof of the block-measurement theorem that we mentioned in the introduction. We will need it again in the proof Theorem 19. Thus, while we are at it, we may as well state the generalization of this result to any block-encoded unitary as a proposition.
Proposition 18. Say U A is a block-encoding of A, which is ε-close in spectral norm to a unitary V . Then there exists a quantum channel 4ε-close in diamond norm to the channel ρ → V ρV † .
Proof. Lemma 17 proved this result with V = e iHt . The exact same argument holds for abstract V .
Returning to the ancilla discussion, say that the block-encoding of the input Hamiltonian has a ancillae. Then we see that Theorem 15 requires a+n+3 ancillae, while textbook phase estimation combined with Lemma 17 requires O(a + (n + log(α −1 )) log(δ −1 )). This is because the extra ancillae required to perform the procedure in Lemma 17 can be discarded and reset after each application of e iHt . Also note that despite the fact that the above implementation of e iHt is not unitary, the overall protocol in Proposition 5 is still approximately invertible as required by Lemma 3, since we can just use Lemma 17 to implement e −iHt instead.
Next, we compare the algorithms in terms of their query complexity to the unitary U or the block-encoding U H . Note that this is not the gate complexity: the number of gates from some universal gate set used to implement the algorithm. The reason for this choice is that the gate complexity of U or U H depends on the application, and is likely much much larger than any of the additional gates used to implement singular value transformation. The algorithms' query complexities depend on three parameters: α, n, and δ. In the following we discuss the impact of these parameters on the complexity and show how we arrive at the 14x and 10x speedups stated in the introduction.
The following jupyter notebook contains the code we used to evaluate the query complexities of the algorithms: https://github.com/qiskit-community/improved-phase-estimation/blob/main/ phase_estimation_performance.ipynb The implementation of the algorithms follows the constructions in Corollary 13 and Corollary 16 with only one minor optimization: rather than setting η k based on a linear lower bound on cos(πλ (k) ), we simply evaluate cos(πλ (k) ) at the relevant point. The performance in terms of α is plotted in Figure 5, for fixed n = 10 and δ = 10 −30 . This comparison shows several features. First, we see that the novel algorithms in this paper are consistently faster than the traditional methods in this regime. The only exception is Corollary 13 combined with Lemma 3 to remove the phases, which is outperformed by traditional phase estimation for some value of α > 1/2. Such enormous α is obviously impractical: even traditional phase estimation does not round to the nearest bit in this regime.
Second, the performance of Proposition 5 exhibits a ziz-zag behavior. This is because we reduce α by estimating more bits than we need and then rounding them away, a process that can only obtain α of the form 2 −1−r for integer r. When α > 1 2 then we no longer use the rounding strategy since we can achieve these values with median amplification alone. Therefore, for large enough α the line is no longer a zig-zag and begins to be a curve with shape ∼ α −2 .
The zig-zag behavior makes comparison for continuous values of α complicated. In our analysis we would like to compare against the most efficient version of traditional phase estimation, so we continue our performance analysis where α is a power of two, maximizing the efficiency (but minimizing our speedup). In Figure 6 we show the performance when vary α, n and δ independently.
We see that once n 10, α 2 −10 and δ 10 −30 the speedup stabilizes at about 14x for phase estimation and 10x for energy estimation. Our method therefore shows an significant improvement over the state of the art.
Of course, several assumptions needed to be made in order to claim a particular multiplicative speedup. Many of these assumptions were made specifically to maximize the performance of the traditional method. For example, we count performance in terms of query complexity rather than gate complexity, which neglects all the additional processing that phase estimation needs to perform for median amplification. This method of comparison favors the traditional method, since it neglects the fact that our new methods require significantly less ancillary processing. Furthermore, we select α to be a power of two, so that phase estimation is maximally efficient. However, we also assume that n is large enough and that α and δ are small enough that the speedup is stable. If the accuracy required is not so large, then the speedup is less significant.
Which method is best in reality will depend on the situation. In particular, the appropriate choice of α when studying real-world Hamiltonians remains an interesting direction of study. In reality it will not be possible to guarantee a rounding promise, so one's only option is to pick a small value of α and hope for the best. How small of an α is required?
Figure 5: Performance of estimation algorithms presented in this paper. Phase estimation algorithms are presented with slim lines on the left, and energy estimation algorithms are presented with thick lines on the right. The zig-zag behavior of phase estimation is explained by the method by which Proposition 5 reduces α: by estimating more bits than needed and then rounding them. Thus, the α achieved by phase estimation is always a power of two. When n 10, δ 10 −30 and α 2 −10 the speedups become stable. Together with Figure 5 we can conclude that the speedup for phase estimation is about 14x and the speedup for energy estimation is about 10x.

Block-Measurement
We have presented improved algorithms for phase and energy estimation. In this section we prove the block-measurement theorem from the introduction. The goal of the blockmeasurement protocol is the following: given an approximate block-encoding of a projector Π, implement a channel close to the unitary: When the block-encoding of Π is exact, then the above is easily accomplished through linear combinations of unitaries and oblivious amplitude amplification. In particular, we can rewrite the above as |0 ⊗ I − √ 2 |− ⊗ Π, so we need to amplify away a factor of 1/(1 + √ 2) from the linear combination. Following Theorem 28 of [GSLW18], we observe that T 5 (x) is the first Chebyshev polynomial that has a solution T 5 (x) = ±1 such that x < 1/(1+ √ 2). We nudge the factor down to the solution x with some extra postselection, and then we meet the conditions of this theorem. This demands five queries to the blockencoding of Π. Then we invoke our Proposition 18 to turn the block-encoding of |1 ⊗ Π + |0 ⊗ (I − Π) into a channel.
However, the above strategy is both more complicated and more expensive than necessary -we can do this in just two queries. Say U Π is a block-encoding of Π with m ancillae. Then, let V Π : where the CNOT above refers to X ⊗ |0 m 0 m | + I ⊗ (I − |0 m 0 m |). Then, V Π satisfies: So viewing the |0 m register as the postselective part of a block-encoding, we see that V Π is a block-encoding of the desired operation. Then we invoke Proposition 18 to construct a channel Λ Π from V Π . Now all that is left to do is the error analysis.
In fact, when m = 1, then there is an extent that we do not even need the modified CNOT gate and can get away with just a single query. We used this fact in Theorem 12. This is because if Π = j α j |ψ j ψ j |, then we can write: where β j is some amplitude satisfying |β j | 2 + |α j | 2 = 1. To compare, the desired operation is: Since α j ∈ {0, 1}, we know that β j = e iφ j (1 − α j ) for some phase φ j that may depend on |ψ j . So, with the exception of the phase correction, we are already one X gate away from the desired unitary. Is it possible to remove this phase correction without uncomputation? If U Π is obtained through singular value transformation of some real eigenvalues λ j then we actually have more control than Lemma 10 would indicate. Looking at Theorem 3 of [GSLW18], we can actually select polynomials P, Q such that α j = P (λ j ) and β j = Q(λ j ), provided that P, Q satisfy some conditions including |P (x)| 2 + (1 − x 2 )|Q(x)| 2 = 1. For what positive-realvalued polynomials P (x) is it possible to choose a positive-real-valued Q(x) that satisfies this relation, thus removing the phases e iφ j ? We leave this question for future work. Now we proceed to the formal statement and error analysis of the block-measurement theorem.
Theorem 19. Block-measurement. Say Π is a projector, and A is a hermitian matrix satisfying |Π − A 2 | ≤ ε. Also, A has a block-encoding U A . Then the channel Λ A , constructed in the same way as Λ Π above just with U A instead of U Π , approximates Λ Π in diamond norm: Proof. Just like V Π , V A is a block-encoding of the unitary: The distance in spectral norm to V Π = X ⊗ Π + I ⊗ (I − Π) is: So we have a block-encoding of a matrix √ 2ε-close in spectral norm to the unitary X ⊗ Π + I ⊗ (I − Π). Now we just use Proposition 18, which gives a channel with diamond norm accuracy 4 √ 2ε. Then we get the desired operation by initializing an extra qubit to |0 before applying the channel.
What happens when A is not a projector? For simplicity, say the input state to the circuit is |Ψ , which happens to be an eigenstate of A. Then: We are interested in the state obtained by tracing out the |0 m register above. For j ∈ {0, 1} m , we let γ j abbreviate a matrix element of U A : Note that γ 0 is the eigenvalue of A corresponding to |Ψ . Then, the reduced density matrix after tracing out the middle register is: This somewhat complicated calculation produced a simple result: tracing out the |0 m register collapses the superposition on the output register. A more general version of this calculation where the input state is not an eigenstate of A demonstrates that this collapse also damages the superposition on the input register (although it does not fully collapse it). Intuitively, we can see why this is the case even without doing the full calculation: since measuring the middle register collapses the superposition that encodes the eigenvalues, the environment that traced out the middle register thus learns information about the eigenvalues. But the distribution over eigenvalues also contains information about the input superposition over eigenvectors, so therefore the input state will be damaged.
In essence, we have re-derived the fact that uncomputation is impossible unless the output of the computation is deterministic.

Non-Destructive Amplitude Estimation
In this section we show how to use our energy estimation algorithm to perform amplitude estimation. The algorithms for energy and phase estimation demanded a rounding promise on the input Hamiltonian, which guarantees that they do not damage the input state even if it is not an eigenstate of the unitary or Hamiltonian of interest. However, as we shall see, this is not a concern for amplitude estimation. This is because we can construct a Hamiltonian whose eigenstate is the input state.
Say Π is some projector and |Ψ is a quantum state. Non-destructive amplitude estimation obtains an estimate of a = |Π |Ψ | given exactly one copy of |Ψ , and leaves that copy of |Ψ intact. This subroutine is explicitly required by the algorithms in [HW19, AHNTW20], which can be used to perform Bayesian inference, thermal state preparation and partition function estimation. However, it is also quite useful in general when |Ψ is expensive to prepare. For example, when estimating the expectation of an observable on the ground state of a Hamiltonian, preparing the ground state can be very expensive. Thus, it may be practical to only prepare it once.
The only previously known algorithm for non-destructive amplitude estimation is given in [HW19]. It works via several invocations of amplitude estimation according to [BHMT00], which is based on phase estimation. We argue that our new algorithm has several performance advantages over the prior art. However, these advantages are not very quantifiable, preventing us from computing a numerical constant-factor speedup. The advantages are as follows: • Our new algorithm requires dramatically fewer ancillae. This is because [HW19] relies on phase estimation with median amplification. As argued above, median amplification requires O(n log(δ −1 )) ancillae. In contrast, we simply use the protocol for energy estimation which requires n + O(1) ancillae.
• Our new algorithm runs in a fixed amount of time: one application of our energy estimation algorithm suffices. In contrast, [HW19]'s algorithm works by repeatedly attempting to 'repair' the input state. This process succeeds only with probability 1/2, so while the expected number of attempts is constant, the resulting algorithm is highly adaptive and may need a variable amount of time.
• Our new algorithm does not require knowledge of a lower bound on the amplitude a.
The 'repair' step in [HW19] itself involves another application of phase estimation, which must produce an estimate with enough accuracy to distinguish arcsin(a) and − arcsin(a). This also implies that [HW19] can only produce relative-error estimates, even when an additive-error estimate might be sufficient, as it is in [AHNTW20].
• While, due to the above differences, it is not really possible to perform a side-byside constant-factor comparison of the algorithms, we do expect to inherit a modest constant-factor speedup from our energy estimation algorithm. [BHMT00] has no need to perform Hamiltonian simulation, so we expect the speedup to look more like the one in the case of phase estimation. Since no rounding promise is required, making α tiny is not really necessary. But it is necessary that α < 1/2 since this makes traditional phase estimation round correctly. Looking at Figure 5, we already see constant factor speedups in this regime.
Another minor advantage of our method over [HW19] is that it estimates a 2 directly, rather than going through the Grover angle θ := arcsin(a). Coherently evaluating a sine function in superposition to correct this issue, while possible, will have an enormous ancilla overhead. a 2 is the probability with which a measurement of |Ψ yields a state in Π, which may be useful directly.
We note that an additive error estimate of a is slightly stronger than an additive error estimate of a 2 . Moreover this accuracy seems to be necessary for some versions of quantum mean estimation: see for example Appendix C of [AHNTW20]. If the amplitude a is desired rather than a 2 , then, since the algorithm proceeds by obtaining a block-encoding of a 2 , one can use singular value transformation to make a block-encoding of a instead. A polynomial approximation of √ x could be constructed from Corollary 66 of [GSLW18]. We leave a careful error analysis of a direct estimate of a to future work.
Corollary 20. Non-destructive amplitude estimation. Say Π is a projector and R Π is a unitary that reflects about this projector: Let |Ψ be some quantum state such that a := |Π |Ψ |. Let M := floor(a 2 2 n ). Then for any positive integer n and any δ > 0 there exists a quantum channel that implements δ-approximately in diamond norm the map: for some probability p (i.e., there is some probability that instead of obtaining floor(a 2 2 n ) we obtain floor(a 2 2 n ) − 1 ). This channel uses O 2 n log(δ −1 ) controlled applications of R Π and R |Ψ := 2 |Ψ Ψ| − I.