Basic quantum subroutines: finding multiple marked elements and summing numbers

We show how to find all $k$ marked elements in a list of size $N$ using the optimal number $O(\sqrt{N k})$ of quantum queries and only a polylogarithmic overhead in the gate complexity, in the setting where one has a small quantum memory. Previous algorithms either incurred a factor $k$ overhead in the gate complexity, or had an extra factor $\log(k)$ in the query complexity. We then consider the problem of finding a multiplicative $\delta$-approximation of $s = \sum_{i=1}^N v_i$ where $v=(v_i) \in [0,1]^N$, given quantum query access to a binary description of $v$. We give an algorithm that does so, with probability at least $1-\rho$, using $O(\sqrt{N \log(1/\rho) / \delta})$ quantum queries (under mild assumptions on $\rho$). This quadratically improves the dependence on $1/\delta$ and $\log(1/\rho)$ compared to a straightforward application of amplitude estimation. To obtain the improved $\log(1/\rho)$ dependence we use the first result.


Introduction 1.Finding multiple marked elements in a list
Grover's famous search algorithm [Gro96] can be used to find a marked element in a list quadratically faster than possible classically.Formally it can be used to solve the following problem: given a bit string x ∈ {0, 1} N , x ̸ = 0, find an index i ∈ [N ] such that x i = 1.
In this work we consider the problem of finding all indices i ∈ [N ] for which x i = 1.We give a query-optimal quantum algorithm with polylogarithmic gate overhead in the setting where one has a small quantum memory.We explain below why this last assumption makes the problem non-trivial.This improves over the previous state-of-the-art: previous algorithms were either query-optimal but with a polynomial gate overhead, or had a polylogarithmic gate overhead but also a logarithmic overhead in the query count.
A well-known query-optimal algorithm for the problem is as follows [dGdW02, Lem.2].Let k be the Hamming weight |x| := N i=1 x i of x.For ease of exposition, suppose the algorithm knows k. (For our results we will work with weaker assumptions such as knowing only an upper bound on k, or an estimate of it, see Section 3. We also ignore failure probabilities in this part of the introduction.)A variant of Grover's algorithm [BBHT98] can find a single marked element using O( N/k) quantum queries and O( N/k log(N )) additional single-and two-qubit gates.One can then find all k marked elements using quantum queries to x.The above complexity is obtained as follows.Suppose we have already found a set J ⊆ [N ] of marked elements.Then to find a new marked element, we replace x by the string z ∈ {0, 1} N defined as A quantum query to z can be made using a single quantum query to x and quantum query to J (which on input |i⟩ |b⟩ for i ∈ [N ], b ∈ {0, 1} returns |i⟩ |b ⊕ δ i∈J ⟩ where δ i∈J ∈ {0, 1} is one iff i ∈ J).In particular, if J can be stored in a quantum memory (i.e.queried and updated in unit time), then the query complexity will be O( √ N k) and the time complexity is O( √ N k).We refer the interested reader to [GLM08] and [CHI + 18, Sec.5] for a discussion of quantum memory and its (dis)advantages.
However, when we cannot store J in a quantum memory, a naive implementation of the quantum queries to J is expensive in terms of gate complexity: if |J| = s, then one can use O(s log(N )) quantum gates to implement a single query to J. 1 Since the size of J grows to k, the total gate complexity of finding all marked elements will scale as O( √ N k 3/2 ), which is a factor k larger than the query complexity.We show that this factor of k in the gate complexity can be avoided: we give an algorithm that finds, with large probability, all k indices using the optimal number of quantum queries to x, O( √ N k), while incurring only a polylogarithmic overhead in the gate complexity, in the case where we only have a small quantum memory.We state a simplified version of our main result below; for the full version, see Theorem 3.9 and the corresponding algorithm GroverMultipleFast.
We mention that by a simple coupon-collector argument one can already achieve both queryand gate-complexity √ N k polylog(N, 1/ρ), see Proposition 3.7.Our algorithm completely removes the polylog(N ) factors in the query complexity and moreover has a much improved dependence on log(1/ρ): one can achieve ρ = 1/poly(k) without increasing the number of quantum queries made by the algorithm.In the same spirit, we mention that previous work had already shown that simply boosting a constant success probability is not optimal for finding a single marked element: one can do so with probability ≥ 1 − ρ using N log(1/ρ) quantum queries [BCdWZ99].
In a nutshell, our algorithm is a hybrid between the quantum coupon-collector and the queryoptimal algorithm described above.First, we use the coupon collection strategy to find t marked indices 1 ≤ i 1 < • • • < i t ≤ n, for t roughly k/log(k)2 .A basic property of this strategy is that the resulting indices {i 1 , . . ., i t } yield a uniformly random subset of size t of the marked indices in x.Next, for every j ∈ [t + 1], we use the query-optimal algorithm to find all remaining marked elements in the interval (i j−1 , i j ) ⊆ [n], where we write i 0 = 0 and i t+1 = n + 1.With high probability over the found indices {i 1 , . . ., i t }, each of the intervals (i j−1 , i j ) contains few remaining marked indices, which reduces the effect of the high gate-complexity overhead of the query-optimal search algorithm.

Improved quantum summing algorithm
Given quantum query access to a binary description of v ∈ [0, 1] N , how difficult is it to obtain, with probability ≥ 1 − ρ, a multiplicative δ-approximation 2 of the sum s = N i=1 v i ?We provide an algorithm to do so whose complexity can be tuned by choosing a parameter p ∈ (0, 1); one special case of our second main result is as follows, see Theorem 4.3 for the full version.In the version below we have made very mild assumptions on the failure probability ρ and precision δ, which essentially correspond to the regime in which one makes at most O(N ) quantum queries.
quantum queries to binary descriptions of the entries of v, and a gate complexity which is larger by a factor polylogarithmic in N , 1/δ and 1/ρ.
In a nutshell, our algorithm first finds all indices of "large enough" entries of the v using GroverMultipleFast and sums the corresponding elements classically.It then rescales the remaining "small enough" elements and uses amplitude estimation [BHMT02] to approximate their sum.To determine what "large enough" means, we use a recent quantum quantile estimation procedure from [Ham21].Choosing the quantile carefully controls both the number of elements that need to be found in the first stage, as well as the size of the elements that remain to be summed in the second stage.Note that it is the above version of Grover's algorithm that allows us to obtain a query complexity with only a log(1/ρ)-dependence, and without additional polylogarithmic factors in N and δ.Indeed, the fact that the number of quantum queries required to find multiple marked elements does not depend on log(1/ρ) (for ρ not too small) allows us to balance the complexities of the two stages. 3he problem we consider can be viewed as a special case of the mean estimation problem, or as a generalization of the approximate counting problem for binary strings x ∈ {0, 1} N .We briefly discuss how our results compare to prior work on those problems.

Mean estimation algorithms.
After multiplying the v i by a factor 1 N , we can interpret the problem of finding a multiplicative δ-approximation of the sum s = N i=1 v i as the problem of obtaining a multiplicative δ-approximation of the mean µ = 1 N N i=1 v i of the random variable that, for each i ∈ [N ], takes value v i with probability 1/N .Quantum algorithms for the mean estimation problem date back to the work of Grover [Gro97,Gro98].A careful application of maximum finding and quantum amplitude estimation yields such an approximation of µ, with probability ≥ 1 − ρ, using O( √ N δ log(1/ρ)) quantum queries and polylogarithmic gate overhead, see Theorem 2.9.We improve the dependence on δ from 1/δ to 1/ √ δ.As for applications, we note that Theorem 2.9 was used to give quantum speedups for the matrix scaling problem in [vAGL + 21, GN22], where it is used to approximate the row-and column sums of a matrix with non-negative entries.This is one of their main sources of quantum speedup, and the quality of this approximation directly affects the achievable precision for the matrix scaling problem.Using the improved quantum summing subroutine of Theorem 4.3, the dependence on the desired precision ε for the matrix scaling problem is further improved.More precisely, if A ∈ R N ×N ≥0 is an N × N matrix with non-negative entries, let r(A) ∈ R N ≥0 denote its vector of row sums, i.e., r i (A) = N j=1 A ij .Then given quantum query access to A, using the improved summing subroutine, one can with O(N Computing such an r with δ = ε 2 is the bottleneck in the second-order method for matrix scaling presented in [GN22].By reducing the complexity of this step, this method is improved to become better than the fastest classical first-order method (Sinkhorn's algorithm) for entrywisepositive matrices: the classical method finds an ε-ℓ 1 -scaling of entrywise positive matrices in time O(N 2 /ε), whereas the box-constrained Newton method now runs in time O(N 1.5 /ε).Note that this gives an algorithm for matrix scaling whose runtime is sublinear in the input size when , which is precisely the regime of δ for which the quantum subroutine improves over classical summing.
We remark that faster mean estimation algorithms have been developed for example for random variables with a small variance σ 2 .Indeed, the current state of the art obtains a multiplicative δ-approximation, with probability queries in expectation [Ham21,KO23]. 4For comparison, we mention that σ ≤ µ(1 − µ) always holds, and when given binary access to the v i , one may additionally assume (after maximumfinding and rescaling) that µ ∈ [1/N, 1].The second term in the complexity is then at most N/δ log(1/ρ) (i.e. at most our bound when we ignore the ρ-dependence).The first term, however, is larger than our complexity if and only if δN ≤ (σ/µ) 2 (again ignoring ρ).Our algorithm thus improves over prior mean estimation algorithms when the support is relatively small: when δN is at most (σ/µ) 2 .

Approximate counting algorithms.
As mentioned above, our algorithm improves the errordependence for mean estimation (for random variables with small support).It therefore makes sense to compare our upper bound with the well-known lower bound for the approximate counting problem for binary strings x ∈ {0, 1} N .We first recall a precise statement.Let x ∈ {0, 1} N and k = |x|, and U x a unitary implementing quantum oracle access to x.Then for an integer ∆ > 0, any quantum algorithm which, with probability ≥ 2/3, computes an additive ∆approximation of k uses at least Ω( N/∆+ k(N − k)/∆) applications of controlled-U x [NW99, Thm.1.10 and 1.11].A matching upper bound is given in [BHMT02, Thm.18], see Theorem 2.7 for a precise formulation.We can compare the complexity of our algorithm by converting multiplicative error into additive error, i.e., to achieve an additive error of ε we take δ = ε/k (or ε divided by a suitable multiplicative approximation of k).Then the key point is that if one considers Eq. (1.1) for ε ≤ ∆ and k ≥ 1, then where the last inequality follows from concavity of the square-root function and ∆ ≥ 1.In other words, for all parameters N, k, ∆, the complexity of our algorithm (left hand side), is at least as large as the lower bound on approximate counting (right hand side), so we do not break the lower bound.We highlight two ranges of parameters.On the one hand, when ∆ is large, our upper bound is suboptimal for quantum counting.For example, when ∆ = k/2 (i.e., δ = 1/2), our algorithm uses O( √ N ) queries whereas the approximate counting algorithm from [BHMT02, Thm.18] uses only O( N/k) queries.This is no surprise given that our algorithm finds all "large" elements, which in the counting setting amounts to finding all ones.On the other hand, when ∆ is a small constant, say ∆ = 1, the approximate counting lower bound shows that our upper bound is essentially tight.To see this, note if one had a quantum algorithm for computing (1 ± δ)multiplicative approximations of sums with quantum query complexity O( √ N /δ c ) (that succeeds with probability ≥ 2/3), this would give an upper bound of O( √ N k c ) for finding an additive ∆ = 1-approximation of k.The lower bound becomes Ω( k(N − k)) when ∆ = 1, and so we must have We leave it as an open problem whether one can obtain a quantum algorithm for approximate summing of vectors v ∈ [0, 1] N that matches the approximate counting complexity when applied to v ∈ {0, 1} N for the entire range of parameters N, k, ∆.
Finally we highlight that our quantum upper bound for summing outperforms the classical lower bound for approximate counting for a certain range of parameters.The classical randomized query complexity of achieving a multiplicative δ-approximation of the Hamming weight of x ∈ {0, 1} N is Θ(min{N, N δ 2 k }). 5 This classical bound exceeds our quantum upper bound of O( N/δ) if 1/δ ∈ O(N ) and 1/δ ∈ Ω(k 2/3 /N 1/3 ) (ignoring logarithmic factors).

Organization of the paper
In Section 2 we discuss notation, the computational model, and some basic results we build upon.In Section 3 we provide our algorithm for searching for multiple marked elements.Lastly, in Section 4, we give our summation algorithm.

Notation and assumptions
Throughout the paper, we will assume that N ≥ 1 and N = 2 n for some n ≥ 1.We identify C N with C 2 n by |j⟩ → |j 1 . . .j n ⟩, where (j 1 , . . ., j n ) ∈ {0, 1} n is the standard binary encoding of j − 1 ∈ {0, . . ., 2 n − 1}.We write log for the logarithm with base 2 and ln for the natural logarithm.For a bit string x ∈ {0, 1} N we write |x| = i∈ [N ] x i .Throughout we will use k to denote the Hamming weight of x, i.e, |x| = k, and we write k est , k lb , k ub for various bounds on k: k est will denote an integer such that k/2 ≤ k est ≤ 3k/2, and k lb and k ub are lower-and upper bounds on k respectively.

Computational model
We express the cost of a quantum algorithm in terms of the number of one-and two-qubit gates it uses.Note that in particular we allow single-qubit rotations with arbitrary real angles.In Section 3, the angle will always be determined by classical data.In Section 4 we additionally apply controlled rotations where the control register is allowed to be in superposition; in this case we only use angles of the form π/2 m and we carefully count the number of used gates.In the query setting, we separately count the number of quantum queries the algorithm makes, which means (controlled) applications of the query unitary or its inverse.We will use the following types of quantum queries to access either In both cases we allow the unitary to act on additional workspace registers, which we omit for notational convenience.Moreover, throughout the paper, every algorithm will use at most logarithmic number of additional ancillary qubits.
We additionally use a classical data structure to maintain sorted lists that supports both insertion and look-up in a time that scales logarithmically with the size of the list, see for example [Knu98, Sec.6.2.3] or [CLRS22,Ch. 13].We emphasize that we allow neither writing nor reading of such a data structure in superposition.

Various quantum subroutines
In this section we summarize the external results that we build upon, and in some cases give a quick proof of an aspect of the result that is not mentioned explicitly in the original source.
There is a quantum algorithm AmpEst that, with probability using M applications of controlled-U and M applications of controlled-U † .If M is a power of 2, the algorithm uses O(qM ) additional quantum gates, and the computation of the sine-squared function of the normalized phase.
Procedure GroverCertainty(U , k 0 ) Proof.This follows from the formulation in [BHMT02] by setting k = 1 and implementing the reflection through |0 q ⟩ using O(q) gates, which needs to be performed M times.If M is a power of 2 we can implement the quantum Fourier transform on m = log 2 (M ) qubits using m Hadamard gates, and the QFT and its inverse need only be performed once; therefore, this cost is absorbed in the big-O.
We note that the above formulation of AmpEst outputs a real number ã whereas we require a fixed-point encoded number for future uses.However, it suffices to use fixed-point arithmetic using O(log(M )) bits; after all, the guarantee of AmpEst only gives a precision of 1/poly(M ).
We also need a version of amplitude amplification where the success probability is 1 if one knows the amplitude of the "good" part of the state exactly.In a nutshell, the algorithm with success probability 1 is the usual amplitude amplification algorithm applied not to U but to U followed by a rotation of the last qubit to slightly reduce the amplitude a to ā. Carefully choosing ā ensures that the success probability is exactly 1 after an integer number of rounds of amplitude amplification.This requires having access to gates which implement rotation by arbitrary angles, not just angles of the form π/2 m for some integer m.We specialize the statement of this result to the search setting but remark that this works more generally.For exactly N/4 marked elements this observation was first made in [BBHT98].
Then there is a quantum algorithm GroverCertainty that takes as input a quantum oracle U x to access x and an integer k 0 ∈ [N ], and that outputs an index i ∈ [N ], such that x i = 1 with certainty if k 0 = k, and uses O( N/k 0 ) quantum queries to x, and O( N/k 0 log(N )) additional gates.
The other version of Grover that we need is the following, which is originally due to [BBHT98, Thm.3], but we use a slightly different version from [BHMT02, Thm.3]: where k is not necessarily known.Then there is a quantum algorithm GroverExpectation that takes as input a quantum oracle U x to access x, and if k ≥ 1, outputs an index i ∈ [N ] such that x i = 1.The number of quantum queries to x that it uses is a random variable Q, such that, if k ≥ 1, then  Proof.The algorithm GroverExpectation finds an index i such that x i = 1.Its number of applications of controlled-U x is a random variable Q and the number of additional gates is Markov's inequality shows that if we terminate GroverExpectation after at most C N/|x| quantum queries for a suitable constant C > 0, then it finds an index i such that x i = 1 with probability at least 2/3.The procedure Grover 2/3 uses the lower bound k lb on |x| to decide when to terminate GroverExpectation.
For the same constant C > 0 as before, it terminates after at most C N/k lb quantum queries.Since C N/k lb ≥ C N/|x|, the success probability of Grover 2/3 is also at least 2/3.
Let us make some remarks about the complexity of finding a single marked element.First, to find such an element with certainty one can essentially remove the log(N ) factor in the gate complexity: O( √ N log(log * (N ))) gates suffice [AdW17].Second, by cleverly combining GroverCertainty and Grover 2/3 , one can find a marked element (among an unknown number of solutions) with probability ≥ 1 − ρ using N log(1/ρ) quantum queries [BCdWZ99].This shows that the standard way of boosting the success probability of Grover 2/3 is not optimal.
Next, we recall a well-known result on approximate counting.
Then there is a quantum algorithm that, with probability at least 2/3, that outputs an estimate k such that | k − k| ≤ εk using an expected number of quantum queries to x.If k = 0, then the algorithm outputs k = 0 with certainty, using Θ( √ N ) quantum queries to x.In both cases, the algorithm uses a number of gates which is O(log(N )) times the number of quantum queries.To boost the success probability to 1 − ρ, repeat the procedure O(log(1/ρ)) many times and output the median of the returned values.
We often use the special case ε = 1/2 of the above theorem, hence we record it here for future use.(Note that the proof of Theorem 2.7 given in [BHMT02] in fact starts by obtaining a constant factor approximation of |x|.) Corollary 2.8.Let x ∈ {0, 1} N and write |x| = k.Then there is a quantum algorithm that outputs a k est such that, with probability ≥ 1 − ρ, we have k/2 ≤ k est ≤ 3k/2, and uses O( N/(k + 1) log(1/ρ)) quantum queries and O( N/(k + 1) log(1/ρ) log(N )) gates.
We now discuss known extensions of the above results on counting the Hamming weight of a bit string to the problem of mean estimation: This was first studied in [Gro97] and later in [Gro98] where in the latter it was shown that one can find an additive ε-approximation of v using O(1/ε) quantum queries to a unitary that prepares a state encoding the entries of v in its amplitudes, and a similar number of additional gates (also dependent on N ).Using amplitude amplification techniques one can reduce the query dependence to O(1/ε) with O(log(N )/ε) additional gates.This result may be easily recovered from Lemma 2.3 with It is well-known that when one has quantum oracle access to fixed point representations of the entries of v (cf.Definition 2.2), rather than just a state encoding its entries in the amplitudes, one can give an algorithm whose complexity depends only on N and δ, with guarantees as given below.
Then with O(

and a polylogarithmic gate overhead, one can find with probability
We give an informal description of the algorithm here, and refer the interested reader to [vAGL + 21] for a careful implementation along with a bit complexity analysis.By using quantum maximum finding [DH96], with O( √ N ) quantum queries and O(b √ N log(N )) other gates, one may find v max = max i v i .If v max = 0 one may output vest = 0 as an estimate of v.Note that having binary access here makes it easy to compare elements.Next, set w i = v i /v max , and let U be a unitary preparing a state Then Lemma 2.3 with M = 8 √ N /δ outputs an estimate west of w, such that because 1/N ≤ w, so west is a multiplicative δ-approximation of w.Therefore west • v max is a multiplicative δ-approximation of v.We note that in this step the binary access to the entries of v enables the "binary amplification" by ensuring the largest entry of w is 1.
3 Fast Grover search for multiple items, without quantum memory In this section we give a version of Grover's search algorithm for the problem of, given a string x ∈ {0, 1} N , finding all k indices i ∈ [N ] such that x i = 1.For ρ ∈ (0, 1) with ρ = Ω(1/poly(k)), our algorithm finds all such indices with probability ≥ 1 − ρ, and uses O( √ N k) quantum queries and O( √ N k) single-and two-qubit gates.The contribution here is that the query complexity is optimal and the time complexity is only polylogarithmically worse than the query complexity, without using a QRAM.

Deterministic Grover for multiple elements
We first recall the well-known result [dGdW02, Lem.2], that it is possible to find all solutions with probability 1 using O( √ N k) quantum queries, which is optimal, but suffers from a too-high gate complexity in terms of k.The algorithm is given in GroverCertaintyMultiple.We first define for each j ∈ [N ] a gate C j , referred to as the "control-on-j-NOT"-gate, and describe how to implement it with a standard gate-set.The point of this gate is that if one has quantum oracle access U x to x ∈ {0, 1} N , then C j U x implements quantum oracle access to the bit string y ∈ {0, 1} N which agrees with x on all indices, except on the j-th index, where the bit is flipped.
Then the C j -gate can be implemented with O(n) standard gates and n − 1 ancillary qubits.
Proof.Let |j⟩ = |j 1 . . .j n ⟩ be the binary encoding of j − 1. Then: 1.For each l ∈ [n] such that j l = 0, apply a NOT gate on the l-th qubit of the index register.Proof.We first establish correctness of GroverCertaintyMultiple.
be the index set and U Jm the unitary in the algorithm at the m-th step.Then by the definition of C j , U Jm implements oracle access to the bit string y m which agrees with x on [N ] \ J m , and is zero on the indices in J m (whereas x j = 1 for j ∈ J m ).Clearly j ∈ J 0 implies that x j = 1.It remains to show that in k ub iterations we find all marked elements.To do so, observe that there can be at most k ub − k iterations in which one fails to find a new j ∈ [N ] such that x j = 1: indeed, as soon as this happens, we have m = |y m |, and every iteration afterwards we find a new index with certainty by the guarantees of GroverCertainty.
Procedure GroverCertaintyMultiple(U x , k ub ) Guarantee: If |x| ≤ k ub , then for every j ∈ [N ], j ∈ J if and only if x j = 1.Analysis: Lemma 3.2 14 end while 15 return J 0 ; In total, this procedure uses Summing this over all iterations yields a total gate complexity of O( √ N k ub (k + 1) log(N )).

Coupon collecting Grover
We next give another simple version of Grover which can be used to find a large fraction of the marked elements in a time-efficient manner, but does not yield a query-optimal bound when the fraction is close to 1.The algorithm is given in GroverCoupon, and is analyzed in Proposition 3.7.The algorithm is simple: the idea is to repeatedly call Grover 2/3 to sample marked elements.The analysis is based on the observation that the required number of calls to Grover 2/3 is a sum of geometrically distributed random variables: for 1 ≤ i ≤ t, the number of calls to obtain the i-th distinct marked element is a geometrically distributed random variable with success probability p ′ i ≥ 2 3 p i , where p i = (k − i + 1)/k is the probability of observing a new element after i − 1 distinct elements have been found.This is because Grover 2/3 succeeds with probability ≥ 2/3, and the fact that if Grover 2/3 successfully finds a marked element, then it is uniformly random among the marked elements.The number of calls can then by bounded using a general tail bound on sums of geometrically distributed random variables given in [Jan18, Thm.2.3] (see Lemma 3.4).
The analysis is based on tail bounds of sums of geometrically distributed random variables.These tail bounds in turn are stated in terms of the harmonic numbers, for which we recall some basic properties in the following lemma.

and we shall use the convention H
where γ ≈ 0.577 is the Euler-Mascheroni constant.Furthermore, for 0 ≤ t < k, this implies Proof.The bounds on H k − γ − ln(k) are well-known, see [You91] for an elementary proof.For the estimate on H k − H k−t we have We use the following tail bound for geometrically distributed variables. .
Procedure GroverCoupon(U x , R, k lb , t) ln(3k/(k+2(t−1)) , then, with probability ≥ 1 − ρ, we have |J| = t and x j = 1 for all j ∈ J. Analysis: Proposition 3.7 We now establish correctness.By construction r ≤ R with certainty.Lemma 2.6 shows that, with probability ≥ 2/3, the index returned by Grover 2/3 in line 3 is a uniformly random marked element.Hence Lemma 3.6 shows that after obtaining such indices, we have obtained t distinct indices with probability at least 1 − ρ.In other words, if R ≥ R t,k,ρ , then, with probability at least 1 − ρ, GroverCoupon terminates at line 8 with a sorted list J ⊆ [N ] of t distinct marked indices.

Grover for multiple elements, fast
In this section we improve the complexity of finding all marked indices by combining the two previously discussed algorithms, GroverCoupon and GroverCertaintyMultiple.The structure of our algorithm, GroverMultipleFast, is as follows.As before, suppose we are given query access to an x ∈ {0, 1} N .Let the (unknown) number of marked indices be k ≥ 1, i.e., k = |x|.We first use GroverCoupon to find a (large) fraction of the marked elements.That is, we find a uniformly random subset J 0 ⊆ [N ] of τ k marked elements, where 0 < τ < 1 is a parameter we can use to tune the complexity of the algorithm.This subset J 0 partitions [N ] into intervals.We then use GroverCertaintyMultiple to find all marked indices in each interval separately.
The following lemma upper bounds the probability that when we draw a set S ⊆ [k] of size t uniformly at random, there exists an interval of length ≥ ℓ in the set [k] \ S. In the analysis of GroverMultipleFast (see Theorem 3.9), we will use this bound to control the number of elements that are in between any two elements of the previously sampled indices J 0 .Lemma 3.8.Let S ⊆ [k] be a uniformly random t-element set, and let 1 ≤ ℓ ≤ k − t.The probability that [k]\S contains a contiguous subset I of length ≥ ℓ, i.e., I = {a, a+1, . . ., a+ℓ−1} Proof.The probability that [k] \ S contains a contiguous subset I of length at least ℓ is the same as the probability that it contains a contiguous subset of length exactly ℓ.This is in turn given by By a union bound, this is at most By the uniform randomness of S, each of the latter probabilities is the same, and given by Pr[{a, . . ., a We conclude that the probability that [k] \ S contains a contiguous subset I of at least ℓ is at most Procedure GroverMultipleFast(U x , k est , ρ, λ) Guarantee: If λ and ρ are such that log(6k est /ρ) ≤ ⌈k est /λ⌉, then, with probability ≥ 1 − ρ, we have |J| = |x| and x j = 1 for all j ∈ J. Analysis: Theorem 3.9 1 J ← ∅; 2 t ← ⌈k/λ⌉; 3 R ← 6 ln(2)(t + 1) + 2 ln(1/ρ) ln(3/2); 4 use GroverCoupon(U x , R, 2 3 k est , t) to find, with probability ≥ 1 − ρ/3, a sorted list J 0 ⊆ [N ] with x j = 1 for all j ∈ J 0 , |J 0 | = t; 5 set J ← J 0 and write J 0 = {a 1 < a 2 < • • • < a t }; 6 set a 0 = 0 and a t+1 = N + 1; 7 for i = 0, . . ., t do 8 If a i+1 = a i + 1, continue with next loop; otherwise, let b i = 2 ⌈log(a i+1 −1−a i )⌉ ; 9 construct from U x an oracle U y which implements access to the bit string y ∈ {0, 1} b i given by y j = x a i +j if a i + j < a i+1 , and 0 otherwise; We remark here that GroverMultipleFast takes a multiplicative estimate k est of k as additional input, which can be found with O( N/k log(1/ρ)) quantum queries and O( N/k log(1/ρ) log(N )) additional gates; see Corollary 2.8.Both of these costs are dominated by that of finding the actual elements.The above theorem also includes a parameter λ that allows for a trade-off between query complexity and gate complexity.Before we provide the proof of Theorem 3.9, let us highlight the two extremal cases that follow from taking λ either as large as useful or as small as possible.
Then we can find, with probability ≥ 1 − ρ, all k indices i for which x i = 1 using either: , via Theorem 3.9 with λ = 6.
Let a 1 < a 2 < • • • < a t denote the found indices for which x a j = 1 and define the intervals I 0 = {1, . . ., a 1 − 1}, I t = {a t + 1, . . ., N }, and, for j ∈ [t − 1], I j = {a j + 1, . . ., a j+1 − 1}.We use k j to denote the (unknown) number of marked elements in I j , so in particular t j=0 k j ≤ k − t.Then by Lemma 3.8, the probability that there is a k j larger than ℓ := k t (log(k) + log(3/ρ)) is at most Here we used that ℓ ≥ 1, (1 , and log(k) + log(3/ρ) ≥ 0.7 For the rest of the argument we may thus assume that there is no interval with more than ℓ not-yet-found marked elements.
In the next step of our algorithm we search for all marked elements in each interval.To do so for the jth interval, we search over the elements from [2 ⌈log(|I j |)⌉ ] marking an element i ∈ [2 ⌈log(|I j |)⌉ ] if x i+a j = 1 and i ≤ |I j | (letting a 0 = 0).One can implement this unitary using O(1) quantum queries and O(log(N )) gates (to implement the addition and comparison).For each interval, we first compute an estimate (k j ) est of k j that satisfies k j /2 ≤ (k j ) est ≤ 3k j /2 using Corollary 2.8, with success probability ≥ 1 − ρ/(3(t + 1)).The associated query cost is O( |I j |/(k j + 1) log(t/ρ)), and it uses O( |I j |/(k j + 1) log(t/ρ) log(N )) additional gates.Then Lemma 3.2 shows that we can find all marked elements in the j-th interval with probability 1 using O( |I j | (k j ) est ) quantum queries and O( |I j | (k j ) 3/2 est log(N )) additional gates.By a union bound, with probability ≥ 1 − ρ/3, all (k j ) est are correct, and this step has a total query complexity of (3.4) where the first step uses Cauchy-Schwarz for both terms (reading To analyze the gate complexity of this step, we first bound t j=0 k 3 j .We have where k is the vector with entries k j and k 2 is the entrywise square of k.As we also have ∥k∥ 1 ≤ k we get Then the gate complexity of the final search steps becomes: where we again used Cauchy-Schwarz in the first step, and the total error probability is bounded by ρ/3 + ρ/3 + (t + 1) • ρ 3(t+1) = ρ.To conclude, the upper bound on the total query complexity follows by combining Eqs.(3.2) and (3.4): Here the first equality uses that The second equality follows since log(1/ρ) ≤ log(6k est /ρ) and, by assumption, log(6k est /ρ) ≤ ⌈k est /λ⌉ = t ≤ k.A similar argument using Eqs.(3.3) and (3.5) and λ ≥ 1, establishes the desired gate complexity: 4 Improved query complexity for approximate summation In this section, we provide an algorithm ApproxSum, which given quantum query access to a binary description of v ∈ [0, 1] N , in the sense of Definition 2.2, finds a multiplicative δapproximation of s = N i=1 v i with probability ≥ 1 − ρ using quantum queries and a similar gate complexity (with only a polylogarithmic overhead).In the above (4.1)we have made very mild assumptions on the value of ρ and δ; a precise statement is given in Theorem 4.3.The algorithm is given in ApproxSum.By slightly perturbing the entries of v, we may assume without loss of generality that all entries of v are distinct; we shall make this assumption throughout this section, and have made this assumption in the description of the algorithm as well.
We briefly explain the overall strategy.Recall from the proof of Theorem 2.7 that it is useful to preprocess the vector v by using quantum maximum finding to find v max = max i∈[N ] v i , and then to use amplitude estimation on the vector w = v/v max .We take this approach slightly further: we first find the largest k entries z 1 , . . ., z k of v, where k = Θ(pN ) for p ∈ (0, 1), and sum their values classically.Let z be the smallest value among the z 1 , . . ., z k . 8For the next part, we treat the corresponding entries of v as zero: checking whether one exceeds the threshold z is a binary comparison, hence can be done in superposition without explicitly using their indices, We use the following lemma to derive a bound on the required precision for certain arithmetic operations.For the quantile estimation, we use a subroutine from [Ham21].Let v ∈ [0, 1] N .Then for p ∈ (0, 1), we define the p-quantile Q(p) ∈ [0, 1] by In words, Q(p) is the largest value z ∈ [0, 1] such that there are at least pN entries of v which are larger than z.The subroutine we invoke allows one to produce an estimate for Q(p), in the following sense: The actual access model for which the above theorem holds is more general, but we have instantiated it for our setting.The gate complexity overhead follows from having to implement their access model from ours, which involves arithmetic and comparisons on the fixed point representations we use, and the fact that the underlying technique is amplitude amplification.We now get to the main theorem of this section, which proves the correctness of ApproxSum and analyzes its complexity.Theorem 4.3.Let v ∈ [0, 1] N , let U v be a unitary implementing quantum query access to (0, b)-fixed point representations of v, and let δ ∈ (0, 1).Let p, ρ ∈ (0, 1) and choose 6 ≤ λ ≤ min{cpN/ log(pN/ρ), log(cpN/ρ) 2 }.Then ApproxSum computes, with probability ≥ 1−ρ, a multiplicative δ-approximation of s = N i=1 v i .It uses quantum queries, and the number of additional gates is bounded by Before we give the proof, we discuss two useful regimes for p and λ: Corollary 4.4.Let v ∈ [0, 1] N , let U v be a unitary implementing quantum oracle access to (0, b)fixed point representations of v, and let δ ∈ (0, 1).Then we can find, with probability ≥ 1 − ρ, a multiplicative δ-approximation of s = N i=1 v i , using: • O( N log(1/ρ)/δ) quantum queries, when p = Θ(log(1/ρ)/(δN )) < 1 and we choose λ = min{cpN/ log(6pN/ρ)), log(cpN/ρ) 2 } ≥ 6, and using N/δ poly(log(1/ρ), b, log(N ), log(1/δ)) additional gates, or • O( N/δ log(1/ρ)) quantum queries when p = Θ(1/(δN )) < 1 and we choose λ = 6, and using N/δ poly(log(1/ρ), b, log(N ), log(1/δ)) additional gates.
Proof of Theorem 4.3.We assume without loss of generality that all the entries of v are distinct.If this is not the case, one can perturb the i-th entry of v by i2 −ℓ for some sufficiently large ℓ = Ω(log(N ) + b), where we recall that b is the number of bits describing v i , and discarding these trailing bits from the output value s.We use Theorem 4.2 to find a value z such that the number of elements of v that are at least as large as z, is at most pN and at least cpN .The number of quantum queries is and the number of additional gates used is where As we have found all the z j 's, we can compute their sum exactly; therefore, to determine a multiplicative δ-approximation of s, we must produce an additive δs-approximation of z N i=1 w i .Let ε := δs; note that we do not know s as we do not know δ.Then we have to approximate 1 N N i=1 w i with precision ε/(N z).For this, we use amplitude estimation as follows.First, one can implement query access to U w by using two quantum queries to v and O(b log(b)) nonquery gates, by querying an entry, comparing the entry to z, and conditional on the comparison uncomputing the query, and lastly performing the division by z.From this, we can construct a unitary U with U |0⟩ = |ψ⟩ satisfying where wi is close to w i .One can implement such a unitary as follows.First, set up a uniform superposition over the index register using O(log(N )) gates.Use U w to load binary descriptions of the entries of w.Calculate a ⌈log 2 (4N/δ)⌉-bit approximation α i of arcsin( √ w i ) using O(log(bN/δ) log 2 log(bN/δ)) gates [BZ10,Ch. 4].Then conditionally rotate the last qubit from 0 to 1 over angles π/4, π/8, et cetera, depending on the bits of α i .Lastly, we uncompute α i and U w to return work registers to the zero state, and we have obtained the desired state |ψ⟩, where √ wi = sin(α i ).We now show that wi = sin(α i ) 2 is close to w i , and hence Since α i is a ⌈log 2 (4N/δ)⌉-bit approximation of arcsin( √ w i ), we may apply the above with ξ = δ/(4N ) for every i ∈ [N ].Because s ≥ z, δ = ε/s ≤ ε/z, and δ ≤ 1, so the total error satisfies Next, we use this to derive an upper bound on a: where the last inequality uses ε = δs ≤ s and i:v i <z v i ≤ s.Therefore, using AmpEst with M applications of U yields a number ã ∈ [0, 1] with by Lemma 2.3.We now determine an appropriate number of rounds M to be used for amplitude estimation.We will choose M such that |ã − a| ≤ 1 2 ε/(N z); if we do so, then by the triangle inequality Even though we do not know ε, by choosing p carefully, we can enforce upper bounds on z and give a safe choice for M .We use that the number of entries k which are at least z satisfies k ≥ cpN , so that cp N z ≤ i:v j ≥z v j ≤ s, i.e., z ≤ s/(cpN ).Therefore it suffices to take M = 12π/ δ 2 pc, as this satisfies To amplify the success probability to 1 − ρ, we repeat the above procedure log(1/ρ) many times and output the median of the individual estimates.The query-and gate complexity of the entire algorithm follow by combining those of the four parts: the quantile estimation, the approximate counting, Grover search for finding all large elements, and amplitude estimation for approximating the sum of the small elements.
C min{N, N/(δ 2 k)} queries for some universal constant C. Consider the distribution D on inputs that with probability 1/2 samples a uniformly random element from X k , and with probability 1/2 samples a uniformly random element from X k(1+δ) .Suppose A is a deterministic t-query algorithm that on input x ∼ D correctly returns |x| with probability at least 5/6 (where the probability is over the sample from D).Note that we allow A to know k and δ.We show the desired lower bound on t.Let A(x) = a denote the substring (x i 1 , . . ., x it ) of x that corresponds to the t queried indices i 1 , . . ., i t .Note that the output A(x) ∈ {k, k(1 + δ)} of the algorithm is deterministic and only a function of a (for a fixed algorithm).If one thinks of A as a decision tree, then the first index to be queried does not depend on a, and after every subsequent query, the index to be queried is deterministic as a function of the previous queried indices and outcomes.It is also the case that the queried indices i 1 , . . ., i t are a function of the query outcomes a! Therefore, we may view A(x) as a just a function of a = A(x).Let B ⊂ X k ∪X k(1+δ) be the set of a ∈ {0, 1} t on which A(a) outputs k(1 + δ).
Let P ℓ be the distribution on a = A(x) ∈ {0, 1} t induced by x ∼ Unif(X ℓ ), where the latter refers to the uniform distribution on X ℓ .By assumption, A can distinguish (with constant success probability) the distributions P k and P k(1+δ) .Therefore, the total variation distance between these distributions is large.Indeed, the probability, with respect to D, that A outputs the wrong value of |x| is at most 1/6, therefore A fails with probability at most 1/3 when x ∼ Unif(X ℓ ) for both ℓ = k and ℓ = k(1 + δ), and hence  We now relate P ℓ to the hypergeometric distribution Hyp(N, ℓ, t), so that we can upper bound the above total variation distance as a function of t.We prove that P ℓ (a) = Pr We prove this by exploiting the permutation symmetry of the distribution on X ℓ , along with an iterative conditioning argument.Let i 1 , . . ., i t denote the sequence of indices of x queried, so that A(x) = (x i 1 , . . ., x it ).Recall that the i 1 , . . ., i t may be chosen adaptively, but i j+1 is determined completely from i 1 , . . ., i j and x i 1 , . . ., x i j .Then  We now give a t-dependent upper bound on this total variation distance; combined with the assumption that d TV (P k , P k(1+δ) ) ≥ 1/3, this will lead to the right lower bound on t.Let and so in this case t = Ω(N/(δ 2 k)) (unless the latter is O(1), in which case the query lower bound we aim for is constant and uninteresting).Otherwise, by the triangle inequality, where we applied Theorems A.2 and A.3 (note that we needed τ < 1).Since the left-hand side is at least 1/3 as shown before, we have so at least one of the two terms must be 1/6 or greater.If 8t/N ≥ 1/6, then t ≥ N/48 and we are done; otherwise, τ ≥ 1/24 and so If k ≤ N/2, then N − k ≥ N/2 and one deduces t + 2 ≥ N/(144δ 2 k).

2.
Apply a NOT-gate to the output register containing b, controlled on all n qubits of the index register.This can be implemented using O(n) Toffoli gates, one CNOT gate, and n − 1 ancilla qubits, see [NC02, Fig. 4.10].3. Apply the NOT gates from the first step again.Lemma 3.2.Let x ∈ {0, 1} N , U x a quantum oracle to access x, and k ub ≥ 1.If |x| = k ≤ k ub , then GroverCertaintyMultiple(U x , k ub ) finds, with probability 1, all k indices i such that x i = 1.The algorithm uses O N k ub applications of U x , and O N k ub (k + 1) log(N ) additional non-query gates.
and the output value s = k j=1 z j +N zã satisfies |s − s| ≤ ε = δs.The number of quantum queries used for this step is therefore O(M ) = O(1/(δ √ p)), and the number of additional gates used is O(M b log(b) log(N/δ) log 2 log(N/δ)).
the algorithm runs forever).The number of additional gates used is O(Q log(N )).The index i which is output is uniformly random among all such indices, and independent of the value of Q. Procedure Grover 2/3 (U x , k lb )Input: Output: An index i ∈ [N ].Guarantee: If |x| ≥ 1, then with probability ≥ 2/3, x i = 1.Analysis: Lemma 2.6Procedure ApproxCount(U x , ε, ρ)
By the assumption that the v i are all distinct, cpN ≤ k ≤ pN .
Let z 1 , . . ., z k be the entries of v that are ≥ z.Then N i=1