Quantum SDP-Solvers: Better upper and lower bounds

Brand\~ao and Svore very recently gave quantum algorithms for approximately solving semidefinite programs, which in some regimes are faster than the best-possible classical algorithms in terms of the dimension $n$ of the problem and the number $m$ of constraints, but worse in terms of various other parameters. In this paper we improve their algorithms in several ways, getting better dependence on those other parameters. To this end we develop new techniques for quantum algorithms, for instance a general way to efficiently implement smooth functions of sparse Hamiltonians, and a generalized minimum-finding procedure. We also show limits on this approach to quantum SDP-solvers, for instance for combinatorial optimizations problems that have a lot of symmetry. Finally, we prove some general lower bounds showing that in the worst case, the complexity of every quantum LP-solver (and hence also SDP-solver) has to scale linearly with $mn$ when $m\approx n$, which is the same as classical.

Brandão and Svore [BS17] recently gave quantum algorithms for approximately solving semidefinite programs, which in some regimes are faster than the bestpossible classical algorithms in terms of the dimension n of the problem and the number m of constraints, but worse in terms of various other parameters. In this paper we improve their algorithms in several ways, getting better dependence on those other parameters. To this end we develop new techniques for quantum algorithms, for instance a general way to efficiently implement smooth functions of sparse Hamiltonians, and a generalized minimum-finding procedure.
We also show limits on this approach to quantum SDP-solvers, for instance for combinatorial optimization problems that have a lot of symmetry. Finally, we prove some general lower bounds showing that in the worst case, the complexity of every quantum LP-solver (and hence also SDP-solver) has to scale linearly with mn when m ≈ n, which is the same as classical. 1 Introduction

Semidefinite programs
In the last decades, particularly since the work of Grötschel, Lovász, and Schrijver [GLS88], semidefinite programs (SDPs) have become an important tool for designing efficient optimization and approximation algorithms. SDPs generalize the better-known linear programs (LPs), but (like LPs) they are still efficiently solvable. Thanks to their stronger expressive power, SDPs can sometimes give better approximation algorithms than LPs. The basic form of an SDP is the following: For normalization purposes we assume C , A j ≤ 1. The number of constraints is m (we do not count the standard X 0 constraint for this). The variable X of this SDP is an n × n positive semidefinite (psd) matrix. LPs correspond to the case where all matrices are diagonal.
A famous example is the algorithm of Goemans and Williamson [GW95] for approximating the size of a maximum cut in a graph G = ([n], E): the maximum, over all subsets S of vertices, of the number of edges between S and its complementS. Computing MAXCUT(G) exactly is NP-hard. It corresponds to the following integer program using the fact that (1 − v i v j )/2 = 1 if v i and v j are different signs, and (1 − v i v j )/2 = 0 if they are the same. We can relax this integer program by replacing the signs v j by unit vectors, and replacing the product v i v j in the objective function by the dot product v T i v j . We can implicitly optimize over such vectors (of unspecified dimension) by explicitly optimizing over an n × n psd matrix X whose diagonal entries are 1. This X is the Gram matrix of the vectors v 1 , . . . , v n , so s.t. Tr(E jj X) = 1 for all j ∈ [n], where E jj is the n × n matrix that has a 1 at the (j, j)-entry, and 0s elsewhere. This SDP is a relaxation of a maximization problem, so it may overshoot the correct value, but Goemans and Williamson showed that an optimal solution to the SDP can be rounded to a cut in G whose size is within a factor ≈ 0.878 of MAXCUT(G) (which is optimal under the Unique Games Conjecture [KKMO07]). This SDP can be massaged into the form of (1) by replacing the equality Tr(E jj X) = 1 by inequality Tr(E jj X) ≤ 1 (so m = n) and letting C be a properly normalized version of the Laplacian of G.

Classical solvers for LPs and SDPs
Ever since Dantzig's development of the simplex algorithm for solving LPs in the 1940s [Dan51], much work has gone into finding faster solvers, first for LPs and then also for SDPs. The simplex algorithm for LPs (with some reasonable pivot rule) is usually fast in practice, but has worst-case exponential runtime. The ellipsoid method and interior-point methods can solve LPs and SDPs in polynomial time; they will typically approximate the optimal value to arbitrary precision [GLS81,NN94]. The best known general SDP-solvers [LSW15] approximate the optimal value OPT of such an SDP up to additive error ε, with complexity O(m(m 2 + n ω + mns) polylog(m, n, R, 1/ε)), where ω ∈ [2, 2.373) is the (still unknown) optimal exponent for matrix multiplication; s is the sparsity: the maximal number of non-zero entries per row of the input matrices; and R is an upper bound on the trace of an optimal X. 1 The assumption here is that the rows and columns of the matrices of SDP (1) can be accessed as adjacency lists: we can query, say, the th non-zero entry of the kth row of matrix A j in constant time.
Arora and Kale [AK16] gave an alternative way to approximate OPT, using a matrix version of the "multiplicative weights update" method. 2 In Section 2.1 we will describe their framework in more detail, but in order to describe our result we will start with an overly simplified sketch here. The algorithm goes back and forth between candidate solutions to the primal SDP and to the corresponding dual SDP, whose variables are non-negative reals y 1 , . . . , y m : Under assumptions that will be satisfied everywhere in this paper, strong duality applies: the primal SDP (1) and dual SDP (2) will have the same optimal value OPT. The algorithm does a binary search for OPT by trying different guesses α for it. Suppose we have fixed some α, and want to find out whether α is bigger or smaller than OPT. Start with some candidate solution X (1) for the primal, for example a multiple of the identity matrix (X (1) has to be psd but need not be a feasible solution to the primal). This X (1) induces the following polytope: This polytope can be thought of as a relaxation of the feasible region of the dual SDP with the extra constraint that OPT ≤ α: instead of requiring that j y j A j − C is psd, we merely require that its inner product with the particular psd matrix X (1) is not too negative. The algorithm then calls an "oracle" that provides a y (1) ∈ P ε (X (1) ), or outputs "fail" if P 0 (X (1) ) is empty (how to efficiently implement such an oracle depends on the application). In the "fail" case we know there is no dual-feasible y with objective value ≤ α, so we can increase our guess α for OPT, and restart. In case the oracle produced a y (1) , this is used to define a Hermitian matrix H (1) and a new candidate solution X (2) for the primal, which is proportional to e −H (1) . Then the oracle for the polytope P ε (X (2) ) induced by this X (2) is called to produce a candidate y (2) ∈ P ε (X (2) ) for the dual (or "fail"), this is used to define H (2) and X (3) proportional to e −H (2) , and so on. Surprisingly, the average of the dual candidates y (1) , y (2) , . . . converges to a nearly-dualfeasible solution. Let R be an upper bound on the trace of an optimal X of the primal, r be an upper bound on the sum of entries of an optimal y for the dual, and w * be the "width" of the oracle for a certain SDP: the maximum of m j=1 y j A j − C over all psd matrices X and all vectors y that the oracle may output for the corresponding polytope P ε (X). In general we will not know the width of an oracle exactly, but only an upper bound w ≥ w * , that may depend on the SDP; this is, however, enough for the Arora-Kale framework. In Section 2.1 we will show that without loss of generality we can assume the oracle returns a y such that y 1 ≤ r. Because we assumed A j , C ≤ 1, we have w * ≤ r + 1 as an easy width-bound. General properties of the multiplicative weights update method guarantee that after T = O(w 2 R 2 /ε 2 ) iterations 3 , if no oracle call yielded "fail", then the vector 1 T T t=1 y (t) is close to dual-feasible and satisfies b T y ≤ α. This vector can then be turned into a dual-feasible solution by tweaking its first coordinate, certifying that OPT ≤ α + ε, and we can decrease our guess α for OPT accordingly.
The framework of Arora and Kale is really a meta-algorithm, because it does not specify how to implement the oracle. They themselves provide oracles that are optimized for special cases, which allows them to give a very low width-bound for these specific SDPs. For example for the MAXCUT SDP, they obtain a solver with near-linear runtime in the number of edges of the graph. They also observed that the algorithm can be made more efficient by not explicitly calculating the matrix X (t) in each iteration: the algorithm can still be made to work if instead of providing the oracle with X (t) , we feed it good estimates of Tr(A j X (t) ) and Tr(CX (t) ). Arora and Kale do not describe oracles for general SDPs, but as we show at the end of Section 2.4 (using Appendix A to estimate Tr(A j X (t) ) and Tr(CX (t) )), one can get a general classical SDP-solver in their framework with complexity O nms Rr ε 4 + ns Rr ε 7 . (3) Compared to the complexity of the SDP-solver of [LSW15], this has much worse dependence on R and ε, but better dependence on m and n. Using the Arora-Kale framework is thus preferable over standard SDP-solvers for the case where Rr is small compared to mn, and a rough approximation to OPT (say, small constant ε) is good enough. It should be noted that for many specific cases, Arora and Kale get significantly better upper bounds than (3) by designing oracles that are specifically optimized for those cases.

Quantum SDP-solvers: the Brandão-Svore algorithm
Given the speed-ups that quantum computers give over classical computers for various problems [Sho97, Gro96,DHHM06,Amb07,HHL09], it is natural to ask whether quantum computers can solve LPs and SDPs more efficiently as well. Very little was known about this, until recently when Brandão and Svore [BS17] discovered quantum algorithms that significantly outperform classical SDP-solvers in certain regimes. Because of the general importance of quickly solving LPs and SDPs, and the limited number of quantum algorithms that have been found so far, this is a very interesting development.
The key idea of the Brandão-Svore algorithm is to take the Arora-Kale approach and to replace two of its steps by more efficient quantum subroutines. First, given a vector y (t−1) , it turns out one can use "Gibbs sampling" to prepare the new primal candidate X (t) ∝ e −H (t−1) as a log(n)-qubit quantum state ρ (t) := X (t) /Tr(X (t) ) in much less time than needed to compute X (t) as an n × n matrix. Second, one can efficiently implement the oracle for P ε (X (t) ) based on a number of copies of ρ (t) , using those copies to estimate Tr(A j ρ (t) ) and Tr(A j X (t) ) when needed (note that Tr(Aρ) is the expectation value of operator A for the quantum state ρ). This is based on something called "Jaynes's principle." The resulting oracle is weaker than what is used classically, in the sense that it outputs a sample j ∼ y j / y 1 rather than the whole vector y. However, such sampling still suffices to make the algorithm work (it also means we can assume the vector y (t) to be quite sparse).
Using these ideas, Brandão and Svore obtain a quantum SDP-solver of complexity with multiplicative error 1 ± δ for the special case where b j ≥ 1 for all j ∈ [m], and OPT ≥ 1 (the latter assumption allows them to convert additive error ε to multiplicative error δ) [BS17, Corollary 5 in arXiv version 4]. They describe a reduction to transform a general SDP of the form (1) to this special case, but that reduction significantly worsens the dependence of the complexity on the parameters R, r, and δ.
Note that compared to the runtime (3) of our general instantiation of the original Arora-Kale framework, there are quadratic improvements in both m and n, corresponding to the two quantum modifications made to Arora-Kale. However, the dependence on R, r, s and 1/ε is much worse now than in (3). This quantum algorithm thus provides a speed-up only in regimes where R, r, s, 1/ε are fairly small compared to mn (finding good examples of SDPs in such regimes is an open problem).

Our results
In this paper we present two sets of results: improvements to the Brandão-Svore algorithm, and better lower bounds for the complexity of quantum LP-solvers (and hence for quantum SDP-solvers as well).

Improved quantum SDP-solver
Our quantum SDP-solver, like the Brandão-Svore algorithm, works by quantizing some aspects of the Arora-Kale algorithm. However, the way we quantize is different and faster than theirs.
First, we give a more efficient procedure to estimate the quantities Tr(A j ρ (t) ) required by the oracle. Instead of first preparing some copies of Gibbs state ρ (t) ∝ e −H (t−1) as a mixed state, we coherently prepare a purification of ρ (t) , which can then be used to estimate Tr(A j ρ (t) ) more efficiently using amplitude-estimation techniques. Also, our purified Gibbs sampler has logarithmic dependence on the error, which is exponentially better than the Gibbs sampler of Poulin and Wocjan [PW09b] that Brandão and Svore invoke. Chowdhury and Somma [CS17] also gave a Gibbs sampler with logarithmic error-dependence, but assuming query access to the entries of √ H rather than H itself. Second, we have a different implementation of the oracle, without using Gibbs sampling or Jaynes's principle (though, as mentioned above, we still use purified Gibbs sampling for approximating the Tr(Aρ) quantities). We observe that the vector y (t) can be made very sparse: two non-zero entries suffice. 4 We then show how we can efficiently find such a 2-sparse vector (rather than merely sampling from it) using two applications of a new generalization of the well-known quantum minimum-finding algorithm of Dürr and Høyer [DH96], which is based on Grover search [Gro96].
The dependence on m, n, and s is the same as in Brandão-Svore, but our dependence on R, r, and 1/ε is substantially better. Note that each of the three parameters R, r, and 1/ε now occurs with the same 8th power in the complexity. This is no coincidence: as we show in Appendix E, these three parameters can all be traded for one another, in the sense that we can massage the SDP to make each one of them small at the expense of making the others proportionally bigger. These trade-offs suggest we should actually think of Rr/ε as one parameter of the primal-dual pair of SDPs, not three separate parameters. For the special case of LPs, we can improve the runtime to O( √ mn(Rr/ε) 5 ).
Like in Brandão-Svore, our quantum oracle produces very sparse vectors y, in our case even of sparsity 2. This means that after T iterations, the final ε-optimal dual-feasible vector (which is a slightly tweaked version of the average of the T y-vectors produced in the T iterations) has only O(T ) non-zero entries. Such sparse vectors have some advantages, for example they take much less space to store than arbitrary y ∈ R m . In fact, to get a sublinear runtime in terms of m, this is necessary. However, this sparsity of the algorithm's output also points to a weakness of these methods: if every ε-optimal dual-feasible vector y has many non-zero entries, then the number of iterations needs to be large. For example, if every ε-optimal dual-feasible vector y has Ω(m) non-zero entries, then these methods require T = Ω(m) iterations before they can reach an ε-optimal dual-feasible vector. Since T = O R 2 r 2 ε 2 ln(n) this would imply that Rr ε = Ω( m/ ln(n)), and hence many classical SDP-solvers would have a better complexity than our quantum SDP-solver. As we show in Section 3, this will actually be the case for families of SDPs that have a lot of symmetry.

Tools that may be of more general interest
Along the way to our improved SDP-solver, we developed some new techniques that may be of independent interest. These are mostly tucked away in appendices, but here we will highlight two.
Implementing smooth functions of a given Hamiltonian. A generalized minimum-finding algorithm. Dürr and Høyer [DH96] showed how to find the minimal value of a function f : [N ] → R using O( √ N ) queries to f , by repeatedly using Grover search to find smaller and smaller elements of the range of f . In Appendix C we describe a more general minimum-finding procedure. Suppose we have a unitary U which prepares a quantum state U |0 = N k=1 |ψ k |x k , where the |ψ k are unnormalized states. Our procedure can find the minimum value x k * among the x k 's that have support in the second register, using roughly O(1/ ψ k * ) applications of U and U −1 . Also, upon finding the minimal value k * the procedure actually outputs the normalized state proportional to |ψ k * |x k * . This immediately gives the Dürr-Høyer result as a special case, if we take U to produce U |0 = 1 √ N N k=1 |k |f (k) using one query to f . Unlike Dürr-Høyer, we need not assume direct query access to the individual values f (k).
More interestingly for us, for a given n-dimensional Hamiltonian H, if we combine our minimum-finder with phase estimation using unitary U = e iH on one half of a maximally entangled state, then we obtain an algorithm for estimating the smallest eigenvalue of H (and preparing its ground state) using roughly O( √ n) applications of phase estimation with U . A similar result on approximating the smallest eigenvalue of a Hamiltonian was already shown by Poulin and Wocjan [PW09a], but we improve on their analysis to be able to apply it as a subroutine in our procedure to estimate Tr(A j ρ).

Lower bounds
What about lower bounds for quantum SDP-solvers? Brandão and Svore already proved that a quantum SDP-solver has to make Ω( √ n + √ m) queries to the input matrices, for some SDPs. Their lower bound is for a family of SDPs where s, R, r, 1/ε are all constant, and is by reduction from a search problem.
In this paper we prove lower bounds that are quantitatively stronger in m and n, but for SDPs with non-constant R and r. The key idea is to consider a Boolean function F on N = abc input bits that is the composition of an a-bit majority function with a b-bit OR function with a c-bit majority function. The known quantum query complexities of majority and OR, combined with composition properties of the adversary lower bound, imply that every quantum algorithm that computes this function requires Ω(a √ b c) queries. We define a family of LPs with nonconstant R and r such that a constant-error approximation of OPT computes F . Choosing a, b, and c appropriately, this implies a lower bound of Ω max{n, m}(min{n, m}) 3/2 queries to the entries of the input matrices for quantum LP-solvers. Since LPs are SDPs with sparsity s = 1, we get the same lower bound for quantum SDP-solvers. If m and n are of the same order, this lower bound is Ω(mn), the same scaling with mn as the classical general instantiation of Arora-Kale (3 In a different algorithmic direction, Kerenidis and Prakash [KP18] recently obtained a quantum interior-point method for solving SDPs and LPs. It is hard to compare the latter algorithm to the other SDP-solvers for two reasons. First, the output of their algorithm consists only of almost-feasible solutions to the primal and dual (their algorithm has a polynomial dependence on the distance to feasibility). It is therefore not clear what their output means for the optimal value of the SDPs. Secondly, the runtime of their algorithm depends polynomially on the condition number of the matrices that the interior point method encounters, and no explicit bounds for these condition numbers are given.
Our results on implementing smooth functions of a given Hamiltonian have been extended to more general input models (block-encodings) in [GSLW19]. This recent paper builds on some of our techniques, but achieves slightly improved complexities by directly implementing the transformations without using Hamiltonian simulation as a subroutine.
Recently van Apeldoorn et al. [vAGGdW20] and Chakrabarti et al. [CCLW20] developed quantum algorithms for general black-box convex optimization, where one optimizes over a general convex set K, and the access to K is via membership and/or separation oracles. Since we work in a model where we are given access directly to the constraints defining the problem, our results are incomparable to theirs.
Organization. The paper is organized as follows. In Section 2 we start with a description of the Arora-Kale framework for SDP-solvers, and then we describe how to quantize different aspects of it to obtain a quantum SDP-solver with better dependence on R, r, and 1/ε (or rather, on Rr/ε) than Brandão and Svore got. In Section 3 we describe the limitations of primal-dual SDP-solvers using general oracles (not optimized for specific SDPs) that produce sparse dual solutions y: if good solutions are dense, this puts a lower bound on the number of iterations needed. In Section 4 we give our lower bounds. A number of the proofs are relegated to the appendices: how to classically approximate Tr(A j ρ) (Appendix A), how to efficiently implement smooth functions of Hamiltonians on a quantum computer (Appendix B), our generalized method for minimum-finding (Appendix C), upper and lower bounds on how efficiently we can query entries of sums of sparse matrices (Appendix D), and how to trade off the parameters R, r, and 1/ε against each other (Appendix E).

An improved quantum SDP-solver
Here we describe our quantum SDP-solver. In Section 2.1 we describe the framework designed by Arora and Kale for solving semidefinite programs. As in the recent work by Brandão and Svore, we use this framework to design an efficient quantum algorithm for solving SDPs. In particular, we show that the key subroutine needed in the Arora-Kale framework can be implemented efficiently on a quantum computer. Our implementation uses different techniques than the quantum algorithm of Brandão and Svore, allowing us to obtain a faster algorithm. The techniques required for this subroutine are developed in Sections 2.2 and 2.3. In Section 2.4 we put everything together to prove the main theorem of this section (the notation is explained below): Theorem 1. Instantiating Meta-Algorithm 1 using the trace calculation algorithm from Section 2.2 and the oracle from Section 2.3 (with width-bound w := r + 1), and using this to do a binary search for OPT ∈ [−R, R] (using different guesses α for OPT), gives a quantum algorithm for solving SDPs of the form (1), which (with high probability) produces a feasible solution y to the dual program which is optimal up to an additive error ε, and uses O √ nms 2 Rr ε 8 queries to the input matrices and the same order of other gates.
Notation/Assumptions. We use log to denote the logarithm in base 2. We denote the allzero matrix and vector by 0 and the all-one vector by 1. Throughout we assume each element of the input matrices can be represented by a bitstring of size poly(log n, log m). We use s as the sparsity of the input matrices, that is, the maximum number of non-zero entries in a row (or column) of any of the matrices C, A 1 , . . . , A m is s. Recall that for normalization purposes we assume A 1 , . . . , A m , C ≤ 1. We furthermore assume that A 1 = I and b 1 = R, that is, the trace of primal-feasible solutions is bounded by R (and hence also the trace of primal-optimal solutions is bounded by R). The analogous quantity for the dual SDP (2), an upper bound on m j=1 y j for an optimal dual solution y, will be denoted by r. However, we do not add the constraint m j=1 y j ≤ r to the dual. We will assume r ≥ 1. For r to be well-defined we have to make the explicit assumption that the optimal solution in the dual is attained. In Section 3 it will be necessary to work with the best possible upper bounds: we let R * be the smallest trace of an optimal solution to SDP (1), and we let r * be the smallest 1 -norm of an optimal solution to the dual. These quantities are well-defined; indeed, both the primal and dual optimum are attained: the dual optimum is attained by assumption, and due to the assumption A 1 = I, the dual SDP is strictly feasible, which means that the optimum in (1) is attained.
Unless specified otherwise, we always consider additive error. In particular, an ε-optimal solution to an SDP will be a feasible solution whose objective value is within additive error ε of the optimum. Specifically in the quantum case, as described in [BCK15], we assume access to an oracle O I which serves the purpose of sparse access. O I calculates the index A j : [n] × [s] → [n] function, which for input (k, ) gives the column index of the th non-zero element in the kth row of A j . We assume this oracle computes the index "in place": (4) (In the degenerate case where the kth row has fewer than non-zero entries, index A j (k, ) is defined to be together with some special symbol.) We also assume we can apply the inverse of O I . We also need another oracle O M , returning a bitstring representation of (A j ) ki for any j ∈ {0} ∪ [m] and k, i ∈ [n]: The slightly unusual "in place" definition of oracle O I is not too demanding. In particular, if instead we had an oracle that computed the non-zero entries of a row in order, then we could implement both O I and its inverse using log(s) queries (we can compute from k and index A j (k, ) using binary search) [BCK15].
Computational model: As our computational model, we assume a slight relaxation of the usual quantum circuit model: a classical control system that can run quantum subroutines. We limit the classical control system so that its number of operations is at most a polylogarithmic factor bigger than the gate complexity of the quantum subroutines, i.e., if the quantum subroutines use C gates, then the classical control system may not use more than O(C polylog(C)) operations.
When we talk about gate complexity, we count the number of two-qubit quantum gates needed for implementation of the quantum subroutines. Additionally, we assume for simplicity that there exists a unit-cost QRAM gate that allows us to store and retrieve qubits in a memory, by means of a swap of two registers indexed by another register: QRAM : |i, x, r 1 , . . . , r K → |i, r i , r 1 , . . . , r i−1 , x, r i+1 , . . . , r K , where the registers r 1 , . . . , r K are only accessible through this gate. The QRAM gate can be seen as a quantum analogue of pointers in classical computing. The only place where we need QRAM is in Appendix D, for a data structure that allows efficient access to the non-zero entries of a sum of sparse matrices; for the special case of LP-solving it is not needed.

The Arora-Kale framework for solving SDPs
In this section we give a short introduction to the Arora-Kale framework for solving semidefinite programs. We refer to [AK16,AHK12] for a more detailed description and omitted proofs.
The key building block is the Matrix Multiplicative Weights (MMW) algorithm introduced by Arora and Kale in [AK16]. The MMW algorithm can be seen as a strategy for you in a game between you and an adversary. We first introduce the game. There is a number of rounds T . In each round you present a density matrix ρ to an adversary, the adversary replies with a loss matrix M satisfying −I M I. After each round you have to pay Tr(M ρ). Your objective is to pay as little as possible. The MMW algorithm is a strategy for you that allows you to lose not too much, in a sense that is made precise below. In Algorithm 1 we state the MMW algorithm, the following theorem shows the key property of the output of the algorithm. 1. Show the density matrix ρ (t) to the adversary.
2. Obtain the loss matrix M (t) from the adversary.
Arora and Kale use the MMW algorithm to construct an SDP-solver. For that, they construct an adversary who promises to satisfy an additional condition: in each round t, the adversary returns a matrix M (t) whose trace inner product with the density matrix ρ (t) is non-negative. The above theorem shows that then, after T rounds, the average of the adversary's responses satisfies the stronger condition that its smallest eigenvalue is not too negative: ηT . More explicitly, the MMW algorithm is used to build a vector That is, the smallest eigenvalue of the matrix m j=1 y j A j − C is only slightly below zero and y's objective value is at most α. Since A 1 = I, increasing the first coordinate of y makes the smallest eigenvalue of j y j A j − C bigger, so that this matrix becomes psd and hence dual-feasible. By the above we know how much the minimum eigenvalue has to be shifted, and with the right choice of parameters it can be shown that this gives a dual-feasible vector y that satisfies b T y ≤ α + ε. In order to present the algorithm formally, we require some definitions.
Given a candidate solution X 0 for the primal problem (1) and a parameter ε ≥ 0, define the polytope One can verify the following: Lemma 3 ([AK16, Lemma 4.2]). If for a given candidate solution X 0 the polytope P 0 (X) is empty, then a scaled version of X is primal-feasible and of objective value at least α.
The Arora-Kale framework for solving SDPs uses the MMW algorithm where the role of the adversary is taken by an ε-approximate oracle: Input An n × n psd matrix X, a number α ∈ [−R, R], and the description of an SDP as in (1).
Output Either the Oracle ε returns a vector y from the polytope P ε (X) or it outputs "fail".
It may only output fail if P 0 (X) = ∅.
Algorithm 2: Definition of an ε-approximate Oracle ε for maximization SDPs As we will see later, the runtime of the Arora-Kale framework depends on a property of the oracle called the width: Definition 4 (Width of Oracle ε ). The width of Oracle ε for an SDP is the smallest w * ≥ 0 such that for every X 0 and α ∈ [−R, R], the vector y returned by Oracle ε satisfies m j=1 y j A j − C ≤ w * .
In practice, the width of an oracle is not always known. However, it suffices to work with an upper bound w ≥ w * : as we can see in Meta-Algorithm 1, the purpose of the width is to rescale the matrix M (t) in such a way that it forms a valid response for the adversary in the MMW algorithm. The following theorem shows the correctness of the Arora-Kale primal-dual meta-algorithm for solving SDPs, stated in Meta-Algorithm 1: Theorem 5 ([AK16, Theorem 4.7]). Suppose we are given an SDP of the form (1) with input matrices A 1 = I, A 2 , . . . , A m and C having operator norm at most 1, and input reals b 1 = R, b 2 , . . . , b m . Assume Meta-Algorithm 1 does not output "fail" in any of the rounds, then the returned vector y is feasible for the dual (2) with objective value at most α + ε. If Oracle ε/3 outputs "fail" in the t-th round then a suitably scaled version of X (t) is primal-feasible with objective value at least α.
The SDP-solver uses T = 9w 2 R 2 ln(n) ε 2 iterations. In each iteration several steps have to be taken. The most expensive two steps are computing the matrix exponential of the matrix −ηH (t) and the application of the oracle. Note that the only purpose of computing the matrix exponential is to allow the oracle to compute the values Tr(A j X) for all j and Tr(CX), since the polytope depends on X only through those values. To obtain faster algorithms it is important to note, as was done already by Arora and Kale, that the primal-dual algorithm also works if we provide a (more accurate) oracle with approximations of Tr(A j X). Let a j := Tr(A j ρ) = Input The input matrices and reals of SDP (1) and trace bound R. The current guess α of the optimal value of the dual (2). An additive error tolerance ε > 0. An ε 3 -approximate oracle Oracle ε/3 as in Algorithm 2 with width-bound w.
Output Either "Lower" and a vector y ∈ R m + feasible for (2) with b T y ≤ α + ε or "Higher" and a symmetric n × n matrix X that, when scaled suitably, is primal-feasible with objective value at least α.
if Oracle ε/3 outputs "fail" then return "Higher" and a description of X (t) . end if Let y (t) be the vector generated by Oracle ε/3 .
. Update the state matrix as follows: ρ (t+1) := exp −ηH (t) /Tr exp −ηH (t) . end for If Oracle ε/3 does not output "fail" in any of the T rounds, then output the dual solution y = ε R e 1 + 1 For convenience we will denoteã = (ã 1 , . . . ,ã m ) and c :=c − (r + 1)θ. Notice thatP also contains a new type of constraint: j y j ≤ r. Recall that r is defined as a positive real such that there exists an optimal solution y to SDP (2) with y 1 ≤ r. Hence, using that P 0 (X) is a relaxation of the feasible region of the dual (with bound α on the objective value), we may restrict our oracle to return only such y: The benefit of this restriction is that an oracle that always returns a vector with bounded 1norm automatically has a width w * ≤ r + 1, due to the assumptions on the norms of the input matrices. The downside of this restriction is that the analogue of Lemma 3 does not hold for P 0 (X) ∩ {y ∈ R m : j y j ≤ r}. 6 The following shows that an oracle that always returns a vector y ∈P(ã, c ) if one exists, is a 4Rrθ-approximate oracle as defined in Algorithm 2.
We have now seen the Arora-Kale framework for solving SDPs. To obtain a quantum SDP-solver it remains to provide a quantum oracle subroutine. By the above discussion it suffices to set θ = ε/(12Rr) and to use an oracle that is based on θ-approximations of Tr(Aρ) (for A ∈ {A 1 , . . . , A m , C}), since with that choice of θ we have P 4Rrθ (X) = P ε/3 (X). In the section below we first give a quantum algorithm for approximating Tr(Aρ) efficiently (see also Appendix A for a classical algorithm). Then, in Section 2.3, we provide an oracle using those estimates. The oracle will be based on a simple geometric idea and can be implemented both on a quantum computer and on a classical computer (of course, resulting in different runtimes). In Section 2.4 we conclude with an overview of the runtime of our quantum SDP-solver. We want to stress that our solver is meant to work for any SDP. In particular, our oracle does not use the structure of a specific SDP. As we will show in Section 3, any oracle that works for all SDPs necessarily has a large width-bound. To obtain quantum speedups for a specific class of SDPs it will be necessary to develop oracles tuned to that problem, we view this as an important direction for future work. Recall from the introduction that Arora and Kale also obtain fast classical algorithms for problems such as MAXCUT by developing specialized oracles. 6 Using several transformations of the SDP, from Appendix E and Lemma 2 of [BS17], one can show that there is a way to remove the need for this restriction. Hence, after these modifications, if for a given candidate solution X 0 the oracle outputs that the set P0(X) is empty, then a scaled version of X is primal feasible for this new SDP, with objective value at least α. This scaled version of X can be modified to a near-feasible solution to the original SDP (it will be psd, but it might violate the linear constraints a little bit) with nearly the same objective value.

Approximating the expectation value Tr(Aρ) using a quantum algorithm
In this section we give an efficient quantum algorithm to approximate quantities of the form Tr(Aρ). We are going to work with Hermitian matrices A, H ∈ C n×n , such that ρ is the Gibbs state e −H /Tr e −H . Note the analogy with quantum physics: in physics terminology Tr(Aρ) is simply called the "expectation value" of A for a quantum system in a thermal state corresponding to H.
The general approach is to separately estimate Tr Ae −H and Tr e −H , and then to use the ratio of these estimates as an approximation of Tr(Aρ) = Tr Ae −H /Tr e −H . Both estimations are done using state preparation to prepare a pure state with a flag, such that the probability that the flag is 0 is proportional to the quantity we want to estimate, and then to use amplitude estimation to estimate that probability. Below in Section 2.2.1 we first describe the general approach. In Section 2.2.2 we then instantiate this for the special case where all matrices are diagonal, which is the relevant case for LP-solving. In Section 2.2.3 we handle the general case of arbitrary matrices (needed for SDP-solving); the state-preparation part will be substantially more involved there, because in the general case we need not know the diagonalizing bases for A and H, and A and H may not be simultaneously diagonalizable.

General approach
To start, consider the following lemma about the multiplicative approximation error of a ratio of two real numbers that are given by multiplicative approximations: The inequality can be proven as follows where the last step usedZ ≥ 2 3 Z.
Corollary 8. Let A be such that A ≤ 1. A multiplicative θ 9 -approximation of both Tr I 4 e −H and Tr I+A/2 4 e −H suffices to get an additive θ-approximation of Proof. According to Lemma 7 by dividing the two multiplicative approximations we get θ 3 of outcome 0 when measuring the first qubit. That is: .
(To clarify the notation: if |ψ is a 2-register state, then ( 0| ⊗ I)|ψ is the (unnormalized) state in the 2nd register that results from projecting on |0 in the 1st register.) In practice we will not be able to construct such a U A,H exactly, instead we will construct aŨ A,H that yields a sufficiently close approximation of the correct probability. When we have access to such a unitary, the following lemma allows us to use amplitude estimation to estimate the probability and hence Tr I+A/2 4 e −H up to the desired error. Lemma 9. Suppose we have a unitary U acting on q qubits such that U |0 . . . 0 = |0 |ψ + |Φ with ( 0| ⊗ I)|Φ = 0 and ψ 2 = p ≥ p min for some known bound p min . Let µ ∈ (0, 1] be the allowed multiplicative error in our estimation of p.
gates on the q qubits and some ancilla qubits, we obtain ap such that |p −p| ≤ µp with probability at least 4/5.
Proof. We use the amplitude-estimation algorithm of [BHMT02, Theorem 12] with M applications of U and U −1 . This provides an estimatep of p, that with probability at least 8/π 2 > 4/5 satisfies Choosing M the smallest power of 2 such that M ≥ 3π/(µ √ p min ), with probability at least 4/5 we get The q factor in the gate complexity comes from the implementation of the amplitude amplification steps needed in amplitude estimation. The gate complexity of the whole amplitudeestimation procedure is dominated by this contribution, proving the final gate complexity. .
Applying the procedure of Lemma 9 toŨ A ,H (both for A = 0 and for A = A) with p min = z 9n and µ = θ/19, and combining the results using Corollary 8 yields an additive θ-approximation of Tr(Aρ) with probability at least 4/5.
Also by (6) we have Therefore using Lemma 9 and setting µ = θ/19, with probability at least 4/5 we get ap satisfying By combining (6)-(7) and using the triangle inequality we get so that Corollary 8 can indeed be applied. The complexity statement follows from Lemma 9 and our choices of p min and µ.
Notice the 1/ √ z ≥ 1/ Tr(e −H ) factor in the complexity statement of the last corollary. To make sure this factor is not too large, we would like to ensure Tr e −H = Ω(1). This can be achieved by substituting H + = H − λ min I, where λ min is the smallest eigenvalue of H. It is easy to verify that this will not change the value Tr Ae −H /Tr e −H .
It remains to show how to compute λ min and how to applyŨ A,H . Both of these steps are considerably easier in the case where all matrices are diagonal, so we will consider this case first.

The special case of diagonal matrices -for LP-solving
In this section we consider diagonal matrices, assuming oracle access to H of the following form: and similarly for A. Notice that this kind of oracle can easily be constructed from the general sparse matrix oracle (5) that we assume access to.
Lemma 11. Let A, H ∈ R n×n be diagonal matrices such that A ≤ 1 and H 0, and let µ > 0 be an error parameter. Then there exists a unitaryŨ A,H such that which uses 1 quantum query to A and H and O(log O(1) (1/µ) + log(n)) other gates.
Proof. First we prepare the state n i=1 |i / √ n with O(log(n)) one-and two-qubit gates. If n is a power of 2 we do this by applying log 2 (n) Hadamard gates on |0 ⊗ log 2 (n) ; in the general case it is still possible to prepare the state n i=1 |i / √ n with O(log(n)) two-qubit gates, for example by preparing the state k i=1 |i / √ k for k = 2 log 2 (n) and then using (exact) amplitude amplification in order to remove the i > n from the superposition.
Then we query the diagonal values of H and A to get the state n i=1 |i |H ii |A ii / √ n. Using these binary values we apply a finite-precision arithmetic circuit to prepare The error δ i is because we only write down a finite number of bits b 1 .b 2 b 3 . . . b log(8/µ) . Due to our choice of A and H, we know that β i lies in [0, 1]. We proceed by first adding an ancilla qubit initialized to |1 in front of the state, then we apply log(8/µ) controlled rotations to this qubit: for each b j = 1 we apply a rotation by angle π2 −j . In other words, if b 1 = 1, then we rotate |1 fully to |0 . If b 2 = 1, then we rotate halfway, and we proceed further by halving the angle for each subsequent bit. We will end up with the state: It is now easy to see that the squared norm of the |0 -part of this state is as required: which is an additive µ-approximation since Proof. Since H is a diagonal matrix, its eigenvalues are exactly its diagonal entries. Using the quantum minimum-finding algorithm of Dürr and Høyer [DH96] one can find (with high success probability) the minimum λ min of the diagonal entries using O( √ n) queries to the matrix elements. Applying Lemma 11 and Corollary 10 to H + = H − λ min I, with z = 1, gives the stated bound.

General case -for SDP-solving
In this section we will extend the ideas from the last section to non-diagonal matrices. There are a few complications that arise in this more general case. These mostly follow from the fact that we now do not know the eigenvectors of H and A, which were the basis states before, and that these eigenvectors might not be the same for both matrices. For example, to find the minimal eigenvalue of H, we can no longer simply minimize over its diagonal entries. To solve this, in Appendix C we develop new techniques that generalize minimum-finding. Furthermore, the unitaryŨ A,H in the LP case could be seen as applying the operator to a superposition of its eigenvectors. This is also more complicated in the general setting, due to the fact that the eigenvectors are no longer the basis states. In Appendix B we develop general techniques to apply smooth functions of Hamiltonians to a state. Among other things, this will be used to create an efficient purified Gibbs sampler. Our Gibbs sampler uses similar methods to the work of Chowdhury and Somma [CS17] for achieving logarithmic dependence on the precision. However, the result of [CS17] cannot be applied to our setting, because it implicitly assumes access to an oracle for √ H instead of H. Although their paper describes a way to construct such an oracle, it comes with a large overhead: they construct an oracle for can be prohibitive due to the 1/Z factor in the runtime, which blows up exponentially in ν.
In the following lemma we show how to implementŨ A,H using the techniques we developed in Appendix B. to the first register. Note that we can assume without loss of generality that µ ≤ 1, otherwise the statement is trivial.

Lemma 13. Let A, H ∈ C n×n be Hermitian matrices such that
LetW 0 = ( 0| ⊗ I)W (|0 ⊗ I) be a µ/5-approximation of the map e −H/2 (in operator norm) implemented by using Theorem 43, and letṼ 0 = ( 0| ⊗ I)Ṽ (|0 ⊗ I) be a µ/5-approximation of the map I+A/2 4 implemented by using Theorem 40. We defineŨ A,H :=ṼW , noting that there is a hidden ⊗I factor in bothṼ andW corresponding to their respective ancilla qubit. As in the linear programming case, we are interested in the probability p of measuring outcome 00 in the first register (i.e., the two "flag" qubits) after applyingŨ A,H . We will analyze this in terms of these operators below.
Now we show that the above quantity is a good approximation of For this we show thatṼ † 0Ṽ 0 ≈ (I + A/2)/4 andW 0W † 0 ≈ e −H . To see this, first note that for all matrices B,B with B ≤ 1, we have Since µ ≤ 1, and hence 2µ/5 + (µ/5) 2 ≤ µ/2, this implies (with Let · 1 denote the trace norm (a.k.a. Schatten 1-norm). Note that for all C, D,C,D: Dividing both sides by n and using equation (8) then implies This proves the correctness ofŨ A,H . It remains to show that the complexity statement holds.
To show this we only need to specify how to implement the map I+A/2

An efficient 2-sparse oracle
To motivate the problem below, recall from the end of Section 2.1 thatã j is an additive θapproximation to Tr(A j ρ),c is an additive θ-approximation to Tr(Cρ) and c =c − rθ − θ. We first describe a simplified version of our quantum 2-sparse oracle (Lemma 16) that assumes access to a unitary acting as |j |0 |0 → |j |ã j |ψ j , where |ψ j is some workspace state depending on j.
At the end of this section we then discuss how to modify the analysis when we are given an oracle that acts as |j |0 |0 → |j i β i j |ã i j |ψ i j where each |ã i j is an approximation of a j and the amplitudes β i j are such that measuring the second register with high probability returns añ a i j which is θ-close to a j . We do so since the output of the trace-estimation procedure of the previous section is of this more complicated form.
Our goal is to find a y ∈P(ã, c ), i.e., a y such that Our first observation is that the polytopeP(ã, c ) is extremely simple: it has only three nontrivial constraints and, if it is non-empty, then it contains a point with at most 2 non-zero coordinates. The latter follows from general properties of linear programs: any feasible LP with at most k constraints has a k-sparse optimal solution (see, e.g., [Sch86, Ch. 7]). Note that non-emptiness ofP(ã, c ) is equivalent to the optimal value of being at most r (the latter being an LP with only 2 non-trivial constraints). We will give an alternative, constructive proof that we can obtain a 2-sparse solution in Lemma 15. This will also illustrate the intuition behind the definition of our 2-sparse oracle. Before doing so, let us give a first naive approach to find a y ∈P(ã, c ) which will not be sufficiently efficient for our purposes. Using the formulation in Equation (10), we could attempt to find a y ∈P(ã, c ) by solving O(m 2 ) linear programs with only 2 variables and 2 constraints each (these LPs are obtained from (10) by setting all but 2 variables equal to zero) and searching for an optimal solution whose value is at most r. Here each linear program is determined by the valuesã j , b j and c , and thus we can decide ifP(ã, c ) is non-empty using O(m 2 ) classical time (given these values).
We use a more sophisticated approach to show that O(m) classical operations (and queries) suffice. Our approach is amenable to a quantum speedup: it also implies that only O( √ m) quantum queries suffice. In particular, we now show how to reduce the problem of finding a y ∈P(ã, c ) to finding a convex combination of points (b j ,ã j ) that lies within a certain region of the plane.
First observe that if α ≥ 0 and c ≤ 0, then y = 0 is a solution and our oracle can return it. From now on we will assume that α < 0 or c > 0. Then for a feasible point y we may write y = N q with N = y 1 > 0 and hence q 1 = 1. So we are looking for an N and a q such that We can now view q ∈ R m + as the coefficients of a convex combination of the points p i = (b i ,ã i ) in the plane. We want such a combination that lies to the upper left of g N = (α/N, c /N ) for some 0 < N ≤ r. Let G N denote the upper-left quadrant of the plane starting at g N .
Proof. Let y ∈P(ã, c ), and N = y 1 . Consider p i = (b i ,ã i ) and g = (α/N, c /N ) as before, and write q = y/N such that m j=1 q j = 1, q ≥ 0. The vector q certifies that a convex combination of the points p i lies in G N . But then there exist j, k ∈ [m] such that the line segment p j p k intersects G N . All points on this line segment are convex combinations of p j and p k , hence there is a convex combination of p j and p k that lies in G N . This gives a 2-sparse q , and y = N q ∈P(ã, c ).
We can now restrict our search to 2-sparse y. Let G = N ∈(0,r] G N , see Figure 1 for the shape of G. Then we want to find two points p j , p k that have a convex combination in G, since this implies that a scaled version of their convex combination gives a y ∈P(ã, c ) with y 1 ≤ r (this scaling can be computed efficiently given p j and p k ).
Furthermore, regarding the possible (non-)emptiness of G we know the following by Lemma 6 and Lemma 15: • If P 0 (X) ∩ {y ∈ R m : j y j ≤ r} is non-empty, then some convex combination of two of the p j 's lies in G.
• If P 4Rrθ (X) ∩ {y ∈ R m : j y j ≤ r} is empty, then no convex combination of the p j 's lies in G.
We first prove a simplified version of the main result of this section. The analysis below applies if there are m points , and we are given a unitary which acts as |j |0 |0 → |j |ã j |ψ j . We later prove the result for more general oracles, which may give superpositions of approximations instead of just the one valueã j .

Lemma 16 (Simple version).
There is an algorithm that returns a 2-sparse vector q (with q ≥ 0 and q 1 = 1) such that m j=1 q j p j ∈ G, if one exists, using one search and two minimizations over the m points p j = (b j ,ã j ). This gives a classical algorithm that uses O(m) calls to the subroutine that gives the entries ofã, and O(m) other operations; and a quantum algorithm that (in order to solve the problem with high probability) uses O( √ m) calls to an (exact quantum) subroutine that gives the entries ofã, and O( √ m) other gates.
Proof. The algorithm can be summarized as follows: 1. Check if α ≥ 0 and c ≤ 0. If so, then return q = 0.
2. Check if there is a p i ∈ G. If so, then return q = e i 3. Find p j , p k so that the line segment p j p k goes through G and return the corresponding q.
4. If the first three steps did not return a vector q, then output 'Fail'.
The main realization is that in step 3 we can search separately for p j and p k . We explain this in more detail below, but first we will need a better understanding of the shape of G (see Figure 1 for illustration). The shape of G depends on the sign of α and c .
(a) If α < 0 and c < 0. The corner point of G is (α/r, c /r). One edge goes up vertically and an other follows the line segment λ · (α, c ) for λ ∈ [1/r, ∞) starting at the corner.
The corner point is again (α/r, c /r), but now one edge goes up vertically and one goes to the left horizontally.
(c) If α ≥ 0 and c ≤ 0. This is the case where y = 0 is a solution, G is the whole plane and has no corner.
(d) If α ≥ 0 and c > 0. The corner point of G is again (α/r, c /r). From there one edge goes to the left horizontally and one edge follows the line segment λ · (α, c ) for λ ∈ [1/r, ∞).
Since G is always an intersection of at most 2 halfspaces, steps 1-2 of the algorithm are easy to perform. In step 1 we handle case (c) by simply returning y = 0. For the other cases (α/r, c /r) is the corner point of G and the two edges are simple lines. Hence in step 2 we can easily search through all the points to find out if there is one lying in G; since G is a very simple region, this only amounts to checking on which side of the two lines a point lies.
: Illustration of G with the points p j , p k and the angles ∠ j L 1 , ∠L 1 L 2 , ∠L 2 k drawn in. Clearly the line p j p k only crosses G when the total angle is less than π. Now, if we cannot find a single point in G in step 2, then we need a combination of two points in step 3. Let L 1 , L 2 be the edges of G and let j and k be the line segments from (α/r, c /r) to p j and p k , respectively. Then, as can be seen in Figure 2, the line segment p j p k goes through G if and only if (up to relabeling p j and p k ) ∠ j L 1 + ∠L 1 L 2 + ∠L 2 k ≤ π. Since ∠L 1 L 2 is fixed, we can simply look for a j such that ∠ j L 1 is minimized and a k such that ∠L 2 k is minimized. If p j p k does not pass through G for this pair of points, then it does not for any of the pairs of points.
Notice that these minimizations can be done separately and hence can be done in the stated complexity. Given the minimizing points p j and p k , it is easy to check if they give a solution by calculating the angle between j and k . The coefficients of the convex combination q are then easy to compute.
We now consider the more general case where we are given access to a unitary which for each j provides a superposition over different valuesã j . We do so because the trace estimation procedure of Corollary 14 provides an oracle of this form.
Lemma 16 (General version). Assume that we are given an oracle that acts as where each |ã i j is an approximation of a j and the amplitudes β i j are such that measuring the second register with high probability returns anã i j which is θ-close to a j . There is a quantum algorithm that uses O( √ m) calls to the oracle described above, and the same order of two-qubit gates, and (with high probability) has the following guarantees.
• If P 0 (X) ∩ {y ∈ R m : j y j ≤ r} is non-empty, then the algorithm returns a 2-sparse vector in P 4Rrθ (X) ∩ {y ∈ R m : j y j ≤ r}.
• If P 4Rrθ (X) ∩ {y ∈ R m : j y j ≤ r} is empty, then the algorithm correctly concludes that P 0 (X) ∩ {y ∈ R m : j y j ≤ r} is empty.
Proof. Since we can exponentially reduce the probability that we obtain anã i j which is further than θ away from a j , we will for simplicity assume that for all i, j we have |ã i j − a j | ≤ θ; the neglected exponentially small probabilities will only affect the analysis in negligible ways.
Note that while we do not allow our quantum algorithm enough time to obtain classical descriptions of allã j s (we aim for a runtime of O( √ m)), we do have enough time to computẽ c once initially (after this measurement, G is well-defined). Knowingc, we can compute the angles defined by the points p i j = (b j ,ã i j ) with respect to the corner point of (α/r, (c − θ)/r − θ) and the lines L 1 , L 2 (see Figure 2). We now apply our generalized minimum-finding algorithm with runtime O( √ m) (see Theorem 49) starting with a uniform superposition over the js to find k, ∈ [m] and points p i k and p i approximately minimizing the respective angles to lines L 1 , L 2 . Here 'approximately minimizing' means that there is no j ∈ [m] such that for all i the angle of p i j = (b j ,ã i j ) with L 1 is smaller than that of p i k with L 1 (and similar for and L 2 ). From this point on we can simply consider the model in the simple version of this lemma, since by the analysis above there exists an approximationã ∈ R m withã k =ã i k andã =ã i and where k and are the correct minimizers.
It follows from Lemma 6 and Lemma 15 that if P 0 (X) ∩ {y ∈ R m : j y j ≤ r} is nonempty, then some convex combination of (ã , b ) and (ã k , b k ) lies in G. On the other hand, if P 4Rrθ (X) ∩ {y ∈ R m : j y j ≤ r} is empty, then the same lemmas guarantee that we correctly conclude that P 0 (X) ∩ {y ∈ R m : j y j ≤ r} is empty.

Total runtime
We are now ready to add our quantum implementations of the trace calculations and the oracle to the classical Arora-Kale framework. Proof. Using our implementations of the different building blocks, it remains to calculate what the total complexity will be when they are used together.
Cost of the oracle for H (t) . The first problem in each iteration is to obtain access to an oracle for H (t) . In each iteration the oracle will produce a y (t) that is at most 2-sparse, and hence in the (t + 1)th iteration, H (t) is a linear combination of 2t of the A j matrices and the C matrix.
We can write down a sparse representation of the coefficients of the linear combination that gives H (t) in each iteration by adding the new terms coming from y (t) . This will clearly not take longer than O(T ), since there are only a constant number of terms to add for our oracle. As we will see, this term will not dominate the complexity of the full algorithm.
Using such a sparse representation of the coefficients, one query to a sparse representation of H (t) will cost O(st) queries to the input matrices and O(st) other gates. For a detailed explanation and a matching lower bound, see Appendix D.
Cost of the oracle for Tr(A j ρ). In each iteration M (t) is made to have operator norm at most 1. This means that Further remarks. We want to stress again that our solver is meant to work for all SDPs.
In particular, it does not use the structure of a specific SDP. As we show in the next section, every oracle that works for all SDPs must have large width. To obtain quantum speedups for a specific class of SDPs, it will be necessary to develop oracles tuned to that problem. We view this as an important direction for future work. Recall from the introduction that Arora and Kale also obtain fast classical algorithms for problems such as MAXCUT by doing exactly that: they develop specialized oracles for those problems.
3 Downside of this method: general oracles are restrictive In this section we show some of the limitations of a method that uses sparse or general oracles, i.e., ones that are not optimized for the properties of specific SDPs. We will start by discussing sparse oracles in the next section. We will use a counting argument to show that sparse solutions cannot hold too much information about a problem's solution. In Section 3.2 we will show that width-bounds that do not depend on the specific structure of an SDP are for many problems not efficient. As in the rest of the paper, we will assume the notation of Section 2, in particular of Meta-Algorithm 1.

Sparse oracles are restrictive
Lemma 17. If, for some specific SDP of the form (1), every ε-optimal dual-feasible vector has at least non-zero elements, then the width w of any k-sparse Oracle ε/3 for this SDP is such Proof. The vectorȳ returned by Meta-Algorithm 1 is, by construction, the average of T vectors y (t) that are all k-sparse, plus one extra 1-sparse term of ε R e 1 , and hence ≤ kT + 1. The stated bound on Rw ε then follows directly by combining this inequality with T = O( R 2 w 2 ε 2 ln(n)).
The oracle presented in Section 2.3 always provides a 2-sparse vector y. This implies that if an SDP requires an -sparse dual solution, we must have Rw ε = Ω( / ln(n)). This in turn means that the upper bound on the runtime of our algorithm will be of order 4 √ nms 2 . This is clearly bad if is of the order n or m.
Of course it could be the case that almost every SDP of interest has a sparse approximate dual solution (or can easily be rewritten so that it does), and hence sparseness might be not a restriction at all. However, as we will see below, this is not the case. We will prove that for certain kinds of SDPs, no "useful" dual solution can be very sparse. Intuitively, a dual solution to an SDP is "useful" if it can be turned into a solution of the problem that the SDP is trying to solve. We make this more precise in the definition below.
Definition 18. A problem is defined by a function f that, for every element p of the problem domain D, gives a subset of the solution space S, consisting of the solutions that are considered correct. We say a family of SDPs, {SDP (p) } p∈D , solves the problem via the dual if there is an ε ≥ 0 and a function g such that for every p ∈ D and every ε-optimal dual-feasible vector y (p) to SDP (p) : g(y (p) ) ∈ f (p).
In other words, an ε-optimal dual solution can be converted into a correct solution of the original problem without more knowledge of p.
For these kinds of SDP families we will prove a lower bound on the sparsity of the dual solutions. The idea for this bound is as follows. If you have a lot of different instances that require different solutions, but the SDPs are equivalent up to permuting the constraints and the coordinates of R n , then a dual solution should have a lot of unique permutations and hence cannot be too sparse.
Theorem 19. Consider a problem and a family of SDPs as in Definition 18. Let T ⊆ D be such that for all p, q ∈ T : That is, a solution to p is not a solution to q and vice versa.
• The number of constraints m and the primal variable size n are the same for SDP (p) and SDP (q) .
j be the constraints of SDP (p) and A (q) j those from SDP (q) (and define C (p) , C (q) , b (p) j , and b (q) j in the same manner). Then there exist σ ∈ S n , π ∈ S m s.t.
. That is, the SDPs are the same up to permutations of the labels of the constraints and permutations of the coordinates of R n .
If y (p) is an ε-optimal dual-feasible vector to SDP (p) for some p ∈ T , then y (p) is at least log m -dense (i.e., has at least that many non-zero entries).
Proof. We first observe that, with SDP (p) and SDP (q) as in the lemma, if y (p) is an ε-optimal dual-feasible vector of SDP (p) , then y (q) defined by y (q) is an ε-optimal dual vector for SDP (q) . Here we use the fact that a permutation of the n coordinates in the primal does not affect the dual solutions. Since f (p) ∩ f (q) = ∅ we know that g(y (p) ) = g(y (q) ) and so y (p) = y (q) . Since this is true for every q in T , there should be at least |T | different vectors y (q) = π(y (p) ).
A k-sparse vector can have k different non-zero entries and hence the number of possible unique permutations of that vector is at most Example. Consider the (s, t)-mincut problem, i.e., the dual of the (s, t)-maxflow. Specifically, consider a simple instance of this problem: the union of two complete graphs of size z + 1, where s is in one subgraph and t in the other. Let the other vertices be labeled by {1, 2, . . . , 2z}. Every assignment of the labels over the two halves gives a unique mincut, in terms of which labels fall on which side of the cut. There is exactly one partition of the vertices in two sets that cuts no edges (namely the partition consists of the two complete graphs), and every other partition cuts at least z edges. Hence a z/2-approximate cut is a mincut. This means that there are 2z z problems that require a different output. So for every family of SDPs that is symmetric under permutation of the vertices and for which a z/2-approximate dual solution gives an (s, t)-mincut, the sparsity of a z/2-approximate dual solution is at least 7 where we used that 2z z ≥ 2 2z 2 √ z . In particular this holds for the standard linear programming formulation of the (s, t)-maxflow/(s, t)-mincut problem.

General width-bounds are restrictive for certain SDPs
In this section we will show that width-bounds can be restrictive when they do not consider the specific structure of an SDP.

Definition 20. An algorithm O is called a general oracle if, when provided with an error parameter ε, it correctly implements an Oracle ε (as in Algorithm 2) for all inputs. We use O ε to denote the algorithm provided by O with error parameter ε fixed. A function w(n, m, s, r, R, ε) is called a general width-bound for a general oracle if, for every 0 < ε < 1/2, the value w(n, m, s, r, R, ε) is a correct width-bound (see Definition 4) for O ε for every SDP with parameters n, m, s, r, and
R. In particular, the function w may not depend on the structure of the input A 1 , . . . , A m , C, b or on the value of α.
We will show that general width-bounds need to scale with r * (recall that r * denotes the smallest 1 -norm of an optimal solution to the dual). We then go on to show that if two SDPs in a class can be combined to get another element of that class in a natural manner, then, under some mild conditions, r * will be of the order n and m for some instances of the class.
We start by showing, for specifically constructed LPs, a lower bound on the width of any oracle. Although these LPs will not solve any useful problem, every general width-bound should also apply to these LPs. This gives a lower bound on general width-bounds.
Proof. We will construct an LP for n = m = 3. This is enough to prove the lemma since LPs are a subclass of SDPs and we can increase n, m, and s by adding more dimensions and s-dense SDP constraints that do not influence the analysis below. For some k > 0, consider the following LP where the first row is the primal trace constraint. Notice that x 1 = x 2 = 0 due to the second constraint. This implies that OPT = 0 and, due to the last constraint, that x 3 ≥ R. Notice that (0, 0, R) is actually an optimal solution, so R * = R.
To calculate r * , look at the dual of the LP: due to strong duality its optimal value is 0 as well. This implies y 1 = y 3 , so the first constraint becomes y 2 ≥ k. This in turn implies r * ≥ k, which is actually attained (by y = (0, k, 0)) so r * = k.
Since the oracle and width-bound should work for every x ∈ R 3 + and every α, they should in particular work for x = (R, 0, 0) and α = 0. In this case the polytope for the oracle becomes P ε (x) := {y ∈ R m : y 1 − y 3 ≤ 0, since b T y = y 1 − y 3 , c T x = 1, a T 1 x = 1, a T 2 x = 1/k and a T 3 x = −1. This implies that for every y ∈ P ε (x), we have y 2 ≥ k(1 − ε/R) ≥ k/2 = r * /2, where the second inequality follows from the assumptions that ε ≤ 1/2 and R ≥ 1.
Notice that the term m j=1 y j A j − C in the definition of width for an SDP becomes in the case of an LP. In our case, due to the second constraint in the dual, we know that for every vector y from P ε (x). This shows that any oracle has width at least r * /2 for this LP.
Proof. Consider the LP given by Lemma 21 with r * = r. Using a general oracle with general width-bound w for this LP implies the corollary.
Note that this bound applies to both our algorithm and the one given by Brandão and Svore. It turns out that for many classes of SDPs it is natural to assume that m, r * , and R * grow linearly with n, and that the "logical" choice of ε also scales linearly with n. We now argue that this is for instance the case when SDPs in a class combine in a natural manner. As an example, consider the MAXCUT problem. For, e.g., d-regular graphs the MAXCUT value grows linearly with the number of vertices n. Therefore, one is generally interested in finding a constant multiplicative approximation to the optimal value. For d-regular graphs this would thus translate to an additive error which scales linearly with the number of vertices. We argue below that for the SDP-relaxation it is also natural to assume that r and R grow linearly in n. Take for example two SDP-relaxations for the MAXCUT problem on two graphs G (1) and G (2) (on n (1) and n (2) vertices, respectively): Tr E jj X (2) ≤ 1 for j = 1, . . . , n (2) X (2) 0 Where L(G) is the Laplacian of a graph. Note that this is not normalized to operator norm ≤ 1, but for simplicity we ignore this here. If we denote the direct sum of two matrices by ⊕, that is then, for the disjoint union of the two graphs, we have L(G (1) ∪ G (2) ) = L(G (1) ) ⊕ L(G (2) ).
This, plus the fact that the trace distributes over direct sums of matrices, means that the SDP relaxation for MAXCUT on G (1) ∪ G (2) is the same as a natural combination of the two separate maximizations: max Tr L(G (1) )X (1) + Tr L(G (2) )X (2) s.t. Tr X (1) + Tr X (2) ≤ n (1) + n (2) Tr E jj X (1) ≤ 1 for j = 1, . . . , n (1) Tr E jj X (2) ≤ 1 for j = 1, . . . , n (2) It is easy to see that the new value of n is n (1) + n (2) , the new value of m is m (1) + m (2) − 1 and the new value of R * is n (1) + n (2) = R * (1) + R * (2) . Likewise, it is natural to assume that the desired approximation error for the combined SDP is the sum of the desired errors for the individual SDPs: starting with feasible solutions X (i) that are ε (i) -approximate solutions to the two SDP-relaxations (i = 1, 2), the matrix X (1) ⊕ X (2) is an (ε (1) + ε (2) )-approximate solution to the combined SDP. It remains to see what happens to r * , and so, for general width-bounds, what happens to w. As we will see later in this section, under some mild conditions, these kind of combinations imply that there are MAXCUT-relaxation SDPs for which r * also increases linearly, but this requires a bit more work.

Definition 23. We say that a class of SDPs (each with an associated allowed approximation error) is combinable if there is a k ≥ 0 so that for every two elements in this class, (SDP (a) , ε (a) )
and (SDP (b) , ε (b) ), there is an instance in the class, (SDP (c) , ε (c) ), that is a combination of the two in the following sense: In other words, some fixed set of constraints are summed pairwise, and the remaining constraints get added separately.
The motivation for the above definition reflects the following: if X (a) and X (b) are feasible solutions to SDP (a) and SDP (b) that are ε (a) -approximate and ε (b) -approximate solutions respectively, then X (a) ⊕ X (b) is an (ε (a) + ε (b) )-approximate solution to SDP (c) .
Furthermore, note that this is a natural generalization of the combining property of the MAXCUT relaxations (in that case k = 1 to account for the constraint giving the trace bound).

Theorem 24. If a class of SDPs is combinable and there is an element SDP
(1) for which every optimal dual solution has the property that m j=k+1 y m ≥ δ for some δ > 0, then there is a sequence (SDP (t) , ε (t) ) t∈N in the class such that R * (t) r * (t) ε (t) increases linearly in n (t) , m (t) and t, and SDP (t) is the t-fold combination of SDP (1) with itself.
Proof. The sequence we will consider is the t-fold combination of SDP (1) with itself. If SDP (1) is max Tr(CX) First, let us consider the value of OPT (t) . Let X (1) be an optimal solution to SDP (1) and for all i ∈ [t] let X i = X (1) . Since these X i form a feasible solution to SDP (t) , this shows that OPT (t) ≥ t · OPT (1) . Furthermore, let y (1) be an optimal dual solution of SDP (1) , then (y ⊕t is a feasible dual solution for SDP (t) with objective value t · OPT (1) , so OPT (t) = t · OPT (1) . Next, let us consider the value of r * (t) . Letỹ ⊕ y 1 ⊕ · · · ⊕ y t be an optimal dual solution for SDP (t) , split into the parts of y that correspond to different parts of the combination. Theñ y ⊕ y i is a feasible dual solution for SDP (1) and hence b T (ỹ ⊕ y i ) ≥ OPT (1) . On the other hand we have this implies that each term in the sum is actually equal to OPT (1) . But if (ỹ ⊕ y i ) is an optimal dual solution of SDP (1) then (ỹ ⊕ y i ) 1 ≥ r * (1) by definition and y i 1 ≥ δ. We conclude that r * (t) ≥ r * (1) − δ + tδ.
Now we know the behavior of r * under combinations, let us look at the primal to find a similar statement for R * (t) . Define a new SDP, SDP (t) , in which all the constraints are summed when combining, that is, in Definition 23 we take k = n (1) , however, contrary to that definition, we even sum the psd constraints: This SDP has the same objective function as SDP (t) but a larger feasible region: every feasible X 1 , . . . , X t for SDP (t) is also feasible for SDP (t) . However, by a change of variables, is simply a scaled version of SDP (1) . So, SDP has optimal value t · OPT (1) . Since optimal solutions to SDP (t) are scaled optimal solutions to SDP (1) , we haveR * (t) = t · R * (1) . Combining the above, it follows that every optimal solution to SDP (t) is optimal to SDP (t) as well, and hence has trace at least t · R * (1) , so R * (t) ≥ t · R * (1) . We conclude that and This shows that for many natural SDP formulations for combinatorial problems, such as the MAXCUT relaxation or LPs that have to do with resource management, R * r * /ε increases linearly in n and m for some instances. Hence, using R * ≤ R and Lemma 21, Rw/ε grows at least linearly when a general width-bound is used.

Lower bounds on the quantum query complexity
In this section we will show that every LP-solver (and hence every SDP-solver) that can distinguish two optimal values with high probability needs Ω max{n, m}(min{n, m}) 3/2 quantum queries in the worst case.
For the lower bound on LP-solving we will give a reduction from a composition of Majority and OR functions.  Proof. The promise version of MAJ k is known to require Ω(k) quantum queries. Likewise, it is known that the OR k function requires Ω( √ k) queries. Furthermore, the adversary bound tells us that query complexity is multiplicative under composition of functions; Kimmel [Kim13, Lemma A.3 (Lemma 6 in the arXiv version)] showed that this even holds for promise functions.

Lemma 27. Determining the value
for a Z from the promise MAJ a -OR b -MAJ c problem up to additive error ε = 1/3, solves the promise MAJ a -OR b -MAJ c problem.
Proof. Notice that due to the first promise, c . This implies that • If the ith OR is 0, then all of its inner MAJ functions are 0 and hence • If the ith OR is 1, then at least one of its inner MAJ functions is 1 and hence Now, if we denote the string of outcomes of the OR functions byZ ∈ {0, 1} a , then Hence determining the left-hand side will determine |Z|; this Hamming weight is either a 2 if the full function evaluates to 0, or a 2 + 1 if it evaluates to 1. Lemma 28. For an input Z ∈ {0, 1} a×b×c there is an LP with m = c + a and n = c + ab for which the optimal value is

Furthermore, a query to an entry of the input matrix or vector costs at most 1 query to Z.
Proof. Let Z (i) be the matrix one gets by fixing the first index of Z and putting the entries in a c × b matrix, so Z (i) j = Z ij . We define the following LP: Notice every Z (i) is of size c × b, so that indeed m = c + a and n = c + ab.

For every i ∈ [a] there is a constraint that says
The constraints involving w say that for every k ∈ [c] where (Z (i) v (i) ) k is the kth entry of the matrix-vector product Z (i) v (i) . Clearly, for an optimal solution these constraints will be satisfied with equality, since in the objective function w k has a positive weight. Summing over k on both sides, we get the equality will be maximized. Note that we can use the 1 -norm as a shorthand for the sum over vector elements since all elements are positive. In particular, the value of Z (i) v (i) 1 is given by Now Z (i) v (i) 1 will be maximized by putting all weight in v (i) on the index that corresponds to the column of Z (i) that has the highest Hamming weight. In particular in the optimum Putting everything together gives: Proof. This follows from Theorem 29 and LP duality.
It is important to note that the parameters R and r from the Arora-Kale algorithm are not constant in this family of LPs (R, r = Θ(min{n, m} 2 ) here), and hence this lower bound does not contradict the scaling with √ mn of the complexity of our SDP-solver or Brandão and Svore's. Since we show in the appendix that one can always rewrite the LP (or SDP) so that 2 of the parameters R, r, ε are constant, the lower bound implies that any algorithm with a sublinear dependence on m or n has to depend at least polynomially on Rr/ε. For example, the above family of LPs shows that an algorithm with a √ mn dependence has to have an (Rr/ε) κ factor in its complexity with κ ≥ 1/4. It remains an open question whether a lower bound of Ω( √ mn) can be proven for a family of LPs/SDPs where ε, R and r all constant.

Conclusion
In this paper we gave better algorithms and lower bounds for quantum SDP-solvers, improving upon recent work of Brandão and Svore [BS17]. Here are a few directions for future work: • Better upper bounds. The runtime of our algorithm, like the earlier algorithm of Brandão and Svore, has better dependence on m and n than the best classical SDP-solvers, but worse dependence on s and on Rr/ε. In subsequent work (see the introduction), these dependencies have been improved, but there is room for further improvement.
• Applications of our algorithm. As mentioned, both our and Brandão-Svore's quantum SDP-solvers only improve upon the best classical algorithms for a specific regime of parameters, namely where mn Rr/ε. Unfortunately, we don't know particularly interesting problems in combinatorial optimization in this regime. As shown in Section 3, many natural SDP formulations will not fall into this regime. However, it would be interesting to find useful SDPs for which our algorithm gives a significant speed-up.
• New algorithms. As in the work by Arora and Kale, it might be more promising to look at oracles (now quantum) that are designed for specific SDPs. Such oracles could build on the techniques developed here, or develop totally new techniques. It might also be possible to speed up other classical SDP solvers, for example those based on interior-point methods.
• Better lower bounds. Our lower bounds are probably not optimal, particularly for the case where m and n are not of the same order. The most interesting case would be to get lower bounds that are simultaneously tight in the parameters m, n, s, and Rr/ε.

A Classical estimation of the expectation value Tr(Aρ)
To provide contrast to Section 2.2, here we describe a classical procedure to efficiently estimate Tr(Aρ) where A is a Hermitian matrix such that A ≤ 1, and ρ = exp(−H)/Tr(exp(−H)) for some Hermitian matrix H. The results in this section can be seen as a generalization of [AK16, Section 7]. The key observation is that if we are given a Hermitian matrix B 0, and if we take a random vector u = (u 1 , . . . , u n ) where u i ∈ {±1} is uniformly distributed, then, using We now show that u T √ BA √ Bu is highly concentrated around its mean by Chebyshev's inequality.
Lemma 31. Given a Hermitian matrix A, with ||A|| ≤ 1, a psd matrix B, and a parameter 0 < θ ≤ 1. With probability 1 − 1/16, the average of k = O 1/θ 2 independent samples from the distribution u T √ BA √ Bu is at most θTr(B) away from Tr(AB). Here u = (u i ) and each Proof. We let F k be the random variable 1 where each of the vectors u (i) ∈ {±1} n is sampled from the distribution described above. By the above it is clear that E[F k ] = Tr(AB).
We will use Chebyshev's inequality which, in our setting, states that for every t > 0 here σ 2 k is the variance of F k . We will now upper bound the variance of F k . First note that var(F k ) = 1 k var(u T √ BA √ Bu). It therefore suffices to upper bound the variance σ 2 of u T √ BA √ Bu. We first write We then calculate ( * ) using E[u i ] = 0, E[u 2 i ] = 1, and the independence of the u i 's: Therefore, using Cauchy-Schwarz, we have where on the last line we use A ≤ 1 and Tr(AY ) ≤ A Tr(Y ) for any Y 0, in particular for BA 2 B and B 2 .
It follows that σ 2 k ≤ 2Tr(B) 2 /k. Chebyshev's inequality (12) therefore shows that for k = 32/θ 2 and t = 4, A simple computation shows that the success probability in the above lemma can be boosted to 1 − δ by picking the median of O(log(1/δ)) repetitions. To show this, let K = log(1/δ) and for each i ∈ This proves the following lemma: Lemma 32. Given a Hermitian matrix A, with ||A|| ≤ 1, a psd matrix B, and parameters 0 < δ ≤ 1/2 and 0 < θ ≤ 1. Using k = O log( 1 δ )/θ 2 samples from the distribution u T √ BA √ Bu, one can find an estimate of Tr(AB) that, with probability 1 − δ, has additive error at most θTr(B). Here u = (u i ) and the u i ∈ {±1} are i.i.d. uniformly distributed.
Looking back at Meta-Algorithm 1, we would like to apply the above lemma to B = exp(−H). 8 Since it is expensive to compute the exponent of a matrix, it is of interest to consider samples from v T Av, where v is an approximation of √ Bu. Say √ Bu − v ≤ κ and, as always, A ≤ 1. Then Now observe that we are interested in This shows that the additional error incurred by sampling from v T Av is proportional to θTr(B). Finally, a κ-approximation of √ Bu, with κ = θ/ u can be obtained by using the truncated Taylor series of exp(−H /2) of degree p = max{2e H , log √ n θ }: For ease of notation, we write H for ηH. 9 Here we assume that θ ≤ 1. Then, since λmin(B) ≥ 1, we trivially have θ/ u ≤ 1/ √ n ≤ √ n ≤ √ Bu .

Lemma 33. Given a Hermitian s-sparse matrix A, with ||A|| ≤ 1, a psd matrix B = exp(−H)
with H 0, for a d-sparse H, and parameters 0 < δ ≤ 1/2 and 0 < θ ≤ 1. With probability 1 − δ, using k = O log( 1 δ )/θ 2 samples from the distribution v T Av, one can find an estimate that is at most θTr(B) away from Tr(AB). Here Lemma 34. Given m Hermitian s-sparse n × n matrices A 1 = I, A 2 , . . . , A m , with A j ≤ 1 for all j, a Hermitian d-sparse n × n matrix H with H ≤ K, and parameters 0 < δ ≤ 1/2 and 0 < θ ≤ 1. With probability 1 − δ, we can compute θ-approximations a 1 , . . . , a m The results of Lemma 33 say that for each j, using those k samples of v T A j v we can construct a θTr(B )/4-approximation a j of Tr(A j B ), with probability 1−δ/(2m). Therefore, by a union bound, with probability 1 − δ/2 we can construct θTr(B )/4-approximations a 1 , . . . , a m of Tr (A 1 B ), . . . , Tr(A m B ). Therefore, for each j, with probability at least 1−δ, by Lemma 7 we have that a j = a j /a 1 is a θ-approximation of Tr(A j B )/Tr(B ), and hence it is a θ-approximation of Tr (A j B)/Tr(B).

B Implementing smooth functions of Hamiltonians
In this appendix we show how to efficiently implement smooth functions of a given Hamiltonian.
First we explain what we mean by a function of a Hamiltonian H ∈ C n×n , i.e., a Hermitian matrix. Since Hermitian matrices are diagonalizable using a unitary matrix, we can write H = U † diag(λ)U , where λ ∈ R n is the vector of eigenvalues. Then for a function f : R → C we define f (H) := U † diag(f (λ))U with a slight abuse of notation, where we apply f to the eigenvalues in λ one-by-one. Note that if we approximate f byf , then The main idea of the method presented below, is to implement a mapf (H), wheref is a good (finite) Fourier approximation of f for all x ∈ [− H , H ]. The novelty in our approach is that we construct a Fourier approximation based on some polynomial approximation. In the special case, when f is analytic and H is less than the radius of convergence of the Taylor series, we can obtain good polynomial approximation functions simply by truncating the Taylor series, with logarithmic dependence on the precision parameter. Finally we implement the Fourier series using Hamiltonian simulation and the Linear Combination of Unitaries (LCU) trick [CW12, BCC + 15, BCK15].
This approach was already used in several earlier papers, particularly in [CKS17,CS17]. There the main technical difficulty was to obtain a good truncated Fourier series. This is a nontrivial task, since on top of the approximation error, one needs to optimize two other parameters of the Fourier approximation that determine the complexity of implementation, namely: • the largest time parameter t that appears in some Fourier term e −itH , and • the total weight of the coefficients, by which we mean the 1-norm of the vector of coefficients. Earlier works used clever integral approximations and involved calculus to construct a good Fourier approximation for a specific function f . We are not aware of a general result.
In contrast, our Theorem 40 and Corollary 42 avoids the usage of any integration. It obtains a low-weight Fourier approximation function using the Taylor series. The described method is completely general, and has the nice property that the maximal time parameter t depends logarithmically on the desired approximation precision. Since it uses the Taylor series, it is easy to apply to a wide range of smooth functions.
The circuit we describe for the implementation of the linear operator f (H) : C n → C n is going to depend on the specific function f , but not on H; the H-dependence is only coming from Hamiltonian simulation. Since the circuit for a specific f can be constructed in advance, we do not need to worry about the (polynomial) cost of constructing the circuit, making the analysis simpler. When we describe gate complexity, we count the number of two-qubit gates needed for a quantum circuit implementation, just as in Section 2.
Since this appendix presents stand-alone results, here we will deviate slightly from the notation used throughout the rest of the paper, to conform to the standard notation used in the literature (for example, ε, r, θ and a have a different meaning in this appendix). For simplicity we also assume, that the Hamiltonian H acts on C n , where n is a power of 2. Whenever we write log(formula) in some complexity statement we actually mean log 2 (2 + formula) in order to avoid incorrect near-0 or even negative expressions in complexity bounds that would appear for small values of the formula.
Hamiltonian simulation. We implement each term in a Fourier series using a Hamiltonian simulation algorithm, and combine the terms using the LCU Lemma. Specifically we use [BCK15], but in fact our techniques would work with any kind of Hamiltonian simulation algorithm. 10 The following definition describes what we mean by controlled Hamiltonian simulation.
Definition 35. Let M = 2 J for some J ∈ N, γ ∈ R and ≥ 0. We say that the unitary Note that in this definition we assume that both positive and negative powers of e iH are simulated. This is necessary for our Fourier series, but sometimes we use only positive powers, e.g., for phase estimation; in that case we can simply ignore the negative powers.
The following lemma is inspired by the techniques of [CKS17]. It calculates the cost of such controlled Hamiltonian simulation in terms of queries to the input oracles (4)-(5) as described in Section 2.
Proof. We use the results of [BCK15, Lemma 9-10], which tell us that a d-sparse Hamiltonian H can be simulated for time t with ε precision in the operator norm using queries and gate complexity Now we use a standard trick to remove log factors from the implementation cost, and write the given unitary W as the product of some increasingly precisely implemented controlled Hamiltonian simulation unitaries. For b ∈ {0, 1} let us introduce the projector |b b| j := I 2 j ⊗ |b b| ⊗ I 2 J−j , where J = log(M ). Observe that The j-th operator e ±i2 j γH in the product (15) can be implemented with 2 j−J−1 precision using O 2 j γKd log 2 j γK ε2 j−J−1 / log log 2 j γK ε2 j−J−1 = O(2 j γKd log(τ / )/ log log(τ / )) queries by (13) and using O(2 j d log(τ / )/ log log(τ / )[log(n) + log 5 2 (τ / )]) gates by (14). Let us denote byW the concatenation of all these controlled Hamiltonian simulation unitaries. Adding up the costs we see that our implementation ofW uses O(τ d log(τ /ε)/ log log(τ /ε)) queries and has gate complexity O τ d log(τ /ε)/ log log(τ /ε) log(n) + log 5 2 (τ /ε) . Using the triangle inequality repeatedly, it is easy to see that W −W ≤ J j=0 2 j−J−1 ε ≤ ε.

B.1 Implementation of smooth functions of Hamiltonians: general results
The first lemma we prove provides the basis for our approach. It shows how to turn a polynomial approximation of a function f on the interval [−1, 1] into a nice Fourier series in an efficient way, while not increasing the weight of coefficients. This is useful, because we can implement a function given by a Fourier series using the LCU Lemma, but only after scaling it down with the weight of the coefficients.
So in this case the statement holds with M = 0 and c = 0, i.e., even with an empty sum.
From now on we assume a 1 ≥ ε/2. We are going to build up our approximation gradually. Our first approximate functionf 1 (x) := K k=0 a k x k satisfies f −f 1 ∞ ≤ ε/4 by assumption. In order to construct a Fourier series, we will work towards a linear combination of sines.
To that end, note that ∀x ∈ [−1, 1]:f 1 (x) = K k=0 a k arcsin(sin(xπ/2)) π/2 k . Let b (k) denote the series of coefficients such that arcsin(y) y for all y ∈ [−1, 1]. For k = 1 the coefficients are just 2 π times the coefficients of the Taylor series of arcsin so we know that b (1) (1) − , so one can recursively calculate each b (k) . As b (1) ≥ 0 one can use the above identity inductively to show that b (k) ≥ 0. Therefore b (k) arcsin(1) π/2 k = 1. Using the above definitions and observations we can rewrite To obtain the second approximation function, we want to truncate the summation over at L = ln 4 a 1 ε 1 δ 2 in the above formula. We first estimate the tail of the sum. We are going to use that for all δ ∈ [0, 1]: sin((1 − δ)π/2) ≤ 1 − δ 2 . For all k ∈ N and x ∈ [−1 + δ, 1 − δ] we have: Thus To obtain our third approximation function, we will approximate sin (xπ/2). First observe that which, as we will show (for M much larger than √ ) is very well approximated by Truncating the summation in (16) based on this approximation reduces the maximal time evolution parameter (i.e., the maximal value of the parameter t in the exp(izt) terms) quadratically.
To make this approximation precise, we use Chernoff's inequality [AS08, A.1.7] for the binomial distribution, or more precisely its corollary for sums of binomial coefficients, stating Let M = ln 4 a 1 ε 1 δ and suppose ≤ L, then this bound implies that where for the last inequality we use the assumption ε ≤ 2 a 1 . By combining (16) and (17) we get that for all ≤ L Substituting z = xπ/2 into this bound we can see that Therefore we can conclude thatf 3 is an εapproximation to f : Observe that in (18)  x . Now let us fix a value k in the first summation of (18). Observe that after taking the absolute value of each term, the last two summations still yield a value ≤ 1, since b (k) 1 = 1 and m=0 m = 2 . It follows that c 1 ≤ a 1 . From the construction of the proof, it is easy to see that (an ε-approximation of) c can be calculated in time poly(K, M, log(1/ε)). Now we present the Linear Combination of Unitaries (LCU) Lemma [CW12, BCC + 15, BCK15], which we will use for combining the Fourier terms in our quantum circuit. Since we intend to use LCU for implementing non-unitary operations, we describe a version without the final amplitude amplification step. We provide a short proof for completeness.
The next result summarizes how to efficiently implement a Fourier series of a Hamiltonian. Proof. This is a direct corollary of Lemma 38. To work out the details, note that we can always extend c with some 0 values, so we can assume without loss of generality that M is a power of 2. This is useful, because then we can represent each m ∈ [−M, M − 1] as a (J + 1)-bit signed integer for J = log(M ). The implementation of the operator A in Lemma 38 does not need any queries and it can be constructed exactly using O(M (log(M ) + 1)) two-qubit gates, e.g., by the techniques of [GR02]. We sketch the basic idea which is based on induction. For J = 1 the operator A is just a two-qubit unitary. Suppose we proved the claim for bitstrings of length J and want to prove the claim for length J + 1. Let a ∈ R 2 J+1 + be such that a 1 = 1 and defineã ∈ R 2 J + such that This yields an ε-precise implementation, since Now we can state the main result of this appendix, which tells us how to efficiently turn a function (provided with its Taylor series) of a Hamiltonian H, into a quantum circuit by using controlled Hamiltonian simulation.
In the following theorem we assume that the eigenvalues of H lie in a radius-r ball around x 0 . The main idea is that if even r + δ is less than the radius of convergence of the Taylor series, then we can obtain an ε-approximation of f by truncating the series at logarithmically high powers. B will be an upper bound on the absolute value of the function within the r + δ ball around x 0 , in particular f (H)/B ≤ 1. Therefore we can implement f (H)/B as a block of some larger unitary. It turns out that apart from the norm and sparsity of H and precision parameters, the complexity depends on the ratio of δ and r.
Theorem 40 (Implementing a smooth function of a Hamiltonian). Let x 0 ∈ R and r > 0 be such Proof. The basic idea is to combine Lemma 37 and Lemma 39 and apply them to a transformed version of the function. First we define δ := δ/(r + δ), which is at most 1/2 by assumption.
Then, for all ∈ N let b := a (r + δ) and define the function g : Now we set L : We would now like to obtain a Fourier-approximation of g for all y ∈ [−1 + δ , 1 − δ ], with precision ε = εB 2 . Let b := (b 0 , b 1 , . . . , b L−1 ) and observe that b 1 ≤ b 1 ≤ B. We apply Lemma 37 to the function g, using the polynomial approximation corresponding to the truncation to the first L terms, i.e., using the coefficients in b . Then we obtain a x 0 we get a Fourier series in z, while preserving c 1 = c 1 ≤ B. In the trivial case, when c = 0, we choose a unitaryŨ , such that it maps the |0 ancilla state to |1 , then clearly ( 0| ⊗ I)Ũ (|0 ⊗ I) = 0 =f (H). Clearly such aŨ can be implemented using O(1) gates and 0 queries. Otherwise we can apply Lemma 39 to this modified Fourier series to construct a unitary circuitṼ implementing an ε 2 -approximation off (H)/ c 1 . We can further scale down the amplitude of the |0 -part of the output by a factor of c 1 /B ≤ 1, to obtain an approximation off (H)/B as follows. We simply add an additional ancilla qubit initialized to |0 on which we act with the one-qubit unitary The following corollary shows how to implement functions piecewise in small "patches" using Remark 41. The main idea is to first estimate the eigenvalues of H up to θ precision, and then implement the function using the Taylor series centered around a point close to the eigenvalue. This approach has multiple advantages. First, the function may not have a Taylor series that is convergent over the whole domain of possible eigenvalues of H. Even if there is such a series, it can have very poor convergence properties, making B large and therefore requiring a lot of amplitude amplification. Nevertheless, for small enough neighborhoods the Taylor series always converges quickly, overcoming this difficulty.
Corollary 42. Suppose (x ) ∈ R L and r, θ ∈ R + are such that the spectrum of H lies in the ] and B > 0. If H ≤ K and ε ∈ 0, 1 2 , then we can implement a unitarỹ U such that using O(Lr/δ log(r/(δε)) log(1/ε) + log(K/θ) log log(K/(θε))) gates, and with O(log(1/ε)) uses of an (O(1/θ), π/K, Ω(ε 2 / log(1/ε)))-simulation of H and a single use of a circuit for controlled Sketch of the proof. We start by performing phase estimation on e iH with ≈ θ resolution in phase. We boost the success probability by taking the median outcome of O(log(1/ε)) parallel repetitions, so that we get a worse-than-θ estimation with probability at most O(ε 2 ). This way the boosted phase estimation circuit is O(ε)-close in operator norm to an "idealized" phase estimation unitary that never makes approximation error greater than θ (for more details on this type of argument, see the proof of Lemma 50). Phase estimation uses controlled (O(K/θ), π/K, Ω(ε 2 / log(1/ε)))-simulation of H and a Fourier transform on O(log(K/θ))-bit numbers which can be implemented using O(log(K/θ) log log(K/(θε))) gates. The probabilityboosting uses O(log(1/ε)) repetitions. Controlled on the phase estimateλ that we obtained, we implement a 1/B-scaled version of the corresponding function "patch" f (x)| [x −r,x +r] centered around arg min |x −λ| using Theorem 40 and Remark 41. The additional gate complexities of the "patches" add up to O(Lr/δ log(r/(δε)) log(1/ε)), but since each "patch" uses the same controlled (O(r log(1/ε)/δ), O(1/r), ε/2)-simulation of H, we only need to implement that once. Finally we uncompute phase estimation. 12 For the final complexity, note that we can assume without loss of generality that L(r − 2θ) = O(K), since otherwise we can just remove some redundant intervals from the domain. Hence Lr = O(K) and Lemma 36 implies the stated complexities.
This corollary is essentially as general and efficient as we can hope for. Let D denote the domain of possible eigenvalues of H. If we want to implement a reasonably smooth function f , then it probably satisfies the following: there is some r = Ω(1), such that for each x ∈ D, the Taylor series in the radius-r neighborhood of x converges quickly, more precisely the Taylor coefficients a (x) k for the x-centered series satisfy ∞ k=0 |a where we define f ∞ := sup x∈D |f (x)|. If this is the case, covering D with radius-O(r) intervals, choosing θ = Θ(r) and δ = Θ(r), Corollary 42 provides an O( H d) query and gate complexity implementation of f (H)/B, where B = O( f ∞ ). The value of B is optimal up to constant factors, since f (H)/B must have norm at most 1. Also the H d factor in the complexity is very reasonable, and we achieve the logarithmic error dependence which is the primary motivation of the related techniques. An application along the lines of this discussion can be found in Lemma 46.
Also note that in the above corollary we added up the gate complexities of the different "patches." Since these gates prepare the Fourier coefficients of the function corresponding to the different Taylor series at different points, one could use this structure to implement all coefficients with a single circuit. This can potentially result in much smaller circuit sizes, which could be beneficial when the input model allows more efficient Hamiltonian simulation (which then would no longer be the bottleneck in the complexity).

B.2 Applications of smooth functions of Hamiltonians
In this subsection we use the input model for the d-sparse matrix H as described at the start of Section 2. We calculate the implementation cost in terms of queries to the input oracles (4)-(5), but it is easy to convert the results to more general statements as in the previous subsection.
The following theorem shows how to efficiently implement the function e −H for some H I. We use this result in the proof of Lemma 13 to estimate expectation values of the quantum state ρ = e −H /Tr e −H (for the application we ensure that H I by adding some multiple of I). Proof. In order to use Theorem 40 we set x 0 := K + 1/2 so that H − x 0 I ≤ K =: r, and use the function We To conclude this appendix, we now sketch the proofs of a few interesting consequences of the earlier results in this appendix. These will, however, not be used in the body of the paper.
First, we show how to use the above subroutine together with amplitude amplification to prepare a Gibbs state with cost depending logarithmically on the precision parameter, as shown by the following lemma. To our knowledge this is the first Gibbs sampler that achieves logarithmic dependence on the precision parameter without assuming access to the entries of √ H as in [CS17]. This can mean a significant reduction in complexity; for more details see the introduction of Section 2. Sketch of proof. First we show how to prepare a purified sub-normalized Gibbs state. Then we use the exponential search algorithm of Boyer et al. [BBHT98] (with exponentially decreasing guesses for the norm a of the subnormalized Gibbs state, and hence exponentially increasing number of amplitude amplification steps) to postselect on this sub-normalized state in a similar fashion as in Algorithm 3. There is a possible caveat here: if we postselect on a state with norm a, then it gets rescaled by 1/a and its preparation error is rescaled by 1/a as well. Therefore during the rounds of the search algorithm we always increase the precision of implementation to compensate for the increased (error) amplification. Since the success of postselection in a round is upper bounded by the square of O(a · #{amplification steps in the round}), the probability for the postselection to succeed in any of the early rounds is small. Now we describe how to prepare a purified sub-normalized Gibbs state. We use the decom-  we obtain a Gibbs state using the claimed expected runtime. If we also know a lower bound z ≤ Tr e −H , then we have an upper bound on the expected runtime, therefore we can turn the procedure into a unitary circuit using standard techniques.
We can also recover the Gibbs sampler of Chowdhury and Somma [CS17]: if we apply our Corollary 42 to the function e −x 2 assuming access to √ H for some psd matrix H, then we get a Gibbs sampler for the state e −H /Tr(e −H ), similar to [CS17]. The advantage of the presented approach is that it avoids the usage of involved integral transformations, and can be presented without writing down a single integral sign, also due to our general results the proof is significantly shorter. Before we prove the precise statement in Lemma 46, we need some preparation for the application of Corollary 42: and for all Proof. We prove both claims by induction. ∂ 0 x e −x 2 = e −x 2 , ∂ 1 x e −x 2 = −2xe −x 2 and ∂ 2 x e −x 2 = 4x 2 e −x 2 − 2e −x 2 , so (21) holds for k = 1. Suppose (21) holds for k, we prove the inductive step as follows: Similarly, observe that (22) holds for k = 0 and k = 1. Suppose (22) holds for k, then we show the induction step as follows: Finally, using the previous two statements we can prove (23) by the following calculation: query complexity O(dκ 2 log 2.5 (κ/ε)). The latter makes O(κ log(dκ/ε)) uses of U and has query complexity O(dκ 2 log 2 (dκ/ε)). Now we provide a sketch of how to solve the HHL problem with O(κ) uses of U and with query complexity O(dκ 2 log 2 (κ/ε)) using our techniques. The improvement on both previously mentioned results is not very large, but our proof is significantly shorter thanks to our general Theorem 40 and Corollary 42.
To solve the HHL problem we need to implement the function H −1 , i.e., apply the function f (x) = 1/x to H. Due to the constraints on H, the eigenvalues of H lie in the union of the intervals [−1, −1/κ] and [1/κ, 1]. We first assume that the eigenvalues actually lie in [1/κ, 1]. In this case we can easily implement the function 1/x by Theorem 40 using the Taylor series around 1: As Setting ε := cε/κ for an appropriate constant c and using amplitude amplification, we can prepare an ε-approximation of the state H −1 |b / H −1 |b as required by HHL using O(κ) amplitude amplification steps. Therefore we use U at most O(κ) times and make O dκ 2 log 2 (κ/ε) queries.

C Generalized minimum-finding algorithm
In this appendix we describe our generalized quantum minimum-finding algorithm, which we are going to apply to finding an approximation of the ground state energy of a Hamiltonian. This algorithm generalizes the results of Dürr and Høyer [DH96] in a manner similar to the way amplitude amplification [BHMT02] generalizes Grover search: we do not need to assume the ability to query individual elements of the search space, we just need to be able to generate a superposition over the search space. The algorithm also has the benefit over binary search that it removes a logarithmic factor from the complexity.
The backbone of our analysis will be the meta-algorithm below from [DH96]. The metaalgorithm finds the minimal element in the range of the random variable X by sampling, where by "range" we mean the values which occur with non-zero probability. We assume X has finite range.
Input A discrete random variable X with finite range.
Output The minimal value x min in the range of X.
2. Sample a value s t according to the conditional distribution Pr(X = s t | X < s t−1 ).

Meta-Algorithm 2: Minimum-finding
Note that the above algorithm will always find the minimum, since the obtained samples are strictly decreasing.
Lemma 47. Let X be a finite discrete random variable whose range of values is x 1 < x 2 < . . . < x N . Let S(X) = {s 1 , s 2 , . . . } denote the random set of values obtained via sampling during a run of Meta-Algorithm 2 with input random variable X. If k ∈ [N ], then Pr(x k ∈ S(X)) = Pr(X = x k ) Pr(X ≤ x k ) .
Proof. The intuition of the proof is to show that, whenever Pr(s t−1 > x k ) > 0, we have Pr(s t = x k | t ∈ [N ] is the first time such that s t ≤ x k ) = Pr(X = x k ) Pr(X ≤ x k ) .
To formulate the statement more precisely 13 we consider a fixed value t ∈ [N ]. For notational convenience let x := x k , x N +1 := ∞, we prove (25) by: This is enough to conclude the proof, since there is always a smallest t ∈ [N ] such that s t ≤ x, as the algorithm always finds the minimum in at most N steps. So we can finish the proof by Pr(x ∈ S(X)) = N t=1 Pr(s t = x) Now we describe our generalized minimum-finding algorithm which is based on Meta-Algorithm 2. We take some unitary U , and replace X by the distribution obtained if we measured the second register of U |0 . We implement conditional sampling via amplitude amplification and the use of the exponential search algorithm of Boyer et al. [BBHT98]. If a unitary prepares the state |0 |φ + |1 |ψ where φ 2 + ψ 2 = 1, then this exponential search algorithm built on top of amplitude amplification prepares the state |1 |ψ probabilistically using an expected number of O(1/ ψ ) applications of U and U −1 (we will skip the details here, which are straightforward modifications of [BBHT98]).
Input A number M and a unitary U , acting on q qubits, such that U |0 = N k=1 |ψ k |x k , where x k is a binary string representing some number and |ψ k is an unnormalized quantum state on the first register. Let x 1 < x 2 < . . . < x N and define X to be the random variable with Pr(X = x k ) = ψ k 2 .
Output Some |ψ k |x k for a (hopefully) small k.
Init t ← 0; s 0 ← ∞ While the total number of applications of U and U −1 does not exceed M : 1. t ← t + 1 2. Use the exponential search algorithm with amplitude amplification on states such that x k < s t−1 to obtain a sample |ψ k |x k .

s t ← x k
Algorithm 3: Generalized minimum-finding Lemma 48. There exists C ∈ R + , such that if we run Algorithm 3 indefinitely (setting M = ∞), then for every U and x k the expected number of uses of U and U −1 before obtaining a sample x ≤ x k is at most .
Proof. Let X <x denote the random variable for which Pr(X <x = x) = Pr(X = x | X < x ).
The expected number of uses of U ±1 in Algorithm 3 before obtaining a value x ≤ x k is Pr(x ∈ S(X)) E #uses of U ±1 for sampling X <x where the last equality follows from Equation (30)  Pr The basic idea is that we treat the expression on the right-hand side of (26) as an integral approximation sum for the integral 1 p 0 z −3/2 dz, and show that it is actually always less than the value of this integral. We proceed by showing that subdivision always increases the sum.
Let us fix some parameter δ > 0. Recursively applying this halving procedure for different indices, we can find some J ∈ N andp ∈ R J+1 would satisfy |E − E j | = |e − e j |2K/T ≤ 3ε/4 < ε. If we repeat phase estimation O(log(n)) times (on the same state |φ j ), and take the median of the estimates (in superposition), then we obtain a more concentrated superposition of estimates e such that a measurement would reveal an |e − e j | ≤ 3 with probability at least 1 − b/n, for some b = Θ(1). Since in our maximally entangled state |φ j is entangled with |φ * j on the second register, applying phase estimation to the first register in superposition does not cause interference. Let us denote by U the unitary corresponding to the above preparation-estimation-boost procedure. Define Π to be the projector which projects to the subspace of estimation values |e such that there is a j ∈ [n] with |e − e j | ≤ ε. By the non-interference argument we can see that, after applying U , the probability that we get an estimation e such that |e − e j | > 3 is at most b/n for all j ∈ [n], moreover (I − Π)U |0 2 ≤ b/n. Also let Π 1 denote the projector which projects to phase estimates that yield e such that |e−e 1 | ≤ 3. It is easy to see that Π 1 U |0 2 ≥ 1/n−b/n 2 . Now let us replace V byṼ implemented via Lemma 36, such that V −Ṽ ≤ c /(n log(n)) for some c = Θ(1). LetŨ denote the circuit that we obtain from U by replacing V withṼ . Since in the repeated phase-estimation procedure we use V in total O(log(n)) times, by using the triangle inequality we see that U −Ũ ≤ c/(2n), where c = Θ(1). We use the well-known fact that if two unitaries are δ-close in operator norm, and they are applied to the same quantum state, then the measurement statistics of the resulting states are 2δ-close. Therefore we can upper bound the difference in probability of getting outcome (I − Π): we will obtain an estimate e such that e ≤ e 1 + 3, with probability at least 3/4. But since Π|ψ = |ψ , we find that any estimate that we might obtain satisfies e ≥ e 1 − 3. So an estimate e ≤ e 1 + 3 always satisfies |e − e 1 | ≤ 3. The problem is that we only have access toŨ as a quantum circuit. Let C M F (Ũ ) denote the circuit that we get from Theorem 49 when using it withŨ and define similarly C M F (U ) for U . Since we useŨ a total of O( √ n) times in C M F (Ũ ) and Therefore the measurement statistics of the two circuits differ by at most O( √ b + c). Choosing b, c small enough constants ensures that C M F (Ũ ) outputs a proper estimate e such that |e−e 1 | ≤ 3 with probability at least 2/3. As we have shown at the beginning of the proof, such an e yields an ε-approximation of E 1 via E := e2K/T .
The query complexity has an O(T d log(T n)) = O(Kd/ε log(Kn/ε)) factor coming from the implementation ofṼ by Lemma 36. This gets multiplied with O(log(n)) by the boosting of phase estimation, and by O( √ n) due to the minimum-finding algorithm. The gate complexity is dominated by the cost O(Kd/ε log 7/2 (Kn/ε)) of implementingṼ , multiplied with the O( √ n log(n)) factor as for the query complexity.
Note that the minimum-finding algorithm of Theorem 49 can also be used for state preparation. If we choose 2ε less than the energy-gap of the Hamiltonian, then upon finding the approximation of the ground state energy we also prepare an approximate ground state. The precision of this state preparation can be improved with logarithmic cost, as can be seen from the proof of Lemma 50.

D Sparse matrix summation
As seen in Section 2, the Arora-Kale algorithm requires an approximation of exp(−ηH (t) ) where H (t) is a sum of matrices. To keep this section general we simplify the notation. Let H be the sum of k different d-sparse matrices M : In this section we study the complexity of one oracle call to H, given access to respective oracles for the matrices M 1 , . . . , M k . Here we assume that the oracles for the M i are given in sparse matrix form, as defined in Section 2. In particular, the goal is to construct a procedure that acts as a similar sparse matrix oracle for H. We will only focus on the oracle that computes the non-zero indices of H, since the oracle that gives element access is easy to compute by summing the separate oracles.
In the remainder of this section we only consider one row of H. We denote this row by R H and the corresponding rows of the matrices M i by R i . Notice that such a row is given as an ordered list of integers, where the integers are the non-zero indices in R i . Then R H will again be an ordered list of integers, containing all integers in the R i lists once (i.e., R H does not contain duplicates).

D.1 A lower bound
We show a lower bound on the query complexity of the oracle O I H described above by observing that determining the number of elements in a row of H solves the majority function. Notice that, given access to O I H , we can decide whether there are at least a certain number of non-zero elements in a row of H.
Lemma 51. Given k + 1 ordered lists of integers R 0 , . . . , R k , each of length at most d. Let R H be the merged list that is ordered and contains every element in the lists R i only once (i.e., we remove duplicates). Deciding whether |R H | ≤ d + dk 2 or |R H | ≥ d + dk 2 + 1 takes Ω(dk) quantum queries to the input lists in general.
Proof. We prove this by a reduction from MAJ on dk elements. Let Z ∈ {0, 1} d×k be a Boolean string. It is known that it takes at least Ω(dk) quantum queries to Z to decide whether |Z| ≤ dk 2 or |Z| ≥ dk 2 + 1. Now let R 0 , R 1 , . . . , R k be lists of length d defined as follows: • R 0 [j] = j(k + 1) for j = 1, . . . , d.
By construction, if Z ij = 1, then the value of the entry R i [j] is unique in the lists R 0 , . . . , R k , and if Z ij = 0 then R i [j] = R 0 [j]. So in R H there will be one element for each element in R 0 and one element for each bit in Z ij that is one. The length of R H is therefore d + |Z|. Hence, distinguishing between |R H | ≤ d + dk 2 and |R H | ≥ d + dk 2 + 1 would solve the MAJ problem and therefore requires at least Ω(dk) queries to the lists in general. and hence, an optimal solutionỹ to d is of the formỹ = y 0 y where y is an optimal solution to d. It follows that an optimal solution to d is of the form (y 0 , y 0 y) where y is an optimal solution to d. Observe that the optimal value of d as a function of y 0 is of the form y 0 ·(−L+OPT). Since by assumption OPT ≥ L, the objective is increasing (linearly) with y 0 and hence y 0 = 4 U −L is optimal.