Quantum algorithms for escaping from saddle points

We initiate the study of quantum algorithms for escaping from saddle points with provable guarantee. Given a function $f\colon\mathbb{R}^{n}\to\mathbb{R}$, our quantum algorithm outputs an $\epsilon$-approximate second-order stationary point using $\tilde{O}(\log^{2} (n)/\epsilon^{1.75})$ queries to the quantum evaluation oracle (i.e., the zeroth-order oracle). Compared to the classical state-of-the-art algorithm by Jin et al. with $\tilde{O}(\log^{6} (n)/\epsilon^{1.75})$ queries to the gradient oracle (i.e., the first-order oracle), our quantum algorithm is polynomially better in terms of $\log n$ and matches its complexity in terms of $1/\epsilon$. Technically, our main contribution is the idea of replacing the classical perturbations in gradient descent methods by simulating quantum wave equations, which constitutes the improvement in the quantum query complexity with $\log n$ factors for escaping from saddle points. We also show how to use a quantum gradient computation algorithm due to Jordan to replace the classical gradient queries by quantum evaluation queries with the same complexity. Finally, we also perform numerical experiments that support our theoretical findings.


Introduction
Nonconvex optimization is a central research topic in optimization theory, mainly because the loss functions in many machine learning models (including neural networks) are typically nonconvex. However, finding a global optimum of a nonconvex function is NP-hard in general. Instead, many theoretical works focus on finding local optima, since there are landscape results suggesting that local optima are nearly as good as the global optima for many learning problems [11,[35][36][37][38]43]. On the other hand, it is known that saddle points (and local maxima) can give highly suboptimal solutions in many problems [45,65]. Furthermore, saddle points are ubiquitous in high-dimensional nonconvex optimization problems [16,29,33].
Therefore, one of the most important problems in nonconvex optimization is to escape from saddle points. Suppose we have a twice-differentiable function f : R n → R such that • f is ρ-Hessian Lipschitz: the goal is to find an -approximate local minimum x (also known as an -approximate second-order stationary point) such that 2 Intuitively, this means that at x , the gradient is small with norm being at most , and the Hessian is close to be positive semi-definite with the smallest eigenvalue being at least − √ ρ .
There have been two main focuses on designing algorithms for escaping from saddle points. First, algorithms with good performance in practice are typically dimension-free or almost dimension-free (i.e., having poly(log n) dependence), especially considering that most machine learning models in the real world have enormous dimensions. Second, practical algorithms prefer simple oracle access to the nonconvex function. If we are given a Hessian oracle of f , which takes x as the input and outputs ∇ 2 f (x), we can find an -approximate local minimum by second-order methods; for instance, Ref. [61] took O(1/ 1.5 ) queries. However, because the Hessian is an n × n matrix, its construction takes Ω(n 2 ) cost in general. Therefore, it has become a notable interest to escape from saddle points using simpler oracles.
A seminal work along this line was by Ge et al. [35], which can find an -approximate local minimum satisfying (1) only using the first-order oracle, i.e., gradients. Although this paper has a poly(n) dependence in the query complexity of the oracle, the follow-up work by [46] achieved to be almost dimension-free with complexityÕ(log 4 (n)/ 2 ), and the state-of-theart result takesÕ(log 6 (n)/ 1.75 ) queries [48]. However, these results suffer from a significant overhead in terms of log n, and it has been an open question to keep both the merits of using only the first-order oracle as well as being close to dimension-free [49].
On the other hand, quantum computing is a rapidly advancing technology. In particular, the capability of quantum computers is dramatically increasing and recently reached "quantum supremacy" [63] by Google [7]. However, at the moment the noise of quantum gates prevents current quantum computers from being directly useful in practice; consequently, it is also of significant interest to understand quantum algorithms from a theoretical perspective for paving its way to future applications.
In this paper, we explore quantum algorithms for escaping from saddle points. This is a mutual generalization of both classical and quantum algorithms for optimization: • For classical optimization theory, since many classical optimization methods are physicsmotivated, including Nesterov's momentum-based methods [62], Hamiltonian Monte Carlo [34] or stochastic gradient Langevin dynamics [76], etc., the elevation from classical mechanics to quantum mechanics can potentially bring more observations on designing fast quantum-inspired classical algorithms. In fact, quantum-inspired classical machine learning algorithms have been an emerging topic in theoretical computer science [20-22, 40, 64, 66, 67], and it is worthwhile to explore relevant classical algorithms for optimization.
• For quantum computing, the vast majority of previous quantum optimization algorithms had been devoted to convex optimization with the focuses on semidefinite programs [4,5,14,15,53] and general convex optimization [6,19]; these results have at least a √ n dependence in their complexities, and their quantum algorithms are far from dimensionfree methods. Up to now, little is known about quantum algorithms for nonconvex optimization.
However, there are inspirations that quantum speedups in nonconvex scenarios can potentially be more significant than convex scenarios. In particular, quantum tunneling is a phenomenon in quantum mechanics where the wave function of a quantum particle can tunnel through a potential barrier and appear on the other side with significant probability. This very much resembles escaping from poor landscapes in nonconvex optimization. Moreover, quantum algorithms motivated by quantum tunneling will be essentially different from those motivated by the Grover search [42], and will demonstrate significant novelty if the quantum speedup compared to the classical counterparts is more than quadratic.

Contributions
Our main contribution is a quantum algorithm that can find an -approximate local minimum of a function f : R n → R that is smooth and Hessian Lipschitz. Compared to the classical state-of-the-art algorithm by [48] usingÕ(log 6 (n)/ 1.75 ) queries to the gradient oracle (i.e., the first-order oracle), our quantum algorithm achieves an improvement in query complexity with log n factors. Furthermore, our quantum algorithm only takes queries to the quantum evaluation oracle (i.e., the zeroth-order oracle), which is defined as a unitary map U f on R n ⊗R such that for any |x ∈ R n , Furthermore, for any m ∈ N, |x 1 , . . . , |x m ∈ R n , and c ∈ C m such that m i=1 |c i | 2 = 1, If we measure this quantum state, we get f (x i ) with probability |c i | 2 . Compared to the classical evaluation oracle (i.e., m = 1), the quantum evaluation oracle allows the ability to query different locations in superposition, which is the essence of speedups from quantum algorithms. In addition, if the classical evaluation oracle can be implemented by explicit arithmetic circuits, the quantum evaluation oracle in (2) can be implemented by quantum arithmetic circuits of about the same size. As a result, it is the standard assumption in previous literature on quantum algorithms for various optimization problems, including quadratic forms [51], basin hopper [17], and general convex optimization [6,19]. Subsequently, we adopt it here for general nonconvex optimization.
Technically, our work is inspired by both the perturbed gradient descent (PGD) algorithm in [46,47] and the perturbed accelerated gradient descent (PAGD) algorithm in [48]. To be more specific, PGD applies gradient descent iteratively until it reaches a point with small gradient. It can potentially be a saddle point, so PGD applies uniform perturbation in a small ball centered at that point and then continues the GD iterations. It can be shown that with an appropriate choice of the radius, PGD can shake the point off from the saddle and converge to a local minimum with high probability. The PAGD in [48] adopts the similar perturbation idea, but the GD is replaced by Nesterov's AGD [62].
Our quantum algorithm is built upon PGD and PAGD and shares their simplicity of being single-loop, but we propose two main modifications. On the one hand, for the perturbation steps for escaping from saddle points, we replace the uniform perturbation by evolving a quantum wave function governed by the Schrödinger equation and using the measurement outcome as the perturbed result. Intuitively, the Schrödinger equation screens the local geometry of a saddle point through wave interference, which results in a phenomenon that the wave packet disperses rapidly along the directions with significant function value decrease. Specifically, quantum mechanics finds the negative curvature directions more efficiently than the classical counterpart: for a constant , the classical PGD and PAGD take O(log n) steps to decrease the function value by Ω(1/ log 3 n) and Ω(1/ log 5 n) with high probability, respectively. Quantumly, the simulation of the Schrödinger equation for time t takesÕ(t log n) evaluation queries, 3 but simulation for time t = O(log n) suffices to decrease the function value by Ω(1) with high probability. See Proposition 1 and Theorem 4.
In addition, we replace the gradient descent steps by a quantum algorithm for computing gradients using also quantum evaluation queries. The idea was initiated by Jordan in Ref. [50] which computed the gradient at a point by applying the quantum Fourier transform on a mesh near the point. Prior work has applied Jordan's algorithm to general convex optimization [6,19]; we follow the same path by conducting a detailed analysis (see Theorem 5) showing how we replace classical gradient queries by the same number of quantum evaluation queries in nonconvex optimization.
It is worth highlighting that our quantum algorithm enjoys the following two nice features: • Classical-quantum hybrid: In Algorithm 3 and Algorithm 4, the transition between consecutive iterations is still classical, while the only quantum computing part happens inside each iteration for replacing the classical uniform perturbation. Such feature is friendly for the implementation on near-term quantum computers.
• Robustness: Our quantum algorithm is robust from two aspects. On the one hand, we can even escape from an approximate saddle point by evolving the Schrödinger equation (see Proposition 1). On the other hand, Theorem 5 essentially shows the robustness of es-caping from saddle points by even noisy gradient descents, which may be of independent interest.
Finally, we perform numerical experiments that support our theoretical findings. Specifically, we observe the dispersion of quantum wave packets along the negative curvature direction in various landscapes. In a comparative study, our PGD with quantum simulation outperforms the classical PGD with a higher probability of escaping from saddle points and fewer iteration steps. We also compare the dimension dependence of classical and quantum algorithms in a model question with dimensions varying from 10 to 1000, and our quantum algorithm achieves a better dimension scaling overall.

Related Work
Escaping from saddle points by gradients was initiated by [35] with complexity O(poly(n/ )).
The follow-up work by [55] improved it to O(n 3 poly(1/ )), but it is still polynomial in dimension n. The breakthrough result by [46,47] achieves iteration complexityÕ(log 4 (n)/ 2 ) which is poly-logarithmic in n. The best-known result has complexityÕ(log 6 (n)/ 1.75 ) by [48] (the same result in terms of was independently obtained by [3,71]). Besides the gradient oracle, escaping from saddle points can also be achieved using the Hessian-vector product oracle with O(log(n)/ 1.75 ) queries [1,18]. There has also been a rich literature on stochastic optimization algorithms for finding second-order stationary points only using the first-order oracle. The seminal work [35] showed that noisy stochastic gradient descent (SGD) finds approximate second-order stationary points in O(poly(n)/ 4 ) iterations. This was later improved toÕ(poly(log n)/ 3.5 ) [2,3,31,68,72], and the current state-of-the-art iteration complexity of stochastic algorithms isÕ(poly(log n)/ 3 ) due to [30,77].
Quantum algorithms for nonconvex optimization with provable guarantee is a widely open topic. As far as we know, the only work along this direction is by [75], which gives a quantum algorithm for finding the negative curvature of a point in timeÕ(poly(r, 1/ )), where r is the rank of the Hessian at that point. However, the algorithm has a few drawbacks: 1) The cost is expensive when r = Θ(n); 2) It relies on a quantum data structure [52] which can actually be dequantized to classical algorithms with comparable cost [20,66,67]; 3) It can only find the negative curvature for a fixed Hessian. In all, it is unclear whether this quantum algorithm achieves speedup for escaping from saddle points.

Open Questions
Our paper leaves several natural open questions for future investigation: • Can we give quantum-inspired classical algorithms for escaping from saddle points? Our work suggests that compared to uniform perturbation, there exist physics-motivated methods to better exploit the randomness in gradient descent. A natural question is to understand the potential speedup of using (classical) mechanical waves.
• Can quantum algorithms achieve speedup in terms of 1/ ? The current speedup due to quantum simulation can only improve the dependence in terms of log n.
• Beyond local minima, does quantum provide advantage for approaching global minima? Potentially, simulating quantum wave equations can not only escape from saddle points, but also escape from some poor local minima.

Organization
We introduce quantum simulation of the Schrödinger equation in Section 2.1, and present how it provides quantum speedup for perturbed gradient descent and perturbed accelerated gradient descent in Section 2.2 and Section 2.3, respectively. We introduce how to replace classical gradient descents by quantum evaluations in Section 3. We present numerical experiments in Section 4. Necessary tools for our proofs are given in Appendix A.

Escape from Saddle Points by Quantum Simulation
The main contribution of this section is to show how to escape from a saddle point by replacing the uniform perturbation in the perturbed gradient descent (PGD) algorithm [47,Algorithm 4] and the perturbed accelerated gradient descent (PAGD) algorithm [48, Algorithm 2] with a distribution adaptive to the saddle point geometry. The intuition behind the classical algorithms is that without a second-order oracle, we do not know in which direction a perturbation should be added, thus a uniform perturbation is appropriate. However, quantum mechanics allows us to find the negative curvature direction without explicit Hessian information.

Quantum Simulation of the Schrödinger Equation
We consider the most standard evolution in quantum mechanics, the Schrödinger equation: where Φ is a wave function in R n , ∆ is the Laplacian operator, and f can be regarded as the potential of the evolution. In the one-dimensional case, we can prove that Φ enjoys an explicit form below if f is a quadratic function: Lemma 1. Suppose a quantum particle is in a one-dimensional potential field f (x) = λ 2 x 2 with initial state Φ(0, x) = ( 1 2π ) 1/4 exp −x 2 /4 ; in other words, the initial position of this quantum particle follows the standard normal distribution N (0, 1). The time evolution of this particle is governed by (4). Then, at any time t ≥ 0, the position of the quantum particle still follows normal distribution N 0, σ 2 (t; λ) , where the variance σ 2 (t; λ) is given by Lemma 1 shows that the wave function will disperse when the potential field is of negative curvature, i.e., λ < 0, and the dispersion speed is exponentially fast. Furthermore, we prove in Appendix A.1 that this "escaping-at-negative-curvature" behavior of the wave function still emerges given a quadratic potential field f (x) = 1 2 x T Hx in any finite dimension. To turn this idea into a quantum algorithm, we need to use quantum simulation. In fact, quantum simulation in real spaces is a classical problem and has been studied back in the 1990s [70, 73, 74]. There is a rich literature on the cost of quantum simulation [9,10,23,[57][58][59]; it is typically linear in the evolution time, which is formally known as the "no-fastforwarding theorem", see Theorem 3 of [9], and Theorem 3 of [24]. In Section 2.1.1, we prove the following lemma about the cost of simulating the the Schrödinger equation using the quantum evaluation oracle in (2): defined on the domain Ω = {x ∈ R n : x ≤ M } (where M > 0 is the diameter that will be specified later) with periodic boundary condition. 4 Given the quantum evaluation oracle U f (|x ⊗ |0 ) = |x ⊗ |f (x) in (2) and an arbitrary initial state at time t = 0, the evolution of (6) for time t > 0 can be simulated usingÕ t log n log 2 ( t ) queries to U f , where is the precision.
Because we have assumed that f is Hessian-Lipschitz, we can use the second-order Taylor expansion to approximate the function value of f near a saddle pointx. Such an approximation is more accurate on a ball centered atx with radius r 0 small enough. Regarding this, we scale the initial distribution as well as the Schrödinger equation to be more localized in terms of r 0 , which results in Algorithm 1.
Algorithm 1 is the main building block of our quantum algorithms for escaping from saddle points, and also the main resource of our quantum speedup.

Quantum Query Complexity of Simulating the Schrödinger Equation
We prove Lemma 2 in this subsection. Before doing that, we want to briefly discuss the reason why we simulate the scaled Schrödinger equation (6) instead of the common version of Algorithm 1: QuantumSimulation(x, r 0 , t e , f (·)). 1 Put a Gaussian wave packet into the potential field f , with its initial state being: Simulate its evolution in potential field f with the Schrödinger equation for time t e ; 2 Measure the position of the wave packet and output the measurement outcome.
non-relativistic Schrödinger equation in (4), rewritten below: In real-world problems, we are likely to encounter an objective function f (x) with a saddle point at x 0 but is not a quadratic form. In this situation, a quadratic approximation is only valid within a small neighborhood of the first-order stationary point x 0 , say Ω defined in Lemma 2. Regarding this issue, it is necessary to scale the spatial variable in order to make the wave packet more localized. However, the scaling in the spatial variable will simultaneously cause a scaling in the time variable under Eq. (4). This is not preferable because the scaling in time can dramatically change the variance σ(t; λ) in (5), which can cause troubles when bounding the time complexity in the analysis of algorithms. To leave the time scale invariant, we introduce a modified Schrödinger equation (6), in which the quantum simulation is restricted on a domain of diameter O(r 0 ): this localization guarantees that the quantum wave packet captures the saddle point geometry while not to be significantly affected by other features on the landscape of f (x), thus simplifying our further analysis. We may justify our construction of (6) in three aspects: • Geometric aspect: Eq. (6) is obtained by considering a spatial dilation in the wave function Φ(t, x) −→ Φ(t, x/r) without changing the time scale. This property guarantees the variance of the Gaussian distribution corresponding to Φ(t, x/r) is just r 2 times the original variance σ 2 (t; λ) (we will prove this in Proposition 2). Mathematically, this time-invariant property means the dispersion speed is now an intrinsic quantity as it is mostly determined by the saddle point geometry.
• Physical aspect: When the wave function is too localized in the position space, due to the uncertainty principle, the momentum variable will spread on a large domain in the frequency space. To reconcile this imparity, we want to introduce a small r 2 factor for the kinetic energy operator − 1 2 ∆ in order to balance between position and momentum. • Complexity aspect: The circuit complexity of simulating Schrödinger equation is linear in the operator norm of the Hamiltonian. Our scaling in (6) drags down the operator norm of the Laplacian (we will discretize it when doing simulation) while leaves the operator norm of the potential field remain O( H ) in a O(r 0 )-ball. This normalization effect will help reducing the circuit complexity.
Complexity bounds of quantum simulation is a well-established research topic; see e.g. [9,10,23,[57][58][59] for detailed results and proofs. In this paper, we apply quantum simulation under the interaction picture [60]. In particular, we use the following result: Lemma 6]). Let A, B ∈ C d×d be time-independent Hamiltonians that are promised to obey A ≤ α A and B ≤ α B , where · represents the spectral norm. Then the time-evolution operator e −i(A+B)t can be simulated up to error by using Our Lemma 2 is inspired by [27] which gives a quantum algorithm for simulating the Schrödinger equation but without the potential function f . It discretizes the space into grids with side-length a; in this case, − 1 2 ∆ reduces to − 1 2a 2 L where L is the Laplacian matrix of the graph of the grids (whose off-diagonal entries are −1 for connected grids and zero otherwise; the diagonal entries are the degree of the grids). For instance, in the one-dimensional case, where φ j is the value on the j th grid. When a → 0, this becomes the second derivative of φ; in practice, as mentioned above, it suffices to take 1/a = poly(log(1/ )) such that the overall precision is bounded by . The discretization method used in [27] is just a special example of the finite difference method (FDM), which is a common method in applied mathematics to discretize the space of ODE or PDE problems such that their solution is tractable numerically. To be more specific, the continuous space is approximated by discrete grids, and the partial derivatives are approximated by finite differences in each direction. There are higher-order approximation methods for estimating the derivatives by finite difference formulas [56], and it is known that the number of grids in each coordinate can be as small as poly(log(1/ )) by applying the high-order approximations to the FDM adaptively [8]. See also Section 3 of [25] which gave quantum algorithms for solving PDEs that applied FDM with this poly(log(1/ )) complexity for the grids.
We are now ready to prove Lemma 2.
Proof. There are two steps in the quantum simulation of (6): (1) discretizing the spatial domain using (9) so that the Schrödinger equation (6) is reduced to an ordinary differential equation (10); (2) simulating (10) under the interaction picture. In each step, we fix the error tolerance as /2. By the triangle inequality, the overall error is . First, we consider the k-th order finite difference method in Section 3 of [25] (the discrete Laplacian will be denoted as L k ). With the spacing between grid points being a, if we choose the mesh number along each direction as 1/a = poly(n) poly(log(2/ )), the finite difference error will be of order /2. Then the Schrödinger equation in (6) becomes where L k is the Laplacian matrix associated to the k-th order finite difference method (discretization of the hypercube Ω) and B is a diagonal matrix such that the entry for the grid at x is 1 Here, the function evaluation oracle U f is trivially encoded in the matrix evaluation oracle O B . By [25], the spectral norm of L k is of order O(n/a 2 ) = poly(n) poly(log(2/ )), where n is the spatial dimension of the Schödinger equation.
We simulate the evolution of (10) by Theorem 2 and taking A = − Recall that L k ≤ poly(n) poly(log(2/ )). By the -smooth condition, we have ∇f (x) ≤ M for x ∈ Ω so that the maximal absolute value of function f (x) on Ω is bounded by M 2 by the Poincaré inequality. Therefore, we have α A ≤ Cr 2 0 poly(n) poly(log(2/ )) where C > 0 is an absolute constant, and α B ≤ (M/r 0 ) 2 . It follows from Theorem 2 that, to simulate the time evolution operator e −i(A+B)t for time t > 0, the total quantum queries to O B (or equivalently, . The radius M of the simulation region is chosen large enough such that the wavepacket does not hit the boundary during simulation. Intuitively, the value of M should be proportional to the initial variance r 0 . Quantitatively, it is shown in Section 2.2 such that under our choice of parameters, M/r 0 equals some constant 1/C r . Absorbing all poly-logarithmic constants in the bigÕ notation, the total quantum queries to f reduces toÕ t log n log 2 ( t ) as claimed in Lemma 2.

Remark 1. In our scenario of escaping from saddle points, the initial state is a Gaussian
It is well-known that a Gaussian state can be efficiently prepared on quantum computers [54]; Gaussian states are also ubiquitous in the literature of continuous-variable quantum information [69]. However, although when f is quadratic the evolution of the Schrödinger equation keeps the state being a Gaussian wave packet by Lemma 8,it intrinsically has dependence on f and it is not totally clear how to prepare the Gaussian wave packet at time t directly by continuous-variable quantum information. It seems that the quantum simulation above using the quantum evaluation oracle U f in (2) is necessary for our purpose.

Perturbed Gradient Descent with Quantum Simulation
We now introduce a modified version of perturbed gradient descent. We start with gradient descents until the gradient becomes small. Then, we perturb the point by applying Algorithm 1 for a time period t e = T , perform a measurement on all the coordinates (which gives an output x 0 ), and continue with gradient descent until the algorithm runs for T iterations. This is summarized as Algorithm 2.

Algorithm 2:
Perturbed Gradient Descent with Quantum Simulation.
Intuitively, in Algorithm 2 QuantumSimulation is applied to find negative curvature of saddle points. Hence in Line 3 we simulate the wavepacket under the potential f (x)− ∇f (x t ), x− x t instead of f itself, since the first order term in the Taylor expansion of f at x t is not relevant to the negative curvature, which is characterized by the second-order Hessian matrix. After negative curvature is specified, we can add a perturbation ∆ t in that direction to decrease the function value and escape from saddle points.

Effectiveness of the Perturbation by Quantum Simulation
We show that our method of quantum wave packet simulation is significantly better than the classical method of uniform perturbation in a ball. To be more specific, we focus on the scenarios with ≤ l 2 /ρ (this is the standard assumption adopted in [48]); intuitively, this is the case when the local landscape is "flat" and the Hessian has a small spectral radius. Under this circumstance, the classical gradient descent may move slowly, but the quantum Gaussian wavepacket still disperses fast, i.e., the variance of the probability distribution corresponding to the wavepacket still has a large increasing rate. Hence, if we let this wavepacket evolve for a long enough time period, it is drastically stretched in the directions with negative curvature. As a result, if we measure its position at this time, with high probability the output vector indicates a negative curvature direction, or equivalently, a direction along which we can decrease the function value. We can thus add a large perturbation along that direction to escape from the saddle point. Formally, we prove: Proposition 1. For any constant δ 0 > 0, we specify our choices for the parameters and constants that we use: where C r , C 0 , α are absolute constants, and the value of α is specified in Lemma 3. Let f : R n → R be an -smooth, ρ-Hessian Lipschitz function. For an approximate saddle point

adds a perturbation by
QuantumSimulation with the radius M of the simulation region being set to M = r 0 /C r , and decreases the function value for at least F , with probability at least 1 − δ 0 .
Compared to the provable guarantee from classical perturbation [47,Lemma 22], speaking only in terms of n, classically it takes T = O(log n) steps to decrease the function value by F = Ω(1/ log 3 n), whereas our quantum simulation with time T = O(log n) together with also T subsequent GD iterations decrease the function value by F = Ω(1) with high success probability.
Intuitively, the proof of Proposition 1 is composed of two parts. If the potential f is quadratic, we can use Lemma 1 to prove Proposition 2 (both proof details are given in Appendix A.1), which demonstrates the exponential rate for quantum simulation to escape along the negative eigen-directions of the Hessian of f . However, the objective function f is rarely a standard quadratic form in reality, and we cannot expect the quantum wave packet to preserve its shape as a Gaussian distribution. Nevertheless, we are able to show that the quantum wave packets do not differ significantly from a perfect Gaussian distribution in the course of quantum simulation, which preserves our quantum speedup in the general case.
Formally, we introduce the following lemma to bound the deviation from perfect Gaussian in quantum evolution. Before proceeding to its details, we first specify our choice for the constant C r . As shown in the statement of Proposition 1, C r stands for the ratio between the initial wavepacket variance and the radius of the simulation region. We choose a small enough constant C r , such that the simulation region would be much larger than the range of the wavepacket, during the entire simulation process. Since the function f is -smooth, the spectral norm of its Hessian matrix at any point is upper bounded by constant . Hence, the small enough constant C r is independent of f . Then, the radius M of the simulation region satisfies

Lemma 3. Let H be the Hessian matrix of f at a saddle pointx, and define
to be the quadratic approximation of the function f nearx. Denote the measurement outcome from the quantum simulation (see Algorithm 1) with potential field f and evolution time t e as random variable ξ, and the measurement outcome from the quantum simulation with potential field f q and the same evolution time t e as another random variable ξ . Let the law of ξ (or ξ , resp.) be P ξ (or P ξ , resp.). If the quantum wave packet is confined to a hypercube with edge length M , then where T V (·, ·) is the total variation distance between measures, α is an absolute constant, and The proof of Lemma 3 is deferred to Appendix A.2. This lemma shows that the true perturbation given by quantum simulation ξ ∼ P ξ only deviates from the Gaussian random vector ξ ∼ P ξ at a magnitude ofÕ(M n 3/2 t 2 e ) when t e = T = O(log n) in Algorithm 2. Such a deviation is non-material compared to our choice of M in (13). Therefore, we may estimate the performance of our quantum simulation subroutine using a quadratic approximation function and then bound the error caused by the non-quadratic part, as in the following lemma: with probability at least 1 − δ 0 .
Proof. Without loss of generality, assume ∇f (x t ) = 0. First consider the case where the potential f is purely quadratic, and add the estimate the error term caused by the nonquadratic deflation afterwards. First note that the Hessian matrixH admits the following eigen-decomposition: where the set {u i } n i=1 forms an orthonormal basis of R n . Without loss of generality, we assume that the eigenvalues λ 1 , λ 2 , . . . , λ n corresponding to u 1 , u 2 , . . . , u n satisfy We use S , S ⊥ to respectively denote the subspace of R n spanned by and use S , S ⊥ to respectively denote the subspace of R n spanned by Furthermore, we define ξ : respectively to denote the component of ξ in the subspaces S , S ⊥ , S , S ⊥ . Also, we define ξ 1 := u 1 , ξ u 1 to be the component of ξ along u 1 , the most negative eigendirection.
Under the basis {u 1 , . . . , u n }, by Proposition 2, the time evolution of the initial wave function is governed by (6). Then, at t e = T , the wave function still follows multivariate Gaussian distribution N (0, r 2 0 Σ), with the covariance matrix being where the variance σ(T ; λ i ) is defined in (5).
Due to our choice of the parameter T , we can further derive that Denote . We define an (n − p )-dimensional Gaussian distributionp(·) in S ⊥ : then the actual distribution of ξ ⊥ is upper bounded by the distribution of y under the probability density functionp(y). Furthermore, by Lemma 12 in Appendix A.3, with probability at least 1 − δ 0 /3 we have On the other hand, on the most negative direction i = 1, by Hence, after we measure the wavepacket, ξ 1 satisfies By the union bound, with probability at least 1 − 2δ 0 /3, the output ξ would satisfy: Considering the fact that the function f is not purely quadratic, by Lemma 3 the inequality above may be violated with probability at most in which M = r 0 /C r due to our parameter choice. Choose the constant C 0 in r 0 large enough to satisfy C 0 ≥ C. Then with probability at least 1 − δ 0 , we can still have after counting in the deviation from pure quadratic field. Under this circumstance, useξ to denote ξ/ ξ . Observe that sinceHξ ⊥ ∈ S ⊥ andHξ ∈ S . Due to the -smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by . Thus we have, Further according to the definition of S , we havê Combining these two inequalities together, we can obtain Now we are ready to prove Proposition 1.
Proof. Without loss of generality, we assumex = 0. By Lemma 4, with probability at least 1 − δ 0 , the output ξ of QuantumSimulation would be in a negative curvature direction, or quantitatively, Since we choose the one with smaller function value from {∆ t , −∆ t } to be the perturbation result, without loss of generality we can assume ∇f (0), ∆ t ≤ 0. Then, where the gradient ∇f (θ∆ t ) can be expressed as which leads to Here, H(ν, ∆ t ) satisfies due to the ρ-Hessian Lipschitz property of f , which indicates Hence,

Proof of Our Quantum Speedup
We now prove the following theorem using Proposition 1: Theorem 3. For any , δ > 0, Algorithm 2 with parameters chosen in Proposition 1 satisfies that at least one half of its iterations of will be -approximate local minima, using queries to U f in (2) and gradients with probability ρ , let the parameters be chosen according to (11) and (12), and set the total iteration steps T to be: similar to the classical GD algorithm. We first assume that for each x t we apply Quantum-Simulation (Algorithm 1), we can successfully obtain an output ξ with ξ T Hξ/ ξ 2 ≤ − √ ρ /3, as long as λ min (H(x t )) ≤ − √ ρ . The error probability of this assumption is provided later.
Under this assumption, Algorithm 1 can be called for at most 81(f (x 0 )−f * ) 2 ρ 3 ≤ T 4 times, for otherwise the function value decrease will be greater than f (x 0 )−f * , which is not possible. Then, the error probability that some calls to Algorithm 1 fail to indicate a negative curvature is upper bounded by Excluding those iterations that QuantumSimulation is applied, we still have at least 3T /4 steps left. They are either large gradient steps, ∇f (x t ) ≥ , or -approximate second-order stationary points. Within them, we know that the number of large gradient steps cannot be more than T /4 because otherwise, by Lemma 13 in Appendix A.4: a contradiction. Therefore, we conclude that at least T /2 of the iterations must beapproximate second-order stationary points with probability at least 1 − δ.
The number of queries can be viewed as the sum of two parts, the number of queries needed for gradient descent, denoted by T 1 , and the number of queries needed for quantum simulation, denoted by T 2 . Then with probability at least 1 − δ, As for T 2 , with probability at least 1 − δ quantum simulation is called for at most 4(f (x 0 )−f * ) F times, and by Lemma 2 it takesÕ T log n log 2 (T 2 / ) queries to carry out each simulation. Therefore, As a result, the total query complexity

Perturbed Accelerated Gradient Descent with Quantum Simulation
In Theorem 3, the 1/ 2 term is a bottleneck of the whole algorithm, but [48] improved it to 1/ 1.75 by replacing the GD with the accelerated GD by [62]. We next introduce a hybrid quantum-classical algorithm (Algorithm 3) that reflects this intuition. We make the following comparisons to [48]: • Same: When the gradient is large, we both apply AGD iteratively until we reach a point with small gradient. If the function becomes "too nonconvex" in the AGD, we both reset the momentum and decide whether to exploit the negative curvature at that point.
• Difference: At a point with small gradient, we apply quantum simulation instead of the classical uniform perturbation. Speaking only in terms of n, [48] takes O(log n) steps to decrease the Hamiltonian f (x) + 1 2η v 2 by Ω(1/ log 5 n) with high probability, whereas our quantum simulation for time T = O(log n) decreases the Hamiltonian by Ω(1) with high probability.
Algorithm 3: Perturbed Accelerated Gradient Descent with Quantum Simulation.
The following theorem provides the complexity of this algorithm: where c A is chosen large enough to satisfy the condition in Lemma 14, C 0 and C r are constants specified in Proposition 1. Then, for any δ > 0, ≤ 2 ρ , if we run Algorithm 3 with choice of parameters specified above, then with probability at least 1 − δ one of the iterations x t will be an -approximate second-order stationary point, using the following number of queries to U f in (2) and classical gradients:Õ Proof. We use T to denote total number of iterations and specify our choice for T as: where E = 3 ρ · c −7 A , the same as our choice for E in Lemma 14. Similar to Proposition 1, we set the radius M of the simulation region to be r 0 /C r . We assume the contrary, i.e., the outputs of all of the iterations are not -approximate second-order stationary points.
Similar to our analysis in the proof of Theorem 3, we first assume that for each x t we apply QuantumSimulation (Algorithm 1), we can successfully obtain an output ξ with ξ T Hξ/ ξ 2 ≤ − √ ρ /3, as long as λ min (H(x t )) ≤ − √ ρ . The error probability of this assumption is provided later. Under this assumption, Algorithm 1 can be called for at most 81(f (x 0 )−f * ) 2 ρ 3 ≤ T 3 times, for otherwise the function value decrease will be greater than f (x 0 )−f * , which is not possible. Then, the error probability that some calls to Algorithm 1 fails to indicate a negative curvature is upper bounded by Excluding those iterations that QuantumSimulation is applied, we still have at least 2T /3 steps left, which are all accelerated gradient descent steps.
Since from ≤ 2 /ρ we have T ≥ T , then we can found at least T 3T disjoint time periods, each of time interval T . From Lemma 14, during these time intervals the Hamiltonian will decrease in total at least: which is impossible due to Lemma 15, the Hamiltonian decreases monotonically for every step where quantum simulation is not called, and the overall decrease cannot be greater than f (x 0 ) − f * . Note that the iteration numbers T satisfies: As for the number of queries, it can be viewed as the sum of two parts, the number of queries needed for accelerated gradient descent, denoted by T 1 , and the number of queries needed for quantum simulation, denoted by T 2 . Then with probability at least 1 − δ, For T 2 , with probability at least 1 − δ quantum simulation is called for at most (f (x 0 )−f * ) F times, and by Lemma 2 it takesÕ T log n log 2 (T 2 / ) queries to carry out each simulation. Therefore, As a result, the total query complexity T 1 + T 2 is

Remark 2. Although the theorem above only guarantees that one of the iterations is an -approximate second-order stationary point, it can be easily accessed by adding a proper termination condition: once the quantum simulation is called, we keep track of the pointx prior to quantum simulation, and compare the function value atx with that of x t after the perturbation.
If the function value decreases by at least F , then the algorithm has made progress, otherwise with probability at least 1 − δ,x is an -approximate second-order stationary point. Doing so will add an extra register for saving the point but does not increase the asymptotic complexity.

Gradient Descent by the Quantum Evaluation Oracle
Another important contribution of this paper is to show how to replace the classical gradient queries by quantum evaluation queries. This is shown in the case of convex optimization [6,19], and we generalize the same result to nonconvex optimization. The idea was initiated by [50]. Classically, with only an evaluation oracle, the best way to construct a gradient oracle is probably to walk along each direction a little bit and compute the finite difference in each coordinate. Quantumly, a clever approach is to take the uniform superposition on a mesh around the point, query the quantum evaluation oracle (in superposition) in phase, 6 and apply the quantum Fourier transform (QFT). Due to Taylor expansion, the QFT can recover all the partial derivatives simultaneously. In this paper, we refer to Lemma 2.2 of [19] for a precise version of Jordan's algorithm: Lemma 5. Let f : R n → R be an -smooth function specified by the evaluation oracle in (2) with accuracy δ q , i.e., it returns a valuef (x) such that |f (x) − f (x)| ≤ δ q . For any x ∈ R n , there is a quantum algorithm that uses one query to (2) and returns a vector∇f (x) s.t.
The main technical contribution of this section is to replace the gradient descent steps in Section 2 by Lemma 5. We give error bounds of gradient computation steps in Section 3.1, and give the proof details of escaping from saddle points in Section 3.2.

Error Bounds of Gradient Computation Steps
We first give the following bound on gradient descent using Lemma 5: , any specific step of the gradient descent sequence {x t : x t+1 ← x t −η∇x t } satisfies: where A q = 400n δ q in the formula stands for a constant of the accuracy of the quantum algorithm.
Ideally speaking, A q can be arbitrarily small given a quantum computer that is accurate enough using more qubits for the precision δ q .
Proof. Considering our condition of f being -smooth, we have we use g(x) to denote the outcome of the quantum algorithm. Then by the definition of gradient descent, By Lemma 5, for a fixed constant c, the value of η 2 δ[g(x t )] 2 is smaller than c with probability at least 1 − Now, we replace all the gradient queries in Algorithm 2 by quantum evaluation queries, which results in Algorithm 4. We aim to show that if it starts at x 0 and the value of the objective function does not decrease too much over iterations, then its whole iteration sequence {x τ } t τ =0 will be located in a small neighborhood of x 0 . Intuitively, this is a robust version of the "improve or localize" phenomenon presented in [47]. we have if quantum simulation is not called during [0, t].
Proof. Observe that Using the Cauchy-Schwartz inequality, the formula above can be converted to: in which which results in Go back to the first inequality, Suppose during each step from 1 to t, the value of δ[g(x τ −1 )] 2 is smaller than the fixed constant c. From Lemma 5, this condition can be satisfied with probability at least 1 − Then,

Escaping from Saddle Points with Quantum Simulation and Gradient Computation
In this subsection, we prove the result below for escaping from saddle points with both quantum simulation and gradient computation. Compared to Theorem 3, it reduces classical gradient queries to the same number of quantum evaluation queries.
Theorem 5. Let f : R n → R be an -smooth, ρ-Hessian Lipschitz function. Suppose that we have the quantum evaluation oracle U f in (2) with accuracy δ q ≤ O δ 2 2 n 4 . Then Algorithm 4 finds an -approximate local minimum satisfying (1), using queries to U f with probability at least 1 − δ, under the following parameter choices: where C 0 and C r are constants specified in Proposition 1, x 0 is the start point, and f * is the global minimum of f .
Note that Theorem 5 essentially shows that the perturbed gradient descent method still converges with the same asymptotic bound if there is a small error in gradient queries. This robustness of escaping from saddle points may be of independent interest.
Proof. Set δ 0 = . Let total iteration steps T to be: similar to the classical GD algorithm. The same to Proposition 1, we set the radius M of the simulation range to be r 0 /C r . First assume that for each x t we apply QuantumSimulation (Algorithm 1), we can successfully obtain an output ξ with ξ T Hξ/ ξ 2 ≤ − √ ρ /3, as long as λ min (H(x t )) ≤ − √ ρ . The error probability of this assumption is provided later.
Under this assumption, Algorithm 1 can be called for at most 81(f (x 0 )−f * ) 2 ρ 3 ≤ T 4 times, for otherwise the function value decrease will be greater than f (x 0 )−f * , which is not possible. Then, the error probability that some calls to Algorithm 1 fails to indicate a negative curvature is upper bounded by Excluding those iterations that QuantumSimulation is applied, we still have T /2 steps left. They are either large gradient steps, ∇f (x t ) ≥ , or -approximate second-order stationary points. Within them, for each large gradient steps, by Lemma 6, with probability at least the function value decrease is greater than η 2 /4, there can be at most T /4 steps with large gradients-otherwise the function value decrease will be greater than f (x 0 ) − f * , which is impossible. In summary, by the union bound we can deduce that with probability at least 1 − δ, there are at most T /2 steps within T steps after calling quantum simulation, and at most T /4 steps have a gradient greater than . As a result, the rest T /4 steps must all be -approximate second-order stationary points.
The number of queries can be viewed as the sum of two parts, the number of queries needed for gradient descent, denoted by T 1 , and the number of queries needed for quantum simulation, denoted by T 2 . Then with probability at least 1 − δ, As for T 2 , with probability at least 1 − δ quantum simulation is called for at most 4(f (x 0 )−f * ) F times, and by Lemma 2 it takesÕ T log n log 2 (T 2 / ) queries to carry out each simulation. Therefore, As a result, the total query complexity

Numerical Experiments
In this section, we provide numerical results that demonstrate the power of quantum simulation for escaping from saddle points. Due to the limitation of current quantum computers, we simulate all quantum algorithms numerically on a classical computer (with Dual-Core Intel Core i5 Processor, 8GB memory). Nevertheless, our numerical results strongly assert the quantum speedup in small to intermediate scales. All the numerical results and plots are obtained by MATLAB 2019a.
In the first two experiments, we look at the wave packet evolution on both quadratic and non-quadratic potential fields. Before bringing out numerical results and related discussions, we want to briefly discuss the leapfrog scheme [41], which is the technique we employed for numerical integration of the Schrödinger equation. We discretize the Schrödinger equation as a linear system of an ordinary differential equation (for details, see Section 2.1.1): where Ψ : [0, T ] → C N is a vector-valued function in time. We may have a decomposition Ψ(t) = Q(t) + iP (t) for Q, P : [0, T ] → R N being the real and imaginary part of Ψ, respectively. Then plugging the decomposition into the ODE (100), we have a separable N -body Hamiltonian system Q = HP ; The optimal integration scheme for solving this Hamiltonian system is the symplectic integrator [41], and we use a second-order leapfrog integrator for separable canonical Hamiltonian systems [32] in this section. In all of our PDE simulations, we fix the spatial domain to be Ω = {(x, y) : |x| ≤ 3, |y| ≤ 3} and the mesh number to be 512 on each edge.

Dispersion of the Wave Packet
In Proposition 2, we showed that a centered Gaussian wave packet will disperse along the negative curvature direction of the saddle point. In the numerical simulation presented in Figure 1, we have a potential function f 1 (x, y) = −x 2 /2+3y 2 /2 and the initial wave function as described in Proposition 2 (r = 0.5). In each subplot, the Gaussian wave packet (i.e., modulus square of the wave function Φ(t, x)) at a specific time is shown. The quantum evolution "squeezes" the wave packet along the x-axis: the variance of the marginal distribution on the x-axis is 0.25, 0.33, 0.68 at time t = 0, 0.5, 1, respectively. In the preceding experiment, we have provided a numerical simulation of the dispersion of the Gaussian wave packet on a quadratic potential field. Next, we only require that the function is Hessian-Lipschitz near the saddle point. This is enough to promise that the secondorder Taylor series is a good approximation near a small neighborhood of the saddle point.

Quantum Simulation on Non-quadratic Potential Fields
Now, we explore the behavior of the wave packet on non-quadratic potential fields. It is worth noting that: (1) the wave packet is not necessarily Gaussian during the time evolution; (2) for practical reason, we will truncate the unbounded spatial domain R 2 to be a bounded square Ω and assume Dirichlet boundary conditions (Φ(t, x) = 0 on ∂Ω for all t ∈ [0, T ]). Nevertheless, it is still observed that the wave packet will be mainly confined to the "valley" on the landscape which corresponds to the direction of the negative curvature.
We will run quantum simulation (Algorithm 1) near the saddle point of two non-quadratic potential landscapes. The first one is f (x, y) = 1 12 It has a saddle point at (0, 0) and two global minima (± √ 3, 0). The minimal function value is −3/4. This is the landscape used in the next experiment in which a comparison study between quantum and classical is conducted. We claimed that the wave packet will remain (almost) Gaussian at t e = 1.5. This claim is confirmed by the numerical result illustrated in Figure 2. The wave packet has been "squeezed" along the x-axis, the negative curvature direction. Compared to the uniform distribution in a ball used in PGD, this "squeezed" bivariant Gaussian distribution assigns more probability mass along the x-axis, thus allowing escaping from the saddle point more efficiently. The second landscape we explore is g(x, y) = x 3 − y 3 − 2xy + 6. Its Hessian matrix is It has a saddle point at (0, 0) with no global minimum. This objective function has a circular "valley" along the negative curvature direction (1, 1), and a "ridge" along the positive curvature direction (1, −1). We aim to study the long-term evolution of the Gaussian wave packet on the landscape restricted on a square region. The evolution of the wave packet is illustrated in Figure 3. In a small time scale (e.g., t = 1), the wave packet disperses down the valley on the landscape, and it preserves a bell shape; waves are reflected from the boundary and an interference pattern can be observed near the upper and left edges of the square. Dispersion and interference coexist in the plot at t = 2, in which the wave packet splits into two symmetric components, each locates in a lowland. Since the total energy is conserved in the quantummechanical system, the wave packet bounces back at t = 5, but is blurred due to wave interference. In the whole evolution in t ∈ [0, 5], the wave packet is confined to the valley area of the landscape (even after bouncing back from the boundary). This evidence suggests that Gaussian wave packet is able to adapt to more complicated saddle point geometries.  Figure 3: Quantum simulation on landscape 2: g(x, y) = x 3 − y 3 − 2xy + 6. Parameters: r 0 = 0.5, t e = 5.
In each subplot, a colored contour plot of the wave packet at a specific time is shown, and the landscape contour is placed on top of the wave packet for quick reference. The average runtime for this simulation is 209.95 seconds.

Comparison Between PGD and Algorithm 2
In addition to the numerical study of the evolution of wave packets, we compare the performance of the PGD algorithm [46] with Algorithm 2 on a test function f 2 (x, y) = 1 12 In this experiment and the last one in this section, we only implement a mini-batch from the whole algorithm (for both classical PGD and PGD with quantum simulation). In fact, a mini-batch is good enough for us to demonstrate the power of quantum simulation as well as the dimension dependence in both algorithms. A mini-batch in the experiment is defined as follows: • Classical algorithm (PGD) mini-batch [following Algorithm 4 of 47]: x 0 is uniformly sampled from the ball B 0 (r) (saddle point at the origin), and then run T c gradient descent steps to obtain x Tc . Record the function value f (x Tc ). Repeat this process M times. The resulting function values are presented in a histogram.
• Quantum algorithm mini-batch (following Algorithm 2): Run the quantum simulation with evolution time t e to generate a multivariate Gaussian distribution centered at 0.
x 0 is sampled from this multivariate Gaussian distribution. Run T q gradient descent steps and record the function value f (x Tq ). Repeat this process M times. The resulting function values are also presented in a histogram, superposed to the results given by the classical algorithm.
The experimental results from 1000 samples are illustrated in Figure 4. Although the test function is non-quadratic, the quantum speedup is apparent. iterations. Note that path 2 approaches the local minimum (− √ 3, 0), while path 1 is still far away. In PGD, path 1 and 2 will be sampled with equal probability by the uniform perturbation, whereas in Algorithm 2, the dispersion of the wave packet along the x-axis enables a much higher probability of sampling a path like path 2 (that approaches the local minimum). Right: A histogram of function values f 2 (x Tc ) (PGD) and f 2 (x Tq ) (Algorithm 2). We set step length η = 0.05, r = 0.5 (ball radius in PGD and r 0 in Algorithm 1), M = 1000, T c = 50, T q = 10, t e = 1.5. Although we run five more times of iterations in PGD, there are still over 70% of gradient descent paths arriving the neighborhood of the local minimum, while there are less than 70% paths in Algorithm 2 approaching the local minimum. The average runtime of this experiment is 0.02 seconds.

Dimension Dependence
Recall that n is the dimension of the problem. Classically, it has been shown in [47] that the PGD algorithm requires O(log 4 n) iterations to escape from saddle points; however, quantum simulation for time O(log n) suffices in our Algorithm 2 by Theorem 3. The following experiment is designed to compare this dimension dependence of PGD and Algorithm 2. We choose a test function h(x) = 1 2 x T Hx where H is an n-by-n diagonal matrix: H = diag (− , 1, 1, ..., 1). The function h(x) has a saddle point at the origin, and only one negative curvature direction. Throughout the experiment, we set = 0.01. Other hyperparameters are: dimension n ∈ N, radius of perturbation r > 0, classical number of iterations T c , quantum number of iterations T q , quantum evolution time t e , number of samples M ∈ N, and GD step size (learning rate) η. For the sake of comparison, the iteration numbers T c and T q are chosen in a manner such that the statistics of the classical and quantum algorithms in each category of the histogram in Figure 5 are of similar magnitude. The numerical results are illustrated in Figure 5. The number of dimensions varies drastically from 10 to 1000, while the distribution patterns in all three subplots are similar: setting T c = Θ(log 2 n) and T q = Θ(log n), the PGD with quantum simulation outperforms the classical PGD in the sense that more samples can escape from the saddle point (as they have lower function values). At the same time, under this choice of parameters, the performance of the classical PGD is still comparable to that of the PGD with quantum simulation, i.e., the statistics in each category are of similar magnitude. This numerical evidence might suggest that for a generic problem, the classical PGD method in [47]

A Auxiliary Lemmas
In this appendix, we collect all auxiliary lemmas that we use in the proofs.

A.1 Schrödinger Equation with a Quadratic Potential
In this subsection, we prove several results that lay the foundation of the quantum algorithm described in Section 2.

Lemma 1. Suppose a quantum particle is in a one
; in other words, the initial position of this quantum particle follows the standard normal distribution N (0, 1). The time evolution of this particle is governed by (4). Then, at any time t ≥ 0, the position of the quantum particle still follows normal distribution N 0, σ 2 (t; λ) , where the variance σ 2 (t; λ) is given by Proof. Due to the well-posedness of the Schrödinger equation, if we find a solution to the initial value problem (4), this solution is unique. We take the following ansatz with θ(0) = 0, δ(0) = √ 2. In this Ansatz, the probability density p λ (t, x), i.e., the modulus square of the wave function, is given by where y(t) = δ 2 (t).
If the ansatz (104) solves the Schrödinger equation, we will have the conservation of probability, i.e., Φ(t, x) 2 = 1 for all t ≥ 0; in other words, the R p λ (t, x)dx = 1 for all t ≥ 0. It is now clear that (105) is the density of a Gaussian random variable with zero mean and variance σ 2 (t; λ) = 1 2 Re(1/y(t)) .
Therefore, it is sufficient to compute y(t) in order to obtain the distribution of the quantum particle at time t ≥ 0. For simplicity, we will not compute the global phase θ(t) as it is not important in the the variance. Substituting the ansatz (104) to (4) with potential function f (x) = λ 2 x 2 , and introducing change of variables y(t) = δ 2 (t), we attain the following system of ordinary differential equations It follows that 1 And by Equation (106), the variance is Clearly, the sign of λ matters.
We immediately observe that Eq. (117) is a decoupled system d dt Again, introduce change of variables y j (t) = δ 2 j (t), we havė They are precisely the same as the first equation in (107), thus the calculation of onedimensional case in Lemma 1 applies directly to (120). Given the ansatz (116), it is clear that the probability density of the quantum particle in R n is an n-dimensional Gaussian with mean 0 and covariance matrix , ..., It follows from (106) and (5) that the covariance matrix is given as (115).
Finally, we state the following proposition with different scales: Let H be an n-by-n symmetric matrix with diagonalization H = U T ΛU , with Λ = diag(λ 1 , ..., λ n ) and U an orthogonal matrix. Suppose a quantum particle is in an n-dimensional potential field f (x) = 1 2 x T Hx with the initial state being in other words, the initial position of the particle follows multivariate Gaussian distribution N (0, r 2 I). The time evolution of this particle is governed by (6). Then, at any time t ≥ 0, the position of the quantum particle still follows multivariate Gaussian distribution N (0, r 2 Σ(t)), with the covariance matrix Throughout the discussion, we only concern the evolution of the wave packet when it happens to center on the saddle point. However, in reality, the exact location of the saddle point is rarely known and the initial Gaussian wave may be slightly off the saddle point. In the following proposition, we investigate this more general situation in which the potential function is shifted by a distance of d. It turns out that the wave packet remains Gaussian with exactly the same rate of dispersion in its variance, while the mean of the Gaussian wave behaves like the trajectory of a classical particle, i.e., governed by the Hamiltonian mechanics X = −∇f (X). Thus, we believe the source of quantum speedup in our algorithm is the variance dispersion along the negative curvature direction.

Proposition 3. Suppose a quantum particle is in a one
; in other words, the initial position of this quantum particle follows the standard normal distribution N (0, 1). The time evolution of this particle is governed by (4). Then, at any time t ≥ 0, the position of the quantum particle still follows normal distribution N µ(t; λ), σ 2 (t; λ) , where the mean µ(t; λ) is given by while the variance σ 2 (t; λ) is exactly the same as in (5).
Proof. The main idea of the proof is to use the undetermined coefficient method similar to the proof of Lemma 1, though we will use a different ansatz with more parameters: where a(t), b(t), and c(t) are complex-valued functions. For simplicity, the normalization constant is absorbed in the c(t) term. The probability density p λ (t, x), i.e., the modulus square of the wave function, is then given by where A (t), B(t), and C (t) are the real parts of the functions a(t), b(t), and c(t), respectively. One can readily observe that p λ (t, x) is a Gaussian density function with mean and variance being It turns out that the distribution of the quantum particle is completely determined by the mean µ(t) and variance σ 2 (t) if we can show that the ansatz function (126) indeed solves the Schrödinger equation (4) with a potential field f (x) = λ 2 (x − d) 2 . Substituting the ansatz (126) to the Schrödinger equation (4), we obtain the following system of ordinary differential equations: subject to the initial condition a(0) = 1/4, b(0) = 0, and c(0) = − log(2π)/4. The last equation says c(t) can be directly integrated as long as a(t) and b(t) are known. In other words, c(t) exists given that a(t) and b(t) are determined, and we do not care about the exact value of c(t) because it sheds no light on either the mean µ(t; λ) nor the variance σ 2 (t; λ). To prove the lemma, it suffices to calculate a(t) and b(t).
The first equation in the system (129) is a Riccati equation; by the change of variable a = − i 2u u , the Riccati equation is transformed into a second-order linear equationü + λu = 0. Then, similarly, we shall discuss three cases λ = 0, λ > 0, and λ < 0. Here, we only do the λ > 0 case, as the other two cases are solved following essentially the same procedures.
Before we proceed with the calculation of a(t), we discuss how the change of variable a = − i 2u u simplifies the second equation in the system (129). With the change of variable into iḃ = 2ab − λd and proper algebraic manipulation, we end up with the nice forṁ Note that the left hand side is simply d dt (ub), and hence the function b(t) can be expressed in terms of u(t): where C is a constant. Now, we are ready to compute both the mean and variance for the case λ > 0. Suppose This particular choice of c will give rise to the function a(t) satisfying the initial condition a(0) = 1/4, which reads Similarly, we substitute the solution of u(t) (132) back into the formula for b(t) (131), together with the initial condition b(0) = 0, we can write down the closed form of b(t): The real parts of a(t) and b(t) can then be computed as follows and the mean µ(t; λ) and variance σ 2 (t; λ) follows from (128).

A.2 Bounding the deviation from perfect Gaussian in quantum evolution
In what follows, we will use · p to denote the L p -norm of an integrable function g : Ω → R: where 1 ≤ p < ∞. For a continuous function g : Ω → R, the L ∞ norm is g ∞ = sup x∈Ω |g(x)|. For a finite-dimensional vector v, we simply use v to denote its 2 -norm (or the Euclidean norm): For a vector-valued function G : Ω → R n , we also define its L p -norm for 1 ≤ p < ∞: where G j (x) is the j-th component of the function G(x). The L ∞ -norm is defined in the same manner: G ∞ = max 1≤j≤n G j ∞ . First, we prove the following vector norm error bound of quantum simulation: and t 0 dτ 1 τ 1 0 dτ 2 = t 2 2 , we obtain the desired vector norm error bound (139). Second, we observe the following fact: where C and α are absolute constants.

Remark 5.
The original Theorem 2 in [12] actually proved the logarithmic growth in Sobolev norm u(t) H s for all s > 0, while we only cite the special case s = 1. The ∇u(0) 2 term was absorbed in the constant factor in the original statement, while we feel necessary to expand it out because it may introduce dependence on n and r 0 . It is worth noting that the theorem was proven for two-dimensional Schrödinger equations with quasi-periodic potential field V (x, t), while it has been made clear in the context that this result holds for arbitrary-dimensional cases if V is periodic. Bourgain also explicitly discussed the periodic-V case in [13].
with periodic boundary conditions and initial condition Φ 0 (x) defined in (7) (i.e., the initial state of the quantum simulation Algorithm 1), then we have where C and α are absolute constants.
Proof. Note that the constant F just adds a global phase to the solution which does not influence either Φ(t) 2 or ∇Φ(t) , and the Schrödinger equation is translation-invariant under x → x −x, we may assume without loss of generality that f q = 1 2 x T Hx. Define a new function u(x, t) = Φ r 0 x √ 2 , t , and it is straightforward to verify that Note that the function f q (x) is quadratic, so 1 , which is a constant multiple of f q . Thus, we may directly invoke Theorem 6 to yield where the ∇u(0) 2 can be directly calculated as follows: Absorbing the 2 −5/4 factor into the absolute constant C, we complete the proof.

A.4 Existing Lemmas
In this subsection, we list existing lemmas from [47,48] that we use in our proof. First, we use the following lemma for the large gradient scenario of gradient descent method: Lemma 13 ([47, Lemma 19]). If f (·) is -smooth and ρ-Hessian Lipschitz, η = 1/ , then the gradient descent sequence {x t } satisfies: for any step t in which quantum simulation is not called.
The next lemmas are frequently used in the large gradient scenario of the accelerated gradient descent method: where η = 1 4 as in Theorem 4. Note that this lemma is not exactly the same as Lemma 7 of [48]: to be more specific, they have an extra ι −5 term appearing in the E . However, this term actually only appears when we need to escape from a saddle point using the original AGD algorithm. In large gradient scenarios where the gradient is greater than , it does not make a difference if we ignore this ι −5 term.

Lemma 15 ([48, Lemma 4 and Lemma 5]).
Assume that the function f is -smooth. Consider the setting of Theorem 4, for every iteration τ where quantum simulation was not called, we have where E τ is defined in (189) in Lemma 14.
The correctness of these two lemmas above is guaranteed by two mechanisms. If the function does not have a large negative curvature between x t and y t in the current iteration, the AGD will simply make the Hamiltonian decrease efficiently. Otherwise, the Negative-Curvature-Exploitation procedure in Line 11 of Algorithm 3 will be triggered (same as in [48]) and decrease the Hamiltonian by either finding the minimum function value in the nearby region of x t if v t is small, or directly resetting v t = 0 if it is large.