On Quantum Speedups for Nonconvex Optimization via Quantum Tunneling Walks

Classical algorithms are often not effective for solving nonconvex optimization problems where local minima are separated by high barriers. In this paper, we explore possible quantum speedups for nonconvex optimization by leveraging the global effect of quantum tunneling. Specifically, we introduce a quantum algorithm termed the quantum tunneling walk (QTW) and apply it to nonconvex problems where local minima are approximately global minima. We show that QTW achieves quantum speedup over classical stochastic gradient descents (SGD) when the barriers between different local minima are high but thin and the minima are flat. Based on this observation, we construct a specific double-well landscape, where classical algorithms cannot efficiently hit one target well knowing the other well but QTW can when given proper initial states near the known well. Finally, we corroborate our findings with numerical experiments.


Introduction
Nonconvex optimization plays a central role in machine learning because the training of many modern machine learning models, especially those from deep learning, requires optimization of nonconvex loss functions. Among algorithms for solving nonconvex optimization problems, stochastic gradient descent (SGD) and its variants, such as Adam [KB15], Adagrad [DHS11], etc., are widely used in practice. In theory, their provable guarantee has been studied from various perspectives.
In this paper, we adopt the perspective of studying gradient descents via the analysis of their behavior in continuous-time limits as differential equations, following a recent line of work in [SBC16,WWJ16,Jor18,SDJS21]. In particular, given a function f : R d → R, the SGD x k+1 = x k − s∇f (x k ) − sξ k with learning rate s and the kth step noise ξ k can be approximated by a stochastic differential equation (SDE) as follows: where W is a standard Brownian motion. Such approach enjoys clear intuition from physics. In particular, Eq. (1) is essentially a non-equilibrium thermodynamic process: gradient descent provides driving forces, the stochastic term serves as thermal motions, and a combination of these two ingredients enables convergence to the thermal distribution, also known as the Gibbs distribution. A systematic study of Eq. (1) was conducted in a recent work by [SSJ20]. See more details in Section 2.2. Nevertheless, algorithms based on gradient descents also have limitations because they only have access to local information about the function, which suffers from fundamental difficulties when facing landscapes with intricate local structures such as vanishing gradient [Hoc98], nonsmoothness [KS21], negative curvature [CB21], etc. In terms of optimization, we are mostly interested in points with zero gradients, and they can be categorized as saddle points, local optima, and global optima. It is known that variants of SGD can escape from saddle points [GHJY15, JGN + 17, AZL18, FLLZ18, FLZ19, JNG + 21, ZL21], but one of the most prominent issues in nonconvex optimization is to escape from local minima and reach global minima.
Up to now, theoretical guarantee of escaping from local minima by SGD has only been known for some special nonconvex functions [KLY18]. In general, SGD has to climb through high barriers in landscapes to reach global minima, and this is typically intractable using only gradients that descend the function. In all, fundamentally different ideas, especially those that explores beyond local information, are expected to derive better algorithms for nonconvex optimization in general.
This paper aims to study nonconvex optimization via dynamics from quantum mechanics, which can leverage global information about a function f : R d → R. The fundamental rule in quantum mechanics is the Schrödinger Equation: 1 where i is the imaginary unit, h is defined as the quantum learning rate, ∆ = d j=1 ∂ 2 ∂x 2 j is the Laplacian, and Φ(t, x) : R × R d → C is a quantum wave function satisfying R d |Φ(t, x)| 2 dx = 1 for any t. Measuring the wave function at time t, |Φ(t, x)| 2 is the probability density of finding the particle at position x. In Eq. (2), the time evolution of wave functions is governed by the Hamiltonian 2 H := −h 2 ∆ + f , where −h 2 ∆ corresponds to the classical kinetic energy and f the potential energy.
In sharp contrast to classical particles, quantum wave functions can tunnel through high potential barriers with significant probability, and this is formally known as quantum tunneling. Take a one-dimensional doublewell potential f : R → R in Figure 1 as an example, the goal is to move from the local minimum x − in the left region to the local minimum x + in the right region. Classically, the SDE in (1) has to climb through the barrier with height H f , and it can take exp(Θ(H f /s)) time to reach x + (see Section 3.4 of [SSJ20]).
As a result, after time t where |E 0 − E 1 |t = π, we have Φ(t, x) ∝ Φ + (x) localized near x + . Intuitively, this can be viewed as global evolution and superposition of quantum states, which is capable of acquiring global information of the function f and explains why for various choices of f , this quantum evolution time t is much shorter than the classical counterpart by SDE which only takes gradients locally. It is a natural intuition to design quantum algorithms using quantum tunneling. Previously, [FGS + 94, MAL16, CH16, BZ18] studied the phenomenon of quantum tunneling in quantum annealing algorithms [FGS + 94, FGG + 01]. However, most of these results studied Boolean functions, which is essentially different from continuous optimization. In addition, quantum annealing focused on ground state preparation instead of the dynamics for quantum tunneling. Up to now, it is in general unclear when we can design quantum algorithms for optimization by adopting quantum tunneling. Therefore, we ask: Question 1.1. On what kind of landscapes can we design algorithms efficiently using quantum tunneling?
To answer this question, we need to figure out specifications of the quantum algorithm, such as the initialization of the quantum wave packet, the landscape's parameters, the measurement strategy, etc. The next question is to understand the advantage of quantum algorithms based on quantum tunneling. A main reason of studying quantum computing is because it can solve various problems with significant speedup compared to classical state-of-the-art algorithms. In optimization, prior quantum algorithms have been devoted to semidefinite programs [BS17, AGGW17, AG19, BKL + 19], convex optimization [AGGW20,CCLW20], escaping from saddle points [ZLL21], polynomial optimization [RSW + 19, LWG + 21], finding negative curvature directions [ZHLT19], etc., but quantum algorithms for nonconvex optimization with provable guarantee in general is widely open as far as we know. Here we ask: Question 1.2. When do algorithms based on quantum tunneling give rise to quantum speedups?
Contributions. We systematically study quantum algorithms based on quantum tunneling for a wide range of nonconvex optimization problems. Throughout the paper, we consider benign nonconvex landscapes where local minima are (approximately) global minima. We point out that many common nonconvex optimization problems indeed yield objective functions satisfying such benign behaviors, such as tensor decomposition [GHJY15,GM20], matrix completion [GLM16,MWCC18], and dictionary learning [QZL + 19], etc. In general, nonconvex problems with discrete symmetry satisfy this assumption, see the surveys by [Ma21,ZQW21].
In this paper, we demonstrate the power of quantum computing for the following main problem: Main Problem. On a landscape whose local minima are (approximately) global minima, starting from one local minimum, find all local minima with similar function values or find a certain target minimum.
Such a problem is crucial for understanding the generalization property of nonconvex landscapes, and in general it also sheds light on nonconvex optimization. First, local minima with similar function values can have dramatically different generalization performance (see Section 6.2.3 of [Sun19]), and solving this Main Problem can be viewed as a subsequent step of optimization for finding the minimum which generalizes the best. Second, Main Problem implies the mode connectivity of landscapes, which has been applied to understanding the loss surfaces of various machine learning models including neural networks both empirically [ Landscapes whose local minima are (approximately) global significantly facilitate quantum tunneling. Roughly speaking, since the total energy during our quantum evolution (2) is conserved, quantum tunneling can only efficiently send a state from one minimum to another minimum with similar values. As a conclusion, if the quantum wave function is initialized near a local minimum, we can focus on quantum tunneling between the local ground state of each well, i.e., the tunneling of the particle from the bottom of a well to that of another well. To avoid complicated discussions on the value of the quantum learning rate h, we further restrict ourselves to functions whose local minima are global, which would not provide less intuition. Now, an answer to Question 1.1 can be given as follows: Theorem 1.1 (Quantum tunneling walks, informal). On landscapes whose local minima are global minima, we have an algorithm called quantum tunneling walks (QTW) which initiates the simulation of Eq. (2) from the local ground state at a minimum, and measures the position at a time which is chosen uniformly from [0, τ ]. To solve the Main Problem we can take where N is the number of global minima and ∆E is the minimal spectral gap of the Hamiltonian restricted in a low-energy subspace. For sufficiently small h, we have where b, S 0 > 0 are constants that depend only on f .
Formal description of the QTW can be found in Section 3.2. Here we highlight two important properties of QTW: Quantum mixing time and quantum hitting time.
Quantum mixing time (Lemma 3.1 in Section 3.3). Since quantum evolutions are unitary, QTW never converges, a fundamental distinction from SGD. Therefore, to study the mixing properties of QTW, we follow quantum walk literature [CCD + 03] by employing the measurement strategy, where we measure at t uniformly chosen from [0, τ ]. The measured results obey a distribution which is a function of τ , and when τ → +∞, the distribution tends to its limit, µ QTW . Quantum mixing time is the minimal τ enabling us to sample from µ QTW up to some small error. Alternatively speaking, the mixing time evaluates how fast the distribution yielded by QTW converges. We prove that µ QTW concentrates near minima, so that sampling from µ QTW repeatedly can give positions of all minima. In addition, µ QTW gives the upper bound on τ in (6).
Quantum hitting time (Lemma 3.5 in Section 3.4). Hitting time is the duration it takes to hit a target region (usually a neighborhood of some minimum). Quantum hitting time is the minimum evolution time needed for hitting the region of interest once. Despite this straightforward intuition, the formal definition of quantum hitting time is very different from that of classical hitting time. Intuitively, repeatedly sampling from µ QTW can ensure the hitting of neighborhoods of particular minima, and thus we can use the mixing time to bound the hitting time. In short, to solve the Main Problem, we bound the quantum mixing and hitting time to obtain Theorem 1.1.
The minimal spectral gap ∆E in Theorem 1.1 is calculated in SI Appendix A.2.3. The quantity S 0 is called the minimal Agmon distance between different wells, formally defined in Definition 2.5, which is related to both the height and width of potential barriers. The smaller h is, the closer the measured results are to the minima (i.e., the more accurate QTW is), but the longer evolution time the Schrödinger equation takes.
As an application of Theorem 1.1 and a justification of the practicability of QTW, we show how to use QTW to solve the orthogonal tensor decomposition problem. This problem asks to find all orthogonal components of a tensor. After transforming into a single optimization problem [CLDA09,Hyv99], the aim is to find all global minima. We present below a bound on the time cost of QTW on decomposing fourth-order tensors and details can be found in Section 3.5.
Proposition 1.1 (Tensor decomposition, informal version of Proposition 3.1). Let d be the dimension of the components of the fourth-order tensor T ∈ R d 4 satisfying (81), δ be the expected risk yielded by the limit distribution µ QTW , and be the maximum error between µ QTW and the actual obtained distribution (quantified by L 1 norm). For sufficiently small and sufficiently small δ, the total time T tot for finding all orthogonal components of T by QTW satisfies Next, we explore the advantages of the quantum tunneling mechanism comparing QTW with SGD and shown by describing landscapes where QTW outperforms SGD. The time cost for SGD to converge to global by [SSJ20]. Here, s is the step size or learning rate of SGD. The constants a > 0 and H f > 0 depend only on f . Interestingly, running time of QTW and that of SGD have similar form. In (7) and (9), there are exponential terms e S0/h and e 2H f /s , respectively. Intuitively, the quantity H f is the characteristic height of potential barriers, and the quantity S 0 depends on not only the height but also the width of potential barriers. For the one-dimensional example in Figure 1, (Proof details are given in Section 3.1.) Other terms in the bounds, poly(N )/∆E and 1/λ s , are referred to as polynomial coefficients. We make the following comparisons: • Regarding the exponential terms S 0 and H f , tall barriers means that H f is large, whereas if the barriers are thin enough, S 0 can still be small. This is consistent with the long-standing intuition that tall and thin barriers are easy for tunneling but difficult for climbing [CH16].
• Regarding the polynomial coefficients, they are mainly influenced by the distribution or relative positions of the wells. We observe that a symmetric distribution of wells, which can make (the local ground state in) any one well interacts with (the local ground states in) other wells, may reduce the running time of QTW but has no explicit impact on SGD.
• Flatness of wells is another important factor that influences the running time of both QTW and SGD. We propose standards for comparison (see Section 4.1), which studies their running time when reaching the same accuracy δ. Same to the effect of h, a smaller learning rate s permits more accurate outputs but makes SGD more time consuming. For sufficiently flat minima, h is larger than s, leading to a smaller running time for QTW.
In summary, we illustrate above observations in Figure 2 and conclude the following: Main Message (Advantages of the quantum tunneling mechanism, a summary of Section 4.1 and Section 4.2). On landscapes whose local minima are global minima, QTW outperforms SGD on solving the Main Problem if barriers of the landscape f is high but thin, wells are distributed symmetrically, and global minima are flat.
Remark 1.1. As is indicated above, we compare the costs of QTW and SGD under the same accuracy δ. We introduce two definitions of accuracy in Section 4.1: Standard 4.1 concerns the expected risk, and Standard 4.2 concerns the expected distance to some minima. Mathematically, Standard 4.1 and Standard 4.2 establish a relationship between the quantum and classical learning rates h and s, respectively, enabling direct comparisons.
Having introduced the general performance of the quantum tunneling walk, we further investigate Question 1.2 on some specific scenarios of the Main Problem. We focus on comparison between query complexities, namely the classical query complexity to local information and the quantum query complexity to the evaluation oracle 3 This is the standard assumption in existing literature on quantum optimization algorithms [AGGW20, CCLW20,ZLL21]. Note that the U f here is a unitary transformation and it allows superposed input states, i.e., for a state m j=1 c j |x j where m ∈ N, c j ∈ C, and m j=1 |c j | 2 = 1, If we measure this output state, with probability |c j | 2 we obtain f (x j ). The distribution can not only be sampled from a discrete set but also a continuous set. Different from classical queries that only learn local information of the landscape of f , quantum evaluation queries are essentially nonlocal as they can extract information of f at different locations in superposition. Based on this fundamental difference, we are able to prove that QTW can solve a variant of the Main Problem with exponentially fewer queries than any classical counterparts: Theorem 1.2. For any dimension d, there exists a landscape f : R d → R such that its local minima are global minima, and on which, with high probability, QTW can hit the neighborhood of an unknown global minimum from the local ground state associated to a known minimum using queries polynomial in d, while no classical algorithm knowing the same minimum can hit the same target region with queries subexponential in d.
Details of Theorem 1.2 are presented in Section 4.3. Following similar idea to [JLGJ18], our construction relies on locally non-informative regions. Main structures of the constructed landscape are illustrated in Figure 3, which has two global minima. W − and W + are two symmetric wells containing one global minimum respectively, B v is a plateau connecting W − and W + , and other places form a much higher plateau. The region W + is our target. We show that the landscape satisfies the following properties: • S v , which is a band {x | x ∈ B, |x · v| ≤ w} with w a constant and v a unit vector, occupies dominating measure in the ball B.
• In S v , local queries (see Definition 2.1) do not reveal information about the direction v.
• Local queries outside B do not reveal information about the region inside B.
Restricted in the ball B, the first two properties make classical algorithms intractable to escape from S v and thus cannot hit W + efficiently. The last property ensures that, without being restricted in B, classical algorithms are still unable to hit W + efficiently. See Section 4.3.1 for details. Nevertheless, quantum tunneling can be efficient if we carefully design the function values and the parameter h. The design of the parameters should establish the following main conditions (See Section 4.3.2 for details): • The wave function always concentrates in W − or W + .
• The quantum learning rate h is small such that our theory based on semi-classical analysis is valid.
• Quantum tunneling from W − to W + is always easy (can happen within time polynomial in d).
Organization. Section 2 introduces our assumptions and problem settings, both classical and quantum. In Section 3, we explore QTW in details and state the formal version of Theorem 1.1. This includes a onedimensional example, the formal definition of QTW, the mixing and hitting time of QTW, and the example on tensor decomposition (Proposition 1.1). Section 4 covers detailed quantum-classical comparisons. First, we introduce fair criteria of the comparison. Second, we illustrate the advantages of quantum tunneling and give a detailed view of our Main Message. Third, we prove Theorem 1.2. We corroborate our findings with numerical experiments in Section 5. At last, the paper is concluded with discussions in Section 6.

Notations
Throughout this paper, the space we consider is either R d or a d-dimensional smooth compact Riemannian manifold denoted M . Bold lower-case letters x, y,. . . , are used to denote vectors. If there is no ambiguity, we use normal lower-case letters, x, y,. . . , to denote these vectors for simplicity. Depending on the context, dx may refer to either line differential or volume differential. We use A jj to denote the element of the matrix A at of row j and column j . Conversely, given all matrix elements A jj , we use the notation (A jj ) to denote the matrix. Unless otherwise specified, · is used to denote the 2 norm of vectors, spectral norm of matrices, and L 2 norm of functions. Similarly, · 1 is used to denote the 1 norm of vectors and L 1 norm of functions.
For a function f , ∇f and ∇ 2 f denote the gradient vector and Hessian matrix, respectively. ∆ := , Ω(·), and Θ(·), follow common definitions. We also write f g if f = o(g), and f ∼ g if f = Θ(g). TheÕ notation omits poly-logarithmic terms, namely,Õ(f ) := O(f poly(log f )) (in this paper, log denotes the logarithm with base 2 and ln denotes the natural logarithm with base e). We write f = O(g ∞ ) if Throughout the paper, we write f ≈ g if when g = 0. When g → 0, f → 0 means f = o(1) or, to stress the dependence, f = o g (1).
For quantum mechanics, we use the Dirac notation throughout the paper. Quantum states are vectors from a Hilbert space with unit norm. Let |φ denote a state vector, and φ| = (|φ ) † denote the dual vector that equals to its conjugate transpose. The inner product of two states can be written as ψ| φ . In the coordinate representation, for each x ∈ R d we have the wave function φ(x) := x| φ , where |x denotes the state localized at x. More basics on quantum mechanics and quantum computing can be found in standard textbooks, for instance [NC10].

Classical preparations
Classical algorithms only have access to local information about the objective function at different sites, which is formalized as follows: Definition 2.1 (Algorithms based on local queries). Denote a sequence of points and corresponding queries with size T by {x i , q(x i )} T i=1 , where each q(x i ) can include the function value and arbitrary order derivatives (if exist). Algorithms based on local queries are those which determine the jth point x j by {x i , q(x i )} j−1 i=1 . As an example, the classical algorithm SGD can be mathematically described by Definition 2.2 (Discrete model of SGD). Given a function f (x), starting from an initial point x 0 , SGD updates the iterates according to where s is the learning rate and ξ k is the noise term at the kth step.
The local information in SGD is gradients. Since s is small, define time t k = ks, the points {x k } can be approximated by points on a smooth curve {X(t k )}. The curve, which can be regarded as the continuous-time limit of discrete SGD, is determined by a learning-rate-dependent stochastic differential equation (lr-dependent SDE): where W is a standard Brownian motion.
The solution of (16), X(t), is a stochastic process whose probability density ρ SGD (t, ·) evolves according to the Fokker-Planck-Smoluchowski equation The validity of this SDE approximation has been discussed and verified in previous literature [KY03,CS18,SSJ20,LMA21].
where ∆ s f is called the Witten-Laplacian, more specifically, Let δ s,1 be the smallest non-zero eigenvalue of ∆ s f , the following convergence guarantee for SGD holds: Proposition 2.1 (Part of Theorem 2.8 in [Mic19] and Lemma 5.5 in [SSJ20]). Under Assumption 2.1, 2.2, and 2.3, for sufficiently small s, where the smallest positive eigenvalues of the Witten-Laplacian ∆ s f associated with f satisfies Here, H f and γ 1 are constants depending only on the function f .
That is, the convergence time of SGD is loosely O(s/δ s,1 ) whose magnitude is largely related to H f . The constant H f is called the Morse saddle barrier, characterizing the largest height of barriers. Rigorous results about eigenvalues of the Witten-Laplacian are reviewed in SI Appendix A.1.

Quantum preparations
Our quantum algorithmic method essentially relies on the simulation of the Schrödinger equation: where i is the imaginary unit, is the Planck constant, and H is the Hamiltonian. Physically, we simulate a particle moving under a potential function f . Then, in the coordinate representation, the Schrödinger equation is specified: Throughout this paper, we set h = / √ 2m. The spectrum of the Hamiltonian will highly depends on the variable h. More interestingly, by comparing (23) and (29), h plays a similar role to that of the learning rate s. Thus we refer to h as the quantum learning rate. In the reality, is a fixed constant. However, since we are simulating quantum evolution by quantum computers, proper rescaling the simulation can equivalently be seen as varying the value of . The value of affects the evolution time needed. However, rescaling has no impact on quantum query complexity. Therefore, in this paper, is an unimportant constant, i.e., = 1.
As is introduced, we consider quantum tunneling from the bottom of a well to that of another well, in other words, tunneling between local ground states. A local ground state of a well is the local eigenstate of the well with minimum eigenvalue. Technically, several kinds of local eigenstates are defined (see Definition A.7, Definition A.8, and Definition A.10 in SI Appendix A.2.3). Despite of the number of definitions, different kinds of local eigenstates are close to each other and share the same intuition: eigenstates of the Hamiltonian restricted in regions only contain one well. For convenience, if no otherwise specified, local eigenstates stand for orthonormalized eigenstates defined by Definition A.10. Actually, there are also tunneling effects between local excited states. However, excited states are difficult to approximate accurately for general landscapes. Besides, due to interference, tunneling effects between different local excited states may cancel each other out. We restrict our attention to tunneling between local ground states in order to obtain explicit results along with a clear physical picture.
Two local ground states can interact strongly with each other only if the difference between their energies is small relative to h [Hel88] (see also discussions after Proposition A.4 in SI Appendix A.2.3). In other words, this requires the function values between two local minima to be close and there is little resonance between the first (local) excited state in one well and the (local) ground state of the other [Ras12,SCC91]. Therefore, our algorithms based on tunneling between local ground states are essentially restricted on landscapes where local minima are approximately global minima. Note that we can always find small enough h to make two local ground states nonresonant, if the corresponding local minima are not exactly equal. As a result, to avoid more complicated restrictions on h, without loss of generality we assume that local minima are global minima, and they all have function value 0. More precisely: In addition, f has finite number of local minima, and they can be decomposed as follows: Each U j is called a well.
This assumption will not affect the explicit forms of convergence time or present less physical insights. To further characterize the distance on such landscapes, an important geometric tool we use is the Agmon distance.
Definition 2.5 (Agmon distance). Under Assumption 2.4, the Agmon distance d(x, y) is defined as where γ denotes pairwise C 1 paths connecting x and y. For a set U , d(x, U ) = d(U, x) := inf y∈U d(x, y). And for two sets U 1 and U 2 , d(U 1 , U 2 ) = inf x∈U1,y∈U2 d(x, y).
The minimal Agmon distance between wells are defined as We only consider resonant wells by assuming the following for simplicity: Assumption 2.5 (Informal). The difference between any two local ground states are of the order O(h ∞ ).
In addition, for any well U j , there exists another well At last, to obtain explicit results we demand that Assumption 2.6. There are a finite number of paths of the Agmon length S 0 connecting U j and U k if Assumption 2.4, 2.5, and 2.6 are simplified from Assumptions A.5, A.6, and A.7 in SI Appendix A.2 which presents details needed for analyzing quantum tunneling. Under Assumption 2.4, 2.5, and 2.6, we state the main results of SI Appendix A.2 as follows. For sufficiently small h, the orthonormalized local ground states |e j , j = 1, . . . , N almost localize near the wells U j , j = 1, . . . , N , respectively. The space F spanned by {|e j : j = 1, . . . , N } is exactly a low-energy invariant subspace of the Hamiltonian H. In other words, in the low-energy space F, the particle walks between wells by quantum tunneling. The Hamiltonian restricted in F, i.e., H |F , determines the strength of the quantum tunneling effect and is called the interaction matrix.
To explore H |F , we use the WKB method to estimate local ground states (SI Appendix A.2.1). Any local ground state function decays exponentially with respect to the Agmon distance to its corresponding well (SI Appendix A.2.2). Consequently, the tunneling effects would decay exponentially with respect to S 0 . Having captured theses properties, SI Appendix A.2.3 can give explicit estimations about H |F , namely, Proposition A.5, Proposition A.6 (with N + = N ), and Theorem A.1.
Finally, we restate a more formal version of our Main Problem: Main Problem (restated). Given an objective function f that satisfies Assumption 2.4, 2.5, and 2.6, starting from one local minimum, find all local minima or find a certain target minimum.
We make the following remarks for clarification: Remark 2.1. Assumptions in Section 2.2 and Section 2.3 are not contradictory. When considering SGD, we naturally add to the Main Problem (restated) that the assumptions in Section 2.2 should also be satisfied.
Remark 2.2. The assumption of starting from one local minimum enables quantum algorithms to prepare a local ground state, or more generally, a state largely in the aforementioned subspace F.

Remark 2.3.
Because finding a precise global minimum is impractical in general, it suffices to find points sufficiently close to the minima of interest. Later in Section 4.1, we use two different measures of accuracy: 1. the function value difference; and 2. the distance to one of the minimum.

Quantum Tunneling Walks
In this section, we present full details of the quantum tunneling walk (QTW). We begin with a onedimensional example in Section 3.1, and then in Section 3.2 we formally define QTW. In Section 3.3 and Section 3.4, we study the mixing and hitting time of QTW, respectively. As an example, we give full details of applying QTW to tensor decomposition in Section 3.5.

A one-dimensional example
We start the introduction of the quantum algorithm QTW with a one-dimensional example which quantifies the intuitions provided in Section 1. Section 3.1 also serves as a map connecting each step of the analysis to the needed mathematical tools. Later sections can be seen as generalizing results here for high dimensional and multi-well cases. General descriptions and results begin at Section 3.2.
Consider the potential f (x) in Figure 1, which has two global minima, x ± = ±a. For simplicity, we take where is a small number. In this way, the potential satisfies that min f = 0, and f (x) is quadratic near minima. Besides, the symmetry of wells demands can always be made to be smooth.
Near the two minima, ±a, whose local harmonic frequencies are ω, we can solve the Schrödinger equation locally and get two local ground states, Φ ± (x). For instance, around −a, if we set y = x + a, the local ground state is determined by where ε 0 = hω/ √ 2. Physically, the demand of localization is equivalent to ε 0 f (0), indicating that the particle nearly cannot pass through the energy barrier. Concrete mathematical definitions and discussions on local ground states can be found in SI Appendix A.2.2.
From a high-level perspective, the main idea of the present paper is to unite the interaction or tunneling between local ground states to realize algorithmic speedups. As we want to investigate the evolution of states, we need to determine the relationship between local ground states and the eigenstates of the Hamiltonian H. We set the eigenstates of H as |n , n = 0, 1, . . . with energies E 0 ≤ E 1 ≤ · · · , respectively (i.e., |0 is the global ground state, |1 is the first excited state, etc.). The overlap of states Φ ± is small (namely, Φ + | Φ − ≈ 0), as they are local and separated by a high barrier. Denote the subspace spanned by Φ − and Φ + as E, and that spanned by |0 and |1 as F. Both E and F are 2-dimensional and contains states with low energies. It is intuitive that E ≈ F (which is guaranteed by Proposition A.3). For the one-dimensional case, we just take E = F for simplicity. In this way, we can represent |0 and |1 by |Φ ± in the following general way: Restricted in the subspace F, the two-level system Hamiltonian can be written as where ν is called the tunneling amplitude, measuring the interaction between wells. Because the f (x) we choose is symmetric, ε − = ε + = ε 0 = hω/ √ 2. Therefore, we have θ = π/4 and the energy gap ∆E := E 1 − E 0 = 2ν. We will refer to this Hamiltonian restricted in subspace F as the interaction matrix, indicating that H |F characterizes the interaction between wells.
In our setting, we can begin at a local minimum, where the local ground state is easy to prepare (see justifications in SI Appendix B.2). Without loss of generality, let us begin the quantum simulation at the state Φ − , namely, setting |Φ(0) = |Φ − . After evolution of time t, the state becomes And the probabilities of finding the particle in the right and left wells are give by The energy gap ∆E is also called the Rabi oscillation frequency, suggesting that the particle oscillates between the two wells periodically. Our aim is to pass through the barrier and find other local minima (for the case here, is to find the other local minimum). Since for small h, the local state |Φ + distributes in a very convex region near the right local minimum, it suffice to solve our problem by measuring the position of state |Φ(t) when P + (t) is large. However, we may not be able to know ∆E precisely in advance. So, we will apply the method of quantum walks: evolving system for time t which is chosen randomly from [0, τ ], and then measuring the position [CCD + 03]. The resulted distribution is where QTW denotes quantum tunneling walk. Define p −→± (τ ) = τ 0 P ± (t) dt τ , which is the probabilities of finding the right and left local ground states, respectively. Since | x| Φ − | 2 is small for x near +a, the probability of finding the particle near +a is determined by Therefore, it suffice to study properties of p −→± . In the present case, which will converge when τ → ∞. This fact ensures that we can find the right local minimum +a with a probability larger than some constant after evolving the system for enough long time.
Starting from |Φ − , the hitting time for the right well is Since the probability for successful tunneling in one trial is p −→+ (τ ), we can repeat trials for 1/p −→+ (τ ) times to secure one success and the total evolution time is τ /p −→+ (τ ). For sufficiently small 1, if 1 ∆Eτ ≤ , we can get p −→+ (τ ) ≥ 1 2 (1 − ). Therefore, the hitting time can be bounded by O( 1 ∆E ). As is going to be shown later, if we want to find all local minima, the mixing time would be a better quantifier. The limiting distribution is The mixing time measures how fast the distribution ρ QTW (τ, x) converges to µ QTW (x). We define T mix as the -close mixing time which satisfies . For simulating a time-independent Hamiltonian, the number of queries needed are roughly proportional to the total evolution time (as demonstrated in Section 2.3 or see details in SI Appendix B.1). The major task left is to calculate the energy gap ∆E, get different evolution times and compare them with classical results.
As is specified in SI Appendix A.2.3, 0 is the boundary of the two wells and the tunneling amplitude ν can be given by To obtain an explicit result, we need to use the WKB approximation of the local ground states (SI Appendix A.2.1): where a 0 (x) is given by which is determined by the transport equation (Eq. (S24) in SI Appendix A.2.1) and the normalization condition. Substituting (51) and (52) to (50), we get where the constants C and S 0 are given by Note that if f (x) is quadratic, the factor C will be zero, indicating that C measures the deviation of the landscape from being quadratic. The quantity S 0 is called the Agmon distance between two the local minima (see SI Appendix A.2.2). Since we assumed by (36) that f (x) is almost quadratic, we have f (0) ≈ 1 2 ω 2 a 2 and As discussed above, the mixing time and hitting time could be bounded by the following characteristic time Next, we need to find how long it takes for SGD to escape from the left local minimum. Discrete-time SGD with a small learning rate s can be approximated by a learning-rate dependent stochastic differential equation (lr-dependent SDE) [SSJ20]: where W is a standard Brownian motion. Before hitting 0, the SDE is almost an Ornstein-Uhlenbeck process: The expected time for the Ornstein-Uhlenbeck process to first hit 0 is where H f is called the Morse saddle barrier and equals to f (0) − f (−a) ≈ 1 2 ω 2 a 2 in our case. Although ET 0 is not a precise classical counterpart of either quantum mixing or hitting time, it is heuristic to compare ET 0 with T c which can both reflect the time to escape from the left well. The forms of ET 0 and T c are very similar. Two major differences can be observed: 1. the exponential term in ET 0 is determined by the height of the barrier, while that in T c is related to an integral of √ f ; 2. T c ∝ 1/ω 3/2 but ET 0 ∝ 1/ω 3 , indicating that the flatness of the wells affects differently on quantum and classical methods. We will show that for landscapes with multiple wells, the distribution of wells is also an important factor. In general, QTW could be faster than SGD if the barriers between local minima are high but thin, each well is close to many other wells, and wells are flat.
The above comparison is intuitive but not rigorous. Two important technical details for comparison are needed for quantitative discussions. It is shown in Section 2.3 that a super-polynomial separation between evolution time of QTW and SGD gives rise to a super-polynomial separation between quantum and classical queries for QTW and SGD, respectively. Therefore, it suffice to compare the evolution time, especially the exponential term e S0/h and e 2H f /s . The second problem is that h and s are not two constants but variables. The evolution times cannot be quantitatively compared if h and s are independent. In Section 4.1, we develop two natural standards to make fair comparisons between QTW and SGD, which specifies h and s.

Definition of quantum tunneling walks
We now formally describe the model of a quantum tunneling walk (QTW) on a general objective function f (x) satisfying assumptions in Section 2.3. The wells are denoted by U j = {x j } (j = 1, . . . , N ). Let |j be the corresponding orthonormalized local ground state of U j . It is ensured that {|j : j = 1, . . . , N } spans a low energy subspace, F, of the Hamiltonian H = −h 2 ∆ + f (x).
If one has information about one well U j and its neighborhood, the construction of the local ground state should be easy which can be close to |j or at least be almost in the subspace F. We assume the initial state Φ(0) to be in F. The evolution is determined by the Schrödinger equation, where | x| Φ(t) | 2 is the probability distribution of finding the walker at x. The Schrödinger equation indicates the phenomenon of quantum tunneling since it can be rewritten as given that |Φ(0) is in the subspace F. Here, ( j|H |F |j ) is called as the interaction matrix and is calculated by Proposition A.5 and A.6. Once we get j| Φ(t) and x| j for all j, we can obtain the probability distribution | x| Φ(t) | 2 . The overlap x| j is invariant with respect to t. So, we may focus on (61) to investigate the time evolution. As is shown by (61), restricted in the low energy subspace F, the quantum evolution is similar to that of a quantum walk on a graph. The wells correspond to vertices of the graph, and the quantum tunneling effects between wells determine the graph connectivity. QTW walks among different wells by quantum tunneling, helping to find all other local minima.
Finally, according to Lemma B.2 in SI Appendix B.1, the quantum query complexity of simulating the Schrödinger equation is directly linked to the evolution time t and is bounded by where Ω is a large region containing all minima of interest and is the precision quantified by the L 2 norm between the target and the obtained wave functions. Loosely speaking, we needÕ(t) quantum queries if the evolution time is t. For SGD, the number of queries needed is at least Ω(t/s) for time t. Thus, as long as there is a super-polynomial separation between QTW and SGD evolution time, there is a super-polynomial separation between quantum queries and classical queries for QTW and SGD, respectively. Conclusions on speedups are essentially based on comparisons of query complexity. However, based on this relationship between evolution time and query complexity, we can focus on comparisons of time.
To sum up, QTW is quantum simulation with the system Hamiltonian being H = −h 2 ∆ + f (x) and the initial state being in a low energy subspace of H, where f (x) is the potential function of a type of benign landscapes (Assumption 2.4, 2.5, and 2.6). QTW can be efficiently implemented on quantum computers.

Mixing time
For a given landscape f (x), the complexity of Hamiltonian simulation mainly depends on the evolution time (see SI Appendix B.1). In this section, we focus on the evolution time needed for fulfilling the tasks of finding all minima. Since quantum evolutions are unitary, different from SGD, QTW never converges. In this case, after running QTW for some time t, we measure the position of the walker. Similar to the quantum walks in [CCD + 03], the evolution time t can be chosen uniformly in [0, τ ]. Later, we will prove that under sufficiently large τ , QTW can find other wells with probability larger than some constant. Note that there can be better strategies to determine the time for measurement t than uniformly sampling from an interval [0, τ ] [AC21]. For simplicity, we only analyze the original strategy of [CCD + 03] in the present paper.
As is mentioned earlier, we initialize at a state |Φ(0) ∈ F. In later sections, we may specify |Φ(0) to be one of the local ground states. Let the spectral decomposition of H |F to be H |F = N k=1 E k |E k E k |.
Simulating the system for a time t chosen uniformly in [0, τ ], one can obtain the probability density of finding the walker at x 4 The time-averaged probability density leads to a limiting distribution when τ → ∞: With the strategy of measuring at t randomly chosen from [0, τ ], QTW can output a distribution ρ QTW (τ, x) with limit. Such a process is regarded as mixing. Quantum mixing time evaluates how fast ρ QTW (τ, x) converges to µ QTW (x), and is rigorously defined as: Definition 3.1 (Mixing time of QTW). T mix is called the -close mixing time, iff for any τ ≥ T mix , we have The following lemma provides a general bound for the QTW mixing time whose proof is postponed to SI Appendix C.1.1.
and this implies where ∆E := min E k =E k |E k − E k | is referred to as the minimal gap of H |F .
The term O(h ∞ ) in (67) originates from integrals | x| j j | x |dx, j = j . Intuitively, states |j and |j localize in different wells, such that x| j j | x is exponentially small with respect to h for any x. Lemma 3.1 highlights the dependence of the mixing time on the initial state |Φ(0) and the eigenvalue gaps of H |F . Concrete examples will be given in Section 4, where we further illustrate (67) and compare QTW with SGD.
As is mentioned, (61) indicates a quantum walk: a well U j can be seen as a vertex of a graph and H |F implies graph connectivity (interaction between wells) similar to the graph Laplacian. The connection between QTW and quantum walks is helpful to simplify the physical picture of QTW. However, we also address the difference between quantum walks and QTW. For quantum walks, we only consider the probabilities of finding the walker at vertices, that is, 5 When τ → ∞, p(τ, j) also converges to a limit Following results, Lemma 3.2 and Lemma 3.3, show the connection and difference between QTW and quantum walks in a more quantitative way (detailed proofs can be found in SI Appendix C.1.2 and SI Appendix C.1.3).
Lemma 3.2 (Limit distributions). Limit distributions of the QTW and the quantum walk satisfy the following: Definition 3.2 (Mixing time of quantum walks [CLR20]). t mix is called the -close mixing time of the quantum walk, iff for any τ ≥ t mix ,

Lemma 3.3 (Upper bound for the mixing time of quantum walks). The condition (71) is satisfied if
and we have By Lemma 3.2 and the comparison of Lemma 3.1 and Lemma 3.3, we know that for sufficiently small h, which indicates that local ground states localize sufficiently near their respective wells, QTW can be well characterized by a quantum walk. On a higher level of speaking, QTW generalizes quantum walks from walking on discrete graphs to propagating on continuous functions. And QTW may enable new phenomenons not shown in quantum walks when states |j are poorly localized near U j . On the other hand, QTW under proper conditions can be used to implement quantum walks.

Hitting time
If we aim at finding one particular well (the one with global minimum or the one with the best generalization properties), hitting time instead of mixing time should be of interest. Classically, the hitting time is the expected time required to find some target region or point. For quantum algorithms, we cannot output the position of the walker at all times and the system state would be destroyed by measuring its position. Thus, the definition of the hitting time for quantum algorithms is slightly different. We first see how previous literature defines the quantum walk hitting time: .
To understand this definition, we first refer to the process, evolving the system for time t uniformly chosen from [0, τ ], as one trial. Using one trial, the probability of getting |j is p(τ, j). So, repeating the trials for 1/p(τ, j) times guarantees to hit |j with high probability. In this case, the total evolution time needed is bounded by τ /p(τ, j). In the same spirit, we can define and bound the QTW hitting time as follows: Definition 3.4 (Hitting time of QTW). For QTW governed by (61), let the open and C 2 -bounded region Ω be the region of interest. Then, starting from the initial state |Φ(0) , the Ω-hitting time of QTW is defined as follows: Basic results about the hitting time (Lemma 3.4 and Lemma 3.5) are present as follows. The proof of Lemma 3.5 is in SI Appendix C.2.1 and that of Lemma 3.5 is in SI Appendix C.2.2.
Lemma 3.4 (Upper bound of the quantum walk hitting time). The probability of finding |j can be bounded as follows As a result, for any < p(∞, j), setting τ = 2/∆E , we have If in (77) is small enough, τ = 2/∆E permits a good mixing and we may write which suggests we are using the mixing time to bound the hitting time. For any < Ωj µ QTW (x)dx, let τ = 2(1 + |O(h ∞ )|)/∆E , we have The upper bounds we have obtained on mixing and hitting time are still not explicit, as H |F is not given. Next, we establish relationships between an objective landscape, the corresponding interaction matrix H |F , and the time cost of the QTW algorithm. This is a main task of the present paper. In later sections, we figure out major geometric properties that affect H |F on specific landscapes.

Application: Tensor decomposition
After giving the definition of QTW and studying its mixing and hitting time, now we use QTW to solve a practical problem, orthogonal tensor decomposition, which is a central problem in learning many latent variable models [AGH + 14]. Specifically, we consider a fourth-order tensor T ∈ R d 4 that has orthogonal decomposition: where the components {a j } form an orthonormal basis of a d-dimensional space (a j a j = δ ij ). The goal of orthogonal tensor decomposition is to find all components {a j }. (1,+) (2,+) (3,−) Figure 4: The landscape given by (82) for dimension d = 3: local minima a 1 , a 2 , and −a 3 are highlighted by red points (•) and corresponding labels α = (j(α), π α ) are shown.
Following previous popular methods [CLDA09,Hyv99], we try to find all components by a single optimization problem. Concretely, we consider the following landscape [FJK96]: 7 Without loss of generality, we work in the coordinate system specified by {a j } d j=1 . In particular, let u = d j x j a j and x = (x 1 , . . . , i . Later, we also use a j to denote the vector (0, . . . , 1, . . . , 0) where the only nonzero coordinate with value 1 appears in the jth entry. f (x) has 2d local minima ±a 1 , . . . ± a d uniformly distributed on the d-dimensional sphere S d−1 . 8 Therefore, finding all the minima solves the orthogonal tensor decomposition problem.
The problem of tensor decomposition has notable symmetries. As a result, the objective function f (82) is nonconvex, and we can apply QTW to such a landscape. We can use the pair α = (j, π α ) to denote the local minima and corresponding wells, where j = j(α) (i.e. j(·) is a function) refers to a j and π α ∈ {±1} specifies whether it is +a j or −a j . Figure 4 shows the landscape f for d = 3, where some minima are labeled. The local ground state of the well (j, π α ) is denoted by |j, π α . In the basis {|1, + , . . . , |d, + , |1, − , . . . , |d, − } which spans a low-energy subspace F, the interaction matrix modulus an exponential error has the form Here the quantity µ stands for the energy of local ground states, and w is the tunneling amplitude quantifying the interactions between wells. To understand (83), for any j, imagine a sphere where (j, +) is the north pole and (j, −) the south pole, then for j = j, (j , ±) are evenly distributed on the equator. The energy of all local ground states are the same because of the symmetry. So, diagonal elements of H |F are all µ. The interactions between (j, +) and (j , ±) for all j = j should be the same as well. However, the interaction between (j, +) and (j, −) is exponentially weaker due to the longer distance between (j, +) and (j, −). As a result, we write j, +| H |F |j, − = 0 modulus an exponential error and all other off-diagonal elements as w.
As is demonstrated in previous sections, the time cost of QTW highly depends on spectral gaps of H |F . The following lemma studies eigenstates and eigenvalues of H |F (see proof in SI Appendix C.3.1).
Lemma 3.6. The eigenstates and corresponding eigenvalues of H |F in (83) are given by where Evolving the system for at least the mixing time, the measured results would be subject to the limit distribution µ QTW . Since µ QTW would concentrate near all minima, we are able to find all components. Combined with results of Section 3.3, we can get the mixing time of the QTW as follows (the proof is postponed to SI Appendix C.3.2): Lemma 3.7. For the landscape (82), starting from a local ground state |α , the distribution ρ QTW (τ, x) converges to the limiting distribution µ QTW obeying the following relation: The -close mixing time is subsequently bounded as To determine the total evolution time for finding all components {a j } (i.e., all global minima), we need to calculate Ω β µ QTW dx, where Ω β is an open set containing the minimum β. According to Lemma 3.5, Ω β µ QTW dx is the probability of finding the particle in a neighborhood of π β a j(β) and can be captured by the probability of finding the system at the state |β . Starting from a local state |α , the probability of hitting |β is given by the following lemma and the proof can be found in SI Appendix C.3.3.
Lemma 3.8. Initiating at a local state |α where α = (j(α), π α ), after simulating for a time t which is chosen uniformly from [0, τ ], τ → ∞, the limiting distribution represented by the probability of tunneling to a local state |β is given by Note that the two minima ±a j are equivalent, representing one component. Thus, starting from |α , we are able to find a component different from ±a j(α) if the measured result is in a well β where j(β) = j(α). We can define the probability for a successful trial as p suc = j(β) =j(α) p(∞, β|α) = d−1 d 2 . That is, evolving for time T mix as described by Lemma 3.7, we are able to approximately sample from the limiting distribution µ QTW and then get to another component with probability near p suc . The number of trials needed for finding another component is approximately 1/p suc . And the time needed for finding another component from a known component is approximately T mix /p suc . Repeating the procedure of looking for one component that is different from a known one, we can obtain all orthogonal components with total time 9 To determine the time specifically, it remains to determine 1/|w| which depends exponentially on d and h.
We can obtain: Lemma 3.9. For sufficiently small h, the tunneling amplitude w in the interaction matrix (83) satisfies where C 1 and C 2 are constants depending only on the landscape and are independent of the dimension d.
The proof of Lemma 3.9 is in SI Appendix C.3.4. It is intuitive to see from Lemma 3.9 that the smaller h is, the longer time it takes to find all components. However, small h permits more accurate measurement results. A successful tunneling means we can find a point near a new component, but this point may not be the actual minimum. We add a constraint that the expected risk is δ (i.e., E x∼µQTW f (x) − min f = δ). Subsequently, h can be bounded using δ and we can have the following proposition: Proposition 3.1. For sufficiently small (such that the measured positions nearly obey µ QTW ) and sufficiently small expected risk δ (such that h can be estimated by δ and Lemma 3.9 is valid), we have h = √ 2δ/(d − 1 + o δ (1)) and the total time for finding all orthogonal components of T in (81) by QTW satisfies Remark 3.1. The strategy we adopt here, which is equivalent to repeating sampling from µ QTW , is straightforward but may not be the optimal one under the framework of QTW. In other words, Proposition 3.1 provides a general upper bound on the total evolution time needed. However, the term e √ 2 2h which gives the term e d−1 2δ +o δ (1) in (94) describes essential difficulty for tunneling through a barrier and would not disappear as long as we use quantum tunneling.
To sum up, we provide a scenario that QTW can be used to solve orthogonal tensor decomposition problems. For a practical landscape, the spectrum of the interaction matrix and the mixing time of QTW is explicitly calculated. Running QTW for some time (bounded by the mixing time) repeatedly, we can sample points from a distribution near the limiting distribution and find all tensor components, and an upper bound on the total running time for QTW is derived.

Comparison Between Quantum Tunneling Walks and Classical Algorithms
In this section, we use comparisons between QTW and SGD to explain the advantages of quantum tunneling, resulting in our Main Message. Because of distinctions between quantum and classical algorithms, preparations (i.e., standards for comparisons) in Section 4.1 are needed before specific comparisons in Section 4.2.
Having such general understanding of QTW, in Section 4.3, we further make use of the fact that quantum evolution is essentially global but classical algorithms rely on local queries, so that a hitting problem cannot be solved efficiently by classical algorithms can be tackled by QTW within polynomial queries when given reasonable initial states.

Criteria of fair comparison
Through out Section 4, we adopt assumptions in both Section 2.2 and Section 2.3 for the objective landscape f (x) of interest. We still use U j = {x j } (j = 1, . . . , N ) to denote the wells and |j the corresponding 9 The term O(d log d) appears because our procedure is equivalent to the Coupon Collector's Problem.
orthonormalized local ground states. The interaction matrix is H |F , where H = −h 2 ∆ + f (x) is the Hamiltonian and F the low energy subspace spanned by {|j : j = 1, . . . , N }. As shown in Section 3.1, the hitting time of SGD is determined by the landscape and an adjustable learning rate s. Similarly, we can also adjust h in Hamiltonian simulation. Therefore, we need to determine the relationship between h and s for the comparison between the time cost of QTW and SGD.
Note that both QTW and SGD have limit distributions, namely, µ QTW and µ SGD , respectively (see Section 2.2 and Lemma 3.2 for details). If h (or s) becomes smaller, µ QTW (µ SGD ) will concentrate more closely to global minima, giving more accurate outputs, whereas it would take more time for the QTW (SGD) to converge. Comparing the running time without specifying accuracy is not fair.
In order to establish an relationship between h and s, as well as to compare QTW and SGD fairly, we specify some kind of accuracy of the limit distributions. The two variables, h and s, will be solved from the demand of accuracy. Hence, the time cost of different algorithms are only related to the accuracy, the dimension, and some geometric properties of the landscapes.
There are different measures of accuracy we can choose depending on the tasks faced. Here, we introduce two kinds of measures along with the corresponding standards of comparison.
Standard 4.1 (Risk accuracy). Let µ QTW be the limit distribution of QTW, and µ SGD the invariant Gibbs distribution of SGD. Two distributions are demanded to be δ-risk-accurate: (95) Standard 4.1 ensures that two limit distributions yield the same expected risk. Then, it is natural to compare how fast QTW and SGD would converge. The algorithm spending less time is more efficient on finding any one global minimum. Sometimes, the task is to find some target minima or one special minimum. In this case, using risk accuracy cannot emphasize the particularity of the minima of interest and we may need the following standard: Standard 4.2 (Distance accuracy). Let µ QTW be the limit distribution of QTW, and µ SGD be the invariant Gibbs distribution of SGD. The minima of interest are x j k , k = 1, . . . , m, j k ∈ {1, . . . , N }. Let D(·, ·) be any distance function. Two distributions are demanded to be δ-distance-accurate with respect to x j k and D(·, ·): Conditions (95) and (96) can specify h and s. To see this, we first study the expected risk for quadratic functions: Lemma 4.1. Assume the objective function f : R d → R is quadratic and where the last inequality means the Hessian ∇ 2 f (0) is positive definite. Then, we have Lemma 4.1 calculates the expected risks for a landscape with only one minimum whose proof is in SI Appendix D.1.1. For landscapes with multiple minima, the limit distributions concentrate near the global minima and the objective function in a small neighborhood of any minimum can be approximated by a quadratic function based on the assumptions. Hence, we can obtain the following general estimations (the proof is postponed to SI Appendix D.1.2).
Lemma 4.2. If δ is sufficiently small and the objective function f : R d → R satisfies satisfying assumptions in Section 2.2 and Section 2.3, then Standard 4.1 gives . Figure 5: A one-dimensional partially periodic function.
That is, we establish a relationship between h and s by Standard 4.1. Similarly, for Standard 4.2, we can have the following result: where the last inequality means the Hessian ∇ 2 f (0) is positive definite. We define the distance function D(x, y) := x − y 2 2 , ∀x, y ∈ R d . Then, we have The proof of Lemma 4.3 is shown in SI Appendix D.1.3. Similar to the process from Lemma 4.1 to Lemma 4.2, Lemma 4.3 may be generalized to general landscapes. However, the generalization of Lemma 4.3 is quite complicated as the distance function and the wells of interest are arbitrary. So, we stop at Lemma 4.3. Regardless of different standards, Lemma 4.2 and Lemma 4.3 present some similar intuition: the dependence of h on the flatness of wells are different from that of s, which is going to be shown in the following section as a source of quantum speedups. 10

Illustrating advantages of quantum tunneling
In this subsection, we compare QTW with SGD for several special landscapes. The goal is to explore geometric properties of the landscapes that affect relative efficiencies of QTW and SGD. Heuristically, the comparison reveals when quantum tunneling can be faster than thermal climbing (climbing over barriers between minima by stochastic motions), which are the two mechanisms behind QTW and many classical algorithms.
For simplicity, we focus on the following kind of landscapes: Definition 4.1 (One-dimensional partially periodic functions). A function f : R → R is partially periodic if it satisfies the assumptions in Section 2.2 and Section 2.3, and all minima {x j : j = 1, . . . , N } are in a bounded interval which is a period of f .
A sketch of functions in Definition 4.1 is shown in Figure 5. Neglect an exponentially small error and note the symmetry of the one-dimensional partially periodic function f , the interaction matrix under {|j : j = 1, . . . , N } should be given by where µ is the energy of one local ground state and w quantifies the tunneling effect between two adjacent wells. Eigenstates and eigenvalues of H |F can be given by the following lemma.
Lemma 4.4. The eigenstates and corresponding eigenvalues of the Hamiltonian (104) are given by To describe w in detail, as shown in Figure 5, we introduce new notations {x • j : j = 1, . . . , N } and {x • j : j = 1, . . . , N − 1} to denote minima and saddle points, respectively. A more general labeling of local minima and saddle points can be found in SI Appendix A.1. The Morse saddle barrier reflecting height of the barrier in the present case can be given by . Using results in SI Appendix A.2.3, we have: Lemma 4.5 (Tunneling amplitude). The tunneling amplitude for the one-dimensional partially periodic function f is given by Now, we can obtain the spectrum of H |F explicitly, and proceed by using Lemma 3.1 to get the quantum mixing time.
Lemma 4.6 (Quantum mixing time). Staring from one local ground state of one minimum, the -close mixing time of QTW is given by Regarding SGD, we use the results introduced in Section 2.2 to estimate the classical mixing time. First, Lemma 4.7 (Exponential decay constant). In Proposition 2.1, let λ = δ s,1 /2s we have Then, by Corollary 2.1, the following lemma holds.
As concrete examples, we present several specific functions to illustrate the advantages of quantum tunneling. Since the function in the region of our interest is periodic, we only need to specify the function value within one period to construct a concrete example. Without loss of generality, we set the interval [−a, a + 2b] to be one period, where [−a, a] is called the well region and [a, a + 2b] the barrier region. The constructed landscape in [−a, a + 2b] is given by Here, to reduce free parameters, we make f (x) differentiable at a, the boundary of the well, and the barrier, such that Remark 4.1. Note that the function in (112) is not smooth. We need to use the mollifier function m r (see detials in SI Appendix D.3) to smooth it such that assumptions in Section 2.2 and Section 2.3 are satisfied. Note that if r → 0, the smoothed function will tend to be f , following results can be seen as arbitrarily accurate for a smooth function arbitrarily close to (112).
By giving specific a, σ, and ε in (112), we can design landscapes with different properties. Detailed variables, discussions and comparisons are given below.
Substituting the parameters, it is true for both Example 4.2 and Example 4.3 that Comparing to the critical case Example 4.1, Example 4.2 has a thicker barrier, which increases S 0 and causes difficulty for QTW. However, QTW can perform better in Example 4.2. This is mainly due to the more flat well of Example 4.2. Recall that by Standard 4.1, to ensure δ-risk-accuracy, h and s should be respectively. That is, under the same risk accuracy, h can be much larger than s if the well is flat (k is small), making tunneling easier. Note that there is a trade-off between accuracy and time cost: smaller h (or s) ensures high accuracy but make tunneling effects (or thermal diffusion) weaker; conversely, larger h (or s) permits faster tunneling (or diffusion) but yields inaccurate results. Discussions on quantum tunneling effects usually focus on properties of the barrier. In the present study, since we aim to find global minima, the precision of results obtained is one important concern. Therefore, the flatness of wells, which affects differently on the accuracy of QTW and SGD, is a crucial property determining the runtime of QTW and SGD. Loosely speaking, QTW is faster than SGD on landscapes with flat wells. Example 4.3 adheres to the intuition that quantum tunneling is efficient on functions with tall and thin barriers. The wells of Example 4.3 are almost the same as those of the critical case Example 4.1. QTW can be faster in Example 4.3 because we add a sharp barrier between wells. By Lemma 4.8, a high barrier (i.e., large H f ) would significantly hinder thermal climbing. However, the tall barrier is sufficiently thin, such that S 0 = 2 a+b 0 f (x)dx can still be small and by Lemma 4.6, the tunneling effect would be strong. Moreover, in high dimensions, the distribution of wells can be very different from being on a line. As shown in SI Appendix D.4, distribution of wells can largely affect the dependence of time on N . However, such relation between the distribution of wells and running time is not explicitly shown for SGD. Therefore, the distribution of wells can also be a factor of quantum speedups.
In summary, we can conclude our Main Message.

Efficient quantum tunneling for solving a classically difficult hitting problem
The above examples compare QTW driven by quantum tunneling with SGD. In this section, an exponential separation in terms of query complexity between QTW given initial states and classical algorithms knowing one well will be shown for a specific hitting problem on a constructed landscape. The landscape f (x) we construct lives in R d . We use · to denote the 2 norm of vectors, namely, x = √ x · x. Let B(x, r) denote a d-dimensional ball centered at x with radius r. A special direction v is randomly chosen from the d-dimensional unit sphere. We define two regions W − = B(0, a) and W + = B(2bv, a) with b ≥ a. Let R be sufficiently large s.t. W − , W + ⊂ B(0, R). We denote the region {x | x ∈ B(0, R), |x·v| ≤ w} by S v , where w will be chosen from [ √ 3a/2, 0). We denote Figure 3 illustrates positions of the newly defined regions. The constructed function f is given by Here, we define H 0 = 1 2 ω 2 a 2 and demand that 0 Remark 4.2. The landscape f in (121) is not smooth and should be smoothed to be F r with the help of a mollifier function m r (see details in SI Appendix D.3) such that assumptions in Section 2.3 can be satisfied. Because when r → 0, F r → f r , we can always find sufficiently small r to make the following conclusions based on f valid for F r .
There are two global minima, 0 and 2bv, of the function f . Given that we know 0 is a minimum, our goal is to find the other one. To avoid complicated justifications, we deal with a simpler problem: Problem 4.1. For the f in (121), given that we only know 0 is a global minimum, find any point in W + .

Classical lower bound
Due to the concentration of measure, for any point x ∈ B(0, R), the probability of x ∈ S v is given by Intuitively, restricted in B(0, R), any classical algorithm cannot escape from S v efficiently. In R d , queries out of B(0, R) provide no information about the landscape inside B(0, R) and are unable to help to escape from S v . Therefore, classical algorithms cannot solve Problem 4.1 efficiently with or without being constrained in B(0, R). To rigorously prove above intuitions, we first introduce a mathematical result indicating (122):  Figure 7). We have The estimation details are presented in SI Appendix D.2.1. Subsequently, it is readily to have (see details in SI Appendix D.2.2): Lemma 4.13. For any randomly chosen point x ∈ B(0, R), the probability of Recall that w ∈ [a/2, a) and R are independent of d, the measure of the region in B(0, R) and outside S v is exponentially small with respect to the dimension d. By Definition 2.1, classical algorithms depend on an adaptive sequence of points. we now need to demonstrate that it is difficult for the points to hit regions beyond S v . Lemma 4.14. For any classical algorithm (see Definition 2.1), after running T times, we get a sequence of points and corresponding queries We prove Lemma 4.14 in SI Appendix D.2.3. Now, we can prove that if the number of points and queries is small, with high probability, any classical algorithm cannot escape from S v . Rigorously, we have The proof sketch of Proposition 4.1 goes as follows (see proof details in SI Appendix D.2.4). By Lemma 4.14, it suffices to demonstrate that restricted in the ball B, classical algorithms cannot escape from W − and hit W + efficiently. The left thing is to show that queries outside B(0, R) provide no information about B(0, R). And thus, without being restricted in B(0, R), classical algorithms still cannot hit W + by subexponential queries with high probability.

Quantum upper bound
We now focus on the time needed for quantum tunneling to solve Problem 4.1. The landscape (121) satisfies Assumption A.6 0 = min f < lim where U − := {0} and U + := {2bv} are called as wells by definition. The neighborhoods of the two wells are quadratic, enabling the wells and corresponding local ground states to satisfy (S79) and (S80). Moreover, due to the symmetry of the function (121), the local ground states are also symmetric. Therefore, Assumption A.5 can be satisfied. To use Proposition A.6, we only need to verify the conditions in Assumption A.7, leading to the following lemma.
Lemma 4.15. There exists a unique Agmon geodesic, denoted γ −+ : R → R d , which links U − and U + : And the Agmon distance S 0 := d(U − , U + ) is The calculation details of Lemma 4.15 are presented in SI Appendix D.2.5. We are now ready to calculate the interaction matrix explicitly (see details in SI Appendix D.2.6): Lemma 4.16. Under the two orthonormalized local ground states, The interaction matrix is of the form and the next-to-leading order formula of w is given by Using the explicit tunneling amplitude, we can estimate the time needed for quantum tunneling.
Proposition 4.2 (Quantum upper bound). For any dimension d, we can always choose appropriate h, ω, a, b, H 1 , H 2 , and w satisfying previous restrictions, such that, given the local ground state associated to W − under the choosing h as initial state, QTW can solve Problem 4.1 with high probability 1 − (1 − C) n using only nO(poly(d)) queries, where 0 < C < 1 is a constant independent of d. The proof of Proposition 4.2 is postponed to SI Appendix D.2.7 which is explained briefly as follows. We take all the adjustable parameters as functions of d and discuss the evolution time as a function of d. First, we have h = Θ(1/d) for sufficiently large d, which can eliminate the negative effects of measure concentration brought by increasing dimension, and on the other hand we prove that our theory on quantum tunneling walks is still valid. Thus, the quantum wave distributes near W − or W + , and the limit distribution µ QTW permits a probability of finding the particle in W + larger than some constant independent of d. Then, based on the results of semi-classical analysis, we can tune the function values in W − , W + , and B v such that the time needed for tunneling is a polynomial of d. As a result, the last three conditions at the end of Section 1 can be satisfied, and the first and third conditions suggest that with high probability, QTW can hit W + with queries polynomial in d. Finally, given the fact that we can useÕ(t) quantum queries to evolve QTW for time t, with high probability QTW can hit W + with queries polynomial in d.
Combining the results of Proposition 4.2 and Proposition 4.1, we can obtain Theorem 1.2, which is restated in a more rigorous way as follows.
Theorem 4.1. For any dimension d, there exists a landscape with the form (121) such that with a high probability 1 − (1 − A) n , QTW can solve Problem 4.1 with nO(poly(d)) queries given the local ground state associated to U − , but with a high probability 1 − e −dB , no classical algorithm (Definition 2.1) can solve the same problem for the same landscape f with o(e dB ) queries, where 0 < A < 1 and B > 0 are two constants independent of d.

The significance of proper initial states
The hardness of Problem 4.1 can be abstracted as that of finding an exponentially small cone on a landscape which is isotropic outside the special cone. 12 There can be exponentially many such cones disjoint with each other. Therefore, it can be proved that by solving Problem 4.1 in R d , we can solve an unstructured search problem with a size exponential in d.
To show this, we first introduce unstructured search. Say, we are given N data points, only one of which is assigned the value 1 and all other points are assigned a value 0. The goal is to find the point assigned 1 with an oracle outputting the assigned value of the input point. Intuitively, each data point can be mapped to a unique cone in R d and the point assigned 1 should correspond to the cone containing B v and W + . In this case, solving Problem 4.1 can lead to the data point we want to find. Precisely speaking, if there is a quantum algorithm that can solve Problem 4.1 with queries polynomial in d, it can solve an unstructured search whose size N is exponential in d within queries polynomial in d. That is, we can solve an N -size unstructured search within O(poly(log N )) queries with the help of the efficient algorithm for Problem 4.1.
However, it is well known that quantum algorithms have a query complexity lower bound Ω( √ N ) in solving unstructured search with N data points [BBBV97]. Therefore, we can conclude We prove Proposition 4.3 and related claims rigorously in SI Appendix D.2.8. It seems that Proposition 4.3 contradicts with Proposition 4.2. But there is actually no paradox as in Proposition 4.2 QTW does not solve Problem 4.1 faithfully. The local ground state |Φ − associated to U − under proper quantum learning rate h is given to QTW as prior knowledge. To establish polynomial decay tunneling effect, the state |Φ − has non-vanishing probability (maybe an inverse polynomial of d) in W + . The state |Φ − indicates a lot about the special direction v for QTW, such that what QTW does cannot be equivalent to unstructured search. Indeed, by the same spirit of Proposition 4.3, the state |Φ − cannot be prepared within polynomial queries, or we can reach W + efficiently by measuring |Φ − repeatedly.
We admit that Theorem 4.1 requires the initial quantum state. Note that QTW only uses M copies of the local ground state |Φ − to hit W + with high probability in polynomial time, where M is a number independent of d. If the possibility of learning about v from sampling tends to 0 when d → ∞, which is likely to be true, the expected queries needed by classical algorithms to hit W + cannot be subexponential in d. In this case, we have an exponential quantum-classical separation in evaluation queries even classical algorithms are given a constant number of samples from the initial distribution | x| Φ − | 2 . Essentially, this is because no classical evolution can make good use of the samples of the initial state.

Numerical Experiments
We conduct numerical experiments to examine our theory. All results and plots are obtained by simulations on a classical computer (Dual-Core Intel Core i5 Processor, 16GB memory) via MATLAB 2020b. Details of all numerical settings can be found in SI Appendix E.
QTW is simulated by solving the Schrödinger equation by numerical methods, and SGD is performed with first-order queries and the noise of each step follows the standard Gaussian distribution.
Quantum-classical comparisons. To corroborate our Main Message, we numerically study the performance of QTW and SGD on concrete examples (see details in SI Appendix E.1). The quantum learning rate h and the classical learning rate s are determined under Standard 4.1 which equalizes expected risks yielded by QTW and SGD.
The task is to hit a target neighborhood of one minimum beginning at another designated minimum. In Figure 8, results on classical and quantum hitting time are shown. We examine QTW and SGD on three landscapes, Example 1, 2, and 3 in Figure 8 which correspond to concrete functions given by Example 4.1, Example 4.2, and Example 4.3 in Section 4.2, respectively. We conduct 1000 experiments for QTW and SGD on each example. For QTW, we use an experiment to denote a process repeating trials until successfully hitting, where each trial initiates QTW once and measures the position at t randomly chosen from [0, τ ]. For SGD, an experiment begins at a designated minimum and stops until SGD hits the target region. The evolution time of an experiment is the sum of evolution time of the trials the experiment contains.
We use T QTW hit and T SGD hit to denote the evolution time of one experiment for QTW and SGD, respectively. In Figure 8, the histograms compare T QTW hit with T SGD hit /10, and all presented examples demonstrate that QTW is faster. The number of quantum queries is approximatelyÕ( f ∞ T QTW hit ) and the number of classical queries is Ω(T SGD hit /s). In addition, in the three examples f ∞ ≤ 0.85 and s < 0.2, and quantum advantage exists in terms of query complexity.
This result matches our theory at large. For Example 1, we make direct comparison between the exponential terms e S0/h and e 2H f /s , and to remove the coefficients in front of them, we divide T SGD hit by 10 such that T SGD hit /10 has similar distribution t T QTW hit for Example 1. In this way, we observe that whether T SGD hit /10 is relatively larger than T QTW hit is determined only by e S0/h and e 2H f /s . For Example 2, T QTW hit is not much smaller than T SGD hit /10, which is not completely coherent with our theory. This result can be explained as that for Example 2, the quantum learning rate h is not small enough such that the initial state prepared does not well stay near a low energy subspace. Experiments on Example 2 suggest that higher energy may not be able to help quantum tunneling to run faster.
For Example 3, the h chosen is small enough (see details in SI Appendix E.1) such that a significant quantum speedup is achieved as expected. In Figure 8, T SGD hit /10 is even several orders of magnitude larger Figure 8: Quantum-classical comparison between SGD and QTW on three landscapes. Example 1 is the critical case where the exponential terms in QTW and SGD evolution time are equal for sufficiently small accuracy δ. Example 2 has flatter minima but similar barriers compared to Example 1, enabling QTW to be faster. Example 3 possesses the same flatness of minima as Example 1 but is equipped with sharp but thin barriers, enabling larger quantum speedups. We take τ = 288, 800, 600 in the three examples, respectively. Dimension dependence. Due to the limitations of solving the Schrödinger equation on classical computers, QTW is simulated only in low dimensions (i.e., d = 1 and d = 2). Here we examine Theorem 1.2 by testing SGD and its the classical lower bound. The classical lower bound in Theorem 1.2 ensures that for any s, SGD cannot cannot escape from S v with subexponential queries with high probability. Based on the constructed landscape with parameters specified in SI Appendix E.2, we test SGD with different learning rates (s ∈ [0.1, 1]) in various dimensions (d ∈ [15, 95]). For each dimension and each s, 1000 experiments are conducted. The number of steps spent to escaping from S v in one experiment is denoted as Q esc . Here, we present the relationship between average Q esc and the dimension d in Figure 9 (more details are deferred to SI Appendix E.2).
For each fixed learning rate s, we observe that with the increase of d, the average Q esc remains constant initially and then increase exponentially with respect to d. Increasing s yields a smaller initial constant but larger exponential rate. Nevertheless, for all s, Q esc eventually increases exponentially with respect to d (the triangle in Figure 9 shows the slope 1/256 corresponding to the exponential function e d/256 which is a lower bound of the average Q esc ), supporting our prediction.
Dependence on the quantum learning rate. In QTW, the quantum learning rate h is one of the most important variables. Theorem 1.1 gives a general relationship between h and the evolution time of QTW. We further test the relationship on the landscape constructed in Theorem 1.2 (dimension d = 2) with specified parameters given in SI Appendix E.3. Since the landscape has two symmetric wells, the time for tunneling from one well to the other, T half , is explicitly linked to ∆E (i.e., T half = π/∆E). On this concrete landscape, ∆E can be predicted, giving that where C f is a constant depending on f and can be explicitly calculated. Starting from one well, we stop when the probability of finding the other well exceeds 90% and record the evolution time as T half . Experiments on T half is shown in Figure 10. The results match our theory except a constant difference between the predicted and experimental ln T half , indicating the correctness of S0 h − 1 2 ln h. The constant difference emerges because we stop evolution when the probability of tunneling exceeds 90%, while theoretical T half takes the time when the probability is nearly 100%.
To conclude, several aspects of the present theory are well supported by numerical experiments.

Discussion
In this paper, we explore quantum speedups for nonconvex optimization by quantum tunneling. In particular, we introduce the quantum tunneling walk (QTW) and apply it to nonconvex problems where local minima are approximately global minima. We show that QTW achieves quantum speedup over classical stochastic gradient descents (SGD) when the barriers between different local minima are high but thin and the minima are flat. Moreover, we construct a specific nonconvex landscape where QTW given proper initial states is exponentially faster than classical algorithms taking local queries for hitting the neighborhood of a target global minimum. Finally, we conduct numerical experiments to corroborate our theoretical results. We expect our results to have further impacts both classically and quantumly. In optimization theory, previous work has studied several physics-motivated optimization algorithms, including Nesterov's momentum method [SBC16,WWJ16,SDJS21], stochastic gradient descents [SSJ20], symplectic optimization [BJW18,Jor18], etc. We believe that our work can further inspire the design of optimization algorithms. From theory to practice, in this work we analyzed the performance of QTW on tensor decomposition, and we expect QTW to also have decent performance on other practical problems with benign landscapes.
In quantum computing, on the one hand, previous work on continuous optimization only studies convex optimization [AGGW20,CCLW20] or local properties such as escaping from saddle points [ZLL21], and our work significantly extends the range of problems which quantum computers can efficiently solve to global problems in nonconvex optimization. On the other hand, we point out that QTW has the potential to be implemented on near-term quantum computers. In fact, current quantum computers have implemented both quantum simulation [AAB + 20, EWL + 21] and quantum walks [TLF + 18, GWZ + 21] to decent scales. We deem QTW as a potential proposal for demonstrating quantum advantages in near term.
Our paper also leaves several technical questions for future investigation: • What is the performance of QTW on more general landscapes? For instance, a wide range of deep neural networks [KHK19] has some (but probably not all) local minima which are approximately global. Future work on weakening the assumptions on landscapes for QTW is preferred.
• Are there more examples with exponential quantum-classical separation? Our construction leverages a special kind of locally non-informative landscapes, and exponential quantum-classical separation can potentially be observed on other landscapes, such as nonsmooth landscapes and landscapes with negative curvature.
• QTW simulates the Schrödinger equation whose potential is set to be the optimization function, and this QTW can be efficiently simulated on quantum computers. In general, are there better PDEs which are more efficient for optimization and can still be efficiently simulated on quantum computers?