Latency considerations for stochastic optimizers in variational quantum algorithms

Variational quantum algorithms, which have risen to prominence in the noisy intermediate-scale quantum setting, require the implementation of a stochastic optimizer on classical hardware. To date, most research has employed algorithms based on the stochastic gradient iteration as the stochastic classical optimizer. In this work we propose instead using stochastic optimization algorithms that yield stochastic processes emulating the dynamics of classical deterministic algorithms. This approach results in methods with theoretically superior worst-case iteration complexities, at the expense of greater per-iteration sample (shot) complexities. We investigate this trade-off both theoretically and empirically and conclude that preferences for a choice of stochastic optimizer should explicitly depend on a function of both latency and shot execution times.


Introduction
Quantum computers have the potential to perform many important calculations faster than their classical counterparts do. Disparate domains such as quantum chemistry [1]; nuclear [2], condensed matter [3], and high energy [4] physics; machine learning and data science [5]; and finance [6] are all theorized to benefit from quantum algorithms in various ways. In the near-term, the so-called noisy intermediate-scale quantum (NISQ) [7] era, these theoretical benefits are hard to realize, because the canonical quantum algorithms that are used in many of the domains, such as quantum phase estimation [8], require gate depths that are expected to Matt Menickelly: mmenickelly@anl.gov Yunsoo Ha: yha3@ncsu.edu Matthew Otten: mjotten@hrl.com be realized only with fault-tolerant, error-corrected quantum computers [9]. Variational quantum algorithms (VQAs) attempt to lower gate depth requirements by replacing some of the requirements with the need to perform optimization using classical computers [10]. Such algorithms have shown success on NISQ hardware in eigenvalue estimation [11], dynamical evolution [12,13,14], machine learning [15,16], and many other problems [10]. The variation quantum eigensolver (VQE) is perhaps the best-known example, with many demonstrations performed on a variety of physical and chemical systems [14,17]. One of the key bottlenecks in VQAs is the optimization step; the functions being optimized, as well as their higher-order derivative information, need to be estimated via a limited number of samples, necessitating the use of stochastic optimization algorithms. A simple way of quantifying the total cost would be through the number of shots: for each function evaluation, the quantum computer needs to be queried many times to get an estimate of the function to be used in a classical optimization routine. These sampling queries are in addition to the need to run many different sets of parameters along the optimization trajectory, which introduces a "circuit switching" latency. Given that many of today's quantum computers are available in a cloud setting, there can be additional overhead due to network latency [18]. These latencies can also vary among different architectures; for example, a superconducting transmon quantum processor has measurement times on the order of a few microseconds [19], while a trapped ion system can take hundreds of microseconds [20,21]. This is, of course, in addition to gate times and reset times, all of which increase the time to take a single sample. Designing optimization algorithms that take these various times into account is important for minimizing the total amount of potentially expensive quantum computer time used.
A few stochastic gradient methods have been developed specifically for the quantum regime that attempt to minimize the number of shots [22], as well as those that attempt to minimize total wall time [18]. However, stochastic gradient methods, including the popular Adam optimizer [23], are typically deployed in applications where accuracy does not matter much. In quantum chemistry, where VQE is used extensively, there is a standard level of accuracy, known as "chemical accuracy," that is desirable because it is the minimum accuracy required to allow for chemically relevant predictions [24]. Many stochastic gradient methods also require careful hyperparameter tuning [25,26], particularly in the selection of a step-size parameter. In VQAs, each sample (or observation) can be expensive because quantum computer time is highly limited. Doing extensive hyperparameter sweeps is, therefore, untenable.

Our Contributions
Given these circumstances, we suggest a new optimization algorithm, SHOt-Adaptive Line Search (SHOALS). In each iteration, SHOALS computes a stochastic estimate of the gradient of the cost function, computes a trial step, and evaluates the cost function at both the current set of parameters and the trial step. If sufficient decrease within a confidence interval is detected, the step is accepted, and the trial step becomes the current set of parameters for the next iteration. In every iteration, SHOALS updates certain accuracy parameters that, when coupled with conservative concentration inequalities, determine sample sizes for computing function and gradient values.
The convergence theory behind SHOALS is derived from ALOE [27], which in turn is an extension of previous work in [28,29,30]. These works in adaptive optimization consider the quantities involved in deterministic methods of optimization (in particular, trust region methods and line search methods) and replace these quantities with random variables to yield an analyzable stochastic process. As a result, one derives stochastic variants of deterministic methods that yield, up to constants, the worst-case iteration complexity guarantees of the deterministic methods with high probability; SHOALS achieves this by mimicking the dynamics of gradient descent with a line search. These results are significant because the iteration complexities of deterministic methods are of a different order than those of stochastic methods. For instance, in the worst case, given Lipschitz continuous f , the number of iterations that a deterministic gradient descent method requires to attain ∇f (θ k ) ≤ is on the order of 1/ 2 ; this bound is known to be tight [31,32]. 1 In the same class of problems, the number of iterations for a first-order method based on the classical Robbins-Monro stochastic gradient iteration to achieve the same degree of stationarity is on the order of 1/ 4 [36], which is strictly worse. 2 In the quantum computing setting, adopting an algorithm like SHOALS translates to significantly lower total time, since certain large parts of the latency costs are paid per iteration, rather than per shot, at the expense of potentially higher (but dynamic) numbers of shots. We note that a recent paper on another classical optimizer for VQAs employs a Bayesian line search [39], which is fundamentally different from the line search considered in this paper.
We stress that SHOALS is not a new algorithm, but is a practical extension of ALOE [27]. Specifically, SHOALS allows for per-parameter sampling of directional derivatives, enabled by parameter shift gradients. However, we do note that this is the first work to explicitly consider adaptive stochastic methods for use in VQAs, and we illustrate -in terms of latency cost and shot execution tradeoffs -why such algorithms are extremely promising in this setting, even though they were not explicitly designed for this setting.
We provide numerical experiments demonstrating the efficacy of SHOALS on a variety of chemical molecules. Depending on the specifics of a simulated quantum computing environment, SHOALS can reduce the time needed to reach chemical accuracy by several orders of magnitude compared with other state-of-the-art optimization algorithms, such as iCANS [22] and Adam. In particular, our comparisons employ various measures of latency, showing how the effect of circuit switching and network latency costs affect the comparisons. As suggested by the theory on which we elaborate in this paper, we find that SHOALS frequently outperforms pop-1 In general, the worst-case iteration complexity for any unconstrained deterministic first-order optimization method to attain ∇f (θ k ) ≤ given a Lipschitz continuous gradient and Hessian is on the order of −1.75 [33]. Such a theoretically best worst-case rate is indeed attained by accelerated gradient descent [34,35]. 2 In general, the theoretical best worst-case iteration complexity for any unconstrained stochastic first-order optimization method to attain ∇f (θ k ) ≤ -the definition of which must be more carefully provided than in this footnote -is on the order of 1/ 3 [37]. Such a theoretical best rate is attained, for instance, by SPIDER [38], at the cost of evaluating two stochastic gradients per iteration (as opposed to one stochastic gradient per iteration required by the classical Robbins-Monro iteration). ular methods based on the stochastic gradient iteration, especially in settings where latency cost is high compared with shot cost. Our algorithm and numerical experiments highlight the importance of making more efficient use of limited quantum resources, allowing for further exploration of more interesting problems.

Variational Quantum Eigensolver
In the VQE, a standard example of a VQA [10], we seek to find an approximation of the lowest eigenvalue of a Hermitian matrix, H, which is often known as the Hamiltonian. VQE relies on what is known as the "variational principle," which states where E 0 is the true value of the lowest eigenvalue (often known as the ground state energy) and |ψ(θ) is any parameterizable function (typically known as an ansatz). The variational principle, Eq. (1), states that an upper bound for the true ground state energy can be obtained by measuring the energy (that is, computing ψ(θ)|H|ψ(θ) ) for any function, |ψ(θ) . A flexible, expressive wavefunction |ψ(θ) and efficient, effective optimization are key to realizing high-quality approximations of the true ground state energy. Because of their natural ability to express quantum mechanics, quantum computers are excellent candidates for preparing various ansatzes. For standard quantum chemistry problems, we can write the Hamiltonian, H, after preprocessing through a fermion-to-spin transformation such as the Jordan-Wigner transformation [40], as a sum of terms, where a j is a (typically real-valued) constant prefactor and P j is a Pauli string. Note that the N here is where N o is the number of orbitals used (and is often equal to the number of qubits). Each Pauli string can be measured efficiently on a quantum computer. The VQE problem can thus be expressed as an optimization problem, Each of the terms in the sum in (3) is independently estimated stochastically by repeatedly querying the quantum computer for some number of shots.
For convenience of notation in discussing an optimization algorithm, we will, in the remainder of this paper, abuse notation and rewrite (3) as where P j (θ) denotes ψ(θ)|P j |ψ(θ) . By using the parameter shift rule [41], partial derivatives for the types of ansatzes used in this work can be calculated as where e i represents the unit vector in the direction of parameters i. We utilize a Trotterized, unitary coupled cluster, singles doubles (UCCSD) ansatz [42,43] for this work, which is defined as where d represents what we term depth, which is the number of times to repeat the basic Trotterized UCCSD ansatz with unique parameters (similar to the k parameter in Ref. [42]), andt j is either a singles (t j =â † kâ l ) or doubles (t j =â † kâ † lâ mân ) cluster operator. We note that our theory should extend to any ansatz, such as those generated in qubit coupled cluster [44] or hardware-efficient ansatzes [14], and can be embedded within methods to generate more compact ansatzes, such as ADAPT-VQE [45] and unitary selective coupled cluster [46].

Single-Shot Estimators
A gradient-based optimization algorithm for solving (4) will require the ability to compute f (θ) and ∇f (θ). However, because each term P j (θ) in (4) is in fact a random variable for which a shot execution on a quantum computer gives an estimate valued in {−1, 1}, we cannot directly compute f (θ). Instead, we obtain what we call a single-shot estimator f (θ; ξ) of f (θ), which is computed by taking one observation of each Pauli string P j (θ) and returning the linear combination specified by the prefactors {a j }. We note that although we call f (θ; ξ) a singleshot estimator, it does not cost one shot to obtain a single-shot estimator; rather, a single-shot estimator costs one shot per circuit necessary to evaluate every Pauli string, which is typically O(N 4 q ), where N q is the number of qubits used. Using operator grouping, one potentially can reduce this number to O(N 3 q ) [47]. The random variable ξ in f (θ; ξ) is intended to encapsulate all the randomness inherent in estimating f (θ) with a quantum computer.
Analogously, using the parameter shift gradient, Eq. (5), we can compute a single-shot estimator g i (θ; ξ) of each partial derivative ∂ i f (θ). When we assemble the single-shot estimators of partial derivatives into a full gradient vector, we denote this latter quantity g(θ; ξ).

Shot-Adaptive Line Search
Our method is fundamentally based on a stochastic line search method proposed over several papers [29,30] but is closest to one referred to as Adaptive Linesearch with Oracle Estimations, or ALOE [27]. To disambiguate from the use of the word "oracle" in quantum computing, we instead use the name SHOt-Adaptive Line Search, or SHOALS. We describe this method in Algorithm 1 and discuss key elements of the algorithm.
Compute an estimate g k of ∇f (θ k ) (in practice; g k will be assembled coordinate-wise from averages of N gi,k samples of g i (θ k ; ξ); see (7)). 6 Set a trial point s k ← θ k − α k g k .

7
Compute function estimates f k 0 and f k s of f (θ k ) and f (s k ), respectively (in practice, f k 0 and f k s will be computed as averages of N f,k samples of f (θ k ; ξ) and f (s k ; ξ); see (9)).
We note that Algorithm 1 would be a classical line search method with an Armijo line search with the notable differences that • Line 10 would be replaced with α k ← 1; • the estimates g k , f k 0 , and f k s would be replaced with the respective true values ∇f (θ k ), f (θ k ), and f (s k ); and • f would always be initialized to 0.
Under particular conditions on the estimates g k , f k 0 , and f k s , as well as the determination of f in Line 7, Jin et al. [27] demonstrate a special kind of convergence result concerning the iterates of Algorithm 1. We now elaborate on these conditions, which will be enforced in SHOALS.

Gradient Estimation
In our setting, a directional derivative estimate (5)) is computed by averaging a number N gi,k of single-shot estimates of the directional derivative, g k i (θ k ; ξ). As a result, the gradient estimate g k can be written g k (θ k ; ξ k ), where ξ k denotes a realization of the random variable Ξ k representing all possible shot sequences that could be observed when obtaining N gi,k many single-shot estimates of each ∂ i f (θ k ), and concatenating the estimates into a n-dimensional gradient vector. We stress that in our notation, whenever ξ has a superscript (k), there is an implied (set of) sample size(s) N gi,k to distinguish it from the atomic random variable ξ involved in the single-shot estimator.
We give reasoning in the appendix, but we ultimately require, for each i = 1, . . . , n for some p ∈ (0, 1 2 n ) and some g > 0. In (7), L i denotes a global Lipschitz constant of ∂ i f (θ) and V g i (θ k ; ξ) denotes the variance of the single-shot estimator g i (θ k ; ξ). Loosely speaking, enforcing (7) will guarantee that, with high probability (1 − p), the error in the estimate g i (θ k ; ξ k ) of ∂ i f (θ k ) will be bounded in such a way that that it does not interfere with the gradient descent dynamics. As in [22], we note that we can yield an upper bound on each L i due to the parameter shift rule and the expression of the cost function as a prefactor-weighted sum of Pauli matrices. In particular, given an expression for the non-mixed partial second derivative ∂ i ∂ i f (θ k ), we can yield an upper bound on each L i by the sum of the absolute values of prefactors obtained after applying a parameter shift (5) twice.
To make practical use of (7), it remains to handle the unknown variance term in the numerator. The variance of the estimator g i (θ k ; ξ) is unknowable because computing a population variance requires knowing the expectation ∇f (θ k ). In our implementation we simply compute the sample variance of the sample of size N g,k before the end of the kth iteration to use as an estimate of the variance in the (k + 1)st iteration. It is theoretically reasonable to believe that the variance V g(θ k ; ξ) should not change too rapidly in θ k , justifying this choice. We have also verified this assumption empirically through experiments. To summarize, our practical sample size is where s 2 gi,k is the sample variance estimate of g i (θ k ; ξ) obtained in the previous iteration.

Function Value Estimation
Similarly to the gradient estimation, we average a number N f,k of single-shot estimates of f (θ k ) and f (s k ) to yield estimates f (θ k ; ξ k ) and f (s k ; ξ k ).
where f > 0 is a parameter, and p ∈ ( 1 2 , 1). Formal reasoning is given in the appendix, but intuitively, averaging N f,k many samples guarantees with high probability that the error incurred by f (θ k ; ξ k ) is bounded by a constant f that dictates the accuracy to which we would intend to reduce the optimality gap of our final solution.
In our practical implementation, to avoid oversampling, we additionally employ a criterion resembling past work and the criterion in (8) (see [48,29]) and compute , which would have been computed in the previous iteration.

Theoretical Results concerning SHOALS
In [27], a more formal statement of the following result is proven.
Then, with high probability, the number of iterations of Algorithm 1, T , required to attain The probability in this result is with respect to the σ-algebra of the realizations of ξ k observed over the run of the algorithm.
We remark on some important aspects of Theorem 2.1. First, as indicated by the big-O notation, this result provides a worst-case guarantee. Virtually never in practice is a stopping time on the order of 1/ 2 many iterations realized. However, the result of Theorem 2.1 recovers up to small constants the worst-case iteration complexity result for deterministic gradient descent algorithms, which is known to be tight [31].

Comparison with Stochastic Gradient Methods
The result in Theorem 2.1 should be contrasted with known results for the worst-case performance of stochastic gradient methods. We provide a prototypical stochastic gradient method in Algorithm 2. Results in Corollary 2.2 of Ref. [36] and Theorem 4.8 Algorithm 2: Generic Stochastic Gradient Method 1 Initialization: Choose step size α, initial point θ 0 ∈ R n 2 for k = 0, 1, 2, . . . do 3 Compute an estimate g k of ∇f (θ k ).
of Ref. [49] show essentially that in order to guarantee one must run a stochastic gradient method for many iterations, provided the step size α is chosen sufficiently small, typically satisfying a bound like α ≤ 1 Lg . For ease of presentation, we deliberately ignored in (13) and (14) an additional dependence on the variance of g k , which decreases linearly with the sample size or "batch size" b. This is a significant oversight because in the worst case one requires b ∈ Θ(T ), where T is as in (14) in order for theoretical results such as (13) to hold. This is certainly concerning in settings where accuracy matters (e.g., chemical accuracy in the problems we are considering), but in many machine learning applications it is considerably less concerning. In light of this intentional oversight, we remark that the worst-case results we are presenting for SG methods are, in fact, erring on the side of optimism. The bound in (13) can be interpreted as saying that, given a fixed number of iterations of T , the expected value (over the σ-algebra of all sources of randomness) of the average value (averaged over k = 1, . . . , T ) of ∇f (θ k ) is bounded by . As a corollary, one can show that if one selects uniformly at random R ∈ {1, . . . , T }, then These results have been shown to be tight in the general case [50] and improvable to O 1 3.5 in special cases [51]. Popular variants of SG algorithms like Algorithm 2, such as Adam, have been shown to have virtually the same type and quality of error bounds as that in (13); see Theorem 1 of Ref. [52]. Methods such as iCANS [22] are also based on the stochastic gradient iteration and thus are susceptible to the same worst-case performance.
The trade-off between a O 1 2 worst-case complexity for algorithms of the class in Algorithm 1 and the O 1 4 worst-case complexity for algorithms of the class in Algorithm 2 is that the estimates in Algorithm 1 require more samples (or, in the context of our parameter shift problems, shots) per iteration to satisfy Condition A.1 and Condition A.2. The appropriate metric for comparing these algorithms is measured by the total work, namely, a function of latency and shot acquisition times.
We consider the following simplified setting. We note that similar analyses (see [53] and [49,Section 4.4]) have been done to explain the apparent empirical dominance of stochastic gradient methods in empirical risk minimization settings, but the conclusions we will draw are very different because the quantum computing setting is very different from general machine learning settings in terms of overhead costs. We denote the shot acquisition time by c 1 , the latency time of circuit switching by c 2 , and the communication cost time by c 3 . All times vary from architecture to architecture. Shot acquisition time includes the time to reset the circuit to the ini-tial state, the time to run the circuit, and the time to record a single measurement, which can from a around 100 µs in superconducting qubits [54] and around 200 ms in neutral atom systems [55]. Circuit switching time includes the time needed to compile a circuit and then to load it on to the quantum computer's control system. For superconducting qubits, compilation can take 200 ms, while interacting with the arbitrary waveform generator can take around 25 ms [54]. Communication cost includes the time needed to communicate between the computer running the classical optimization algorithm and the actual quantum computer. This can range from nearly zero when the two devices are in the same room to many seconds when instructions need to be sent over the Internet [18]. We note that every iteration of Algorithm 1 requires two communications between the quantum and classical devices: one communication to obtain g(θ k ; ξ k ) and a second communication to obtain estimates f (θ k ; ξ k ) and f (s k , ξ k ). Every iteration of Algorithm 2 requires only one communication to obtain g(θ k ; ξ k ). Thus, denoting by t f • and t g • the number of circuits required to evaluate the one-shot estimators f (θ k ; ξ) and g(θ k ; ξ), respectively, we can summarize the overhead per-iteration latency cost of each iteration as in Table 1. Now, denote by t f s and t g s the number of shots requested per iteration. By Theorem 2.1, if we choose = g < 1 and f = 2 , then in the worst case, and up to constants, Algorithm 1 will require on each iteration 1 2 samples of g(θ k ; ξ) and 1 4 samples of f (θ k ; ξ), yielding t g s ≤ t g On the other hand, each iteration of Algorithm 2 will require only a constant number of samples, b, of g(θ k ; ξ). That is, without loss of generality, and up to constants, t g s = bt g • . All of these costs are summarized in Table 1.
We note that SHOALS incurs only the communication cost, c 3 , on the order of 1/ 2 many times, whereas the cost is incurred on the order of 1/ 4 many times for SG methods. Additionally, the latency switching cost c 2 is present only on the 1/ 2 term for SHOALS, whereas it appears on the order of 1/ 4 many times in SG methods. On the other hand, we note that SHOALS has an unfortunate 1/ 6 dependence; however, if c 1 is smallthat is, if the cost of shot acquisition is small relative to the total latency and communication costs c 3 + c 2 (t f • + t g • ))-then the 1/ 6 term may not dominate.
To quantify this trade-off a bit more, we make simplifying assumptions that t g • = 2nt f • , where n is the dimension of the parameter vector θ, which is likely correct up to constants in the parameter Iterations × (Per-Iteration Latency Cost + Per-Iteration Shots Cost) Table 1: Summary of total worst-case costs of SHOALS and a stochastic gradient (SG) method to achieve -stationarity. The total worst-case cost is computed as the number of iterations to achieve -stationarity multiplied by the sum of the per iteration costs. We recall that t f • and t g • denote the number of circuits needed to compute f (θ k ; ξ) and g(θ k ; ξ), respectively. For an SG method, we denote a constant batch size parameter by b.
shift setting [41], and that the batchsize employed by the SG method is b = 1, which is correct up to constants in practice. Then, in this setting and per Table 1, one can derive that a sufficient condition for the total cost of SHOALS being less than the total cost of an SG method is the satisfaction of still assuming without loss of generality that L g > 1.
That is, roughly speaking, if the cost of obtaining a single shot is smaller by a factor of 2 than the overhead latency cost of one iteration of an SG method, normalized by both the number of circuits needed to evaluate f (θ k ; ξ) and the gradient Lipschitz constant, then SHOALS is expected to yield better worst-case guarantees in terms of total costs than an SG method does. Thinking to the future of the deployment of VQAs, as n becomes large, L g ought to increase linearly in N in (4) and hence linearly in n; therefore, (16) asympotically amounts to c 1 < 2 c 2 .
In our experiments we will examine this tradeoff empirically; but to summarize, these theoretical insights suggest that, asymptotically, when shot acquisition time is less than a factor of 2 times the latency switching time, we expect SHOALS to outperform methods based on the iteration in Algorithm 2.

Numerical Results
We note that, similar to stochastic gradient methods, SHOALS is remarkably simple to implement. It has also been documented in past work that adaptive methods like SHOALS are far less sensitive to changes in hyperparameters (such as step sizes) than stochastic gradient methods are, virtually eliminating the need for tuning hyperparameters. In our implementation of SHOALS, we chose γ = 2, c = 0.2, α max = 1, and α 0 = 1.
In our implementation, when computing sample sizes via (8) and (10) we set p = 0.1. Motivated by Theorem 2.1, we ignore constant terms and set g = √ f , where in turn motivated by chemical accuracy [24], we set f = 0.0016. We note that because ∈ Θ( g ) per Theorem 2.1, this essentially means that the factor of 2 that played a critical role in the end of our discussion in Section 2.4 is effectively f . We first consider a hypothetical, forward-looking, superconducting hardware environment outlined in Ref.
[18]; in particular, we set c 1 = 1.0 × 10 −5 s, c 2 = 0.1 s, and c 3 = 4.0 s. Before showing any results, we remark that in this setting, c 1 < f c 2 ; and so, from our discussion in Section 2.4, the theory suggests that in worst-case settings and for larger n, we expect SHOALS to outperform all of the stochastic gradient methods. Using these hardware times, we will display in our figures the total simulated time on the x-axis, where t f s and t g s (the numbers of shots requested by a method per iteration) are the actual values requested in the simulations.
We consider three quantum chemistry problems, for which we give relevant statistics in Table 2. For all molecules we use parity fermion-to-spin mapping with two-qubit reduction [56] to reduce the number of qubits, as implemented in Qiskit [57]. We simulate H 2 with two qubits at an equilibrium distance of 0.74Å. For LiH, we use an equilibrium distance of 1.595Å, and we freeze core orbitals (as in [14]), leading to only four qubits. We use a square geometry for H 4 , with each side having equilibrium length 1.23Å [42], and comprising six qubits.   Table 2: Relevant statistics for tested problems. As in preceding sections, n denotes the number of parameters in the ansatz, t f • denotes the number of circuits that must be evaluated to obtain an evalution of f (θ k ; ξ), and t g • denotes the number of circuits that must be evaluated to obtain an evaluation of g(θ k ; ξ).

Comparison of SHOALS with Other Solvers
We first compare the performance of SHOALS with two stochastic gradient methods: an implementation of iCANS and an implementation of Adam. For iCANS, we used the same hyperparameter settings described in [22], with a minimum sample size set equal to 30, and implemented what the authors referred to as "iCANS1." For Adam, we employed a constant step size of 1.0/L g , momentum hyperparameters of 0.9 and 0.999, and a denominator offset hyperparameter of 10 −8 . These settings are standard in practice. We experimented with a batch size (in the quantum setting, average of a number of single-shot estimators g(θ k ; ξ)) of both 100 and 1000. We remark that one could tune all of the hyperparameters of both iCANS and Adam to yield variations in practical performance, but one intention of these experiments is to demonstrate that standard settings of SHOALS' hyperparameters yield relatively good performance compared with standard settings of other solvers. Furthermore, we remark again that in actual NISQ settings, one would not want to expend budgets performing hyperparameter tuning.
Additionally, we compare the performance of SHOALS with a deterministic first-order method, an implementation of L-BFGS [58]. This implementation is the standard one provided in Scipy and Qiskit and was not designed to handle stochasticity in the function evaluation. Therefore, such a comparison is not particularly fair, since an L-BFGS optimizer has no guarantee of first-order convergence in such a setting. We experiment with L-BFGS employing both a batch size of 100 and 1000. We remark that there is ongoing work on the development of L-BFGS optimizers capable of handling stochastic estimates of function and gradient values [59,60]. In fact, Pacquette and Scheinberg [29] discuss how to incorporate approximate second-order information, such as that provided by L-BFGS techniques, into the adaptive line search framework We leave the incorporation of such second-order information into SHOALS as future work, since it is not the focus of this current paper.
Results are illustrated in Figure 1, and additional statistics are given in Table 3. We see that, in this superconducting setting, the median wall-clock time performance of SHOALS is consistently better than any of the compared stochastic gradient and L-BFGS methods. As expected, on all of the tested Figure 2: For each of the tested molecules, we display the ratio of the wall-clock time needed for SHOALS to achieve chemical accuracy over the wall-clock time needed for Adam-100 to achieve chemical accuracy. The dashed line denotes the median value of the ratio, while the transparent band shows the 25th through 75th percentiles of the ratio. We also display in each plot the value of c1/c2 where the median ratio equals 1, that is, where the wall-clock time of the two solvers is equal.
problems, the median performance of L-BFGS with a fixed batchsize fails to attain chemical accuracy. For this reason, we chose not to present that solver for the largest tested molecule, H 4 .
In light of our discussion in Section 2.4 and the inherent uncertainty in the specific timings of the superconducting setting, we additionally present in Figure 2, for the same molecules, the median (and first and third quantiles) of the ratio of the wallclock time needed for SHOALS to attain chemical accuracy to the wall-clock time needed for Adam-100 to attain chemical accuracy when we fix c 3 = 0 (i.e., we completely disregard cloud communication time) and vary the ratio of c 1 /c 2 .
As expected by our discussion in Section 2.4, for each tested molecules, as c 1 /c 2 tends to 0, the advantage of using SHOALS over a stochastic gradient method is clear. What merits further investigation, but requires testing significantly larger molecules, is whether the trend that the value of c 1 /c 2 at the breakeven point is decreasing with n. One explanation would be that our predictions are based on worst-case behavior and the optimization problems we are testing are certainly not exhibiting worst-case behavior. In any case, it would be foolhardy to extrapolate any trend from three datapoints; testing molecules that require orders of magnitudes more circuits to compute a single gradient will involve an extensive computational campaign, which is a subject for future work.

Conclusions and Future Work
In this paper we demonstrate a new optimization algorithm, SHOt-Adaptive Line Search (SHOALS). We provide theoretical arguments and numerical evidence that SHOALS should outperform other stateof-the-art optimization algorithms when the shot acquisition time is much less than the circuit switching time. This is because the dynamics of the deterministic line-search method that SHOALS stochastically approximates requires fewer iterations but more shots per iteration, to reach a given level of stationarity. Standard stochastic gradient methods, on the other hand, can utilize smaller numbers of shots per iteration to make expected improvement in the cost function. The latency regime in which SHOALS is expected to perform better than stochastic gradient methods is realized by today's superconducting qubit quantum computers [18]. However, when the circuit switching time and shot acquisition time are closer in magnitude, such as in trapped ion systems [20], we show, theoretically and numerically, that stochastic gradient methods are expected to outperform SHOALS. Because quantum devices have a non-negligible per-iteration overhead (from the need to switch between many circuits), these latency considerations are important in determining which algorithm to use in variational quantum algorithms, whereas in optimization of classical functions, such as those encountered in machine learning, these considerations are considerably less important.
In future work we intend to investigate the possibility of tighter worst-case complexities for SHOALS, in terms of the wall-clock time metric, than those presented in Table 1. Indeed, as a simplifying assumption, we assumed that SHOALS required O(1/ 2 ) many samples of g(θ k ; ξ) and O(1/ 4 ) many samples of f (θ k ; ξ), f (s k ; ξ) per iteration, but this is in fact necessary only when k is near the stopping time T . A more rigorous analysis would consider the total work of SHOALS, such as in the analysis performed in Ref. [61].
We also intend to pursue multiple directions in the practical implementation of SHOALS. In one direction, we are interested in the incorporation of (approximate) second-order information into SHOALS in order to further decrease the iteration complexity and hence relax further the total accumulated periteration latency switching costs. As suggested previously, we are inspired by work in Refs. [59,60,29]. We also intend to investigate an operator-sampling variant of SHOALS, in the style of Ref. [62], in order to further reduce both latency and shot acquisition costs. In particular, we may choose on certain iterations, based on sample variance estimates, to evaluate only a subset of Pauli strings in (4), or perhaps to evaluate only a subset of directional derivative estimates g i (θ k ; ξ k ) when forming a gradient estimate g k . Such approaches would resemble doubly stochastic coordinate descent methods, which have been studied in the optimization literature [63].
Moreover, and as suggested in our numerical experiments, we intend to conduct a larger computational campaign to look at larger molecules in order to ascertain our belief in the asymptotic performance of SHOALS.

Acknowledgements
This work was supported in part by the U.

A.2 Deriving N f,k
We now state our condition on f (θ k ; ξ k ). The condition for f (s k ; ξ k ) is analogous.

Condition A.2. Define the estimation error as
The estimation error is a one-sided subexponential-like random variable with mean bounded by a constant f > 0. That is, there exist parameters (ν, β) such that and When (18) and (19) are satisfied, we say that e(θ k ; ξ k ) is (ν, β)-subexponential.
The following corollary demonstrates that, for any and We now prove Corollary A.1. We first record a result that was proven in [  Because of the form of the single-shot estimators f (θ k ; ξ) in the parameter shift setting (a finite weighted sum of Bernoulli random variables), we have that |e(θ k ; ξ)| ≤ C deterministically, where C = 2 i |a i |. Thus, we can show the following result.
Because the variance V ξ f (θ k ; ξ) is bounded in the parameter-shift setting, (f (θ k ; ξ) can be expressed as a finite weighted sum of Bernoulli random variables), Proposition A.1 and Proposition A.2 together give us Corollary A.1.  Table 3: For each tested molecule and for each solver, we display the 25 th , 50 th , and 75 th percentiles (Q25, Q50, Q75, respectively), as well as the mean, of the numbers of shots, switches, and communications needed to attain chemical accuracy in our experiments. Lowest values are highlighted in boldface font. We note that LBFGS-100 and LBFGS-1000 have undefined or "infinite" values for each summary statistic for the two larger molecules and are hence excluded entirely.