An Adaptive Optimizer for Measurement-Frugal Variational Algorithms

.


Introduction
There are various strategies to make use of noisy intermediate-scale quantum (NISQ) computers [1].One particularly promising strategy is to push most of the algorithmic complexity onto a classical computer while running only a small portion of the computation on the NISQ device.This is the idea behind variational hybrid quantum-classical algorithms (VHQCAs) [2].VHQCAs employ a quantum computer to efficiently estimate a cost function that depends on Patrick J. Coles: pcoles@lanl.govthe parameters of a quantum gate sequence, and then leverage a classical optimizer to minimize this cost.VHQCAs intend to achieve a quantum advantage with NISQ computers by finding short-depth quantum circuits that at least approximately solve some problem.VHQCAs have been proposed for many applications including ground-state preparation, optimization, data compression, simulation, compiling, factoring, diagonalization, and others .
A concern about VHQCAs is that they might require prohibitively many quantum measurements (shots) in order to achieve convergence of the cost function [25], especially for applications like quantum chemistry that require chemical accuracy [26,27].In response to this concern, there has been an recent explosion of papers looking to improve the measurement frugality of VHQCAs by simultaneously measuring commuting subsets of the Pauli operators needed for the cost function [28][29][30][31][32][33][34].
Here, we approach the problem from a different direction by aiming to improve the classical optimizer.There have been several recent efforts to improve optimizers for VHQCAs [35][36][37][38][39]. Our approach is different from these works in that the optimizer we propose is specifically constructed to achieve measurement frugality.In particular, we develop an adaptive optimizer that is adaptive in two senses: it frugally adjusts the number of shots for a given iteration and for a given partial derivative.
Our method is inspired by the classical machine learning algorithm named Coupled Adaptive Batch Size (CABS) [40].For pedagogical reasons, we first directly adapt the CABS algorithm to VHQCA applications and call the resulting algorithm Coupled Adaptive Number of Shots (CANS).In order to achieve greater measurement frugality, we go beyond direct adaptation and modify the optimizer to account for dif-ferences in the number of shots needed to estimate individual components of the gradient.We call this method individual-CANS (iCANS).
While iCANS is conceptually simple, it nevertheless performs very well.Using IBM's simulator [41], we implement iCANS and other state-ofthe-art optimizers such as Adam [42], SPSA [43], and sequential gate optimization [37,38] for both the variational quantum eigensolver [3] and variational quantum compiling [14][15][16]18].We find that iCANS on average performs the best.This is especially true for our implementations in the presence of noise, i.e., with IBM's simulator of their NISQ device.This is encouraging since VHQCAs must be able to run in the presence of noise to be practically useful.
Ultimately, one can take a multi-pronged approach to reducing measurements in VHQCAs, e.g., by combining our measurement-frugal classical optimizer with the recent advances on Pauli operator sets in Refs.[28][29][30][31][32][33][34].However, one can apply our optimizer to VHQCAs that do not involve the measurement of Pauli operator sets (e.g., the VHQCAs in [7][8][9]).In this sense, our work is relevant to all VHQCAs.In what follows, we first give a detailed review of various optimizers used in the classical machine learning and quantum circuit learning literature.We remark that this lengthy review aims to assist readers who may not have a background in classical optimization, as this article is intended for a quantum-computing audience.(Experienced readers can skip to Section 3.) We then present our adaptive optimizer, followed by the results of our numerical implementations.

Gradient Descent
One standard approach to minimization problems is gradient descent, where the optimizer iteratively steps along the direction in parameter space that is locally "downhill" (i.e., decreasing) for some function f (θ).Mathematically, we can phrase the step at the t-th iteration as where α is called the learning rate.If one takes a large learning rate, one cannot be sure that one will not go too far and possibly end up at a higher point.For a small learning rate one is more guaranteed to keep making incremental progress (assuming the change in slope is bounded), but it will take much longer to get to a minimum.Knowing an upper bound on the slope is therefore very helpful in determining the appropriate learning rate.
To formalize this discussion, we review the notion of Lipschitz continuous gradients.The gradient of a function f is Lipschitz continuous if there exists some L (called the Lipschitz constant) such that for all θ (t+1) and θ (t) .(We note that in our notation the • without a subscript denotes the 2 or Euclidean norm.)When this holds, we can see that the fractional change in the gradient over the course of one step is bounded by αL, meaning that for sufficiently small α we can be sure that we are following the gradient.In fact, the convergence of the basic gradient descent method is guaranteed for deterministic gradient evaluations so long as α < 2/L [40].In machine learning contexts L is usually unknown, but for VHQCAs it is often possible to determine a good bound.We discuss this alongside an analytic formula for estimating gradients for VHQCAs in the next subsection.

Gradient Estimation
Working with the exact gradient is often difficult for two reasons.First the gradient can depend on quantities that are expensive to estimate with high precision.Second, it might be that no analytic form for the gradient formula is accessible, and hence the gradient must be approximated by finite differences.In the following we discuss the two scenarios in more detail.

Analytic gradients
If one has sufficient knowledge of the structure of the optimization problem under consideration, it might be possible to find analytic expressions for the gradient of the function.In deep learning this is what is provided via the backpropagation algorithm, which allows one to take analytic derivatives with respect to all parameters [44].However these formulas are usually expressed as an average over the full sample one has available in a learning task.To decrease the cost of evaluating the gradient often only a subset of the full sample, a so-called mini-batch, is used to get an unbiased estimate of the gradient [44].This introduces a trade-off between the cost of the gradient estimation and its achieved precision.
In VHQCAs there exist similar scenarios where it is possible to analytically compute the gradients [45][46][47].For example if the parameters describe rotation angles of single-qubit rotations and the cost function is the expectation value of some operator A, f = A , partial derivatives can be computed as i.e., the partial derivative is determined by the value of the cost function if one changes the ith component by ±π/2.However, the value of the cost function can only be estimated from a finite number of measurements, and this number of measurements as well as the noise level of the computation itself determine the precision of the gradient estimates.Therefore it is important to understand how to choose the number of shots, and keep in mind that for VHQCAs the gradient estimate is always noisy to some extent, even though it is referred to as analytical.
An immediate extension of this is that (3) can be used recursively to define higher derivatives.This result then allows one to determine a usefully small upper bound on L in (2).In particular, we note for operators with bounded eigenspectra, the largest magnitude of a derivative of any order we can find with (3) is precisely half the difference between the largest and smallest eigenvalues λ max and λ min , respectively.Thus, For the common case where the eigenspectrum is unknown but we know how to decompose A into a weighted sum over tensor products of Pauli matrices, A = i a i σ i , we can bound the highest and lowest eigenvalues in turn by i |a i | and − i |a i |, respectively, which gives By setting equality in (5) (or (4) when we have more information), we therefore find a useful Lipschitz constant.

Finite Differencing
If one does not have access to analytical gradients, one way to approximate the partial derivatives is by taking a finite δ step in parameter space Again, as in the analytical case, the function values need to be estimated by a finite number of shots introducing statistical noise.However, as opposed to the analytic case, the estimate (6) is systematically wrong, with an error that scales with δ 2 .Therefore, one might want to decrease the parameter δ during an optimization procedure using such a gradient estimate.Intuitively this makes the optimization harder, and was recently discussed in the context of VHQCAs [48].

Noisy Gradient Descent
For the case where one has noise in one's measurement of the gradient, the analysis of a gradient descent procedure becomes more complicated as the best one can achieve are statements about the behavior that can be expected on average.However, so long as one's estimates are unbiased (i.e., repeated estimates average to the true gradient) one should still end up near a minimum.This idea is at the heart of all stochastic gradient descent methods which we discuss now.

Stochastic/Mini-Batch Gradient Descent
In cases such as VHQCAs (as well as some machine learning applications), we cannot access the gradients directly and therefore need to estimate the the gradients by sampling from some distribution.A standard approach to this case is to choose some number of samples that are needed to achieve a desired precision.This method is known as either stochastic or mini-batch gradient descent.(A mini-batch here refers to a collection of samples, usually much smaller than the total population.) The number of samples as well as the learning rate are usually set heuristically, in order to balance competing interests of efficiency and precision.First, when collecting samples is computationally expensive, it can sometimes be more efficient to take less accurate gradient estimates in order to converge faster, though doing so can be detrimental if it means that one ends up needing to perform an inordinate number iterations [49].Second, it does not make sense to attempt to achieve a precision greater than intrinsic accuracy of the distribution from which one samples.If there is some error expected in the representation of the distribution one samples the gradients from, there is therefore an upper bound on the number of samples that it is sensible to take based on that accuracy [49].For the case of VHQ-CAs, this often means that the upper limit on the number of samples, s max depends on the (usually unknown) bias b noise introduced to the gradient measurements by the noise of the physical quantum device: Since for VHQCAs this bias is a function of the unknown, time varying device noise for the specific gate sequence, often the best one can do is to make a rough estimate about its order of magnitude and use that in the denominator.
Typically, the number of samples as well as the learning rate are heuristically adjusted based on the structure of the cost landscape as well as the error level.When little information is known about the optimization problem, the minimization process is optimized either by manual trial and error until an acceptable choice is found or using a hyper-parameter optimization strategy [50].
For a stochastic gradient approach to converge quickly, it is often helpful to decrease the error in the optimization steps during the run of the optimization.This can be done by either decreasing the learning rate α, or minimizing the noise in the gradient estimates.The following two subsections introduce two methods from machine learning that respectively take these two strategies.

Adam
Adam is a variant of stochastic gradient in which the step that is taken along each search direction is adapted based on the first and second moment of the gradient [42].To do this, one takes an exponential decaying average of the first and second moment (m t and v t , respectively) for each component of the gradient individually where the square is understood element-wise, g t is the gradient estimate at step t, and β 1 , β 2 are the constants that determine how slowly the variables are updated.The parameters are then updated with the following rule: where mt (v t ) is an initialization-bias-corrected version of m t (v t ), and is a small constant to ensure stability [42].One particular feature of Adam is that the adaptation happens individually for each component of the gradient.We also briefly mention that there is a recent modification to Adam that looks promising, called Rectified Adam (RAdam) [51].RAdam essentially selectively turns on the adaptive learning rate once the variance in the estimated gradient becomes small enough.While Adam has made a large impact in deep learning, to our knowledge it has not been widely considered in the context of VHQCAs.

Balles et al. analyzed the problem of choosing
the sample size in the context of optimizing neural networks by stochastic gradient descent [40].Their approach is to find the number of samples s that maximizes the expected gain per sample at each iteration.
In the following we denote the i-th component of the estimated gradient by g i , the empirical variance of the estimate g i by S i , the actual gradient by ∇f , and the actual covariance matrix (in the limit of infinite samples or shots) of the gradient estimation by Σ. Balles et al. introduce a lower bound G on the gain (improvement in the cost function) per iteration.Accounting for the finite sampling error, they find that the average value of G is [40] Tr(Σ). ( 11) As an immediate consequence, they then find that the expected gain at any step has a positive lower bound if By taking a small but fixed α, Balles et al. then maximize the lower bound on the expected gain per sample by taking samples [40].Unfortunately, this formula depends on quantities Σ and ∇f that are not accessible.Therefore in CABS, Σ is replaced by an estimator Σ and, specializing to the case where the minimum value of f is known to be zero, ∇f 2 is replaced by f /α as the gradient estimator is biased.Since the Lipschitz constant is also often unknown in the machine learning problems they were considering, they also drop the factor of 2Lα/(2 − Lα) [40].CABS then proceeds as a stochastic gradient descent with a fixed learning rate and a number of samples that is selected at each iteration based on (13) with the quantities measured at the previous point, making the assumption that the new point will be similar to the old point.
As discussed in the next section, our adaptive optimizer for VHQCAs is built upon the ideas behind CABS (particularly (13)), although our approach differs somewhat.

SPSA
The simultaneous perturbation stochastic approximation (SPSA) algorithm [43] is explicitly designed for a setting with only noisy evaluation of the cost function, where no analytic formulas for the gradients are available.It is also a descent method, however, instead of estimating the full gradient, a random direction is picked and the slope in this direction is estimated.Based on this estimate a downhill step in the sampled direction is taken: Here g(θ (t) ) is the estimated slope in the random direction and estimated as [52]: where ∆ t is the random direction sampled for the t-th step and ∆ −1 t simply denotes the vector with its element-wise inverses.In order to ensure convergence the finite difference parameter c t as well as the learning rate α t have to be decreased over the optimization run.This is commonly done by using a prefixed schedule [52].In this approach, we have In the original formulation, the idea is usually to estimate the cost function in (15) by a single measurement.However, in a quantum setting it seems intuitive to take a larger number of measurements for the estimation, as was done in [53].

Sequential Subspace Search
Another approach to optimizing a multivariate cost function is to break the problem into subparts which are independently easier to handle.The generic idea is to define a sequence of subspaces of parameter space to consider independently.These methods then approach a local minimum by iteratively optimizing the cost function on each subspace in the sequence.Now we discuss two instances of this approach: the famous Powell method [54] as well as a recently proposed method specialized to VHCQAs [37,38].

Powell Algorithm
The Powell algorithm [54] is a very useful gradient-free optimizer that specializes the subspace search to the case of sequential line searches.Specifically, starting with some input set of search vectors V = {v i } (often the coordinate basis vectors of the parameter space) this method sequentially finds the set of displacements {a i } along each search vector that minimizes the cost function.Next, the method finds the v j associated with the greatest displacement, a j = max(a i ).This v j is then replaced with the total displacement vector for this iteration, namely: and then the next iteration begins with this updated set of search vectors.This replacement scheme accelerates the convergence and prevents the optimizer from being trapped in a cyclic pattern.In practice, the displacements a i are typically found using Brent's method [55], but in principle any gradient-free scalar optimizer could work.(Gradient-based scalar optimizers would make Powell's method no longer "gradient-free.")

Sequential Optimization by Function Fitting
In the special case of VHQCAs where the cost function is expressed as an expectation value of some Hermitian operator and the quantum circuit is expressed as fixed two-qubit gates and variable single-qubit rotations, it is possible to determine the functional form of the cost function along a coordinate axis [37].After fitting a few parameters, it becomes possible to compute where the analytic minimum should be in order to find the optimal displacement along any given search direction.This can be scaled up to finding the analytic minimum (exact up to distortions from noise) on some subspace that is the Cartesian product of coordinate axes, though this is hampered by the fact that the number of parameters that must be fit scales exponentially with the dimension of the subspace [37].We will refer to this algorithm as the Sequential Optimization by Function Fitting (SOFF) algorithm.We note that a very similar method was published shortly after SOFF [38].The primary difference was the incorporation of the Anderson and Pulay convergence acceleration procedures used in computational chemistry [56,57].We note that, though SOFF and Powell are closely related, due to the limitation to only searching along coordinate axes, it is not possible to take arbitrary search directions, thus SOFF is not quite a special case of Powell's method.For VHQCA problems where it is applicable, SOFF has been demonstrated to be highly competitive with or better than other standard optimization schemes like Powell's method [37,38].

Adaptive Shot Noise optimizer
As mentioned above, the basic idea behind our approach is similar to that of CABS [40], but we implement those ideas in a different way.Specifically, by implementing different estimates for the inaccessible quantities in (13) that are suitable to the number of shots in a VHQCA (rather than the batch size in a machine learning method), we arrive at a variant of CABS we name Coupled Adaptive Number of Shots (CANS).Recognizing that a different number of shots might be optimal for estimating each component of the gradient in VHQCAs, we further develop this variation into individual-CANS (iCANS), which is our main result.For pedagogical purposes, we first introduce CANS and then present iCANS.

CANS
We now discuss our adaptation of CABS to the setting of VHQCAs.In order to use the number of shots recommended by the CABS method, we need to rewrite (13) using only quantities that are accessible.Making use of the parameter shift rule (3), we have access to the Lipschitz constant L given by ( 5).An unbiased estimate of Tr(Σ) is given by d i=1 S i = S 1 , i.e., by the empirical variances of the gradient components.(Here and below d is the number of parameters being optimized.)The naive estimate of ∇f 2 is g 2 , with g := (g 1 , ..., g l ) T the estimated gradient.This estimator is biased (see Equation (17) of [40]), however using a bias-corrected version is numerically unstable.With these choices, we then define CANS as the CABS algorithm with (13) replaced by We note that the learning rate α must be less than 2/L with this formalism.The CANS algorithm is included in Appendix B for completeness.For the remainder of the paper we will focus on iCANS, which we introduce next.

iCANS
The CANS algorithm is inspired by CABS [40], which was designed for applications in deep learning.Therein for each data point the full gradient is evaluated, and noise arises by considering only a minibatch of the full sample.In VHCQAs, however, each individual partial derivative is estimated independently.This gives us the freedom to distribute measurements over the estimation of the partial derivatives more effectively.This is the idea behind iCANS, which is shown in Algorithm 1 and described below.
Algorithm 1 Stochastic gradient descent with iCANS1/2.The function iEvaluate(θ, s) evaluates the gradient at θ using s i shots for the i-th derivative via the parameter shift rule (3).This function returns the estimated gradient vector g as well as the vector S whose components are the variances of the estimates of the partial derivatives.

25:
k ← k + 1 26: end while iCANS prioritizes the individual partial derivatives rather than the gradient magnitude as in (11).For this purpose, we define G i as our lower bound on the gain (i.e., the improvement in the cost function) associated with the change in parameter θ i for a given optimization step.Furthermore, we define γ i as the expected gain per shot (i.e., the expectation value of G i divided by the number of shots) as follows: where s i is the suggested number of shots for the estimation of the i-th partial derivative.Note that ( 19) is an adaptation of (11) to our setting.
In analogy with the CANS approach (see (18)), we estimate the number of shots that maximizes (19) with As with CANS, we again note that this formalism is only valid if α < 2/L.The idea now is to update each parameter with a gradient-descent step, where each partial derivative is estimated with its individual optimal number of shots.However, empirically those parameters that are close to a local optimal value (hence have a small expected gain) require a large number of shots, while parameters that are far from convergence (and hence usually have a large expected gain) require a small number of shots.We therefore restrict our algorithm to not take more shots for any partial derivative than a cap we will call s max .We take s max to be the number of shots needed in order to estimate the partial derivative for the parameter θ imax , where i max is the index associated with highest expected gain per shot.In other words: and we impose s i s max for all partial derivatives.
We note that the introduction of this cap on the number of shots is a heuristic choice which we find to often be beneficial to shot frugality, but which removes the guarantee that γ i will be maximized or even positive.In order to preserve this frugality while retaining the guarantee of positive expected gains, one can also introduce a step that verifies that the learning rate to be used is appropriate after the measurements are taken and adapts it if it is not.Motivated by (12), we check the following condition for each component of the gradient: When this condition fails to hold for the i-th partial derivative, we temporarily replace α with the right hand side of (23) for the update along that direction.Adding in this check results in a more conservative procedure as it takes smaller steps when needed in order to enforce that γ i > 0, and thus restores the guarantee that E [G] > 0. Below, we will refer to iCANS without this learning rate check as iCANS1 and with it as iCANS2.
The distinction between iCANS1 and iCANS2 is made in Algorithm 1 with the conditional statements on lines 10 and 12.
Beyond the core components of iCANS given above, both implementations of iCANS also take two more hyperparameters for increased stability.Since iCANS is intended to be deployed on highly noisy problems, we find that it is beneficial to use smoothed quantities for the gradient and variance when estimating γ i and s i .For this reason, we use bias-corrected exponential moving averages χ i and ξ i in place of g i and S i , respectively, when implementing equations (19) and (20).These exponential moving averages introduce a new parameter, µ, which controls the degree of smoothing and is bounded between 0 and 1.Since the update step is independent of this smoothing, we often find it beneficial to choose µ close to 1 to achieve a steady progression of s i 's.Finally, we also add a regularizing parameter b to the denominators of lines 13, 16, and 20 of Algorthim 1 for numerical stability.By multiplying b by µ k and choosing b to be small, the bias from this regularizing parameter begins small and exponentially decays as the algorithm progresses.

Implementations
In order to compare the performance of iCANS1 and iCANS2 to established methods, we consider two optimization tasks: variational quantum compiling with a fixed input state [14][15][16]18] and a variational quantum eigensolver (VQE) [3] Figure 1: The quantum circuit diagram for the ansatzes we used to construct the unitary operators U (θ) in our implementations.The angles in each rotation gate (denoted as R j , where j denotes the axis being rotated about) are varied independently.Panel a shows the ansatz used in the compiling and Heisenberg spin chain VQE task, and we note that this is the same ansatz used in Ref. [37].Panel b shows the ansatz used when doing the size scaling comparison with the Ising spin chain VQE task.
for a Heisenberg spin chain.
For our experiments we set the iCANS hyperparameters as α = 0.1, µ = 0.99, and b = 10 −6 , except for the case of the system size scaling comparison.For that case, since the Lipschitz constant L grows linearly with the system size, leaving α = 0.1 leads to α > 2/L for larger systems, which is invalid for iCANS.We therefore chose α = 1/L for the different length Ising spin chains we consider below.
For the other algorithms we compare to, we will denote the number of shots per operator measurement as s.We will denote algorithm A with s shots per operator measurement as A-s (e.g., SOFF with s = 1000 is denoted SOFF-1000).We also note that in the figures and tables below we show the analytical cost and energies that one could achieve with the parameters that the optimizers output, i.e., without hardware noise or shot noise.The optimizers did have to contend with finite statistics and, where indicated,  hardware noise to find those parameters.
In addition to the fixed number of shots they use, the other algorithms we compare to also come with other hyperparameters, which were chosen empirically in an attempt to get the best performance from each.For Adam we used a learning rate of α = 0.1 along with the momentum parameter values of β 1 = 0.9 and β 2 = 0.999.For SPSA, we found that the default parameters were the best among those that we tried, and thus we set A to be a tenth of the total number of allowed iterations, β = 0.602, and γ = 0.101.

Variational Compiling with a Fixed Input State
For our first optimization task, we follow [37] and consider as a benchmark the optimization of the following cost function: where θ * is a vector of fixed, randomly selected angles and θ is the vector of angles to be optimized over.For this problem, we construct the parametrized unitary operator U (θ) with the ansatz described in Fig. 1(a), setting n = 3 qubits and D = 6.(As adding depth and thus more parameters increases the difficulty of the optimization task and amplifies the effect of the noise model, D = 6 was chosen to increase the difficulty of the task although shorter depth ansatzes would work here.)We then simulate the optimization procedure with one hundred different random seeds (each of which generates a unique random θ * and initial point) and a collection of different optimizers.The results for both the case of a noiseless simulator and the case of a simulator using the noise profile of IBM's Melbourne processor [58] are shown in Fig. 2. For the latter, we emphasize that this noise profile reflects the properties of real, currently available quantum hardware.In addition, the average costs obtained for each optimizer are listed in Tables 1 and 2 with the best value found for each total number of shots expended N shown in bold.Furthermore, see Appendix C for the cumulative probability distributions over cost values, which provide more information than the average cost

VQE
For our second optimization task, we follow [53] in considering the Heisenberg spin chain with wrapped boundary conditions and the Hamiltonian: where the <> bracket denotes nearest-neighbor pairs.For the purpose of our comparison, we fix J = 1 and B = 3 and again consider the ansatz described in Fig. 1(a).Running the comparison with n = 3 qubits in a triangle and D = 6 for the ansatz, we simulate running VQE with one hundred different random seeds and initial points, along with the same set of optimizers as in the benchmark case above.As before, the results for the both a noiseless and a noisy simulator (also using the IBM Melbourne processor's noise profile [58]) are shown in Fig. 3. Again, the average energies obtained for each optimizer are listed in Tables 3 and 4 with the best value found for each total number of shots expended N shown in bold.In addition, see Appendix C for the cumulative probability distributions over energy values, which provide more information than the average energy values.

Comparison of Scaling
In order to compare the performance of iCANS to that of other optimizers when one scales up the number of qubits, we now consider VQE applied to Ising spin chains of differing lengths with open boundary conditions and the Hamiltonian: where the <> bracket again denotes nearestneighbor pairs.In order to generate enough entanglement in the ground state to require a modest depth, we choose g = 1.5 so that we are near but not at the critical point g = 1.For this problem, we used the ansatz shown in Fig. 1(b) with D = 3 (two repetitions of the block shown in braces), as its performance for this problem was significantly better than the simple ansatz in Fig. 1(a).
The results for a noiseless simulator for 4, 6, 8, 10, and 12 qubit Ising spin chains are shown in Fig. 4.

Discussion
Here we report on the behavior of the various optimizers we studied.First we consider SOFF, which is the only optimizer studied here other than iCANS that was formulated specifically for VHQCAs.By leveraging analytical knowledge about the optimization landscape, SOFF's gradient-free method of making single parameter updates allows it to quickly train in low noise environments.However, the limit of the precision when fitting the analytical function with a finite number of shots means that SOFF hits a precision floor and cannot improve past that point.Additionally, hardware noise tends to distort the landscape in such a way that the analytical form no longer provides as good of a fit, making SOFF struggle more relative to the other optimizers considered.In the optimization tasks we looked at here, we found that SOFF was often competitive with iCANS shortly before hitting its precision floor, with SOFF-100 sometimes doing better for a brief interval early on.For example, SOFF-100 was the best performing optimizer for the compilation task with N = 10 3 (noiseless noisy) and N = 10 4 (noisy only), as well as for the Heisenberg VQE with N = 10 4 (noiseless and Adam was originally conceived in the context of machine learning and excels at optimizing in noisy environments.However, our numerical studies we found that Adam suffered from an instability the hyperparameters we chose and the number of shots we allowed at each partial derivative evaluation.This appears to enter later in the optimization when we are working with more shots, and it can be seen in the upturn of the curves in Figs.2-4.For the case of the noisy compilation task, Adam-100 looks like it might just be reaching that instability at the end of the allowed shot budget, and slightly outperformed iCANS1 to be the best on average.We note that, similar to what was seen with SOFF, Adam was usually competitive with the iCANS methods before it reached the point where it stopped improving. Unlike SOFF and Adam, SPSA did not seem to hit a point at which it stopped improving with shot budget, for the chosen hyperparameters.We note though that SPSA is the most sensitive to perturbations of the hyperparameters among the methods studied here and can become very unstable if they are incorrectly chosen.However, if one hits upon the correct hyperparameters, SPSA can be very effective.While for our cases we did not find SPSA outperforming iCANS, we note that for the noiseless Heisenberg VQE task, SPSA-100 was the most competitive with iCANS.Overall, we find that iCANS performed well on all optimization tasks considered, with either iCANS1 or iCANS2 usually providing the best result for a given total shot budget N .Even when scaling up the system size in the Ising model VQE task (see Fig. 4), we found that iCANS continued to outperform the other optimizers studied.We also note that empirically iCANS1 usually outperformed iCANS2.While iCANS2 provides a benefit by reducing the sensitivity to the input learning rate, so long as the learning rate is chosen well we expect that iCANS1 may tend to perform better.
We remark that while we do not report full results for RAdam [51], we found with preliminary results that it did not seem to provide a substantial improvement over the simpler Adam algorithm for our use cases.Similarly, we found that SOFF with the Anderson acceleration step proposed in [38] did not noticeably improve upon the performance of basic SOFF, and therefore the curves for this method are not shown.
We finally remark about the different performance for the various fixed-shot optimizers with different numbers of shots (e.g., Adam-10 versus Adam-100).This performance difference can be understood as a trade-off between reducing the statistical uncertainty and achieving more itera-tions before hitting the limit on the total number of shots.When few shots are used, many more iterations might be allowed but the update steps are much noisier, usually meaning that the optimizer can perform more quickly early on but then potentially hits an effective floor due to the precision.Increasing the number of shots will allow more precise updates and thus lowers the precision floor (if present) but means that far fewer iterations can be performed.This is the idea at the heart of iCANS.iCANS uses few shots early on and so achieves a period of noisy but fast descent, but then slows down and computes with greater and greater precision to continue making progress.This strategy allows for shot frugality as well as in principle removing such a precision floor for iCANS.

Conclusions
In order to bring about the promise of VHQCAs solving usefully large and complex problems on NISQ devices, one needs a way to perform the requisite optimizations efficiently.As the ratelimiting step of these optimizations will likely be the number of times one must prepare and measure quantum states, it will be important to have optimizers that are frugal in the number of times physical measurements must be performed on a quantum computer.
In this work we introduced two versions of a measurement-frugal, noise-resilient optimizer tailored for VHQCAs.Both of the strategies we propose, iCANS1 and iCANS2, address measurement frugality by dynamically determining the number of measurements needed for each partial derivative of each step in a gradient descent.iCANS1 is the more aggressive version, always taking the same learning rate, while iCANS2 is more cautious and limits the learning rate for steps so that the expected gain is always guaranteed to be positive.Our numerical results indicate that these optimizers may perform comparably or better than other state-of-the-art optimizers.The performance compares especially well in the presence of realistic hardware noise.
iCANS has already found use in the very recent VHQCA literature [18].Furthermore, after our article was originally posted, a related study of stochastic gradient descent for VHQCAs found that small shot counts can provide rapid improve-ment in early stages of training [59], which provides further motivation for iCANS.
One potential direction for future work is exploring the possibility of extending our frugal adaptive approach to non-gradient methods, such as SPSA.
A The Expected Lower Bound on the Gain per Shot Here we repeat the derivation provided by [40] for the lower bound on the expected gain per shot (given in (11)), and extend it to our expression lower bounding the expected gain per shot per partial derivative (19).
Assuming that the cost function is admits a Taylor series representation about the current point in parameter space, to quadratic order we have In this way, we approximate the gain (the change in the cost function) we expect after the update step, with θ = θ − αg: If the gradients are Lipschitz continuous, we can achieve a lower bound G on this quantity using the Lipschitz constant L: Next we assume that the gradient estimates g have mean E [g] = ∇f (θ) and covariance Σ/s, where s is the number of shots used in the estimate.We then have Plugging this back into (29) then gives us Tr(Σ), (31) which is (11).Dividing both sides by s then gives the expected lower bound on the gain per shot.In order to arrive at (19), we rewrite this expression as: Finally, defining γ i = E [G i ] /s i and replacing ∂ i f and Σ ii with their estimators g i and S i , respectively, gives (19).

B CANS Algorithm
For the interested reader, we present the algorithm for CANS (Coupled Adaptive Number of Shots) in Algorithm 2, which is an adaptation of the CABS algorithm [40] to the VHQCA setting.

C Cumulative probability distributions for 3-qubit implementations
Here we show the cumulative distribution plots of the cost values or energies achieved by the optimizers we studied for the compilation task (Fig. 5) and the Heisenberg spin chain VQE task (Fig. 6) for various shot budgets.
Algorithm 2 Stochastic gradient descent with CANS.The function Evaluate(θ, s) evaluates the gradient at θ using s measurements for each component of the derivative using the parameter shift rule (3) and returns the estimated gradient vector g as well as the vector S with the variances of the individual estimates of the partial derivatives.Input: Learning rate α, starting point θ 0 , min number of shots per estimation s min , number of shots that can be used in total N , Lipschitz constant L, running average constant µ, bias for gradient norm b 1: initialize: θ ← θ 0 , s tot ← 0, s ← s min , χ ← (0, ..., 0) T , ξ ← 0, k ← 0 2: while s tot < N do

Figure 2 :
Figure 2: Comparison of performance for the compilation task across one hundred random target states and initial starts.As mentioned in the text, we denote algorithm A with s shots per operator measurement as A-s.Panels a and b show the average cost value attained as a function of the total number of shots (N ) expended for the noiseless and noisy cases, respectively.

Figure 3 :
Figure 3: Comparison of performance for the Heisenberg spin chain VQE task across one hundred random starts.Panels a and b show the average ∆E value (i.e.energy above the ground state energy) attained as a function of the total number of shots (N ) expended for the noiseless and noisy cases, respectively.

Figure 4 :
Figure 4: Comparison of performance for the Ising VQE task with different numbers of sites (i.e., qubits) without hardware noise.Each panel shows the average ∆E per site value attained as a function of the total number of shots expended (N ) for each number of qubits.Each curve represents the average over ten random starts.

Table 1 :
Noiseless Compilation Average Cost Values

Table 2 :
Noisy Compilation Average Cost Values