Stochastic gradient descent for hybrid quantum-classical optimization

,


Introduction
Hybrid quantum-classical optimization with parameterized quantum circuits [1] provides a promising approach for understanding and exploiting the potential of noisy intermediate-scale quantum (NISQ) devices [2]. In this approach, which includes large classes of well studied methods like variational quantum eigensolvers (VQE) [3], the quantum approximate optimization algorithm (QAOA) [4] and quantum classifiers [5,6], a classical optimization scheme is utilized to update the parameters of a hybrid quantum-classical model, and developing and understanding optimization techniques tailored to this setting is of natural importance [7]. While a variety of gradient-free optimization methods have been proposed and studied [8], in this work we will be concerned with gradient descent type optimizers. This focus is motivated by the fact that for highly overparameterized classical models (such as modern neural networks) gradient-based optimization offers many advantages over gradient-free methods [9], and as a result developing and analyzing such methods designed specifically for the quantum-classical setting is crucial, particularly as the computational power of available near-term quantum devices increases.
In this hybrid quantum-classical setting, since part of the hybrid computation is executed by a quantum circuit, the optimized loss is generically a function of the expectation values of quantum observables. While zeroth-order methods can be immediately applied to obtain an approximation to the gradient from evaluations of the loss function [10,11], there also now exist a variety of strategies to directly evaluate the gradient, either via distinct quantum algorithms [12,13], or through the measurement of suitable observables with respect to states generated by the parameterized model [14][15][16]. As an example of the latter, it has recently been shown that the partial derivatives required for gradient descent can be exactly expressed as linear combinations of the same expectation values appearing in the loss function, but with respect to states generated from a shift in the tunable parameters of the parameterized quantum circuit -a strategy called the 'parameter shift rule' [15,16]. However, as an infinite number of measurements is required for the exact evaluation of a particular expectation value, it is not possible to implement exact gradient descent in this hybrid quantum-classical setting, even when the exact gradient can be written as a function of expectation values. As a result, previous approaches have typically used large numbers of measurements to estimate expectation values as accurately as possible [17][18][19], and the necessary circuit repetitions represent a significant overhead for the implementation of gradient based optimizers.
Recently however it has been observed that using a finite number of measurements for the evaluation of gradients effectively results in the implementation of stochastic gradient descent [11]. Stochastic gradient descent (SGD) optimization works by replacing the exact partial derivative at each optimization step with an estimator of the partial derivative, and when the estimator is unbiased it is often possible to prove rigorous convergence guarantees in appropriately simplified settings [20,21]. Additionally, SGD is the method of choice for the vast majority of large-scale machine learning models, where it has been found to offer advantages over exact gradient descent, such as faster evaluation of the gradient, faster convergence, and avoidance of local minima [22][23][24][25]. Given the natural connection between SGD and hybrid quantum-classical optimization, Harrow and Napp [11] provide explicit algorithms for the construction of unbiased estimators for the gradient of typical loss functions arising in the context of hybrid quantum-classical optimization, both by exploiting single shot measurements for the estimation of expectation values, and by importance sampling single terms of linear combinations of expectation values, which arise naturally from the use of local Hamiltonians to construct loss functions. Additionally, using these estimators they are able to generalize a variety of existing upper and lower SGD convergence bounds into the hybrid quantum-classical setting. These bounds then allow them to construct a simple class of optimization problems for which it can be proven that certain first-order hybrid quantum-classical SGD methods converge substantially faster than any zeroth-order method.
In light of these previous results and observations, we explore in this work, both theoretically and numerically, the applicability of these techniques, and a variety of heuristic extensions, in multiple concrete settings. As previously observed by Harrow and Napp [11], one such setting is in the context of VQE, where the cost function is typically a linear combination of expectation values of local Hamiltonian terms, and as such an unbiased estimator of the gradient can be constructed by sampling commuting subsets of these local terms in each optimization step. However, multiple other hybrid quantum-classical optimization settings also facilitate the use of similar strategies. For example, the parameter shift rule re-expresses partial derivatives as linear combinations of expectation values with respect to shifted circuit parameters, and unbiased estimators to these derivatives can therefore similarly be easily obtained by sampling subsets of terms in each optimization step. This insight can be particularly valuable in the context of continuous variable quantum circuits, where an infinite number of parameter shift terms may be required [16]. Additionally, in many data driven settings, such as quantum classifiers for example, loss functions are constructed as sums over data-set instances, and sampling subsets of instances in each optimization step is a well known classical strategy -known as mini-batch SGD -which has already been adopted from classical machine learning [5]. Of particular interest is the fact that these "sampling from linear combination" type SGD schemes can all be combined with each other, and with efficient expectation value estimation, resulting in optimization methods that we refer to as "doubly stochastic" gradient descent. This is in the same spirit as similarly motivated techniques for kernel methods, in which scalable doubly stochastic estimators to the gradient are constructed through the combination of two distinct unbiased approximations to the gradient [26,27].
Similarly to the kernel method setting, the concrete doubly stochastic SGD optimization schemes we formulate here may provide large efficiency gains over existing approaches, crucial for the implementation of variational algorithms on existing devices. For example, if a quantum classifier trained with D data points has a cost function consisting of M observables, which each need K "parameter shift" terms to extract the analytical gradient, and each expectation is an average of N measurement results, the most extreme version of stochastic gradient descent uses only a single measurement sample to estimate the gradient, saving O(DM KN ) measurements in each optimization step. Additionally, from a theoretical perspective, as has been previously mentioned, formalizing these notions allows one to prove convergence guarantees, in suitably restricted settings, for both existing methods and newly proposed SGD optimizers, thereby placing gradientbased hybrid quantum-classical optimization within a rigorous theoretical framework, which can be exploited both to guide the development and analysis of new methods, and to facilitate comparison of existing methods [11].
As suggested by previously obtained convergence bounds, and confirmed here in numerical simulations of various benchmark tasks for which these convergence bounds may not be directly applicable, the increased variance of such extreme-case estimators typically results in more optimization steps being required for convergence, and potentially non-optimal final solutions. However, while more optimization steps may be required, the enhanced efficiency of each each optimization step can result in significant overall savings in the number of circuit executions and measurements which are necessary to achieve convergence. Furthermore, in practice, a promising strategy is to treat the various SGD parameters -such as learning rate, number of measurement shots, and number of linear combination terms sampled -as hyper-parameters which are adjusted appropriately through the course of optimization. As we see from numerical simulations, basic implementations of this strategy suggest one is able to converge to highly accurate solutions, while retaining the efficiency gains of SGD. In fact, very recently the authors of Ref. [28] have also observed that the number of measurement shots used for expectation value estimation can be treated as a hyper-parameter within an SGD framework, and suggested multiple sophisticated heuristic strategies, inspired by state-of-the-art methods from classical optimization, for varying this parameter through the course of optimization. These strategies may well also be applicable within the context of the doubly stochastic gradient descent schemes we consider here, and provide a natural avenue of investigation for future work.
This work is structured as follows: We present the setting and basic idea in Section 2. We then develop a more general framework and body of concrete examples by exploring in detail the settings of VQE, QAOA and quantum classifiers in Sections 3, 4 and 5 respectively. In Section 6, we discuss certain extensions beyond the settings we consider here, before proceeding in Section 7 to prove rigorous convergence guarantees, complimentary to previously obtained bounds, for all considered optimizers. Given this formal framework, we present in Section 8 multiple numerical experiments and benchmarks, before concluding in Section 9 with both a discussion and outlook.

Setting and Idea
Given a model parameterized by θ ∈ R d , and some loss function L : R d → R, stochastic gradient descent algorithms can be viewed as optimization algorithms in which the exact gradient descent update rule is replaced with a stochastic update rule of the form where {g (t) (θ)} is a sequence of random variables -estimators of the gradient -which defines the particular algorithm. Perhaps counter-intuitively, this may offer multiple advantages: Stochasticity can potentially aid in the avoidance of local minima and saddle points [29], and if well designed, g (t) (θ) can be much more efficiently evaluated than the full gradient ∇L [30,31]. While such algorithms are heavily used and studied as heuristics, there is also an increasing body of work seeking to understand and prove their convergence properties. While we postpone a detailed discussion of convergence theorems to Section 7, we note that a fundamental property required for obtaining convergence guarantees is that the estimators {g (t) (θ)} are unbiased -i.e., for all t. In this work we will be concerned with the development of stochastic gradient descent algorithms within the setting of hybrid quantum-classical optimization. In particular, we will consider loss functions L of the form where O i θ is the expectation value of an observable O i with respect to the outcome of a parameterized quantum circuit U (θ) acting on an initial state vector |0 , i.e.
In the following sections, we will examine in detail specific loss functions relevant for current applications, however in order to present some of the basic ideas, let us consider as a first example the simple loss function for some observable O, which we assume can be readily measured. In order to utilize gradient descent optimization it is necessary to obtain expressions for all partial derivatives ∂L(θ)/∂θ i . While zeroth-order methods, such as finite differences, could be used to estimate these partial derivatives from evaluations of the loss function, we will in this work focus on first-order methods, in which one calculates these partial derivatives directly, without necessarily evaluating the loss function as an intermediate step. In particular, for many settings of interest, it has recently been shown that a parameter shift rule can be derived, via which all partial derivatives can be expressed as linear combinations of the same expectation value, but with respect to slightly shifted circuit parameters [15,16]. In this work we primarily restrict ourselves to settings in which such a rule can be derived 1 , which we formalize via the following definition: where θ k,i = θ + c k,i e i , with e i denoting a unit vector in the i'th direction.
In order to facilitate intuition, before continuing let us consider the following example: Example 1 (Parameter shift rule for single qubit generators [16]). Consider a parameterized quantum circuit of the form where each G i is a single qubit Hermitian operator with eigenvalues ±r i . In this case one finds that (8) Now, under the assumption of a K-term parameter shift rule for circuit ansatz U (θ), we see that all partial derivatives ∂L(θ)/∂θ i of our example loss function are the linear combination of K expectation values, as per Eq. (7). As expectation values cannot be evaluated exactly on quantum devices, we see already at this stage that exact gradient descent is not possible in this setting. However, as we show now, estimating expectation values via a finite number of measurement outcomes leads to unbiased estimators for the gradient, and therefore to well motivated stochastic gradient descent schemes. In particular, let us start with the following general definition for the n-shot sample mean estimator of an expectation value: Note that by construction we have that for all n, and that while all n-sample mean estimators have the same expectation value the variance of the estimator decreases with increasing n. Given this unbiased estimator for a single expectation value, it now follows straightforwardly that such estimators can be linearly combined to obtain unbiased estimators for the required partial derivatives, i.e., and therefore the estimator g i (θ) = k γ k,iõ (n) (θ k,i ) is an unbiased estimator for ∂L(θ)/∂θ i , which requires nK measurements to construct. Using this estimator we can then define a well founded SGD optimization algorithm via the update rule where we have implicitly defined g (t) i (θ (t) ) := g i (θ (t) ) for all t -i.e. the same estimator is used for all update steps. Importantly, note that this algorithm is valid even in the extreme case of n = 1. In fact, as exact evaluation of expectation values is not possible, many previous gradient-based approaches to loss functions such as the one considered here have been large n instances of such an SGD algorithm [17][18][19]. It is, however, possible to go further, and define a "doubly stochastic" gradient descent optimizer by not only estimating the expectation values via n measurements, but also sampling subsets of terms from the linear combination in each optimization step, and applying appropriate correction weights. In the extreme case of sampling only single terms, such an optimizer requires only n measurements per optimization step, as opposed to nK measurements.
In order to formalize this notion, we will denote the set of non-negative integers less than or equal to k as [k] := {1, . . . , k}, and let us assume for now that the estimatorõ (n) (θ k,i ) for each term O θ k,i in the linear combination of Eq. (7) is a discrete random variable which can take . We now note that the random variable g i (θ) taking values {(Kγ k,i )λ is an unbiased estimator for the linear combination of Eq. (7), i.e., The important thing to note is that the random variable g i (θ) can be sampled by first sampling a term of the linear combination uniformly at random -i.e., drawing k from [1, . . . , K] with prob(k) = 1/K -and then sampling the random variableõ (n) (θ k,i ) (requiring n measurements) before applying the correction factor Kγ k,i to the outcome. Alternatively, one can also sample with prob(k) = |γ k,i |/( k |γ k,i |), and apply the correction factor γ k,i /prob(k). While for certain settings this may well lead to estimators with lower variance, for ease of presentation we restrict ourselves here to uniform sampling over linear combinations. As g i (θ) is an unbiased estimator for ∂L(θ)/∂θ i one can use this estimator for an SGD optimizer, requiring only n measurements per optimization step, which is summarized via the following algorithms: Construct estimator for ∂L(θ (t) )/∂θ i 6: end for 7: Update the learning rate 8: Update all circuit parameters 9: t ← t + 1 10: end while Algorithm 2 n-shot doubly stochastic partial derivative estimator Given U (θ) satisfying a K-term parameter shift rule, and L(θ) = O θ 1: function PartialDerivativeEstimator(i, θ (t) ) 2: Sample k ∈ {1, . . . , K} with prob(k) = 1/K Sample a parameter shift term 3: k,i ) Construct estimator for ∂L(θ (t) )/∂θ i by applying correction factor 5: return g i (θ (t) ) 6: end function At this stage, the basic idea should be clear: In order to implement stochastic gradient descent one requires estimators for all partial derivatives, and when these partial derivatives can be expressed via linear combinations of expectation values then one can define a simple unbiased estimator for the gradient by using a finite number of measurements to estimate the expectation values occurring in the linear combination. Moreover, one can define a "doubly stochastic" estimator, as illustrated in Algorithm 2, by additionally sampling a subset of the terms of the linear combination in each optimization step. However, as we will see in the following sections, such linear combinations can arise also from loss functions that are a linear combination of observables, as in VQE, or from loss functions which are a sum over data-set instances, as for example found in quantum classification problems. In the latter example, one has to exercise caution as the loss function may not be a simple linear combination of expectation values, but rather a linear combination of non-linear functions of expectation values. Generic settings such as these require care to deal with properly, as discussed in Sections 5 and 6. Finally, before continuing we note that we have not specified the function GetLearningRate which is called by Algorithm 1. While a constant learning rate could be used, In practice a variety of different adaptive strategies are exploited, which can be proven to improve convergence properties in restricted settings [21,32]. Additionally, it should be noted that a wide variety of more sophisticated variations to Algorithm 1 exist, in which both the learning rate and parameter update rule depend explicitly on the history of prior gradient estimates [25]. We will explore and compare such variations to Algorithm 1 in Section 8.

Unbiased estimators for VQE
In this section, we examine how the ideas introduced in Section 2 can be straightforwardly extended and applied to the setting of variational quantum eigensolvers (VQE). In particular, given a local Hamiltonian which encodes a problem of interest, we define the VQE loss function For this simple loss function, we have that We then have that As a result, the simplest unbiased estimator for ∂L(θ)/∂θ i that we can define is just the linear combination of estimators for H j θ k,i -i.e., the random variable is the n-sample mean estimator of H j θ k,i , as per Definition 2. This estimator, which requires nKM measurements per parameter update, can be constructed using the function PartialDerivativeEstimator 1 given in Algorithm 3 below.
However, as we have seen in the previous section, in order to obtain more efficient unbiased estimators for ∂L(θ)/∂θ i , we can sample subsets of terms from either the summation over parameter shift terms, the summation over local Hamiltonian terms, or both. In order to provide an explicit example, consider the function PartialDerivativeEstimator 2 in Algorithm 3, in which only a single local term of the Hamiltonian is sampled, with a cost of nK measurements per parameter update. Note the similarity of the approach taken here with random compiling in digital quantum simulation [33], which in that context offers advantages such as favourable scaling of Trotter error accumulation. While this function explicitly samples a single local term of the Hamiltonian, it is possible to straightforwardly generalize this to an algorithm which samples a subset of local terms of the Hamiltonian for each parameter update. In particular, note that if [H l , H m ] = 0, then the estimatorsh m (θ k,i ) could both be constructed from measurements of n copies of the state vector U (θ k,i )|0 . As a result, by sampling subsets of commuting local terms in each parameter update step, which can be measured simultaneously on the state prepared by a single circuit execution, we are able to decrease the variance of the estimators while conserving the number of circuit executions required per parameter update. This approach allows for previously developed methods for grouping commuting Hamiltonian terms [34], again reminiscent of methods of grouping of terms in digital quantum simulation [35], to be easily applied within this context.
Finally, the function PartialDerivativeEstimator 3 in Algorithm 3 makes clear how in a similar manner one could additionally sample over terms (or subsets of terms) of the parameter shift summation, resulting in an SGD algorithm requiring only n measurements per parameter update. Of course, despite the increase in efficiency per optimization step, using small values of n and sampling over terms of the linear combinations leads to estimators with increased variance, and intuitively one may expect this to lead to slower convergence (in terms of number of update steps required), and to solutions which may be relatively far from global minima. This intuition is indeed valid, and in Sections 7 and 8 we explore these trade-offs between efficiency and accuracy, from both an analytical and numerical perspective.

11:
Sample j ∈ {1, . . . , M } with prob(j) = 1/M Sample a Hamiltonian term 12: for all 1 ≤ k ≤ K do Iterate through parameter shift terms 13: via n measurements 14: end for 15: k,i ) Construct estimator for ∂ H θ (t) /∂θ i via linear combination and correction factor 16: return g i (θ (t) ) 17: end function 18: function PartialDerivativeEstimator 3(i, θ (t) ) 19: Sample a Hamiltonian term 20: Sample k ∈ {1, . . . , K} with prob(k) = 1/K Sample a parameter-shift term 21: k,i ) Construct estimator for ∂ H θ (t) /∂θ i by applying correction factor 23: return g i (θ (t) ) 24: end function 4 Unbiased estimators for QAOA As in the VQE setting, given a problem Hamiltonian whose ground state encodes the solution to some combinatorial problem, the QAOA loss function is once again given simply by L(θ) = H P θ . Unlike VQE however, the QAOA algorithm is defined additionally by a specific parameterized circuit architecture, designed such that the variational circuit implements a discretized adiabatic evolution, from some known initial state, into the desired target state. While the loss function is the same, the nature of this specific ansatz has consequences for the design of SGD optimizers, which we focus our attention on here. To be more precise, when using parameterized circuits in the context of hybrid quantum-classical optimization, one often considers parameterized circuits of the form where each parameterized gate is parameterized by a distinct parameter [7], and for models of this type parameter shift rules can be derived under various different assumptions on the form of the constituent gates [16] (as per Example 1). By contrast, the QAOA parameterized circuit ansatz is defined as where H B = M j=1 H B j is a mixing Hamiltonian whose ground state is easy to prepare. Note that in this case the variational parameters define a Trotterized time evolution, and therefore a discretization of a typical adiabatic protocol. Under the assumption that all local terms of both the mixing and problem Hamiltonian commute -i.e., and one sees that multiple constituent gates are parameterized by the same variational parameter. In order to understand how a parameter shift rule can still be applied in this case, let us consider first a simple parameterized circuit architecture consisting of only a single variational parameter, which parameterizes a single variational gate, i.e., In this case we see that for any observable O the partial derivative of the parameterized expectation value is given by and from this expression, if e −iθG admits a K(G)-term parameter shift rule, then by definition where we have denoted explicitly that the number of terms, linear coefficients and parameter shifts all depend on G. Let us now consider a parameterized circuit with a single variational parameter, parameterizing multiple gates, whose generators G j all commute, i.e., In particular, we note by comparison with Eq. (26) that the QAOA ansatz is precisely of this form. In this case, the partial derivative O θ is given by By comparison with Eq. (28) we therefore see that, under the assumption that all the gates e −iθGj 's still satisfy a parameter shift rule, At this stage we can finally see that in the case where multiple gates are parameterized by the same variational parameter θ -as is the case for the QAOA ansatz -the parameter shift rule we obtain for the partial derivative ∂ O θ /∂θ can be written as a double sum. The first of these summations runs over all gates parameterized by this parameter, while the second runs over the parameter shift terms for a specific gate. As the QAOA loss function is identical to the VQE loss function, all the unbiased estimators of the previous section can be straightforwardly adapted to the QAOA setting, while the additional structure of the parameter shift rule, allows one to further sample from the linear combination over gates parameterized by the same variational parameter.

Unbiased estimators for MSE Quantum classifiers
Given the tools developed in the previous sections it is now relatively straightforward to construct unbiased estimators, and therefore stochastic gradient descent optimizers, for mean squared error (MSE) quantum classifiers. As we will see, the primary obstacle in this case is that the required partial derivatives are not linear combinations of expectation values, but linear combinations of non-linear functions of expectation values. To begin, it is helpful to define the mean squared error loss, as used for example in univariate regression. Specifically, given a target value y ∈ R we define the mean squared error (MSE) loss with respect to the operator O as If one is given a data-set of binary labelled examples, as is the case in classification problems, it is possible to use the average MSE loss over the data-set as the loss function for a classification algorithm. More precisely, given a data- where in this case U (θ, x j ) is the unitary operator corresponding to a quantum circuit parameterized by both θ and x j . The MSE loss with respect to O, over the data-set D, is then defined as We note that typically one may also add a regularization term to the above loss function, which we have omitted for notational clarity, but can be straightforwardly included in our analysis, as will be later shown. In order to construct stochastic gradient descent optimizers for the loss function of Eq. (37), it is convenient to start with the simplified "single instance" MSE loss function of Eq. (35). For this loss function, we have that In order to obtain an unbiased estimator for the partial derivative ∂L MSE (θ)/∂θ i it is sufficient to have independent unbiased estimators for O θ and ∂ O θ /∂θ i . However, for settings in which a parameter shift rule applies, we have already constructed such estimators. Explicitly, the n-sample meanõ (n) (θ) is an unbiased estimator for O θ , while under the assumption of a K-term parameter shift ruled is an unbiased estimator for ∂ O θ /∂θ i , which is independent fromõ (n) (θ) since they are evaluated from measurements of different circuit evaluations. As a result, one can straightforwardly show that the random variabled is an unbiased estimator for ∂L MSE (θ)/∂θ i , which requires (K +1)n measurements to construct. Of course, in the same spirit as previous sections, one can also sample single terms from the summation over the parameter shift terms, giving rise to an estimator which requires only 2n measurements to evaluate. In light of the above it is now straightforward to design unbiased estimators for MSE quantum classifiers, as the loss function of Eq. (37) involves only an additional linear combination over data-set instances. In order to formalize this, let's start with the following definitions: Given these estimators, we have that for the data-set MSE loss function of Eq. (38) and therefore the random variable is an unbiased estimator for ∂L(θ)/∂θ i , which requires M (K + 1)n measurements to construct. However, as in the case of classical mini-batch SGD, one can sample subsets B (batches) of data instances in each optimization step, giving rise to a "doubly stochastic" unbiased estimator which requires only |B|(K + 1)n measurements per parameter update. We present this particular algorithm below, for the case |B| = 1, however given the explicit VQE estimators of Algortithm 3 it should be clear how this can be extended to include sampling over parameter shift terms.

Extensions to Generic Loss Functions
In light of the optimizers presented for the MSE quantum classifier, a natural question is whether or not analogous optimizers can be obtained when regularization terms are included, or when alternative loss functions, such as the cross-entropy, are used. Let us begin with a remark on the inclusion of regularization terms, which can be handled straightforwardly via the following observation: Observation 1. Given some loss function L, let the random variable g(θ) be an unbiased estimator for ∇L(θ) -i.e., E[g(θ)] = ∇L(θ). Then, the random variablẽ is an unbiased estimator for the regularized loss functionL(θ) = L(θ) + R(θ). This can be straightforwardly verified via Unfortunately however an extension to generic loss functions is not as straightforward. To see this let's consider a loss function L, for which the partial derivative ∂L(θ)/∂θ i that we would like to estimate is a non-linear function of an expectation value, i.e., ∂L(θ)/∂θ i = f ( O θ ) for some non-linear function f . Assume now that the random variable X(θ) is an unbiased estimator for O θ (such as the n-sample mean estimator). If f had been a linear function, then we would have that and as such f (X) would be an unbiased estimator for ∂L(θ) and let us define the function h : In addition, for all m ≥ k, let us denote by P k,m the set of all m!/(m − k)! ordered arrangements (i 1 , . . . , i k ) of size k chosen from (1, . . . , m). Now, given a set {X j | j ∈ [m]} of m ≥ k independent copies of the random variable X, one can easily verify that the random variable is an unbiased estimator for f (E(X)), which requires m samples from X to construct. In fact, g (m) is the U-statistic for the kernel h, and as such is the minimum variance unbiased estimator for f (E(X)) [36]. As many relevant loss functions, such as the cross entropy or the log-likelihood, are certainly not polynomials, it is worth noting before continuing that despite the absence of rigorous constructions it is still possible to apply the spirit of the constructions from the previous sections -i.e., utilizing single shot estimations of observables and sampling over linear combinations -to obtain biased estimators for arbitrary loss functions, whose performance on practical tasks can be easily evaluated, and for which one may still be able to prove convergence bounds [37]. In fact, while unbiased estimators facilitate convergence proofs in simplified settings, it is not a-priori clear that naive (potentially biased) estimators would perform badly as heuristics. Such heuristic analysis, for loss functions such as the cross entropy, should be considered in parallel with a development of the theory. Our motivation in this work is to provide a general framework, and to lay the foundations for more sophisticated analysis.

Convergence Guarantees
The construction of convergence guarantees for stochastic gradient descent optimizers in fully realistic settings is currently an active and open area of research, of critical importance for building a theoretically solid foundation for state of the art machine learning algorithms [38]. In particular, as discussed in the previous sections, many prior approaches to gradient descent optimization in the context of hybrid quantum-classical optimization have been large n instances of the algorithms from the previous sections (without sampling over linear combinations), and one of the primary motivations of this work is to place these previously heuristic approaches on a solid formal footing. While convergence guarantees are not yet available for state of the art models, highly non-convex landscapes and optimization algorithms with heuristically adapted learning rates, in certain simplified settings it is indeed possible to obtain convergence theorems, which allow one to build a measure of confidence in these techniques. In order to gain a more formal perspective on the algorithms presented in the previous sections, as well as insight into the requirements for the development of further well motivated stochastic gradient descent algorithms, we examine in this section convergence guarantees for loss functions L : R d → R which satisfy the Polyak-Lojasiewicz (PL) inequality, a slightly generalized notion of strong convexity [21]. These bounds are complimentary to the bounds previously discussed by Harrow and Napp for strongly-convex loss functions [11], and are presented explicitly in order to facilitate discussion around both the technical requirements for such guarantees in the hybrid quantum-classical setting, and the trade-offs inherent in SGD optimization schemes, which are explored numerically in the following section. In particular, a loss function L : R d → R satisfies the PL inequality for some µ > 0 if for all θ ∈ R d it is true that where θ * ∈ R d is the value at which L(θ) attains a global minimum. Under the assumption of the PL inequality, one can state the following convergence theorem, whose proof can be found in Refs. [21,39]: Theorem 1 (Stochastic gradient descent convergence [39]). Let L ∈ C 1 (R d ) satisfy the Polyak-Lojasiewicz inequality in Eq. (52) for some µ > 0, have L-Lipschitz gradient and a global minimum attained at θ * . For any T ∈ N and any θ ∈ R d let g (1) (θ), . . . , g (T ) (θ) be i.

i.d random variables with values in
2. If ∀θ, t : E[||g (t) (θ)|| 2 ] ≤ β 2 ||∇L(θ)|| 2 and α = 1/(Lβ 2 ), then We note that as all previously discussed algorithms have been constructed via unbiased estimators, Theorem 1 applies directly to these algorithms provided one can prove Lipschitz continuity of the gradient ∇L. Additionally, in order to apply the convergence bounds of Theorem 1 in a quantitative manner, it is necessary to obtain an upper bound on the quantity E[||g (t) (θ)|| 2 ], which in principle amounts to calculating the variance of the estimator g (t) (θ), as can be seen from the following straightforward calculation: where we defined the variance of a random vector as Var[X] := E[||X|| 2 ] − E[||X||] 2 . As Theorem 1 only applies in the simplified setting of loss functions which satisfy the PL inequality we will not address these upper bounds explicitly here, as in practice more sophisticated convergence theorems are necessary to make applicable quantitative statements for realistic settings. However, we note the critical role this variance plays in determining the distance between the global optimum and the solution one can be guaranteed to converge to, and in Section 8 we explore this interplay, in realistic settings, through extensive numerical experiments. As such, we focus on the issue of Lipschitz continuity, and provide the following results proving that for a large class of realistic parameterized quantum circuits, under the assumption of a parameter shift rule, the gradient of the loss functions considered in this work are indeed Lipschitz continuous. To be specific, we start with the following theorem (whose proof can be found in Appendix A), which guarantees Lipschitz continuity of expectation values for a specific class of parameterized quantum circuits: Given this result, an upper bound to L can be obtained as follows: One finds where U (θ) ≡ U L (θ)e −iθj Hj U R (θ), and Eq. (59) was obtained from Eq. (58) via the Cauchy-Schwarz inequality. In order to cover more realistic loss functions, we need to extend Theorem 2 to sums and products of parameterized expectation values. For example, we have already seen that for the MSE loss function, under the assumption of a parameter shift rule, we have that where θ k,j = θ + c k,j e j . The first thing to notice is that if O θ is L-Lipschitz continuous via Theorem 2, then so is O θ k,j . We see this by noting that for any constant c k,j , and therefore O θ k,j satisfies the assumptions of the Lemma after absorbing exp(−ic k,j )H j into the set of fixed (non-parameterized) gates. This observation, in conjunction with standard result that if f (θ) and g(θ) are Lipschitz continuous functions bounded on their domains, then f + g and f g are Lipschitz continuous [40], guarantees Lipschitz continuity for the class of functions we are interested in.

Numerical Experiments and Benchmarks
Given the unbiased estimators of Sections 3, 4 and 5, we explore here their performance on benchmark tasks. In particular, while it is clear from the above analysis that sampling over linear combination terms and utilizing small values of n leads to algorithms which require significantly fewer measurements per optimization step than previous approaches, while possessing convergence guarantees in simplified settings, it is not a-priori clear how these algorithms perform on realistic tasks. Specifically, as the convergence properties of the algorithms depend both on the Lipschitz constant of the gradient of the loss, and upper bounds on the variance of the estimator g (t) (θ), it may be that the advantages of both linear combination sampling and small n implementations are outweighed by slow or unstable convergence behavior in practice. Fortunately, as shown in the following sections, at least in the simplified setting of noiseless devices this is not the case, and small n implementations of the previously introduced algorithms indeed offer multiple advantages over large n approaches, which can be further enhanced through the use of heuristics such as decaying learning rate, and momentum based optimizers [41]. All numerical results presented in the following sections were obtained using qradient [42], a custom open-source package for the simulation of hybrid quantum-classical optimization, and all scripts, data and random number seeds are available on request.

Stochastic gradient descent for VQE
We start by exploring, for different values of n, the estimator formalized by the function Par-tialDerivativeEstimator 1 in Algorithm 3 (which we refer to as n-shot SGD), allowing us to gain insight into the effect of efficient expectation value estimation, before introducing the additional stochasticity of Hamiltonian term sampling. In particular, we consider the critical transverse field Ising model with open boundary conditions, given by the Hamiltonian with N = 8, where X j and Z j are the respective Pauli operators on site j. Critical Ising models are good candidates for such an endeavor, as the entanglement structure of such critical systems prohibits the use of classical tensor network methods for large system sizes [43]. The parameterized circuit that we consider consists of a sequence of σ-blocks (with σ ∈ [X, Y, Z]), where a single σblock has the following structure: The central column of figures shows the results for n-shot SGD with learning rate decay, from an initial learning rate of α = 0.005. The right hand column shows the results of n-shot SGD with learning rate decided adaptively by the Adam optimizer [41], starting from an initial learning rate of α = 0.005. For each algorithm, and each value of n, the experiment has been repeated 8 times, and the minimum, mean and maximum at each step is displayed.

A layer of nearest neighbour
CNOT gates with control on qubit j and target on qubit j + 1, for all even j.
3. A layer of nearest neighbour CNOT gates with control on qubit j and target on qubit j + 1, for all odd j.
Note that for one-dimensional circuits consisting of N qubits each block has N free parameters -i.e., the rotation angles on each qubit. The particular architecture that we consider consists of an initial Y -block, with all free parameters set to π/4, followed by k parameterized σ-blocks, alternating between X, Y and Z blocks, in that order. A k block circuit of this type therefore has kN free parameters, and we consider a circuit with 50 blocks, and therefore 400 free parameters, as N = 8. An explicit circuit diagram can be found in Appendix B. We implemented Algorithm 1, calling the estimator PartialDerivativeEstimator 1, with a fixed learning rate α = 0.005, for multiple values of n, as well as two heuristic variations: 1. Learning rate decay: If the energy has not decreased in the last 20 optimization steps, then the learning rate is decreased by a factor of 2.
2. Adam optimizer: A variation of Algorithm 1 which computes adaptive learning rates for each parameter, using the history of prior gradient estimates [41]. Implemented with an initial learning rate of α = 0.005 and hyper-parameters β 1 = 0.9 and β 2 = 0.999.
The results of these experiments can be seen in Fig. 1, and there are a variety of interesting aspects to note. Firstly, from the top left figure one can see that, as one might expect, with a fixed learning rate the quality of the final solution is strongly effected by the value of n. In particular, for very small values of n we expect a large variance in the gradient, even close to a local minima, and the algorithm can therefore only converge to relatively inaccurate solutions. Again as we might expect, the accuracy of the obtained solutions can be successively tightened by increasing the value of n, and in the process decreasing the variance of the gradient between optimization steps. By comparing the top left figure with the central and right figure of the top row we can however see that the quality of the final solution can also be significantly improved, even for very small values of n, by utilizing adaptive or decaying learning rates, and once again this makes intuitive sense. Once one has converged to a fixed solution, decreasing the learning rate scales the gradient estimator, and therefore effectively decreases the variance in the parameter updates, allowing one to move closer to a local minimum, even with a large variance in the gradient [32]. As such, we see that both decaying learning rate and increasing shot number allow one to improve the quality of the final solution that is obtained. We note however, that analogously to SGD in purely classical deep learning where variance can be decreased either through scaling via learning rate decay or directly via the batch size, it is not a-priori clear which strategy is optimal [44]. Additionally, from the top row of results in Fig. 1 it would appear that even with adaptive learning rate decay, one can achieve better quality solutions by utilizing larger shot numbers, and as such it is not clear that there is something to be gained by using small n estimators. In order to explore this more clearly, let us define the number of measurements required per optimization step of an n-shot algorithm as MC n -where MC stands for measurement cost. The bottom row of results in Fig. 1 shows the performance of the algorithms not with respect to optimization steps, but with respect to number of measurements performed, as a multiple of MC 1 (the smallest possible number of measurements per optimization step). With respect to this metric, it is clear from the bottom row of figures that in terms of numbers of measurements, single shot (n = 1) stochastic gradient descent algorithms converge much faster than large n algorithms. However, by comparing the upper and lower rows of results one can see that in this setting, for both standard SGD and SGD with decay (left and central panel), in order to improve the quality of the final solution, eventually it is necessary to increase the value of n. While the situation is less clear for the Adam optimizer, it also seems likely that increasing n at some point is necessary to achieve higher quality solutions in this case. As such, a natural strategy, which will be discussed further in Section 9, would be to combine learning rate decay with increasing shot numbers. This should allow one to exploit the rapid convergence, and therefore enhanced measurement efficiency, of small n estimators, while still allowing one to obtain the solutions which are accessible via large n estimators. In fact, recent work has explored precisely such heuristic optimization strategies, with promising results [28]. Given these insights, we explore in Fig. 2 the performance of the estimator PartialDeriva-tiveEstimator 2 -i.e., n-shot stochastic gradient descent with Hamiltonian sampling -in order to explore the additional effect of sampling over linear combinations. As discussed in Section 3, in practical settings one would not sample single terms of the Hamiltonian per parameter update, but rather commuting sets of Hamiltonian terms which can be easily measured simultaneously, so that the maximum amount of information can be extracted from a single circuit forward pass. However, for illustration purposes, we explore here the performance of this algorithms when a single local term of the Hamiltonian is sampled, as this is clearly an extreme case, whose performance can only be improved by sampling more terms simultaneously. Once again we implemented Algorithm 1 with learning rate decided by the Adam optimizer, with an initial learning rate of α = 0.005. As a result of Theorem 1, and from our numerical experiments with PartialDerivativeEstimator 1, due to the high variance of the estimator in this case, even with large values of n, we would expect this algorithm to converge slowly as a function of the number of optimization steps required, and indeed this is what is observed in Fig. 2. However, the crucial insight is that although more optimization steps are required, each optimization step of Algorithm 1 requires up to a factor of M less circuit executions (where M is the number of local terms of the Hamiltonian), and using Hamiltonian sampling may well facilitate more rapid convergence or initialization, with respect to the number of circuit executions required. As the precise efficiency gains compared to the simple n-shot estimator PartialDerivativeEstimator 1 depend on the strategy via which local Hamiltonian terms are grouped and measured (i.e., MC 1 varies depending on the strategy for obtaining measurements of all local Hamiltonian terms), we have not provided a direct comparison between PartialDerivativeEstimator 1 and PartialDerivativeEstimator 2 in terms of circuit executions and number of measurements, and we leave such comparisons, with state-of-the art strategies for minimizing all involved costs, to future work, emphasizing that this depends on the particular Hamiltonian at hand. Despite this, it is clear from Fig. 2 that the final solutions obtained when using PartialDerivativeEstimator 2 are relatively far from the optimal solutions obtained by large n versions of PartialDerivativeEstimator 2, and as such once again a natural strategy would be to use Hamiltonian sampling as an initialization strategy, with the number of terms (or commuting sets of terms) of the Hamiltonian treated as a hyper-parameter.

Stochastic gradient descent for QAOA
As discussed in Section 4, if one does not sample over linear combinations of local Hamiltonian terms, or over parameter shift terms, then the unbiased estimator formalized via PartialDeriva- tiveEstimator 1 from Algorithm 3 can be directly applied in the context of the QAOA circuit ansatz, by simply re-interpreting the double sum from the parameter shift rule as a single sum over a multi-index. In this section we explore the performance of this estimator, again for different values of n, on various instances of the MaxCut problem [45]. Specifically, given a graph G = (V, E), the MaxCut problem is equivalent to finding the ground state of the Hamiltonian As mentioned in Section 4, the QAOA ansatz alternates application of the problem Hamiltonian with a mixing Hamiltonian H B via For all experiments performed here, a value of d = 100 has been used, with the mixing Hamiltonian Fig. 3 shows the results obtained by using the above QAOA ansatz, in conjunction with the estimator PartialDerivativeEstimator 1, on 20 randomly drawn instances of the MaxCut problem, all with |V | = 8 vertices, and |E| = 16 edges. For all instances, the parameters were initialized such that they realize a simple linear interpolation between H P and H B -i.e., for odd j (parameters of H P terms) we have that θ j = j/d while for even j (parameters of H B terms) we have that θ j = 1 − j/d. This initialization is informed by recent studies which identified global optima for MaxCut problems, which were typically close to linear interpolation solutions [46]. The optimization was carried out using the Adam optimizer, with initial learning rate α = 0.001 and hyper-parameters β 1 = 0.8 and β 2 = 0.999. Once again we are able to draw various insights from the results of Fig. 3, which are similar in nature to the conclusions from our VQE simulations. In particular, by comparing the left and center panels of Fig. 3, which show the convergence behavior with respect to optimization steps and measurement cost respectively for one specific MaxCut instance, we once again see that while 1-shot SGD converges slower in terms of the number of optimization steps required, it converges faster in terms of the number of measurements required (where once again we have plotted the number of measurements in multiples of M C 1 , the measurement cost per optimization step of single shot SGD). The rightmost panel of Fig. 3 confirms that this behavior is indeed generic. Specifically, the right most panel shows the cost obtained by n-shot SGD after 10000M C 1 measurements, on 20 randomly drawn MaxCut instances, and it is clear that for all instances, single shot SGD obtains (often significantly) lower costs with this number of measurements, and hence should clearly be used as an initialization strategy. Given this initialization strategy the second natural question is then whether learning rate decay is sufficient to decrease the variance of the parameter updates such that highly accurate solutions can be obtained, or whether in addition it is necessary to increase the shot number. From the central panel (in which the number of steps is limited by computational resources) the fact that the 9-shot curve intersects the single shot curve suggests that, for this particular example, the 9-shot algorithm is likely to converge to a more accurate solution than the single shot algorithm. As such, if one begins with the single shot algorithm, then it is possible that increasing the number of measurements at some point would allow one to obtain a more accurate solution than one that could be obtained by keeping the number of measurements constant and simply decreasing the learning rate further. Given this, as per the VQE setting, in order to achieve the most accurate solution with the minimum number of measurements, a promising strategy is to combine learning rate decay with an increasing number of measurements.

Stochastic gradient descent for MSE classification
In order to investigate the performance of n-shot doubly stochastic gradient descent for MSE loss functions on a realistic task, we consider the problem of image classification. In particular, we consider a binary classification task defined by a data-set consisting of down-sampled grayscale 3's and 6's from the MNIST data-set. To be specific, each selected MNIST image has been down-sampled from 28 × 28 pixels to 8 × 8 pixels by first removing 6 pixels from all boundaries, and then removing every second row and every second column. Each image was then flattened into a 64 dimensional vector, and the pixel values were normalized so that the 2-norm of each image vector was 1. Each image was then encoded into an 8 qubit state via amplitude encoding [5,47]. The training data-set consisted of 2000 3's (labelled with y = 1) and 2000 6's (labeled with y = −1), while the validation data-set consisted of 200 3's and 200 6's. We have utilized the same parameterized circuit architecture as in the previous section (with 6 qubits and 18 σ-blocks) followed by a measurement of Z 1 , where given an image instance x the model has been defined via The loss function has been as per Eq. (38), with O = Z 1 , and we investigated the performance of the estimator defined in Algorithm 4 -resulting in what we refer to as n-shot doubly stochastic gradient descent (DSGD) -for different fixed learning rates and values of n, as can be seen in Fig. 4. In particular, in order to provide a benchmark the top row of Fig. 4 shows the results for "singly"-stochastic gradient descent -i.e., mini-batch gradient descent with batch size 1 and exact expectation values (which, as discussed, is not practically feasible), while the bottom row of Fig. 4 shows the results for n-shot doubly stochastic gradient descent, also with batch size 1.
An important thing to note is that because of the way the model is defined in Eq. (68), a high confidence (large shot number n) estimate of Z 1 is required in order to make a prediction with the model, even though, as we have discussed at length, such an estimate is not required to implement the doubly stochastic optimization algorithm. Additionally, we note that in order to avoid overfitting, a typical supervised learning procedure involves dividing the optimization into "epochs" -a single pass of the optimization algorithm through the training data set -each of which is followed by an evaluation of the optimization procedure, and a subsequent determination of convergence, by calculating the model accuracy on the validation data-set [9]. In light of the prior observations concerning the shot numbers necessary for high-confidence model predictions, a natural training strategy in this setting is therefore to train the algorithm for a single epoch, using low shot number estimates and without making simultaneous predictions on the training data, before then evaluating the accuracy of the model on the (typically smaller) validation set, using large shot number estimates of the expectation value to make predictions. In order to reflect this strategy, and the information one would then have available in a practical implementation of this algorithm, for the "singly"-stochastic gradient   The results from n-shot doubly stochastic gradient descent (i.e. using the function PartialDerivativeEstimator 4 from Algorithm 4) are shown in the lower row, and the results from exact expectation value stochastic gradient descent are shown in the upper row, both with batch-size = 1. All results were obtained using a fixed learning rate, displayed in the title of the respective subplot. All experiments were repeated 8 times, and the mean, max and min in each step is displayed. A particular experiment was decided to have converged if no improvement in the validation set accuracy was achieved in the previous 5 epochs, and for each shot number and learning rate pair the curve is plotted up until the epoch number which was reached by the slowest experiment to converge. descent, in which exact expectation values are used in training, we have plotted both the training and validation set accuracy after each epoch, while for the n-shot doubly stochastic gradient descent we have plotted only the validation set accuracy, which was evaluated after each training epoch by using exact expectation values (as a proxy for the much higher shot numbers one would use in practice).
Once again, there are a variety of lessons to be taken from the results of Fig. 4. Firstly, as can be seen by comparing the columns, it is clear that as in the purely classical case, the learning rate plays a large role, and care should be taken in estimating this hyper-parameter. In these experiments, we see that for the learning rates α = 0.005 and α = 0.0005, the results from n-shot doubly and exact expectation value SGD do not differ significantly. As in the VQE setting, the n-shot DSGD algorithms require more epochs to converge, but as we have seen in the previous section, as each epoch is significantly cheaper for small n DSGD, it is again clear that single shot DSGD should definitely be used to rapidly find approximate solutions, before possibly increasing n, in conjunction with learning rate decay, to increase the quality of the solution. In this case the enhanced efficiency of small n estimators is amplified by the size of the data-set, as each epoch consists of a single optimization step per data-set instance, and thus the measurement efficiency improvements per optimization step should be multiplied by the size of the data-set.

Discussion and Conclusion
In light of the above analysis and results, it is possible to draw a variety of observations and conclusions. Firstly, in the context of hybrid quantum-classical optimization, in which both loss functions and their gradients can be expressed as functions of expectation values, exact gradient descent is not possible. As such, all optimization schemes which in practice require the estimation of expectation values from a finite number of measurement results should be considered within the framework of stochastic gradient descent. In this work we have formalized this notion for VQE, QAOA and MSE classification problems, from which it is clear that previous approaches based on approximating expectation values with n measurement outcomes can be understood as instances of well defined stochastic gradient descent type algorithms, which are valid for all n, and in particular, even for n = 1. It is therefore not a-priori necessary to perform large numbers of measurements when implementing these algorithms in practice. Additionally, as is done classically in mini-batch stochastic gradient descent, whenever the loss function consists of a linear combination of terms which can be estimated in an unbiased way it is possible to obtain a new "doubly stochastic" optimizer by sampling over terms of the linear combination. In the context of hybrid quantum-classical optimization such linear combinations appear from a variety of sources, such as sums over local Hamiltonian terms, sums over parameter shift terms, or sums over data-set instances. Exploiting this insight has allowed us to define multiple doubly stochastic gradient descent algorithms, for practically relevant settings such as VQE, QAOA and quantum classifiers. Moreover, while convergence guarantees for realistic non-convex landscapes remain an open question, for simplified landscapes convergence guarantees are available for all of the algorithms discussed, which provide a minimum measure of confidence in their construction.
From these convergence guarantees it is clear that even in restricted settings the accuracy of the solution to which a given stochastic gradient descent algorithm will converge is dependent on the variance of the estimator for the gradient. It is natural to expect that both small n implementations of the algorithms we have introduced, as well as algorithms which involve samples over (possibly multiple) linear combinations, may not achieve solutions which are as fine-tuned as those obtained by large n algorithms. From the numerical experiments we have performed in Section 8 we see that this is indeed the case, however there are two important things to note. Firstly, adaptive heuristic learning rate schemes can successfully compensate for the variance of gradient estimators, and significantly improve the quality of solutions, even for very high variance estimators. Secondly, we see that on benchmark practical settings both small n stochastic gradient descent algorithms, and algorithms that sample over linear combinations of terms, can converge orders of magnitude faster, with respect to the total number of measurements required. As a result, even though the solutions converged to are not optimal, it is well advised to utilize small n SGD algorithms -possibly with linear combination sampling -as initializers which are able to rapidly find approximate solutions, using very few measurements relative to the number required by large n algorithms to reach the same point. Once convergence has been achieved for small n with sampling, optimization can be continued with a combination of learning rate decay and slowly increasing values of n, in order to fine tune the solutions.
Given the results presented here, there remain a variety of directions to explore. From an analytical perspective, it is naturally of interest to construct unbiased estimators for loss functions such as the log-likelihood and the cross entropy. However, as previously discussed, despite the absence of rigorous constructions for unbiased estimators for non-polynomial loss functions, one can use both efficient expectation value estimation and sampling over linear combinations to constructed biased estimators for these loss functions, whose performance as heuristic strategies may be worth investigating further, and for which one still may be able to design optimization algorithms with rigorous convergence guarantees [37]. Additionally, It is also desirable to obtain convergence guarantees for settings in which the Polyak-Lojasiewicz inequality is not satisfied, although this is expected to require significantly new techniques, which are sought after by the classical machine learning community. It should also be noted that the results and experiments presented here have all neglected the effects of noise, and understanding how realistic noise influences these results is of natural importance for developing optimization methods for use in conjunction with existing devices. One possible scenario is that realistic device noise will lead to estimators which remain unbiased, but with a larger variance. In this case, the strategies suggested here remain valid, however one may ultimately require smaller learning rates and larger shot numbers for fine-tuning solutions. If however realistic device noise introduces a bias into the estimators, then as mentioned earlier, one could still apply the strategies and optimization algorithms developed here as heuristics, and it may also be possible to design more complicated algorithms which still admit convergence guarantees [37]. In order to understand the role of noise more fully, it would therefore be of interest both to study the performance of these algorithms on existing devices and to integrate realistic noise models into the analysis. In fact, along precisely these lines, very recently the authors of Ref. [48] have shown that counter to intuition noise may in fact be beneficial for stochastic gradient descent optimization, and it is hoped that this work stimulates further future research in these directions. Finally, it would be interesting to combine the algorithms presented here with additional techniques for reducing the number of circuit executions and forward passes. Very recent work has in fact introduced a variety of heuristic hyper-parameter adaptation schemes for measurement shot adaptation [28], and it will be of interest to investigate the extension of these techniques to the additional algorithms we have presented here. It is also worthwhile to further exploit structure such as sparsity in the gradients estimated.
On a higher level, we hope that the rigorous framework provided in this work contributes in a significant fashion to the growing body of analytical work on variational quantum-classical hybrid algorithms and algorithms with applications in quantum-enhanced machine learning, complementing a -to date -still largely empirical field of research. At the same time, we hope that our work provides further perspectives to find practical applications of near-term quantum devices, let this be near-term quantum circuits [49,50] or instances of programmable quantum simulators [51][52][53], beyond showing conceptually interesting quantum advantages or "supremacy" [54][55][56][57][58][59]. Optimized schemes of the kind developed here may help in the search of finding such applications.
Putting together the previous statements we then obtain the following Lemma. Proof. For all j, and for all y ∈ [a, b] M −1 , one can apply Lemma 3 to the function g j,y , from which one obtains that g j,y is L j,y -Lipschitz continuous with L j,y = sup x∈ [a,b] ∂f (y 1 , . . . y j−1 , x, y j , . . . y M −1 ) ∂x j .
The statement then follows from Lemma 4 and the definition of L j .
Finally, given Lemma 5 the proof of Theorem 2 follows straightforwardly: Proof (Theorem 2). For all j we have that the partial derivatives exist and are continuous on R M , and therefore the statement of the theorem follows from Lemma 5.

B Parameterized circuit and optimization details
Fig. 5 below shows the parameterized quantum circuit U (θ) used for both the VQE and MSE quantum classifier experiments, as discussed in Section 8. As can be seen, the circuit is constructed from layers of Pauli rotations, interleaved with CN OT ladders, and consists of 400 free parameters.
|0 Y π C CO 2 Emission Table   The table below