General parameter-shift rules for quantum gradients

Variational quantum algorithms are ubiquitous in applications of noisy intermediate-scale quantum computers. Due to the structure of conventional parametrized quantum gates, the evaluated functions typically are finite Fourier series of the input parameters. In this work, we use this fact to derive new, general parameter-shift rules for single-parameter gates, and provide closed-form expressions to apply them. These rules are then extended to multi-parameter quantum gates by combining them with the stochastic parameter-shift rule. We perform a systematic analysis of quantum resource requirements for each rule, and show that a reduction in resources is possible for higher-order derivatives. Using the example of the quantum approximate optimization algorithm, we show that the generalized parameter-shift rule can reduce the number of circuit evaluations significantly when computing derivatives with respect to parameters that feed into many gates. Our approach additionally reproduces reconstructions of the evaluated function up to a chosen order, leading to known generalizations of the Rotosolve optimizer and new extensions of the quantum analytic descent optimization algorithm.

These algorithms have a common structure: a parametrized circuit is executed and a cost function is composed from expectation values measured in the resulting state.A classical optimization routine is then used to optimize the circuit parameters by minimizing said cost function.Initially, gradientfree optimization methods, such as Nelder-Mead and COBYLA, were common.However, gradient-based optimization provides significant advantages, from convergence guarantees [32] to the availability of workhorse algorithms (e.g., stochastic gradient descent) and software tooling developed for machine learning [33,34,35,36,37].
The so-called parameter-shift rule [16,23,38,39] can be used to estimate the gradient for these optimization techniques, without additional hardware requirements and -in contrast to naïve numerical methods -without bias; the cost function is evaluated at two shifted parameter positions, and the rescaled difference of the results forms an unbiased estimate of the derivative.However, this twoterm parameter-shift rule is restricted to gates with two distinct eigenvalues, potentially requiring expensive decompositions in order to compute hardwarecompatible quantum gradients [40].While various extensions to the shift rule have been discovered, they remain restricted to gates with a particular number of distinct eigenvalues [10,41].
In this manuscript, we use the observation that the restriction of a variational cost function to a single parameter is a finite Fourier series [42,43,44,45]; as a result, the restricted cost function can be reconstructed from circuit evaluations at shifted positions using a discrete Fourier transform (DFT).By analytically computing the derivatives of the Fourier series, we extract general parameter-shift rules for arbitrary quantum gates and provide closed-form expressions to apply them.In the specific case of unitaries with equidistant eigenvalues, the general parameter-shift rule recovers known parameter-shift rules from the literature, including the original two-term parametershift rule.We then generalize our approach in two steps: first from equidistant to arbitrary eigenvalues of the quantum gate, and from there -by making use of stochastic parameter shifts -to more complicated unitaries like multi-parameter gates.This enables us Figure 1: Overview of existing and new parameter-shift rules for first-order univariate derivatives as Venn diagram on the space of quantum gates.Each rule produces the analytic derivative for a set of gates, with more general rules reproducing the more specific ones.For gates of the form U (x) = exp(ixG) the rules are deterministic (left) whereas more involved gates of the form UF = exp(i(xG + F )) require stochastic evaluations of shifted values (right).See Sec.2.2 for a summary of previously known shift rules.The fermionic four-term shift rule in Ref. [41] covers the same gates as the shown four-term rule (purple).
to cover all practically relevant quantum gates.An overview of the existing parameter-shift rules and our new results is shown in Fig. 1.
Afterwards, we perform an extensive resource analysis to compare the computational expenses required by both the general shift rule presented here, and decomposition-based approaches.In particular, we note that evaluating the cost of gradient recipes by comparing the number of unique executed circuits leads to fundamentally different conclusions on the optimal differentiation technique than when comparing the total number of measurements.
Our analysis not only is fruitful for understanding the structure of variational cost functions, but also has several practical advantages.Firstly, second-order derivatives (such as the Hessian [46] and the Fubini-Study metric tensor [47,48]) can be computed with fewer evaluations compared to naïvely iterating the two-term parameter-shift rule.We also show, using the example of the quantum approximate optimization algorithm (QAOA), that the generalized parametershift rule can reduce the number of quantum circuit evaluations required for ansätze with repeated parameters.
Finally, we generalize the quantum analytic descent (QAD) algorithm [49] using the reconstruction of variational cost functions discussed here.We also reproduce the known generalizations of Rotosolve [50,51] from single Pauli rotations to groups of rotations controlled by the same parameter [42,45]; reconstruct-ing functions with arbitrary spectrum extends this algorithm even further.Furthermore, the cost reduction for the gradient we present in the context of QAOA applies to Rotosolve as well.Similarly, future improvements that reduce the cost for gradient computations might improve the efficiency of these model-based algorithms, based on the analysis presented here.
This manuscript is structured as follows.In Sec. 2, we lay out the setting for our results by deriving the general functional form for variational cost functions, followed by a survey of existing parameter-shift rules.In Sec. 3 we show how to fully reconstruct univariate variational cost functions from a finite number of evaluations assuming an equidistant frequency spectrum, and derive parameter-shift rules for arbitraryorder univariate derivatives, including a generalization of the stochastic parameter-shift rule.In Sec. 4 we demonstrate how to compute second-order derivatives, in particular the Hessian and the metric tensor, more cheaply compared to existing methods.In Sec. 5 we discuss applications, applying the new generalized parameter-shift rules to QAOA, and using the full univariate reconstruction to extend existing model-based optimization methods.We end the main text in Sec.6 with a discussion of our work and potential future directions.Finally, in the appendix we summarize some technical derivations (App.A), and extend the results to more general frequency spectra (App.B).The general stochastic parameter-shift rule and details on quantum analytic descent can be found in Apps.C and D.
Related work: In Ref. [42], the functions of VQAs were considered as Fourier series and parameter-shift rules were derived.Regarding the shift rules, the authors of Ref. [42] consider integer eigenvalues and derive a rule with 2R + 1 evaluations for equidistant eigenvalues.In particular, the two-term and fourterm shift rules are reviewed and formulated as special cases with fewer evaluations than the general result presented there.In contrast, our work results in the exact generalization of those shift rules, which requires 2R evaluations.Remarkably, Refs.[42,45] also propose a generalized Rotosolve algorithm prior to its eponymous paper.
In addition, during the final stages of preparation of this work, a related work considering algebraic extensions of the parameter-shift rule appeared online [52].The general description of quantum expectation values in Sec.2.1 of the present work, along with its initial consequences in Sec.3.1, are shown in Sec.II A of this preprint.We present a simpler derivation and further explore the implications this description has.The generalization of the parametershift rule in Ref. [52] is obtained by decomposing the gate generator using Cartan subalgebras, which can yield fewer shifted evaluations than decompositions of the gate itself.In particular, decompositions into non-commuting terms, which do not lead to a gate decomposition into native quantum gates directly, can be used in this approach.
At a similar time, yet another work appeared [53], presenting a derivation similar to Sec. 2.1 and parameter-shift rules for the first order derivative.These rules are based on the ideas discussed here in Secs.3.1 and 3.2.

Background
We start by deriving the form of a VQA cost function of a single parameter for a general single-parameter quantum gate.Then we review known parametershift rules and briefly discuss resource measures to compare these gradient recipes.

Cost functions arising from quantum gates
Let us first consider the expectation value for a general gate U (x) = exp(ixG), defined by a Hermitian generator G and parametrized by a single parameter x.Let |ψ denote the quantum state that U is applied to, and B the measured observable 1 .The eigenvalues of U (x) are given by {exp(iω j x)} j∈ [d] with real-valued {ω j } j∈ [d] where we denote [d] := {1, . . ., d} and have sorted the ω j to be non-decreasing.Thus, we have: where we have expanded B and |ψ in the eigenbasis of U , denoted by b jk and ψ j , respectively.We can collect the x-independent part into coefficients c jk := ψ j b jk ψ k and introduce the R unique pos- Note that the differences are not necessarily equidistant, and that for r = {ω j } j∈ [d] unique eigenvalues of the gate generator, there are at most R ≤ r(r−1) 2 unique differences.However, many quantum gates will yield R ≤ r equidistant differences in- 1 Here we consider any pure state in the Hilbert space; in the context of VQAs, |ψ is the state prepared by the subcircuit prior to U (x).Similarly, B includes the subcircuit following up on U (x). stead; a common example for this is for commuting Pauli words P k (P k P k = P k P k ), which yields the frequencies [P] and thus R = P.
In the following, we implicitly assume a mapping between the two indices j, k ∈ [d] and the frequency index We can then write the expectation value as a trigonometric polynomial (a finite-term Fourier series): with frequencies given by the differences {Ω }, where we defined c =: Since E(x) is a finite-term Fourier series, the coefficients {a } and {b } can be obtained from a finite number of evaluations of E(x) through a discrete Fourier transform.This observation (and variations thereof in Sec. 3) forms the core of this work: we can obtain the full functional form of E(x) from a finite number of evaluations of E(x), from which we can compute arbitrary order derivatives.

Known parameter-shift rules
Parameter-shift rules relate derivatives of a quantum function to evaluations of the function itself at different points.In this subsection, we survey known parameter-shift rules in the literature.
For functions of the form (6) with a single frequency Ω 1 = Ω (i.e., G has two eigenvalues), the derivative can be computed via the parameter-shift rule [16,23,38] where x 1 is a freely chosen shift angle from (0, π) 3 .This rule was generalized to gates with eigenvalues {−1, 0, 1}, which leads to R = 2 frequencies, in Refs.[41,10] in two distinct ways.The rule in Ref. [10] is an immediate generalization of the one above: 3 The position 0 for the derivative is chosen for convenience but the rule can be applied at any position.To see this, note that shifting the argument of E does not change its functional form.
with freely chosen shift angles x 1,2 and corresponding coefficients y 1,2 , requiring four evaluations to obtain E (0).A particularly symmetric choice of shift angles is In contrast, the rule in Ref. [41] makes use of an auxiliary gate to implement slightly altered circuits, leading to a structurally different rule: where E α ± is the measured energy when replacing the gate U (x) in question by U (x ± π/2) exp(∓αi π 4 P 0 ) and P 0 is the projector onto the zero-eigenspace of the generator of U .Remarkably, this structure allows a reduction of the number of distinct circuit evaluations to two if the circuit and the Hamiltonian are real-valued, which is often the case for simulations of fermionic systems and forms a unique feature of this approach.This second rule is preferable whenever this condition is fulfilled, the auxiliary gates exp(±i π 4 P 0 ) are available, and simultaneously the number of distinct circuits is the relevant resource measure.
Furthermore, the two-term parameter-shift rule Eq. ( 7) was generalized to gates with the more complicated gate structure U F (x) = exp(i(xG + F )) via the stochastic parameter-shift rule [39] Here, E ± (t) is the energy measured in the state prepared by a modified circuit that splits U F (x 0 ) into U F (tx 0 ) and U F ((1 − t)x 0 ), and interleaves these two gates with U F =0 (±x 1 ).See Sec.3.6 and App.C for details.The first-order parameter-shift rules summarized here and their relationship to each other is also visualized in Fig. 1.
A parameter-shift rule for higher-order derivatives based on repeatedly applying the original rule has been proposed in Ref. [46].The shift can be chosen smartly so that two function evaluations suffice to obtain the second-order derivative: which like Eq. ( 7) is valid for single-frequency gates.
Various expressions to compute combinations of derivatives with few evaluations were explored in Ref. [54].

Resource measures for shift rules
While the original parameter-shift rule Eq. (7) provides a unique, unbiased method to estimate the derivative E (0) via evaluations of E if it contains a single frequency, we will need to compare different shift rules for the general case.To this end, we consider two resource measures.Firstly, the number of distinct circuits that need to be evaluated to obtain all terms of a shift rule, N eval .This is a meaningful quantity on both, simulators that readily produce many measurement samples after executing each unique circuit once, as well as quantum hardware devices that are available via cloud services.In the latter case, quantum hardware devices are typically billed and queued per unique circuit, and as a result N eval often dictates both the financial and time cost.Note that overhead due to circuit compilation and optimization scale with this quantity as well.Secondly, we consider the overall number N of measurements -or shots -irrespective of the number of unique circuits they are distributed across.To this end, we approximate the physical (one-shot) variance σ 2 of the cost function E to be constant across its domain 4 .For an arbitrary quantity ∆ computed from M values of E via a shift rule, we obtain the variance for the estimate of ∆ as where N µ expresses the number of shots used to measure E(x µ ).For a total budget of N shots, the optimal shot allocation is This can be understood as the number of shots needed to compute ∆ to a tolerable standard deviation ε.
The number of shots N is a meaningful quantity for simulators whose runtime scales primarily with the number of requested samples (e.g., Amazon Braket's TN1 tensor network simulator [1]), and for actual quantum devices when artificial resource measures like pricing per unique circuit and queueing time do not play a role.
In this work we will mostly use N eval to compare the requirements of different parameter-shift rules as it is more accessible, does not rely on the assumption of constant physical variance like N does, and the coefficients y to estimate N are simply not known analytically in most general cases.For the case of equidistant frequencies and shift angles as discussed in Sec.3.4 we will additionally compare the number of shots N in Sec.3.5.

Univariate cost functions
In this section we study how a quantum cost function, which in general depends on multiple parameters, varies if only one of these parameters is changed.
The results of this section will be sufficient to evaluate the gradient as well as the diagonal of the Hessian of a quantum function.We restrict ourselves to functions that can be written as the expectation value of an observable with respect to a state that is prepared using a unitary U (x) = exp(ixG) -capturing the full dependence on x.That is, all parameters but x are fixed and the operations they control are considered as part of the prepared state and the observable.As shown in Sec.2.1, this yields a trigonometric polynomial, i.e., In the following, we will assume the frequencies to be equidistant, i.e., Ω = Ω, and generalize to arbitrary frequencies in App.B. While it is easy to construct gate sequences that do not lead to equidistant frequencies, many conventional gates and layers of gates do yield such a regular spectrum.The equidistant frequency case has two major advantages over the general case: we can derive closed-form parametershift rules (Sec.3.4); and the number of circuits required for the parameter-shift rule scales much better (Sec.3.5).Without loss of generality, we further restrict the frequencies to integer values, i.e., Ω = .For Ω = 1, we may rescale the function argument to achieve Ω = and once we reconstruct the rescaled function, the original function is available, too.

Determining the full dependence on x
As we have seen, the functional form of E(x) is known exactly.We can thus determine the function by computing the 2R + 1 coefficients {a } and {b }.This is the well-studied problem of trigonometric interpolation (see e.g., [55,Chapter X]).
A reasonable choice is x µ = 2πµ 2R+1 , µ = −R, . . ., R, in which case the transform is the usual (uniform) DFT.For this choice, an explicit reconstruction for E follows directly from [55, Chapter X]; we reproduce it in App.A.1.1.

Determining the odd part of E(x)
It is often the case in applications that we only need to determine the odd part of E, For example, calculating odd-order derivatives of E(x) at x = 0 only requires knowledge of E odd (x), since those derivatives of the even part vanish.Note that the reference point with respect to which E odd is odd may be chosen arbitrarily, and does not have to be 0.
The coefficients in E odd can be determined by evaluating E odd at R distinct points x µ with 0 < x µ < π.This gives us a system of R equations which we can use to solve for the R coefficients {b }.Using Eq. (16) we see that each evaluation of E odd can be done with two evaluations of E(x).Thus, the odd part of E can be completely determined with 2R evaluations of E, saving one evaluation compared to the general case.Note however that the saved E(0) evaluation is evaluated regardless in many applications, and may be used to recover the full reconstruction -so, in effect, this saving does not have a significant impact5 .

Determining the even part of E(x)
We might similarly want to obtain the even part of E, which can be used to compute even-order derivatives of E. Determining E even (x) requires R + 1 evaluations of E even , which leads to 2R + 1 evaluations of E for arbitrary frequencies.However, in the case where Ω are integers, R + 1 evaluations of E even can be obtained with 2R evaluations of E(x) by using periodicity: Thus, in this case 2R evaluations of E(x) suffice to determine its even part, saving one evaluation over the general case.In contrast to the odd part, this saving genuinely reduces the required computations as E(0) is also used in the cheaper computation of {a }; therefore, if E(0) is already known, we only require 2R − 1 new evaluations.We note that even though both the odd and the even part of E(x) require 2R evaluations, the full function can be obtained at the price of 2R + 1 evaluations.

Explicit parameter-shift formulas
Consider again the task of determining E odd (E even ) based on its value at the shifted points {x µ } with µ ∈ [R] (µ ∈ [R] 0 ).This can be done by linearly combining elementary functions that vanish on all but one of the {x µ }, i.e., kernel functions, using the evaluation E(x µ ) as coefficients.If we restrict ourselves to evenly spaced points we can choose these functions to be Dirichlet kernels.In addition to a straightforward reconstruction of the odd (even) function this delivers the general parametershift rules, which we derive in App.A.1: We remark that derivatives of higher order can be obtained in an analogous manner, and with the same function evaluations for all odd (even) orders.Furthermore, this result reduces to the known two-term (Eq.( 7)) and four-term (Eq.( 8)) parameter-shift rules for R = 1 and R = 2, respectively, as well as the second-order derivative for R = 1 (Eq.( 11)).
We again note that the formulas above use different evaluation points for the first and second derivatives (2R evaluations for each derivative).Closedform parameter-shift rules that use 2R + 1 shared points can be obtained by differentiating the reconstruction formula Eq. (57).

Resource comparison
As any unitary may be compiled from (single-qubit) Pauli rotations, which satisfy the original parameter-shift rule, and CNOT gates, an alternative approach to compute E (0) is to decompose U (x) into such gates and combine the derivatives based on the elementary gates.As rotation gates about any multiqubit Pauli word satisfy the original parameter-shift rule as well, a more coarse-grained decomposition might be possible and yield fewer evaluations for this approach.
For instance, for the MaxCut QAOA ansatz6 on a graph G = (V, E) with vertices V and edges E, one of the operations is to evolve under the problem Hamiltonian: Eq. ( 26) treats U P (x) as a single operation with at most M = |E| frequencies 1, . . ., R ≤ M , and we can apply the generalized parameter-shift rules of this section.Alternatively, we could decompose U P (x) with Eq. ( 27), apply the two-term parameter-shift rule to each R ZZ rotation, and sum up the contributions using the chain rule.

Number of unique circuits
If there are P gates that depend on x in the decomposition, this approach requires 2P unique circuit evaluations; as a result, the general parameter-shift rule is cheaper if R < P. The evaluations used in the decomposition-based approach cannot be expressed by E directly because the parameter is shifted only in one of the P gates per evaluation, which makes the general parameter-shift rule more convenient and may reduce compilation overhead for quantum hardware, and the number of operations on simulators.
In order to compute E (0) via the decomposition, we need to obtain and sum the full Hessian of all elementary gates that depend on x (see App. A.4.2), which requires 2P 2 −P+1 evaluations, including E(0), and thus is significantly more expensive than the 2R evaluations for the general parameter-shift rule.
While the derivatives can be calculated from the functional form of E odd or E even , the converse is not true for R > 1, i.e., the full functional dependence on x cannot be extracted from the first and second derivative alone.Therefore, the decomposition-based approach would demand a full multivariate reconstruction for all P parametrized elementary gates to obtain this dependence, requiring O(2 P ) evaluations.The approach shown here allows us to compute the dependence in 2R + 1 evaluations and thus is the only method for which the univariate reconstruction is viable.

Number of shots
For equidistant evaluation points, we explicitly know the coefficients of the first and second-order shift rule given in Eqs.(24,25), and thus can compare the variance of the derivatives in the context and under the assumptions of Sec.2.3.
The coefficients satisfy (see App. A.4.1) This means that the variance-minimizing shot allocation requires a shot budget of using the generalized parameter-shift rule for the first and second derivative, respectively.Assuming integer-valued frequencies in the cost function typically means, in the decomposition-based approach, that x enters the elementary gates without any additional prefactors 7 .Thus, optimally all evaluations for the first-order derivative rule are performed with the same portion of shots; whereas the secondorder derivative requires an adapted shot allocation which, in particular, measures E(0) with high precision as it enters E (0) with the prefactor P/2.This yields (see App. A.4.2) Comparing with N genPS, 1 and N genPS, 2 above, we see that the shot budgets are equal at P = R.That is, for both the first and second derivative, the general parameter-shift rule does not show lower shot requirements in general, in contrast to the previous analysis that showed a significantly smaller number of unique circuits for the second derivative.This shows that the comparison of recipes for gradients and higherorder derivatives crucially depends on the chosen resource measure.In specific cases we may be able to give tighter upper bounds on R so that R < P (see Sec. 5.1) and the general shift rule becomes favourable regarding the shot count as well.

General stochastic parameter-shift rule
Next, we will apply the stochastic parameter-shift rule to our general shift rule.For this section we do not assume the frequencies to be equidistant but address arbitrary spectra directly.Additionally we make the reference point x 0 at which the derivative is computed explicit.In Ref. [39], the authors derive the stochastic parameter-shift rule for gates of the form where G is a Hermitian operator with eigenvalues ±1 (so that G 2 = 1), e.g., a Pauli word.F is any other Hermitian operator, which may not necessarily commute with G8 .Key to the derivation of the stochastic rule is an identity relating the derivative of the quan- We may extend this directly to the general parametershift rule for the case when G 2 = 1 is no longer satisfied (see App. C for the derivation): The integration is implemented in practice by sampling values for t for each measurement of E µ (x 0 , t) and E −µ (x 0 , t).
The stochastic parameter-shift rule in combination with the generalized shift rule in Eq. (24) allows for the differentiation of any unitary with equidistant frequencies.As F in U F (x) above is allowed to contain terms that depend on other variational parameters, this includes multi-parameter gates in particular.Furthermore, combining Eq. (33) with the generalized shift rule for arbitrary frequencies in Eq. (90) allows us to compute the derivative of any quantum gate as long as the frequencies of U F =0 (x) are known.We thus obtain an improved rule for U F =0 (x) over the original stochastic shift rule whenever the generalized shift rule is beneficial for U (x) = U F =0 (x), compared to the decomposition-based approach.

Second-order derivatives
As noted in Sec.3.3, higher-order derivatives of univariate functions are easily computed using the even or odd part of the function.In the following sections, we will extend our discussion to multivariate functions E(x), where derivatives may be taken with respect to different variables.Each single parameter dependence is assumed to be of the form Eq. (5), with equidistant (and by rescaling integer-valued) frequencies {Ω for the kth parameter.We may collect the numbers of frequencies in a vector (R) k = R k .It will again be useful in the following to make the reference point x 0 , at which these derivatives are computed, explicit.

Diagonal shift rule for the Hessian
Here we show how to compute the Hessian H of a multivariate function E(x) at some reference point x 0 using the Fourier series representation of E. We allow for single-parameter gates U (x) = exp(ixG) with equidistant frequencies and will use fewer evaluations of E than known schemes.An indication that this may be possible for gates with two eigenvalues was made in [54, Eq. ( 37)].
First, for the kth diagonal entry ) of the Hessian, we previously noted in Sec.3.3 that 2R k evaluations are sufficient as it is the second derivative of a univariate restriction of E. Recall that one of the 2R k evaluations is E(x 0 ); we can reuse this evaluation for all diagonal entries of H, and thus require k E in addition to the gradient, we may reuse the 2 R 1 evaluations computed for the gradient, only requiring a single additional function value, namely E(x 0 ).In this case, we do not make use of the periodicity where v k is the kth canonical basis vector, because this shift is not used in the gradient evaluation (see Sec. 3.2).
Next, for an off-diagonal entry H km = ∂ k ∂ m E(x 0 ), consider the univariate trigonometric function that shifts the two parameters x k and x m simultaneously: where we abbreviated v k,m := v k + v m .We show in App.A.2 that E (km) again is a Fourier series of x with R km = R k + R m equidistant frequencies.This means that we can compute E (km) (0) via Eq.( 25) with R = R km , using 2R km − 1 evaluations of E (as we may reuse E(x 0 ) from the diagonal computation).Note that and that we have already computed the diagonal entries.We thus may obtain H km via the diagonal parameter-shift rule In Fig. 2, we visually compare the computation of H km via the diagonal shift rule to the chained application of univariate parameter-shift rules for x k and x m .
As an example, consider the case when R k = R m = 1 (e.g., where all parametrized gates are of the form

exp(ix
. By setting R = 2 in Eq. (25), we obtain the explicit formula for E (km) (0), which can be combined with Eq. (36) to give an explicit formula for the Hessian.This formula (for R k = R m = 1) was already discovered in [54, Eq. ( 37)].
The computation of H km along the main diagonal in the x k -x m -plane can be modified by making use of the second diagonal as well: ), and compute This means we can replace the dependence on the diagonal elements H kk and H mm by another univariate second-order derivative on the second diagonal.We will not analyze the resources required by this method in detail but note that for many applications it forms a compromise between the two approaches shown in Fig. 2.
We note that an idea similar to the ones presented here can be used for higher-order derivatives, but possibly requires more than one additional univariate reconstruction per derivative.

Resource comparison
For the Hessian computation, we will again look at the number of unique circuit evaluations N eval and the number of shots N , as introduced in Sec.2.3.

Number of unique circuits
In Tab. 1, we summarize the number of distinct circuit evaluations required to compute several combinations of derivatives of E(x), either by decomposing the gate or by using the general parameter-shift rule together with the diagonal shift rule for the Hessian.We also include the generalized case of non-equidistant frequencies covered in App.B.2 for completeness.To obtain the cost for the repeated general shift rule, i.e., without the diagonal shift rule for the Hessian or decomposition, simply replace P by R in the left column.
For equidistant frequencies, the diagonal shift rule for H km requires 2(R k + R m ) − 1 evaluations, assuming the diagonal and thus E(x 0 ) to be known already.Like the gradient, H km may instead be computed by decomposing U k (x k ) and U m (x m ) into P k and P m elementary gates, respectively, and repeating the parameter-shift rule twice [46,56].All combinations of parameter shifts are required, leading to 4P k P m evaluations.Finally, as a third option, one may repeat the general parameter-shift rule in Eq. (24) twice, leading to 4R k R m evaluations 9 .
The repeated general shift rule requires strictly more circuit evaluations than the diagonal shift rule, since Similar to the discussion for the scaling of gradient computations, the optimal approach depends on R k,m and P k,m , but P and R often have a linear relation so that the diagonal shift rule will be significantly cheaper for many cost functions than decomposing the unitaries.

Number of shots
Next we compare the numbers of measurements required to reach a precision ε.While the approach via repeated shift rules uses distinct circuit evaluations for each Hessian entry, the diagonal shift rule in Eq. (36) reuses entries of the Hessian and thus correlates the optimal shot allocations and the statistical errors of the Hessian entries.We therefore consider an error measure on the full Hessian matrix instead of a single entry, namely the root mean square of the Frobenius norm of the difference between the true and the estimated Hessian.This norm is computed in App.A.5 for the three presented approaches, and we conclude the number of shots required to achieve a norm of ε to be In general, the diagonal shift rule for the Hessian is significantly less efficient than the repeated execution of the general parameter-shift rule if the shot count is the relevant resource measure.This is in sharp contrast to the number of unique circuits, which is strictly smaller for the diagonal shift rule.We note that the two resource measures yield incompatible recommendations for the computation of the Hessian.The overhead of the diagonal shift rule reduces to a (to leading and therefore

Metric tensor
The Fubini-Study metric tensor F is the natural metric on the manifold of (parametrized) quantum states, and the key ingredient in quantum natural gradient descent [48].The component of the metric belonging to the parameters x k and x m can be written as or, alternatively, as a Hessian [46]: It follows that we can compute the metric using the same method as for the Hessian, with f (x) as the cost function.We know the value of f without shift as The values with shifted argument can be calculated as the probability of the zero bitstring 0 when measuring the state V † (x)V (x 0 ) |0 in the computational basis, which requires circuits with up to doubled depth compared to the original circuit V (x).Alternatively, we may use a Hadamard test to implement f , requiring an auxiliary qubit, two operations controlled by that qubit as well as a measurement on it, but only Quantity Decomposition Gen. shift rule, equidistant Gen. shift rule halved depth on average (see App. A.3).With either of these methods, the terms for the shift rule in Eq. (36) and thus the metric tensor can be computed via the parameter-shift rule.
The metric can also be computed analytically without parameter shifts via a linear combination of unitaries (LCU) [57,58], which also employs Hadamard tests.As it uses the generator as an operation in the circuit, any non-unitary generator needs to be decomposed into Pauli words for this method to be available on quantum hardware, similar to a gate decomposition.Afterwards, this method uses one circuit evaluation per pair of Pauli words from the kth and mth generator to compute the entry F km .A modification of all approaches that use a Hadamard test is possible by replacing it with projective measurements [56].
Metric entries that belong to operations that commute within the circuit 10 can be computed block-wise without any auxiliary qubits, additional operations or deeper circuits [48].For a given block, we execute the subcircuit V 1 prior to the group of mutually commuting gates and measure the covariance matrix of the generators {G k } of these gates: By grouping the measurement bases of all {G k G m } 10 For example, operations on distinct wires commute in general but not necessarily within the circuit if entangling operations are carried out between them.
and {G k } of the block, the covariance matrix can typically be measured with only a few unique circuit evaluations 11 , making this method the best choice for the block-diagonal.One may then either use the result as an approximation to the full metric tensor, or use one of the other methods to compute the off-blockdiagonal entries; the approximation has been shown to work well for some circuit structures [48], but not for others [59].The methods to obtain the metric tensor and their resource requirements are shown in Tab. 2.
Since we run a different circuit for the metric tensor than for the cost function itself, the 2R k − 1 evaluations at shifted positions needed for the kth diagonal entry cannot reuse any prior circuit evaluations, as is the case for the cost function Hessian.Consequentially, the natural gradient of a (single term) expectation value function E, with ∇E referring to the Euclidean gradient, requires more circuit evaluations than its Hessian and gradient together.
However, the utility of the metric tensor becomes apparent upon observing that it depends solely on the ansatz, and not the observable being measured.This The LCU method (center right) is based on Hadamard tests as well and both these methods can spare the auxiliary qubit and instead employ projective measurements [56].The cheapest method is via measurements of the covariance of generators (right) but it can only be used for the block-diagonal of the tensor, i.e., not for all F km .We denote the depth of the original circuit V by DV and the number of Pauli words in the decomposition of G k and its square with P k and Q k , respectively.The P k Pauli words of G k can be grouped into P k groups of pairwise commuting words; the number of groups of pairwise commuting Pauli words in the product G k Gm similarly is P km .For the covariance-based approach, we overestimate the number of required circuits, as typically many of the measurement bases of the entries in the same block will be compatible.
The number of unique circuits to be evaluated for a diagonal element F kk , an off-diagonal element F km , and the full tensor F is given in terms of the number of frequencies R k and of Q k , P k P k and P km .The entries for N eval in the first and second row of the braces refer to equidistant (main text) and arbitrary frequencies (see App. B.2), respectively.
means that if a cost function has multiple terms, like in VQEs, the metric only needs to be computed once per epoch, rather than once per term, as is the case of the cost function Hessian.Therefore, an epoch of quantum natural gradient descent can be cheaper for such cost functions than an epoch of optimizers using the Hessian of the cost function.In addition, the block-diagonal of the metric tensor can be obtained with few circuit evaluations per block for conventional gates without any further requirements and with reduced average circuit depth.

Applications
In this section, we will present QAOA as concrete application for our general parameter-shift rule, which reduces the required resources significantly when computing derivatives.Afterwards, we use the approach of trigonometric interpolation to generalize the Rotosolve algorithm.This makes it applicable to arbitrary quantum gates with equidistant frequencies, which reproduces the results in Refs.[42,45], and extends them further to more general frequency spectra.In addition, we make quantum analytic descent (QAD) available for arbitrary quantum gates with equidistant frequencies, which previously required a higherdimensional Fourier reconstruction and thus was infeasible.

QAOA and Hamiltonian time evolution
In Eq. (24) we presented a generalized parameter-shift rule that makes use of 2R function evaluations for R frequencies in E. A particular example for singleparameter unitaries with many frequencies are layers of single-or two-qubit rotation gates, as can be found e.g., in QAOA circuits or digitized Hamiltonian time evolution algorithms.The quantum approximate optimization algorithm (QAOA) was first proposed in 2014 by Farhi, Goldstone and Gutmann to solve classical combinatorial optimization problems on near-term quantum devices [8].Since then, it has been investigated analytically [60,61,62], numerically [63,64], and on quantum computers [65,66].
In general, given a problem Hamiltonian H P that encodes the solution to the problem of interest onto N qubits, QAOA applies two types of layers alternatingly to an initial state |+ ⊗N : where p is the number of blocks which determines the depth of the circuit, X k is the so-called mixing layer, and U P (x) = exp(−ixH P ) is the time evolution under H P .The parameters x can then be optimized to try to minimize the objective function Here we focus on the layer U P , and we look at the example of MaxCut in particular.The corresponding problem Hamiltonian for an unweighted graph G = (V, E) with N vertices V and M edges E reads and U P correspondingly contains M two-qubit Pauli-Z rotations R ZZ .
We note that H M has eigenvalues −N, −N + 2, • • • , N , which means the corresponding frequencies (differences of eigenvalues) are 2, • • • , 2N .Thus, treating U M (x 2j ) as a single operation, Eq. (6) implies that E(x) can be considered an N -order trigonometric polynomial in x 2j , and the parameter-shift rules we derive in Sec. 3 will apply with R = N .Similarly, H P has corresponding frequencies in the set [M ], and it will obey the parameter-shift rule for R = M , although we may be able to give better upper bounds λ for R. Thus the unique positive differences {Ω } for those layers, i.e., the frequencies of E(x) with respect to the parameter {x 2j−1 } j∈[p] , take integer values within the interval [0, λ] as well.We may therefore use Eq.(24), with the knowledge that R ≤ λ ≤ M .
Note that knowing all frequencies of E(x) requires knowledge of the full spectrum of H P -and in particular of λ -which in turn is the solution of MaxCut itself.As a consequence, the motivation for performing QAOA becomes obsolete.Therefore, in general we cannot assume to know {Ω } (or even R), but instead require upper bounds ϕ(G) ≥ MaxCut(G) = λ which can be used to bound the largest frequency, and thus the number of frequencies R and subsequently the number of terms in the parameter-shift rule.It is noteworthy that even if the largest frequency λ is known exactly via a tight bound -which restricts the Fourier spectrum to the integers [λ] -not all integers smaller than λ need to be present in the set of frequencies {Ω }, so that the estimate for R may be too large 12 .
One way to obtain an upper bound uses analytic results based on the Laplacian of the graph of interest [67,68], for which automatic bound generating programs exist [69].An alternative approach uses semi-definite programs (SDPs) that solve relaxations of the MaxCut problem, the most prominent being the Goemans-Williamson (GW) algorithm [70] and recent extensions thereof that provide tighter upper bounds [71,72].The largest eigenvalue is guaranteed to be within ∼ 0.878 of these SDP upper bounds. 12A simple example for this is the case of 2k-regular graphs; here, H P only has even eigenvalues, and therefore all frequencies are even as well.Given an upper bound ϕ, we thus know the number of frequencies to satisfy R ≤ ϕ/2.
To demonstrate the above strategy, we summarize the number of evaluations required for the gradient and Hessian of an n-parameter QAOA circuit on N qubits for MaxCut in Tab. 3, comparing the approach via decomposing the circuit, to the one detailed above based on ϕ and the improved Hessian measurement scheme in Sec.4.1.Here, we take into account that half of the layers are of the form U P , and the other half are mixing layers with R = N frequencies.We systematically observe the number of evaluations for the gradient to be cut in half, and the those for the gradient and Hessian together to scale with halved order in N (and k, for regular graphs).
In addition, we display the numbers of circuit evaluations from Tab. 3 together with SDP-based bounds for λ and the true minimal number of evaluations required for the parameter-shift rule in Fig. 3.For this, we sampled random unweighted graphs of the corresponding type and size and ran the GW algorithm as well as an improvement thereof to obtain tighter bounds [71].On one hand we observe the advantage of the generalized parameter-shift rule and the cheaper Hessian method that can be read off already from the scalings in Tab. 3. On the other hand, we find both SDP-based upper bounds to provide an exact estimate of the largest eigenvalue in the N ≤ 20 regime, as can be seen from the cut values obtained from the GW algorithm that coincide with the upper bound.In cases in which the frequencies {Ω } occupy all integers in [R], this leads to an exact estimate of R and the evaluations in the shift rule.For all graph types but complete graphs, the SDP-based upper bounds yield a better estimate for the number of terms than the respective analytic bound ϕ, which improves the generalized shift rule further.
In summary, we find the generalized parametershift rule to offer a constant prefactor improvement when computing the gradient and an improvement of at least O(N ) when computing both the gradient and the Hessian.For certain graph types, knowledge about the structure of the spectrum and tight analytic bounds provide this advantage already, whereas for other graph types the SDP-based bounds reduce the evaluation numbers significantly.

Rotosolve
The Rotosolve algorithm is a coordinate descent algorithm for minimizing quantum cost functions.It has been independently discovered multiple times [42,45,51,50], with [50] giving the algorithm its name but only (along with [51]) considering parametrized Pauli rotations, and [42,45] covering other unitaries with integer-valued generator eigenvalues.
The Rotosolve algorithm optimizes the rotation angles sequentially: for one variational parameter x k at a time, the cost function is reconstructed as a function of that parameter using 2R k +1 evaluations, the mini- Table 3: Evaluation numbers for the gradient, or both the gradient and the Hessian, for QAOA circuits for MaxCut on several types of graphs.Each graph has N vertices and a graph type-specific number M of edges, and the (even) number of parameters is denoted as n.For K-regular graphs, we know M = min{(N 2 − N )/2, KN/2}, and the latter value is used in the displayed evaluation costs; if the former value forms the minimum, the graph is in fact complete.The left column is based on decomposing the circuit, applying the conventional two-term parameter-shift rule per elementary gate and iterating it for the Hessian.The right column employs the generalized parameter-shift rule Eq. ( 24) combined with an upper bound ϕ for the largest eigenvalue λ of the problem Hamiltonian, as well as the reduced number of evaluations for Hessian off-diagonal terms from Sec. 4.1.The bound ϕ for complete graphs can be found in Ref. [67].
mum of the reconstruction is calculated, and then the parameter is updated to the minimizing angle.For the case of Pauli rotation gates this minimum can be found via a closed-form expression.Recent studies have shown such coordinate descent methods to work well on many tasks [73,50,45,74], although there are limited cases where these methods fail [75].While Rotosolve is not gradient-based, our cost reduction for the gradient presented in Sec.5.1 stems from a cost reduction for function reconstruction, and hence is applicable to Rotosolve as well.
As shown in Sec.3.1, the univariate objective function can also be fully reconstructed if the parametrized unitaries are more complicated than Pauli rotations, using the function value itself and the evaluations from the generalized parameter-shift rule.Since the generalized parameter-shift rule also applies for non-equidistant frequencies (see App. B), the reconstruction works in the same way for arbitrary single-parameter gates.This extends our generalization of Rotosolve beyond the previously known integer-frequency case [42,45], although the number of frequencies-and thus the cost-for the reconstruction are typically significantly increased for noninteger frequencies.While the minimizing angle might not be straightforward to express in a closed form as it is the case for a single frequency, the one-dimensional minimization can efficiently be carried out numerically to high precision, via grid search or semi-definite programming [76, Chapter 4.2].

Quantum analytic descent
Quantum analytic descent (QAD) [49] also approaches the optimization problem in VQAs via trigonometric interpolation.In contrast to Rotosolve, it considers a model of all parameters simultaneously and includes second-order derivatives, but this model only is a local approximation of the full cost function.Additionally, QAD has been developed for circuits that exclusively contain Pauli rotations as parametrized gates.
The algorithm evaluates the cost function E at 2n 2 + n + 1 points around a reference point x 0 , and then constructs a trigonometric model of the form13 Here, we introduced A(x) and the element-wise square of a vector v, (v 2 ) k := v 2 k as for the Hessian diagonal.The coefficients E (A/B/C/D) are derived from the circuit evaluations, taking the form of a scalar, two vectors and an upper triangular matrix.More precisely, the expansion basis is chosen such that E (B) = ∇E(x 0 ), E (C) = ∇ 2 E(x 0 ), and E (D) is the strictly upper triangular part of the Hessian.Note that for this model 2n 2 + n + 1 evaluations are used to obtain n 2 /2 + 3n/2 + 1 parameters.In the presence of statistical noise from these evaluations, it turns out that building the model to a desired precision and inferring modelled gradients close to the reference point x 0 has resource requirements similar to measuring the gradient directly [49].
This model coincides with E(x) at x 0 up to second order, and in the vicinity its error scales with the third order of the largest parameter deviation [49].After the construction phase, the model cost is minimized in an inner optimization loop, which only requires classical operations.For an implementation and demonstration of the optimization, we also refer the reader to [77] and [78].
In the light of the parameter-shift rules and reconstruction methods, we propose three (alternative) modifications of QAD.The first change is to reduce the required number of evaluations.As the coeffi- cients E (A/B/C/D) consist of the gradient and Hessian, they allow us to exploit the reduced resource requirements presented in Tab. 1 14 .In the case originally considered by the authors, i.e., for Pauli rotations only, this reduces the number of evaluations from 2n 2 + n + 1 to (3n 2 + n)/2 + 1.
A second, alternative modification of QAD is to keep all evaluations as originally proposed to obtain the full second-order terms, i.e., we may combine the shift angles for each pair of parameters, and use them for coefficients of additional higher-order terms.This extended model (see App. D.1) has the form , where E (F ) is symmetric with zeros on its diagonal and E (G) is a strictly upper triangular matrix.This extended model has 2n 2 +1 degrees of freedom, which matches the number of evaluations exactly.While the QAD model reconstructs the univariate restrictions of E to the coordinate axes correctly, the extended model E does so for the bivariate restrictions to the plane spanned by any pair of coordinate axes.It remains to investigate whether and for which applications the extension yields a better optimization behaviour; for functions in which pairs of parameters yield a good local approximation of the landscape, it might provide an improvement.
The third modification we consider is to generalize the previous, extended QAD model to general singleparameter quantum gates.This can be done via a full trigonometric interpolation to second order, which is detailed in App.D.2, exactly reconstructing the energy function when restricted to any coordinate plane at the price of 2( R 2  1 − R 2 2 + R 1 )+1 evaluations.Using toy model circuits and Hamiltonians, we demonstrate the qualitative difference between the QAD model, its extension E, and the generalization to multiple frequencies in Fig. 4.

Discussion
In this work, we derive interpolation rules to exactly express quantum functions E(x) as a linear combination of evaluations E(x µ ), assuming E(x) derives from parametrized gates of the form U (x) = exp(ixG).Our method relies on the observation that E(x) can be expressed as trigonometric polynomial in x, characterized by a set of R frequencies that correspond to distinct differences in the eigenvalues of G.This observation allows us to derive our results using trigonometric interpolation methods.(53), that includes full second-order terms (center left), and the second-order trigonometric interpolation model (center right), as well as the original expectation value E (right).The original function is generated from toy Hamiltonians in a two-parameter example circuit, with one frequency (top) and two frequencies (bottom) per parameter.The QAD model produces a local approximation to E that deviates away from x0 at a slow rate for R = 1 but faster for R = 2.The extension E reuses evaluations made for the Hessian to capture the full bivariate dependence for a single frequency but is not apt to model multiple frequencies either.Finally, the trigonometric interpolation generalizes E. This means it coincides with E for R = 1, but also reproduces the full bivariate function for R > 1.
In addition to a full reconstruction of E(x), the presented approach offers parameter-shift rules for derivatives of arbitrary order and recipes to evaluate multivariate derivatives more cheaply.Using the concept of the stochastic parameter-shift rule, quantum gates of the form U F (x) = exp(i(xG + F )) can be differentiated as well.
Nevertheless, much remains unknown about the practicality of our new parameter-shift rules.For the common case that we have R equidistant frequencies, Sec.3.5 shows that the scaling of the required resources is similar between naïvely applying our generalized parameter-shift rules, and applying parametershift rules to a decomposition of U (x).This holds for the first derivative and also for the required shot budget when computing the second derivative, whereas the number of unique circuits is significantly smaller for the new, generalized shift rule.
Our observations lead to several open questions: • In which situations can we obtain better bounds on the number of frequencies?We investigated an example for QAOA in Sec.5.1, but are there other examples?
• For general G (e.g., G = j c j P j with real c j and Pauli words P j ), the frequencies will not be equidistant, and in fact R may scale quadratically in the size of U .Naïvely applied, our method would then scale poorly compared to decomposing G. Can we apply an approximate or stochastic parameter-shift rule with a better scaling?
• Would it ever make sense to truncate these parameter-shift rules to keep only terms corresponding to smaller frequencies?This is inspired by the idea of using low-pass filters to smooth out rapid changes of a signal.
• Our work on function reconstruction extends QAD to all gates with equidistant frequencies.Similarly, it allows Rotosolve, which has been shown to work remarkably well on some applications, to be used on all quantum gates with arbitrary frequencies.Is there a classification of problems on which these model-based algorithms work well?And can we reduce the optimization cost based on the above ideas?
• More generally, can we apply the machinery of Fourier analysis more broadly, e.g., to improve optimization methods in the presence of noise?
We hope that this work serves as an impetus for future work that will further apply signal processing methods to the burgeoning field of variational quantum computing.
Similarly to D * , this kernel satisfies Dµ (x µ ) = δ µµ but it's a linear combination of the odd basis functions sin( x), ∈ [R].Following from these two properties, we know that and we thus can reconstruct E odd with the R evaluations E odd (x µ ).
We also can extract from here a closed-form formula for the derivative at x = 0, as it only depends on the odd part of E. We arrive at the general parametershift rule: Similarly, as the higher-order derivatives of Dµ can be computed analytically, we may obtain derivatives of E of higher odd orders.

A.1.3 Even kernels
Next we reconstruct the even part E even again using the kernel D * (x) from above but choosing the R + 1 points x µ = µπ/R for µ ∈ [R] 0 .As the spacing between these points is the same as between the previous {x µ }, we again have D * (x µ −x µ ) = δ µµ ; but note we cannot directly use D * (x − x µ ) as our kernel because D * (x − x µ ) is an even function in x − x µ but not in x.Instead we take the even linear combination Then the Dµ are even functions and satisfy Dµ (x µ ) = δ µµ , leading to The second derivative of D * is and if we take the limit x → 0: This yields the explicit parameter-shift rule for the second derivative: Again, derivatives of E of higher even order can be computed in a similar manner, using the same evaluations E even µπ R .

A.2 Hessian parameter-shift rule
Here we consider the spectrum of the function Without loss of generality, we assume U k to act first within the circuit and set x 0 = 0.As for the univariate case in Sec.2.1, we may explicitly write the cost function as x , where ω (k,m) are the eigenvalues of the generators of U k and U m , respectively, and we denoted the entries of matrices by lowercase letters as before.We may read off the occuring frequencies in this Fourier series in terms of the unique positive differences Ω (k,m) , leading to l2 .We again only collect the positive values as they come in pairs 16 .
In case of integer-valued frequencies, there are For arbitrary frequencies, all {δΩ} might be unique and we obtain up to Rescaling the smallest frequency enforces a small degree of redundancy so that R km = 2R k R m + R k + R m − 2 is always achievable; for some scenarios specific rescaling factors might drastically reduce R km 17 .

A.3 Hadamard tests for the metric tensor
In order to compute the metric tensor as the Hessian of the overlap f (x) = − 1 2 | ψ(x)|ψ(x 0 ) | 2 , we need to evaluate it at shifted positions x = x 0 + xv k,m .This can be done by executing the circuit V (x 0 ) and the adjoint circuit V † (x) at the shifted position, and returning the probability to measure the 0 bitstring in the computational basis.As all operations after the latter of the two parametrized gates of interest cancel between the two circuits, those operations can be spared, but the maximal depth is (almost) the doubled depth of V .
Alternatively, we may use a Hadamard test as derived in the appendix of Ref. [57].
There, it was designed to realize the derivative overlaps Re{ ∂ k ψ(x)|∂ m ψ(x) } for the metric tensor directly, assuming the generator to be a Pauli word and therefore unitary.However, it can also be used to calculate the real or imaginary part of by measuring the auxiliary qubit in the Z or Y basis.The corresponding circuit is shown in Fig. 5.
While the original proposal has to split up the generators into Pauli words and implement one circuit per combination of Pauli words from x k and x m , the number of circuits here is dictated by the number of evaluations in the parameter-shift rule.In order to measure f (x), the real and the imaginary part both have to be measured, doubling the number of circuits.

A.4 Coefficient norms for univariate derivatives via equidistant shifts
The 1 -norm of the coefficients in parameter-shift rules dictates the number of shots required to reach certain precision (see Sec. 2.3).Here, we explicitly compute this norm for both the general and decomposition-based parameter-shift rule for the firstand second-order univariate derivative.For the entire analysis, we approximate the single-shot variance σ 2 to be constant as detailed in the main text.

A.4.1 Norm for general parameter-shift rule
For the case of equidistant shift angles, we can compute the norm of the coefficient vector y (1,2) in the parameter-shift rules in Eqs.(24,25) explicitly, in order to estimate the required shot budget for the obtained derivative.For the first order, we note that the evaluations of E come in pairs, with the same coefficient up to a relative sign.This yields (recalling that x µ = 2µ−1 4R π): which follows from sin −2 (x µ ) = cot 2 (x µ ) + 1 and [80, Formula (445)]: A derivation for Eq. ( 73) can be adapted from Ref.
[81], which we present below for completeness: Here we have applied the binomial theorem, extracted the real part, and divided by (i sin(x µ )) 2R (note that 0 < x µ < π/2).From the last equation above, we see that As g is a polynomial of degree R, we thus know all its roots and may use the simplest of Vieta's formulas: R µ=1 with roots {τ µ } µ of g, and g j the jth order Taylor coefficient of g.Plugging in the known roots and coefficients we get For the second order we may repeat the above computation with small modifications 18 , arriving at

A.4.2 Norm for decomposition
If we compute the first-and second-order derivatives via a decomposition that contains P parametrized elementary gates, we need to apply the original two-term parameter-shift rule to each of these gates separately.
For the first-order derivative, we simply sum all elementary derivatives.For integer-valued frequencies, x typically feeds without prefactor into the gates in the decomposition, so that the decomposition-based shift rule reads (78) where E (k) denotes the cost function based on the decomposition, in which only the parameter of the kth elementary gate is set to the shifted angle x 1 and to 0 in all other gates.To maximize sin(x 1 ), we choose x 1 = π/2, and as a reuslt all 2P coefficients have magnitude 1/2, and therefore y (1) decomp 1 = P. ( Due to all coefficients being equal, the optimal shot allocation is N/(2P) for all terms.
For the second-order derivative, the full Hessian has to be computed from the decomposition as described in Ref. [46] and all elements have to be summed 19 : where E (km) (x 1 , x 2 ) is defined analogously to E (k) but the shift angles put into the kth and mth elementary gate may differ.Fixing the shift angle to π/2 again, we have 2P(P − 1) coefficients of magnitude 18 Recall that the angles differ between the two derivatives. 19Here we do not anticipate the cheaper Hessian evaluation from Sec. 4.1.

A.5 Coefficient norms for the Hessian
Similar to the previous section, we compute the coefficient norms for three methods to compute the Hessian for equidistant frequencies and shifts: We may use the diagonal shift rule in Eq. (36), repeat the general parameter-shift rule, or decompose the circuit and repeat the original parameter-shift rule.For the first approach, the diagonal entries of the Hessianand thus the shifted evaluations for those entries-are reused to compute the off-diagonal ones, whereas the shifted evaluations for the repeated shift rule are distinct for all Hessian entries.This difference makes the cost comparison for a single Hessian entry difficult.We therefore consider the root mean square of the Frobenius norm of the difference between the true and the estimated Hessian as quality measure.The matrix of expected deviations is given by the standard deviations σ km so that we need to compute

A.5.1 Hessian shift rule
The variance for a Hessian diagonal entry H kk is σ 2 R 4 k /N kk if we use N kk shots to estimate it (see Eq. (29)) 20 .For an off-diagonal element H km computed via the diagonal shift rule in Eq. (36), the variance is where we used that R km = R k + R m for equidistant frequencies.Overall, this yields If we allocate N diag shots optimally, that is N km is proportional to the square root of the coefficient of N −1 km , we require shots to estimate H to a precision of ε.

A.5.2 Repeated general parameter-shift rule
Without the diagonal shift rule, we compute H km by executing the univariate general parameter-shift rule in Eq. (24) for x k and x m successively, i.e., we apply the rule for x m to all terms from the rule for x k .This leads to 4R k R m terms with their coefficients arising from the first-order shift rule coefficients by multiplying them together: where we used Eq.(72).Correspondingly, the variance for H km computed by this methods with an optimal shot allocation of The mean square of the Frobenius norm then is and an optimal shot allocation across the entries of the Hessian to achieve a precision of ε will require shots in total.

A.5.3 Decomposition and repeated original shift rule
For the third approach, we only require the observation that again all (unique) Hessian entries are estimated independently and that the coefficients arise from all products of two coefficients from the separate shift rules for x k and x m .This yields 4P k P m coefficients with magnitude 1/4, so that the calculation of ε is the same as for the previous approach, replacing R by P. The required shot budget for a precision of ε is thus

B Generalization to arbitrary spectra
Throughout this work, we mostly focused on cost functions E with equidistant -and thus, by rescaling, integer-valued -frequencies {Ω }.Here we will discuss the generalization to arbitrary frequencies, mostly considering the changed cost.

B.1 Univariate functions
The nonuniform DFT used to reconstruct the full function E in Sec.3.1, and its modifications for the odd and even part in Secs.3.2 and 3.3, can be used straightforwardly for arbitrary frequencies.However, choosing equidistant shift angles {x µ } will no longer make the DFT uniform, as was the case for equidistant frequencies.Correspondingly, the explicit parameter-shift rules for E (0) and E (0) in Eqs.(24,25) do not apply and in general we do not know a closed-form expression for the DFT or the parametershift rules.Symbolically, the parameter-shift rule takes the form Regarding the evaluation cost, the odd part and thus odd-order derivatives can be obtained at the same price of 2R evaluations of E as before, but the even part might no longer be periodic in general; as a consequence, actually may require two evaluations of E, leading to 2R + 1 evaluations overall.If the even part is periodic, which is equivalent to all involved frequencies being commensurable, with some period T , evaluating E even (T /2) allows to skip the additional evaluation.When comparing to the first derivative based on a decomposition into P parametrized elementary gates, the break-even point for the number of unique circuits remains at R = P as for equidistant frequencies, but we note that e.g., a decomposition of the form namely where x is rescaled individually in each elementary gate by some β k ∈ R, in general will result in R = P 2 frequencies of E, making the decompositionbased parameter-shift rule beneficial.For the secondorder derivative, the number of evaluations 2R + 1 might be quadratic in P in the same way, but the decomposition requires 2P 2 − P + 1 as well, so that the requirements are similar if R = P.
Regarding the required number of shots, we cannot make concrete statements for the general case as we don't have a closed-form expression for the coefficients y, but note that for the decomposition approach, rescaling factors like the {β k } in Eq. (93) above have to be factored in via the chain rule, leading to a modified shot requirement.
An example for unitaries with non-equidistant frequencies would be the QAOA layer that implements the time evolution under the problem Hamiltonian (see Eq. ( 26)) for MaxCut on weighted graphs with non-integer weights.
For the stochastic parameter-shift rule in Sec.3.6 we did not restrict ourselves to equidistant frequencies and derive it in App.C for general unitaries of the form U F = exp(i(xG + F )) directly.

B.2 Multivariate functions
While the univariate functions do not differ strongly for equidistant and arbitrary frequencies in E and mostly the expected relation between R and P changes, the shift rule for the Hessian and the metric tensor are affected heavily by generalizing the spectrum.First, the univariate restriction E (km) (x) in Eq. (34) still can be used to compute the off-diagonal entry H km of the Hessian but this may require up to in the equidistant case.Compared to the resource requirements of the decomposition-based approach, 4P k P m , this makes our general parameter-shift rule more expensive if R k P k .
As we use the same method to obtain the metric tensor F, the number of evaluations grows in the same manner, making the decomposition-based shift rule more feasible for unitaries with non-equidistant frequencies.As f (x 0 ) does not have to be evaluated, an off-diagonal element F km requires one evaluation fewer than H km , namely

C General stochastic shift rule
In this section we describe a stochastic variant of the general parameter-shift rule which follows immediately from combining the rule for single-parameter gates in Eq. (90) with the result from Ref. [39].
First, note that any shift rule with coefficients {y µ } and shift angles {x µ } for a unitary U (x) = exp(ixG), implies that we can implement the commutator with G: since the commutator between G and the Hamiltonian directly expresses the derivative of the expectation value E (0) on the operator level, and shift rules hold for arbitrary states.Now consider the extension U F (x) = exp(i(xG + F )) of the above unitary.In the original stochastic parameter-shift rule, the authors show21 where we again denoted the state prepared by the circuit before U F by |ψ and the observable transformed by the circuit following U F by B. By using Eq.(95) to express the commutator, we obtain We abbreviate the interleaved unitaries U F,µ (x 0 , t) := U F (tx 0 )U (x µ )U F (1 − t)x 0 (98) and denote the cost function that uses U F,µ (x 0 , t) instead of U F (x 0 ) as E µ (x 0 , t) := tr B U † F,µ (x 0 , t) |ψ ψ| U F,µ (x 0 , t) .
Rewriting Eq. (97) then yields the generalized stochastic parameter-shift rule It can be implemented by sampling values for the splitting time t, combining the shifted energies E µ (x 0 , t) for each sampled t with the coefficients y µ , and averaging over the results.

D Details on QAD
In this section we provide details on the latter two of the three modifications of the QAD algorithm discussed in Sec.5.3.

D.1 Extended QAD model for Pauli rotations
The QAD model introduced in Ref. [49] contains trigonometric functions up to second (leading) order.The free parameters of the model cannot be extracted with one function evaluation per degree of freedom, because unlike standard monomials in a Taylor expansion, the trigonometric basis functions mix the orders in the input parameters.This leads to the mismatch of 2n 2 + n + 1 (original QAD) or 3n 2 /2 + n/2 + 1 (see above) evaluations to obtain n 2 /2+3n/2+1 model parameters.We note that the QAD model contains full univariate reconstructions at optimal cost, extracting the 2n + 1 model parameters E (A) , E (B) and E (C)  from 2n + 1 function evaluations.The doubly shifted evaluations, however, are used for the Hessian entry only: where E ±± km = E(x 0 ± π 2 v k ± π 2 v m ) and we recall that this QAD model is restricted to Pauli rotations only.
Let us now consider a slightly larger truncation of the cost function than the one presented in App.A 2 in [49]: with A(x) = k cos 2 (x k /2).E (F ) and E (G) have zeros on their diagonals because there are no terms of the form sin 3 (x k /2) or sin 4 (x k /2) in the cost function, and for E (G) we only require the strictly upper triangular entries due to symmetry.The higher-order terms contain at least three distinct variables x k , x l and x m because all bivariate terms are captured in the above truncation.Using and tan ± π 4 = ±1, we now can compute: This means that the 4 function evaluations E ±± km that are used for E The corresponding model is of the form Eq. (101) and therefore includes all terms that depend on two parameters only.Consequentially, the constructed model exactly reproduces the cost function not only on the coordinate axes but also on all coordinate planes spanned by any two of the axes.The number of model parameters is 2n 2 + 1, which matches the total number of function evaluations.

D.2 Trigonometric interpolation for QAD
Both the original QAD algorithm, and the extension introduced above, assume the parametrized quantum circuit to consist of Pauli rotation gates exclusively.In the spirit of the generalized function reconstruction and parameter-shift rule, we would like to relax this assumption and generalize the QAD model.However, there is no obvious unique way to do this, because the correspondence between the gradient and E (B) and between the Hessian and E (C,D) is not preserved for multiple frequencies.Instead, the uni-and bivariate Fourier coefficients of E form the model parameters and the derivative quantities are contractions with the frequencies thereof.There are multiple ways in which we could generalize QAD to multiple frequencies.
The first way to generalize QAD is to compute the gradient and Hessian with the generalized parametershift rule Eq. ( 24) and the shift rule for Hessian entries Eq. (36) and to construct a single-frequency model as in original QAD.Even though we know the original energy function to contain multiple frequencies, this would yield a local model with the correct secondorder expansion at x 0 that exploits the evaluations savings shown in this work.As QAD is supposed to use the model only in the neighbourhood of x 0 , this might be sufficient for the optimization.
As a second generalization we propose a full trigonometric interpolation of E up to second order, similar to the univariate reconstruction in Sec.3.1.First we consider the univariate part of the model: Start by evaluating E at positions shifted in the kth coordinate by equidistant points and subtract E(x 0 ), E (k)  µ := E(x 0 + x µ v k ) − E(x 0 ) (102) Then consider the (shifted) Dirichlet kernels coincides with E(x 0 + xv k ) − E(x 0 ) at 2R k + 1 points and is a trigonometric polynomial with the same R k frequencies. 22One might be wondering why to subtract E(x 0 ) just to add it manually back into the reconstruction now.This is because we need to avoid duplicating this term when adding up the univariate and bivariate terms of all parameters later on.
Similarly, the product kernels D As we constructed the terms such that they do not contain the respective lower order terms, we finally can combine them to the full trigonometric interpolation: This model has as many parameters as function evaluations, namely 2( R 2 1 − R 2 2 + R 1 )+1, and therefore, the trigonometric interpolation is the generalization of the extended QAD model in App.D.1.Indeed, for R k = 1 for all k we get back 2(n 2 − n + n) + 1 = 2n 2 + 1 evaluations and model parameters.
We note that the trigonometric interpolation can be implemented for non-equidistant evaluation points in a similar manner and with the same number of evaluations, although the elementary functions are no longer Dirichlet kernels but take the form . (111)

Figure 2 :
Figure 2: Visual representation of two approaches to compute a Hessian entry H km at the position x0 (red cross).The parameters x k and xm lie on the coordinate axes and the heatmap displays the cost function E(x).We may either combine the general shift rule for x k and xm (grey triangles) or compute the univariate derivative E (km) (0) and extract H km via Eq.(36) (green circles).

Figure 3 :
Figure3: Evaluation numbers N eval for the gradient (left) or both the gradient and the Hessian (right) for n = 6 parameter QAOA circuits for MaxCut on graphs of several types and sizes.Using numerical upper bounds together with our new parameter-shift rule (GW -purple triangles and its generalization -dashed turquoise) reduces the resource requirements for both quantities significantly, compared to the previously available decomposition-based method (solid orange).The rows correspond to the various considered graph types (top to bottom): complete, 5-regular, 6-regular and (up to) 4N randomly sampled edges.The requirements for the decomposition-based approach and the analytic upper bound (dotted blue) correspond to the results in the left and right column of Tab. 3, respectively.The numerical upper bounds both use the minimized objective value of SDPs for relaxations of MaxCut to obtain the bound ϕ, which depends on the graph instance.The GW-based lower bound (pink triangles) is obtained by randomly mapping the output state of the GW algorithm to 10 valid cuts and choosing the one with the largest cut value.Note that K-regular graphs are only defined for N > K and N K mod 2 = 0 and that graphs with κN sampled edges are complete for N ≤ 2κ + 1, leading to a change in the qualitative behaviour in the last row at N = 2κ + 2 = 10.

Figure 4 :
Figure4: The QAD model (left), its extension E, see Eq.(53), that includes full second-order terms (center left), and the second-order trigonometric interpolation model (center right), as well as the original expectation value E (right).The original function is generated from toy Hamiltonians in a two-parameter example circuit, with one frequency (top) and two frequencies (bottom) per parameter.The QAD model produces a local approximation to E that deviates away from x0 at a slow rate for R = 1 but faster for R = 2.The extension E reuses evaluations made for the Hessian to capture the full bivariate dependence for a single frequency but is not apt to model multiple frequencies either.Finally, the trigonometric interpolation generalizes E. This means it coincides with E for R = 1, but also reproduces the full bivariate function for R > 1.

Figure 5 :
Figure 5: Circuits for the Hadamard tests to measure the overlap in Eq. (71), adapted from [57, Fig. 5].The basis rotation in the last operation on the auxiliary qubit determines whether the real (top) or the imaginary (bottom) part of ψ(x0 + xv k,m )|ψ(x0) is calculated.All unitaries without argument are understood as Uj = Uj((x0)j).
km in the original QAD can be recycled to obtain the 3 parameters E

Table 1 :
(36)er of distinct circuit evaluations N eval for measuring combinations of derivatives of a parametrized expectation value function E at parameter position x0.The compared approaches include decomposition of the unitaries together with the original parameter-shift rule (left), and the generalized parameter-shift rule Eq. (24) together with the diagonal shift rule for the Hessian in Eq.(36).The requirements for the latter differ significantly for equidistant (center ) and arbitrary frequencies (right, see App.B.2).A third approach is to repeat the general parameter-shift rule, the cost of which can be read off by replacing P by R in the left column.Here, n is the number of parameters in the circuit, P k is the number of elementary gates with two eigenvalues in the decomposition of the kth parametrized unitary, and R k denotes the number of frequencies for the kth parameter.The asterisk ( * ) indicates that the derivatives ∂ 2 k E and ∂ 2 m E need to be known in order to obtain the mixed derivative at the shown price (see main text).The evaluation numbers take savings into account that are based on using evaluated energies for multiple derivative quantities; hence, they are not additive in general.

Table 2 :
Quantum hardware-ready methods to compute the Fubini-Study metric tensor and their resource requirements.The cost function f (x) (see Eq. (45)) for the parameter-shift rule can be implemented with increased depth by applying the adjoint of the original circuit to directly realize the overlap (left) or with an auxiliary qubit and Hadamard tests (center left, App.A.3).