Measuring Analytic Gradients of General Quantum Evolution with the Stochastic Parameter Shift Rule

Hybrid quantum-classical optimization algorithms represent one of the most promising application for near-term quantum computers. In these algorithms the goal is to optimize an observable quantity with respect to some classical parameters, using feedback from measurements performed on the quantum device. Here we study the problem of estimating the gradient of the function to be optimized directly from quantum measurements, generalizing and simplifying some approaches present in the literature, such as the so-called parameter-shift rule. We derive a mathematically exact formula that provides a stochastic algorithm for estimating the gradient of any multi-qubit parametric quantum evolution, without the introduction of ancillary qubits or the use of Hamiltonian simulation techniques. Our algorithm continues to work, although with some approximations, even when all the available quantum gates are noisy, for instance due to the coupling between the quantum device and an unknown environment.


I. INTRODUCTION
In the near-term [1] quantum computers will be too noisy and the number of operations, or depth of the circuit, will still be too low to reliably implement conventional quantum algorithms that require full quantum error correction [2]. Therefore, alternative algorithms, better suited for exploiting these devices have been proposed, such as the variational quantum eigensolver [3,4], the quantum approximate optimization algorithm [5], quantum autoencoders [6], quantum simulation [7], and quantum classifiers for machine learning [8][9][10]. Because of these applications, several companies involved in the development of quantum computers have released software for the manipulation of parametric quantum states [11][12][13][14][15].
Hybrid quantum-classical optimization algorithms, such as the ones mentioned above, try to overcome the limitations of current quantum computers by pairing them with a classical device. In these hybrid strategies, the "hard" part of the algorithm, which typically involves the manipulation of objects living in a high-dimensional Hilbert space, is done by a quantum computer, which is reset after each measurement. The classical routine then iteratively reprograms the quantum computer in such a way that either the output of quantum measurements or the prepared quantum state have the desired property. These iterative schemes allow the use of shorterdepth circuits that can be implemented within the decoherence time of the device. Typically, the manipulation of the quantum state is performed with parametric quantum gates and the role of the classical routine is to update those parameters either via gradient descent or gradient ascent. Evaluating the gradient of a quantum circuit is as hard as the evaluation of the circuit itself, and therefore it is important to use the quantum computer for estimating it. Several algorithms have been proposed for such purpose, either based on a generalization of the Hadamard test [16,17] or on the so-called parameter shift rule [18][19][20], which have a similar complexity. Nonetheless, both algorithms can only be applied when the parametric gates can be written as e iθ tXt , for parameters θ t , and where the opera-torsX t have certain special properties. In the general case one has to resort to Hamiltonian simulation techniques [21] that increase the complexity of the algorithm.
Here we show that the parameter-shift rule can be generalized to any multi-qubit quantum evolution, without the need to introduce any ancillary system or Hamiltonian simulation techniques. Our generalization is based on a stochastic strategy that is exact in the limit of many repetitions of the quantum measurement. We analyse the number of repetitions needed to achieve a certain precision by studying the variance of our estimation procedure, and numerically observe that it is comparable to that of the standard parameter shift rule. In near-term computers, unitary gates are an approximation to a more complex, noisy evolution that couples the qubit registers to an unknown environment. We show that our estimation procedure can be applied even when the coupling between system and environment cannot be completely suppressed, and when the gates depend on the parameters in a complex way.
Our paper is organized as follows: in Sec. II we set up the problem and the notation; in Sec. III we discuss the main ideas and introduce analytical formulae and algorithms for estimating the gradient in the general case; in Sec. IV we study applications in quantum control and for optimizing noisy gates; Conclusions are drawn in Sec. V. Explicit pseudo-codes for our algorithms are given in Appendix A. The stochastic variance of our algorithms is studied in Appendix B.

arXiv:2005.10299v1 [quant-ph] 20 May 2020
We study the optimization (either maximization or minimization) of the expected value of an observableĈ, taken with respect to |ψ(θ) Several problems can be mapped to the above optimization, such as variational diagonalization and quantum simulation [3,5,22], whereĈ is the Hamiltonian of a many-body system and the task is to variationally approximate its ground state; and quantum state synthesis, whereĈ = |ψ target ψ target |, or some machine-learning classifiers [8,16]. Even quantum control problems [23] or the simulation of gates with time-independent Hamiltonians [24,25] can be written in the form (2). Indeed, consider the task of finding a good approximation of a certain target unitary gateĜ with a parametric unitaryÛ(θ). We may define |ψ(θ) =1 1 ⊗Û(θ) |Φ , where |Φ = d i=1 |i, i / √ d and d is the dimension of the Hilbert space, and similarly |ψ target =1 1 ⊗Ĝ |Φ . Then, from (2) witĥ C = |ψ target ψ target |, we find which is the function normally maximized in quantum control problems [23,24]. Any unitary operator can be expressed as a matrix exponen-tialÛ(θ) = e iX(θ) , whereX(θ) is a Hermitian operator. When the unitaryÛ(θ) is a composition of T simpler gatesÛ t (θ), then we writê where the products are ordered as T t=1Û t :=Û T · · ·Û 1 . The products of Pauli matricesσ ν =σ ν 1 ⊗ · · · ⊗σ ν N form a basis for the space of N-qubit Hermitian operators [26], where ν = (ν 1 , . . . , ν N ) is a multi index, ν j is either {0, x, y, z} andσ 0 :=1 1, σ x ,σ y ,σ z are the Pauli matrices. As such, we may expand the operatorsX t (θ) onto this basis and writê with coefficients x t,ν (θ) = Tr X t (θ)σ ν /2 N . It is common to restrict attention to gates that only have a single element in the expansion (5), i.e. x t,µ (θ) = θ t δ µ,ν(t) and where ν(t) specifies the kind of parametric gate applied at time t. Moreover, most often we consider gates that act on either one-or two-qubit, so at most two Pauli matrices in the product σ ν 1 ⊗ · · · ⊗σ ν N are different from the identity. On the other hand, in this paper we do not restrict ourselves to the case (6) and consider the more general parametrization (4) with (5). Via the Leibniz rule, we may write When the parametrization is such that all gates can be expressed as in Eq. (6), different approaches have been proposed to evaluate the gradient via a carefully designed quantum circuit and classical post-processing, for instance using the Hadamard test [17] or the parameter shift rule [18,19]. In [20] the parameter shift rule was generalized to some particular cases where there are more than one term in the expansion (5). However, finding the gradient in the general case was still an open question. In the next section we show that by mixing standard operator derivative techniques [27] with Monte Carlo strategies, we can define a procedure to measure gradients of any C(θ), as in Eq. (2), with near-term quantum hardware. Thanks to the Leibniz rule (7), we may fix the value of t and ν and study the derivative of C with respect to x t,ν . By repeating the analysis for each possible values of t and ν, from Eq. (7) we may obtain the derivatives with respect to the parameters θ p , and hence the gradient. Therefore, we fix t and ν and, to simplify the notation, we drop the dependence on t and ν to write With a similar spirit, we also define whereÛ t+ = T s=t+1Û s . Thanks to the above simplified notation, we may write the function C in (2) as a function of x ≡ x t,ν for fixed t and ν all the other terms in (10) do not explicitly depend on x ≡ x t,ν . In other words, Eq. (10) is equivalent to Eq. (2), where we have separated the terms that depend on x ≡ x t,ν for fixed t and ν from the others.

III. STOCHASTIC PARAMETER SHIFT RULE
Without loss of generality, we fix t and ν as described in the previous section, and study the derivative of C(x) defined in (10). The derivative with respect to the parameters θ p can be obtained from (7) by repeating the analysis for all t and ν. We remark that in Eq. (10) the state |φ and the operatorsĤ,Â, V explicitly depend on t, ν, and on the other values x t ,ν with either t t or ν ν, but we omit this dependence to simplify the notation. Full algorithms are shown in Appendix A.
The main tool behind our analysis is the following operator identity [27] which is valid for any operator Z. We may rewrite Eq. (10) as Algorithm 1 Parameter Shift Rule 1: initialize the computer in the state |φ , following the preparation routine (9); 2: apply the gate e i(x+ π 4u )V ; 3: measure the observableÂ from (9) and call the result r + . 4: Repeat steps 1 to 3, but on point 2 apply e i(x− π 4u )V rather than e i(x+ π 4u )V ; 5: measureÂ and call the result r − . 6: the sample g t, [18,19,29], only applicable to parametric gates as in Eq. (6) or, more generally, to parametrizations e ixV whereV has two distinct eigenvalues ±u. WhenV is a product of Pauli matrices as in (6), u = 1. In the algorithm we consider the derivative ∂ x C(x) of Eq. (10) [28]. We also introduce the superoperator Now we focus on the exponential e λV with V defined in (13). From series expansion, sinceV is a tensor product of Pauli matrices (8) and, as such,V 2 =1 1, it is simple to show that from which we get WhenĤ ≡ 0, it is Z = xV and we may use the above equation with λ = x to take derivatives in (12). As a result, we get which is the so-called parameter shift rule, described in Fig. 1, often used for training quantum circuits [18,19,29]. Note that, with the formalism of the previous section,Ĥ = 0 corresponds to the use of the simper parametric unitaries of Eq. (6). A more general version of the parameter shift rule can be obtained when the operatorV has only two distinct eigenvalues [18,19]. Indeed, we note that the only property we used in (15) isV 2 =1 1, which is true for any product of Pauli matrices. IfV has only two possible eigenvalues c ± u, then we may writeV = uV + c1 1 wherê V 2 =1 1 and the dependence on c disappears in (13). Therefore, it is straightforward to generalize the above derivation and find ∂ x C(x) = u C t + π 4u − C t − π 4u . The resulting algorithm is described in Fig. 1. Although the parameter shift rule can be made slightly more general, for instance by replacing the operatorσ ν(t) in (6) with another operator that has, Algorithm 2 Stochastic Parameter Shift Rule 1: Sample s from the uniform distribution in [0,1]; 2: initialize the computer in the state |φ , following the preparation routine (9); 3: apply the gate e i(1−s)(Ĥ+xV) , namely where parameters x t,µ for fixed t and all possible values of µ have been rescaled by a factor (1 − s); 4: apply the gate e iπV/4 ≡ e iπσν/4 ; 5: apply the gate e is(Ĥ+xV) , where parameters x t,ν for fixed t and all possible values of ν have been rescaled by a factor s; 6: measure the observableÂ from (9) and call the result r + . 7: Repeat steps 2 to 5, but on point 4 apply e iπσν/4 rather than e −iπσν/4 ; 8: measureÂ and call the result r − . 9: the sample g t, From the above equation, calling we get from (11) and (16) Eqs. (17) Taking the expectation value with respect to the measurement outcomes and with respect to the uniform probability over s we get from Eq. (19) which concludes the proof. For a single measurement, both the Parameter Shift Rule of Fig. 1 and the Stochastic Parameter Shift Rule of Fig. 2 provide a random difference between two eigenvalues ofÂ. Only in the limit over many repetitions of those algorithms does the average over the outcomes converge to the exact value of the gradient. Due to the Chebyshev inequality, the number of repetitions to achieve a certain precision depends on the variance of the random outcomes. In Appendix B we study the variance of the gradient estimator obtained with the Stochastic Parameter Shift Rule and show that it is comparable with that of the standard Parameter Shift Rule.

A. Stochastic optimization
In the previous section we have introduced an algorithm (Fig. 2) to use a quantum computer to sample from a random variable whose average is equal to the gradient of a certain circuit. We say that the output of the Stochastic Parameter Shift Rule provides an unbiased estimator of the gradient, in the sense of Eq. (21).
We now focus on the original problem, namely a parametric unitary (4) with many parameters as in (5). We can use the algorithm of Fig. 2 to sample g t,ν with the property ∂C/∂x t,ν = E[g t,ν ]. By repeating the procedure many times and with all possible values of t and ν, due the linearity of the Leibniz rule (7), we may write where the expectation value E has the same meaning as in Eq. (21). The full algorithm is shown in Appendix A, algorithm 4. The problem with this approach is that we have to repeat algorithm 2 many times, each time resetting the quantum machine, to get a single sample. We now introduce a simpler unbiased estimator of the gradient that requires significantly fewer operations to get a single sample. A similar technique has been developed in [17,29] for parametrizations as in Eq. (6), which was dubbed doubly stochastic gradient descend. Here we generalize that approach to general quantum evolution, as in Eq. (4). We start by defining a probability distribution from the "weights" with N = t,ν |∂ p x t,ν (θ)|. Setting n p,t,ν = Nsign ∂ p x t,ν (θ) we may then write Eq. (22) as where E (t,ν)∼q means that, at each iteration, t, and ν are sampled from the distribution (23). When the functional dependence on the parameters is known, all quantities q p (t, ν) and n p,t,ν can be easily computed at each iteration without having to deal with exponentially large spaces. The above equation (24) allows us to define a simple "doubly stochastic" gradient estimator via the following rule 1: sample t and ν from the distribution (23); 2: use Algorithm 2 to get an estimate g t,ν ; 3: the sample r p,t,ν = g t,ν n p,t,ν is such that ∂C/∂θ p = E[r p,t,ν ]. The full algorithm is shown in Appendix A, algorithm 5. Based on the above equation, in Appendix A we also define an algorithm that can provide an unbiased sample with a single initialization of the quantum device, algorithm 6.
To summarize the results of this section, we can use either (22) or (24) to estimate the gradient of an expectation value (2) with a quantum computer. Once we have an estimate of the gradient, we can optimize C(θ) using stochastic gradient descent (or ascent) algorithms [30], such as Adam [31]. These algorithms are classical, in the sense that, given certain parameters θ and an estimate of the gradient g, the parameters are updated as θ → θ ± η g for a suitably small learning rate η. Therefore, we can use a hybrid quantum-classical approach to optimize C(θ) where the hard calculations, namely the estimation of the gradients, are delegated to a quantum computer, while the update of the parameters is performed classically.

B. Quantum gates with unavoidable drift
Depending on the hardware, the application of the gates e ±iπV/4 in Algorithm 2 might be problematic. Let us consider a quantum computer that can only apply the parametric gateŝ whereĤ 0 is some drift Hamiltonian that cannot be completely switched off, aside from the trivial case t = 0. Such "simple" device is still capable of universal quantum computation, provided that the operatorsĤ 0 andĤ 1 are multi-qubit operators that generate the full Lie algebra [32]. Here though, for simplicity, we consider the case where bothĤ 0 andĤ 1 are tensor products of Pauli operators, as introduced in Sec. II. The parameters in the above gate are θ = (t, b). Using the notation of Eq. (5) we may writê where x 0 = t and x 1 = bt. Employing the above gate in Eq. (2), from (7) we get An estimator of ∂C ∂x j for j = 0, 1 can be obtained with Algorithm 2, whereV is, respectively, eitherĤ 0 orĤ 1 .
Step 3 in the algorithm corresponds toÛ((1− s)t, b) and Step 5 corresponds toÛ(st, b), so both operations can be easily implemented directly in the device.
Step 4 corresponds to the gate U(π/4, 0) when estimating ∂C ∂x 0 , which is again easy to implement. However, Step  apply the gate e is(Ĥ+xV) ; 7: measure the observableÂ and call the result r m . 8: end for 9: An estimate g t,ν of ∂C/∂x t,ν is given by g t,ν = r + − r − . that does not belong to the set of gates (25) and, with our assumptions, cannot be implemented by the device. However, we may substitute that gate with an approximation The error coming from the drift term can be small O( ) if it is possible to set b to a high value O( −1 ). With the above gate, in Fig. 3 we define the approximate Stochastic Parameter Shift Rule. The approximate gate introduces a bias in the gradient estimator, but since such bias can be made small, convergence can still be expected [33]. As a relevant example, we study the cross-resonance gate [19,20] U CR (t, b, c) = exp it σ x ⊗1 1 − bσ z ⊗σ x + c1 1⊗σ x , (29) a natural gate for certain microwave-controlled transmon superconducting qubit architectures [34]. The results are shown in Fig. 4 for different values of t, c and b, where we show that Algorithms 2 and 3 are basically indistinguishable from each other, and very close to the approximated value obtained numerically, without any randomness, using a finite difference approximation. All numerical results are obtained by analytically computing the probabilities (20) and then simulating the quantum measurement via Monte Carlo sampling. The finite difference approximation is obtained as ∂ Note that, although this approximation works fine for numerical approximations using a classical computer, it is not useful for calculating gradients on quantum hardware. Indeed, if we use a quantum device for estimating C(x ± ε), then the estimator of ∂C x has a variance ≈ ε 2 which is very high when ε is small. external pulses andV j the associated operators, the evolution is described by the following time-dependent Hamiltonian where M is the number of pulses andĤ 0 is the drift Hamiltonian that describes the time-evolution of the system when no-pulses are applied. Here we consider M = 1 as the generalization is straightforward, and set λ 1 ≡ λ andV 1 ≡V. By discretizing the control time T into N T = T/∆T steps of width ∆t we getÛ with error ≈ N T ∆T 2 . Pulse design corresponds to the optimization of the parameters θ p := λ(p∆T ) to achieve a desired target evolution [23], for which we can apply the procedure of the section III. An alternative is to expand the pulse a) |ψ 0,1 in the Fourier basis λ(t) = m a m cos(ω k t + φ k ) for some frequencies ω k and tunable amplitudes a m and phases φ k [35]. Therefore, we may use (7) together with the procedure of Sec. III to estimate the gradient with respect to the parameters {θ p } = {a m , φ m }.

B. Parametric circuits with noisy quantum gates
One of the main strengths of our Algorithm 3, and its generalizations in Appendix A, is its ability to work, under reasonable assumptions, even when parametric gates are not perfectly implemented by the device. This is the case in currently available and near-term quantum computers [1].
As a relevant example, consider the quantum circuit of Fig. 5, built from simple parametric gates as in Eq. (6). When the quantum computer can apply the exact gates, then the standard parameter shift rule can be employed. However, quantum devices are always in contact with their surrounding environment, so an exact application of the gate is impossible (without full quantum error correction). More precisely, due to the action of the environment the gate is not unitary but, under some reasonable approximations, can be described by a completely positive map [36,37]. A completely positive map can always be written as a unitary evolution on the register and its environment. For simplicity let us consider a perfect gate as in (6) with fixed t. Physically, the perfect gate (6) means that a control HamiltonianĤ (R) t := −σ ν(t) is switched on for a time θ t , where the index (R) reminds us that the Hamiltonian acts on the registers R only. In realistic implementations the register is coupled with its own environment. If we call H (RE) t the coupling Hamiltonian between register (R) and environment (E), then we may write the non-unitary gate (see also Fig. 5c) as t .
In Eq. (32) there are three main approximations: i) we neglect any initial quantum correlation between register and environment, so that the non-unitary evolution can be modeled as a completely positive map [36], which in turn implies (32); ii) we assume that the (unknown) initial state of the environment does not depend on θ and τ; iii) we assume that it is possible to tune both τ and, to some extent, the relative strength θ. Under these three conditions, it is possible to use Algorithm 3 and its generalizations of Appendix A to compute the gradient with respect to θ. Indeed, without loss of generality, we may consider H (R) t as a product of Pauli matrices acting on the register R. When this is not the case we may employ the Leibniz rule (7). All operations in Algorithm 3 are possible, with the substitutionĤ 0 =Ĥ (RE) t andV =Ĥ (R) t . The rescaled gates correspond to reducing the control time τ by either a factor (1 − s) or s, while the application of the approximate gate (28) can be obtained by making θ large. Note that in a good quantum computer, the factor θ should always be large, as the coupling between register and environment should be small. Therefore, derivatives with respect to θ can be obtained using the same operations available in the device.
On the other hand, derivatives with respect to τ are, in general, not possible. We may always expand the coupling Hamiltonian in the Pauli basis via (5) and use the Leibniz rule (7), but in order to obtain the derivative with our Algorithm (3), we have to approximate a highly tuned gate of type e iπ/4σ (RE) , which couples the system and environment. We believe that for reasonable models of environment, this is not generally possible.
In summary, when the noisy evolution can be written as in Eq. (32), under the approximations defined above, derivatives with respect to θ can be obtained with the same operations available in the machine, while the further parameter τ should only be used to implement the rescaling and not as an optimization parameter.

C. Quantum Natural Gradient
The quantum natural gradient has been proposed in [38,39] as a way to better describe the geometry of parametric quantum states, enabling faster convergence towards local optima. With the quantum natural gradient the update rule becomes θ → θ ± ηF −1 g, where g is the gradient andF the metric tensor. The role of the metric tensor for noisy parametric quantum evolution has been studied first in [40], where it was shown that it provides a method to investigate the convergence time of standard stochastic gradient descend. When us-ing the simple parametric gates of (6), the elements of this tensor can be measured efficiently [22,38]. Moreover, recently the quantum natural gradient has been extended to arbitrary noisy quantum states [39]. In particular, for slightly mixed states it is where κ = 1 for pure states andρ is the state after the parametric unitaries that, for either noiseless or noisy gates, we can write asρ(θ) = E T (θ) • · · · • E 1 (θ)[ρ 0 ] withρ 0 = |ψ 0 ψ 0 |. We focus on F p,p as the parameter κ can be absorbed into the learning rate. The approximation in (33) is valid when the state has a high purity [39] , as it is expected in good NISQ computers. We may measure the matrix in Eq. (33) using a combination of the Stochastic Parameter Shift Rule and the SWAP test. The latter is based on the simple observation that, for anyX andŶ, it is Tr XŶ = Tr Ŝ (X ⊗Ŷ) , whereŜ is the swap operator [22]. Using the SWAP test and Eq. (7) we get Then, thanks to our analysis from section III, we may write ds Tr Ŝ ρ t,ν,s,α ⊗ρ t ,ν ,s ,α , whereρ t,ν,s,± is the state in which the gateÛ t has been substituted by the gate U ± (x t,ν , s) from Eq. (18), or its noisy implementation as in Sec. IV B. Therefore, an estimator of the matrix elements of the Fisher information matrix can be obtained by sampling two real numbers s and s from the uniform distribution, and then measuring the overlaps of all quantum states ρ t,ν,s,α and ρ t ,ν ,s ,α via the swap test. Note that for noiseless gates the overlaps in (35) can be simplified in some cases. For instance, when t = t all the gates in the product (4) with larger t disappears from the overlap. It was found in [38] that a good approximation to the natural gradient can be obtained by using only the diagonal elements of F. Motivated by this, we study what happens when we fix t and ν and call x ≡ x t,ν as in Sec. III. With the notation of Eqs. (9) and (10), using (11) we may write whereρ 0 = |φ φ|,V(s) = e is(Ĥ+xV)V e i(1−s)(Ĥ+xV) and we have defined SinceV is a product of Pauli matricesV(s) is a unitary operator, so both F 2 and F 1 can be measured by first sampling s and s from the uniform distribution, and then measuring the expectation value using the Hadamard test [41].

V. CONCLUSIONS
We have studied the optimization of a cost function defined by taking a quantum measurement on a parametric quantum state, obtained by applying on a fixed reference state a controlled evolution with tunable classical parameters. We have found explicit analytical formulae for the derivatives of the cost function with respect to those classical parameters. Our formulae can be applied to any multi-qubit evolution and generalize the so-called parameter shift rule [18,19] to the general case, without any restriction on the spectrum of the operator, and without the use of ancillary qubits or Hamiltonian simulation techniques [21].
Based on those exact formulae, we have devised both exact and approximate algorithms for estimating the derivatives of the cost via carefully designed quantum circuits. The exact algorithm works when exact applications of the gates are possible, whereas the approximate algorithm is designed to tackle spurious interactions in the system that cannot be completely removed. As such, our algorithm can also be applied, though with some approximations, when the gates implemented by the quantum device are noisy, as it is the case in near-term quantum devices [1].
The main application of our study is to optimize parametric quantum evolution for quantum optimization [3] and machine-learning problems [10]. FIG. 6. Empirical standard deviation of the gradient estimator of ∂ t C(t, b), with the same notation of Fig. 4(b), for c = 0 and different values of b. The Standard Parameter Shift Rule corresponds to Algorithm 1. For each point, the STD is estimated using 10 4 samples. with c = 0. When c = 0 the operator in the exponential has two possible eigenvalues u = ± √ 1 + b 2 , so for computing the derivative ∂ t C we can also apply the standard parameter shift rule, and compare the variance of the resulting estimator with that obtained from the Stochastic Parameter Shift Rule. Note that, unlike our Algorithm 2, the simpler parameter shift rule cannot be applied to estimate ∂ b C.
In Fig. 6 we compare the standard deviation of Algorithms 1, 2, 3. We note that, although Algorithms 2, 3 have extra sampling steps, the resulting variance is comparable with that of Algorithm 1.
We now study how the standard deviation might scale as a function of the number of qubits. In Fig. 7(a) we focus on the Stochastic Parameter Shift Rule, with the following choice of states and operators in (10) Fig. 7(a) we observe a slight non-monotonic increase, while in Fig. 7(b) the results are basically independent on N. We believe that this difference is mostly due to the particular choice of the models, Eqs. (B1) and (B2), that although related are not identical. Therefore, we may conclude that the stochastic parameter shift rule is basically as efficient as the standard parameter shift rule, but it is more general.