Quantum Natural Gradient

A quantum generalization of Natural Gradient Descent is presented as part of a general-purpose optimization framework for variational quantum circuits. The optimization dynamics is interpreted as moving in the steepest descent direction with respect to the Quantum Information Geometry, corresponding to the real part of the Quantum Geometric Tensor (QGT), also known as the Fubini-Study metric tensor. An efficient algorithm is presented for computing a block-diagonal approximation to the Fubini-Study metric tensor for parametrized quantum circuits, which may be of independent interest.

All the above are examples of stochastic optimization problems whereby one minimizes the expected value of a random cost function over a set of variational parameters, using noisy estimates of the cost and/or its gradient. In the quantum setting these estimates are obtained by repeated measurements of some Hermitian observables for a quantum state which depends on the variational parameters.
A variety of optimization methods have been proposed in the variational quantum circuit literature for determining optimal variational pa-rameters, including derivative-free (zeroth-order) methods such as Nelder-Mead, finite-differencing [12] or SPSA [33]. Recently the possibility of exploiting direct access to first-order gradient information has been explored. Indeed quantum circuits have been designed to estimate such gradients with minimal overhead compared to objective function evaluations [31].
One motivation for exploiting first-order gradients is theoretical: in the convex case, the expected error in the objective function using the best known zeroth-order stochastic optimization algorithm scales polynomially with the dimension d of the parameter space, whereas Stochastic Gradient Descent (SGD) converges independently of d. Another motivation stems from the empirical success of stochastic gradient methods in training deep neural networks, which involve minimization of non-convex objective functions over highdimensional parameter spaces.
The application of SGD to deep learning suffers from the caveat that successful optimization hinges on careful hyper-parameter tuning of the learning rate (step size) and other hyperparameters such as Momentum. Indeed a vast literature has developed devoted to step size selection (see e.g. [15]). The difficulty of choosing a step size can be understood intuitively in the simple quadratic bowl approximation, where the optimal step size depends on the maximum eigenvalue of the Hessian, a quantity which is difficult to calculate in high dimensions. In practical applications the step size selection problem is overcome by using adaptive methods of stochastic optimization such as Adam [18] which have enjoyed wide adoption because of their ability to dynamically select a step size by maintaining a history of past gradients.
Independently of the improvements arising from historical averaging as in Momentum and Adam, it is natural to ask if the geometry of quantum states favors a particular optimization strategy. Indeed, it is well-known that the choice of optimization is intimately linked to the choice of geometry on the parameter space [26]. In the most well-known case of vanilla gradient descent, the relevant geometry corresponds to the l 2 geometry as can be seen from the following exact rewriting of the iterative update rule where L is the loss as a function of the variational parameters θ ∈ R d and η is the step size. Thus, vanilla gradient descent moves in the steepest descent direction with respect to the l 2 geometry. In the deep learning literature, it has been argued that the l 2 geometry is poorly adapted to the space of weights of deep networks, due to their intrinsic parameter redundancy [26]. The Natural Gradient [1], in contrast, moves in the steepest descent direction with respect to the Information Geometry. This natural gradient descent is advantageous compared to the vanilla gradient because it is invariant under arbitrary re-parametrizations [1] and moreover possesses an approximate invariance with respect to overparametrizations [22], which are typical for deep neural networks.
In a similar spirit, the quantum circuit literature has investigated the impact of geometry on dynamics of variational algorithms. In particular, it was shown that under the assumption of strong convexity, the l 2 geometry is sub-optimal in some situations compared to the l 1 geometry [13]. The intuitive argument put forth favoring the l 1 geometry is that some quantum state ansätze can be physically interpreted as a sequence of pulses of Hamiltonian evolution, starting from a fixed reference state. In this particular parametrization, each variational parameter can be interpreted as the duration of the corresponding pulse. This is not the only useful parametrization of quantum states, however, and it is thus desirable to find a descent direction which is not tied to any particular parametrization.
Ref. [13] leaves open the problem of finding the relevant geometry for general-purpose variational quantum algorithms, and this paper seeks to fill that void. The contributions of this papers are as follows: • We point out that the demand of invariance with respect to arbitrary reparametrizations can be naturally fulfilled by introducing a Riemannian metric tensor on the space of quantum states, and that the implied descent direction is invariant with respect to reparametrizations by construction.
• We note that the space of quantum states is naturally equipped with a Riemannian metric, which differs from l 2 and l 1 geometries explored previously. In fact, in the absence of noise, the space of quantum states is a complex projective space, which possesses a unique unitarily-invariant metric tensor called the Fubini-Study metric tensor. When restricted to the submanifold of quantum states defining the parametric family, the Fubini-Study metric tensor emerges as the real part of a more general geometric quantity called the Quantum Geometric Tensor (QGT).
• We show that the resulting gradient descent algorithm is a direct quantum analogue of the Natural Gradient in the statistics literature, and reduces to it in a certain limit.
• We present quantum circuit construction which computes a block-diagonal approximation to the Quantum Geometric Tensor and show that a simple diagonal preconditioning scheme outperforms vanilla gradient descent in terms of number of iterates required to achieve convergence 2 Theory

Quantum Information Geometry
Consider the set of probability distributions on N elements [N ] = {1, . . . , N }; that is, the set of positive vectors p ∈ R N , p 0 which are normalized in the 1-norm p 1 = 1. The following function is easily shown to be a metric (Fisher-Rao metric) on the probability simplex ∆ N −1 , where √ p and √ q denote the elementwise square root of the probability vectors in the probability simplex p, q ∈ ∆ N −1 . Now consider a parametric family of strictly positive probability distributions p θ 0 indexed by real parameters θ ∈ R d . It can be shown that the infinitesimal squared line element between two members of the parametric family is given by where I ij (θ) are the components of a Riemannian metric tensor (with possible degeneracies) called the Fisher Information Matrix. Letting p θ (x) denote the component of the probability vector p θ corresponding to x ∈ [N ] we have Now consider a N -dimensional complex Hilbert space C N . Given a vector ψ ∈ C N which is normalized in the 2-norm ψ 2 = 1, a pure quantum state is defined as the projection P ψ = |ψ ψ| ∈ CP N −1 onto the one-dimensional subspace spanned by the unit vector ψ. In direct analogy with the simplex, the following function is easily shown to be a metric (Fubini-Study metric) on the space of pure states: where ψ, φ ∈ C N are unit vectors. Letting ψ θ denote a parametric family of unit vectors, the infinitesimal squared line element between two states defined by the parametric family is given by where g ij (θ) = Re[G ij (θ)] is the Fubini-Study metric tensor, which can be expressed in terms of the following Quantum Geometric Tensor (see [3,19,34] for a review), } denotes an orthonormal basis for C N then one can easily verify that for the family of unit vectors defined by we have G ij (θ) = 1 4 I ij (θ). Clearly, not all quantum states are of this form due to the possibility of complex phases.
Finally, although we have posed the discussion in finite-dimensions, all of the above concepts carry over to infinite-dimensional Hilbert spaces by appropriately replacing sums by integrals.

Optimization problem
Consider a parametric family of unitary operators U θ ∈ U (N ) which are indexed by real parameters θ ∈ R d . Given a fixed reference unit vector |0 ∈ C N and a Hermitian operator H = H † acting on C N , we consider the following optimization problem where ψ θ = U θ |0 and P ψ θ ∈ CP N −1 is the associated projector. In particular, note that ψ θ is normalized since U θ is unitary. Global optimization of the nonconvex objective function L(θ) is impractical, so we instead propose to search for local optima by iterating the following discretetime dynamical system, , and we have introduced the following notation: The first-order optimality condition corresponding to (10) is A solution of the optimization problem (10) is thus provided by the following expression which involves the pseudo-inverse g + (θ t ) of the metric tensor, In practice, however, we avoid materializing the pseudo-inverse by directly solving the linear system (12) which is both more efficient and more numerically stable. In the continuous-time limit corresponding to vanishing step size η → 0, the dynamics (10) is equivalent to imaginary-time evolution within the variational subspace according to the Hamiltonian H, as shown in the supplementary material.

Relationship with previous work
Quantum Natural Gradient optimization possesses important differences compared to its classical counterpart because of the form of the objective function. In classical statistical learning, the task is to minimize the relative entropy D(p p θ ) between the unknown data distribution p and the model distribution p θ , parametrized by θ. Since the data distribution is unknown, the objective function is sometimes chosen to be an empirical estimate of the population negative-log-likelihood L of the model, Minimization of the empirical negative-log-likelihood asymptotically minimizes the relative entropy D(p p θ ). Under additional assumptions (reviewed in the supplementary material), the Fisher Information Matrix approximates the Hessian of L and the natural gradient can be viewed as an approximate second-order method. In the quantum optimization problem however, there is no direct relationship between the quantum Fisher Information and the curvature of the objective, and the quantum natural gradient is more naturally interpreted as constrained imaginary-time evolution.
In the variational quantum Monte Carlo literature, the Stochastic Reconfiguration algorithm [32] and the time-dependent variational Monte Carlo [4,5] have been developed for imaginary and real-time evolution, respectively. These algorithms evolve variational states ψ θ by classically sampling from the Born probability distribution. In the quantum computing literature, an associated real-time evolution algorithm which exploits the imaginary part Im[G ij (θ)] of the Quantum Geometric Tensor (7) has been developed in [21] and subsequently demonstrated on quantum hardware in [6]. For details on the geometry of the time-dependent variational principle we refer the reader to [20,Proposition 2.4]. Variational imaginary-time evolution on hybrid quantumclassical devices has been previously investigated in [16,17,23]. In these works, the choice of optimization geometry can be shown to correspond to the unit sphere S N −1 = {ψ ∈ C N : ψ 2 = 1}, rather than the complex projective space CP N −1 utilized in this paper. Recently, Ref. [36] appeared which considers general evolution of variational density matrices in both real and imaginary time, from a different perspective. By restricting their proposal to pure state projectors (elements of CP N −1 ) they find an algorithm equivalent to ours.

Parametric family
In a digital quantum computer the Hilbert space dimension N = 2 n is exponential in the number of qubits n ∈ N and the Hilbert space has a natural tensor product decomposition into twodimensional factors C N = C 2 n = (C 2 ) ⊗n . A parametric family of unitaries relevant to variational quantum algorithms consists of decompositions into products of L ≥ 1 non-commuting layers of unitaries. Specifically, assume that the variational parameter vector is of the form θ = θ 1 ⊕ · · · ⊕ θ L ∈ R d where ⊕ denotes the direct sum (concatenation) and consider a unitary operator acting on n qubits of the following form where V l (θ l ) and W l are parametric and nonparametric unitary operators, respectively. In particular, all parametric gates within a given layer are assumed to commute. For later convenience, we introduce the following notation for representing subcircuits between layers l 1 ≤ l 2 so that, for example

Quantum Circuit Representation of Quantum Geometric Tensor
Computing the Quantum Geometric Tensor corresponding to a parametrized quantum circuit of the form (14) is a challenging task. In this section we will show, nevertheless, that block-diagonal components of the tensor can be efficiently computed on a quantum computer, producing an approximation to the QGT of the following block-diagonal form: Consider the lth layer of the circuit parametrized by θ l and let ∂ i and ∂ j denote the partial derivative operators acting with respect to any pair of components of θ l (not necessarily distinct). For each layer l ∈ [L] there exist Hermitian generator matrices K i and K j such that, where for notational clarity we have dropped the layer index l from the Hermitian generator K j , despite the fact that the generators can vary between layers. For simplicity we assume that for all distinct parameters i = j within a layer we have ∂ i K j = 0 (this can also serve as the defining property of a layer). Then the commutativity of the partial derivative operators combined with Using (16), (19) and (20) we compute Similarly, we have It follows from unitarity of the subcircuit U (l:L] and Hermiticity of the generator K i that Similarly, the so-called Berry connection is given by Combining these expressions we obtain the following form for the lth block of the QGT, The operator K i K j is Hermitian since [K i , K j ] = 0 and thus the block-diagonal approximation of the QGT coincides with the block-diagonal approximation of the Fubini-Study metric tensor, The preceding calculation demonstrates the following key facts: 1. The lth block of the Fubini-Study metric tensor can be evaluated in terms of quantum expectation values of Hermitian observables.
2. The states ψ l defining the quantum expectation values are prepared by subcircuits of the full quantum circuit and are thus experimentally realizable.

Observables
Having identified the states for which the quantum expectation values are to be evaluated, we now turn to characterizing the Hermitian observables defining the quantum measurement.
For simplicity of exposition we focus on one of the most common parametric families encountered in the literature, which consists of tensor products of single-qubit Pauli rotations, The rotation gates are given by where θ l,k ∈ [0, 2π), and P l,k ∈ {σ x , σ y , σ z } denotes the Pauli matrix which acts on qubit k of layer l. The expressive power of this class of circuits was recently investigated in [8]. In this case the generators are easily shown to be where 1 [1,i) = 1≤j<i 1. These operators evidently satisfy [K i , K j ] = 0. Since P 2 l,i = 1 as a result of the Pauli algebra, it follows that the lth block of the QGT requires the evaluation of the quantum expectation value ψ l |Â|ψ l wherê A ∈ S l belongs to the following set of operators Furthermore, since every operator in S l commutes, this implies that the number of state preparations is reduced from the naive counting |S l | = n(n + 1)/2 to just a single measurement.

Numerical Experiments
In order to assess the performance of the Quantum Natural Gradient optimizer, we present in this section numerical experiments comparing the analytical complexity of QNG, assuming oracle access to local data including gradient and Fubini-Study tensor information. These numerical experiments suggest improved oracle complexity compared to existing optimization techniques such as vanilla gradient and Adam optimization. Although the oracle model of complexity is unrealistic because it ignores the added per-iteration complexity of querying the oracle, we provide additional experiments in Sec. A.5 of the supplementary material which demonstrate that the advantage persists when optimizers are compared in terms of both wall time and number of required quantum evaluations. These experiments were performed with the open-source quantum machine learning software library Pen-nyLane [2,31]. New functionality was added for efficiently computing the block-diagonal g (l) ij and diagonal g ii approximations of the Fubini-Study metric tensor for arbitrary n-qubit parametrized quantum circuits on quantum hardware. This process involves the following steps: 1. Represent the circuit as a directed acyclic graph (DAG). This allows the parametrized layer structure to be programmatically extracted. Gates which have no dependence on each other (e.g., because they act on different wires) can be grouped together into the same layer.
2. Determine observables. For each layer l consisting of m parameters, the generators K i for each parametrized gate are determined, and a subcircuit preparing ψ l constructed.
3. Calculate the lth block of the Fubini-Study metric tensor.
(a) Entire block: The unitary operation which rotates ψ l into the shared eigen- m} is calculated and applied to the subcircuit, and all qubits measured in the Pauli-Z basis. Classical post-processing is performed to determine ψ l |K i K j |ψ l , ψ l |K i |ψ l , and ψ l |K j |ψ l for all 1 ≤ i, j ≤ m, and subsequently g (l) ij . (b) Diagonal approximation: The variance K 2 i − K i 2 is computed for all 1 ≤ i ≤ m, and subsequently the diagonal approximation to the blockdiagonal, g (l) ii .
Thus, to evaluate the block-diagonal approximation of the Fubini-Study metric tensor on quantum hardware, a single quantum evaluation is performed for each layer in the parametrized quantum circuit. Finally, a Quantum Natural Gradient optimizer was implemented in Penny-Lane (see [35] for full source code). This optimizer computes the block-diagonal metric tensor g(θ) at each optimization step (L quantum evaluations), as well as the analytic gradient of the objective function ∇L(θ) via the parameter shift rule [25] (2d quantum evaluations), and updates the parameter values by classically solving the linear system (12). As a result, each optimization step requires 2d + L quantum evaluations.
For numerical verification, we considered the circuit of [24], which consists of an initial fixed layer of R y (π/4) gates acting on n qubits, followed by L layers of parametrized Pauli rotations interwoven with 1D ladders of controlled-Z gates, and target Hermitian observable chosen to be the same two-Pauli operator Z 1 Z 2 acting on the first and second qubit which has a ground state energy of −1. Starting from the same random initialization of Ref. [24], we optimize the parametrized Pauli rotation gates using vanilla gradient descent, the Adam optimizer, and the Quantum Natural Gradient optimizer, with both the blockdiagonal and diagonal approximations. The results are shown in Fig. 1 for n = 7, 9, 11 qubits, L = 5 layers, and with the optimization performed using 8192 samples per expectation value. In all cases the vanilla gradient descent fails to find the minimum of the objective function, while the Quantum Natural Gradient descent finds the minimum in a small number of iterations, in both block-diagonal and strictly diagonal approximation. In addition, we present a comparison with the Adam optimizer which is a non-local averaging method. In this particular circuit, Adam is capable of finding the minimum but requires a larger number of iterations than the Quantum Natural Gradient. Furthermore, the improvement afforded by the Quantum Natural Gradient optimizer appears more significant with increasing qubit number. Note that for n = 11, we do not include the block-diagonal approximation, due to the increased classical overhead associated with numerically computing the shared eigenbasis for each parametrized layer. However, this over-head can likely be negated by implementing the techniques of [7] and [11].
To investigate the effects of variable circuit depth, the numerical experiment was repeated with n = 9 qubits, and parametric quantum circuits with L = 3, 4, 5, 6 layers. The results are shown in Fig. 2, highlighting that the Quantum Natural Gradient optimizer retains its advantage with increasing circuit depth.

Discussion
It is instructive to compare our proposal with existing preconditioning schemes such as Adam. Unlike Adam, which involves some kind of historical averaging, the preconditioning matrix suggested by quantum information geometry does not depend on the specific choice of loss function (Hermitian observable). It is instead a reflection of the local geometry of the quantum state space. In view of these differences it is natural to expect that the benefits provided by the Quantum Natural Gradient are complementary to those of existing stochastic optimization methods such as Adam. It is therefore of interest to perform a detailed ablative study combining these methods, which we leave to future work.
Finally, this paper only considered the relevant geometry for idealized systems described by pure quantum states. In near-term noisy devices it may be of interest to study the relevant geometry for density matrices. The most promising candidate is the Bures metric, which possesses a number of desirable features. In particular, it is the only monotone metric which reduces to both the Fubini-Study metric for pure states and the Fisher information matrix for classical mixtures [28].

A Supplementary Material
In this appendix we employ the Einstein summation convention where summation over repeated indices is implied.

A.1 Real and imaginary parts of Quantum Geometric Tensor
Partially differentiating both sides of the normalization condition 1 = ψ θ 2 with respect to θ i gives Partially differentiating again with respect to θ j gives Consider the wavefunction in a neighborhood θ + δθ of θ ∈ R d . Taylor expanding in the displacement vector δθ we obtain, Taking the inner product with ψ θ gives It follows that the fidelity between ψ θ and ψ θ+δθ is given to quadratic order in the displacement δθ by, where we have used (33) and (34). Now use the approximation It follows that the infinitesimal squared distance is given by, The term multiplying 1 2 on the right-hand side of (40) is manifestly real. The term multiplying −1 is also real because of (33) which implies It follows that the metric tensor is given by the real part of the QGT, For completeness, the imaginary part of the QGT is given by where A i (θ) is the Berry connection,

A.3 Relationship with curvature of objective
Let p θ 0 be a parametric family probability distributions over [N ], indexed by θ ∈ R d . Differentiating both sides of the expression 1 = E x∼p θ [1] we find the identity and differentiating once again gives The Fisher Information Matrix can thus be expressed as Now suppose that p 0 is an unknown probability vector. Recall that the relative entropy between p and p θ can be expressed as where S(p) is the entropy of p and L(θ) is the population loss given by of expected negative-loglikelihood of the model, The Hessian of the loss is given by Introducing the shorthand f ij (x) = − ∂ 2 log p θ (x) ∂θ i ∂θ j and using Hölder's inequality we obtain Thus the error in approximation is controlled by the loss deficit L(θ) − S(p) ≥ 0 and the curvature of the likelihood function.

A.4 Relationship with classical Fisher information
Then by the chain rule Thus the Berry connection for this family of states vanishes where we used the orthonormality of the basis x |x = δ xx . The QGT is thus given by

A.5 Additional experiments and figures
In the following section, we present some additional plots comparing the optimization dynamics of the Quantum Natural Gradient to various other optimization strategies, including gradient descent-based (standard or vanilla gradient descent, Adam) and gradient-free (COBYLA, Nelder-Mead) strategies.
In addition, we include in this comparison a version of the Adam optimizer modified to use the natural gradient in its parameter update step. While it remains difficult to make direct comparisons between the (non-local) Adam optimizer and the Quantum Natural Gradient optimizer, it is instructive to compare the behaviour of the Adam optimizer when using the natural gradient as opposed to the standard gradient.
The results of these additional experiments are shown in Fig. 3, highlighting each optimization strategy for fixed number of shots and increasing circuit depth, and Fig. 4, for fixed circuit depth but varying number of samples used to compute circuit expectation values. In both experiments, the same circuit architecture is used as in Sec. 3. Here, we compare the progress of each optimization strategy against the number of iterations, total computational wall time (note that this includes the wall time required to perform all quantum simulations), and number of quantum evaluations. In particular, we note that: • The Quantum Natural Gradient continues to outperform both vanilla gradient descent and Adam optimization.
• The diagonal approximation and the block diagonal approximation to the Quantum Geometric Tensor provide comparable results when used with the Quantum Natural Gradient, however the diagonal approximation results in significantly reduced overall wall time-comparable to vanilla gradient descent-due to the decrease in classical processing overhead.
• Comparison with gradient-free techniques is more difficult; within the same number of iterations, both gradient-free techniques failed to find the local minimum. However, COBYLA and Nelder-Mead required significantly fewer number of quantum evaluations over these iterations.
• The inclusion of the natural gradient within the Adam optimizer parameter update step appears to provide some benefit, with the modified Adam optimizer converging to the local minimum in fewer iterations than the standard Adam optimizer.