Here comes the SU(N): multivariate quantum gates and gradients

Variational quantum algorithms use non-convex optimization methods to find the optimal parameters for a parametrized quantum circuit in order to solve a computational problem. The choice of the circuit ansatz, which consists of parameterized gates, is crucial to the success of these algorithms. Here, we propose a gate which fully parameterizes the special unitary group $\mathrm{SU}(N)$. This gate is generated by a sum of non-commuting operators, and we provide a method for calculating its gradient on quantum hardware. In addition, we provide a theorem for the computational complexity of calculating these gradients by using results from Lie algebra theory. In doing so, we further generalize previous parameter-shift methods. We show that the proposed gate and its optimization satisfy the quantum speed limit, resulting in geodesics on the unitary group. Finally, we give numerical evidence to support the feasibility of our approach and show the advantage of our gate over a standard gate decomposition scheme. In doing so, we show that not only the expressibility of an ansatz matters, but also how it's explicitly parameterized.


Introduction
Variational quantum computing is a paradigm of quantum computing that uses optimization algorithms to find the optimal parameters for a parameterized quantum circuit [1,2].Crucial for the success of such algorithms is the choice of circuit ansatz, which usually consists of multiple parameterized one and two-qubit gates.Typi-cally, these gates are parameterized unitary matrices generated by single Pauli-string operators that can locally rotate a state around some axis: U (t) = exp{itG}, where t is a gate parameter and G a Pauli string.For a specific family of cost functions, there exist a variety of methods that allow one to obtain the gradient with respect to t [3,4,5,6,7,8,9] on quantum hardware.With these gradients, the cost function can be minimized via any gradient-descent-based algorithm.
Instead of considering a gate generated by a single Pauli string, one can construct more general parameterized gates that can perform an arbitrary rotation in SU(N ), the special unitary group.These general SU(N ) rotations are used in a variety of quantum algorithms [10,11,12,13].In practice, rotations in SU(N ) can be implemented by composing several simple parameterized gates together into a more complicated one.For example, for single and two-qubit gates (where N = 2, 4, respectively), there exist several general decomposition schemes of such gates into products of single-qubit gates and CNOTs [14,15,16,17,18,19].In practice, this compilation comes with hardware-specific challenges, since quantum hardware usually has a set of native gates into which all others have to be decomposed [20,21].
Choosing the right parameterization for a function is important because it can significantly affect the properties of its gradients.Reparameterizing functions to obtain more useful gradients is a well-known method in statistics and machine learning.For example, in restricted maximum likelihood methods one can ensure numerical stability of quasi-Newton methods by decomposing covariance matrices into Cholesky factors [22].In addition, methods like auxiliary lin-ear transformations [23], batch normalization [24] and weight normalization [25] are used to improve the gradients in neural networks.In variational inference, the reparameterization trick [26] is at the core of variational autoencoder approaches and allows for gradients for stochastic back-propagation [27,28].In light of this, it may be worthwhile to investigate alternative parameterizations of quantum gates for variational quantum algorithms.
In this work, we propose a family of parameterized unitaries called SU(N ) gates and provide a method to evaluate their gradients on quantum hardware.In doing so, we generalize the prior literature one step further, since many past schemes can be understood as special cases of our proposal [3,4,5,6,7,8,9].We provide numerical results to support the validity of our approach and give several examples to illustrate the capabilities of the SU(N ) gate.We show that this gate satisfies the quantum speed limit and that it is easier to optimize compared to SU(N ) parameterizations that consist of products of gates.We argue that this is the case because the product of unitaries creates a "bias" in the Lie algebra that deforms the cost landscape.In addition, we highlight the connections between our formalism and the properties of semisimple Lie algebras and establish a bound on the computational complexity of the gradient estimation using tools from representation theory.

SU(N ) gates
A quantum gate is a unitary operation U that acts on a quantum state ρ in a complex Hilbert space.If we ignore a global phase, then a gate U acting on N qubits qubits is an element of the special unitary group SU(N ) (see App. A), where N = 2 N qubits .Note that all of the following works for any N > 1, but here we restrict ourselves to the qubit case.We are interested in constructing a quantum gate that parameterizes all of SU(N ).
To achieve this, we make use of the theory of Lie algebras.We will not be concerned with the formal treatment of this topic, which can be found in many excellent textbooks [29,30,31].
To construct our gate, we realize that SU(N ) is a (semisimple) Lie group and so there exists a unique connection between its elements and the Lie algebra su(N ) via the so-called Lie corre-spondence, or Lie's third theorem [32,31].In particular, each g ∈ SU(N ) can be identified with an A ∈ su(N ) via the exponential map g = exp{A}.For our purposes, we can understand the Lie algebra su(N ) as a vector space of dimension N 2 − 1 that is closed under the commutator, [A, B] = AB − BA ∈ su(N ) for A, B ∈ su(N ).For su(N ), we choose as a basis the tensor products of Pauli matrices multiplied by the imaginary unit i: where σ i ∈ {I, X, Y, Z} and I N qubits = iI ⊗N qubits .
We choose the following parameterization of SU(N ): (2) where θ = (θ 1 , θ 2 , . . ., θ N 2 −1 ) ∈ R N 2 −1 and {G m } ∈ P (N qubits ) .To distinguish between the group and the gate, we call the parameterization in Eq. (2) a SU(N ) gate.The coordinates θ are called the canonical coordinates, which uniquely parameterize U through the Lie algebra su(N ).Since we typically cannot implement the above gate in hardware, we will have to be decompose it via a standard unitary decomposition algorithm [14,15,16,17,18,19].We emphasize here that even though the gate will be decomposed, it is parameterized as an exponential map.Hence we can understand Eq. ( 2) as a change of coordinates from the SU(N ) gate decomposition.
If we do not want to parameterize all of SU(N ), we can instead parameterize a more restricted Hamiltonian by setting some of the parameters θ m to zero.This makes Eq. ( 2) a natural parameterization of several Hamiltonians available on modern quantum hardware platforms.These Hamiltonians often have multiple independently tunable fields which can be active at the same time and do not necessarily commute.One typically has local control on each qubit and access to an interacting Hamiltonian between pairs of qubits, depending on the topology of the quantum device [33].The interacting pair can for example be a ZZ interaction for Josephson flux qubits [34], a Heisenberg interaction for nuclear spins in doped silicon [35] or an XY interaction in quantum dots interacting with a cavity [36].
To use this gate in a gradient-based variational quantum algorithm, we have to be able to obtain partial derivatives of U (θ) with respect to each parameter θ l .Although there exist a variety of works that provide analytical expressions for gradients through quantum circuits via the parameter-shift rule [3,4,5,6,7,8,9,37], these works almost uniformly assume that the gate is of the form U (θ) = exp{iθP }, where P is a Hermitian operator.As far as we are aware, the only methods to obtain gradients of Eq. ( 2) with respect to θ are the stochastic and Nyquist parameter-shift rules of [6] and [10], respectively.The first approach relies on an integral identity for bounded operators that is estimated via Monte Carlo [38], whereas the latter is based on a theorem in Fourier analysis [39].

Obtaining the gradient
Here, we provide a new approach to obtain the gradient of Eq. ( 2) that makes use of differentiable programming, which is efficient for gates acting on a small number of qubits.To start, we note that the partial derivative with respect to a parameter θ l is given by (3) Here, ad X denotes the adjoint action of the Lie algebra given by the commutator ad hence (ad X ) p denotes a nested commutator of p terms.For more details, see App.B. Note that the term on the right of U (θ) in Eq. ( 3) is an element of the Lie algebra, since ∂/∂θ l A(θ) = G l ∈ su(N ) and so the commutator keeps the entire sum in the algebra.For notational clarity we define where Ω l (θ) ∈ su(N ) is a skew-Hermitian operator that generates a unitary, which we call the effective generator.Given that Eq. ( 4) is an infinite series of nested commutators it is not clear how Ω l (θ) ∈ su(N ) can be calculated in practice without truncating the sum.We can think of U as a function U : R N 2 −1 → SU(N ) that we evaluate at the point θ.Since SU(N ) is a differentiable manifold, we can define a set of local coordinates on the group and represent U (x) as a matrix described by N 2 − 1 real numbers.Hence, we can think of our gate as a coordinate transformation between the parameters x and the entries of the matrix representing the unitary.Since U (x) depends smoothly on x l via the matrix exponential, this coordinate transformation comes with a corresponding Jacobian (or more accurately, pushforward) We can obtain this Jacobian by differentiating the elements U nm (x) with respect to x l : To obtain the above matrix function numerically, we rely on the fact that the matrix exponential and its derivative are implemented in differentiable programming frameworks such as JAX [40], PyTorch [41] and Tensorflow [42] through automatic differentiation.Here we make use of the JAX implementation, which provides the matrix exponential through a differentiable Padé approximation [43,44].Continuing, we note that evaluating ∂U (x)/∂x l at a point θ produces an element of the tangent space T U (θ) SU(N ).We can move from the tangent space to the Lie algebra by left multiplying the elementwise derivative of Eq. ( 5) in Eq. ( 3) with U † (θ) (see App. A), which allows us to obtain Ω l (θ) exactly, up to machine precision.We emphasize that these steps can be performed on a classical computer, with a cost that is only dependent on the number of qubits the gate acts on, not the number of qubits in the circuit.We now make the following observation: Ω l (θ) corresponds to a tangent vector on SU(N ) and generates the one-parameter subgroup and We sketch this procedure schematically in Fig. 1.
Figure 1: Schematic depiction of our approach.We move to the Lie algebra from the tangent space by left multiplication with U † (θ) and obtain Ω l (θ).The orbit generated by Ω l (θ) corresponds to the gate we have to insert in the circuit to compute the gradient.
We now consider a typical variational setting, where we are interested in minimizing the following cost function: where H is some Hermitian operator and ρ the initial state of the system.For simplicity we consider a circuit consisting of a single SU(N ) gate.
Differentiating the cost function with respect to θ l gives Then, plugging in Eq. ( 8) we find, where we used the skew-Hermitian property of the tangent vector Ω † l (θ) = −Ω l (θ).Note that Eq. ( 11) corresponds to a new circuit with the gate exp{tΩ l (θ)} inserted before U (θ) (see Fig. 2).The partial derivative with respect to the gate parameter θ l can be obtained by adding a gate to the circuit that is generated by Ω l (θ).Calculating the derivative with respect to t and evaluating at t = 0 then provides one with the correct gradient.
The gradient of this new circuit can be computed on quantum hardware with a generalized parameter-shift rule (GPSR) [8,7,9].In Algorithm 1, we outline the entire algorithm for our gradient estimation and we denote the GPSR subroutine with gpsr.An alternative to the generalized shift rule is to decompose the effective generators and apply the original twoterm parameter-shift rule to the constituents (see App. 3 for details).In [45], the authors proposed the so-called stochastic parameter-shift rule for multivariate gates, which is based on the Monte Carlo approximation of an operator identity.
In Fig. 3 we consider a toy example using a random Hamiltonian on a single qubit and compare the exact derivative of an SU(2) gate with our generalized parameter-shift method (Algorithm 1), the stochastic parameter-shift rule and the central finite difference derivative with shifts ± δ 2 .In particular, we consider the gate U (θ) = exp(iaX + ibY ) with θ = (a, b) and compute the partial derivative with respect to a over the range a ∈ [0, π] for three fixed values of b on a state vector simulator (without shot noise).For the finite difference recipe we use δ = 0.75, which we found to be a reasonable choice for a shot budget of 100 shots per cost function evaluation (see App. 2).We observe that the generalized SU(N ) derivative reproduces the exact value while the finite difference derivative is slightly biased.This is to be expected because the latter is an approximate method.While decreasing the shift size δ reduces the deterministic approximation error, it leads to larger overall estimation errors in shotbased computations like on quantum computers (see App. 2 and e.g., [46]).Finally, the stochastic parameter-shift rule yields an unbiased estimator for the exact derivative but has a finite variance, which we estimated using 100 samples (see App. 1).We stress that this variance is a property of the differentiation method itself and not due to sampling on the quantum computer.All methods require two unique circuits per derivative but the stochastic shift rule needs additional circuits in order to suppress the variance.We provide the code for all our numerical experiments at [47].Our generalized shift rule (dotted) reproduces the exact value (solid), whereas the central finite difference (dashed) is biased and the stochastic shift rule (solid, shaded) comes with a finite statistical error even without shot noise from the quantum measurements.Since we look at a single-qubit operation, Ω a (θ) has a single spectral gap, so we require two shifted circuits to calculate the gradient entry (see App. D for details).The finite difference and the stochastic shift rule require two circuits as well, but additional executions are need for the latter to reduce the shown single-sample error.
In addition, we compare the three methods in the presence of shot noise in Fig. 4. We show the means and single-shot errors estimated with 1000 shots, which we split over 100 samples for the stochastic shift rule.We observe that the generalized SU(N ) shift rule systematically performs best.It is not only unbiased but also has the smallest variance.Note that for smaller parameters b, the SU(N ) shift rule and the stochastic shift rule show very similar variances.This is because U (θ) approaches the gate R X (a) = exp(iaX), which can be differentiated with the original parameter-shift rule, and both rules in-deed reduce to the two-term shift rule for R X .3 but for finitely many shots on quantum hardware.We show the single-shot error for each method, estimated with 1000 shots, which varies with the gate parameters as noted e.g., in [9].Our generalized SU(N ) shift rule systematically outperforms the other methods.For small b, the SU(N ) and the stochastic shift rule approach the singleparameter shift rule and hence behave similarly.The finite difference shift δ = 0.75 is chosen such that the bias and variance are traded off reasonably for 100 shots (see App. 1 and e.g., [46]).For other shot numbers, δ needs to be optimized anew, whereas the parameter-shift rules are known to perform optimally at fixed shifts.
Finally, we note that Eq. ( 11) is closely related to the Riemannian gradient on SU(N ) [48,49].However, instead of a gradient flow on a Lie group, we have defined a flow on the Lie algebra su(N ), which we retract back to the manifold via the exponential map.This subtle difference induces a different flow from the SU(N ) one, as we illustrate in App. C. Algorithm 1: SU(N ) gradients.Input: U (x), ρ, H, θ Obtain the Jacobian function:

Comparison with decomposed unitaries
Previous parameterizations of SU(N ) unitaries consist of products of single-qubit gates and CNOTs [14,15,16,17,18,19].We refer to this parameterization as decomposed SU(N ) gates.On the other hand, Eq. ( 2) describes a general SU(N ) unitary by exponentiating a parameterization of the Lie algebra su(N ).Here, we investigate the effects of this alternative parameterization.

Gate speed limit
First, we investigate a speed limit in terms of the gate time.We slightly modify the definition of Eq. ( 2) for a unitary evolution of the system, where The normalization of Ā(θ) is equivalent to the normalization of θ in Euclidean norm, see Lemma 3 in App.F. The normalization of the Hamiltonian (or, equivalently, θ) means that the total path length covered by the evolution is directly proportional to the evolution time t, since we are effectively setting the speed of the evolution to 1.
The Lie group SU(N ) can be turned into a Riemannian manifold by equipping it with the Hilbert-Schmidt inner product g(x, y) = Tr x † y .The unitary evolution U (θ; t), parameterized by t, is a one-parameter subgroup that gives the geodesic [48,Theorem III.6] from the identity element at time t = 0. Geodesics can be defined as generalizations of straight lines in Euclidean geometry.Using Lemma 4 (App.F), the length of the path after time t is constant for time-independent normalized Hamiltonians with |θ| = 1, In general, there is more than one geodesic between two points on the manifold.For example, two points on the Bloch sphere can be connected by rotations about the same axis moving in opposite directions.Using Lemma 5 (App.F), one of these geodesics must be the curve of the minimal path length.Hence, the minimum time to generate the evolution U (θ; t g ) is t g along the geodesic of the minimal path.For an initial state ρ and final state ρ f , the Fubini-Study metric is used to find a minimum evolution time giving the Mandelstam-Tamm bound for timeindependent normalized Hamiltonians.
In practice, we may only have access to a restricted family of gates within SU(N ), for example due to hardware limitations, in which case we require a decomposition of a desired gate in SU(N ) into gates from this family.Here we want to compute the additional evolution time required by such a decomposition.The simplest gate decomposition is to break the unitary into two terms, U (θ; t g ) = U (ϕ (2) ; t 2 )U (ϕ (1) ; t 1 ).The parameters ϕ (1) and ϕ (2) are also normalized Hamiltonians, i.e., they have the norm |ϕ (1) The following theorem shows that using a decomposed circuit over an SU(N ) gate gives an additional evolution time, which corresponds to longer circuit run times.
Theorem 1.For unitary gates generated by normalized time-independent Hamiltonians, consider a general circuit decomposition of two gates U (ϕ (2) ; t 2 )U (ϕ (1) ; t 1 ).There exists an equivalent evolution with an SU(N ) gate U (θ; t g ) = U (ϕ (2) ; t 2 )U (ϕ (1) ; t 1 ), with evolution time t g , such that The proof of the theorem is in App.F. As expected, a decomposition into two gates gives a longer total evolution time than is possible with an SU(N ) gate due to the normalizations of ϕ (1) , ϕ (2) , and θ.A decomposition into more gates would generally lead to an even greater evolution time.A corollary of Theorem 1 is that any circuit with multiple non-commuting layers of gates cannot be optimal in total time.

Unbiased cost landscapes
An additional advantage of the SU(N ) gate is that it weighs all optimization directions equally.In contrast, a parameterization of SU(N ) in terms of a product of gates will create a bias in the parameter space.We illustrate this point with the following example.Consider the decomposed where R A (θ) = exp{iθA} and A = X, Y, Z.This is the ZYZ decomposition.Using similar techniques as in App.F, we can rewrite V (θ) to be parameterized in terms of the Lie algebra: where σ = (X, Y, Z) and If we look at the components of ϕ, we see that the different directions in the Lie algebra are stretched or compressed as a result of the particular choice of parameterization.Consider the normalization |θ 1 |+|θ 2 |+|θ 3 | = 1 for the ZYZ decomposition and |θ| = 1 for the SU(N ) gate.With each Hamiltonian term normalized to 1, the prefactor gives the evolution time.These choices of norm give equal total evolution times for the ZYZ decomposition and SU(2) gate, T ZYZ = T SU(N ) = √ 2, irrespective of the specific parameters chosen.In Fig. 5, we graphically illustrate the Lie algebra deformation by showing the ϕ surface for both the ZYZ decomposition and SU(2) gate.Note that we have not considered any cost function here; the bias occurs at the level of the parameterization of a specific unitary.
The effect of this bias is demonstrated in Fig. 6 for the simplest case of a single-qubit system with ������ ZYZ decomposition SU(2) (unbiased) an SU(2) gate.The optimal parameters of the circuit are those that produce the state that gives the minimum of the cost function C(θ) = −⟨Y ⟩ (green star).We consider various initial parameters acting on the reference state ρ = |0⟩⟨0|.The corresponding training paths are shown for each initial parameter vector.The training paths for the decomposed ZYZ circuit are depicted in Fig. 6(a).As the initial parameter θ 0 acting on the reference state ρ (purple dots) moves closer to an unstable equilibrium point (orange diamond) the training path becomes increasingly suboptimal.At the unstable equilibrium the only gradient information is directly away from the instability rather than providing information about the direction towards the global minimum.This behavior is further illustrated by the gradient vector field on the Bloch sphere in Fig. 6(c).For the SU(N ) gate, we see in Fig. 6(b) that the optimization trajectories follow a direct path to the minimum.to ρ.Note that for this choice of initial parameters, U (θ 0 ) = V (θ 0 ).The objective function is C(θ) = −⟨Y ⟩, giving the target final state at the green star-the state that gives the global minimum of C(θ).The unstable equilibrium points are given by orange diamonds, at (0, 0, 1) and (0, 0, −1), and the black point is at the maximum of the cost function, (0, 1, 0).(c) shows the gradient vector field of the decomposed ZYZ ansatz.The vector field for the SU(2) gate, shown in (d), coincides with the geodesic flow towards the target final state at all points which satisfies the gate speed limit.

Numerical experiment
To investigate the effect on the performance of a typical optimization, we study how an SU(N ) gate compares with a decomposed gate in a larger circuit.In Fig. 7 we provide a non-trivial example, where we incorporate our gates into a circuit and show that it performs better than a decomposed SU(4) gate on a set of random problem instances.We show the individual optimization trajectories in Fig. 8 which illustrate the faster optimization of SU(N ) gates compared to decomposed gates.Like for the examples in Fig. 3 and Fig. 4, we assume that there is no gate or measurement noise.Additionally, we assume that we can always implement the gate generated by Ω l (θ), and have control over all Pauli operators G m .In practice, we typically only have access to a fixed set of generators span({G m }) < span(su(N )).If this is the case, then we require a decomposition of exp{tΩ l (θ)} in terms of the available operations on the device [14,19].All numerical results were obtained with PennyLane [50], and the SU(N ) gates can be accessed via the qml.SpecialUnitary class.Although we do not explore this here, one could make use of sparse optimization methods such as stochastic optimization [51,52] and frugal optimization [53] for the GPSR subroutine in our algorithm.

Resource estimation
To obtain the partial derivative in Eq. ( 11) in practice we need to estimate the gradient of a circuit that contains a gate generated by Ω l (θ).
As noted in recent works on GPSR rules [8,7,9], the computational cost of estimating this gradient is related to the spectral gaps of Ω l (θ).In particular, if {λ j } is the set of (possibly degenerate) eigenvalues of Ω l (θ), we define the set of unique spectral gaps as Γ = { λ j − λ j ′ } where j ′ > j.Note that for d distinct eigenvalues, the number of unique spectral gaps R is at most R ≤ d(d − 1)/2.The total number of parametershifted circuits is then 2R for a single partial derivative ∂ θ l C(θ) Depending on the generator Ω l (θ), this complexity can be improved.For instance, in [7], a Cartan decomposition is used to improve the number of circuits required from polynomial to linear or even logarithmic in N .Additionally, in [8], the different costs for first-and secondorder gradients are determined for specific varational quantum algorithms like QAOA [54] and RotoSolve [55,56,57,58].Finally, in [9], the computational cost of a variety of different gates is investigated in detail and the variance across the parameter regime is studied.
Instead of focusing on specific instances of the generator Ω l (θ), we make a more general observation about the computational complexity of parameter-shift gradient rules.In general, Ω l (θ) has full support on su(N ), since the consecutive Hamiltonians and various depths.We consider the bricklayer circuit indicated with ℓ = 2 in the inset, with general two-qubit gates acting on the even and odd qubits in each layer.The decomposed gate is the SU(4) parameterization of [16], which is optimal in the number of CNOTs required.For each instance, we sample a Hamiltonian from the Gaussian unitary ensemble and minimize the cost in Eq. ( 9) via standard gradient descent.We show the difference of the relative errors in energy Ē = (E − E min )/(E max − E min ) between the decomposed gates and the SU(N ) gates, that is ∆ Ē = ĒSU(N) − ĒDecomp.The plotted lines are the mean Ē, averaged over 50 random Hamiltonians for each circuit depth ℓ.We see that for all depths ∆ Ē < 0 at all points during the optimization, hence the brick-layer circuit with the SU(N ) gates outperforms the circuit where the two-qubit interactions are parametrized as a gate composition.
applications of ad A(θ) in Eq. (3) typically generate all of su(N ) [59].However, for specific choices of A(θ), the application of ad A(θ) p to ∂ θ l A(θ) closes to form a subalgebra, called the dynamical Lie algebra of A(θ), that is contained in su(N ).These algebras are well-known in the context of quantum optimal control [60,61], and have recently been studied in the context of variational quantum algorithms [62,63].We define the dynamical Lie algebra (DLA) L(A(θ)) as the subalgebra formed under the closure of the non-zero terms in A(θ) under the commutator.Ignoring global phases, this will always result in a subalgebra L(A(θ)) ⊆ su(N ).For example, given

SU(N )
Figure 8: Trajectories from the optimizations in Fig. 7 for 50 random 10-qubit Hamiltonians sampled from the Gaussian unitary ensemble and an ℓ = 5 brick-layer circuit of 2-qubit building blocks.We compare the relative error energy (see Fig. 7 for the definition of Ē) when using a standard gate composition to that when using SU(4) gates as building blocks.The optimization is performed with vanilla gradient descent using a learning rate of η = 10 −3 .The SU(4) gate consistently leads to faster optimization and better approximations of the ground state energy throughout all 10 5 optimization steps.
equals the full Lie algebra su(2).Explicit constructions of DLAs that span so(N ) and sp(N ) are given in [64].In a more recent work, the DLAs of several typical quantum many-body Hamiltonians are studied and their properties are used to prepare efficient time-evolution circuits [65].In one dimension, the DLAs that are generated by Pauli strings have recently been classified [66].Interestingly, if the DLA is maximal, i.e., there exists no smaller non-trivial subalgebra within L(A(θ)), then the roots of the Lie algebra can be related directly to the computational cost of estimating the gradients in Eq. (11).We formally establish this connection with the following theorem: Theorem 2. The number of unique spectral gaps R of Ω l (θ) is upper bounded by the number of roots |Φ| of any maximal semi-simple DLA, We provide the proof of Theorem 2 in App.G.We make use of the fact that any semisimple Lie algebra can be written as a direct sum of its weight spaces, which can be identified with its root system [67].The number of roots |Φ| can then be used to bound the total number of unique spectral gaps of Ω l (θ).We can thus use Theorem 2 to assess the run time of Algorithm 1.We give several examples of SU(N ) gates in App.G together with the corresponding values of R. Depending on the physical system or hardware that we are working with, we have to choose a representation for su(N ), which is a map su(N ) → gl(N, C).In Eq. ( 1) we chose this representation to be the tensor product of the fundamental representation, i.e., Pauli monomials.Note however, that Eq. ( 11) and Theorem 2 hold for any irreducible representation of su(N ).
Additionally, by connecting the spectral gaps to the root system of the DLA, we can make use of a beautiful result in representation theory: the classification of all maximal subalgebras of the classical Lie algebras [68].Each root system can be uniquely identified with a particular subalgebra of a Lie algebra and it can be shown that there exist a finite number of root systems.Since a DLA is a subalgebra of su(N ), we can identify all possible DLAs and by extension all possible families of SU(N ) gates.We provide examples of this procedure in App.G.

Conclusion
We have proposed an alternative parameterization of general SU(N ) gates and a method of optimizing these gates in prototypical variational quantum algorithms.We have shown that our gates are more powerful in toy example settings, and motivated why we believe this is the case based on quantum speed-limit arguments.A natural extension of our work would be to test our method in experimental settings, both on gatebased quantum computers or quantum simulators [69,70,71].With regards to the latter, several methods have been investigated that could provide pulse-level optimization of energy cost functions [72,73].This would obviate the need for a gate-based model of quantum computing to prepare specific states on quantum hardware.Instead, we work on the Hamiltonian level and control the system directly.Our algorithm could be applied to this setting as well, since we're effectively learning the parameters of some fixed Hamiltonian.
We have shown that the SU(N ) gate in a circuit outperforms a decomposed gate.The number of parameters in our proposed gate equals 4 N qubits , hence SU(N ) gates acting on a large number of qubits will be impractical.Additionally, it is not clear for which problems one would rather have a deeper circuit with simple gates as opposed to a shallow circuit with more powerful gates.This also begs another question: will our gates suffer from barren plateaus [74]?It is likely that a circuit of 2-qubit SU(N ) gates that has linear depth in N will lead to a circuit that forms an approximate 2-design, which will suffer from vanishing gradients.However, appropriate choices of the generators A(θ) of our gate could keep the circuit in a polynomially scaling DLA of the entire circuit, which can avoid barren plateaus [62,63].Additionally, we can consider parameter initialization strategies that can improve the optimization [75,76].Finally, we believe that the connections between variational quantum circuits and representation theory merit further investigation.We connected the classification of all SU(N ) gates with the classification of semisimple Lie algebras.However, this could possibly be extended to a classification of all variational quantum circuits based on the DLA of an ansatz.It seems that the tools to provide such a classification are available and could provide one with a method to assess the trainability and expressivity of variational circuits without explicitly referring to specific ansätze.

Acknowledgements
We want to thank Los Alamos National Lab for their hospitality during the Quantum Computing Summer School where the initial stages of this project took place.RW  which maps an element of the Lie algebra to another element of the Lie algebra.This map is called the adjoint representation Ad h : g → g and takes X → hXh −1 .Given X, Y ∈ g, we compute The operator ad : g × g → g is the Lie bracket on g, which for our purposes will be the standard commutator.It now follows that With the boundary condition Ad e tX | t=0 = Id we find that the above differential equation is solved by at t = 1.Consider now the following parameterized matrix function where X(t) is a curve on g.We then find Using that Y (0, t) = 0, we find by integration Estimating the above integral forms the basis of the stochastic parameter-shift rule (see App. 1).Continuing, Hence we see that which gives Eq. ( 3).Note that at this point the Baker-Campbell-Hausdorff formula can be derived with Eq. (B20) by considering the derivative of e Z(t) = e tX e tY , (B21) and subsequent integration of the derivative of Z(t) [77].

C Connection between Riemannian and standard gradient flow on SU(N )
Here, we highlight the connection between a Riemannian gradient flow on SU(N ) with respect and Eq. ( 11) [48,49].We take the Hilbert-Schmidt inner product ⟨A, B⟩ = Tr A † B as a Riemannian metric on SU(N ) and write the cost as The differential of the cost dC(U ) : T U SU(N ) → R, evaluated at the point U Ω, where Ω ∈ su(N ) and U ∈ SU(N ), is then via the chain rule and using that ⟨A, B⟩ = ⟨U A, U B⟩.The compatibility condition for the Riemannian gradient tells us that Hence we can identify with the corresponding gradient flow Next, we follow the results of [78] in our notation.Consider the cost function where X ∈ su(N ).Note that although the minimum of the function is unchanged, the parameterization of a unitary via the Lie algebra changes the resulting gradient flow.To see this, we consider again the differential, where now d(C • exp X) : su(N ) → R and A ∈ su(N ).We can now make use of the result in Eq. (B20), Hence the gradient on su(N ) is We therefore see that only when Φ X = I do we obtain the Riemannian gradient on SU(N ); hence only if X ∈ g where g is abelian.The optimization path followed by optimizing the parameters of an SU(N ) gate is thus different from the one following a Riemannian gradient descent on SU(N ).

D The generalized parameter-shift rule
When using our SU(N ) gate in an application that involves gradient-based optimization, like demonstrated in the numerical experiments in this work, we require calculating the partial derivatives in Eq. ( 11).Here we provide the details for how this can be achieved in practice via the generalized parameter-shift rule (GPSR) [8,7,9].Without loss of generality, we rewrite the cost function in Eq. ( 9) as where we absorbed the rest of the circuit into ρ, H and fixed any other parameters in the circuit.
Computing the derivative of Eq. (D1) with respect to t is equivalent to the problem of finding the gradient in Eq. ( 11) at t = 0.For the numerical experiments in this paper we make use of the particular implementation of the GPSR in [9] as well as the alternative method outlined in App. 3. The skew-Hermitian operator Ω in Eq. (D1) has (possibly degenerate) eigenvalues {iλ j }.We define the set of unique spectral gaps as Γ = { λ j − λ j ′ } where j ′ > j.Note that for d distinct eigenvalues, the number of unique spectral gaps R is bounded via R ≤ d(d − 1)/2.We relabel every unique spectral gap with an integer, i.e. we write ∆ r ∈ Γ, and define the corresponding vector ∆ = (∆ 1 , . . ., ∆ R ).We pick a set of parameter shifts that are equidistant and create a vector of R shifts δ = (δ 1 , . . ., δ R ) where Next, we create the length R cost vector c and the We then calculate the coefficient vector as which finally gives the gradient Since the final gradient is exact, finite shot estimates of all c(δ n )'s will produce an unbiased estimate of dC(x)/dt, where we pulled out ∆ and (M(δ)) −1 since they are constant.The difficulty of obtaining an accurate estimate of the gradient is determined by the variance of this estimator, which is given by where we used ⊙2 to emphasize that the squares are taken elementwise.We assume that the estimates for each shifted circuit obey normal statistics and so since these are independent, we can write . .
where σ 2 (±δ n ) is the variance of the cost for each shifted circuit.If we assume that the dependence of σ on the shifts is mild, i.e., σ(δ) ≈ σ 0 then the total variance will only depend on the prefactor.Setting the estimate E c ⊙2 − E [c] ⊙2 = (σ 2 0 , σ 2 0 , . . ., σ 2 0 ) then finally gives One can minimize this quantity with respect to δ to find the optimal set of shifts for the gradient estimation [9].

E Alternative differentiation of SU(N ) gates
In this section we summarize a number of alternative differentiation techniques that may be applied to the presented SU(N ) gates.In particular, we discuss the stochastic parameter-shift rule, which was created for multi-parameter gates, finite differences as a standard tool in numerical differentiation, as well as an alternative to the general parameter-shift rule above which also exploits the notion of effective generators.

The Stochastic parameter-shift rule
The stochastic parameter-shift rule [6] relies on the following operator identity [38] ∂e for any bounded operator Z(x).We now fix all parameters θ m for m ̸ = l and rewrite the cost in Eq. ( 9) as Then, if we take Z(x) to be the operator Z(x) = i(xG l + A ′ ) we can construct the gradient of Eq. (E2) as where Hence, similar to our method, the gradient evaluation requires adding gates to the circuit and evaluating the new circuit.However, the stochastic parameter-shift rule comes at a significant cost: the evaluation of the integral in Eq. (E3).In practice, one approximates this integral by sampling values of s uniformly in the interval (0, 1) and then calculating the costs C ± (x, s) with a finite-shot estimate.Although this produces an unbiased estimator, we find that the variance of this estimator is larger than ours, see Fig. 4. In addition, this method leads to a bigger number of unique circuits to compute the derivative, increasing the compilation overhead for both hardware and simulator implementations.

Finite differences
Finite differences are widely used to differentiate functions numerically.We briefly discuss this method in the context of variational quantum computation (VQC) and refer the reader to recent works comparing and optimizing differentiation techniques for VQC [79,46].
In particular, we consider the central difference recipe where δ is a freely chosen shift parameter and e j is the jth canonical basis vector.This recipe is an approximation of ∂ θ j C(θ), making the corresponding estimator on a shot-based quantum computer biased.This bias, which depends on δ, has to be traded off against the variance of the estimator, which grows approximately with δ −2 .In classical computations, the numerical precision cutoff plays the role of the variance.Due to the high precision in classical computers, this leads to optimal shifts δ ≪ 1, which allows treating the bias to leading order in δ and thus enables rough estimates of the optimal δ * in advance.On a quantum computer, however, the variance typically is more than ten orders of magnitude larger, leading to a very different δ * , which furthermore depends on the function and derivative values.As a consequence, shifts of O(1) become a reasonable choice, highlighting the similarity of the central difference recipe to the two-term parameter-shift rule [79].
As a demonstration of the above, and in preparation for the numerical experiments shown in Figs. 3  and 4, we compute the central difference gradient for a random single-qubit Hamiltonian, a single SU(2) gate U (θ) = exp(iaX + ibY ) and δ ∈ {0.5, 0.75, 1.0}.For this, we evaluate the mean and standard error 50 times and show the difference to the exact derivative in Fig. E1.As expected, we observe that the bias increases with δ and that the variance is suppressed with larger δ.We determine δ = 0.75 to be a reasonable choice for the purpose of the demonstration in Figs. 3 and 4, but stress that for any other circuit, qubit count, Hamiltonian, and even for a different parameter position θ for this circuit, the optimal shift size needs to be determined anew.and 4. The value of the second parameter again is fixed to b = 0.5, 1.0, 2.0 in the panels (from left to right).The shift parameter δ influences the strengths of bias and variance, leading to a trade-off.For smaller δ, the variance is enhanced due to the coefficients in Eq. (E6) that scale with δ −1 .For larger δ, the bias based on the approximate nature of Eq. (E6) is increased.We find δ = 0.75 to be a reasonable choice for this particular circuit, Hamiltonian and parameter position θ.

Decomposing effective generators for differentiation
In Algorithm 1 we suggest to use the generalized parameter-shift rule [7,8,9] in order to compute the partial derivatives ∂ ∂θ l C(θ) independently.In addition, Theorem 2 bounds the number of frequencies occurring in the univariate auxiliary cost function C(t) = Tr U (θ)e tΩ l (θ) ρe −tΩ l (θ) U † (θ)H and the corresponding number of parameter shifts required during differentiation.
Realizing the shift rule requires us to implement not only U (θ)-which is necessary to compute C(θ) itself-but also the gate e trΩ l (θ) for 2R shift values t r and each Ω l separately.Alternatively, we may follow the approach to decompose all effective generators Ω l and compute the derivative as a linear combination of the derivatives for simpler auxiliary gates, similar to [7].In particular, we again choose the Pauli basis of su(N ) for this decomposition.
Decompose the effective generators Ω l (θ) as Note that the coefficients are purely imaginary due to the skew-Hermiticity of Ω l (θ).The partial derivative we are interested in can then be written as Here we abbreviated ω lm (θ) = 2iω lm (θ) and wrote C Gm (θ, t) for the cost function with a rotation gate with parameter −t/2 about G m inserted before U (θ).This modified cost function can be differentiated with respect to t using the original two-term parameter-shift rule, as the inserted gate is generated by (the multiple of) a Pauli string.The above linear combination of Pauli rotation derivatives can be reused for all partial derivatives, so that the full gradient for one SU(N ) gate is given by (E13) So far we did not discuss the number of Pauli strings occurring in the decomposition of the generators Ω l .As can be seen from Eq. ( 6) and our definition of the DLA in Section 5, this number is bounded by the size of the DLA, and we again remark that this bound will be saturated for most values of θ.As two shifts are required for each Pauli rotation, the gradient ∇C(θ) can thus be computed using 2 dim L(A(θ)) circuits, using Pauli rotations from the DLA and, e.g., shift angles ± π 2 .As we only required a linear decomposition of Ω l , any other basis for the DLA may be used as well, potentially allowing for fewer shifted circuits or different inserted gates that may be more convenient to realize on hardware.

F Gate speed limit
The following Lemmas are used in Section 4.1.for time-independent Hamiltonians.Therefore for all Hamiltonians with fixed norm Tr H 2 , the path distance travelled after time τ is the same regardless of the specific unitary evolution.
Lemma 5.The minimal path between the identity element I and a point V on the Riemannian manifold of SU(N ), with metric g(x, y) = Tr x † y , is a geodesic curve γ(t) = e Xt with X ∈ su(N ).
Proof.From Proposition 3.10 [81], the geodesics of SU(N ) are the one-parameter subgroups, given by γ(t) = e Xt .In general, multiple geodesics curves can give V from the identity.The minimal path is the curve with the minimum length.Since geodesic curves are the extrema of the path length functional [82, Lemma 9.3], there must exist a geodesic, which is of the form γ(t) = e Xt , that is the minimal path to V .

Proof of Theorem 1
We restate Theorem 1 again for convenience.
Proof.The product U (ϕ (2) ; t 2 )U (ϕ (1) ; t 1 ) corresponds to a specific point V on the manifold SU(N ).By Lemma 5, there exists a geodesic between the identity I and V , given by the curve e Xt that is of minimal length.We can parameterize this geodesic as U (ϕ (2) ; t g ) = exp Ā(θ)t g , which is always possible since A(θ) parameterizes an arbitrary point in su(N ) and is an SU(N ) gate.By Lemma 4, the length of this path only depends on the norm of Ā(θ), which is 1, and on t g , which gives Since this path is minimal, we have with equality if ϕ (1) + ϕ (1) = θ.

Special case of SU(2)
In the following we give the additional time for decomposing an optimal SU(2) gate into two gates.
By optimal, we refer to the geodesic along the minimal path length curve -see Lemma 5. We consider the optimal SU(2) gate U (θ; t g ) with geodesic evolution time t g together with a decomposition U (ϕ (2) ; t 2 )U (ϕ (1) ; t 1 ) = U (θ; t g ).The decomposed circuit is given by two unitary evolutions.Each individual evolution U ( φ(ν) ; t ν ) is a U(1) rotation such that only a single basis element is required.With two rotations, the overall evolution is an element of SU(2).The corresponding su(2) algebra is spanned by three basis elements-the three Pauli matrices for example.The two rotations can be represented as lying on a Bloch sphere.A unitary transformation, K ∈ SU(N ), can therefore be applied to the evolution such that U ( φ(ν) ; t where ν = 1, 2, with α ν and β ν parameterizing the rotations.By construction φ(ν) • φ(ν) = 1 is normalised for all parameters α ν and β ν .The same transformation K defines U ( θ; t g ) = KU (θ; t g )K † and gives the same relationship U ( θ; t g ) = U ( φ(2) ; t 2 )U ( φ(1) ; t 1 ).This is straightfoward to show We have where we choose G 1 = iX, G 2 = iY , and G 3 = iZ.These basis elements of su(2) generate the group SU(2).We also define the basis vector G = (G 1 , G 2 , G 3 ).Exponentiation therefore gives the closed-form expression The additional evolution time is ∆t = t d − t g , with t d = t 1 + t 2 the total decomposed unitary evolution time.Due to the invariance of the scalar product, φ(1) • φ(2) = ϕ (1) •ϕ (2) , the additional time ∆t = t d −t g required by the decomposition is then given by ∆t = t d − arccos cos(t 1 ) cos(t 2 ) − ϕ (1) • ϕ (2) sin(t 1 ) sin(t 2 ) ≥ 0.
G Unique spectral gaps of Dynamical Lie Algebras In the following we set g to be a semisimple Lie algebra.A subspace a ⊆ g is called a subalgebra if it is closed under the Lie bracket, i.e., if [a 1 , a 2 ] ∈ a, ∀a 1 , a 2 ∈ a.Since g is a semisimple Lie algebra, it always contains a subalgebra called a Cartan subalgebra [29] (Chapter 7, Definition 7.10).Definition 1.A Cartan subalgebra h of g is a subalgebra that satisfies the following conditions: 1.For all h 1 , h 2 ∈ h, [h 1 , h 2 ] = 0.
2. For all x ∈ g, if [h, x] = 0 for all h ∈ h, then x ∈ h.
The first condition tells us that h is a commutative subalgebra of g, while the second condition says that h is maximal, i.e., there is no larger commutative subalgebra.The first step in proving Theorem 2 is to make use of the following result: and α ∈ h * are functionals on h.That is, a root space is a subspace of g on which the action of the adjoint representation of h is described by a functional (and scalar multiplication).
The above decomposition is called a root space decomposition, which is an essential tool in classifications of Lie algebras [68,67].Since g 0 = {x ∈ g|ad h (x) = 0, ∀h ∈ h}, (G4) we find that h = g 0 and hence We then immediately see that We can thus relate the dimensionality of a Lie algebra to the dimensionality of its Cartan subalgebra and its weight spaces.The second step of the proof relies on identifying the unique spectral gaps of Ω l (θ) with the weight spaces g α .To achieve this, we will construct the linear operator ad h and apply it to the eigenbasis of Ω l (θ) to show that the maps α can be identified with the spectral gaps of Ω l (θ).
Consider an element Ω ∈ g, where g ⊆ su(N ) is a non-trivial subalgebra and h is a Cartan subalgebra of g.Since h is the Lie algebra of a maximally abelian group, we can represent elements of h by diagonal matrices.Since Ω is skew-Hermitian, there exists a unitary V ∈ SU(N ) that diagonalizes Ω, i.e., V † ΩV = h with h ∈ h.Here, V is the matrix with columns equal to the eigenvectors v k of Ω with corresponding eigenvalues λ k .We can thus always choose a basis for g such that Ω is diagonal.If Ω is non-degenerate, then it must be full rank, and therefore an element of h.All Cartan algebras are equivalent up to conjugacy, hence we can choose the matrix h to be the diagonal matrix containing the eigenvalues of Ω to represent the Cartan subalgebra h.We now take E nm to be the matrix with entries (n, m) equal to 1 and all other entries to 0. Define the operator e nm = V † E nm V, (G7) and apply ad h to it: ad h (e nm ) = he nm − e nm h (G8) This means that ad h has the eigenvectors e nm with corresponding eigenvalues α nm (h) = λ n − λ m [83], and so we have identified the eigenvalue differences with the roots of the Lie algebra.We define the set of all roots as We now set g = L(A(θ)).If we take the absolute value of the elements of Φ, we can identify R = |Φ|/2, where the factor 1/2 is to account for double the counting of the spectral gaps.Since Ω can be degenerate in general, we obtain the inequality R ≤ |Φ|/2.With this, the proof of Theorem 2 is completed.

Examples
Here, we give several examples of maximal DLAs and their corresponding value of |Φ|/2.Analogous to the main text, we choose the Pauli representation but these results should hold for any irreducible representation of su(N ).
1. su(2): For a 1-qubit system, there are no non-trivial subalgebras, hence we can only look at the full special unitary Lie algebra su(2).Any A(θ) that consists of two Pauli operators will generate this algebra, e.g., will give L(A(θ)) = su (2).A Cartan subalgebra of su( 2) is given by h = span ({Z}).We therefore find that dim g = 3 and dim h = 1 and so |Φ| = 2. Hence we have R ≤ 1 and need 2R ≤ 2 shifts.This matches the result in [5], where the parameter-shift rule was generalized from single Pauli matrices to Hermitian operators with two unique eigenvalues.

Figure 2 :
Figure2: The partial derivative with respect to the gate parameter θ l can be obtained by adding a gate to the circuit that is generated by Ω l (θ).Calculating the derivative with respect to t and evaluating at t = 0 then provides one with the correct gradient.

Figure 3 :
Figure 3: Gradients of C(θ) for a single SU(2) gate and a random single-qubit Hamiltonian, in the limit of infinitely many shots on quantum hardware.We take A(θ) = iaX + ibY where θ = (a, b) and consider the fixed values b = 0.5, 1.0, 2.0 together with a ∈ [0, π].Our generalized shift rule (dotted) reproduces the exact value (solid), whereas the central finite difference (dashed) is biased and the stochastic shift rule (solid, shaded) comes with a finite statistical error even without shot noise from the quantum measurements.Since we look at a single-qubit operation, Ω a (θ) has a single spectral gap, so we require two shifted circuits to calculate the gradient entry (see App. D for details).The finite difference and the stochastic shift rule require two circuits as well, but additional executions are need for the latter to reduce the shown single-sample error.

Figure 4 :
Figure4: Gradients of C(θ) as in Fig.3but for finitely many shots on quantum hardware.We show the single-shot error for each method, estimated with 1000 shots, which varies with the gate parameters as noted e.g., in[9].Our generalized SU(N ) shift rule systematically outperforms the other methods.For small b, the SU(N ) and the stochastic shift rule approach the singleparameter shift rule and hence behave similarly.The finite difference shift δ = 0.75 is chosen such that the bias and variance are traded off reasonably for 100 shots (see App. 1 and e.g.,[46]).For other shot numbers, δ needs to be optimized anew, whereas the parameter-shift rules are known to perform optimally at fixed shifts.

Figure 7 :
Figure7: Comparison of decomposed gates versus SU(N ) gates in brick-layer circuits for random 10-qubit Hamiltonians and various depths.We consider the bricklayer circuit indicated with ℓ = 2 in the inset, with general two-qubit gates acting on the even and odd qubits in each layer.The decomposed gate is the SU(4) parameterization of[16], which is optimal in the number of CNOTs required.For each instance, we sample a Hamiltonian from the Gaussian unitary ensemble and minimize the cost in Eq. (9) via standard gradient descent.We show the difference of the relative errors in energy Ē = (E − E min )/(E max − E min ) between the decomposed gates and the SU(N ) gates, that is ∆ Ē = ĒSU(N) − ĒDecomp.The plotted lines are the mean Ē, averaged over 50 random Hamiltonians for each circuit depth ℓ.We see that for all depths ∆ Ē < 0 at all points during the optimization, hence the brick-layer circuit with the SU(N ) gates outperforms the circuit where the two-qubit interactions are parametrized as a gate composition.
iZ and successive commutators generate no new contributions.Note that for this example the DLA

Figure E1 :
FigureE1: Error of the central difference gradients with δ = 0.5, 0.75, 1.0 for the single-qubit example from Figs.3 and 4. The value of the second parameter again is fixed to b = 0.5, 1.0, 2.0 in the panels (from left to right).The shift parameter δ influences the strengths of bias and variance, leading to a trade-off.For smaller δ, the variance is enhanced due to the coefficients in Eq. (E6) that scale with δ −1 .For larger δ, the bias based on the approximate nature of Eq. (E6) is increased.We find δ = 0.75 to be a reasonable choice for this particular circuit, Hamiltonian and parameter position θ.

Lemma 3 .
For Hamiltonians of the form H = m θ m G m , where G m are strings of log 2 N Pauli operators, Tr H 2 = N m θ 2 m .Proof.All Pauli strings G m are orthonormal with respect to the trace inner product, Tr G † m G n = δ n,m N .Using this gives Tr H 2 = m,n
[80]unitary evolution of U (θ; t), parameterized by t, corresponds to a smooth curve on SU(N ) with length according to the Riemannian metric, g(x, y) = Tr x † y[80].Integrating over the metric norm through the tangent spaces from t = 0 to final time τ gives the path length, ) Lemma 4. The length of a smooth curve on the Riemannian manifold SU(N ), with metric g(x, y) = Tr x † y , for a time-independent Hamiltonian H after a fixed time τ only depends on the norm of H and τ .Proof.
We restate Theorem 2 here for convenience.The number of unique spectral gaps R of Ω l (θ) is upper bounded by the number of roots |Φ| of any maximal semi-simple DLA, 1 Proof of Theorem 2