Hamiltonian variational ansatz without barren plateaus

Variational quantum algorithms, which combine highly expressive parameterized quantum circuits (PQCs) and optimization techniques in machine learning, are one of the most promising applications of a near-term quantum computer. Despite their huge potential, the utility of variational quantum algorithms beyond tens of qubits is still questioned. One of the central problems is the trainability of PQCs. The cost function landscape of a randomly initialized PQC is often too flat, asking for an exponential amount of quantum resources to find a solution. This problem, dubbed barren plateaus, has gained lots of attention recently, but a general solution is still not available. In this paper, we solve this problem for the Hamiltonian variational ansatz (HVA), which is widely studied for solving quantum many-body problems. After showing that a circuit described by a time-evolution operator generated by a local Hamiltonian does not have exponentially small gradients, we derive parameter conditions for which the HVA is well approximated by such an operator. Based on this result, we propose an initialization scheme for the variational quantum algorithms and a parameter-constrained ansatz free from barren plateaus.


Introduction
Recent experimental progress in controlling quantum systems has demonstrated quantum advantages in sampling tasks [1,2,3], and nearterm quantum computers with hundreds of noisy qubits are emerging [4].Variational quantum algorithms (VQAs) are one of the most promising applications of these near-term quantum computers.By combining highly expressive parameterized quantum circuits (PQCs) and wellestablished parameter optimization techniques from machine learning (ML), VQAs are relevant for many important problems, including combinatorial optimizations [5], finding the ground state of a many-body Hamiltonian [6,7,8,9], and learning probability distributions [10,11,12,13] (see Ref. [14] for a recent review).
VQAs solve a problem by optimizing a cost function typically defined by the expectation value of a target-problem specific observable.However, this optimization task can be challenging since the cost function landscapes are often too flat [15,16].This phenomenon, dubbed barren plateaus, is characterized by the fact that all gradient components are exponentially small with the number of qubits when parameters are randomly sampled.Given that barren plateaus are expected to be prevalent for sufficiently expressive ansätze [17], trainability of PQCs beyond tens of qubits is still an open question.
The issue of vanishing gradients is not entirely new, though.Classical neural networks also suffered a similar vanishing gradient problem, but theoretical and numerical advances have shown that clever neural network architectures [18,19] or better initialization methods [20,21] can sufficiently suppress the problem.Likewise, recent studies explored quantum circuit ansätze without barren plateaus [22,23,24,25], as well as initialization techniques that provide large gradients [26,27,28,29,30].Still, it is unclear how useful barren-plateau-free ansätze are for solving complex problems.Also, proposed initialization methods mostly rely on heuristics and do not provide strong arguments for why such parameters should yield a large gradient.
In this paper, we resolve these issues for the Hamiltonian variational ansatz (HVA) by proposing a novel parameter initialization technique.The HVA [7,9] is widely studied for solving the ground state of a many-body Hamiltonian since it can encode adiabatic evolution.However, the HVA is still subject to the barren plateau problem [31,32].Even though several  initialization methods based on pre-training have been proposed to overcome this problem [29,30], those methods not only require additional classical or quantum resources but rely on heuristics developed based on numerical results for less than 20 qubits.In contrast, our initialization scheme simply adds a constraint to the parameters and is free from additional computational resources.Moreover, we provide a rigorous argument for why this scheme yields large gradients, supported by extensive numerical results up to 28 qubits.We further propose an ansatz that imposes the constraint throughout the optimization process.Such an ansatz is expressive enough for variational time evolution [33,34,35,36,37], with the benefit that the ansatz is free from barren plateaus.
The remainder of the paper is organized as follows.After briefly introducing the problem and related concepts in Sec. 2, we show that the gradient does not decay exponentially when a circuit is described by local Hamiltonian evolution in Sec. 3. In Sec. 4, we find a parameter condition for which the HVA approximates to local Hamiltonian dynamics.We thus prove that a parameter regime for constant gradient magnitudes exists.We then introduce an initialization method based on our proof and numerically compare it to other known parameter initialization techniques in Sec. 5. We summarize our results with concluding remarks in Sec. 6.

Preliminaries
We consider a PQC for N qubits with l total layers, given by where θ θ θ = (θ 1 , • • • , θ l ) is a vector of all parameters and U n (θ) = e −iθGn is a unitary gate generated by G n .In VQAs, a cost function is typically given by where O is a Hermitian operator.The cost function is then optimized with gradient-based methods.Direct computation of the gradient yields where and ρ 0 is the initial state of the circuit.For classes of PQCs, which form a 1-design, the gradient is unbiased for a given parameter set (i.e., E θ θ θ [∂ n C] = 0).In this case, one can use the variance to quantify the magnitudes of gradients, which is given by In the typical barren plateau scenario [15,16], this quantity becomes close to O(1/D 2 )1 , which is the value evaluated under the assumption that U R or U L is a unitary 2-design.Here, D is the total dimension of the Hilbert space, which is 2 N for a system with N qubits.Hence, the variance decays exponentially with the number of qubits, which implies that the gradient is exponentially small for most values of the parameters (can be rigorously proven by Chebyshev's inequality).Even though it is possible to optimize the cost function using a small gradient in principle, running the algorithm in real quantum hardware is extremely inefficient as estimating the gradient requires an exponential number of shots.
Next, we introduce the HVA.The HVA [7,9] is a natural ansatz for solving quantum many-body Hamiltonians.After decomposing a given Hamiltonian into q terms H = q j=1 c j H (j) , where {c j } are real coefficients, the HVA is constructed as where |ψ 0 ⟩ is a quantum state that can be easily prepared.The ansatz consists of p blocks, each containing q layers.Thus, the ansatz has a total of l = pq layers.We also use the notation θ a and U a to denote θ i,j and U a = e −iH (j) θ i,j where a = (i, j), which enables us to interpret the HVA as a PQC given by Eq. (3).
Throughout the paper, we restrict H (j) to be a k-local Hamiltonian in a given lattice for a constant k, i.e., each term in the Hamiltonian acts on at most k geometrically nearby sites in a given lattice.This condition is satisfied for most of the many-particle spin-1/2 Hamiltonians.For example, we decompose the onedimensional transverse-field Ising model This ansatz is powerful for solving the ground state of H, as it can encode the adiabatic evolution of the Hamiltonian [9].Despite the usefulness of the ansatz, however, training the HVA turned out to be non-trivial.Some numerical studies have observed that the gradients decay exponentially with the system size [31,32], although the magnitudes of gradients of the HVA are larger than one expects from a unitary 2design [31].
In this paper, we consider the case where the HVA is well approximated by time evolution under a local Hamiltonian, i.e., there are local Hamiltonians H L , H R such that U L ≈ e −iH L t L and U R ≈ e −iH R t R for some t L , t R ≥ 0 (see Fig. 1).With this assumption, we provide strong analytic and numerical arguments that the gradient magnitudes, Eq. (4), only decay at most polynomially in the number of qubits.Although this assumption seems unrealistic, we later show that the HVA with a certain parameter restriction can satisfy this condition.

Magnitudes of gradients in Hamiltonian dynamics
In this section, we study the scaling of the gradient when the circuit is given by the time evolution under a time-independent local Hamiltonian.We show that gradients in such circuits do not decay exponentially in both extreme regimes: short-and long-time evolution.
For short-time evolution, we prove a rigorous bound on the time that the gradient preserves its initial magnitudes.Thus, a circuit have large gradients for a proper initial state.On the other hand, for long-time evolution, we combine the universality of quantum thermalization [38,39,40] and our numerical results to argue that the gradient does not decay exponentially.

Gradient scaling for short-time evolution
In this subsection, assuming that (1) a circuit with N qubits is given by e −iHt for a local Hamiltonian H and (2) the initial state has a large gradient with a value of Θ(1), we prove that there exists t c = Θ(1/N ) such that the circuit maintains the large gradient when the total evolution time is less than t c .Our main result is the following proposition: Proposition 1 (Quantum speed limit of gradients).For the HVA, the gradient of the cost function is given by where Then We further have ) , O]∥ ≤ 2KC, (10) and where Integrating both sides, we have We obtain the desired inequality as Let us assume that all of H (m) , H L , and H R are k-local Hamiltonians, where each term acts at most k nearby sites in a given lattice for a constant k, and O is a local operator acting on at most a constant number of sites.Under this assumption, which is the case we consider in this paper, we have (1).We prove this fact in the rest of the subsection.
For any k-local Hamiltonian H, we can write where Λ = {1, • • • , N } is the collection of all sites, and h i is an operator supported by k sites centered at i. Formally, we write the support of h i (a set of sites h i acts on) as where dist(i, j) is the distance between two sites in the given lattice.We thus have for a local operator O. Here, is a constant for a given lattice, where dist(i, O) = min j∈supp(O) dist(i, j) and supp(O) ⊂ Λ is the support of O. Given that ∥h i ∥, ∥O∥ are bounded by a constant for a spin system, and s is a constant for a finite-dimensional lattice, we have for any k-local Hamiltonain H.We also note that physical Hamiltonians must have ∥H∥ = Θ(N ), which is a necessary condition to be thermodynamically well-defined.Therefore, we obtain Moreover, we have K = Θ(N ) and C = Θ(1) for Proposition 1, when H R and H L are also klocal, which implies t c = Θ(1/N ).Therefore, the gradient component is Θ(1) for all t ≤ t c .
While we mostly consider geometrically local Hamiltonians in this paper (i.e., local in a finite-dimensional lattice), our result can be extended to Hamiltonians defined on a general (hyper)graph.Such a complex Hamiltonian appears when a fermionic Hamiltonian is translated to a spin Hamiltonian using, e.g., the Jordan-Wigner transformation (see, e.g., Ref. [7]).These Hamiltonians can have smaller t c because C = max{∥[H (m) , O]∥, ∥[H L , O]∥} in Proposition 1 may scale linearly with N .For example, consider a Hamiltonian H, each term of which acts on all sites.Namely, we have H given as where each h i is a Pauli string acting non-trivially on all sites, i.e., h i ∈ {X, Y, Z} ⊗N .We still have ∥H∥ = Θ(N ) for this Hamiltonian.However, for a local operator, O, acting on site i, we can have ∥[O, H]∥ = Θ(N ).Thus, applying Proposition 1 to this Hamiltonian yields t c = Θ(1/N 2 ) instead of Θ(1/N ).

Gradient scaling for long-time evolution
Next, we consider long-time evolution.Following usual arguments for equilibration [41,40,42,43,44], we assume that a Hamiltonian H follows the non-degenerate energy-gap condition: , where E i is the i-th eigenvalue of H with the corresponding eigenvector |E i ⟩.With an additional assumption that the Hamiltonian thermalizes [38,39,40], in the sense that the observable after equilibration gives a similar value to the thermal average, it is known that the second moment of the Hamiltonian evolution behaves differently from a unitary 2-design [45].Precisely, Huang et al. [45] considered the saturated value of the out-of-time correlator (OTOC), a widely used measure for detecting quantum chaos.For local Hermitian operators, O i and O j acting on sites i and j, respectively, the OTOC is defined by Here, ρ 0 is the initial state, and U is a unitary operator determining the time evolution of the system.One often considers the infinite temperature initial state given by ρ 0 = 1 /2 N for a system with N qubits.When our unitary operator U forms a 2-design, we obtain H [45].Such a huge difference mainly comes from the fact that U = e −iHt conserves the energy, i.e., [H, U ] = 0.
Similarly, one might expect that the variance of gradients saturates to a value that scales only inverse polynomially with N for local Hamiltonian evolution.To see that this is the case, we compute the square of the first element of the gradient (for θ 1,1 ) when U L = e −iH L t and the initial state is given as a pure state ρ 0 = |ψ 0 ⟩ ⟨ψ 0 |.Then we obtain a lower bound of (∂ θ 1,1 C) 2 = − ⟨ψ 0 |[H (1) where A proof can be found in Appendix B. Unfortunately, we cannot use techniques to analytically compute the OTOC for a maximally mixed initial state ρ 0 = 1 /2 N [45], as we here consider a pure initial state.Instead, we provide numerical evidence that F H does not decay exponentially for translationally invariant local Hamiltonians with and without the time-reversal symmetry.We especially consider time-reversal symmetric Hamiltonians as they are an important subclass of Hamiltonians widely considered for thermalization [47].
After creating a random k-local Hamiltonian H (see Appendix C for numerical details), we diagonalize H and compute F H using the obtained eigenstates for the initial state |ψ⟩ = |+⟩ ⊗N , G = i Y i , and O = i Z i .As all observables, the Hamiltonian, and the initial state are translationally-invariant, we can compute F H within the translationally-invariant subspace.
We plot the result in Fig. 2 up to N = 16 for random Hamiltonians without (main figure) and with (inset) the time-reversal symmetry.For k = 2, we observe that the lower bound F H does not decay at all in both cases.For k = 3, 4, F H decreases with N for the Hamiltonians without the time-reversal symmetry.Even though it is not conclusive to tell the exact decaying rate of F H from this plot, we strongly believe that it is not exponential from the universality of thermalization dynamics for non-integrable models [38,39,40], i.e., if F H for Hamiltonians with the time-reversal symmetry does not decay exponentially, F H for any thermalizing Hamiltonians also does not decay exponentially.
We also recall Ref. [32], which conjectured that the variance of gradients only scales inverse polynomially with the dimension of the dynamical Lie algebra G spanned by the gate generators, i.e., G = ⟨iG 1 , • • • , iG l ⟩ Lie where ⟨•⟩ Lie is the Lie closure containing all nested commutators of the listed elements.A reason behind using dynamical Lie algebra is that the algebra generates any arbitrary circuit that the given PQC can express, i.e., there is a g ∈ G such that U (θ θ θ) = e g for any θ θ θ.However, for random Hamiltonians, as in our case, it is more natural to use a vector space of the Hamiltonians (which also generates a unitary operation but not a Lie algebra) instead.As the dimension of k-local random Hamiltonians is Θ(N ), the conjecture with a slightly relaxed condition would also indicate that the gradient does not decay exponentially.
We explicitly write down our version of the conjecture, which is also supported by our numerical results, as follows: Conjecture 1.Let V be a vector space of local Hamiltonians.Then for any given initial state |ψ 0 ⟩, G ∈ V , and a local operator O, we have where dν is a proper measure for the vector space.
4 Approximating HVA to local Hamiltonian evolution In the previous section, we argued that a circuit given by the time evolution under a local Hamiltonian does not have barren plateaus.In this section, we find a parameter condition for which the HVA given by Eq. ( 5) is well approximated by local Hamiltonian evolution.We first interpret the HVA as a unitary operator generated by a time-dependent Hamiltonian.Then, we utilize the Floquet-Magnus (FM) expansion to obtain an effective time-independent Hamiltonian that describes the HVA within a small error.We here consider each Hamiltonian H (j) satisfying the following conditions: (C1) H (j) is a sum of commuting Pauli strings, (C2) H (j) is k-local (each term acts on at most k nearby qubits in a given lattice), and (C3) each Pauli string of H (j)  uniquely supports a subset X ⊂ Λ (e.g., H (j) cannot have terms X 1 X 2 and Z 1 Z 2 simultaneously).In other words, we consider H (j) defined as where the summation is over all subsets of sites X ⊆ Λ whose length is ≤ k, and X is a single Pauli string (if there is a term whose support is X) or 0 (otherwise), and [h any non-nearby sites (i.e., if there are a, b ∈ X such that the distance between a and b is larger than k).
We also define parameters for the FM expan-sion.First, we define where we obtained the last equality using the fact that {h (j) X } are commuting Pauli strings.Thus, H max is the maximum number of terms in H (j) .We also introduce a parameter J that upper bounds the local interaction strength As {h X } are Pauli strings, we can use From the locality of Hamiltonians {H (j) }, we have H max = Θ(N ) (see discussion below Proposition 1), and J is upper bounded by the number of vertices whose L 1 distance to the origin is ≤ k, which is a constant for a finite-dimensional lattice.
We further assume that all parameters {θ i,j } in the HVA are larger than 0. Under this setup, the following Proposition shows that the subcircuits U R and U L of the HVA can be approximated by the time evolution under a few-body Hamiltonian when the sum of parameters is small.Proposition 3.For the HVA composed of H (j) , given in Eq. 5, we consider a subcircuit U R = e −iH (j−1) θ i,j−1 • • • e −iH (1) θ 1,1 .We additionally assume that all H (j) satisfy the conditions (C1-C3) defined above.Then, there is a Hamiltonian H We derive the bound and properties of H in Appendix D. Our derivation is based on the truncated FM expansion rigorously proven in Ref. [48].
The above bound tells us that U R,L can be approximated by local Hamiltonian evolution when t R,L are small.For example, when t R = O(1/N ) and for a constant n, the first term in the bound is exponentially small in N and the second term is O(1/N n+1 ).Then, one may further employ Proposition 1 to get a large gradient, which we summarize as the following theorem: Theorem 1.For the HVA [Eq.(5)] and for a local observable O acting on at most O(1) sites, assume that there exists an initial state The proof can be found in Appendix E. We provide two remarks on Theorem 1. First, if there exists any constraint with τ0 such that a gradient component is bounded below by a constant for all ij θ ij ≤ τ0 , then τ0 ≤ π/4.This is because one can easily find the HVA with suitable ρ 0 and O which satisfies ) , O]] = 0 for i,j θ i,j = π/4.We provide such an example in Appendix F. This implies that Theorem 1 can be improved at most τ 0 = Θ(1) in our current set-up.
Second, the theorem implies that for any probability distribution p(θ θ θ) defined for θ i,j ≥ 0 and i,j θ i,j ≤ τ 0 , which is much stronger than the non-exponential decay of gradients.If one considers a different condition, e.g., a polynomially decaying gradients under uniformly generated initial parameters, a parameter constraint may be further relaxed.Namely, we open up the possibility that a constraint with τ 1 = Ω(1) exists such that is satisfied for all θ i,j ≥ 0 and i,j θ i,j ≤ τ 1 .We numerically investigate related scenarios in the following section.

Numerical comparison between initialization methods
Theorem 1 tells us that there is τ 0 = Θ(1/N ) such that the HVA does not have barren plateaus when the sum of all parameters is less than τ 0 .Still, the exact value of τ 0 for the Theorem is difficult to obtain or can be unrealistically small.Thus, in this subsection, we introduce an initialization method based on Theorem 1 and compare it to the small constant initialization considered in Refs.[49,50,51,52].
We numerically test the following three different initialization methods.(1) Random: complete uniformly random initialization, such that all parameters are from U [0,2π] , (2) Constrained: the sum of parameters in each layer is constrained to be T = c/N with a constant c, i.e., j θ i,j = T for all i, and (3) Small: θ i,j ∼ U [0,ϵ] for a small ϵ independent to N [49,50,51].For method (2), we show that a relatively large value of c = π/2 already gives Θ(1) gradient magnitudes.On the other hand, we observe two different scaling behaviors for the constant small initialization [method (3)].There is a value N 0 depending on p and ϵ such that the gradient magnitudes decay exponentially for N < N 0 whereas they only decay polynomially for N > N 0 .This observation suggests that there can be another parameter regime in the HVA that does not have barren plateaus.
We use the HVA for the one-dimensional (1D) and two-dimensional (2D) spin-1/2 Heisenberg-XYZ models where  from the HVA for the 1D Heisenberg-XYZ model with different depths p (see main text for details).Plots show results from the constrained (solid), small (dashed), and completely random (dotted) parameter initializations.We compute (∂ i,j C) 2 for 2 10 parameters samples from each distribution and plot the averaged results over all samples and gradient components (i, j).(b) The same plot as (a) but for the 2D Heisenberg-XYZ model with the lattice size L x × L y .We also see that the results fluctuate for odd and even L y because the 2D Néel state (our initial state) violates the symmetry of the Hamiltonian for odd L y .We also compute the relative standard deviation, r = σ(X)/E[X], where X := ij (∂ i,j C) 2 /(3p) is the squared partial derivatives averaged over all parameters.Here, the standard deviation and the expectation value are taken over the circuit instances.The values of r for the 1D model with p = 64 and N = 24 are given by 0.20, 0.53, and 0.13 for the contained, small, and random initialization, respectively.For the 2D model with p = 64 and L x × L y = 4 × 7, we obtained r ≈ 0.39 (constrained), 1.02 (small), 0.11 (random), respectively.

Scaling of gradients
We compute gradients of the cost function with from different initialization methods.For constrained initialization, random values θi,j are first sampled from the uniform distribution, i.e., θi,j ∼ U [0,2π] , and then parameters are assigned by normalizing them: θ i,j = θi,j × T /( 3 j=1 θi,j ) for all i, j.This method ensures that 3 j=1 θ i,j = T .We here use T = π/(2N ).The results are compared to small parameters θ i,j ∼ U [0,ϵ] with ϵ = 0.2 as well as complete random parameters θ i,j ∼ U [0,2π] .For each set of system parameters (size and depth p) and the initialization method, we compute all gradient components and plot the averaged squared magnitudes (i.e., we averaged The results for the 1D and 2D models are shown in Fig. 3(a) and (b), respectively.The 1D model clearly shows that the magnitudes of gradients do not decay with N , i.e., (∂ i,j C) 2 ≈ Θ(1), for the constrained initialization, which is consistent with Theorem 1.On the other hand, the gradient magnitudes decay exponentially with N when the complete random initialization is used with p ∈ [16,32].However, we could observe an interesting behavior when parameters are initialized to be small (θ i,j ∼ U [0,ϵ] ).In this case, the gradient decays exponentially up to some N 0 , i.e., for N ≤ N 0 ≈ 18, but it decays slower after that.One can already see this signature even for the complete random initialization when p = 16 and N = 24, where the averaged gradient magnitudes are larger than that from p ∈ [32,64].We also see similar behaviors for the 2D model, besides the results from each initialization oscillate for odd and even L y , since our initial state (2D Néel state) is not fully symmetric for odd L y .
As non-exponential decaying gradients from the small constant initialization have not been clearly reported in previous studies, we explore these phenomena more closely here.We compute the averaged squared gradients using the HVA for the 1D XYZ model with p = 16 when the circuit parameters are sampled from U [0,ϵ] for different values of N and ϵ.We plot the result as a function of ϵ in Fig. 4. The results show that the magnitudes of gradients saturate to an exponentially small value as ϵ increases, but the point it saturates, ϵ 0 (N ), also increases with N . .We observe that there is a value ϵ 0 (N ) such that the gradient decays exponentially with N only for ϵ > ϵ 0 (N ) (colored region).Inset shows the same data but as a function of ϵ/ log(N ).We see the gradient magnitudes saturate after the dashed vertical line, which suggests that ϵ 0 (N ) ∝ log N .
For ϵ < ϵ 0 (N ), we observe that the gradient does not decay exponentially.This fact also confirms that there can be another parameter regime beyond the one we mainly considered in this paper, which is also free from barren plateaus.
To see how ϵ 0 (N ) scales with N , we plot the same data but as a function of ϵ/ log(N ) (inset).The plot shows that the gradient magnitudes saturate when ϵ/ log(N ) is larger than a constant, which suggests that ϵ 0 (N ) ∝ log N .In general, we expect that there is a relation between Υ := i,j θ i,j (which is ∝ pϵ in this case) in the HVA for a 2-local Hamiltonian and a random local circuit with depth ∝ Υ.Such a connection explains the observed behavior as a 1D random local circuit requires its depth larger than Θ(log(N )) to show exponential decay of gradient magnitudes when the cost function is given by the expectation value of a local observable [16].
We still note that it is less clear whether a small constant parameter initialization [method (3)] gives the same quantitative behavior for the HVAs with more complex Hamiltonians (e.g., 1D k-local with k ≥ 3 or defined in a highdimensional lattice).In contrast, we expect to have Θ(1) gradient magnitudes regardless of the dimension with our initialization method [method (2)].As a detailed investigation of the relation between the HVA and a local random circuit is out of the scope of the current work, we leave it to future work.

Full simulation of variation quantum eigensolver
We now explore whether our initialization improves learning procedures by fully simulating a variational quantum eigensolver (VQE) using the Heisenberg model (J x = J y = J z = 1).We define the Hamiltonian expectation values as the cost function (C = ⟨ψ(θ θ θ)|H|ψ(θ θ θ)⟩) and train the circuit using the Adam optimizer [53].
We first simulate the VQEs using exact gradients.Quantum hardware cannot compute exact gradients, as each gradient component should be estimated from the measurement outcomes from shots.However, classical quantum circuit simulators support multiple algorithms to obtain exact gradients.For our simulation, we use the adjoint method [54] implemented in PennyLane [55].We present learning curves from different parameter initialization methods for the one-dimensional lattices (with learning rate α = 0.025 and the default values for hyperparameters β 1 = 0.9 and β 2 = 0.999) in Fig. 5 (a), which shows that our initialization scheme outperforms other initialization schemes.The completely random parameter initialization fails to find the ground state.This is an expected behavior from the presence of the barren plateaus.On the other hand, initializing all parameters to π (θ i,j = π for all i, j), which is used in Ref. [31], works but is subject to large initial fluctuations.Generally speaking, such an initialization without randomness is prone to local minima [56].We also found a similar behavior for the 2D Heisenberg model, shown in Fig. 5(b).
We next compare results from different initialization methods when a finite number of shots is used.We solve the 1D Heisenberg model, but gradients are now estimated using n shot shots.
The gradient with a finite number of shots can be obtained as follows.First, we introduce another PQC with the same shape as the HVA, Eq. (33), but all gates have different parameters.The circuit ansatz with p = 24 and the Adam optimizer with learning rate α = 0.005 are used.Results for random and constrained initializations are averaged over 16 different initial parameters.For optimization, we compute the gradient exactly (without shot noise).Shaded regions are barely visible for the random initialization.This is because the loss function, ⟨H⟩, is not trained at all for most instances.Thus, the loss function preserves its initial value ⟨H⟩ ≈ 0, which is from the fact that the circuit forms a 2-design for the random initialization.Such a PQC is written as Compared to Eq. (33), all gates now have independent parameters.Next, we obtain each gradient component using the two-term parametershift rule [57,58].By defining f (α α α, β β β, γ γ γ) = ⟨ψ(α α α, β β β, γ γ γ)|H|ψ(α α α, β β β, γ γ γ)⟩, we obtain its gradient for α i,a,b as follows: where δ δ δ i,a,b is a vector components of which is 1, if the index is (i, a, b), or 0, otherwise.Gradient for β and γ also can be obtained similarly.Shot noise is introduced when we estimate f (α α α, β β β, γ γ γ).(36) we can estimate f = ⟨ψ(α α α, β β β, γ γ γ)|H|ψ(α α α, β β β, γ γ γ)⟩ using the samples of ψ(α α α, β β β, γ γ γ) in the X, Y , and Z bases.For example, let us estimate ⟨X a X b ⟩ using n shot samples.We first obtain bitstrings {x (1) , • • • , x (n shot ) } from the probability distribution p(x) = | ⟨x|H ⊗N |ψ(α α α, β β β, γ γ γ)⟩ | 2 , where each x (i) ∈ {0, 1} N is a bitstring with length N .Then, each ⟨X a X b ⟩ can be estimated using these samples by a is from the fact that X a has the value Note that we can use the same set of samples, {x (1) , • • • , x (n shot ) }, for all ⟨X a X b ⟩.Thus, f = ⟨H⟩ for each set of parameters is estimated using 3n shot samples, and each gradient component is estimated using 6n shot samples.
Finally, the gradient of the original HVA, Eq. (33), which shares the parameters between the gates, can be obtained by summing over the gradient components.Namely, we have Given that the circuit has 3N p gates in total, and each gradient component is estimated from 6n shot samples, 18N pn shot samples are used for each iteration to estimate all gradient components.For the HVA with N = p = 16 [see Eq. ( 33)], we plot the converged energies after 10 3 iterations from the VQE simulations as a function of n shot ∈ [2 7 , 2 9 , 2 11 , 2 13 , 2 15 , 2 17 ] in Fig. 6.While the results show that the best-converged energies from our constrained initialization scheme and θ ij = π are similar, the deviations between instances from our initialization scheme are much smaller.For example, the worst-performing instance for n shot = 2 15 gives ⟨H⟩ − E GS ≈ 3 × 10 −2 when all parameters are initialized with π, but that from our initialization is ≈ 6 × 10 −3 .
Our parameter constraint can also be imposed throughout the optimization steps (the ansatz itself).When imposed on the ansatz, we can slightly change the cost function to ensure the parameters always follow the constraints.This can be simply done by assigning θ j,q = T − q−1 i=1 θ j,i and replacing the cost function where 1(x) is the Heaviside step function.One can see this (piecewise differentiable) cost function (1) restricts all parameters to be larger than 0, and (2) the sum of the parameters in each block is given by T throughout the training.
Still, the constraint-imposed ansatz may not be useful for solving complex problems, as we require pT to be small.Even though the ansatz itself allows a large-depth circuit (e.g., p = Θ(N 2 ) and T = 1/N3 ), the Suzuki-Trotter decomposition [59] tells us that such a circuit always can be approximated by a short-depth circuit, i.e., there is another circuit with depth d = poly(pT ) = poly(1/N ) that can express our constrained HVA with a small error.Using the notion of quantum circuit complexity [60,61,62,63], defined by the minimum number of two-qubit gates in any circuit that implements the given unitary, we can say that this ansatz has a small approximate circuit complexity.

Long-time evolution with repeated parameters
To overcome the problem that a simple parameter-constrained ansatz introduced in the previous subsection is not expressive enough, we propose another ansatz with better expressivity.Our solution is to repeat the circuit multiple times instead of adding free parameters.In this case, the circuit is given as with the constraint j θ i,j = T .Thus, the circuit has a total of pqr layers but only has pq parameters.This ansatz can be approximated by e −iK(prT ) for a local Hamiltonian K with an error O(r(pT ) n+2 ) when pT is inverse polynomial with N (i.e., pT = O(N −γ ) for γ > 0).This fact follows from Proposition 3 and ∥U r 1 − U r 2 ∥ ≤ r∥U 1 −U 2 ∥ which holds for arbitrary U 1 and U 2 2 .We then further expect that the gradients scale polynomially with N from Conjecture 1 when prT is sufficiently large enough to equilibrate the system 3 .We also numerically test the gradient scaling of this ansatz for r = N 2 /4, p = 16, and T = π/(2N ) using the HVA for the onedimensional XYZ model in Fig. 7.The plot shows that the gradient does not decay when the parameters are constrained, whereas it decays exponentially otherwise.
We now argue that the repeated ansatz of Eq. (41) (1) can generate sufficiently complex unitary operators and ( 2) is useful for variational time evolution [33,34,35,36].The complexity of the circuit directly follows from the observation that the circuit approximates to e −iK(prT ) for a Hamiltonian K with a large prT .It is commonly believed that simulating the long-time evolution of a general local Hamiltonian requires a large depth circuit (which is also formally conjectured in Refs.[64,65] in terms of quantum circuit complexity).Next, the given ansatz can express the time evolution of a given Hamiltonian H = p j=1 α j H j .Using the first-order Suzuki-Trotter decomposition, we can write with t 0 = t/r and an error O(N rt 2 0 ).Thus the approximation has an error Θ(1/N ).As the right-hand side is nothing but Eq. ( 41) with T = j α j t 0 , the ansatz with T = Θ(1/N ) can approximate e −iHt .

Conclusion
We studied the scaling behaviors of the gradients in the hamiltonian variational ansatz (HVA) and showed that adding a simple parameter constraint to the ansatz results in large gradients.We demonstrated that the gradient magnitudes scale as Θ(1) when the circuit is given by shorttime evolution and 1/poly(N ) when it is given by long-time evolution.For the short-time regime, we provided a rigorous proof based on the rate of the gradient evolution, while we showed numerical evidence based on quantum thermalization [38,39,40,45] for long-time evolution.We then found the parameter constraints for which the HVA can be approximated by short-time as well as long-time evolution under a local Hamiltonian.We further supported our arguments with extensive numerics for up to 28 qubits, which also consistently showed the correctness of our arguments.
For long-time evolution, our argument is based on the fact that the dynamics generated by thermalizing Hamiltonians are more restricted than unitary 2-designs.Albeit typical Hamiltonians thermalize [44], there are two other important classes of Hamiltonians with different dynamic properties: integrable and many-body localized systems.In contrast to thermalizing systems where information of initial states spreads out through the Hilbert space (but within a subspace preserving the energy), initial information on integrable and many-body localized systems can be easily accessed by simple operators at any time, i.e., their dynamics are even more restrictive than thermalizing Hamiltonians.Given this interesting property, we expect that there could be a different parameter condition that parameterized quantum circuits approximate to integrable or many-body localized systems, which are also free from barren plateaus.
For example, it is known that the out-of-time correlator of many-body localized systems does not decay exponentially with the system size for particular choices of observables and initial states (see, e.g., Refs.[66,67,68]).Following the arguments in Sec.3.2, we hope to find a class of parameterized quantum circuits that approximate to a many-body localized system and have large gradients.On the other hand, Ref. [32] showed that the dynamic Lie algebra G generated by the HVA for the XXZ model (J x = J y ) can have a small dimension, i.e., the Lie algebra ⟨i has a small dimension.As the XXZ model is solvable by the Bethe ansatz (thus integrable), we believe that a fundamental connection exists between the low-dimensional dynamical Lie algebra, the integrability of the system, and large gradients.Such a connection might be studied in future work.
This paper did not consider incoherent noise, which prevails in noisy quantum devices.This type of noise is known to be another source of barren plateaus [69].When the circuit is short enough, we believe that our initialization schemes can help compensate for the vanishing gradients from the incoherent noises.Still, how the strength of the noise, the circuit depth, and the initialization schemes interplay in general is another big question that requires a separate study.Since this is important for practical applications of variational quantum algorithms, further research on the effect of incoherent noises is necessary. where We assume that the Hamiltonian H satisfies the non-degenerate energy-gap condition, i.e., Under this condition, averaging Eq. (B.3) over time yields where As ⟨ψ|[G, Õ]|ψ⟩ is purely imaginary, we obtain the last inequality.The inequality in the main text is then obtained by changing the summation indices.We also note that the eigenstate thermalization hypothesis [38,39,40] suggests that the last term, ⟨ψ|[G, Õ]|ψ⟩ 2 , is exponentially small in N .

C Generating random Hamiltonians
In the main text, we numerically observed the scaling behaviors of gradient magnitudes using random Hamiltonians.Here, we describe detailed steps to generate such random k-local Hamiltonians.For a given k, we first create a set of terms where a k ∈ {0, 1, 2, 3} for all k.We then remove terms duplicated under translation.For example, as X ⊗ I ⊗ I, I ⊗ X ⊗ I, and I ⊗ I ⊗ X generate the same terms under translation, we only keep one of them.We then construct a random Hamiltonian H = s∈S c s N n=1 T n s, where the coefficients c s are samples from the normal distribution N (0, 1) and T is the translation operator (Tσ i a = σ i+1 a ).We also generate random time-reversal symmetric Hamiltonians (H * = H) using the same method but removing purely imaginary operators (that contain odd numbers of Pauli-Y s) from S.

D Approximation of the HVA from the truncated Floquet-Magnus expansion
In this appendix, we prove Proposition 3 using the truncated Floquet-Magnus (FM) expansion.The FM expansion [73] provides a time-independent effective Hamiltonian for a unitary evolution from a time-dependent Hamiltonian.While this expansion diverges for a general many-body Hamiltonian, recent works [48,74] have shown that we can still use the expansion after truncating high-order terms.

D.1 Truncated Floquet-Magnus expansion
Let us introduce the truncated FM expansion following the notation in Ref. [48].We consider a system defined on a lattice with N spins where each spin is labeled by i = 1 to N .The set of all spins is denoted by Λ = {1, • • • , N }.We consider a time-dependent Hamiltonian H(t) defined for 0 ≤ t ≤ τ .We decompose the Hamiltonian into H(t) = H 0 + V (t) where H 0 is the time-independent part and V (t) is the remaining time-dependent part.Both parts have at most k-body interactions (we do not impose geometric locality yet).We then write the Hamiltonian terms as where X is all possible subsets of Λ and |X| is the number of elements in the set.We introduce a parameter J that upper bounds local interaction strength and some additional parameters for the expansion: Under this setting, we are interested in the Floquet Hamiltonian H F defined as where T [•] is the time-ordering operator.One can expand the Floquet Hamiltonian as H F = ∞ n=0 τ n Ω n where the terms {Ω n } ∞ n=0 are given by the FM expansion as follows: where S n+1 is the permutation group on n + 1 letters, ω(σ) = n i=1 1[σ(i + 1) − σ(i)], and 1(x) is the Heaviside step function.In our setup, where the Hamiltonian terms have at most k-body interactions, Ω n has at most (n + 1)k-body interactions.We further have the following upper bound for Ω n (Lemma 1 in Ref. [48]): One can see that the convergence condition ∥Ω n+1 ∥τ n+1 < ∥Ω n ∥τ n only holds up to n ≈ (λτ ) −1 .Indeed, the FM expansion diverges for a many-body Hamiltonian unless τ also scales with N .Even though this fact suggests that the FM expansion might not be useful, it turned out that one can still use a truncated series H However, as n 0 increases with N , if τ ∼ N −α for a positive α, the interaction range of H (n 0 ) F increases with N .As we want strict locality in our Hamiltonian (it must be k ′ -local for a constant k ′ ), a bound for H (n) F for a fixed n should be useful, which is provided by the following Corollary.

D.2 Approximating the HVA
We now consider the HVA given by where we assume that each H (j) is the sum of commuting Pauli strings (products of Pauli operators) acting on at most k geometrically local sites.For example, N i=1 X i X i+1 from the HVA for the transverse-field Ising model satisfies this (both for the periodic and open boundary conditions) with k = 2.
We now interpret the HVA as a time-dependent Hamiltonian given as H (1) for 0 ≤ t ≤ θ 1,1 H (1) for For convenience, we define Υ n,m := n−1 i=1 q j=1 θ i,j + m j=1 θ n,j which is the cumulative sum of {θ}.For any subcircuit of the HVA U b • • • U a , where a = (i, j) and b = (i ′ , j ′ ) are indices for the layers, we consider H(t) = H 0 + V (t) with H 0 = 0 and V (t) = H(t + Υ a−1 ) [Eq. (D.16)] defined for 0 ≤ t ≤ Υ b −Υ a−1 := τ (where we use Υ a−1 to denote the sum of the parameters before layer a = (i, j) and Parameters for the FM expansion can be obtained by writing V (t) as Following the notation in the main text, we have for V 0 defined in Eq. (D.8) and H max defined in Eq. (26).Under this setup, applying Corollary D.1 to U R and U L defined in the main text yields Proposition 3. Precisely, for a given n ≤ n 0 = ⌊1/(32kJt R,L )⌋, there are (n + 1)k-local Hamiltonians H R and H L such that are satisfied (where we put λ = 2kJ from Eq. (D.8)).
Furthermore, H R and H L share any symmetries that {H (j) } have, which follows from the property of the commutator, i.e., . Thus, for example, if all {H (j) } are translationally invariant, the resulting Hamiltonians H R and H L are also translationally invariant.
Obtaining the norm of each term of H R (H L ) is also possible.For convenience, let K be one of where h (j[t σ(1) ]) X acts on at most k sites.Inserting this expression in Eq. (D.10) gives Locality of k X follows from the fact that the multicommutator [H (in) , [H (i n−1 ) , • • • , [H (1) , O] • • • ]] acts on at most (n + 1)k nearby sites for any operator O acting on at most k local sites, and the Hamiltonians H (1) Finally, we obtain a bound of the norm of k X using the inequality and the following lemma.
Lemma 1 (Consequence of Lemma 3 in Ref. [48]).Let {H (j) } be k-local and X:X∋i ∥h (j) X ∥ ≤ J for all j.Then for an arbitrary operator O supported on k local sites, we have We thus have where we use X ∥ = 1 regardless of j as {h (j) X } are Pauli words.As a consequence, we obtain where H max is the maximum number of terms in H X (defined in the main text).

E Proof of Theorem 1
We here provide a detailed proof showing that there exists τ 0 = Θ(1/N ) such that the HVA with i,j θ i,j = t R + t L ≤ τ 0 has large gradient components ∂ n,m C. Here, we consider the cost function C given by the expectation value of a local observable O acting on at most k O sites and an initial state ρ 0 which gives | Tr{ρ 0 [H (m) , O]}| = Θ(1).
Our proof consists of three steps.First, we show that the error approximating the HVA to local Hamiltonian evolution from the FM expansion is O(1/N 2 ).Next, we derive all factors (∥H R ∥, ∥[H L , O]∥, etc.) in Proposition 1 from the FM expansion.We then complete the proof by combining steps to show that there exists τ 0 = Θ(1/N ) such that |∂ n,m C| is lower bounded by a constant.

E.1 Polynomially decaying bound of the error from the truncated FM expansion
Let us first analyze the error term in Proposition 3 for t R , t L ≤ c/N with n = 1.We note that n = 1 requires n 0 = ⌊1/(32kJt R,L )⌋ ≥ 1, which is satisfied for N ≥ N 0 := 32ckJ.In addition, we assume kJ ≥ 1, which is true in our setting [see Eq. (27)].Then, the error from the truncated FM expansion [the RHS of Eq. (29)] is given by where r is a constant such that H max ≤ rN (which is from H max = O(N )).
We now use the following lemma to find N 1 such that the error is O(1/N 2 ) for N ≥ N 1 .
Proof.We first have which completes the proof.

E.2 Condition of the constant for large gradients
We next find an upper bound of c (for τ 0 = c/N ) from the complete expression of time t c in Proposition 1. From Eq. (D.26) with the first order expansion (n = 1), we have where we obtain the second inequality by combining Eq. 16 and Eq.D.25.Here, l = |{X : [k X , O] ̸ = 0}| ≥ 1 is a constant for a given lattice, which follows from the fact that k X acts on at most 2k nearby sites and O is a local operator.
In addition, we have ) , O]∥ ≤ 2s∥O∥ (E.36) where s = |{X : [h X , O] ̸ = 0}| ≥ 1 is also a constant.We used the fact that each h X is a Pauli string to obtain the second inequality.As t R , t L ≤ τ 0 = c/N (from t R , t L ≥ 0 and t R + t L ≤ τ 0 ), we obtain for N ≥ N 0 ≥ 32ckJ, We still note that the current condition implies large gradients only when U R,L are exact time-evolution operators.As there is an approximation error from the FM expansion, we consider this factor in the following subsection.

E.3 Bounding gradient with the FM truncation error
We introduce the following lemma to see how much an error from unitary approximation affects the gradients.
Let us now apply this lemma to and Ã = [H (m) , e iH L t L Oe −iH L t L ]. Denoting ϵ by the error from the FM expansion, i.e., ∥e −iH R t R − U R ∥ ≤ ϵ and ∥e −iH L t L − U L ∥ ≤ ϵ, we obtain is satisfied for all N ≥ N min if t R + t L ≤ c/N .

F Vanishing gradient after a finite time evolution
In the main text and previous Appendix, we argued that there exists τ 0 = Θ(1/N ) such that the HVA with constraints θ i,j ≥ 0 and i,j θ i,j ≤ τ 0 does not have vanishing gradients if | Tr[ρ 0 [H (m) , O]]| ̸ = 0 for some m.In this subsection, we provide an example whose gradient component vanishes where the sum of parameters is a constant.This implies that if there is τ0 such that the gradient is bounded by a constant when ij θ i,h ≤ τ0 , τ0 must be smaller than this constant.

Figure 1 :
Figure 1: We find a parameter constraint such that layers of Hamiltonian evolution in the HVA (left) approximate to the time evolution under a single local Hamiltonian (right).Using the dynamical properties of local Hamiltonians, we argue that the HVA has large gradients.

4 ×Figure 3 :
Figure3: (a) Scaling of the gradient square (∂ i,j C)2  from the HVA for the 1D Heisenberg-XYZ model with different depths p (see main text for details).Plots show results from the constrained (solid), small (dashed), and completely random (dotted) parameter initializations.We compute (∂ i,j C) 2 for 2 10 parameters samples from each distribution and plot the averaged results over all samples and gradient components (i, j).(b) The same plot as (a) but for the 2D Heisenberg-XYZ model with the lattice size L x × L y .We also see that the results fluctuate for odd and even L y because the 2D Néel state (our initial state) violates the symmetry of the Hamiltonian for odd L y .We also compute the relative standard deviation, r = σ(X)/E[X], where X := ij (∂ i,j C) 2 /(3p) is the squared partial derivatives averaged over all parameters.Here, the standard deviation and the expectation value are taken over the circuit instances.The values of r for the 1D model with p = 64 and N = 24 are given by 0.20, 0.53, and 0.13 for the contained, small, and random initialization, respectively.For the 2D model with p = 64 and L x × L y = 4 × 7, we obtained r ≈ 0.39 (constrained), 1.02 (small), 0.11 (random), respectively.

Figure 4 :
Figure 4: Averaged magnitudes of gradients ⟨(∂ i,j C) 2 ⟩ as a function of ϵ.The HVA for the 1D XYZ model with p = 16 is used.All parameters are samples from the uniform distribution, i.e., θ i,j ∼ U [0,ϵ].We observe that there is a value ϵ 0 (N ) such that the gradient decays exponentially with N only for ϵ > ϵ 0 (N ) (colored region).Inset shows the same data but as a function of ϵ/ log(N ).We see the gradient magnitudes saturate after the dashed vertical line, which suggests that ϵ 0 (N ) ∝ log N .

Figure 5 :
Figure 5: (a) Learning curves from different initialization schemes for the 1D Heisenberg model with N = 20.We use the circuit ansatz with p = 20 and the Adam optimizer with the learning rate α = 0.025.For each iteration, the curves for random and constrained initializations show the averaged results over 32 different initial parameters.The shaded regions indicate one standard deviation ([m − σ/2, m + σ/2]).Note that results from the constant initialization (θ i,j = π) do not vary between instances as there is no randomness in the simulation.The ground state energy E GS is obtained using the exact diagonalization.(b) The same result for the 2D Heisenberg model with L x × L y = 4 × 6 lattice.The circuit ansatz with p = 24 and the Adam optimizer with learning rate α = 0.005 are used.Results for random and constrained initializations are averaged over 16 different initial parameters.For optimization, we compute the gradient exactly (without shot noise).Shaded regions are barely visible for the random initialization.This is because the loss function, ⟨H⟩, is not trained at all for most instances.Thus, the loss function preserves its initial value ⟨H⟩ ≈ 0, which is from the fact that the circuit forms a 2-design for the random initialization.

Figure 7 :
Figure 7: Scaling of the averaged squared gradients(∂ i,j C)2 from the repeated ansatz Eq. (41) with (solid) and without (dotted) a parameter constraint.For the parameter-constrained ansatz, we sample parameters under the constraint θ i,1 + θ i,2 + θ i,3 = π/(2N ).In contrast, θ i,j ∼ U [0,2π] is used for the ansatz without the constraint.The HVA is for the 1D XYZ model with O = Y 0 Y 1 with p = 16 and r = N 2 /4.The results are averaged over 2 10 random parameters.
Converged energies from the VQE for the onedimensional Heisenberg model with N = p = 16 as a function of the number of shots.For each number of shots, n shot ∈ [2 7 , 2 9 , 2 11 , 2 13 , 2 15 , 2 17 ], we fully simulate the VQE 16 times.The converged energies ⟨H⟩ for each independent VQE instance are presented.For the initialization θ ij = π, the shot noise is the only source of the randomness.On the other hand, the initial parameters are also random when the constrained parameter initialization is used.
2 /9.The obtained bounds tells us that the U L and U R in Proposition 3 which appear in ∂ n,m C approximate to local Hamiltonian evolution.Precisely, for U ∥H R,L ∥ ≤ H max 1 +