Barren plateaus in quantum tensor network optimization

We analyze the barren plateau phenomenon in the variational optimization of quantum circuits inspired by matrix product states (qMPS), tree tensor networks (qTTN), and the multiscale entanglement renormalization ansatz (qMERA). We consider as the cost function the expectation value of a Hamiltonian that is a sum of local terms. For randomly chosen variational parameters we show that the variance of the cost function gradient decreases exponentially with the distance of a Hamiltonian term from the canonical centre in the quantum tensor network. Therefore, as a function of qubit count, for qMPS most gradient variances decrease exponentially and for qTTN as well as qMERA they decrease polynomially. We also show that the calculation of these gradients is exponentially more eﬃcient on a classical computer than on a quantum computer.


Introduction
Noisy intermediate-scale quantum (NISQ) devices possess just a small number of imperfect qubits [1] but offer unprecedented computational capabilities. Whilst not powerful enough to run paradigm-shifting quantum algorithms with guaranteed quantum advantage, such as Shor's algorithm [2] or Grover search [3], they can already outperform classical computers [4,5].  Figure 1: Summary of the main results. We consider the qMERA with periodic boundary conditions (all gates shown; top light green gates connect to bottom ones), the qTTN (dark red gates) and the qMPS (dark red gates in shaded area). For most gates in these circuits the gradient variance with respect to randomly chosen parameters decreases exponentially with the distance of the cost function's observable from the canonical centre. As a function of qubit count this distance can grow linearly for qMPS and it does grow logarithmically for both qTTN and qMERA so that the corresponding gradient variances decrease exponentially and polynomially, respectively.
Variational quantum algorithms are a promising toolbox to work with NISQ devices and achieve a quantum advantage [6][7][8]. The variational approach is characterized by an iterative feedback loop between a quantum and a classical computer during which a parameterized quantum circuit (PQC) is optimized to solve the problem of interest. On the quantum device, the PQC is applied to some initial state to realize the variational wavefunction on which measurements are performed. The measurement results are subsequently processed on the classical device which, e.g., evaluates the cost function, computes gradients and updates the PQC parameters.
The variational optimization of a PQC, however, is hard [27]. One of the difficulties that can be encountered during the optimization is related to the barren plateau phenomenon [28] which manifests itself by a parameter landscape of the cost function that, in simple terms, is flat everywhere except for narrow gorges surrounding local minima. These flat landscapes pose a problem for the optimization of a PQC as they imply that one needs to run the quantum computer and collect samples many times to accurately determine the gradients of the cost function with respect to the variational parameters. The large sampling cost can rule out any quantum advantage one is aiming at with variational quantum algorithms. The severity of the barren plateau problem depends on the cost function [29] and the PQC architecture [28,30,31]. A plethora of proposals exist to avoid barren plateaus in certain cases [29,[31][32][33][34][35][36][37][38][39][40][41].
In this article we study the trainability of quantum tensor networks using the approach [31] (see also [42]) which is based on the ZX-calculus [43,44]. Tensor networks have proven to be a powerful variational ansatz for the simulation of quantum many-body systems on classical computers [45][46][47][48][49][50][51]. Quantum tensor networks have become popular recently since they can be realized on current NISQ devices [52][53][54][55][56][57][58][59] and have advantages over their classical counterparts [60][61][62][63]. We focus on PQC architectures inspired by matrix product states [64-67] (qMPS), tree tensor networks [60,68-70] (qTTN) and the multiscale entanglement renormalization ansatz [71,72] (qMERA). An important concept in these tensor networks is the canonical centre which is the first quantum gate of the circuit. We show that the barren plateau phenomenon is fundamentally connected to the distance between the observable of interest and the canonical centre. Figure 1 summarizes our results.
Our analysis is inspired by [31] and extends their results. For the qMPS ansatz considered in [31] we study the barren plateau problem in more detail. In [31] a discriminative qTTN is analyzed and here we explore the corresponding generative variant [60], which represents the quantum counterpart to standard classical TTN [60, 68-70]. Additionally we investigate a qMERA ansatz not considered in [31]. It is worth noting that [31] studies the quantum convolutional neural network (qCNN) ansatz of [73] which can be viewed as the discriminative variant of the qMERA. In [31] it is shown that the discriminative qTTN and qCNN avoid barren plateaus, but their results are fundamentally different from the ones presented here: This is because in the discriminative variants the distance between the observable and the canonical distance is always equal to the number of qubits, whereas in the generative variants this is not the case in general. We also emphasize that the purpose of this work is not to relate to generative quantum machine learning but to address the application of classical tensor network techniques in quantum machine learning.
This article is structured as follows. In Sec. 2 we present the necessary background. Section 3 contains the results. Technical details including the proofs are provided in appendices.

Background
We collect background information on VQE in Sec. 2.1, the barren plateau phenomenon in Sec. 2.2 and the ZX-calculus in Sec. 2.3

Variational quantum eigensolver
Originally introduced in [9] the variational quantum eigensolver (VQE) consists of a training loop that iterates between a quantum and a classical computer and makes use of the variational principle to solve the minimization for some Hermitian observable H, e.g. a Hamiltonian. During each training iteration the quantum computer prepares the variational wavefunction |ψ(θ) = U (θ) |0 via a PQC of the form where U j (θ j ) = exp(−iθ j V j /2)W j , θ j ∈ [−π, π], V 2 j = I and W j is an unparameterized unitary. The quantum computer is also used to compute cost function gradients via the parameter-shift rule where e j is the j-th unit vector [74,75]. The classical computer subsequently updates the parameters θ and then feeds them back to the quantum machine for the next training iteration. The parameters are updated e.g. using the gradient descent procedure: where η is the learning rate and ∇ θ H θ denotes the gradient vector. An alternative gradient-based method that has become popular in the context of variational quantum algorithms is the Adam optimizer [76]. A comprehensive review article on VQE is [8].
In this article we focus on k-local Hamiltonians, i.e. sums of observables which act on at most k qubits. One example of a 2-local Hamiltonian is the transverse-field quantum Ising chain: where J and h are Hamiltonian parameters, i, j represents adjacent qubits and X (Z) is the Pauli X (Z) matrix. Another example is the Heisenberg model:

Barren plateaus
The barren plateau phenomenon in the variational optimization of quantum circuits was first discussed in [28] and characterized in the following way: (1) be a cost function with an associated parameterized ansatz Eq. (2) acting on N qubits. In simple terms Theorem 1 tells us that the unitary 2-design condition establishes a cost landscape which is nearly flat everywhere (barren plateaus) except for exponentially small regions around minima (narrow gorges). Using Chebyshev's inequality we see that for randomly chosen parameters the probability of obtaining a gradient of magnitude |∂ k H θ | > κ vanishes exponentially with qubit count 1 : The barren plateau phenomenon is a problem for the trainability of PQCs since the computation of exponentially small gradients using standard techniques, such as the parameter-shift rule, requires exponentially many measurements on a quantum computer. Because the computational cost of performing these calculations on a classical computer also scales exponentially with qubit count, a classical approach might be more efficient than a quantum one in which case there is no quantum advantage. While in [28] it is shown that the onset of the unitary 2-design property is caused by large circuit depth, in [29] the authors show that also the form of the cost function affects the depth at which barren plateaus emerge. More specifically they show that PQC optimization with local cost functions is efficient for depths that scale logarithmically with qubit count and transitions into the barren plateau regime when depths scale as O(poly(log(N ))). PQC training based on global cost functions, however, is shown to always be subject to barren plateaus, even for shallow O(1) depth circuits.
Focusing on local observables the analysis in [29] suggests that the onset of barren plateaus is related to the entanglement in the causal cone of the observable 2 . This is analysed in detail in [36] where the authors show that sufficiently large amounts of entanglement in the quantum circuit are necessary for the emergence of unitary 2-designs and claim that entanglement-induced barren plateaus [33,34,77] and barren plateaus for local cost functions are equivalent.
Due to its importance for the field of variational quantum algorithms, the barren plateau problem has been studied in many articles. Some articles have identified PQC architectures that avoid barren plateaus [31,78] and others propose ways to mitigate the barren plateau problem, e.g. in [32] the authors propose to initialize the circuit with shallow identity gates formed by unitaries and their adjoints, in [35] they advertise a layer-wise learning strategy, in [37, 41] they propose to initialize the PQC using previously trained PQCs, in [38] they propose to use a previously trained qMPS for the PQC initialization, and in [40] the authors claim that the barren plateau problem is solved by choosing the initial parameters from a particular Gaussian distribution.

ZX-calculus for barren plateau analysis
In [31] Chen Zhao and Xiao-Shan Gao pioneer the use of the ZX-calculus [43, 44] to analyse the barren plateau phenomenon. They use the following assumption: They show: (1) be a cost function with associated parameterized ansatz (2) for N qubits and under Assumption 1: is a ZXdiagram and a 1 , . . . , a M , T 1 , T 2 and T 3 are labels defining the ZX-diagram [31].
While Theorem 2 does not immediately tell us whether a specific choice of PQC and cost function leads to barren plateaus, it provides us with a constructive procedure to compute the variance of gradients by evaluating ZX-diagrams. This calculation can be further simplified by turning the ZX-diagram into tensor networks whose contraction directly produces the sought-after variance value. In App. A we explain the ZX-calculus formalism that is relevant for this article and also give a simple example that illustrates step-by-step how one can use this formalism to obtain the tensor network for the gradient variance starting from a PQC and using ZX-diagrams.

Results
We present the results on qMPS in Sec. 3.1, qTTN in Sec. 3.2 and qMERA in Sec. 3.3. In Sec. 3.4 we compare the quantum and classical computational cost of calculating gradients.

Quantum matrix product states
We consider the qMPS ansatz composed of two-qubit blocks of the form acting on qubits j and j + 1 for j < N − 1 and acting on qubits N − 1 and N , cf. App. B for a full circuit diagram. Here U qMPS 1 is the canonical centre of the qMPS. Theorem 3. Let X i qMPS be the cost function associated with the observable X i and consider the qMPS ansatz for N qubits defined in Eq. (9), then: where ∂ j,1 X i qMPS refers to the gradient w.r.t. the 1-st parameter in the j-th qubit register.
Theorem 3 tells us that the gradient variance with respect to parameter (j, 1) for j < i is independent of j and depends only on i, i.e. the distance between the observable at site i and the canonical centre. We also learn from Theorem 3 that for j = i, i + 1 the gradient variance has a constant contribution. Note that for j > i + 1 we have Var [∂ j,k X i qMPS ] = 0 since the variational parameter indexed by (j, k) is outside the causal cone of the observable X i , see e.g. Fig. 4 In other words the variance w.r.t. the top-left parameter is a lower bound to all other non-zero variances in the qMPS ansatz.
Note that Theorem 3 implies that the qMPS ansatz avoids the barren plateau problem for a Hamiltonian that is a sum of local terms acting on all qubits, e.g. the Hamiltonian H = N i=1 X i , Ising and Heisenberg models. Focusing on the Hamiltonian H = N i=1 X i , this is because Theorem 3 shows that each term X i in H leads to non-vanishing gradient variances for parameters in registers i and i + 1. Hence, every parameter in the qMPS will have a contribution to the gradient variance which is non-vanishing. However, this is not the case for arbitrary Hamiltonians. If we consider a Hamiltonian acting on a single site, for example H = X N , then Theorem 3 shows that the gradient variances for all parameters in registers i < N vanish exponentially.
Additionally we show: Theorem 4. Let X i X i+1 qMPS be the cost function associated with the observable X i X i+1 and consider the qMPS ansatz of Eq. (9), then: where where ∂ 1,1 X i qMPS refers to the gradient w.r.t. the 1-st parameter in the 1-st qubit register.
We generalize the results to k-local observables and propose: The cases k = 1 and k = 2 are already shown in Theorem 3 and Theorem 4 and we discuss k > 2 in App. B.

Quantum tree tensor networks
We consider a qTTN ansatz for N = 2 n qubits of the following form for n = 1: and for n > 1: Appendix C contains an example of a full circuit diagram. The top recursion level in Eq. (17) is the canonical centre of the network. Each qubit in the qTTN ansatz is causally connected to n = log N qubits, which allows us to show: Theorem 5. Let X i qTTN be the cost function associated with the observable X i and consider the qTTN ansatz defined in Eq. (17), then: Proof. See App. C, Theorem 10 and Lemma 2.
In summary Theorem 5 tells us that for all pairs of indices (j, k) provided the former variance is not 0. The variance is 0 in the qTTN ansatz when the variational paramater indexed by (j, k) is outside the causal cone of the observable. In contrast to the qMPS ansatz, for qTTN the variance decreases polynomially and independently of the site i being considered since the distance between the qubit that the observable acts on and the canonical centre is always log N . We conclude that the qTTN ansatz avoids the barren plateau problem.
We extend the results to k-local observables for k N . In this case the observable is causally connected to O(k log N ) qubits. We propose: The case k = 1 is covered by Theorem 5 and we discuss the general case in App. C.

Quantum multiscale entanglement renormalization ansatz
We define the qMERA ansatz for N = 2 n qubits as a product of n layers each of which is composed of a disentangling (Dis) and a coarse-graining (CG) layer: where the two qubit gates are given by (19) and in the last layer, prior to the measurements, there is an additional R X R Z operation on each qubit register. The canonical centre of the qMERA is in the first CG layer. Each qubit is connected to at most 2 log N qubits via the CG and Dis layers. This quantum tensor network is motivated by the MERA in [72].
Theorem 6. Let X i qMERA be the cost function associated with the observable X i and consider the qMERA ansatz defined in Eq. (18), then: 2n .
Theorem 6 tells us that the qMERA avoids barren plateaus for 1-local observables. In contrast to qMPS and qTTN, here the lower bound is not tight. In App. D we present a numerical method to calculate the exact variances. Numerically we find that the upper bound scales as O(N −1.2 ) and the lower bound as Ω(N −2.7 ).
We extend these results to k-local observables. In this case the observable is causally connected to O(2k log N ) qubits.

Quantum versus classical computational cost of computing gradients
On a quantum computer we assume that gradients are computed via sampling which has an error scaling as in terms of the sample count M [7]. Therefore, to resolve gradients decreasing exponentially with the distance from the canonical centre, M needs to scale exponentially with that distance.
On a classical computer the computational cost of basic arithmetic operations (addition, subtraction, multiplication and division) scales polynomially with log(1/ ) for error [79]. In other words, in classical computing it is efficient to exponentially decrease the error of basic arithmetic operations. For the quantum tensor networks and local observables considered here, gradients can be evaluated on a classical computer via tensor network contraction techniques (see [48] for MPS, [68] for TTN and [72] for MERA). Their computational cost, i.e. the total number of arithmetic operations, scales polynomially with the distance of the observable from the canonical centre and, therefore, the total classical computational cost scales polynomially with that distance.

Discussion
In the context of randomly initialized quantum tensor networks we have shown that qMPS suffer from exponentially vanishing gradients whilst qTTN and qMERA avoid this barren plateau problem. Therefore qTTN and qMERA are recommended over qMPS.
Interestingly any MPS of bond dimension χ can be equivalently represented by a TTN of bond dimension χ 2 [46, 48-51]. Figure 2 illustrates a constructive procedure for transforming a MPS into a TTN (a) and for transforming a qMPS into a qTTN (b) for N = 8. The same procedure can be used for larger values of N and, for the qMPS considered in this article, leads to a qTTN composed of four-qubit quantum gates. Since the qTTN circuit depth is logarithmic in the number of qubits the resulting qTTN avoids the barren plateau problem [29].
From the perspective of the barren plateau phenomenon, therefore, generalized versions of qTTN and qMERA with larger unitary gates are recommended over qMPS because they can contain qMPS and their depth scales logarithmically with qubit count. We conjecture, however, that the classical computation of gradients for these quantum tensor networks can still be more efficient than their quantum computation, cf. Sec  [37] Huan-Yu Liu, Tai-Ping Sun, Yu-Chun Wu, Yong-Jian Han, and Guo-Ping Guo. "Mitigating barren plateaus with transfer-learning-inspired parameter initializations". New Journal of Physics 25, 013039 (2023).

A ZX-calculus
For the sake of completeness, here we summarise the techniques of [31] that are relevant for our work. Let U (θ) be a PQC satisfying the constraints of Assumption 1 with θ ∈ [−π, π] M . Then U (θ) = c · G U (θ) where G U (θ) is a graph-like ZX-diagram 3 representing the circuit U (θ) and c is the constant obtained in the process of turning U (θ) into G U (θ).
For example the graph-like ZX-diagram for the 3-qubit qMPS in Eq. (9) is

· · ·
· · · · · · · · · · · · · · · (20) where the prefactor 1 2 N comes from the identity If we initialise the parameters in the quantum circuit uniformly at random [−π, π] M ← θ, then the variance of the gradient with respect to parameter j is where the integrand is given by where π π and for a i ∈ {T 1 , T 2 , T 3 }:

B Quantum matrix product states
The qMPS ansatz of Eq. (9) for N qubits has the form Figure 3: The qMPS circuit considered in this article.
We index parameters using the index pair (j, k) which refers to the k-th parameter in qubit register j = 1, . . . , N . Theorem 2 and App. A imply that P 2 · · · · · · where the gradient is calculated for the first parameter on the first qubit register and the vectors u i are related to the observables via Eq. (33). To consider general parameters (j, k) we simply move the projection P 2 to the copy tensor at position (j, k). Using the identities for some constant c.
Proof. For the first equality, note that by Eq. (33) both observables X i and Z i yield u i = 2v 2 and u i =i = 2v 13 so that the contraction in Eq.
where the vector c 13 v 13 + c 2 v 2 + c − 13 v − 13 for non-negative constants c 13 , c − 13 , c 2 in the i-th register comes from contracting all registers i > i in the tensor network in (36). In particular, note that the right-hand side above no longer carries a v − 13 term on the (i − 1)-th register. Additionally the two terms on the right side leading with a v 13 on the top register do not contribute to the variance as they will eventually be discarded by the projection. Hence the tensor network is fully determined by the third term on the right-hand side and therefore equivalent to the one corresponding to the observables X i and Z i , up to the constant factor c 2 accrued from contracting the registers i > i 4 .
This Lemma implies that it suffices to consider the 1-local observable X i to probe the behaviour of the variance for general 1-local operators. Also, this Lemma trivially generalises to the qTTN and qMERA circuits and, therefore, henceforth we focus solely on observables X i . Theorem 7. Let X i qMPS be the cost function associated with the observable X i and consider the qMPS ansatz for N qubits defined in Eq. (9), then: where ∂ 1,1 X i qMPS refers to the gradient w.r.t. the 1-st parameter in the 1-st qubit register.
Proof. Var [∂ 1,1 X i ] can be found for the three separate cases by contracting the tensor network in Eq. (36) with u i = v 2 and u i =i = v 13 . Given Eqs. (37), (38) this is a straightforward calculation from which we also derive the useful identity Computing the gradient variance for a general parameter indexed by (j, k) can be done analogously by moving the projection P 2 in Eq. (36) to the copy tensor at position (j, k). The calculation can be simplified by, first, identifying the cases in which the triple index (i, j, k) gives Var [∂ j,k X i qMPS ] = 0. Figure 4 illustrates the causal cone corresponding to observable X i in a qMPS circuit. We observe that in the qMPS the triple index (i, j, k) for which Var [∂ j,k X i qMPS ] = 0 satisfies j > i + 1 ∀k, j = i + 1 and k > 2, j < i and k > 4 (k > 2 for j = 1). The definition for the causal cone of a PQC can be extended analogously to apply to the variance tensor networks of the form of Eq. (27) -for example Eq. (36). Theorem 8. Let X i qMPS be the cost function associated with the observable X i and consider the qMPS ansatz defined in Eq. (9), then: where ∂ j,1 X i qMPS refers to the gradient w.r.t. the 1-st parameter in the j-th qubit register.
Proof. This is a straightforward contraction of the tensor network in Eq. (36) but with the projection P 2 replacing the copy tensor indexed by (j, k) and using Eqs. (37), (38), (42  13 , v 2 and v − 13 monotonically and, therefore, the earlier the projection P 2 is placed, the larger these coefficients become after the contributions of v 13 and v − 13 are removed by P 2 . This argument applies analogously to qTTN and qMERA. Now we consider k-local operators of the form X I := X i1 ⊗ · · · ⊗ X i k for I = {i 1 , . . . , i k } (w.l.o.g. i 1 < i 2 < . . . < i k ) and use techniques and results from this Appendix to justify Conjecture 1. The proof of Theorem 7 and the causal cone structure in Fig. 4 suggest that Var [∂ 1,1 X I qMPS ] vanishes exponentially with i k . Our intuition is that barren plateaus appear when the causal cone of an observable includes a large number of qubits (≈ N ) and we know that the causal cone relating to X I contains at most i k + 1 qubits for the qMPS ansatz in Fig. 3.

Theorem 9.
Let X i X i+1 qMPS be the cost function associated with the observable X i X i+1 and consider the qMPS ansatz of Eq. (9), then: where where ∂ 1,1 X i qMPS refers to the gradient w.r.t. the 1-st parameter in the 1-st qubit register.
Proof. Var [∂ 1,1 X i X i+1 qMPS ] can be found for the three separate cases by contracting the tensor network in Eq. (36) with u i = u i+1 = v 2 and u i =i,i+1 = v 13 using Eqs. (37), (38), (42) in addition to: With the techniques from this Appendix, we are ready to discuss the following proposition: The cases k = 1, 2 are covered in Theorem 3 and Theorem 4. For k > 2 we argue as follows: Given a k-local operator acting on qubits I = {i 1 , . . . , i k } with i 1 < . . . < i k , then Var [∂ 1,1 X I qMPS ] corresponds to a tensor network as in Eq. (36) but where all registers below the (i k + 1)-th qubit do not contribute to the variance. Hence when contracting the network we accrue contributions from at most (i k + 1) registers.

Theorem 2 and Appendix A imply that
which we simplify by noting that 2 2 If we denote the resulting vector after the k-th application of the term within the square brackets in Eq. (51) , then by using the identities in Eq. (52) we find that any subsequent term is given by Let u k := 1 4 [α k , β k ] T be the coefficient vector associated with v k , then the transformation v k → v k+1 is determined by the linear map: M has eigenvalues λ 1 ≈ 0.4313 and λ 2 ≈ 2.3187 and respective eigenvectors w 1 , w 2 so that the spectral theorem implies that after the application of the (n − 1) terms in the square brackets we obtain Contracting the rest of the tensor network then gives We approximate the above by noticing that for n large enough, λ n−1 2 λ n−1 1 ≈ 0 and so α n , β n ∈ O(λ n−1 2 ) so that In general we obtain the gradient variance corresponding to any observable of the form X i analytically by contracting the tensor network in Eq. (49) using the identities in Eqs.
CG CG CG Dis Dis Figure 6: qMERA circuit considered in this article for 8 = 2 3 qubits and 3 layers. For arbitrary N = 2 n qubits and n layers, the l-th course-graining layer (CG in the figure) is as in the qTTN ansatz whilst the l-th disentangling layer is a composition of the (l − 1)-th disentangling layer with additional R CN OT acting on adjacent pairs of the newly added qubits (j1, j2) within that layer (e.g. in the last disentangling operation above the last CNOT gates act on qubits (2, 4) and (6,8) which were added on the last layer of the qMERA).
Note that this circuit is equivalent to the one presented in Eq. (18) up to a reordering of the qubits. The qubits in Fig. 6 are arranged so that the coarse-graining operations are equivalent to the ones in the qTTN PQC in Fig. 5. To that end we redefine the qMERA circuit as a product of course-graining and disentangling layers as Looking at these results in a log-log plot we find that the data for N = 4, 8 and 16 lie on straight lines that give us the upper bound scaling like O(N −1.2 ) and the lower bound scaling like Ω(N −2.7 ). The numerical results showcase a general brute-force approach to calculating the variances for the proposed qMERA for arbitrary 1-local observables.
To make a statement for general N = 2 n qubits we argue as in Eq. (71) using the tools from Lemma 2. Theorem 6 states that whereN = 2 2k log N by similar arguments as used at the end of App. C. These bounds are not tight, but as long as k N they still suggest that the qMERA avoids the barren plateau problem for k-local Hamiltonians. In the limit k ≈ N we obtain exponentially vanishing gradients as all qubits are in the causal cone of the observable.