Analyzing the barren plateau phenomenon in training quantum neural networks with the ZX-calculus

In this paper, we propose a general scheme to analyze the gradient vanishing phenomenon, also known as the barren plateau phenomenon, in training quantum neural networks with the ZX-calculus. More precisely, we extend the barren plateaus theorem from unitary 2-design circuits to any parameterized quantum circuits under certain reasonable assumptions. The main technical contribution of this paper is representing certain integrations as ZX-diagrams and computing them with the ZX-calculus. The method is used to analyze four concrete quantum neural networks with different structures. It is shown that, for the hardware efficient ansatz and the MPS-inspired ansatz, there exist barren plateaus, while for the QCNN ansatz and the tree tensor network ansatz, there exists no barren plateau.


Introduction
In recent years, hybrid quantum-classical algorithms are widely used in quantum chemistry [1][2][3][4], combinatorial optimization [5,6], and quantum machine learning [7][8][9][10][11][12]. In these hybrid quantumclassical algorithms, the goal is usually training parameterized quantum circuits (PQCs) with classical optimizers. The PQC will be applied to an initial state and then the state will be measured on a quantum device. The classical optimizer will update the parameters of the PQC according to the measurement results. As the PQC can be run on noisy intermediate-scale quantum (NISQ [13]) devices, these algorithms are regarded as near-term practical quantum algorithms with potential quantum advantages.
There exist many methods to train PQCs. Some of these are gradient-based [14][15][16][17] and some are not [17,18]. In quantum machine learning, gradient-based methods are widely used. When using gradient-based methods to train PQCs, one may suffer from the barren plateau (BP) phenomenon which was first studied in [19]. The BP phenomenon is that the gradient of parameters of the PQC will vanish exponentially in terms of the system size. It was proved that if the PQCs form unitary 2-designs, then the BP phenomenon exists [19]. This result has been extended to the case when the PQCs form approximately 2-designs in [20]. The BP phenomenon in PQCs of various structures has been proposed. For PQCs with a brick-like structure, if the PQC has locally 2-design, then the existence of BPs depends on the depth of the circuit and the cost-function [21]. Let n be the number of qubits of the PQC. For poly(n)-depth PQCs with a brick-like form, there always exist BPs. For log(n)-depth PQCs with a brick-like form, if the cost-function is global, there exist BPs. Otherwise, there exists no BP when the cost-function is local. Too much entanglement will induce BPs [22,23]. The BP phenomenon in dissipative quantum neural networks has been studied in [24]. And the noise from quantum hardware also causes BPs, which are called noiseinduced BPs [25]. Several methods to avoid BPs have been proposed [20,23,26,27].
The above results about the BP phenomenon are obtained under certain assumptions of unitary 2-design and it is still difficult to analyze the BP phenomenon for PQCs besides those containing t-design parts. In this paper, we develop a general scheme to analyze whether there exist BP phenomena when training a concrete PQC. We focus on BP phenomena induced by the structure of PQCs and noise-induced BPs are not considered in this paper. The most important tool used in this paper is the ZX-calculus, a graphical language for describing and reasoning about quantum processes. The ZX-calculus was developed by Coecke and Duncan in [28,29], which has various applications including quantum circuit synthesis [30][31][32][33], measurement-based quantum computing [34,35], quantum error correction [36,37], condensed matter physics [38], quantum machine learning [39], and quantum natural language processing [40]. In the ZX-calculus, the objects under consideration are ZX-diagrams, which consist of two kinds of tensors: Z-spiders and X-spiders. A ZX-diagram can be rewritten with ZX-calculus rules. Moreover, every quantum circuit can be converted into a ZX-diagram.
Let θ = (θ 1 , . . . , θ m ) be a set of parameters. To analyze the gradient of a PQC U ( θ) with respect to a Hamiltonian H, we need to estimate the following expectation and the variance where H is defined in Eq. To estimate the expectation and variance in Eq. (1), we first represent them as ZX-diagrams. Since the expectation and the variance are integrations, the main technical contribution of this paper is representing these integrations as ZX-diagrams and computing them with the ZX-calculus when the PQC satisfies Assumption 1. More precisely, with the rewriting rules in the ZX-calculus, we prove that Eq. (1) is equal to the contraction of a tensor network with a similar structure as the PQC. Hence, the existence of BPs is totally characterized by the scaling property of the tensor network.
We use these techniques to analyze whether there exist BP phenomena in the hardware-efficient ansatz [2], the QCNN ansatz [41], the tree tensor network ansatz [42], and the MPS-inspired ansatz [43]. We show that there exist BPs in hardware-efficient ansatz and MPS-inspired ansatz, and there exists no BP in the QCNN ansatz and the tree tensor network ansatz. This paper is organized as follows. A brief introduction to the PQC, the BP phenomenon, and the ZX-calculus will be given in Section 2. We will prove the main result that characterizes Eq. (1) in Section 3. And the analysis of four concrete PQCs is given in Section 4.

Hybrid quantum-classical algorithms
In a hybrid quantum-classical algorithm, there will be an ansatz, which is a PQC of the form (2) In (2), U j (θ j ), j = 1, . . . , M are parameterized gates, such as the rotational gates R X , R Y , R Z ; and V j are non-parameterized gates, such as the Hadamard gate H and the CNOT gate. The PQC will be applied to an initial state ρ 0 and then the state will be measured. The above procedure, which is the quantum part of the algorithm, will be run on quantum processors. Meanwhile, there will be a classical part that consists of classical processors to optimize the parameters of the PQC in the quantum part. A cost-function L( θ) will be estimated in the classical part based on the measurement results. Usually, the expectation of a given Hamiltonian H will be regarded as the cost-function in many tasks.  As demonstrated in Figure 1, the quantum part runs the PQC and obtains the measurement results and the classical part estimates the cost-function and updates the parameters. After several iterations, the cost-function may converge and be optimized. Then the training will be stopped. This is the main idea of the hybrid quantum-classical algorithm.

Barren plateau phenomenon
When the parameterized gates are of the form where H j satisfies H 2 j = I, the gradient ∂ H ∂θj can be estimated by the parameter shifting rule without changing the structure of the PQC [14]. Once we obtain the gradient, we can use gradientbased optimization methods, such as gradient descent, to optimize the parameters.
Ideally, if the gradient does not vanish too fast as the size of the PQC grows, then the gradient could be estimated efficiently and the PQC could be trained easily. However, the BP phenomenon tells us that in many cases, the gradient vanishes exponentially as the system size grows up. When this happens, the PQC is difficult to be trained. The first rigorous proof of the BP phenomenon is shown below. Hence, when designing the ansatz PQC for a hybrid quantum-classical algorithm, we should analyze whether there exist BP phenomena in it to ensure that it is trainable.

The ZX-calculus
We provide a brief introduction to the ZX-calculus. For more details, please refer to [44,45].
In the ZX-calculus, quantum states and their transformations are represented as ZX-diagrams which consist of two kinds of tensors: Z-spiders and X-spiders. A Z-spider is denoted as a green node, and an X-spider is denoted as a red node. They can be written explicitly in the Dirac notation as follows.
For a spider, the edges on the left-hand side are called input and the edges on the right-hand side are called output. The angle θ is called the phase of the spider. For simplicity, we will omit the phase when it is zero. Spiders can be connected with wires. Hence, ZX-diagrams can be regarded as tensor networks generated with Z-spiders and X-spiders. For example, we can use ZX-diagrams to represent the following quantum states and quantum gates.
Here we introduce a new notation, the yellow box, to represent the Hadamard gate Since the gates set {R Z , R X , H, CNOT} is universal for quantum computing, in principle, one can convert every quantum circuit to a ZX-diagram with the equations in Eq. (5).
Moreover, the ZX-calculus is a powerful tool for reasoning. There are several rewriting rules in the ZX-calculus with which one can rewrite a ZX-diagram to another equivalent form. Figure 2 gives some basic rewriting rules 1  ...    Note that the ZX-calculus is universal. It means that any linear transformations can be represented as ZX-diagrams. Moreover, the rules in Figure 2 are complete for the stabilizer quantum mechanics where phases can only be multiples of π 2 [46,47]. That is, if two ZX-diagrams are equivalent, then there exists a set of rewriting rules in Figure 2 that rewrites one into another.
There are also completeness results for the Clifford+T quantum mechanics, where phases can be multiples of π 4 , and for arbitrary ZX-diagrams [48][49][50][51][52]. In this paper, we will focus on a canonical form of the ZX-diagram, the graph-like ZX-diagram which is defined in [30]. All X-spiders can be rewritten to Z-spiders by using the rule (h) in Figure 2. Connected Hadamard boxes can be canceled with the rule (i2) and normal edges can be canceled with the rule (f ). Furthermore, parallel Hadamard edges and self-loops can be canceled with rules 2 in Figure 3. Hence, every ZX-diagram is equivalent to a graph-like ZX-diagram [30].   3 Analyzing the BP phenomenon with the ZX-calculus In this section, we will show how to analyze the BP phenomenon with the ZX-calculus. More precisely, we will show how to estimate the expectation and the variance of the gradient of the cost function of a PQC with respect to a Hamiltonian with the ZX-calculus. The main technique we used is to compute integration over unitarians with the ZX-calculus.
Scalars are ignored in the rules in Section 2.3. However, to consider the BP phenomenon, the scalar is necessary. Hence, by using the definition of the Z-spider and the X-spider in Eq. 4, we can obtain the precise rules with scalars in Figure 4.
In this paper, we consider PQCs under the following assumptions.     We remark that in the case that a quantum circuit contains gates not satisfying Assumption 1, if one can represent these gates as composition of gates which satisfy Assumption 1, then the results of this paper still hold. For example, we will first represent R Y using R Z and R X in Section 4.2.

Representing gradients as ZX-diagrams
Consider a PQC U ( θ) of n-qubits and a Hamiltonian H. We assume that the input state is pure. Without loss of generality, we can also assume that we apply this PQC to an initial state |0 . Then the expectation of H can be expressed as As shown in Section 2.3, we can convert the PQC U ( θ) to a parameterized graph-like ZXdiagram G U ( θ) with Eq. (5). Suppose that for a constant c, then H can also be expressed as a ZX-diagram as demonstrated in the following equation. Here, U ( θ) is under the Assumption We remark that in the general case, the input state can be a mixed state ρ. Because the ZX-calculus is universal, we can represent ρ as a ZX-diagram D ρ . Then by replacing the X-spiders representing zero states on the left-and right-hand sides in Eq. (7) with D ρ , we can still obtain a ZX-diagram representing the expectation H = Tr(ρU † HU ). And results in this paper still hold in this case.
If we expand the spider by the definition of the Z-spider, we can prove that the gradient ∂ H ∂θj can be represented as a ZX-diagram.
Proof. We expand the corresponding Z-spiders according to the definition. There will be four terms on the left-hand side. Two of these terms are constants, and thus they will become 0 after taking the derivative. According to the definition of the X-spider, we can obtain the ZX-diagram on the right-hand side. The complete proof is given in Appendix B.
This theorem also gives a graphical proof of the parameter-shift rule in [14]. We will demonstrate it with the following example. Consider the following ansatz. We first can convert it to an equivalent ZX-diagram.

|0
|0 RX (θ1) And the expectation H of a Hamiltonian H can be represented as the following ZX-diagram.
H Then by Theorem 2, we can obtain the gradient directly.
Then we can use the definition of the X-spider to obtain the parameter-shift rule as shown below. Proof. Expand the Z-spiders according to its definition. There will be four terms on the left-hand side. By using 1 2π π −π e iα dα = 0, only two terms left. We will use the definition of the Z-spider again to obtain the ZX-diagram on the right-hand side. The complete proof is given in Appendix C.
With this lemma, we give a graphical proof of the following theorem, which is also proved in [20].

The variance of gradients
In this section, we will compute the variance in Eq. (9).
Because the gradient ∂ H ∂θj is a real number and the expectation is 0, by Eq. (9), the variance is the expectation of ∂ H ∂θj 2 , which can be represented as follows by Theorem 2.   Proof. We expand all Z-spiders and there will be 16 terms on the left-hand side. Again, using the relation 1 2π π −π e iα dα = 0, 6 terms was left. We can use the definition of the Z-spider again to obtain the right-hand side. The complete proof is given in Appendix D.
There exist three terms after integration. Hence, computing the variance of gradients is much more complicated than computing the expectation. We denote the three ZX-diagrams in Lemma 2 as And we introduce a new notation Here, U (θ 1 , . . . , θ m ) is a PQC with m parameters and G U is the graph-like ZX-diagram corresponding to U . With this notation, we have the following theorem.

Theorem 4. Under Assumption 1, the following equation holds.
Var It seems inaccessible when m is large. But in many cases, we have simple ways to compute this sum.
Recall that we have converted quantum circuits to graph-like ZX-diagrams. Hence, spiders are connected with Hadamard edges. Let us consider two spiders W j , W k corresponding to the parameters θ j , θ k in G U . Suppose that W j and W k are connected with a Hadamard edge. Then by the following lemma, the Hadamard edge can be removed after integration over θ j and θ k . Here M aj ,a k is the element on the a j -th row and a k -th column in the following 3 × 3 matrix Proof. By Lemma 2, there will be 9 terms on the left-hand side. We can use rewriting rules in Figure 4 on each term to remove Hadamard edges to obtain the form on the right-hand side. The complete proof is given in Appendix E.
Applying this lemma to the variance recursively, we can remove all the Hadamard edges connecting two parameterized spiders. And the big tensor V a1,...,aj−1,T2,aj+1,...,am U will be broken into smaller tensors that are connected with M . It is a new tensor network whose structure is similar to G U . To compute the variance, the only thing we need to do is contracting this new tensor network. Figure 5 demonstrates the above procedure for the case that all spiders in G U are parameterized and are connected with Hadamard edges. The tensorĨ a1,...,a k is related to    where Figure 5: Computing the variance with the tensor network the input state, while the tensorH c1,...,cm is related to the Hamiltonian H. And P 2 is a projection that has only one non-zero entry. That is Also, note that there is a scalar 2 for each internal copy tensor. This scalar comes from the following equation.
In Appendix A, a simple example is given to illustrate the techniques introduced in this paper.
In conclusion, computing the variance of gradients is reduced to contracting a tensor network corresponding to the circuit. In the next section, we will use these techniques to analyze the BP phenomenon for several for concrete PQCs.

Analyzing the BP phenomenon for four PQCs
In this section, we will analyze the BP phenomenon for four PQCs with the techniques introduced in Section 3.

Hardware-efficient ansatz
Consider a hardware-efficient ansatz [2] PQC of the following form.
The first step is converting it to a graph-like ZX-diagram. Suppose the circuit is of n-qubits. By the conversion rules in Eq. (5) and rewriting rules in Figure 4, we obtain a graph-like ZXdiagram where most spiders are parameterized.
Here, the Z-spider with ". . . " represents a Z-spider with a parameter.
Here ET a1,a2,a3 is an element in the following 3 × 3 × 3 tensor Proof. Refer to Appendix F.
Then we can construct a tensor network that is similar to Figure 5 as follows.
And if we want to compute the variance Var( ∂ H ∂θj ), we can just simply replace the copy tensor corresponding to θ j in the above tensor network with the projection P 2 .
By now we have represented the variance as a tensor network. And now we are going to analyze the scaling property of this tensor network when the number of qubits n and the depth of the circuit grow up.
then a layer can be represented as And the whole tensor network in Eq. (17) is .
Hence, the variance will be We can prove that only two eigenvalues of the matrix LT are 1 and the norms of other eigenvalues are less than 1 (for the complete proof, please refer to Appendix G). Moreover, the eigenspace corresponding to the eigenvalue 1 is generated with two vectors Hence, LT d will converge to P E1 , the projection to the eigenspace E 1 , exponentially, as d → ∞.
If we replace LT L1 and LT L2 with the projection P E1 , then the Eq.
which is exponentially small. That is, the number of qubits n determines an exponential small limitation of the variance in Eq. Note that the above analysis can be generalized to any hardware-efficient ansätze if the entangler connects all of the qubits.

Tree tensor network ansatz
The tree tensor network is a special kind of tensor network with tree structures. The quantum analog of the tree tensor network was developed in [42]. In [53], it was proved that the sum of the variance n j=1

Var
∂ H ∂θ j will not vanish exponentially. In this section, we will prove that not only the sum of the variance but also the variance of each parameter vanishes polynomially.
Consider the tree tensor network ansatz with n-qubit of the following form.
To analyze the BP phenomenon of this ansatz, we first use the gate decomposition to convert the PQC to a ZX-diagram as follows.

Measure
The X-spiders with phase ". . . " are spiders with parameters. And the ZX-diagram can be rewritten to a graph-like ZX-diagram as follows.

Measure
By using the rewriting rule (lc) in [30], we can remove the spiders with phases ± π 2 .

.
We can prove that (for the complete proof, please refer to Appendix G), after integration over the parameters α, β, γ, the building block will become Here, T TTN is a 3 × 3 × 3 tensor defined as follows. Hence, the variance of ∂ H ∂θj can be obtained by replacing one of the copy tensors with the projection P 2 in the following tensor network.
Now let us analyze the scaling property of this tensor network. Since the Hamiltonian H is a 1-qubit Hermitian operator, it can be expressed as Note that the building block of this tensor network is 8 · T TTN . By the definition of T TTN , we have It is a linear function ofH. Hence, we can analyze each term ofH individually.
Since P 2 v 1,3 = 0, the first term 2k 2 0 v 1,3 inH will become 0. Now let us consider the second term 2(k 2 Expanding it recursively, the variance can be represented as a summation of terms of the following form,Ĩ And by the definition ofĨ, each of the termĨ(u 1 , . . . , u n ) ≥ 0. Hence, we can obtain a lower bound, Similarly, we have a lower bound for the term v − 1,3 . For the general case of n-qubit, we can prove that it has a lower bound. (22), if

Theorem 6. For tree tensor network ansatz shown in
for some u j , w j ∈ {v 1,3 , v 2 , v − 1,3 }. HereĨ(u 1 , . . . , u n ) is defined in Eq. (26). AndĨ is a 3 n dimensional tensor which only depends on the input state. If the input state is ρ, thenĨ is defined as follows.Ĩ  Proof. See Appendix G.2.

QCNN
QCNN was developed in [41]. It was proved that there exists no BP in the QCNN ansatz if the subblocks form unitary 2-design [54]. In this section, we will use the ZX-calculus to analyze the BP phenomenon in a QCNN ansatz without the assumption of unitary 2-design.
Consider a QCNN ansatz as follows.
It can be represented as the following ZX-diagram.
where the Z-spiders with ". . . " are parameterized. Note that this is a graph-like ZX-diagram whose spiders are all parameterized. Hence, by using Lemma 3, the variance can be obtained by replacing one of the copy tensors with the projection P 2 in the following tensor network.
By using Eq. (29), we can expand the variance as a sum of terms of the following form. Each of these terms is non-negative. Hence, similar to the analysis of tree tensor network ansatz, we can prove that the variance of gradients in the QCNN ansatz has a lower bound, since the QCNN ansatz is of O(log(n))-depth. Hence, if provided I(u 1 , . . . , u n ) ∈ Ω( 1 poly(n) ) orĨ(w 1 , . . . , w n ) ∈ Ω( 1 poly(n) ),

Theorem 7. For the QCNN ansatz shown in (28), if
then there exists no BP in the QCNN ansatz.

MPS-inspired ansatz
The matrix product state (MPS) is a special structure of tensor networks, which is widely used in quantum physics and machine learning [55,56]. There are also PQCs with a similar structure as MPS, and we call it MPS-inspired ansatz. It has been shown that MPS-inspired ansatz can be implemented efficiently in quantum computers with a small number of qubits [43]. We will analyze the BP phenomenon in MPS-inspired ansatz in this section.
Let us consider the following MPS-inspired ansatz and the Hamiltonian H = I ⊗ I · · · ⊗ I ⊗ X. We will prove that the variance Var ∂ H ∂θ1 is exponentially small. Here θ 1 is the parameter of the first R X gate applying on the first qubit.
Firstly, we convert the PQC into a ZX-diagram as follows.
This is a graph-like ZX-diagram whose spiders are all parameterized. We can use Lemma 3 to represent the variance as the following tensor network.
It is exponential in the number of qubits n. Hence, there exist BPs in the MPS-inspired ansatz.

Discussion
We developed powerful techniques to analyze the BP phenomenon for quantum neural networks training with the ZX-calculus. The quantum neural networks under consideration are PQCs under certain reasonable assumptions and the cost function is the expectation H of the PQC with respect to a given Hamiltonian H. The basic idea of the method is to represent the PQC, the cost function H , and the gradients ∂ H ∂θj as ZX-diagrams. And then computing the expectation and the variance of the gradient of H becomes computing the integration of certain ZX-diagrams. We show that these integrations are sums of ZX-diagrams that can be computed explicitly in many cases. As future works, it would be desirable to use the completeness of ZX-calculus to represent these sums by ZX-diagrams, or more generally, diagrams in other complete graphical calculi.
In principle, these techniques can be used to any given ansatz under Assumption 1. We remark that these techniques can be used to analyze the BP phenomenon for PQCs which contain t-design sub-blocks, for example, the PQCs considered in [21,54]. Because the t-design sub-blocks can be replaced with concrete t-design PQCs and then the techniques proposed in this paper can be applied. Techniques introduced in this paper can be used in more cases including circuits with global cost functions and circuits of any depth. To analyze a PQC with a global cost function, one can represent the global Hamiltonian as a ZX-diagram and then the method introduced in this paper can be used. As shown in Section 4.1, the BP phenomenon for the hardware-efficient ansatz has been analyzed when the depth is O(poly(n)). In conclusion, we extend the BP theorem from unitary 2-design circuits to any parameterized quantum circuits under Assumption 1.
Using the techniques proposed in this paper, we analyzed four kinds of ansätze, including the hardware-efficient ansatz, the tree tensor network ansatz, the QCNN ansatz, and the MPS-inspired ansatz. It is shown that there exist BPs in the hardware-efficient-ansatz and the MPS-inspired ansatz, while there exists no BP in the tree tensor network ansatz and the QCNN ansatz.

A A simple example
In this section, we will use a simple example considered in Section 3.1 to illustrate the techniques proposed in this paper. Recall that the PQC considered is |0 |0 RX (θ1) θ4 θ2 θ3 and the gradient with respect to θ 1 can be represented as By Lemma 1, the expectation of this gradient is zero. The variance of the gradient is the following integration.
Using Lemma 3 recursively, we can break the large ZX-diagram into small parts as follows. 16    Here M aj ,a k is the element on the a j -th row and a k -th column in the following 3 × 3 matrix For (a j , a k ) / ∈ {(T 2 , T 3 ), (T 3 , T 2 )}, the proof is almost the same as that of Eq. (33).
Now, let us consider the case when (a j , a k ) = (T 2 , T 3 ). We can use the rules in Figure 4  Here ET a1,a2,a3 is an element in the following 3 × 3 × 3 tensor  That is ET [1, 3, ·] = 1 8 0 0 1 . By now, we have proved that   That is ET [1, 3, ·] = 1 8 0 0 0 . By now, we have proved that Let us consider the case when a 1 = T 3 .
When a 1 = T 3 , we have G Analysis of PQCs in section 4

G.1 Hardware-efficient ansatz
We will prove some properties of LT , which are used in the analysis in Section 4.1.
Theorem 8. Suppose that λ 1 , . . . , λ 3 n are eigenvalues of LT . And Then we have Proof. By definition of EM , we can compute that By computation, EM can be diagonalized. Four of its eigenvalues are 1 and other eigenvalues are in the interval (−1, 1). Moreover, the eigenspace of the eigenvalue 1 is where LT is an operator on the tensor product of n R 3 . We denote the operator EM on the i-th and j-th R 3 as EM i,j . Then the eigenspace of LT corresponding to the eigenvalue 1 is the intersection of the eigenspaces corresponding to the eigenvalue 1 of EM 1,2 , EM 2,3 , . . . , EM n−1,n , EM n,1 .

G.2 Tree tensor network ansatz
In Section 4.2, we used Eq. (23). Here we will prove this equation.
Proof of Eq. (23). By Lemma 2, it suffices to prove that