An initialization strategy for addressing barren plateaus in parametrized quantum circuits

Parametrized quantum circuits initialized with random initial parameter values are characterized by barren plateaus where the gradient becomes exponentially small in the number of qubits. In this technical note we theoretically motivate and empirically validate an initialization strategy which can resolve the barren plateau problem for practical applications. The technique involves randomly selecting some of the initial parameter values, then choosing the remaining values so that the circuit is a sequence of shallow blocks that each evaluates to the identity. This initialization limits the effective depth of the circuits used to calculate the first parameter update so that they cannot be stuck in a barren plateau at the start of training. In turn, this makes some of the most compact ans\"atze usable in practice, which was not possible before even for rather basic problems. We show empirically that variational quantum eigensolvers and quantum neural networks initialized using this strategy can be trained using a gradient based method.


Introduction
Parametrized quantum circuits have recently been shown to suffer from gradients that vanish exponentially towards zero as a function of the number of qubits. This is known as the 'barren plateau' problem and has been demonstrated analytically and numerically [1]. The implication of this result is that for a wide class of circuits, random initialization will cause gradient-based optimization methods to fail. Resolving this issue is critical to the scalability of algorithms such as the variational quantum eigensolver (VQE) [2,3] and quantum neural networks (QNNs) [4,5,6].
In Ref. [7] the author shows that the barren plateau problem is not an issue of the specifically chosen parametrization, but rather extends the result to any direction in the tangent space of a point in the unitary group. In the Appendix we re-derive this result, and show that the gradient of the scalar 0| U (α) † HU (α) |0 for any Hermitian operator H, vanishes with respect to any direction on the unitary group. Similarly the variance of the gradient decreases exponentially with the number of qubits.
Notably, this does not preclude the existence of a parametrization that would allow for efficient gradient-based optimization. Finding such a parametrization seems a non-trivial, but important, task. Indeed, as argued in Ref. [1], the barren plateau problem affects traditional ansätze such as the unitary coupled cluster, when initialized randomly, even for a small number of orbitals. The authors leave it as an open question whether the problem can be solved by employing alternative hardware-efficient ansätze.
In this technical note, we take an alternative route. Instead of proposing a new ansatz, we present a solution based on a specific way of initializing the parameters of the circuit. The strategy resolves the problem using a sequence of shallow unitary blocks that each evaluates to the identity. This limits the effective depth of the circuits used to calculate the gradient at the first iteration and allows us to efficiently train a variety of parametrized quantum circuits.

A quick recap of the barren plateau problem
Here we briefly recapitulate the original barren plateau problem and its generalization. Details of the derivation can be found in the Appendix. A parametrized quantum circuit can be described by a sequence of unitary operations where U l (θ l ) = exp(−iθ l V l ), θ l is a real-valued parameter, V l is an Hermitian operator, and W l is a fixed unitary. The objective function of a variational problem can be defined as the expectation where H is an Hermitian operator representing the observable of interest. The partial derivatives take the form where If either U − or U + matches the Haar distribution [8] up to the second moment, e.g., 2-designs [9], the expected number of samples required to estimate Eq. (3) is exponential in the system size.
In the Appendix we re-derive the result of Ref. [7] for the more general barren plateau problem. Concretely it is shown that the gradient in a direction Z in the tangent space of the unitary group at U (α), where U (0) = U , and ∂ α U | α=0 = Z, vanishes in expectation, i.e., Noting that the direction always takes the form Z = −iU M for some fixed Hermitian matrix M , it is further shown that the variance with M 00 := 0| M |0 and (M 2 ) 00 := 0| M 2 |0 , becomes exponentially small in the number of qubits.
This leads us to believe that the choice of parametrization for quantum circuits is a highly non-trivial task that can determine the success of the variational algorithm. In the next Section, we describe a method which resolves the barren plateau problem for practical purposes. In Section 4 we give numerical evidence on two different use cases.

Initializing a circuit as a sequence of blocks of identity operators
Intuitively, the initialization strategy is as follows: we randomly select some of the initial parameter values and choose the remaining values in such a way that the result is a fixed unitary matrix, i.e., a deterministic outcome such as the identity. Additionally, we initially ensure that when taking the gradient with respect to any parameter, most of the circuit evaluates to the identity, which restricts its effective depth. This initialization strategy is optimized in order to obtain a non-zero gradient for most parameters in the first iteration. Obviously, this does not a priori guarantee that the algorithm stays far from the barren plateau. However, numerical results indicate that this is indeed the case and that this initialization strategy allows the circuit to be trained efficiently. This gives an immediate advantage compared 2 to previously known methods, which generally do not allow for training any parameter without an exponential cost incurred through the required accuracy.
Concretely, to ensure that U − and U + do not approach 2-designs, we initialize the circuit via M blocks where each block is of depth L. Depth L is chosen to be sufficiently small so that the blocks are shallow and cannot approach 2-designs. In the following we will consider any fixed gate, i.e., W l in Eq. (1), as a parametrized one in order to simplify the presentation. For any m = 1, . . . , M , the corresponding block has the form While the initial parameter values for the U l (θ l,1 ) can be chosen at random, the parameter values for U l (θ l,2 ) are chosen such that U l (θ l,2 ) = U l (θ l,1 ) † . Each block, and thus the whole circuit, evaluates to the identity, i.e., It is important to choose each block U m (θ m ) to be sufficiently deep to allow entanglement as training progresses. Yet each block should be sufficiently shallow so that U m (θ m ), considered in isolation, does not approach a 2-design to the extent that the gradients would become impractically small.
In Ref. [1] it was shown that the sampled variance of the gradient of a two-local Pauli term decreases exponentially as a function of the number of qubits, which also immediately follows from the barren plateau result. Furthermore, the convergence towards this fixed lower value of the variance was numerically shown to be a function of the circuit depth. This implies that for blocks U m (θ m ) of constant depth, the whole circuit U (θ init ) is not in a barren plateau, allowing us to estimate the gradient efficiently for the initial learning iteration.
The intuition behind the identity block strategy is the following: changing a single parameter in one block means that the other blocks still act like the identity. Therefore even if the whole circuit is deep enough to potentially be a 2-design, changing any parameter will yield a shallow circuit. Notably this holds only for the first training parameter update.
We now analyze in more depth the behaviour of the initialization at the level of each block. An interesting property is that the gradients for gates located away from the center of the block, e.g., in the beginning or at the end of a block, will have a larger magnitude. The reason is that the circuits required for the estimation are further from being 2-designs since they are more shallow, as shown by the following calculation where we used that U l (θ m l,2 ) = U l (θ m l,1 ) † . For a small index k we see that the circuit becomes shallow and hence the gradient is expected to be larger. A similar calculation can be done for the gradient with respect to the second set of parameters, i.e., ∂ θ m k,2 U m (θ m ). In this case, for index k close to L we see that the circuit becomes shallow. To summarize, we expect that parameters at the boundaries of the blocks to have gradient larger than those at the center. Notice that the gradient can still be zero if H commutes with the gate, which is a consequence of Eq. (3) and Eq. (4). However, this is generally unlikely, and can be resolved by applying a small deterministic entangling circuit. More concretely, to avoid such cases we can add a shallow entangling layer B to the circuit, i.e., U M · · · U 1 B, where U i are the blocks described above. This also resolves the barren plateau problem for the training of variational quantum eigensolvers, which we discuss in more detail in the next Section.

Initializing a parametrized quantum circuit
In this experiment we show the scaling of the variance of the gradient as a function of the number of qubits for both the random initialization and identity block strategy. For the random circuit, we use the same ansatz as in Ref. [1], Fig.  2, and the same ZZ observable. Their ansatz consists of layers of single qubit rotations exp(−iθ l V l ) about randomly selected axes V l ∈ {X, Y, Z}, followed by nearest neighbor controlled-Z gates. We used a total of 120 layers of this kind. For the identity block initialization, we employed a single block as described by Eq. (8) with M = 1 and L = 60. This setting also accounts for a total of 2LM = 120 layers. In both cases, initial values for the free parameters were drawn from the uniform distribution unif(0, 2π).
In Fig. 1 (a) we compare the variance of (∂ θ1,1,1 E), i.e., the gradient with respect to the first element of the first part of the first block, as a function of the number of qubits n, when the circuit is applied to the input state √ H ⊗n |0 . Each point in the Figure was computed from 200 circuits. When using the random initialization, the variance decreased exponentially with the number of qubits, reproducing the plateau behaviour which was described in Ref. [1]. In contrast to this, the variance of the circuit initialized as an identity block was invariant to the system size.
In Fig. 1 (b) we again compare the variance of (∂ θ1,1,1 E) as a function of system size when circuits are applied to random MNIST images, downsampled and normalized such that they constitute valid quantum states. This type of encoding is known as amplitude encoding and represents a realistic scenario for computer vision tasks performed on a quantum computer. Each point in the Figure was computed from 200 circuits. Similar to the previous experiment, the variance of the gradient vanished with the system size when using the random initialization. In contrast, for circuits using the identity block initialization, the variance did not vanish exponentially with the system size, showing that the plateau was avoided at initialization.

Training a quantum neural network classifier
We have shown both analytically and empirically that a circuit with identity block initialization does not suffer from the barren plateau problem in the first training iteration. In this experiment we examine the variance of the gradient during training time to test whether a quantum neural network (QNN) classifier for MNIST images approaches the plateau.
We used a 10-qubit circuit with M = 2 identity blocks, each having L = 33 layers, for a total of 2LM = 132 layers. We selected N = 700 MNIST images at random, resized them to 32 × 32, and finally reshaped them into vectors of dimension 2 10 = 1024. We normalized each vector to unit length in order to be used as inputs to the circuit. Labels were set to y i = 1 for images of 'even' digits, and y i = 0 for images of 'odd' digits. For each MNIST example ψ i , classification was performed by executing the circuit, measuring the observable ZZ, and finally ascribing a predicted probability to each class such that P (even|ψ i ) = 1 2 ( ψ i | U (θ) † ZZU (θ) |ψ i + 1), and P (odd|ψ i ) = 1 − P (even|ψ i ). The training was performed on 200 different initial circuits, each constituting a different trial. We trained the circuit to Optimization was performed using the Adam optimizer with a learning rate of 0.001 [10] and a single randomly selected MNIST example used for each update.
Figure 2 (a) shows the mean accuracy and standard deviation on a binarized MNIST dataset as a function of the training iterations for a circuit initialized using identity blocks compared with a strategy where all parameters are initially set to zero. While both strategies result in a circuit that initially evaluates to the identity, initializing all parameters to zero made the training much less efficient. Figure 2 (b) shows the variance of the partial derivative for parameters associated to the first three qubits, across different trials, and as a function of training iterations for circuits initialized with identity blocks. From the figures we observe that the model does not get stuck in a barren plateau (red dashed line), and that the variance decreases only as the model converges to a minimum of the objective function.  Figure 2: (a) Training accuracy and standard deviation as a function of the number of training iterations for an MNIST classifying circuit initialized using two identity blocks compared with a circuit where all parameters are initially set to zero, and (b) variance across trials of the partial derivatives for parameters associated to the first three qubits in the circuit using the identity block initialization method. The circuit initialized using identity blocks trained successfully in all trials and never encountered the barren plateau (red dashed line).

Training a variational quantum eigensolver
In this experiment we use the identity block strategy and train a variational quantum eigensolver (VQE) to find ground state energies. We chose the 7-qubit Heisenberg model on a 1D lattice with periodic boundary conditions and in the presence of an external magnetic field. The corresponding Hamiltonian reads where G = (V, E) is the undirected graph of the lattice with 7 nodes, J expresses the strength of the spin-spin interactions, h corresponds to the strength of the magnetic field in the Z-direction. In the experiments we set J = h = 1.
We chose this setting because for J/h = 1 the ground state is highly entangled (see also Ref. [3] for VQE simulations on the Heisenberg model).
Notice that the identity block initialization by itself cannot generate an entangled state at the very beginning. This could result in degraded performance for VQE. Hence, as the input state we chose |ψ = B |0 , where B consists of 7 layers of random single-qubit rotations and controlled-Z gates. These gates are never updated during training, and their purpose is to provide some level of entanglement even at the beginning. Training was performed using the Adam optimizer with a learning rate of 0.001, until convergence.
5 Figure 3 shows the variance of the partial derivatives (∂ θ E) across 200 trials for (a) a circuit initialized using M = 2 identity blocks and L = 33 layers per block, and (b) a randomly initialized circuit with the same number of layers in total. In (b) we observe a barren plateau for all parameters, while in (a) we observe that most of the variances are well above the plateau. As expected, within each identity block the variance increases with distance from the center.

Conclusion
In this technical note we motivated and demonstrated a practical initialization strategy that addresses the problem of barren plateaus in the energy landscape of parametrized quantum circuits. In the experiments we conducted, the identity block strategy enabled us to perform well in two tasks: the variational quantum eigensolver (VQE), and the quantum neural network (QNN).
More work is needed to assess the impact of input states and data encoding methods. In the case of VQEs, the strategy does not initially allow the circuit to generate an entangled state. We resolved this by adding a shallow entangling layer that is fixed throughout training. In the case of QNNs, the encoded input data can already be highly entangled, thereby reducing the depth of the circuit where the plateau problem occurs. From these examples we conclude that there is a problem-dependent trade-off to be analyzed.
Finally, our approach is solely based on initialization of the parameter values. There are other potential strategies for avoiding barren plateaus such as layer-wise training, regularization, and imposing structural constraints on the ansatz. Understanding more about the relative merits of these and other approaches is a topic for future work. thank Raban Iten and Dominic Verdon for helpful technical discussions. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Appendix
Here we provide a brief derivation of the vanishing gradient problem for the unitary group [7].

Vanishing gradient
For Hermitian H ∈ C N ×N and a normalized state |0 ∈ C N , we consider the function E(U ) = 0| U † HU |0 for U ∈ U (N ), where U (N ) denotes the unitary group of dimension N . In the following, we calculate the derivative of E(U ) at a unitary U in direction Z, where Z lies in the tangent space at of the unitary group at the point U . To do so, we choose a path U (α) such that U (0) = U and ∂ α U | α=0 = Z. We have We assume that Z in the tangent space at U has the form Z = iU M for some Hermitian operator M , since it is easy to see that every Z of this form is in the tangent space at U and that every tangent vector at U can be written in this form. Then, we have Now, we would like to calculate the gradient over the whole unitary group. For this, we fix the Hermitian matrix M and find where we have used that for the Haar measure on the unitary group µ(U ), U ∈ U (N ) it holds that dµ(U )U OU † = Tr(O) N I, see [8]. Notably, if we initialize the matrix U to be the identity for a fixed H, which could for example be achieved by just taking half the depth of the initial parametrized circuit U 1/2 and then appending the adjoint U † 1/2 . The full initial circuit becomes then this is always the identity, i.e., constant. Plugging the identity into the expectation of the gradient, we then obtain which is only zero whenever the Hamiltonian commutes with the observable, which is generally not the case. Note that this insight also holds for any other identity initialization such as the block initialization introduced in the body of the paper.
Note that trainable gates often take the form exp(−iα j V j ). If the V j 's are chosen at random from tensor products of Pauli matrices {I, Z, X, Y } ⊗n , then with high probability at least one of the derivatives is non-zero unless H is the identity, see Eq. (22). In sight of the initialization strategy, it is worth noting that initializing the circuit as U U † hence does not guarantee by itself that at least one derivative is non zero.

Vanishing variance
We start with a simple identity. Lemma 1. It holds that Proof. The proof follows from entry-wise evaluation.
We further need the following identity for the second moments in the proof. Lemma 2 ( [8]). For dµ(U ) being the integral over the unitary group with respect to the random Haar measure, it holds that First observe that we can explicitly evaluate the variance and obtain Note that here we used the fact that the square of the trace is the trace of the square since ρ is a rank one matrix, i.e., a projector.
We can proceed now by evaluating the expectation of each term individually. As an example we calculate the first term, since the remaining terms can be evaluated in a similar fashion. We need to evaluate the following term = k,l,m,n,p,q A nm B pq C lk δ mn δ kl δ pi δ jq + δ ml δ kn δ pq δ ji N 2 − 1 − k,l,m,n,p,q A nm B pq C lk δ mn δ kl δ pq δ ji + δ ml δ kn δ pi δ jq N (N 2 − 1) = Tr (A) Tr (C) Note that plugging in A, B and C in Eq. (31), then yields Doing similar calculations for the other terms (using (31) for different A, B and C) and canceling and summarizing terms, yields the variance where H kl denotes the (k, l)-entry of a matrix H and M 00 := 0| M |0 , (M 2 ) 00 = 0| M 2 |0 . This indicates that the variance indeed also decreases exponentially with the number of qubits.