Effect of barren plateaus on gradient-free optimization

Barren plateau landscapes correspond to gradients that vanish exponentially in the number of qubits. Such landscapes have been demonstrated for variational quantum algorithms and quantum neural networks with either deep circuits or global cost functions. For obvious reasons, it is expected that gradient-based optimizers will be significantly affected by barren plateaus. However, whether or not gradient-free optimizers are impacted is a topic of debate, with some arguing that gradient-free approaches are unaffected by barren plateaus. Here we show that, indeed, gradient-free optimizers do not solve the barren plateau problem. Our main result proves that cost function differences, which are the basis for making decisions in a gradient-free optimization, are exponentially suppressed in a barren plateau. Hence, without exponential precision, gradient-free optimizers will not make progress in the optimization. We numerically confirm this by training in a barren plateau with several gradient-free optimizers (Nelder-Mead, Powell, and COBYLA algorithms), and show that the numbers of shots required in the optimization grows exponentially with the number of qubits.

Without effort to avoid barren plateaus, this phenomenon can have a major impact on the scaling of one's algorithm. Specifically, the exponential suppression of the gradient implies that one would need an exponential precision to make progress in the optimization, consequently, causing one's algorithm to scale exponentially in the number of qubits. The standard goal of quantum algorithms is polynomial scaling, unlike the exponential scaling of classical algorithms. Hence, the exponential scaling due to barren plateaus could erase the possibility of a quantum speedup with a parametrized quantum circuit. It is therefore crucial to study barren plateaus in VQAs and QNNs in order to understand when quantum speedup is possible.
This has spawned an important research di-rection of finding strategies to avoid barren plateaus. Some examples include employing local cost functions [20], modifying the architecture [25,26], pre-training [34], parameter correlation [35], layer-by-layer training [36], and initializing layers to the identity [37]. These strategies are promising. However, more analytical and numerical studies are needed to understand how effective they are in general, for example, as in Ref. [38].
One possible strategy to consider is the choice of optimizer. It is widely believed that gradientbased optimizers will be directly impacted by barren plateaus, for obvious reasons. Moreover, higher-order derivatives are also exponentially suppressed in a barren plateau [23], so optimizers based on such derivatives will also be impacted. Nevertheless, there still remains the question of whether gradient-free optimizers could somehow avoid the barren plateau problem. This is currently a topic of debate [20,28]. The question is naturally made subtle by the fact that gradientfree optimizers can potentially use global information about the landscape, rather than being restricted to using local gradient information.
In this work, we present an analytical argument suggesting that gradient-free approaches will, indeed, be impacted by barren plateaus. Specifically, we show that cost function differences, C(θ B ) − C(θ A ), will be exponentially suppressed in a barren plateau. This holds even when the points θ A and θ B are not necessarily close in parameter space. Gradient-free optimizers use such cost function differences to make decisions during the optimization. Hence, our results imply that such optimizers will either need to spend exponentially large resources to characterize cost function differences, or else these optimizers will not make progress in the optimization.
We confirm our analytical results with numerical simulations involving several gradient-free optimizers: Nelder-Mead, Powell, and COBYLA. For each of these optimizers, we attempt to train a deep parametrized quantum circuit, corresponding to the barren plateau scenario in Ref. [19]. In all cases, we find that the number of shots (i.e., the amount of statistics) required to begin to train the cost function grows exponentially in the number of qubits. This is the same behavior that one sees for gradient-based methods, and is a hallmark of the barren plateau phenomenon.

Theoretical Background
Here we provide background needed to understand our results. We first consider the cost function used to train parameterized quantum circuits. Then we consider optimizers that can be used to optimize this cost function, with a specific focus on gradient-free optimizers. Finally, we give background on the barren plateau phenomenon.

Cost function
Consider a parameterized quantum circuit V (θ), whose parameters will be trained by minimizing a cost function C(θ). In this work, we consider a highly general cost function that can be expressed in the form Here, θ is a vector of m continuous parameters, {ρ x } S x=1 are n-qubit input quantum states from a training set S of size S, and f x are functions that encode the problem and which can be different for each input state.
To ensure algorithmic efficiency, we assume that the number m of parameters in θ is in O(poly(n)). In addition we consider that any θ µ ∈ θ parametrizes a unitary of the form e −iθµHµ . We assume H µ is a Hermitian operator with two distinct non-zero eigenvalues (e.g., H µ could be a Pauli operator).
The goal is then to solve the optimization problem θ opt = arg min This involves choosing an optimizer, which can either be a gradient-based or gradient-free optimizer. Various gradient-based approaches [41][42][43][44] have been proposed for training parameterized quantum circuits, and these will be directly impacted by barren plateaus. Optimizers employing higher-order derivatives are also impacted by barren plateaus [23]. In this work we consider the case when one employs a gradient-free optimization method. In the next section we review some widely-used gradient-free optimizers.

Gradient-Free Optimizers
We will refer to any optimization method that only accesses a zeroth-order oracle (i.e., does not directly access derivative information) as being gradient free. This is a very large class of methods, but they all depend on being able to distinguish cost function values at different points. Though our analytical results are general to any such optimizer, we now introduce three particular gradient-free optimizers that we will examine numerically: Nelder-Mead, Powell's Method, and COBYLA.

Nelder-Mead
One popular gradient-free optimization strategy is the Nelder-Mead algorithm [45]. In this approach, one constructs a simplex in the space to be optimized over. Then one modifies it with a sequence of reflect, expand, contract, and shrink operations to move the simplex and then shrink it around the minimum. These operations are chosen based on conditional comparisons of the cost function values at each vertex as well as proposed new vertices. See Figure 1a for an illustration these operations. When used in an environment where the errors in those cost function values are large enough to cause mistakes in these comparisons, however, this algorithm is vulnerable to performing shrink operations prematurely, which slows the optimization down and may lead to a false appearance of convergence [46]. Due to this difficulty, one would expect the number of iterations required to converge with Nelder-Mead to be especially bad in limited precision environments, though we note that there are a number of modifications that attempt to improve the method's robustness to noise [46,47].

Powell's Method
The Powell algorithm [48] is another popular gradient-free optimizer that performs sequential line searches. This method starts with some input set of search vectors V = {v i }, usually just the coordinate directions in the parameter space. Searching along each of these directions in sequence, this method looks for the displacement {a i } along each direction that would minimize the cost when only varying parameters along the current direction. Finding the displacements {a i } is typically done with Brent's parabolic interpolation method [49], though in principle one could use any univariate gradient free optimizer.
After sweeping through all of the search vectors, the iteration is completed by replacing the search vector v j that corresponds to the greatest displacement, a j = max(a i ), and replacing it with By making this replacement, convergence is accelerated and the method avoids getting stuck in a cyclic pattern of updates. See Figure 1b for a sketch of two iterations of this method.

COBYLA
Constrained Optimization BY Linear Approximation (COBYLA) is another popular gradientfree optimizer by Powell [50]. This algorithm constructs a simplex and uses the m + 1 points in the parameter space, with m being the number of parameters, to define a hyperplane to capture the local slope of the cost function. The algorithm then replaces the highest cost function value point on the simplex by stepping from the lowest cost point along the direction of the slope. The method steps as far as possible along this estimated slope while staying within a lower bound on the radius of the trust region. The lower bound on the size of the trust region is decreased when the algorithm detects that is has stopped making progress, allowing the method to converge. Note, however, that the size of the trust region never increases in COBYLA. An iteration of this method (showing a shrinking trust region) is sketched in Figure 1c. a c b Figure 1: Graphical depiction of the gradient-free optimizers considered. Panel a shows the different operations that the Nelder-Mead algorithm performs on the initial (grey) simplex: reflection (red), reflection with expansion (blue), reflection with contraction (green), and shrinking (turquoise). Panel b shows two iterations of the Powell method (with black for the first iteration and red for the second). Note that the direction of the final step is changed, reflecting a modified search direction. Finally, panel c shows an illustration of a COBYLA step from an initial (grey) simplex. After fitting a plane to the initial simplex, the method steps along the fitted slope to form a new simplex (red). The trust region is shown as a solid blue circle. A smaller trust region which might be used later in the optimization is illustrated with the dashed blue circle (though for this particular step the trust region would likely not be contracted).

Barren Plateaus
When the cost function exhibits a barren plateau, the cost function gradient vanishes exponentially with the system size. Without loss of generality we consider here the following generic definition of a barren plateau.

Definition 1 (Barren Plateau). Consider the cost function defined in Eq.
(1). This cost exhibits a barren plateau if, for all θ µ ∈ θ, the expectation value of the cost function partial deriva-

and its variance vanishes exponentially with the number of qubits n as
(4) for some b > 1. As indicated, the expectation values are taken over the parameters θ.
We remark here that, as shown in Definition 1, the barren plateau phenomenon is a probabilistic statement. In fact, from Chebyshev's inequality we know that Var θ [∂ µ C(θ)] bounds the probability that the cost function partial derivative deviates from its mean of zero as for any c > 0. In practice this means that by randomly initializing the parameters θ, there is a high probability that one ends up in a flat region of the landscape where the gradients are exponentially suppressed.
Let us now discuss different mechanisms that can lead to barren plateaus in the cost function landscape. As shown in the seminal work of Ref. [19], deep random unstructured circuits which form 2-designs on n qubits will exhibit barren plateaus. Here we use the term deep when the depth of the ansatz is in O(poly(n)). For instance, as shown in [51][52][53] local circuits will form 2-designs when their depth is in O(poly(n)).
The barren plateau phenomenon was extended in [20] to a type of shallow depth ansatz known as the layered hardware efficient ansatz, where random local gates act on alternating pairs of neighboring qubits in a brick-like structure. Here it was shown that the locality of the cost function can be linked to its trainability. Specifically, global cost functions (those where one compares operators living in exponentially large Hilbert spaces) exhibit barren plateaus for any circuit depth. On the other hand, it was shown that local cost functions (where one compares operators on an individual qubit level) are trainable when the ansatz depth is in O(log(n)), as here their gradients vanish at worst polynomially (rather than exponentially) with the system size.
Barren plateaus have also been shown to arise in more general QNN architectures [21,28]. In perceptron-based QNNs with hidden and visible layers, connecting a large number of qubits in different layers with random global perceptrons (and hence highly entangling them) can lead to exponentially vanishing gradients. These results have shown that the barren plateau phenomenon is a generic problem that can arise in multiple architectures for quantum machine learning.
Finally, in [22] a noise-induced barren plateau mechanism was found. Here it was proven that the presence of noise acting before and after each unitary layer in a parametrized quantum circuit leads to exponentially vanishing gradients for circuits with linear or super-linear depth. When the cost exhibits a noise-induced barren plateau we The underlying mechanism here is that the state gets corrupted due to noise, leading to a flattening of the whole cost landscape. This phenomenon is conceptually different from the previous barren plateaus as here one does not average over the parameters θ. Nevertheless, the noise-induced barren plateau still satisfies Definition 1, which is a weaker condition.

Main Results
In this section we first present our main analytical results in the form of Proposition 1 and Corollary 1. We then discuss the implications for employing gradient-free optimizers in a barren plateau.

Exponentially suppressed cost differences
Here we consider two relevant scenarios where we analyze, on average, how large the difference ∆C = C(θ B ) − C(θ A ) between two points in the landscape can be. First we consider the case when θ A and θ B are not independent, but rather θ B can be obtained from θ A through a given translation in parameter space. We then analyze the case when θ A and θ B are independent.
The following proposition constitutes the main result of our work. The proof is presented in the Appendix.
and the variance is exponentially vanishing with n as with Here m is the dimension of the parameter space, and F (n) was defined in (4).
Let us here recall that we have assumed that m ∈ O(poly(n)). Similarly, we have that θ µ parametrizes a unitary generated by a Hermitian operator H µ with two distinct non-zero eigenvalues. From the latter it then follows that L is always in O(poly(n)), and hence that G(n) ∈ O(1/b n ).
From the previous results one can readily evaluate the case when θ B and θ A are independent. This case is of relevance to global optimizers, such as Bayesian approaches, where initial points on the landscape are chosen independently. This scenario can be analyzed by computing the expec- From Proposition 1, we can derive the following corollary. Eq. (1). Let θ A and θ B be two randomly chosen points in parameter space. Without loss of generality we assume that θ B = θ A + Lˆ for random L andˆ so that E θ A ,θ B [· · · ] = E θ A ,L,ˆ [· · · ]. If the cost exhibits a barren plateau according to Definition 1, then the expectation value of the difference ∆C = C(θ B ) − C(θ A ) is E θ A ,L,ˆ [∆C] = 0, and the variance is exponentially vanishing with n as

Corollary 1. Consider the cost function of
with Here m is the dimension of the parameter space, F (n) was defined in (4), and is the average distance between any two points in parameter space.
The proof of Corollary 1 readily follows from Proposition 1 by additionally computing the expectation value over L andˆ . Moreover, here we can see that G(n) is exponentially vanishing with the system size as L ∈ O(poly(n)).
From Proposition 1 we have that given two dependent set of parameters θ A and a set θ B related trough a translation in parameter space, then the probability that the difference ∆C = C(θ B ) − C(θ A ) is larger than a given c > 0 can be bounded as where we have used (5), and Eq. (9) from Proposition 1. Note that a similar result can be obtained for the case when θ A and θ B are independent, but here one replaces G(n) by G(n) in (12). This implies that, with high probability, the difference ∆C will be exponentially vanishing with the system size, for both cases when θ A and θ B are dependent or independent. Moreover, we remark that this is a direct consequence of the fact that the cost exhibits a barren plateau.

Implications for gradient-free optimizers
Let us first recall that, as discussed in the previous section, the capability of distinguishing the cost function value at different sets of parameters is at the core of gradient-free optimization methods. Therefore, the precision required to differentiate choices fundamentally limits the scaling of these methods, with smaller differences requiring greater precision. If an optimizer's precision requirements are not met, then each decision the method makes becomes randomly chosen by shot noise, leading to many optimizers effectively becoming either random walks or random sampling. The results in Proposition 1 pertain to gradient-free optimizers that compare points that are a given distance and direction apart. For example, simplex-based methods like Nelder-Mead fall under this category. As we show that cost differences are exponentially suppressed with the system size in a barren plateau, this leads to applications having sampling requirements that scale exponentially. Exponentially scaling sampling requirements, in turn, hinder the possibility of achieving quantum speedup with such an algorithm.
Similarly, Corollary 1 tells us that cost function differences between randomly chosen points are also exponentially suppressed. This means that either random search methods or methods U (θ 10 , θ 11 , θ 12 ) U (θ 22 , θ 23 , θ 24 ) • Figure 2: Single layer of the hardware efficient ansatz employed in our numerical implementations, shown here for n = 4. The U gates are general single qubit unitaries U (θ 1 , θ 2 , that use random initialization, such as Bayesian optimization [54], will also struggle with barren plateau landscapes. Therefore, using randomness in the selection of points cannot evade this exponential scaling result. Let us finally remark that Proposition 1 and Corollary 1 make no assumption about how close (or far) the parameters θ A and θ B are in parameter space other than that L = θ B − θ A ∈ O(poly(n)). Given that for any practical application the number of parameters m should scale no faster than m ∈ O(poly(n)) (or the problem will become untrainable for reasons having nothing to do with barren plateaus), this seems very reasonable. For example, if all of the parameters are single qubit rotations, the parameter space is a m-dimensional torus with unit radius. On that torus, the greatest length of the shortest path between two points is: This means that our results are valid for both local and global optimizers as sampling points that are further apart cannot overcome the suppressed slope.

Numerical Implementation
In this section we present numerical results obtained by simulating a variational quantum compiling algorithm [7,12]. Here, one trains a parametrized quantum circuit V (θ) to approximate a target unitary U . Specifically, we con-sider the toy-model problem where U = 1 and the goal is to train the parameters in V (θ) such that V (θ)|0 = |0 , with |0 = |0 ⊗n the all-zero state. As shown in [7,12], the following local cost function is faithful where in that can verify that C(θ) ∈ [0, 1], with C(θ) = 0 iff V (θ)|0 = |0 (up to global phase). We remark that there is an efficient quantum circuit to compute C(θ) [12].
For V (θ) we employ a layered hardware efficient ansatz as shown in Fig. 2. Moreover, we recall that Ref. [19] showed that a cost function such as (14) with a layered hardware efficient ansatz that is randomly initialized will exhibit barren plateaus when the depth of V (θ) scales at least linearly in n.
In our numerics we simulated the aforementioned quantum compilation task for different numbers of qubits n = 5, 6, . . . , 11. Letting p be the number of layers in the ansatz in Fig. 2, we choose p = n, so that the depth grows linearly in n. This corresponds to the barren plateau scenario in Ref. [19]. For each value of n, we solved the optimization of Eq. (2) by employing the Nelder-Mead, Powell, and COBYLA methods. These simulations were performed using MATLAB (Nelder-Mead) and SciPy (Powell and COBYLA). In all cases we randomly initialized the parameters θ in the ansatz and we ran the optimizer until a cost function value C = 0.4 was achieved or until a maximal total number of shots used throughout the optimization was surpassed. For simplicity we use the default values for hyperparameters not related to optimization termination. We note that we chose a relatively large value (C = 0.4) for the cost threshold because we are interested in the initial stages of the training process, i.e., the question of whether one can get past the barren plateau. With this choice, the computational expense of reaching this threshold does not take into account the difficulty of finding a minimum, it only reflects the burden of the barren plateau.
Since the goal is to heuristically determine the precision (i.e., the number of shots N ) needed to minimize the cost, we first ran simulations with different values of N allocated per cost-function evaluation. For each N we simulated 20 optimization instances (runs) with different initial points and we kept track of N total used throughout the optimization. The next step was to determine the value of N for which a cost of C = 0.4 could be reached and which minimizes the median total number of shots N total computed over different runs. We analyze the scaling of this median value as a function of n below.
In Fig. 3 we present our numerical results. Here we can see that the total number of shots scales exponentially with n for the Powell method, and scales super-exponentially for the Nelder-Mead optimizer. For the COBYLA method, the N total behavior as a function of n is not very regular but it is consistent with at least an exponential increase. As a reference point we also show in Fig. 3 results obtained with a custom gradientdescent optimizer. As expected, the total number of shots also scales exponentially in this case.

Discussion
With a wide range of applications spanning chemistry, optimization, and big data analysis, training parameterized quantum circuits is arguably the leading paradigm for near-term quantum computing. Yet barren plateaus in the training landscape remains an obstacle to making these paradigms scalable. Hence, one of the most important lines of research in this field is developing methods to avoid barren plateaus.
In this work, we consider the question of whether the choice of optimizer could be a potential strategy in avoiding barren plateaus. We focus on gradient-free optimizers, since there has been recent debate in the community about whether barren plateaus effect such optimizers.
Our main result is an analytical argument suggesting that gradient-free optimizers will, indeed, be impacted by barren plateaus. Proposition 1 is relevant to gradient-free optimizers that search through the landscape starting from a (random) initial point. For example, this includes simplexbased optimizers like Nelder-Mead. This proposition asserts that the variance of cost function differences is exponentially suppressed in a barren plateau. This implies that such optimizers will need to expend exponentially large resources in order to make decisions about where to move in the landscape. Corollary 1 considers a slightly different scenario, where both points are randomly and independently chosen. This is relevant to global gradient-free optimizers, such as Bayesian methods, which initially choose multiple random points on the landscape, and then proceed from this initial set of points. This corollary implies that these optimizers will also need to utilize exponentially large resources in order to make progress.
We also numerically attempt to train in a barren plateau scenario using several gradient-free optimizers. We ask how many shots are required to begin to train the cost function. In all cases, we find that the required number of shots grows exponentially in the number of qubits. This is consistent with our main result, and demonstrates that barren plateaus can lead to exponential scaling even for gradient-free optimizers. We note that this exponential scaling is a lower bound on the asymptotics. For the Nelder-Mead we find super-exponential scaling, likely due to the chances of prematurely shrinking the simplex when it is hard to order the cost values [46]. For the case of COBYLA, there may be a similar effect from prematurely shrinking the radius of the trust region, though this is not clearly demonstrated in our data. Finally, the Powell method appears to show exponential scaling. It is likely that the reason Powell shows better scaling is that, unlike the other optimizers, statistical noise does not have a cumulative effect on the state of the optimizer.
Our work casts doubt on the notion that the choice of optimizer could provide a strategy to avoid barren plateaus. While the asymptotically exponential scaling cannot be avoided, we note that the size limits of trainable problems may be extended by a careful choice of optimization strategy. For example, techniques using neural networks [34] or natural evolutionary strategies [55] may improve the constants multiplying the exponential scaling. However, we emphasize that all such strategies at minimum require the comparison of cost function values at different points and thus are subject to our scaling analysis.
This result highlights the difficult challenge posed by barren plateaus. Future work certainly should continue to develop strategies to avoid them. Additionally, in future work, we hope to develop a unified treatment that covers the impact of barren plateaus on various types of optimizers, gradient-based and gradient-free.

A Proof of Proposition 1
We now prove our main result. We restate the proposition here for convenience.
and the variance is exponentially vanishing with n as with G(n) = m 2 L 2 F (n) , and G(n) ∈ O 1 b n , for some b > 1. Here m is the dimension of the parameter space, and F (n) was defined in (4).
Proof. We begin by noting that finite differences in cost values can be expressed as: Note that as we are integrating over gradients (which form a conservative vector field by definition) this integral is path independent. We can therefore choose to integrate along the line segment between θ B and θ A , setting θ = θ A + ˆ , withˆ being the unit vector along θ B − θ A and ∈ [0, L] where L = θ B − θ A . We then have dθ =ˆ d . To show that the mean difference is zero, we simply take the expectation value of this integral with respect to θ A This follows as averaging θ A over parameter space is equivalent to averaging θ A + ˆ over the same space (for any fixed ˆ ), and the average gradient over the parameter space is zero from the definition of a barren plateau (see Definition 1).
Similarly, we can compute the magnitude of the second moment of this difference as Note that we have dropped the square of the mean as the mean is zero, so the second moment is the covariance. Next, we upper bound this covariance using the product of the variances using the Cauchy-Schwartz inequality Next, we split these variances into the covariances between individual components of the gradient As before, these covariances can be bounded by variances using the Cauchy-Schwartz inequality Var θ A ∂ i C(θ A + ˆ ) Var θ A ∂ j C(θ A + ˆ ) 1 2 Var θ A ∂ p C(θ A + ˆ ) Var θ A ∂ q C(θ A + ˆ ) If each variance can be bounded by a function F (n), we can then write If the cost exhibits a barren plateau, then by Definition 1, F (n) ∈ O 1 b n for some b > 1. It then follows that if m ∈ O(poly(n)) and L ∈ O(poly(n)), then G(n) ∈ O 1 b n and Proposition 1 holds.