Quantum Lazy Training

In the training of over-parameterized model functions via gradient descent, sometimes the parameters do not change significantly and remain close to their initial values. This phenomenon is called lazy training, and motivates consideration of the linear approximation of the model function around the initial parameters. In the lazy regime, this linear approximation imitates the behavior of the parameterized function whose associated kernel, called the tangent kernel, specifies the training performance of the model. Lazy training is known to occur in the case of (classical) neural networks with large widths. In this paper, we show that the training of geometrically local parameterized quantum circuits enters the lazy regime for large numbers of qubits. More precisely, we prove bounds on the rate of changes of the parameters of such a geometrically local parameterized quantum circuit in the training process, and on the precision of the linear approximation of the associated quantum model function; both of these bounds tend to zero as the number of qubits grows. We support our analytic results with numerical simulations.


Introduction
The goal of achieving near-term quantum advantages have put forward quantum machine learning as one of the main applications of Noisy Intermediate-Scale Quantum (NISQ) devices [1]. A main paradigm for achieving quantum advantage in machine learning is via quantum variational algorithms [2]. In this approach, a quantum circuit consisting of parameterized gates is learned in order to fit some training data. However, this learning process through which optimal parameters of the circuit are found, faces challenges in practice [3,4], and needs a thorough exploration.
Gradient descent is one of the main methods for solving optimization problems, particularly for training the parameters of a quantum circuit for machine learning. In this method, the parameters are updated by moving in the opposite direction of the gradient of a loss function to be optimized. This updating of the parameters changes not only the value of the loss function, but also the function modeled by the quantum circuit. Thus, studying the evolution of the loss and model functions during the gradient descent algorithm is crucial in understanding variational quantum algorithms.
Approximating the gradient descent algorithm with its continuous version (gradient flow) [5], provides us with analytical tools for the study of the evolution of a function whose parameters are optimized via gradient descent. Writing down the evolution equation for this continuous approximation, we observe the appearance of a kernel function called tangent kernel (see Section 2). In short, letting f (Θ, x) be our model function with Θ = (θ 1 , . . . , θ p ) as the parameters (weights) of the model and x as a data point on which we evaluate the function, the tangent kernel is defined by (1) Here, ∇ Θ f (Θ, x) is the gradient of f (Θ, x) with respect to Θ and ∇ Θ f (Θ, x) · ∇ Θ f (Θ, x ) is the inner product of the gradient vector for two data points x, x . The tangent kernel at some initial point Θ 0 can be thought of as the kernel associated with the linear approximation of the function given by The tangent kernel for (classical) neural networks is called the Neural Tangent Kernel (NTK). It is shown in [6] that although the NTK depends on Θ which varies during the gradient descent algorithm, when the width of the neural network is large compared to its depth, the NTK remains almost unchanged. In fact, for such neural networks, the parameters Θ remain very close to their initial value Θ (0) . This surprising phenomenon is called lazy training [7].
In the lazy regime, since Θ is close to its initialization Θ (0) , the linear approximation of the function in (2) is accurate. In this case, the behavior of the function under training via gradient descent follows its linear approximation, and is effectively described by the tangent kernel at initialization. We will review these results and related concepts in more detail in Section 2.
Our results: Our main goal in this paper is to develop the theory of lazy training for parameterized quantum circuits as our model function, and to generalize the results of [6] to the quantum case. We prove that when the number of qubits (analogous to the width of a classical neural network) in a quantum parameterized circuit is large compared to its depth, the associated model function can be approximated by a linear model. Moreover, we show that this linear model's behavior is similar to that of the original model under the gradient descent algorithm.
To prove the above results, we need to put some assumptions on the class of parameterized quantum circuits. The results of [6] in the classical case are proven by fixing all layers of a neural network but one, and sending the number of nodes (width) in that layer to infinity. In the quantum case, assuming that we neither introduce fresh qubits nor do we measure/discard qubits in the middle of the circuit, the number of qubits is fixed in all layers. Thus, in the quantum case, unlike [6], we cannot consider layers of the circuit individually and take their width (number of qubits) to infinity independently of other layers. To circumvent this difficulty, we put some restrictions on our quantum circuits: (i) We assume that the circuit is geometrically local and the entangling gates are performed on neighboring qubits. For example, we assume the qubits are arranged on a 1D or 2D lattice and the two-qubit gates are applied only on pairs of adjacent qubits. More generally, we assume that the qubits are arranged on nodes of a bounded-degree graph and that the two-qubit gates can be applied only on pair of qubits connected by an edge. We note that this assumption arguably holds in most proposed hardware architectures of realizable quantum computers.
(ii) We also assume that the observable which is measured at the end of the circuit is a local operator with its locality being in terms of the underlying bounded-degree graph mentioned above. More precisely, we assume that the observable is a sum of terms, each of which acts only on a constant number of neighboring qubits. We will offer a number of evidences to show that our results do not hold without this assumption.
Given the above assumptions, we prove the followings: 1. To apply the gradient descent algorithm, we usually choose the initial parameters of the circuit at random. In Theorem 1, We show that when choosing the initial parameters independently at random, the quantum tangent kernel concentrates around its average as the number of qubits tends to infinity. This means that when the number of qubits is large, at first the tangent kernel is essentially independent of the starting parameters and is fixed.
2. We also show, in Theorem 2, that when the number of qubits is large, lazy training occurs; meaning that the parameters of the circuit do not change significantly during the gradient descent algorithm and remain almost constant. This means that the tangent kernel is fixed not only at initialization, but also during the training. As a result and as mentioned above, our model function can be approximated by a linear model which shows a behavior similar to that of the original model during the training via gradient descent.
These results show that in order to analyze the training behaviour of parameterized quantum circuits with the aforementioned assumptions, we may only consider the linearized model. We note that the linearized model is determined by the associated tangent kernel, which assuming that the initial parameters are chosen independently at random, is concentrated around its average. Thus, the eigenvalues of the average tangent kernel determine the training behaviour of such parameterized quantum circuits. Based on this observation, we argue in Remark 3 that if these eigenvalues are far from zero, then the model is trained exponentially fast. We will comment on this result in compared to the no-go results about barren plateaus in Section 6.
We also provide numerical simulations to support the above results.

Related works:
The subject of tangent kernels in the quantum case has been previously studied in a few works which we briefly review. A tangent kernel for hybrid classical-quantum networks is considered in [8]. We note, however, that in this work the quantum part of the model is fixed and parameter-free, and only the classical part of the network is trained.
The quantum tangent kernel is considered in [9] for deep parameterized quantum circuits. In this work, a deep circuit is a circuit with a multi-layered data encoding which alternates between data encoding gates and parameterized unitaries. This data encoding scheme increases the expressive power of the model function. It is shown in [9] that as the number of layers increases, the changes in circuit parameters decrease during the gradient descent algorithm (a signature behavior of lazy training), and the training loss vanishes more quickly. It is also shown that the tangent kernel associated to such deep quantum parameterized circuits can outperform conventional quantum kernels, such as those discussed in [10] and [11]. We note that all of these results are based solely on numerical simulations. Moreover, the simulations are performed only for 4-qubit circuits and do not predict the behaviour of the circuits in the large width limit.
Quantum tangent kernel of parameterized quantum circuits (for both optimization and machine learning problems) is also studied in [12]. In this work, without exploring conditions under which lazy training occurs, it is shown that in the lazy training regime (or "frozen limit"), the loss function decays exponentially fast.
Finally, tangent kernel for quantum states is defined in [13], and based on numerical simulations, it is shown that it can be used in the study of the training dynamics of finite-width neural network quantum states.
We emphasize that the missing ingredient shared by these previous works is the absence of explicit conditions on the quantum models under which the training is provably enters the lazy regime. This missing part is addressed in our work.
Note added. After publishing our work, [14] and [15] have also been published that further explore lazy training in quantum machine learning.
Outline of the paper: The rest of this paper is organized as follows. In Section 2, we review the notions of tangent kernel and lazy training in more detail. In Section 3, we describe quantum parameterized circuits and their training. We also explain in more detail the assumption of geometric locality mentioned above, and give an explicit example of such quantum circuits. Section 4 is devoted to the proof of our main results regarding quantum lazy training. In Section 5, we support our analytic results with numerical simulations. Concluding remarks are discussed in Section 6.

Tangent Kernel and Lazy Training
In this section we briefly review the notion of a tangent kernel and explain the results of [6] for classical neural networks.
Let f (Θ, x) be a model function which for any set of parameters Θ, maps R d to R. Having a training dataset D = {(x (1) , y (1) ), . . . , (x (n) , y (n) )}, where x (i) ∈ R d and y (i) ∈ R, our goal is to find the best parameters Θ for which the outputs of our model f (Θ, x (i) ) get close to the outputs provided in the dataset y (i) for all i ∈ {1, 2, . . . , n}. To quantify this, we will need a metric to measure our model's ability to match our dataset. On that account, we make use of a loss function, which in this paper is chosen to be the commonly used mean squared error function: Then, our goal is to find the optimal parameters that minimize the loss function: We use the gradient descent algorithm to solve (4). To this end, we randomly initialize parameters Θ = Θ (0) and in each step update them by moving in the opposite direction of the gradient of the loss function: Θ (t+1) = Θ (t) − η∇ Θ L(Θ (t) ), where η is a fixed scalar called the learning rate and ∇ Θ L(Θ (t) ) denotes the gradient of the loss function with respect to Θ. This updating of parameters is repeated until a termination condition is satisfied, e.g., the gradient vector ∇ Θ L(Θ (t) ) approaches zero, or the number of iterations reaches a maximum limit.
In order to analyze the gradient descent algorithm, we consider its continuous approximation. That is, we assume that the parameters are updated continuously via the gradient flow differential equation: Then, the evolution of the model function computed at a data point x is given by  This computation motivates the definition of the tangent kernel as follows: We note that K Θ (x, x ) is a valid kernel function, since it is the inner product of two vectors. Then, we have The tangent kernel alone is enough to determine the evolution of the model function in the training process.
Let us consider the case where f (Θ, x) comes from a neural network as in Figure 1. In this case, for instance, when there is only a single hidden layer, the model function is given by Here, m is the number of nodes in the hidden layer, where a kj is the weight of the edge connecting x j to the k-th node of the hidden layer, and b k is the weight of the edge connecting the k-th node of the hidden layer to the output node. Moreover, σ(·) is a non-linear activation function. Finally, following [6] we introduce the normalization factor 1 √ m in f (Θ, x) since we will consider the limit of this model function as m tends to infinity.
When training such a neural network with a large width, i.e., large number of nodes in the hidden layers, it is observed that the initial parameters Θ (0) do not change significantly, and Θ (t) remains close to Θ (0) until the gradient vector ∇ Θ L(Θ (t) ) approaches zero. This observation motivates the Taylor expansion of the model function at Θ (0) : Observe that the right hand side is linear in Θ (but not in x). Indeed, it is a linear transformation after applying the feature map x → ∇ Θ f (Θ (0) , x). Interestingly, the kernel function associated to this feature map is nothing but the tangent kernel K Θ (0) (x, x ) associated to the neural network, and is called the neural tangent kernel. Based on the above observations, it is proven in [6] that when the width of hidden layers in a neural network tends to infinity, it enters the lazy regime, meaning that Θ (t) remains close to Θ (0) during the gradient descent algorithm. Moreover, it is proven that in this case, linear approximation of the model function as in (7) remains valid not only at initialization, but also during the entire training process. For more details on these results, particularly on the assumptions under which they hold, we refer to the original paper [6]. We also refer to [7] for more details on lazy training.

Parameterized Quantum Circuits
Parameterized quantum circuits are considered as the quantum counterpart of classical neural networks [16]. Each parameterized quantum circuit amounts to a model function and similar to neural networks, can be trained to fit some data.
As the name suggests, a parameterized quantum circuit is a circuit with some of its gates non-fixed and dependent on some parameters. Indeed, some gates of the circuit depend on parameters denoted by Θ, and some gates encode the input x. A measurement is performed at the end of the circuit which determines the output of computation. The measurement itself could also be parameterized, but in this work, for the sake of simplicity it is assumed to be fixed. See Figure 2 for an example of a parameterized circuit.
Letting U (Θ, x) be the unitary associated to the circuit, and O be the observable measured at the end, the resulting model function is given by Then, having such a model function and a dataset D = ( we may try to find the optimal Θ that minimizes the loss function: To this end, as before, we initialize the parameters Θ independently at random and move towards minimizing the value of this loss function by the way of gradient descent. We usually arrange gates of a parameterized circuit in layers. For instance, the circuit of Figure 2 consists of an encoding layer of single-qubit (Y -rotation) gates and L layers, each of which consists of some single-qubit (X-rotation) gates and some two-qubit (controlled-Z) gates. This layer-wise structure of parameterized circuits is crucial for us since in our results, we are going to fix the number of layers L, and consider the limit of large number of qubits (m → ∞).
In this paper, for the stability of the model, we need to assume that the parameterized gates do not change significantly by a slight change in the parameters Θ. To this end, we assume that for some constant c > 0. We note that this assumption holds in most parameterized circuits in the literature, particularly when the parameterized gates are Pauli rotation (see equation (15) below).

Geometrically local circuits:
As mentioned in the introduction, to prove our result we need to restrict the class of circuits to geometrically local ones. To this end, we assume that the qubits are arranged on vertices of a bounded-degree graph (e.g., 1D or 2D lattice) and the entangling 2-qubits gates in the circuit are applied only on pairs of neighboring qubits. For instance, in the circuit of Figure 2, we assume that the qubits are arranged on . ,x m are functions (e.g., coordinates) of x. Next, L layers of parameterized gates are applied. We assume that only the single-qubit gates are parameterized and fix the entangling gates to controlled-Z gates. We assume that the qubits are arranged on a cycle, and the controlled-Z in each layer are applied on all pairs of neighboring qubits.
a cycle, and the controlled-Z gates in each layer are applied only on pairs of neighboring qubits.
We also assume that the observable O that is measured at the end of the circuit is a geometrically local one. More precisely, we assume that O is given by where m is the number qubits in the circuit, and O k is an observable acting on the k-th qubit and possibly on a constant number of qubits in its neighborhood, with O k ≤ 1. Moreover, as in the classical case (see, equation (6)), we introduce the normalization factor 1 √ m in O since we are considering the limit of m → ∞. In this case, the model function (8) can be written as where We emphasize that the assumption of geometric locality on the quantum circuit described above holds in most quantum hardware architectures. After all, the qubits in the quantum hardware should be arranged on some lattice, and usually the 2-qubits gates can only be applied on neighboring qubits. However, the assumption that the observable is geometrically local is not justified by the hardware architecture. Nevertheless, global observables usually result in barren plateaus and a way of avoiding them is to use local observables [17]. Moreover, as our simulations in Section 5 show, our results do not hold for global observables. Thus, we have to somehow restrict the class of observables.
Example: We finish this section by explaining the example of Figure 2 in more detail, since it will be used as our quantum circuit for simulations.
First, we note that our data points (x (i) , y (i) ) belong to R d ×R, so in the circuit we need to encode each input x in an m-qubit circuit. In the circuit of Figure 2 we assume that we first map x ∈ R d to somex ∈ R m and then use the coordinates ofx in the encoding layer of the circuit. The mapping x →x is arbitrary and can even be non-linear. However, for our numerical simulations we use the map: Then, the coordinates ofx are used to encode x in the first layer: where Y j denotes the Pauli-Y matrix acting on the j-th qubit. Next, we apply L parameterized unitaries U (Θ 1 ), . . . , U (Θ L ) where and CZ k,k+1 is the controlled-Z gate applied on qubits k, k + 1. Here, we assume that the qubits are arranged on a cycle, and the indices are modulo m.
With this specific structure for the parameterized circuit, we have Nevertheless, we emphasize that in this paper we do not assume that the encoding part of the circuit is only restricted to the first layer; our results are valid even if there are gates in the middle of the circuit that encode x, see [18,19]. Finally, we assume that the observable is given by where Z k is the Pauli-Z operator acting on the k-th qubit. Hence, the model function associated to this parameterized circuit is equal to A crucial observation which will be frequently used in our proofs is that each term in the above sum depends only on constantly many parameters (independent of m, the number of qubits). First, note that the last layer of controlled-Z gates does not affect the model function since the controlled-Z gates are diagonal in the Z-basis and commute with the observable. Second, and more importantly, the result of the measurement of the k-th qubit depends only on the light cone of this qubit. To clarify this, let us assume that L = 2. In this case, the result of the measurement of the k-th qubit depends only on parameters θ k−1 , θ k , θ k+1 , θ m+k , see Figure 3. The point is that, when L = 2, we have Thus, the f (Θ, x) given by (16) with L = 2 is a sum of m terms whose k-th term depends on θ k−1 , θ k , θ k+1 and θ m+k , which together make the light cone of the k-th qubit (as depicted in Figure 3).  Figure 2 with L = 2 is depicted in red. This means that in order to compute the result of the k-th measurement Z k , we only need to compute the red part of the circuit and ignore the rest. We note that only the parameters θ k−1 , θ k , θ k+1 and θ m+k appear in this light cone.

Main results
This section contain the proof of our results. We first show that under certain conditions, when the parameters are initialized independently at random, the tangent kernel is concentrated around its mean.
Theorem 1 Let f (Θ, x) be a model function associated to a geometrically local parameterized quantum circuit on m qubits as in (8) with Θ = (θ 1 , . . . , θ p ) satisfying (10). Suppose that the observable O is also geometrically local given by (11) where O k acts on the k-th qubit and possibly on a constant number qubits in its neighborhood, and satisfies O k ≤ 1.
In this case the model function is given by (12) and (13). Suppose that θ 1 , . . . , θ p are chosen independently at random. Then, for any x, x ∈ R d we have Remark 1 We note that usually, the number of parameters in each layer of a circuit is linear in the number of qubits. Then, assuming that the number of layers L is constant, p = O(Lm) = O(m). In this case, the right hand side of (18) vanishes exponentially fast in m.
As mentioned in the previous section, our main tool in proving this theorem is the geometric locality of the circuit and the observable. Based on this, following similar computations as in (17), we find that each term f k (Θ, x) of the model function depends only on constantly many parameters.
In the proof of this theorem of also use McDiarmid's inequality. [20]) Let X 1 . . . , X n be independent random variables, each with values in X . Let f : X n → R be a mapping such that for every i ∈ {1, 2, . . . , n} and every (x 1 , . . . , x n ), (x 1 , . . . , x n ) ∈ X n that differ only in the i-th coordinate (i.e., x i = x i and ∀j = i :

Lemma 1 (McDiarmid's Concentration Inequality
Then for any > 0 depends on θ j . In other words, Θ N k is the set of θ j 's in the light cone of the k-th observable O k . Then, we have We note that by the assumption of geometric locality, we have |N k | = O (1). Now, by the definition of the tangent kernel we have where the last equation follows since ∂ ∂θ j f k (Θ N k , x) = 0 for any j / ∈ N k . Let We note that since O k acts only on a constant number of qubits in the neighborhood of the k-th qubit, N k intersects N k only if the qubits k and k are geometrically close to each other (in the underlying graph). Then, since the underlying graph has a bounded degree, N k intersects only a constant number of N k 's. On the other hand, the size of N k is constant. Thus, for each k the number of triples (k, k , j) in Γ is constant, and we have |Γ| = O(m). Next, let Then, can be thought of as a normalized sum of O(m) terms. Note that these terms are not independent of each other; each parameter θ j may appear in more than one term. Nevertheless, again by the assumption of geometric locality, each θ j appears in at most constantly many terms. Therefore, by letting Θ, Θ be two tuples of parameters differing only at the j-th position (i.e., θ i = θ i for all i = j), we get where in the last line we use (10) and the fact that for each j, the number of triples (k, k , j) in Γ is constant. Then, by McDiarmid's concentration inequality [20] we have The above theorem says that even though the parameters are chosen randomly at initialization, the tangent kernel is essentially fixed. This results in an essentially fixed linearized model via (7).
The following theorem states our second main result, that the training of geometrically local quantum circuits over large number of qubits enters the lazy regime and can be approximated by a linear model.
Theorem 2 Let f (Θ, x) be a model function associated with a parameterized quantum circuit satisfying the assumptions of Theorem 1. Suppose that a data set D = (x (1) , y (1) ), . . . , (x (n) , y (n) ) , with x (i) ∈ R d and y (i) ∈ R is given. Assume that at initialization we choose Θ (0) = θ (0) 1 , . . . , θ (0) p independently at random, and apply the gradient flow to update the parameters in time by ∇Θ (t) = −∇ Θ L(Θ (t) ), where L(Θ) is given in (9). Then, the followings hold: (ii) For any x, x we have (iii) Letf (Θ, x) be the function associated to the linearized model, i.e., Suppose that we start withΘ (0) = Θ (0) , and train the linearized model with its associated loss function denoted byL(Θ (t) ) which results in Then, for all t we have (iv) With the notation of part (iii), for all t we have Part (i) of this theorem says that parameters Θ (t) do not change significantly during training. Based on this, we expect that the tangent kernel remains close to the initial tangent kernel as well. This is proven in part (ii). Next, since the tangent kernel is almost constant, we expect that our model function behaves like the linearized model in the training process (lazy training). This is formally proven in parts (iii) and (iv).

Remark 2
The bounds of this theorem are effective when the loss function L(Θ (0) ) at initialization is a constant independent of m. While we do not explore the conditions under which this holds, since Θ (0) is chosen at random and f (Θ (0) , x) approaches a Gaussian process, we expect to have L(Θ (0) ) = O(1) with high probability when we learn a bounded function.
Remark 3 LetF (t) = f (Θ (t) , x (1) ), . . . ,f (Θ (t) , x (n) ) and Y = (y (1) , . . . , y (n) ). Then, since the kernel associated to the linearized model is time-independent, by (5) we havē This means that if K Θ (0) is full-rank and its minimum eigenvalue is far from zero, the training of the linearized model stops exponentially fast. In this case, the stopping time t in the bounds of parts (iii) and (iv) of the theorem is small. Indeed, under the above assumption on the eigenvalues of the tangent kernel, the parameterized quantum circuit is trained exponentially fast since by part (iv) its behaviour is well-approximated by the linearized model.
, and for any j: Thus, using (10) and the fact that there are a constant number of N k 's containing j, we find that The desired bound follows once we note that we are moving in the opposite direction of the gradient of L Θ (t) via the gradient flow equation, so L Θ (t) ≤ L Θ (0) .
(ii) Using (19) we have By (10), for any i, j, k, k we have Next, recall that |Γ| = O(m), and for any k, k the size of N k ∪ N k is a constant. Thus, the desired bound follows from part (i).
(iii) To prove this part we borrow ideas from [7]. Using (5) we compute Next, the fact that K Θ (0) is positive semidefinite and n i,j=1 We also note that 1 Now, using part (ii) and the fact that K Θ (t) is an n × n matrix, we have Therefore, which gives the desired result by integration.
(iv) Using the triangle inequality for the 2-norm, we have

Numerical simulations
In this section we present numerical simulations to support our results. To this end, we simulate the parameterized circuit of Figure 2, explained in detail in Section 3. To classically simulate this circuit for a large number of qubits (large m), we again use the idea of light cones (see Figure 3). To this end, we evaluate the model function term by as a function of the number of iterations of the gradient descent algorithm. We observe that as the number of qubits m increases, the relative change of parameters decrease. This means that as the number of qubits increase, training enters the lazy regime.
term, knowing that each term can be computed by a sub-circuit of constant size (when L is constant). We use PennyLane [21] for our simulations. 1 We choose the data set D = {(x (i) , y (i) ) : i = 1, . . . , n} randomly, where x (i) 's are in [−2π, 2π], and y (i) 's are in [−1, 1]. We apply the gradient descent algorithm with a learning rate of η = 1 to train the circuit.
We first verify Theorem 1. We let L = 2, pick two random inputs x, x and compute K Θ (x, x ) for random choices of θ j in [−2π, 2π]. Figure 4 shows the histogram of these values. This histogram confirms that K Θ (x, x ) is concentrated around its average. This average is analytically computed in Appendix A, which shows Next, in order to verify Theorem 2, we plot the relative change of the parameters Θ in the training process. That is, we plot where t denotes the number of gradient descent iterations. As Figure 5 shows, this relative change decreases by increasing the number of qubits m. This is an indicator of the occurrence of lazy training. We also plot the loss functions L(Θ (t) ),L(Θ (t) ) of both the original quantum model and its linearized version as functions of the number of iterations in Figure 6. We observe that for large numbers of qubits (e.g., m = 100), these two loss functions have almost the same values in every step of the learning process. This confirms our results in Theorem 2. Moreover, we observe that as suggested in Remark 3, these models converge very quickly. We note that in the plots of Figure 6 the loss functions do not vanish. This is because, as mentioned above, the label y (i) for each data point x (i) is chosen randomly, and the quantum parameterized circuit chosen for our simulations is not expressive enough to fit such a random dataset. Alternatively, we can choose our dataset's inputs to be random x (i) 's as before, and this time to fix the labels, pick random parameters Θ , feed the input x (i) to the parameterized circuit with parameters Θ , and let the outputs y (i) be the labels. 2 In this case, we make sure that our model is expressive enough to fit the dataset, and our simulations show that the loss function converges to zero as the number of iterations increase. Nevertheless, no matter how we choose the dataset, the behavior of the loss functions of the original quantum and the linearized models remain the same and they decrease with an exponential rate with the number of iterations.
We also verified our results on the Iris flower dataset. This dataset consists of 50 data points for each three species of Iris (Iris setosa, Iris virginica and Iris versicolor), and each data point has four features. To get a binary classification problem, we picked data points corresponding to two of these three classes. We consider the same circuit as before with two layers and m = 24 qubits. The loss function is also remained unchanged. Once again as the plot of Figure 7a shows, the loss functions of the both original quantum model and its linearized version remain close. We note that in this plot the loss function converges to zero as the number of iterations grows.
In order to justify our assumption that the observable is geometrically local, we also consider the circuit of Figure 2 with a global observable. We observe in Figure 7b that the quantum model with the global observable O = Z 1 Z 2 . . . Z m is separated from its linearized version. This shows that the assumption of the locality of observable is necessary for lazy training. Interestingly, we also observe that the linearized version of the quantum model with a global observable doesn't learn and remains almost constant. This is because, as can be verified by direct computations, the associated tangent kernel is a low-rank matrix, in which case the model function has a low expressive power.

Conclusion
In this paper, we proved that the training of parameterized quantum circuits that are geometrically local enters the lazy regime. This means that if the associated model function is rich enough, in which case the tangent kernel is full-rank and its eigenvalues are far from zero, training converges quickly.
We emphasize that although in our explicit example of parameterized quantum circuit the encoding is performed only in the first layer, our results hold for general forms of data encoding including parallel and sequential ones [19].
We proved our results under the assumptions that first, the circuit is geometrically local and second, the observable is a local operator. The first assumption is motivated by common hardware architectures, and numerical simulations suggest that the second assumption is necessary. Nevertheless, it is interesting to investigate other settings in which lazy training occurs in quantum machine learning. In particular, it would be interesting to study lazy training for quantum parameterized circuits whose number of qubits varies in different layers, i.e., fresh qubits are introduced and qubits are measured/discarded in the middle of the circuit [22].
Our results show that as long as the tangent kernel associated to a parameterized quantum circuit satisfying the above assumptions is full-rank and its minimum eigenvalue is far from zero, the quantum model is trained exponentially fast (see Remark 3). This is in opposite direction to barren plateaus occurring in the training of certain quantum parameterized circuits [3]. The point is that the circuits considered in our work are not random, and are geometrically local. Moreover, we consider only local observables, which remedies barren plateaus [17].
In this paper, we fixed the loss function to be the mean squared error, yet most of the results hold for more general loss functions as well. Indeed, for a general loss function we should only modify the proof of parts (iii) and (iv) of Theorem 2. Modifying these parts with weaker bounds, this can be done based on ideas in [7].
We did not explore the effect of quantum laziness in compared to its classical counterpart. For instance, how do the eigenvalues of the tangent kernel of classical and quantum models compare to each other? Which of the two models could possibly be better at generalization? We leave these questions for future works.
In the appendix, we explicitly compute the model function as well as the associated tangent kernel corresponding to a two-layer quantum circuit. We believe that such computations are insightful in understanding the expressive power of quantum parameterized circuits and their training properties.

A Explicit computation of E[K Θ (x, x )]
In this Appendix, we explicitly compute E[K Θ (x, x )] for the parameterized circuit of Figure 2 with L = 2 when θ j 's are chosen uniformly at random in [−2π, 2π]. To this end, we first explicitly compute the model function, and then compute its associated tangent kernel.
The other two equations hold by symmetry.