Entanglement entropy production in Quantum Neural Networks

Quantum Neural Networks (QNN) are considered a candidate for achieving quantum advantage in the Noisy Intermediate Scale Quantum computer (NISQ) era. Several QNN architectures have been proposed and successfully tested on benchmark datasets for machine learning. However, quantitative studies of the QNN-generated entanglement have been investigated only for up to few qubits. Tensor network methods allow to emulate quantum circuits with a large number of qubits in a wide variety of scenarios. Here, we employ matrix product states to characterize recently studied QNN architectures with random parameters up to fifty qubits showing that their entanglement, measured in terms of entanglement entropy between qubits, tends to that of Haar distributed random states as the depth of the QNN is increased. We certify the randomness of the quantum states also by measuring the expressibility of the circuits, as well as using tools from random matrix theory. We show a universal behavior for the rate at which entanglement is created in any given QNN architecture, and consequently introduce a new measure to characterize the entanglement production in QNNs: the entangling speed. Our results characterise the entanglement properties of quantum neural networks, and provides new evidence of the rate at which these approximate random unitaries.


Introduction
Nowadays quantum computing is a wellestablished research field where quantum phenomena like superposition and entanglement are exploited in order to process information, possibly more efficiently than standard classical data processing [1]. The aim of quantum computing is to devise quantum algorithms capable of generating a target quantum state representing the solution of a given problem. In the last decade, the community has put a large effort into the realization of hardware able to perform quantum computation.
Accompanying the rise of quantum computing, another research area, namely Machine Learning (ML), has gained a lot of popularity. We live undoubtedly in the era of big data, where information is collected by the most disparate devices. In this context, ML constitutes a set of techniques for efficiently identifying patterns in huge datasets and for inferring input-output relations in data, even in the case of previously unseen inputs [2,3]. ML proves to be a powerful tool with a wide range of applications: from image classifications [4], over devising playing strategies for complex games [5], to controlling nuclear fusion reactors [6].
Inspired by some of these outstanding results, a new interdisciplinary research topic that goes by the name of Quantum Machine Learning (QML) has recently begun to combine quantum computing and machine learning techniques in various ways, with the hope of achieving improvements in both fields [7,8,9,10,11]. As smallscale quantum devices start to be available [12], a new class of quantum procedures called variational quantum algorithms have been developed to take advantage of current and near-term quantum hardware, by trading theoretical guarantees of success with feasibility of execution [9,11,13]. Generally speaking, a variational quantum circuit is a hybrid quantum-classical algorithm employing a quantum circuit U (θ) that depends on a set of parameters θ, which are adjusted in order to minimize a given objective function. While the objective function is evaluated by measuring outcomes of the variational circuit, optimization is performed by a classical iterative optimization algorithm that proposes better candidates for the parameters θ, starting from random (or pre-trained) initial values.
Within the domain of variational quantum circuits, quantum versions of neural networks, often referred to as quantum neural networks (QNNs), represent a promising quantum alternative for classical supervised learning [10,14]. An efficient encoding of input data is key to perform computations in a high dimensional (possibly even infinite) Hilbert space. In fact, it is possible to encode classical inputs x into a quantum state |F(x)⟩ using a parameterized quantum circuit (PQC), a procedure which goes by the name of feature encoding. Thus, the goal of the feature map F is to map classical vectors to the qubits' Hilbert space. This feature map is accompanied by a layered structure of additional variational PQCs, which are trained in order to solve the desired learning task. Recently, QNNs gained a lot of attention after it was shown that they could be more expressive and efficiently trained than their classical counterparts [14]. Still, the dispute on how to achieve quantum advantages over machine learning is still far from being settled [15,16,17]. As for classical neural networks, the type of implementation of parameterized quantum circuits has a profound impact on the QNN performances, both in terms of trainability and classification accuracy [18,19,20,21,22]. Thus, characterizing parameterized quantum circuits in terms of their expressibility and entangling capability is key to selecting a good ansatz, i.e. structure, for a QNN.
Following and expanding the investigation pioneered in [23], in this work we study the entanglement properties of quantum neural networks initialized with random parameters. We employ methods from the tensor network literature, namely Matrix Product States (MPS), to study the entanglement generated in various QNNs architectures composed of up to 50 qubits. Since MPS are a very powerful tool for simulating quantum systems with bounded entanglement, if a quantum neural network can only access low entangled states, it can be easily simulated, which spoils any hope of achieving a concrete quantum advantage. Thus, using entanglement entropy among qubits as a figure of merit, we evaluate the entanglement capabilities of some of the most common and promising QNN architectures [14,23]. We consider several QNNs with different combinations of feature maps F and variational forms V and perform an extended numerical analysis varying: (i) the number of qubits n, (ii) the number of layers L in the network, (iii) the entangling topology of the circuit, (iv ) the data re-uploading [24,25] structure being either alternated or sequential. In this respect, we focus our analysis on data re-uploading quantum circuits because, as extensively discussed later, recent results in the quantum machine learning literature highlight the need for such a circuit structure to increase the expressibility of the parametric models implemented by quantum neural networks. Thus, we consider this class of parametrized quantum circuits due to their practical relevance in quantum machine learning tasks. Nonetheless, in Sec. 3.4, we also analyze instances of random quantum circuits where parameters are not shared between layers (hence no data reuploading is used) and show that, as long as entanglement is involved, the results presented in this paper depend primarily on the architecture of the parametric quantum circuit, and not on the presence of shared parameters. A summary of the circuit templates analyzed in this work is shown in Fig. 1.
For all the considered QNNs with nearest neighbour connectivity, as the number of layers L is increased, the entanglement generated inside the circuit grows, eventually reaching a plateau when L ≈ n, where n is the number of qubits. This behavior is associated with the typical en- Each sphere is a tensor, representing a qubit q j . The entanglement entropy between bi-partitions A and B is computed by "cutting" the connecting edge e j . (c) Circuits analyzed in the manuscript, depicted with a linear entanglement topology, i.e. entangling gates are only applied between nearest neighbors on a line. (d) Different entanglement topologies: circular, with the first and last qubit of the line connected, and full, where the entangling gates are applied between each pair of qubits (see Appendix C for a clear definition and discussion). When using parameterized two qubits gates, like the controlled rotations in circuit 2, the entanglement maps are generalized to their parameterized version by substituting X gates on the controlled qubits with the corresponding parameterized operation. Note that the circuit templates 2 and ZZFeatureMap are those used in the QNN of [14], and also that circuits 1, 2 and 3 share similarities with circuits 1, 15, and 13 of [23], respectively. tanglement of a random Haar-distributed quantum state. The choice of the entangling topology (nearest neighbors, circular, or all to all) clearly affects the rate of creation of entanglement in the circuit. We also point out that a careless definition of a full, i.e. all to all, connectivity map can effectively result in a linear nearest-neighbors interaction if unparameterized two qubits gates (Cnots) are used, something apparently overlooked in the recent literature using this type of ansatz [14,26,27]. By bounding the entanglement generated by the circuit, we are able to simulate QNNs with MPS up to n = 50 qubits. It should be stressed that such simulations are exact up to a given number of layers, after which a truncation of the entanglement via MPS is applied. By appropriately normalizing the entanglement produced we show that all the points for a given QNN architecture follow the same curve, independently from the number of qubits. Thus, we exploit this behavior to define a universal figure of merit given the QNN architecture, the entangling speed. This figure of merit characterizes how fast the entanglement is produced by the QNN, with respect to the number of layers L.
In addition, we evaluate the expressibility mea-sure of the considered QNNs as defined in [23] and argue that the optimality of the QNN introduced in [14] may be related to its good tradeoff between mild entanglement production and high expressibility. Finally, we employ tools from random matrix theory, specifically convergence to the Marčenko-Pastur distribution, to further characterise the resemblance of the deep enough quantum neural networks to random unitary matrices. At last, we note that differently from [23] which bases their analysis on the Meyer-Wallach entanglement measure [28], in this work we make use of the entanglement entropy among subsystems, which allows for a more careful analysis of the entanglement distribution in the system, and it is also readily accessed in an MPS simulation with no computational overhead. The manuscript is organized as follows. In Sec. 2 we review the basis of tensor networks and MPS, and introduce the Von Neumann entropy as an entanglement measure. We then discuss the entanglement entropy properties of random quantum states. We proceed by discussing the most recent results on parameterized quantum circuits and QNNs, especially, on the relation between randomness, trainability, and entangle-ment found in these circuits. In Sec. 3 we show the results of our analysis for various QNN architectures, and discuss the results in Sec. 4. Finally, we discuss the implications of our work and possible routes for future investigations in Sec. 5.

Tensor Networks and Matrix Product States
An n-qubit quantum state is defined in a Hilbert space H of dimension dim(H) = 2 n . The exponential scaling of H with n makes the classical description of quantum states an exponentially expensive task. This problem is widely known in many-body quantum physics, and many different techniques have been developed to alleviate the issue, like the Density Matrix Renormalization Group (DMRG) or Tensor Network (TN) techniques [29,30].
In this work, we use Tensor Network methods to efficiently describe the n-qubit state. In particular, we employ Matrix Product States (MPS), which are a specific tensor network ansatz particularly suited to represent 1-dimensional (i.e. like atoms on a chain, as in Fig. 1) quantum states [31]. The power of tensor networks lies in the assumption that we are only interested in a tiny subspace of the entire Hilbert space, namely the states that display a limited amount of entanglement.
An n-qubit pure state |ψ⟩ ∈ H can be written as a MPS as follows [31] |ψ⟩ = 1 s 1 ,...,sn=0 Each tensor M [i],s i α i α i+1 is a local description for the [i]-th site, which allows one to apply a local operator to a certain site without the need to change all the other coefficients. For a fixed s i , is a χ × χ complex matrix, meaning that Eq. (1) is the sum of basis elements weighted by matrix products. The integer χ is called the MPS bond dimension, and a sufficiently high χ is needed to express a general |ψ⟩ in such form. However, MPS with a lower χ can still encode all the meaningful states, albeit clearly not all possible states. In particular, to correctly describe any quantum state the bond dimension needed is χ = d ⌊ n 2 ⌋ , where d is the local dimension of the degrees of freedom (d = 2 for qubits). We can also efficiently evolve the state under the application of 2-qubit gates, using an approach known in the literature as time-evolving block decimation [32], and perform measurements. Simulations using MPS are not bounded by the number of qubits in the system, but by the amount of entanglement generated inside it, as we explain in detail in Section 2.2.
Nonetheless, while the use of an MPS simulation imposes some constraints on the maximum entanglement that it is possible to represent, this issue is relevant only for very deep circuits involving many qubits. Indeed, we reliably simulate circuit instances involving up to n = 50 qubits and moderate depth, which is already sufficient to provide clear insights on the entanglement entropy generated in such circuits. Moreover, as explained below in Sec. 2.2, during an MPS simulation one has constant access to the singular values of the quantum state, so the entanglement of the state can be calculated on the fly without any computational overhead. Thus, MPS are an effective tool to study the entanglement properties of quantum circuits, especially in regimes that cannot be easily accessed with a full-scale simulation of the statevector of the system.

Entanglement measure in Matrix Product States
Entanglement in quantum states can be evaluated using the so-called Von Neumann entanglement entropy. Let ρ = |ψ⟩⟨ψ| be the quantum state of a system of n qubits, and consider a bipartition A , B of such system of qubits n A and n B = n − n A respectively, like the one shown in Fig. 1(b). The entanglement entropy of the subsystem A having reduced density matrix and quantifies the amount of entanglement shared between the parties A and its complement B. Note that throughout the whole manuscript we consider logarithms in natural base e. If A and B are in a product state then S(ρ A ) = 0, while if the two subsystems share maximal entanglement one has S(ρ A ) = n A log(2) [31]. An important property of Eq. (2) is that the entanglement entropy of the two subsystems is equal, namely S(ρ A ) = S(ρ B ), as it can be easily checked using the Schmidt decomposition of the pure global state ρ = |ψ⟩⟨ψ| (see below). It turns out that matrix product states are a natural tool to characterize the entanglement entropy of a quantum system. This can be illustrated by considering the simple case of a state of n = 2 qubits. The statevector can be expressed in the Schmidt decomposition as where χ s is the Schmidt rank, λ α are the Schmidt coefficients, and {|ξ α ⟩ 1 }, {|η α ⟩ 2 } are orthonormal bases in the space of the first and second qubit respectively. Using the decomposition (4) in Eq.
(2), the entanglement entropy between the two qubits then amounts to In an MPS simulation we always have access to a subset of the Schmidt coefficients, since such representation is built by iteratively applying the Singular Value Decomposition (SVD), a procedure equivalent to Schmidt-decomposing a quantum state. The reason why we have access only to subsets of them is that we impose the following conditions on the Schmidt coefficients. Listing the coefficients in ascending order, i.e. λ 0 ≥ λ 1 ≥ · · · ≥ λ χs , then: • Schmidt coefficients whose ratio with λ 0 is smaller than ϵ are discarded. The value of ϵ in this work is fixed at ϵ = 10 −9 ; • only the first largest χ max coefficient are retained. The value χ max is called maximum bond dimension.
The approximation we are performing is the optimal one in terms of the represented entanglement. Then, the measure of entanglement for the MPS now becomes As explained in detail in Appendix F, despite the approximations, the faithfulness of the simulation can be easily monitored. Finally, we remark that since we have constant access to the considered subset of Schmidt coefficients during the state evolution, we are able to compute the entanglement entropy of a quantum state without any computational overhead.

Entanglement entropy in random quantum states
In this section we briefly describe the entanglement features of uniformly distributed random pure quantum states, that is quantum states sampled according to the unique unitarily invariant probability distribution induced by the Haar measure. Denoting by U(n) the group of 2 n × 2 n unitary matrices, there is a unique unitarily invariant probability measure µ(U ) defined on the group, and such measure is called Haar measure [33,34,35]. Unitary invariance corresponds to the requirement that the measure is invariant under translations in the space of unitary matrices, that is The Haar measure induces a uniform probability distribution in the space of unitary matrices so that sampling a quantum state according to the Haar measure means randomly picking a state uniformly from the space of quantum states. We denote with P(n) such probability distribution. We are interested in the entanglement features of random quantum states, particularly in the entanglement entropy. Let |ψ⟩ ∈ (C 2 ) ⊗n be a quantum state of n qubits sampled from the uniform distribution |ψ⟩ ∼ P(n), and a bipartition of the n qubits system in two subsystems A and B, of size n A and n B = n − n A respectively. Then, for n A ≤ n B , the expected value of the entanglement entropy (2) corresponding to this cut amounts to the Page value [33,36] where d B = 2 n B , d A = 2 n A are the local dimensions of the two subsystems, and the expectation value is over the uniform probability distribution E(·) = E |ψ⟩∼P(n) (·). One can check that the entanglement is highest whenever the two partitions have equal size n A = n B = n/2 (for n even, and similarly for n odd, n A = ⌊ n 2 ⌋ and n B = ⌈ n 2 ⌉). From Eq. [33], and since the maximum value of the entanglement entropy for such bipartition is log d A , obtained if the subsystems A and B share maximal entanglement, one concludes that random states are generally highly entangled. Indeed, in ref. [33] it was shown that the probability that a random pure state has entanglement entropy lower than log d A −d A /2d B is exponentially small. Thus, with very high probability, random quantum pure states are almost maximally entangled.

Quantum Neural Networks as Parameterized Quantum Circuits
Currently available quantum devices are still too small and noisy to perform relevant fault-tolerant computations of notorious and efficient quantum algorithms, like Shor's factoring [1,12]. For this reason, recent research has focused on a new paradigm of quantum computation based on socalled variational quantum algorithms (VQAs), which trade theoretical success guarantees with feasibility of execution, and are thought to be the most effective way to reach a quantum advantage in the near term, already with small quantum devices [9,11,37].
Variational quantum algorithms are based on PQCs, which are quantum circuits in which some of the unitary operations are characterized by variational parameters to be adjusted in order to solve an optimization problem. The optimal parameters are found by minimizing a properly chosen cost (or loss) function encoding the task to be solved. Let U θ be the unitary evolution implemented by a quantum circuit with tunable parameters θ, and O a Hermitian operator (an observable). The goal of variational quantum algorithms is to optimize the quantum circuit parameters θ in order to minimize the expectation value (or variations thereof) where ρ is an initial quantum state, generally set to the ground state ρ = |0⟩⟨0|. This is achieved by means of an iterative hybrid quantum-classical approach where the quantum computer is used to estimate the cost function (8), and given such value, the classical computer proposes new variational parameters according to an optimization method, the most common one being gradient descent.
There is freedom in the choice of the gate sequence defining the parameterized unitary U θ , and a choice of its structure is referred to as variational ansatz. For example, the unitary could be composed of a layer of Pauli rotations around the X-axis on each qubit R(θ) = exp(−iθX/2), followed by a layer of Cnots acting on pairs of neighboring qubits. This is in fact the general blueprint of variational quantum circuits, as they are generally created by repeating single-qubits parameterized rotations followed by multi-qubits operations which introduce entanglement into the computation. Examples of parameterized quantum circuits are shown in Fig. 1.

Quantum Neural Networks.
As it is often the case with learning tasks, either classical or quantum, the goal is to solve a problem given access to a dataset of inputs X = {x i } i , representative of the task to be solved. Whenever data is involved, variational quantum circuits are often referred to as Quantum Neural Networks, as they share the very same idea as their classical counterpart: learn patterns in input data by adjusting tunable parameters through the iterative minimization process known as training. In this case, the quantum circuit of the neural network depends on two sets of parameters x and θ, the former being the input data to be analyzed, and the latter the variational parameters to be adjusted (i.e. the weights of the neural network). In the quantum machine learning jargon, the encoding scheme used to load the input data onto the quantum computer is known as feature map, and consists of a unitary operation parameterized by x. We will denote such feature encoding gate with F(x), where x ∈ X . As with the variational unitary, there is no standard choice for a feature map, and one has to pick a specific ansatz, ideally biasing the choice towards architectures built using knowledge of the problem to be solved [22,21]. Summing up, a general QNN can be then expressed as where F(x) is the feature map ansatz depending on the input data x; V (θ i ) is a variational ansatz depending on trainable parameters θ i ∈ θ = (θ 1 , · · · , θ L ) with θ i ∈ R p ; and L is the number of repetitions (also referred to as layers) of the such layered structure.
It was recently shown that uploading the input data multiple times throughout the circuit is essential for quantum neural networks to model higher-order functions of the inputs [25,38]. Such procedure is now standard practice in the quantum neural network-based quantum machine learning, and it is called data reuploading [39,24]. Notice that the input data in the feature map in Eq. (9) is the same in every layer, while the variational blocks V use a different parameter vector in every layer. In Fig. 1(a) we give a graphical representation of the general structure of QNNs. As for the explicit implementation of F and V , there is no fixed choice and these are usually composed of single qubit rotations followed by entangling operations, either fixed (e.g. Cnots) or themselves parameterized (e.g. controlled rotations). See Fig. 1(c) for some prototypical examples of parameterized blocks proposed in the literature [14,23], which we will consider throughout the manuscript.

Randomness, Entanglement and Trainability.
One of the hardest theoretical challenges affecting quantum machine learning models is the emergence of so-called barren plateaus (BPs) in their optimisation landscape [40]. BPs are regions in parameter space where the loss function is essentially flat, with no interesting minimising direction, so that it is not possible to train the model efficiently and independently of the optimization methods used, be it gradient-based [41] or gradient-free [42]. Different sources can lead to the unfolding of barren plateaus, and these can be broadly grouped into three main categories: randomness-induced BP [40,43,44], BP induced by global cost functions defined with observables having support on a large number of qubits [19], and eventually noise-induced BP [45].
In this work we are concerned with the former type of barren plateaus, that roughly occur when parameterized quantum circuits, when initialised with random parameters, resemble general random unitaries. Indeed, despite being quite lim-ited in terms of qubits connectivity and gate operations, common instances of parameterized quantum circuits are often found to behave as unitary 2-designs, that is they efficiently approximate the statistics of Haar-random unitaries up to the second moment [46]. In this case, then one can prove that the variance of the gradients of any cost function f (θ) defined on the circuit will vanish exponentially with the number of qubits n, namely [43] Var where f (θ) is as in Eq. (8). Specifically, the cost function concentrates around its mean value and stays constant almost everywhere in parameter space [47], which makes training unfeasible. Vanishing gradients are used as a witness to assess whether a parameterized quantum circuit resembles a unitary 2-designs. Of course, this is only necessary but not sufficient condition, as one can easily devise a circuit that is not a 2design but has vanishing gradients, for example using a global cost with a shallow circuit [19].
In addition to vanishing gradients, another witness of randomness is the entanglement generated inside the circuit [44]. Indeed, as discussed previously in Sec. 2.3, random quantum states are almost maximally entangled, so one can use the maximality of entanglement generated by a parameterized circuit as an indicator of the resemblance to a random unitary evolution. As for vanishing gradients, the presence of large entanglement is however only a necessary but not sufficient condition for randomness, as a simple shallow circuit composed of Hadamards and Cnots can create maximally entangled states (GHZ states), which are clearly not random. As discussed in [48], the so-called entanglement-induced BPs [44,49] provide an alternative yet equivalent description of local cost barren plateaus (circuits with global costs always suffer of vanishing gradients [19], regardless of randomness), as they both stem from the proximity of parameterized quantum circuits to unitary 2-designs.
Indeed, if a circuit is a unitary 2-design, then the average entanglement entropy of any subsystem A of dimension d A (d A ≤ d B ) will be already very close to its maximal value [48,50] and approaches the Page value (7) for truly Haarrandom states. We provide a proof of Eq. (11) based on the Rényi 2-entropy in Appendix A. Recent investigations quantify the tight connection between trainability and randomness in terms of the expressibility, roughly defined as the ability of a parameterized quantum circuit to address the full unitary space [23], and show that highly expressible ansätze have flatter loss landscape, hence they are harder to train [43]. We further discuss the expressibility measure in Sec. 4.
To summarise, while the presence of entanglement is a necessary ingredient to avoid classical simulability, its uncontrolled growth is likely to signal the emergence of barren plateaus. The evaluation of the entangling capabilities of parameterized quantum circuits is then a valuable diagnostic tool to provide information both on the classical simulability and trainability issues of quantum machine learning models. At last, we note that although various methods have been put forward to mitigate the occurrence of BPs [51,52,53], including proposals based on entanglement control [48,49,54], these remain a bottleneck for scaling up quantum machine learning computations based on variational circuits.

Results
We now proceed to analyze the entanglement production in various quantum neural network architectures with different feature maps and variational ansatz, obtained composing the circuit blocks shown in Fig. 1. In particular, we take as a prototypical example the QNN introduced in [14], argued as a good candidate for quantum machine learning applications in terms of capacity and expressibility, possibly achieving an advantage over classical counterparts. Such QNN model uses as feature-map F(x) the so-called ZZFeatureMap firstly introduced in [55] as a classically-hard map to load classical data on a quantum state in a nonlinear fashion. The variational block V (θ) is instead composed of single qubit rotations followed by entangling operations. In order to better understand the effect of every single operation in the quantum circuit, we also consider variations of the QNN introduced above, varying both the feature map, the variational form, and the entangling topology. All considered circuit blocks are graphically represented in Fig. 1.
Be U L (x, θ) the unitary representing a specific quantum neural network with L layers with input data x = (x 1 , . . . , x m ) ∈ R m , and variational parameters θ = (θ 1 , . . . , θ p ) ∈ R p , see Eq. (9). We consider random instances of such QNN by sampling both the inputs and the variational parameters according to the uniform distribution Then, we study the entanglement entropy properties of each of these instances and average the result over the M trials (unless stated otherwise, we take M = 100). Thus, when in the following we refer to the entanglement entropy of a quantum circuit, we are always denoting the average over M realizations of that circuit. In order to evaluate the influence of the depth on the entanglement, we repeat this analysis by increasing the number of layers in the quantum neural network L = 1, . . . , L max .
Note that although the total number of parameters (inputs and parameters) depends on the specific feature map and variational form used, for the considered circuits such difference generally amounts to a constant and does not have a relevant impact on the results. In Tab. 1 we report the number of parameters in each circuit template analyzed in this work. We anticipate that while the number of parameters in the considered quantum circuits only scales polynomially with the system size n, these are found to be sufficient to reproduce some entanglement features of random unitaries, which are instead characterized by an exponential number parameters. This is in agreement with results on random quantum circuits that states that polynomial resources are sufficient to approximate unitary designs [56,57]. We refer to Sec. 4 for an extended discussion.

Alternating vs. Sequential data reuploading
As a first analysis, we study the difference in entanglement growth between a standard QNN using an alternated repetitions of feature maps and variational forms (as in Fig. 1), and one in which we have first L repetitions of the feature map followed by L repetitions of the variational form. We call this structure sequential. The former leverages an alternated evolution of the quantum state which is typical of quantum neural networks using a data reupload-  ing scheme [24,25,38]. The latter instead uses an initial data-dependent evolution followed by a trainable unitary, thereby creating an architecture similar to quantum kernel machines [58]. While the two structures (alternated and sequential) may be mapped to each other using ancillary qubits [59], they can have rather different performances, and we hereby show how they also create entanglement in a different way. Specifically, given the two unitary evolutions, namely the fixed input-dependent feature map F(x) and the varying parameterized variational form V (θ i ), one expects the alternated dynamics to introduce randomness at a faster rate than the sequential process and hence introduce more entanglement in the system. Such intuition is confirmed by the nu-merical results, and may be understood as a consequence of the universality of the alternating dynamics proved for example for QAOA circuits [60,61].
Here we use F = C zz and V = C 2 , as defined in Fig. 1, both with linear topology. Be S alt and S seq the entanglement entropy of the bipartition with an equal number of qubits, which is generally the highest, for the alternating and sequential structure, respectively. We define the normalized difference as and study its behavior as the depth of the quantum circuit is increased, as shown in Fig. 2.
The metric is always positive and features a maximum, implying that the alternated structure is creating entanglement faster (i.e. with fewer layers) than its non-alternated counterpart. Note that for L = 1 layers the two structures are identical, so the generated entanglement is the same up to the statistical error, which explains why all the curves start around zero. At a high number of repetitions, the two structures tend to the same value, showing a ∆S ≃ 0, which can be understood in light of the results presented in the following sections: as the number of layers of a QNN is increased, the entanglement rapidly converges to that of a Haar-distributed random state, thus the alternated and non-alternated structure eventually converge to the same value. Given the higher entanglement production rate of the alternated structure, in the following analysis, we shall focus on this structure only.

Entanglement distribution across bonds
It is natural to ask how the choice of the feature map, the variational form, and the entangling topology impact the growth of entanglement of the quantum state. In this section, we start to explore this question by studying how entanglement is distributed across all possible ordered bipartitions of the n qubits in the network. That is, given an MPS representation as in Fig. 1(b), we study the entanglement entropy corresponding to each bond in the linear chain. Denoting with e i the bond connecting qubit q i and q i+1 , the entan-glement entropy of that bond is (see Eq. (2)) S(e i ) = − Tr ρ [1:i]  , (13) where ρ [1:i] is the reduced density matrix of all the qubits up to the i-th one, and ρ is the state obtained from the quantum neural network ρ = In Fig. 3 we show the entanglement entropy distribution for the case of n = 8 qubits using three different quantum neural networks architectures: in panel (a) the one proposed in [14] with feature map F = C zz , variational ansatz V = C 2 , both with linear entanglement; in (b) same as before but using a circular entanglement topology; and eventually in panel (c) a simpler circuit using a tensor product feature map F = C 1 which encodes data independently on each qubit, followed by the same variational ansatz V = C 2 , again with linear entanglement both. For reference, it is also shown the expectation value of the entanglement entropy for Haar-random quantum states evaluated with Eq. (7), as well as an upper bound given by the highest possible entanglement log (min(d A , d B )), obtained if the two partitions A and B were maximally entangled. Note that while we report only the simulation data for n = 8, the discussion has general validity as identical results hold for all tested numbers of qubits, n = 2, . . . , 20. First of all, the findings agree with the intuition that deeper circuits are able to create higher entangled states with respect to shallower ones, in accordance with results from [23]. In particular, the entanglement entropy is higher at the center of the chain. Clearly, depending on the specifics of the QNN, the entanglement grows faster in certain architectures with respect to others. Regarding the effect of the entangling topology, comparing panels (a) and (b) we see that circular connections produce greater entanglement compared to the nearest-neighbors interaction and that such entanglement grows at a faster rate as the number of layers is increased. As for the choice of the feature map, since the QNN in panel (c) produces entanglement only through the entangling gate in the variational blocks, its entanglement is lower and also grows slower with respect to the QNN in panel (a), even though it has twice the number of parameters in the feature map.
Interestingly, however, as the number of layers approaches the number of qubits L ≈ n, all investigated QNNs converge to the same values, that is those obtained for random states sampled from the uniform Haar distribution. Deep enough QNNs are then flexible enough to reproduce the same entanglement spectrum of a random state, which, as discussed in section 2.3, are very highly entangled. Again, even though the measure of entanglement is different, this is in agreement with the results presented in [23], where the convergence to the Haar distribution is encountered for various parameterized quantum circuits, and also with other results in the literature regarding the properties of random quantum circuits to approximate the Haar distribution [62,63]. We will discuss this more in detail in Sec. 4. A more in-depth analysis of the convergence is the subject of the next section.

Entanglement scaling with increasing depth
In order to better understand the entanglement scaling properties of QNNs, we introduce a new quantity, defined as the total entanglement entropy S tot created in the MPS chain which is the sum of the entanglement entropy of all the ordered bipartitions of the quantum state. We use this global measure to quantify how fast QNNs approach the Haar distribution in terms of overall entanglement production. In particular, we define a new figure of merit, the entangling layers L, defined as the number of layers needed by an architecture to reach 90% of the total entanglement of a Haar distributed state S Haar tot , namely The choice of the 90% threshold allows to select states that are already very close to the Haarrandom value, and avoids undesired oscillating behaviours obtained when higher thresholds are used, e.g. 99%, which are caused by statistical fluctuations (recall that every QNN is sampled multiple times with different parameters to calculate averages).
In Fig. 4 we show the behavior of L for four different QNNs as the number of qubits is increased. Note that each QNN is considered with all the three possible entangling topologies (linear, circular and full as defined in Fig. 1). At last, note that all QNNs leverage the same variational form V = C 2 , while the feature map is changed, as reported in the legend.
First, we observe that the entangling layers display a linear behavior when a linear entanglement topology is used. This means that the number of layers needed to entangle the system scales linearly with the size of the system. The behavior changes abruptly when we move to a circular or full entangling topology. All architectures display a faster entanglement production when passing from a linear to a circular topology, as can be seen from the lower slope of the curves. The allto-all connectivity speeds up entanglement production only for F = C zz , C 3 , while the circuits F = C 2 , C 1 show essentially the same behavior of the linear case. We now proceed to discuss more in detail such results.
We start comparing the entangling capabilities of C zz vs. C 2 . Both with linear and circular entangling topology, C 2 is able to produce entanglement essentially at the same rate as C zz , despite C 2 being of a much simpler structure, with half the number of two-qubit gates. However, things change dramatically using a full entangling map, as the QNN reaches the 90% threshold already at L = 1, while C 2 needs more layers, showing the same dependence of a linear connectivity. While counter-intuitive at first, is it easy to see that the entanglement generated by C 2 with a full architecture is indeed equivalent to the linear one. This is due to a simple circuit identity regarding networks of Cnots reported in Fig. 5. Such circuital identity holds for any number of qubits, which makes the full entangling map as shown in Fig. 1 just as a linear entangling map in disguise (in particular, it is the inverse of the linear entangling map). See Appendix C for a more precise statement, discussion and proof. Such circuital identity thus explains the equivalence of the yellow (F = C 2 ) and red (F = C 1 ) curves between the first and last plot of Fig. 4.
Such equivalence clearly does not hold if controlled rotations are used instead of Cnots. Indeed, the feature map F = C 3 uses controlled rotations with independent random parameters, and given that these gates do not cancel out, the entanglement is always increasing going from low to high connectivity. Note that such increase is mainly due to the feature map, as the variational ansatz V = C 2 is the same as other structures, suffering from the Cnots cancellation issue described above.
For comparison, we also show the performances of a QNN with the tensor product feature encoding F = C 1 , using no entangling operations. Interestingly, even if this QNN uses two-qubit interactions only inside the variational blocks, these are sufficient to create entanglement similar to other considered QNNs, even at a slower yet comparable rate.
We report in Appendix E the complete simulation results detailing the evolution of the entanglement with the depth of the circuit, for different numbers of qubits.

Entanglement Speed
So far we have presented numerical evidence for the entanglement production in QNNs up to a maximum of 20 qubits. In the following we extend the analysis leveraging MPS to simulate quantum systems of bigger size up to 50 qubits, with a maximum bond dimension of χ max = 4096. More importantly, we show how the entanglement growth follows a behavior that is specific to each particular QNN architecture and the number of layers considered, but independent of the number of qubits in the circuit. We can thus uniquely assign an entanglement speed value to each QNN, which, we stress again, only depends on the choice of the ansatz, and holds identically for any instantiation of that QNN with arbitrary number of qubits.
Taking into account the entanglement growth discussed in Sec. 3.3, we restrict the analysis to a linear architecture, to increase as much as possible the number of layers we can correctly simulate with tensor networks techniques. Indeed, the entanglement production with a circular or full topology is too fast to allow for a convergent simulation with MPS for deep circuits.
Furthermore, we introduce the maximum Haar entanglement entropy, defined as the maximum across all bond entropies for a given number of qubits, as where the approximation in the second line has an errors that scales as O 2 −n/2 , see Appendix B for its derivation. Thus, for n ≥ 30 qubits, when the exact computation of the Haar entanglement entropy is unfeasible, we employ the approximated Eq. (16). Finally, we define the normalized entanglement entropy S n as: We stress that S n is normalized to the maximum Haar entanglement for a fixed n, not to the real maximum of the entanglement, which would be S = n 2 log 2 for the equal size bipartition. In Fig. 6 we show the evolution of S n versus the normalized number of layers L/n for n ∈ {8, 12, 16, 20, 30, 50} qubits, for the QNN defined with F = C zz , V = C 2 with linear connectivity. We note that all the points, independently of the system size n, follow the same curve: an initial linear growth of the entanglement is followed by a saturation to the Haar-random value for the entanglement entropy (7). In particular, we check this behaviour also at large system sizes with n = 30, 50 qubits and circuits with up to L = 11 layers, and confirm that such scaling is indeed size independent. See Appendix F for a discussion on the errors introduced by truncation in the MPS representation for simulations with n = 30, 50 qubits.
Inspired by the behavior of S n , we introduce a measure for the entanglement production which is specific to a given QNN architecture (feature map plus variational ansatz) and independent of the number of qubits. Borrowing from the literature on random quantum circuits, it is known that the entanglement of a system undergoing random evolution initially grows linearly in time (depth of the circuit) before reaching the plateau of Haar random states [64,65,66,54]. Indeed, as clear from Fig. 6, we observe the same initial linear growth, and thus we define the entangling speed v s as where 0.5 is a threshold such that the linear behavior holds. The entangling speed can thus be obtained by fitting the curve in Fig. 6 with the linear function (18) in the appropriate range. We report in Tab. 2 the entangling speed for a subset of the inspected architectures, and notice that entanglement is produced at sensibly different rates.
In agreement with the findings of Sec. 3.3, we see that for a linear topology the circuit C 2 builds the entanglement at the fastest rate. Indeed, fixing the feature map to F = C zz , C 2 produces entanglement 3 times faster than C 3 . To further characterise the applicability of the entangling speed, we show that the behavior of Fig. 6 evaluated for random circuits also holds when the input data x ∈ R n in the feature map F(x) are not drawn from the uniform distribution, but rather from real-world datasets. In particular, we select two common datasets in the machine learning literature, the wine [67] and breast cancer [68] datasets, and calculate the entanglement generated in the circuit when these data are fed into the feature maps (variational blocks are still populated with random parameters as before). The results presented in Fig. 7 are obtained by rescaling all the features of the datasets in the interval [0, π]. For each sample in the dataset, we average over M = 10 runs with randomly drawn parameters for the variational ansatz. The results shown in the figure are then obtained as the average over the whole dataset. The wine dataset (n = 13 features, hence n = 13 qubits, and 178 samples) follows perfectly the theoretical curve, and the breast cancer (n = 9 features, hence n = 9 qubits, and 286 samples) only slightly deviates from it, producing entanglement at a smaller rate. We then conclude that the entangling speed depends primarily on the architecture of the circuit rather than the actual values of the parameters. Clearly, this holds for reasonably distributed data features, that is excluding pathological cases of values being either zero or concentrating around it. Finally, to verify that the QNN architecture is ultimately responsible for the entanglement speed, we analyze random circuits where the encoding blocks do not share the parameters, but these are sampled independently for each layer, thus effectively removing the data-reuploading feature. This case is portrayed in Figure 7 with yellow square markers, each obtained by averaging over M = 2000 realizations of the random circuit, from which it is clear the normalized entanglement S n again follows the same behavior of the previous scenarios. Thus, the entangling speed can be used as a good estimate of the entanglement generated in a QNN also in real use cases, especially at the start of optimisation, when trainable parameters are usually initialised at random. For example, one could measure the entangling speed of the architecture of interest on a random quantum circuit of just a few qubits, and then estimate the entanglement generated with the same architecture on an arbitrary number of qubits and circuit layers, especially in regimes where simulations are no longer computationally feasible.

Expressibility
In addition to entanglement, another useful quantity to characterize parametrized quantum circuits is the expressibility, as defined by authors in [23]. Such measure quantifies how well the QNN is able to explore the Hilbert space by comparing the distribution of fidelities of states generated by the QNNs with that of randomly Haardistributed ones (see Appendix D for a formal definition and explanation).
Thus, in order to have a comprehensive under-  Random with data reuploading" indicates the case of random synthetic inputs with data reuploading, as shown also in Fig. 6. (ii) "Random" indicates the case of random synthetic inputs without data reuploading, that is encoding blocks in different layers have different random parameters. (iii) "Wine" and "Breast cancer" indicates inputs drawn from the corresponding real-world dataset, used with data reuploading. In all cases, the parameters in the variational blocks are sampled from the uniform distribution Unif(0, π). Such distribution is used also to sample the synthetic random inputs. Data points for real-world datasets are obtained by first averaging over 10 realization for each sample in the dataset, and then averaging again over the whole dataset. Results for random inputs without data reuploading (yellow square markers) are obtained by averaging over 2000 realizations of the circuits. The error bars show the standard deviation of the mean. Error bars associated with random inputs with data reuploading (blue curve) are not shown to avoid cluttering but are of comparable size with the other points. High expressibility Figure 8 -Expressibility of the QNNs analyzed in Fig. 4, for n = 8 qubits with linear entanglement. The expressibility measures how well a variational circuit is able to address the unitary space (the lower, the better). All QNNs use the same variational form V = C 2 , but with different feature maps. As the number of layers is increased, QNNs become more expressible, eventually reaching a plateau.
standing of the factors at play in the behavior of QNNs, in Fig. 8 we show the expressibility measure for the QNNs analyzed in Fig. 4 with a linear connectivity. As one would expect, the expressibility increases as the number of layers is increased, up until a plateau is reached.
Interestingly, the structure with F = V = C 2 turns out to be the least expressible of all the structures considered, even if it is the one producing entanglement at the fastest rate, in agreement with the results reported in [23], as such QNN is indeed very similar to the parameterized circuit labeled '15' in [23]. On the contrary, the QNN with F = C zz , and V = C 2 proposed in [14] is able to reach high expressibility while producing entanglement at a controlled pace. As the presence of high entanglement is correlated with trainability issues [44], this QNN attains an optimal balance of mild entanglement with high expressibility even at low depth, which could be related to its good performances in quantum machine learning task [14,55]. However, a similar, yet less favorable balance, is achieved by the other two architectures, so further investigation is needed to discriminate where the optimality comes from.
In this respect, the authors in [18] found the expressibility to be correlated with the classification accuracy of QNNs in supervised learning tasks, while weak correlation was found with the entanglement generated inside the circuit, in line with the observations regarding entanglementinduced barren plateaus [44]. As discussed earlier in Sec. 2.5, both expressibility and high entanglement are related to the resemblance of the circuit to a random unitary, but while the former provides a more direct evidence, the latter gives an indirect indication. Indeed, there are cases of circuits having low expressibility but high entanglement, indicating that such circuits selectively explore only some highly-entangled regions of the Hilbert space [23].

Distribution of the singular values
The randomness of a quantum state can also be probed using tools from random matrix theory. Specifically, this can be done by studying the distribution of the eigenvalues of the reduced density matrices, which are known to follow the Marčenko-Pastur (MP) law when pure random quantum states are considered [69,70]. More in detail, let |ψ⟩ ∈ H A ⊗ H B be an Haarrandom bipartite quantum state with Schmidt , d B ) and d A,B is the dimension of the Hilbert space H A,B . The reduced density matrix ρ A = Tr B [|ψ⟩⟨ψ|] has eigenvalues λ 2 i given by the square of the singular values, and for large system size their distribution is described by the MP distribution [71,72].
In Figure 9 we show the cumulative distribution function of the eigenvalues C(λ 2 ) of the reduced state of the first half of the qubits, obtained with a QNN with n = 15 qubits, feature map F = C zz , variational ansatz V = C 2 , and linear connectivity. The distribution of the singular values for the QNN is obtained by running the circuit 10 2 times with different sets of parameters, and storing the singular values corresponding to the central cut. Then, we construct the cumulative distribution from the histogram of all the singular values obtained from the simulations. As the number of layers L is increased, the distribution of the eigenvalues approaches the theoretical MP distribution, eventually matching it when the number of layers is equal to the number of qubits. This behavior is displayed also by other QNN architectures. For completeness, we also show the distribution of the eigenvalues of a truly Haarrandom quantum state, generated by sampling its entries independently from a normal complex distribution and then normalising it [69], which, as expected, follows perfectly the MP curve.

Discussion
Moments of the Haar distribution can be approximated efficiently using local random quantum circuits of sufficient depth. Depending on the connectivity dimension D of the qubits, defined as the number of other qubits that are connected to each qubit, order O(poly(t) · n 1/D )-depth random circuits are sufficient to create approximate unitary t-designs [62,63,57,56], that is circuits that generate a distribution of unitaries which approximately matches moments of the Haar distribution up to order t [46]. Numerical studies suggest that these results also hold for random parameterized quantum circuits of various forms [40,19,43,23].
We extend these results by showing similar results also for quantum neural networks featuring data re-uploading, both for random instances using random inputs and parameters, and also for real-world dataset when these are used as inputs in the feature map. In particular, for a linear connectivity, as the number of layers approaches to the number of qubits L ≈ n, QNNs display the same entanglement entropy properties of Haardistributed random states, a fact which can be taken as a proxy for QNNs approximating uni-tary designs. Such behaviour was also confirmed by studying the randomness of the circuits with other metrics, namely the expressibility and the convergence to the Marčenko-Pastur distribution of the eigenvalues of the reduced states. In both cases, we find strong evidence of the QNN reproducing the same features of random quantum states as the number of layers approaches the system size using a linear connectivity.
Our analysis also underlines the importance of the entangling operations, as careless use of an all-to-all connection can result in unwanted simplifications, making the effective connectivity identical to a nearest-neighbors one. Parameterized two-qubit interactions can solve the problem, even though they may be challenging to implement on real hardware. A good trade-off is achieved with a circular entangling topology, which is immune to simplifications and shows remarkable entangling capabilities. Indeed, from the results of Fig. 4, we see that such connectivity is able to create high multiparite entanglement between qubits already at shallow depth, and with only minor additional hardware resources compared to the linear connectivity. An all-to-all topology instead reaches typical values for entanglement of random states essentially at constant number of layers L ∈ O(1) -implying in general O n 2 entangling operations -, independently of the system size, and the architecture used (when non-trivial feature maps and variational ansätze are used).
While limiting the entanglement inside a quantum neural network may be necessary to ensure its trainability [44], low entanglement makes the circuit prone to be simulated exactly with an MPS, as discussed in Sec. 3.4. Thus, we envision that a sweet spot should be found in order for QNNs to show signs of quantum advantage: not too high to preclude trainability, not to low to escape triviality.
At last, the introduction of the entangling speed v s (18) can be used as a figure of merit for the entanglement production of a given QNN, independent of the size of the system. Indeed, the entangling speed can be studied and assigned to an architecture in the simulable regime (low number of qubits n), and then used to estimate the number of layers to achieve a well-determined quantity of the entanglement, for any system size. We also stress that v S characterizes the most in-teresting interval of layers in a circuit. As discussed earlier, a value of the entanglement too high might be connected to barren plateaus, underlying the importance of exploring the regime where the entanglement has not saturated yet, and the linear regime still holds.
We now briefly comment on future interesting investigation directions regarding entanglement and QNNs. The focus of this work was to carefully study the entanglement features of common quantum ansätze, specifically when they are initialized with random parameters and no optimization has yet started. A natural followup is to ask whether entanglement plays any role also during the optimization process, which is at core of variational quantum algorithms. While for some specific variational procedures like QAOA [73] or VQE-based ground state solvers [74] one has some knowledge of the structure of the target solution, and hence can infer the behaviour of the entanglement created in the circuit, this is not the case for quantum machine learning tasks, as they are usually very task-dependent. Indeed, current proposals for QML advocate for the use of constrained quantum ansätze specifically tailored to the problem under investigation [22,21,75], and then one expects the depth of the circuit and the entanglement generated inside it to highly depend on the specific task to be solved, and dataset to fit, either classical or quantum [76]. Moreover, while the use of deep QNN ansätze (with arguably more entanglement) could offer some optimization advantages due to overparametrization [77,78,79], the emergence of barren plateaus suggests using shallow circuits instead [19,45]. The characterisation of the role played by entanglement in QNNs, and how it may be leveraged to achieve a quantum advantage over classical methods will be objects of future studies.

Conclusion
In this paper we discussed in detail the entanglement generated by different promising Quantum Neural Networks (QNNs) when these are initialised with random parameters, and showed that they reproduce the same properties of random quantum states under various measures.
We employed a Matrix Product States (MPS) simulation of the quantum circuits, which guarantees an easy computation of the entanglement in the circuits, and let us study systems of large system size composed of up to n = 50 qubits.
We showed that while all the architectures tend to a Haar entanglement distribution for a sufficiently high number of layers, the speed of convergence strongly depends on the specific circuit ansatz. This result highlights the universal behavior of the normalized entanglement production (17) for a given architecture, so we introduced a new measure to characterize a QNN in terms of its entanglement production: the entangling speed (18).
Finally, we argued that a trade-off between expressibility and entanglement is the key to a better understanding of QNN performances and an auspicious target for the search of quantum advantage. While high entanglement is a necessary condition to avoid classical simulability, a toolarge entanglement is detrimental to the training procedure due to its tight connection with barren plateaus, as discussed in Sec. 2.2. A promising future direction is to extend the entanglement analysis of QNNs not only at initialization but also during the training procedure [48,54,74]. These tests would help to understand if QNNs really are a suitable platform for proving quantum advantage.

Code availability
All simulations with a number of qubits n ≤ 12 were performed using Qiskit [80], while larger systems were simulated with Quantum Matcha Tea package [81] available at https://baltig.infn. it/quantum_tea/quantum_tea. The simulations for the training sections were performed using Pennylane [82]. All the code for reproducing the results presented here is available at the repository: https://github.com/mballarin97/mps_ qnn.

A Lower bound on entanglement entropy for unitary 2-designs
The presented derivation is a straightforward application of known results on the entanglement of random states and properties of Rényi-entropies [50,48]. Rényi α-entropies of a density operator ρ are defined as where lim α→1 S α (ρ) = S(ρ) is the Von Neumann entropy of Eq. (2), and it holds that S β (ρ) ≤ S α (ρ) for β ≥ α. Of particular interest is the Rényi 2-entropy S 2 (ρ) = − log Tr ρ 2 depending on the purity Tr ρ 2 of the system, which is much easier to computer and it can be used to lower bound the Von Neumann entropy via S(ρ) > S 2 (ρ).
Let |ψ⟩ ∈ (C 2 ) ⊗n be the state of a composite system made of subsystems A and B with dimensions d A = 2 n A and d B = 2 n−n A , respectively. Suppose |ψ⟩ is a random state |ψ⟩ = U |ψ 0 ⟩, where U is sampled from an ensemble of unitaries that constitutes at least a unitary 2-design. Then, the average value of the purity of the reduced density matrix ρ A = Tr B [|ψ⟩⟨ψ|] amounts to [50,48] By the convexity of Rényi-entropies with respect to Tr[ρ α ], and using Jensen's inequality (E f ≥ f E), one can lower bound the average Rényi 2-entropy as Then, since S(ρ) ≥ S 2 (ρ) ∀ ρ, taking the expectation value on both sides yields a lower bound on the average Von Neumann entropy of ρ A , namely which is the bound shown in Eq. (11) in the main text. If the state |ψ⟩ is instead a truly Haar-random state, that is U is sampled from the uniform Haar distribution and not just from a 2-design, the entanglement entropy is given by the Page value of Eq. (7) in the main text, which is itself lower bounded by [33] Summarising, for d A < d B , putting together the bounds (22) and (23) one has Alternatively, in the limit when the subsystem B is much larger than A, d B ≫ d A , then by approximating the logarithm log(1 + x) ≈ x in (21) one also has Thus, the entanglement entropy of a state sampled from a 2-design is close to that of a truly Haarrandom state, with both achieving near-maximal entanglement. Of course, one also expects the Von Neumann entropy of a general t-design to be upper bounded by the Page value, E t-design [S(ρ A )] < E Haar [S(ρ A )], with equality obtained in the limit t ≫ 1.

B Computation of the Haar entanglement distribution
While Eq. (7) is the theoretical definition of the Haar entanglement entropy, it is not possible to exactly compute it, due to the exponential number of terms in the sum. However, it is possible to exploit the similarity of the sum with the harmonic series to obtain a good approximation. First, we denote with H n the truncated harmonic series: Then, we rewrite in a more convenient way the sum in Eq. (7): Using well-known results for the truncated Harmonic series [83]: where γ ≃ 0.5772 is the Euler-Mascheroni constant, and 0 ≤ ϵ n ≤ 1/8n 2 . Thus, the correction ϵ n goes to zero as the number of terms in the sum n increases, allowing for a meaningful approximation of the value. Using this technique, we are able to estimate the Haar entanglement entropy of a 50-qubits state with an error of the order 10 −16 .
We now proceed to compute the maximum and average of the distribution with a fixed number of qubits n. Using Eq. 7 and recalling d A(B) = 2 n A(B) , n B = n − n A , n A ∈ [1, n/2] we can write: We are now interested in the maximum and average of the distribution. It is easy to see that the maximum is achieved for n A = n/2. In this scenario 2 n A ≫ 1: Taking into account that for an n qubit system the maximum of the entanglement entropy is S = n 2 log 2 we can state that, in the large n limit, a Haar state presents a maximally entangled bond.

C Triviality of the full entangling map
The full entangling map defined as Algorithm 1 Full entangling map Input: q 1 , . . . q n , qubits Output: Quantum circuit 1: for i = 1 to n do 2: for j = i to n do 3: Cnot(q i , q j ) 4: end for 5: end for can be shown to be equivalent to a nearest neighbors entangling map with the gates in reversed order, see Fig. 10. The proof is straightforward and obtained by direct evaluation, making use of some circuit identities for networks of Cnots [84]. In particular, (i) a Cnot can be distributed into four Cnots acting on an additional intermediate qubit The full entangling map can be highly simplified using these three rules, reducing it to a simple sequence of nearest-neighbors interactions. For example, for n = 3 qubits, using (i) to distribute the long-range Cnot, one obtains = 1 = The simplification process can be iterated for a higher number of qubits by first commuting long range Cnots at the end of the circuit to create a final cascade, and then making use of the result from the lower dimension case. In Fig. 10 the simplification process for n = 4, 5 qubits is explicitly shown, and it is directly generalized for all numbers of qubits.
Clearly, these results only hold for networks composed of plain Cnots, and do not apply for general two-qubit interactions made of controlled unitaries.

D Expressibility of Parameterized Quantum Circuits
The expressibility introduced in [23] quantifies how well the QNN is able to explore the unitary space by comparing the distribution of fidelities of states generated by the QNN with that of randomly Haar-distributed ones.
Let U (ϕ) be the unitary operation implemented by a parameterized quantum circuit (PQC) with parameters ϕ (in our case, we would have ϕ = (x, θ)), and be |ψ ϕ ⟩ = U (ϕ) |0⟩. Given two realizations of the PQC with parameters ϕ 1 and ϕ 2 , consider the fidelity F = |⟨ψ ϕ 1 |ψ ϕ 2 ⟩| 2 . By repeatedly sampling = = = = = = = = Figure 10 -Equivalence of the full entangling map with a nearest-neighbors scheme. Using the circuit identities discussed in the main text, it is straightforward to check that the all-to-all entangling scheme as defined in Alg. 1 is equivalent to a nearest-neighbors interaction.
two sets of parameters and evaluating the corresponding fidelity F , one can construct a histogram approximating the probability distributionP PQC (F ) of the fidelity for states generated by the PQC. For Haar random quantum states, the probability density function of fidelities is known and amounts to P Haar (F ) = (N − 1)(1 − F ) N −1 , where N = 2 n is the dimension of the Hilbert space [85].
The expressibility is then defined as the Kullback-Leibler divergence D KL between the estimated fidelity distribution and that of a Haar-distributed ensemble, namely Expressibility := D KL P PQC (F )||P Haar (F ) . (35) E Extensive analysis of the entanglement scaling with the increasing depth In Fig. 11 we show the behavior of the total entanglement S tot defined in Eq. (14) for four different QNNs as the depth of the quantum circuit is increased. Note that each QNN is considered with all the three possible entangling topologies (linear, circular and full as defined in Fig. 1), and the results are shown for several numbers of qubits n = 4, 6, 8, 10, 12. At last, note that all QNNs leverage the same variational form V = C 2 , while the feature map is changed, F = C zz , C 2 , C 3 , C 1 for panels (a), (b), (c) and (d), respectively. See main text for comments on results.

F Convergence of MPS simulations
Using tensor network, specifically MPS, methods we perform an approximation to simulate large systems, in this work up to n = 50 qubits. However, the error introduced by the approximations can be monitored, so one always has an estimate of the faithfulness of the tensor network simulation [70]. Let |ψ exact ⟩ be the true state of the quantum system after the i-th two qubit gates in the circuit is applied (one qubit gates do not imply additional approximation errors), and let |ψ trunc ⟩ denote the truncated quantum state represented by the MPS. The fidelity between these two states evaluated on the i-th step of the computation is  where we represented the states in the Schmidt decomposition with respect to the bond where the i-th two-qubit gate was applied, and χ s is the bond dimension of the MPS state. The fidelity F t of the simulation after application of the t-th two-qubit gate is lower bounded by the product of the previous fidelities F i , as [70] where we note that the single step fidelities F i are readily accessed during the MPS simulation, since one calculates the fidelity before the truncation of the singular values takes place. Equation (38) gives a lower bound to the error introduced by truncation in terms of the fidelity between the true state and the one evolved using an MPS simulation, and one can then control the faithfulness of the simulation at any given time step of the circuit. In Figure 12 we show the infidelity 1 − F of the final state from the circuit for n = 30, 50 with a maximum bond dimension χ s = 4096. The plotted result is the average over M = 10 realization of the quantum circuit with different sets of parameters. Defining reliable results with the infidelity of at most 1 − F = 10 −4 we observe that, for n = 50, we reliably describe circuits up to 11 layers, while for n = 30 we can reach L = 12 layers.