Expressibility of the alternating layered ansatz for quantum computation

The hybrid quantum-classical algorithm is actively examined as a technique applicable even to intermediate-scale quantum computers. To execute this algorithm, the hardware efficient ansatz is often used, thanks to its implementability and expressibility; however, this ansatz has a critical issue in its trainability in the sense that it generically suffers from the so-called gradient vanishing problem. This issue can be resolved by limiting the circuit to the class of shallow alternating layered ansatz. However, even though the high trainability of this ansatz is proved, it is still unclear whether it has rich expressibility in state generation. In this paper, with a proper definition of the expressibility found in the literature, we show that the shallow alternating layered ansatz has almost the same level of expressibility as that of hardware efficient ansatz. Hence the expressibility and the trainability can coexist, giving a new designing method for quantum circuits in the intermediate-scale quantum computing era.


Introduction
Recent rapid progress in the hardwares for quantum computing stimulates researchers to develop new techniques that utilize even noisy intermediate-scale quantum (NISQ) [1] device for real applications such as machine learning and quantum chemistry. Especially, several hybrid quantum-classical algorithms have been actively examined, as a means to reduce the computational cost required on quantum computing part. The variational quantum eigensolver (VQE) [2], which trains the parameterized quantum circuit via a classical computation to decrease a given cost function, is typically considered as such a hybrid algorithm.
The critical point in this strategy is in the difficulty to design a suitable and implementable circuit ansatz that, after the training process, may produce an exact or well-approximating solution of a given problem. The hardware efficient ansatz (HEA) [3] is often used mainly because of the implementability on a hardware; this is a relatively shallow circuit ansatz, whose parameters are embedded in the angles of single-qubit rotation gates. However, it was proven in [4] that, when those parameters are randomly chosen, the gradient of a standard cost function vanishes, meaning that the update of the cost function often gets stuck in the learning process before reaching the minimum.
For solving this vanishing gradient problem, several approaches have been proposed. Reference [5] showed numerically that, with a special type of initializing method of the HEA, the vanishing gradient problem does not occur in an ansatz with O(1) qubits. Also, a quantum analogue of the natural gradient have been proposed in [6,7], the original classical version of which is often used to avoid similar vanishing gradient problems in neural networks. In this paper, we focus on the third approach given by Ref. [8] that provides a method for devising a specific structure of the HEA ansatz, called the Alternating Layered Ansatz (ALT), which in fact provably does not suffer from the vanishing gradient problem. By definition, the class of ALT is included in that of HEA; the difference is that, while a HEA consists of multiple layers of singlequbit rotation gates and entanglers that in principle combines all qubits in each layer, the entangling gates contained in an ALT is restricted to entangle only local qubits in each layer. With this setting, the authors in [8] derived a strict lower bound of the variance of the gradient for an ALT with its parameters randomly chosen (more precisely, the ensemble of unitary matrices corresponding to each circuit block is 2-design) under the condition that the cost function is local (that is, the cost function is composed of local functions of only a small number of local qubits). By using this lower bound, it was also shown that the vanishing gradient problem can be resolved if the number of layers is of the order O(poly(log n)) where n is the total number of qubits, or roughly speaking if the circuit is shallow.
Then an important question arises; does ALT have a sufficient expressive power (expressibility) for generating a rich class of states, which contains the optimal or a well-approximating state? Because the set of ALTs is a subclass of that of HEA, one might argue that the expressibility of ALT could be much lower than that of the HEA; if this is the case, the ALT circuit may not generate a desired state even though the learning process is smoothly running. Thus, it is worth examining the expressibility of ALT in order to assess the practicality of this ansatz in executing the hybrid quantum-classical algorithm. In this paper, we study this problem using the expressibility measure introduced in [9] and show that, fortunately, the class of shallow ALTs has the same level of expressibility as that of HEAs. Therefore, the expressibility and the trainability of a quantum circuit can coexist, which means that the existing HEA found in the literature can be basically replaced with a simpler ALT without degradation of expressibility while acquiring a better trainability. That is, the ALT might be taken as a new standard ansatz in NISQ computing era.
The structure of this paper is as follows. Section 2 is the preliminary, giving the definition of expressibility and the ansatzes. In Sec. 3, we show both theoretically and numerically that the expressibility of ALT is as high as that of HEA. In Sec. 4, we show how much the value of expressibility is reflected to the result of VQE. Finally, we conclude with some remarks in Sec. 5.

Preliminaries
In this section, we define indicators of the expressibility and introduce some circuit ansatzes.

Indicators of the expressibility
Following Ref. [9], we define the expressibility of a given circuit, by the randomness of states generated from the circuit, in terms of the frame potential and the Kullback-Leibler (KL) divergence.

Frame Potential
To define the expressibility of a given circuit ansatz C, let us consider the deviation of the state distribution generated by C from the Haar distribution, as follows; where Haar denotes the integration over the state |ψ distributed with respect to the Haar measure, and · HS is the Hilbert Schmidt distance. Also, |ψ θ is the state generated by the ansatz C characterized by the parameter θ ∈ Θ, e.g., |ψ θ = UC (θ)|0 , θ ∈ Θ, where UC (θ) is the unitary operator corresponding to C and |0 is an initial state. Then we call that the ansatz C with smaller A (t) (C) has a higher expressibility. This definition is justified by the following reason. That is, because the state |ψ generated from the Haar distribution can in principle represent an arbitrary state, the condition A (t) (C) ≈ 0 implies that the ansatz C can generate almost all states possibly including the optimal solution (e.g., the ground state in VQE); also, in this case the states generated from C are almost equally distributed, which is particularly favorable if little is known about the problem.
To compute A (t) (C), we instead focus on the following t-th generalized frame potential [10] of C: where both Θ and Φ represent the same set of parameters of C. In the present paper, we simply call it the t-th frame potential. In particular, the t-th frame potential of N -dimensional states distributed with respect to the Haar measure, is given by F Haar (N ) = Haar Haar | ψ|ψ | 2t dψdψ . The point of introducing the frame potential is because these quantities are linked to A (t) (C) in the following form; that is, for an arbitrary positive integer t, it holds The equality in the last inequality holds if and only if the ensemble of |ψ θ is a state t-design [11,12,13]. Thus, the ansatz C with smaller F (t) (C) has a higher expressibility. Also F (t) (C) is lower bounded by F (t) Haar (N ), meaning that the frame potential can be used as an indicator for quantifying the non-uniformity in the state distribution. In Sec. 3.1, we calculate F (1) (C) and F (2) (C) for several ansatz C.

KL-Divergence
Note that the frame potential (2) is the t-th moment of the fidelity F = | ψ θ |ψ φ | 2 , where the circuit parameters θ ∈ Θ and φ ∈ Φ are randomly sampled from the circuit ansatz C. Hence the probability distribution of F , denoted by P (C, F ), contains more information for quantifying the randomness of C than F (t) (C), and thereby the following measure can be used to quantify the expressibility of C: where DKL(q p) is the KL divergence between q and p. Also PHaar(F ) is the probability distribution of the fidelity F = | ψ|ψ | 2 , where |ψ and |ψ are sampled according to the Haar measure. In Ref. [14], where N is the dimension of Hilbert space. Because in general DKL(q p) = 0 iff q = p, the anzatz C with smaller value of E(C) has a higher expressibility. Thus, E(C) can also be used an indicator for quantifying the non-uniformity of an ansatz. Lastly recall that the t-th moment of P (C, F ) and PHaar(F ) are F (t) (C) and F

Ansatzes
Here we describe the three types of ansatzes investigated in this paper.

Hardware Efficient Ansatz
The HEA circuit consists of multiple layers of parametrized single qubit gates and entanglers which entangle all qubits. In the following, let C ,n HEA be the class of HEA with layers where each layer contains n-qubits.

Alternating Layered Ansatz
The ALT introduced in [8] also consists of multiple layers but the components of each layer are different from those of the HEA as follows. That is, each layer has some separated blocks, where each block has parametrized single-qubit rotation gates and fixed entanglers that entangle all qubits inside the block. The probability distributions of those angle parameters are independent in all blocks and all layers. In this paper, we further restrict the class of ALT as follows. First, as in HEA, the entire circuit is composed of layers, where each layer contains n qubits. Then, we assume that, in the odd-number-labeled layers, each block contains m qubits, so that m is an even number and n/m is an integer. In other words, the odd-numberlabeled layers contain n/m blocks which operate on {1, . . . , m}, {m + 1 . . . , 2m}, ..., and {n − m + 1, . . . , n} qubits. As for the even-number-labeled layers, they contain n/m + 1 blocks which operate on {1, . . . , m/2}, {m/2 + 1, . . . , 3m/2}, ..., and {n − m/2 + 1, . . . , n} qubits; that is, the first and the last block operate on m/2 qubits, while the others operate on m qubits. In the following, we use C ,m,n ALT to denote the class of ALT with the above-defined indices.
In [8], it is proved that the vanishing gradient problem can be avoided if the following conditions are satisfied; (i) each term of the cost Hamiltonian, H = k H k is local, meaning that each term H k is composed of less than m neighboring qubits, (ii) the ensemble of unitary matrices in each block is 2-design, and (iii) the number of layers, , is of the order O(poly(log n)). Note that each block needs to be deep enough for the distribution of corresponding unitary matrices to be close to 2-design.

Tensor Product Ansatz
In addition to HEA and ALT, we here introduce the class of tensor product ansatz (TEN) as a relatively weak ansatz. This ansatz also consists of layers, where each layer contains n qubits, and each layer contains n/m blocks (n/m is assumed to be an integer), which contain single-qubit rotation gates and entanglers combining all qubits in the block. Throughout all the layers, the blocks operate on {1, . . . , m}, {m + 1 . . . , 2m}, ..., and {n − m + 1, . . . , n} qubits. Thus, TEN always generates a product state of the form |ψ1 ⊗ · · · ⊗ |ψ n/m where each state is composed of m qubits. Let C ,m,n TEN be the class of TEN described above.
In Fig. 1, we show examples of the structures of the above defined ansatzes.

Expressibility of the circuit ansatzes
In this section, we give some analytical expressions as well as upper bounds of the first and the second frame potentials of the three ansatzes introduced in Section 2.2, showing that the shallow ALT has almost the same expressibility as that of HEA. This result will be further confirmed by a numerical simulation in terms of the KL-divergence.

Analytical expression of the frame potential of the ansatzes
First of all, to compute the frame potentials of each ansatz, we assume that the ensemble of the unitary matrices corresponding to C ,n HEA is 2-design. Similarly, the ensemble of the unitary matrices corresponding to each block of C ,m,n ALT and C ,m,n TEN are assumed to be 2-design. The assumptions are also adopted in the discussion of [4] and [8]. Such ensemble of unitary matrices can be generated by randomly choosing the parameters of the circuit having a specific structure.
Before going into the detail, we show some integration formula for random unitary matrices [15]. First, if the ensemble of n × n unitary matrices {U } is 1-design, the following formula holds: where 1design dU is the integral over the 1-design ensemble of the unitary matrices. Second, if the ensemble of n × n unitary matrices {U } is 2-design, the following formula holds: where and 2design dU is the integral over the 2-design ensemble of the unitary matrices. These formula are effectively used to derive the theorems shown below.

The First Frame Potential
For the first frame potentials, the following theorem holds.
Theorem 1. If the ensemble of the unitary matrices corresponding to C ,n HEA and the ensemble of the unitary matrices corresponding to each block of C ,m,n ALT and C ,m,n TEN are 2-design, then the following equalities hold: The equality, Haar (2 n ), can be readily proved from the assumption of the theorem, because, if the ensemble of the unitary matrices corresponding to the circuit is 2-design, the ensemble of the states generated by the circuit is a state 2-design (and therefore a state 1-design). For the other equalities, the proof is given in Appendix. Note that, accordingly, the ensembles of the states generated by C ,n HEA , C ,m,n ALT , and C ,m,n TEN are all 1-design.

The Second Frame Potential
The second frame potential of the Haar random circuits, F Haar (2 n ), can be computed as Then, for C ,n HEA and C ,m,n TEN , the following theorem holds. Theorem 2. If the ensemble of the unitary matrices corresponding to C ,n HEA and the ensemble of the unitary matrices corresponding to each block of C ,m,n TEN are 2-design, then the following equalities hold: The equality (9) is readily proved from the assumption of the theorem, because, as we mentioned above, if the ensemble of the unitary matrices corresponding to the circuit is 2-design, the ensemble of the states generated by the circuit is a state 2-design. For Eq.(10), we give the proof in Appendix. From this theorem we find that F (2) (C ,m,n TEN ) is always larger than F Haar (2 n ) for large n, meaning that the expressibility of TEN is much smaller than that of HEA in the sense of the frame potential.
As for ALT, it is difficult to obtain an explicit formula like the case of HEA and TEN. Hence in Theorem 3 below, we provide a formula for computing the values of F (2) (C 2,m,n ALT ) and F (2) (C 3,m,n ALT ); the computation methods for the other s are left for future work. Before stating the theorem, we define a 16-dimensional vector a(2, m), a 16 × 16 matrix B(2, m), a 64-dimensional vector a(3, m) and a 64 × 64 matrix B(3, m). Given integers ka, k b ∈ {1, 2, 3, 4}, the (4(ka − 1) + k b )-th component of the vector a(2, m) is defined as where ∆ (ka,k b ) (P, Q) is the function of m/2 × m/2 unitary matrices P and Q: Next, given integers ka, where ∆ (ka,k b ,kc) (P, Q) is a function of m/2 × m/2 unitary matrices P and Q: Also, given integers ka, where ∆ (ka,k b ,kc,k d ) (P, Q) is a function of m × m matrices P and Q: For the matrix component M s,t i,j , the upper indices correspond to the first m/2 qubits and the lower indices correspond to the last m/2. Given integers ka, where ∆ (ka,k b ,kc,k d ,ke,k f ) (P, Q) is a function of m × m matrices P and Q: i 5 j 5 i 6 j 6 i 7 j 7 i 8 j 8 ∆ kc i 9 0i 10 0i 11 0i 12 0 × ∆ k d s 1 0s 2 0s 3 0s 4 0 ∆ ke s 5 t 5 s 6 t 6 s 7 t 7 s 8 t 8 ∆ k f s 9 0s 10 0s 11 0s 12 0 × P s 7 s 1 i 7 i 1 P s 8 s 2 i 8 i 2 P * s 5 s 3 i 5 i 3 P * s 6 s 4 i 6 i 4 Q t 5 s 9 j 5 i 9 Q t 6 s 10 j 6 i 10 Q * t 7 s 11 j 7 i 11 Q * t 8 s 12 Now we can give the theorem as follows.
The vectors a( , m) and the matrices B( , m) for = 2, 3 are obtained by directly computing Eqs. (11), (13), (15), and (17), which then lead to F (2) (C 2,m,n ALT ) and F (2) (C 3,m,n ALT ). Now our interest is in the gap of these quantities from F Haar (2 n ) and F (2) (C 3,m,n ALT )/F Haar (2 n ) as a function of n/m, for several values of (m, n). For comparison, F (2) (C 2,m,n TEN )/F Haar (2 n ) and F (2) (C 3,m,n TEN )/F Haar (2 n ) are shown in the figure. Recall that, if this measure takes a smaller value, this means that the corresponding ansatz has a higher expressibility. Here is the list of notable points: • For any pair of (n, m), it is clear that F (2) (C ,m,n ALT ) is much smaller than F (2) (C ,m,n TEN ) for both = 2, 3. This means that, as expected, ALT has a much higher expressibility than TEN.
• For any pair of (n, m), F (2) (C 2,m,n ALT ) > F (2) (C 3,m,n ALT ) hold, i.e., as increases, the expressibility increases. • For any fixed n/m, the ALT with bigger m always has a higher expressibility. For instance, the ALT with (n, m) = (50, 10) has a higher expressibility than the ALT with (n, m) = (20, 4). This is simply because, if the structure of the circuit (the number of division in each layer for making the block) is the same, then an ALT with bigger block components has a higher expressibility.
• For a fixed n, we have ALT with the smaller second order frame potential by taking m bigger. For instance n = 100, we have F (2) (C 2,2,100 ALT ) > F (2) (C 2,4,100 ALT ) > F (2) (C 2,10,100 ALT ). That is, for a limited number of available qubits, the ALT with less blocks has a higher expressibility.
Haar (2 n ) when m = 10 for all n/m within the figure and for both = 2, 3. Hence the ALT composed from the blocks with m = 10 qubits in each layer has almost the same expressibility as the HEA without respect to the total qubits number, n. In other words, for a given HEA with fixed n, we can divide each layer into separated 10-qubits blocks to make an ALT, without decreasing the expressibility.
The last point is of particular important in our scenario. That is, we are concerned with the condition on the number m such that F (2) (C ,m,n ALT ) F Haar (2 n ) holds. The following Theorem 4 and the subsequent Corollary 1, which can be readily derived from the theorem, provide a means for evaluating such m.
Recall from Eq. (2) that F (t) (C) ≥ F (t) Haar (N ) holds for any ansatz C. Therefore, if m ≥ 4 log n and n is enough large, Corollary 1 implies that F (2) (C 2,m,n ALT ) ∼ F Haar (2 n ) and F (2) (C 3,m,n ALT ) ∼ F Haar (2 n ). This means that the ensembles of the states generated by C 2,m,n ALT and C 3,m,n ALT are almost 2-design. Hence in this case, from Theorem 2, the expressibility of ALT is as high as that of HAE. It is worth mentioning that, when m = O(log 2 n), the vanishing gradient problem does not happen in ALT as long as the cost function is local and is small [8]. More precisely, it was shown there that the variance of the gradient of such a cost function is larger than the value proportional to O(1/2 m ); thus, by taking m = O(log 2 n), the variance decreases with only O(1/poly(n)) as a function of n, whereas in the HEA case the same variance decreases exponentially fast as n becomes large. Therefore, the expressibility and the trainability coexists in the shallow ALT with m = O(log 2 n).

Expressibility measured by KL divergence
In Subsection 3.1, we have shown that the first two moments of P (C ,m,n ALT , F ) and P (C ,n HEA , F ) are close to those of PHaar(F ), as long as m = O(log 2 n) and the block components of the ansatzes are sufficiently random. (Recall that, if every block is completely random, then the set of HEA constitutes the Haar ensemble.) The result implies that both P (C ,m,n ALT , F ) and P (C ,n HEA , F ) are close to PHaar(F ) itself, meaning that P (C ,m,n ALT , F ) P (C ,n HEA , F ) PHaar(F ). In this subsection, to support this conjecture, we evaluate the values of KL-divergence E(C) = DKL (P (C, F ) PHaar(F )) for the case C = C ,m,n ALT and C = C ,n HEA , in addition to C = C ,m,n TEN for comparison with various sets of ( , m, n). Especially, we focus on the relationship between the values of F (2) (C) and E(C), and check if F (2) (C) F (2) (C ) would lead to E(C) E(C ) for a fixed n.
The parameters taken for calculating the KL-divergence are summarized in Table 1. Note that the circuits are chosen to be similar to those used in Section 3.1; for TEN and ALT, the depth of the circuits inside the blocks are all set to m so that the ensemble of the unitary matrices corresponding to those circuits become close to 2-design [16,17]; for HEA, is set to n so that the ensemble of the unitary matrices corresponding to the whole circuits becomes close to 2-design. It is expected that F (2) (C 3,2,4 ALT ) ≈ F (2) in these parameter sets. As an example of the circuit, the whole structure of C 3,2,4 TEN , C 3,2,4 ALT , and C 4,4 HEA in our settings are shown in Figs. 3a, 3b, and 3c, respectively. As illustrated in the figures, each layer is composed of parametrized single qubit gates and fixed 2-qubit CNOT gates.
In each trial of computing KL-divergence, we generate 200 states. When generating a state in each trial, we randomly choose the parameters and the type fo single-qubit gate of the circuit. That is, for the i-th single qubit gate Ri(θi) = exp(σa i θi) with ai = {x, y, z} and θi ∈ [0, 2π], in each trial all ai and θi are randomly chosen. Then 200 fidelity values are computed, which are then used to construct the histogram with 1000 bins to approximate the probability distribution P (C, F ). Note that increasing the number of generated states and the number of bins do not affect the following conclusions.
In this setting, Fig. 4 shows the KL divergences E(C ,m,n ALT ), E(C ,n HEA ), and E(C ,m,n TEN ). As a reference, we also show the values of F (2) (C)/F (2) Haar (2 n ) computed from the second moment of the fidelity distributions. Each data point and associated error bar is the average and the standard deviation of 10 trials of computation, respectively. Here is the list of points: • For a fixed n, E(C ,m,n TEN ) is always bigger than E(C ,m,n ALT ) and E(C ,n HEA ). • As the number of layers increases, the KL-divergence decreases for fixed (m, n).
• For a fixed n, the tendency of the values of F (2) (C)/F (2) Haar (2 n ) is strongly correlated with that of KL-divergence.
ALT ) is as small as E(C n,n HEA ) in the parameter sets where F (2) (C 3,m,n ALT ) ≈ F (2) (C n,n HEA ) is realized, i.e., (m, n) = (2, 4) and (4,8). This result implies that the state distribution in ALT is also close to that in HEA, in the setting where the second frame potential is close to F (2) Haar (2 n ). From some of the above observations, we find the strong correlation between F (2) (C)/F (2) Haar (2 n ) and E(C); that is, as F (2) (C) becomes close to 1, then E(C) becomes close to 0. Therefore, combining the result obtained in Section 3.1, we get a clear evidence that, as far as m = O(log 2 n), E(C 2,m,n ALT ) ≈ 0 and E(C 3,m,n ALT ) ≈ 0 hold. That is, the high expressibility and trainability in ALT proven in Section 3.1 are assured also in terms of KL-divergence.
Haar (2 n ) (top) and KL-divergence (bottom) for each ansatz. The sets of points with which F (2) (C 3,m,n ALT ) ≈ F (2) (C 3,n HEA ) hold are enclosed by the red rectangles.

Application to VQE
Recall that ALT was originally introduced with the motivation to resolve the vanishing gradient problem in VQE, which has been often observed when using HEA; then we were concerned with the expressibility of ALT in VQE, meaning that the ALT does not offer a chance to reach the optimal solution due to the possible loss of expressibility. But we now know that this concern has been resolved under some conditions, as concluded in the previous section; that is, the expressibility and the trainability coexists in the shallow ALT with m = O(log 2 n). In this section, let us see that this desirable fact indeed holds in a VQE problem. We choose the Hamiltonian of 4-qubits Heisenberg model on a 1-dimensional lattice with periodic boundary conditions: where σ i a (a ∈ {x, y, z}) is the Pauli matrix that operates on the i-th qubit and σ 5 a = σ 1 a . The goal of VQE problem is to find the minimum eigenvalue of H, by calculating the mean energy H = ψ θ |H|ψ θ = 0|UC (θ) † HUC (θ)|0 via a quantum computer and updating the parameter θ ∈ Θ to decrease H via a classical computer, in each iteration. As ansatzes, C 3,2,4 TEN , C 3,2,4 ALT and C 4,4 HEA are chosen. As indicated in Fig. 4, the values of KL divergence corresponding to these ansatzes show that E(C 3,2,4 TEN ) > E(C 3,2,4 ALT ) E(C 4,4 HEA ). That is, this ALT has the expressibility as high as that of HEA, and further, it is expected to enjoy the trainability unlike the HEA.
The simulation results are shown in Fig. 5. The blue lines and the associated error bars represent the average and the standard deviation of the mean energies in total 100 trials, respectively; in each trial, the initial parameters of the ansatz are randomly chosen; also the optimization to decrease H in each iteration is performed by using the Adam Optimizer with learning rate 0.001 [18]. The green line shows the theoretical minimum energy (i.e., the ground energy) of H. In each subfigure, the orange line is constructed by combining trajectories having the least mean energy at each iteration; in other words, it is the minimum-energy envelope of the whole 100 trajectories.
The ansatz C 3,2,4 TEN , which has the least expressibility in the sense of frame potentials and the KL divergence analysis, clearly gives the worst result; its least mean-energy is far above from the ground energy. This is simply because the state generated via C 3,2,4 TEN cannot represent the ground state for any parameter choice. The result on C 4,4 HEA is the second worst, which also does not reach the ground energy as in the case of TEN. Note that increasing the number of parameters does not change this result; we also executed the simulation with C 4,6 HEA that has the same number of parameters as C 3,2,4 ALT but did not find a better result than that of C 4,4 HEA . On the other hand, C 3,2,4 ALT succeeds in finding the ground state; in fact, 5 of the total 100 trajectories generated via this ALT reach the ground energy. Even though HEA can represent the ground state by suitably choosing the parameters, it failed to find the optimal parameters due to being trapped in a plateau of the mean-energy landscape, i.e., due to the vanishing gradient problem. In contrast, C 3,2,4 ALT can circumvent the plateau, as predicted in [8], and at the same time, it can represent the ground state. Therefore, we end up with a clear evidence that the expressibility and the trainability certainly coexist in ALT.

Conclusion
This paper has examined the expressibility power of the shallow ALT, which was proposed before as a solution to the vanishing gradient problem found in various types of hybrid quantum-classical algorithms. Our conclusion is that, in addition to such a well trainability, shallow ALTs have the expressibility; that is, in the measure of the frame potential and the KL-divergence, those shallow ALTs have almost the same expressibility as HEAs, which are often used in hybrid algorithms but suffer from the vanishing gradient problem. In particular, we have proven that such expressibility holds if the number of entangled qubits in each block is of the order of the logarithm of the number of all resource qubits, which is consistent to the previous result discussing the trainability of ALT. We also confirmed that the ALT certainly enjoys both the expressibility and the trainability in VQE.
Even though our results are limited to the case = 2, 3, we have numerically observed that the ALT acquires even higher expressibility when making bigger. Therefore, we conjecture that the above conclusion still holds for ALT with ≥ 4. The rigorous proof is left for future work.

A Proof of Theorems
Notation: For the unitary matrix Ua corresponding to the entire circuit, we denote the unitary matrix corresponding to the i-th layer of the circuit to be Ua (i) and the unitary matrix corresponding to the j-th block in the i-th layer as Ua (i, j).

A.1 Proof of Theorem 1
First, the value of F Next, we provide the proof of F (1) (C ,m,n ALT ) = F Haar (2 n ). Given two final states |φ = Ua|0 and |ψ = U b |0 generated by C ,m,n ALT , we have where k(i) is the number of blocks in the i-th layer and each dU is the average over the ensemble of the unitary matrix U . Because the distribution of each Ua (i, j) is 2-design (and is therefore 1-design), we can apply the formula (4) to the integrals with respect to Ua (i, j). Actually, by integrating k( ) j=1 dUa ( , j) for all α in the last line of (27), we have The other equality in Eq. (7) can be proved in the same manner.

A.3 Proof of Theorem 3
Here we only show the computation of F (2) (C 3,m,n ALT ). The computation of F (2) (C 2,m,n ALT ) can be done in a similar manner. Given two final states |φ = Ua|0 and |ψ = U b |0 , Executing integrals 2design n/m j=1 dUa ( , j) and 2design n/m j =1 dU b ( , j ) for = 1, 3, With α = n/m, let g k α (X, D) be as the set of matrices expressed by α i=1 Ri where Ri = D or X and the number of Xs in {Ri} is k. For example, XDXX ∈ g 3 4 (X, D) and XDDD ∈ g 1 4 (X, D). Then, F (2) (C 3,m,n ALT ) is expanded as where g k αi (i = 1, 2 . . . αCk ) is an element of g k α (X, D). For an arbitrary g ∈ g k α (X, D) with k ≥ 1, Haar (2 n ).