Quantum Vision Transformers

In this work, quantum transformers are designed and analysed in detail by extending the state-of-the-art classical transformer neural network architectures known to be very performant in natural language processing and image analysis. Building upon the previous work, which uses parametrised quantum circuits for data loading and orthogonal neural layers, we introduce three types of quantum transformers for training and inference, including a quantum transformer based on compound matrices, which guarantees a theoretical advantage of the quantum attention mechanism compared to their classical counterpart both in terms of asymptotic run time and the number of model parameters. These quantum architectures can be built using shallow quantum circuits and produce qualitatively different classification models. The three proposed quantum attention layers vary on the spectrum between closely following the classical transformers and exhibiting more quantum characteristics. As building blocks of the quantum transformer, we propose a novel method for loading a matrix as quantum states as well as two new trainable quantum orthogonal layers adaptable to different levels of connectivity and quality of quantum computers. We performed extensive simulations of the quantum transformers on standard medical image datasets that showed competitively, and at times better performance compared to the classical benchmarks, including the best-in-class classical vision transformers. The quantum transformers we trained on these small-scale datasets require fewer parameters compared to standard classical benchmarks. Finally, we implemented our quantum transformers on superconducting quantum computers and obtained encouraging results for up to six qubit experiments.


Introduction
Quantum machine learning [1] uses quantum computation in order to provide novel and powerful tools to enhance the performance of classical machine learning algorithms.Some use parametrised quantum circuits to compute quantum neural networks and explore a higherdimensional optimisation space [2,3,4], while others exploit interesting properties native to quantum circuits, such as orthogonality or unitarity [5,6].
In this work, we focus on transformers, a neural network architecture proposed by [7] which has been applied successfully to both natural language processing [8] and visual tasks [9], providing state-of-the-art performance across different tasks and datasets [10].While the transformer architecture and attention mechanism were notably popularized by [7], antecedents of these mechanisms can be found in earlier works.Specifically, [11] explored such concepts in the realm of neural machine translation.In earlier works, recurrent neural network approaches hinted at the underpinnings of attention-like Components of the Vision Transformer (1/2): Fig. 1 shows the global architecture of a vision transformers.First, the image is preprocessed using patch division (Fig. 2), and then several transformer layers are applied (see details in Fig. 3 and Fig. 4).The final step consists in a simple fully connected neural network for classification.
mechanisms, which can be found in [12,13].At a high level, transformers are neural networks that use an attention mechanism that takes into account the global context while processing the entire input data element-wise.For visual recognition or text understanding, the context of each element is vital, and the transformer can capture more global correlations between parts of the sentence or the image compared to convolutional neural networks without an attention mechanism [9].In the case of visual analysis for example, images are divided into smaller patches, and instead of simply performing patch-wise operations with fixed size kernels, a transformer learns attention coefficients per patch that weigh the attention paid to the rest of the image by each patch.
In one related work, classical transformer architectures and attention mechanisms have been used to perform quantum tomography [14].Moreover, a quantum-enhanced transformer for sentiment analysis has been proposed in [15], and a self-attention mechanism for text classification has been used in [16].These works use standard variational quantum circuits to compute the neural networks, and the attention coefficients are calculated classically.A method for using a natively quantum attention mechanism for reinforcement learning has also been proposed in [17].[18] performed semiconductor defect detection using quantum self-attention, also using standard variational quantum circuits.We also note the proposals of [2,19] for variational circuits with similarities to convolutional neural networks for general purpose image classification.
The difference between the above-mentioned approaches and the proposed approached of this work mainly stems from the linear algebraic tools we developed which make our quantum circuits much more Noisy Intermediate-Scale Quantum (NISQ)-friendly with proven scalability in terms of run time and model parameters, in contrast to variational quantum circuit approaches taken in [20,4]which lack proof of scalability [21].This advantage in scalability of our proposed parametrised quantum circuits is made possible by the use of a specific amplitude encoding for translating vectors as quantum states, and consistent use of hamming-weight preserving quantum gates instead of general quantum ansatz.In addition to a quantum translation of the classical vision transformer, a novel and natively quantum method is proposed in this work, namely the compound transformer, which invokes Clifford Algebra operations that is hard to compute classically.
While we adapted the vision transformer architecture to ease the translation of the attention layer into quantum circuits and benchmarked our methods on vision tasks, the proposed approaches for quantum attention mechanism can be easily adapted to apply to other fields of applications, for example in natural language processing where transformers have been proven to Components of the Vision Transformer (2/2): Components in a single transformer layer is outlined in (Fig. 3).At its core, the attention mechanism learns how to weigh different parts of the input (Fig. 4), where the trainable matrices are denoted by V and W.This attention mechanism is the focus of our quantum circuits.be particularly efficient [8].
The main ingredient in a transformer as introduced by [9] is the attention layer, shown in Fig. 4.This attention layer is also the focus of this work which seeks to leverage quantum circuit for computational advantages.Given an input image X ∈ R n×d , we transform the input data into n patches each with dimension of d, and denote each patch i with x i ∈ R d .The trainable weight matrix from the linear fully connected layer at the beginning of each attention layer is denoted by V.The heart of the attention mechanism, i.e. the attention coefficients which weighs each patch x i to every other patch is denoted by: where W represents the second trainable weight matrix.
Based on the architecture shown in Fig. 4 we propose three types of quantum transformers (Sections 3.1, 3.2 and 3.4) and apply these novel architectures to visual tasks for benchmarking.Section 3.3 outlines the approach of combining 3.1 and 3.2 into one circuit to perform inference on the quantum circuit once the attention coefficients have been trained, while sections 3.1, 3.2 and 3.4 propose 3 distinct quantum architecture for training and inference.
The first quantum transformer introduced in Section 3.1 implements a trivial attention mecha-nism which where each patch pays attention only to itself while retaining the beneficial property of guaranteed orthogonality of trained weight matrices [22].In the second quantum transformer introduced in Section 3.2, coined the Orthogonal Transformer, we design a quantum analogue for each of the two main components of a classical attention layer: a linear fully connected layer and the attention matrix to capture the interaction between patches.This approach follows the classical approach quite closely.In Section 3.4, the Compound Transformer, which takes advantage of the quantum computer to load input states in superposition, is defined.For each of our quantum methods, we provide theoretical analysis of the computational complexity of the quantum attention mechanisms which is lower compared to their classical counterparts.
The mathematical formalism behind the Compound Transformer is the second-order compound matrix [23].Compound Transformer uses quantum layers to first load all patches into the quantum circuit in uniform superposition and then apply a single unitary to multiply the input vector in superposition with a trainable secondorder compound matrix [24].Here both the input vector and the trainable weight matrix are no longer a simple vector or a simple matrix.Details are given in Sections 3 and 3.4.
The fundamental building blocks for the imple-

Quantum Tools
In this work, we will use the RBS gate given in Eq. (1).RBS gates implement the following unitary: This gate can be implemented rather easily, either as a native gate, known as FSIM [25], or using four Hadamard gates, two R y rotation gates, and two two-qubits CZ gates:

Quantum Data Loaders for Matrices 1
Loading a whole matrix X ∈ R n×d in a quantum state is a powerful technique for machine learning.[26] designed quantum circuits to load input vectors using using N = n + d qubits with a unary amplitude encoding, more specifically a basis of states of hamming weight 1 where all qubits are in state 0 except one in state 1 is used.The number of required gates to load a vector is d − 1.In this work, we extend their approach to build a data loader for matrices (Fig. 5) 1 For this section, details are provided in B.1.

Circuit
Hardware Connectivity Depth # Gates Pyramid Nearest Neighbour 2N − 3 where every row of X is loaded in superposition.
The required number of gates to load a matrix is (n−1)+(2n−1)(d−1).The resulting state of the matrix loader shown in Fig. 5 is a superposition of the form:

Quantum Orthogonal Layers 2
The classical attention layer (Fig. 4) starts with a linear fully connected layer, where each input, i.e. patch x i , is a vector and is multiplied by a weight matrix V. To perform this operation quantumly we generalise the work of [5], where a quantum orthogonal layer is defined as a quantum circuit applied on a state |x⟩ (encoded in the unary basis) to produce the output state |Vx⟩.
More precisely, V is the matrix corresponding to the unitary of the quantum layer, restricted to the unary basis.This matrix is orthogonal due to the unitary nature of quantum operations.
In addition to the already existing Pyramid circuit (Fig. 7) from [5], we define two new types of quantum orthogonal layers with different levels of expressivity and resource requirements: the butterfly circuit (Fig. 8), and the X circuit (Fig. 9).
Looking at Table 1, the X circuit is the most suited for noisy hardware.It requires smaller  number of gates while maintaining a path from every input qubit to every output qubit.It is also less expressive with a restrained set of possible orthogonal matrices and fewer trainable parameters.
The butterfly circuit requires logarithmic circuit depth, a linear number of gates, and exhibits a higher level of expressivity.It originates from the classical Cooley-Tukey algorithm [27] used for Fast Fourier Transform and, it performs an operation analogous to the method presented in [28] for classical recurrent neural networks when it is implemented with RBS gates.Note that the butterfly circuit requires the ability to apply gates on all possible qubit pairs.As shown in [24], quantum orthogonal layers can be generalised to work with inputs which encode a vector on a larger basis.Namely, instead of the unary basis, where all qubits except one are in state 0, basis of hamming weight k can be used as well.A basis of hamming weight k comprises of N k possible states over N qubits.A vector x ∈ R ( N k ) can be loaded as a quantum state |x⟩ using only N qubits.Since the quantum orthogonal layers are hamming weight preserving circuits, the output state from such circuits will also be a vector encoded in the same basis.Let V be the matrix corresponding to the quantum orthogonal layer in the unary basis, and x of hamming weight k, the output state will no longer be |Vx⟩, but instead |V (k) x⟩, where V (k) is the k-th order compound matrix of V [23].We can see V (k) as the expansion of V in the hamming weight k basis.More precisely, given a matrix V ∈ R N ×N , the k th -order compound matrix V (k)  for k ∈ [N ] is the N k dimensional matrix with entries: , where I and J are subsets of rows and columns of V with size k.
Recent research supports the trainability of the quantum layers presented in this paper.[29] provide evidence for the trainability and expressivity of hamming weight preserving circuits, indicating that our layers are not prone to the vanishing gradients problem, commonly referred to as barren plateaus.This assertion is further reinforced by studies in [30,31].Nonetheless, the existence and implications of exponential local minima [32,33] within our framework remain an open question.

Quantum Transformers
The second component of the classical attention layer is the interaction between patches (Fig. 4) where the attention coefficients A ij = x T i Wx j is trained by performing x T i Wx j for a trainable orthogonal matrix W and all pairs of patches x i and x j .After that, a non-linearity, for example softmax, is applied to obtain each output y i .Three different approaches for implementing the quantum attention layer are introduced in the next sections, listed in the order of increasing complexity in terms of quantum resource requirement, which reflect the degree to which quantum circuits are leveraged to replace the attention layer.A comparison between these different quantum methods is provided in Table 2, which is applicable to both training and inference.
Table 2 lists 5 key parameters of the proposed quantum architecture which reflect their theoretical scalability.The number of trainable parameters for a classical vision transformer is 2d 2 (see Section A), which can be directly compared with the number of trainable parameters of the proposed quantum approaches.The number of fixed parameters per quantum architecture is required for data loading.In this table, the circuit depth represents the combined depth of both the data loader and the quantum layer.Furthermore, the butterfly layer detailed in Fig. 8 and the diagonal data-loader illustrated in Fig. 14 is employed, which adds logarithmic depth for loading each vector.The circuit depth together with the number of distinct circuits dictate the overall run time of the quantum architectures, which can be compared to the run time of the classical transformer of O(nd 2 + n 2 d) (listed under the column Circuit Depth).The number of distinct circuits per quantum architecture indicate the possibility for each architecture to be processed in parallel, akin to multi-core CPU processing.The orthogonal patch-wise neural network can be thought of as a transformer with a trivial attention mechanism, where each patch pays attention only to itself.As illustrated in Fig 10, each input patch is multiplied by the same trainable matrix V and one circuit per patch is used.Each circuit has N = d qubits and each patch x i is encoded in a quantum state with a vector data loader.A quantum orthogonal layer is used to perform multiplication of each patch with V.The output of each circuit is a quantum state encoding Vx i , a vector which is retrieved through tomography.Importantly, this tomography procedure deals with states of linear size in relation to the number of qubits, avoiding the exponential complexity often associated with quantum tomography.

Orthogonal Patch-wise Neural Network
The computational complexity of this circuit is calculated as follows: from Section 2.1, a data loader with N = d qubits qubits has a complexity of log(d) steps.For the orthogonal quantum layer, as shown in Table 1, a butterfly circuit takes log(d) steps, with d 2 log(d) trainable parameters.Overall, the complexity is O(log(d)) and the trainable parameters are O(d log d).Since this circuit uses one vector data loader, the number of fixed parameters required is d − 1.Looking at Fig. 11, each attention coefficient

Quantum Orthogonal Transformer
x j is calculated first by loading x j into the circuit with a vector loader followed by a trainable quantum orthogonal layer, W, resulting in the vector Wx j .Next, an inverse data loader of x i is applied, creating a state where the probability of measuring 1 on the first qubit is exactly Note the square that appears in the quantum circuit is already one type of non-linearity.Using this method, coefficients of A are always positive, which can still be learned during training as we show later in the Section 4. Additional methods also exist to obtain the sign of the inner product [5].The estimation of A ij (and therefore A ′ ij if needed, by applying a column-wise softmax classically) is repeated for each pair of patches and the same trainable quantum orthogonal layer W. The computational complexity of this quantum circuit is similar to the previous one, with one more data loader.
Putting Figures 10 and 11 together: the quantum circuit presented in Section 3.1 is implemented to obtain each Vx j .At the same time, each attention coefficient |x T i Wx j | 2 is computed on the quantum circuit, which is further postprocessed column-wise with the softmax function to obtain the A ′ ij .The two parts can then be classically combined to compute each In this approach, the attention mechanism is implemented by using hamming weight preserving parametrised quantum circuits to compute the weight matrices V and W separately.For computing |x T i Wx j | 2 , we would require two data loaders (2 × (d − 1) gates) for x i and x j , and one Quantum Orthogonal Layer (d log d gates in the case of Butterfly layer) for W. To obtain Vx j , we require d − 1 gates to load each x j and a Quantum Orthogonal Layer (d log d gates in the case of Butterfly layer) for the matrix V.
Table 2: Comparison of different quantum methods to perform a single attention layer of a transformer network.n and d stand respectively for the number of patches and their individual dimension.All quantum orthogonal layers are implemented using the butterfly circuits.See Section 3 for details.A quantum orthogonal layer from Section 2.2 is used for V.

Direct Quantum Attention
In Section 3.2, the output of the attention layer y i = j A ′ ij Vx j is computed classically once the quantities A ′ ij and Vx j have been computed separately with the help of quantum circuits.During inference, where the matrices V and W have been learnt, and the attention matrix A (or A ′ ) is stored classically, Direct Quantum Attention implements the attention layer directly on the quantum computer.The matrix data loader from Fig. 5 is used to compute each y i = j A ij Vx j using a single quantum circuit.
In Fig. 12, y i , which corresponds to the output patch with index i, is computed using a quantum circuit using N = n + d qubits.These qubits are split into two main registers.On the top register (n qubits), the vector A i , i th row of the attention matrix A (or A ′ ), is loaded via a vector data loader, as j A ij |e j ⟩ |0⟩.
Next, on the lower register (d qubits), as in Fig. 5, the data loader for each vector x i , and their respective adjoint, are applied sequentially, with CNOTs controlled on each qubit i of the top register.This gives the quantum state j A ij |e j ⟩ |x j ⟩, i.e. the matrix X is loaded with all rows re-scaled according to the attention coefficients.As for any matrix data loader, this requires (n − 1) + (2n − 1)(d − 1) gates with fixed (non trainable) parameters.
The last step consists of applying the quantum orthogonal layer V that has been trained before on the second register of the circuit.As previously established, this operation performs matrix multiplication between V and the vector encoded on the second register.Since the k th element of the vector V x j can be written as q V kq X jq , we get: Since y i = j A ij Vx j , its k th element can be written y ik = j A ij ( q V kq X jq ).Therefore, the quantum state at the end of the circuit can be written as |y i ⟩ = k y ik |ϕ k ⟩ |e k ⟩ for some normalised states |ϕ k ⟩.Performing tomography on the second register generates the output vector y i .
This circuit is a more direct method to compute each y i .Each y i uses a different A i in the first part of the circuit.As shown in Table 2, compared with the previous method, this method requires fewer circuits to run, but each circuit requires more qubits and a deeper circuit.To analyse the computational complexity: the first data loader on the top register has n qubit and log n depth; the following 2n − 1 loaders on the bottom register have d qubits, so (2n−1) log d depth; and the final quantum orthogonal layer V implemented using a butterfly circuit, has a depth of log d and O(d log d) trainable parameters.

Quantum Compound Transformer
Until now, each step of the classical vision transformer has been reproduced closely by quantum linear algebraic procedures.The same quantum tools can also be used in a more natively quantum fashion, while retaining the spirit of the classical transformers, as shown in Fig. 13.
At a high level, the compound transformer first loads all patches with the same weight applied each patch in superposition, and then apply an orthogonal layer that will at the same time extract the features from each patch and re-weight the patches so that in the end the output is computed as a weighted sum of the features extracted from all patches.This means that instead of calculating two separate weight matrices V and W, one for feature extraction and one for weighting to generate y i = j A ′ ij Vx j individually, only one operation is used to generate all y i directly from one circuit.Since a single quantum orthgonal layer is used to generate Y, we switch to V c to denote this orthogonal layer that applies the compound matrix as we explain below.More precisely, the quantum circuit we use has two registers: the top one of size n and the bottom one of size d.The full matrix X ∈ R n×d is loaded into the circuit using the matrix data loader from Section 2.1 with N = n + d qubits.This could correspond to the entire image, as every image can be split into n patches of size d each.Since the encoding basis over the two registers has more than one qubit in state 1, we are stepping out of the unary basis framework.The correct basis to consider is of hamming weight 2. Note that, among the n+d 2 states with hamming weight 2, only n × d of them correspond to states with one 1 in the top qubits, and another 1 in the remaining bottom qubits.
Next, a quantum orthogonal layer V c is applied on both registers at the same time.Note that this V c is not the same as in the previous constructions, since now it is applied on a superposition of patches.As explained in Section 2.2 and in [24], the resulting operation in this case is not a simple matrix-vector multiplication VX.Instead of V, the multiplication involves its 2 nd -order compound matrix V (2)  c of dimension n+d 2 × n+d 2 .Similarly, the vector multiplied is not simply X but a modified version of size n+d 2 , obtained by padding the added dimensions with zeros.
The resulting state is |Y⟩ = |V (2)  c X⟩, where V (2)   c is the 2 nd -order compound matrix of the matrix V c , namely the matrix corresponding to the unitary of the quantum orthogonal layer in Fig. 13 restricted to the unary basis.This state has dimension n+d 2 , i.e. there are exactly two 1s in the N = n + d qubits, but one can postselect for the part of the state where there is exactly one qubit in state 1 on the top register and the other 1 on the lower register.This way, n × d output states are generated.In other words, tomography is performed for a state of the form Note that in this context, the proposed tomography approach reconstructs vectors of a quadratic size, and not exponential, relative to the qubit count.Furthermore, a significant fraction of the measurement shots might be discarded to narrow down to the desired n × d space as part of the the post-selection technique.
To calculate the computational complexity of this circuit, we consider: the matrix data loader, detailed in Fig. 5 which has depth of log n + 2n log d; the Quantum Orthogonal Layer applied on n + d qubits, with a depth of log(n + d) and (n + d) log(n + d) trainable parameters if implemented using the butterfly circuit.Since this circuit uses exactly one matrix loader, the number of fixed parameters is (n − 1) + (2n − 1)(d − 1).
In order to calculate the cost of performing the same operation on a classical computer, consider the equivalent operation of creating the compound matrix V (2)  c by first computing all determinants of the matrix and then performing a matrix-vector multiplication of dimension n+d 2 , which takes O((n + d) 4 ) time.Performing this operation on a quantum computer can provide a polynomial speedup with respect to n.More generally, this compound matrix operation on an arbitrary input state of hamming weight k is quite hard to perform classically, since all determinants must be computed, and a matrix-vector multiplication of size n+d k needs to be applied.Overall, the compound transformer can replace both the Orthogonal Patch-wise Network (3.1) and the Quantum Transformer layer (3.2) with one combined operation.The use of compound matrix multiplication makes this approach different from the classical transformers, while retaining some interesting properties with its classical counterpart: patches are weighted in their global context and gradients are shared through the determinants used to generate the compound matrix.
The Compound Transformer operates in a similar spirit as the MLPMixer architecture presented in [34], which is a state-of-the-art architecture used for image classification tasks and exchanges information between the different patches without using convolution or attention mechanisms.

Experiments
In order to benchmark the proposed methods, we applied them to a set of medical image classification tasks, using both simulations and quantum hardware experiments.MedMNIST, a collection of 12 preprocessed, two-dimensional, open source medical image datasets from [35,36], annotated for classification tasks and benchmarking using a diverse set of classical techniques, is used to provide the complete training and validation data.

Simulation Setting
Orthogonal Patch-wise Network from Section 3.1, Orthogonal Transformer from Section 3.2, and Compound Transformer from Section 3.4 were trained via simulation, along with two baseline methods.The first baseline is the Vision Transformer from [9], which has been successfully applied to different image classification tasks and is described in detail in A. The second baseline is the Orthogonal Fully-Connected Neural Network (OrthoFNN), a quantum method without attention layer that has been previous trained on the RetinaMNIST dataset in [5].For each of the five architectures, one model was trained on each dataset of MedMNIST and validated using the same validation method as in [35,36].
To ensure comparable evaluations between the five neural networks, similar architectures were implemented for all five.
The benchmark architectures all comprise of three parts: pre-processing, features extraction, and postprocessing.The first part is classical and preprocesses the input image of size 28 × 28 by extracting 16 patches (n = 16) of size 7 × 7. We then map every patch to a 16 dimensional feature space (d = 16) by using a fully connected neural network layer.This first feature extraction components is a single fully connected layer trained in conjunction to the rest of the architecture.For the OrthoNN networks, used as our quantum baseline, one patch of size 16 was extracted from the complete input image using a fully connected neural network layer of size 784×16.This fully connected layer is also trained in conjunction to the quantum circuits.The second part of the common architecture transforms the extracted features by applying a sequence of 4 attention layers on the extracted patches, which maintain the dimension of the layer.Moreover, the same gate layout, i.e. the butterfly circuit, is used for all circuits that compose the quantum layers.Finally, the last part of the neural network is classical, which linearly projects the extracted features and outputs the predicted label.

Simulation Results
A summary of the simulation results is shown in Table 3 where the area under receiver operating characteristic (ROC) curve (AUC) and the accuracy (ACC) are reported as evaluation metrics.A full comparison with the classical benchmark provided by [35] is given in Appendix D, Table 6.
From Table 3, we observe that Vision Transformer, Orthogonal Transformer, and Compound Transformer architectures outperform the Orthogonal Fully-Connected and Orthogonal Patch-wise neural networks for all 12 tasks.This is likely due to the fact that the latter two architectures do not contain on any attention mechanism that exchange information across the patches, confirming the effectiveness of the attention mechanism to learn useful features from images.Second, Orthogonal Transformer and Compound Transformer, which implements nontrivial quantum attention mechanism, provide very competitive performances compared to the two benchmark methods and outperform the benchmark methods on 7 out of 12 MedMNIST datasets.
gard to the number of trainable parameters used by each architecture.Table 5 presents a resource analysis for the quantum circuits that were simulated per layer.E.g. the Compound Transformer requires 80 trainable parameters compared to the 512 (2d 2 ) required by the Classical Vision Transformer.Note that this resource analysis focuses on the attention layer of each transformer network, and does not include parameters used for pre-processing, other parts found in the transformer layer, nor the single layer used in the final classification (Fig. 1), which are common to all simulated methods.
Overall, our quantum transformers have reached comparable levels of accuracy compared to the classical equivalent transformers, while using a smaller number of trainable parameters, providing confirmation of our theoretical predictions on a small scale.Circuit depth and number of distinct circuits used for each of the quantum transformers are also listed in Table 5 to match the theoretical resource analysis in Table 2.While the quantum transformers do have theoretical guarantee on the asymptotic run time for the attention mechanism compared to the classical transformer, this effect is hard to observe given the small data size.Summary of the hardware experiments listed in Table 4 shows very competitive levels of accuracy from the quantum transformers in comparison with the classical benchmarks.Details to be found in C.3.

Conclusion
In this work, three different quantum transformers are presented: Orthogonal Patchwise Transformer implements trivial attention mechanism; Orthogonal Transformer closely mimic the classical transformers; Compound Transformer steps away from the classical architecture with a quantum-native linear algebraic operation that cannot be efficiently done classically: multiplication of a vector with a higher-dimensional compound matrix.Inside all these quantum transformers are the quantum orthogonal layers, which efficiently apply matrix multiplication on vectors encoded on specific quantum basis states.All circuits implementing orthogonal matrix multiplication can be trained using backpropagation detailed in [5].
As shown in Table 2, the proposed quantum circuits offer a potential computational advantage in reducing the complexity of attention layers.This opens the possibility that quantum transformers may be able to match the performance of their classical counterparts, requiring fewer resources in terms of runtime and parameter count.On the other hand, while these initial results are promising, they are derived from a limited set of experiments and primarily offer a theoretical viewpoint.Practical realization of such advantages in quantum machine learning is heavily contingent upon future advancements in quantum hardware, for example in managing quantum noise, improving clock speed, and other critical factors.Therefore, these findings should be regarded as a promising yet preliminary step,   necessitating further empirical validation using future quantum hardware.
In addition to theoretical analysis, we performed extensive numerical simulations and quantum hardware experiments, which shows that our quantum circuits can classify the small MedMNIST images just as well as or at times better than the state-of-the-art classical methods (Table 3) while using fewer parameter, thereby showing potential of these quantum models to address over-fitting issues by using a smaller number of parameters.
While the run time of the quantum fully connected layer and the quantum attention mechanism has been theoretically proven to be advantageous, this effect is hard to observe on the current quantum computers due to their limited size, high level of noise, and latency of cloud access.From our hardware experiments, it can be observed that results from the current hardware become too noisy as soon as the number of qubits or the size of the quantum circuit increase.
Overall, our results are encouraging and confirm the benefit of using trainable quantum circuits to perform efficient linear algebra operations.By carefully designing the quantum circuit to allow for much better control over the size of the Hilbert space that is explored by the model, we are able to provide models that are both expressive and trainable.

A Vision Transformers
Here, the details of a classical Vision Transformers introduced by [9] are outlined.Some slight changes in the architecture have been made to ease the correspondence with quantum circuits.We also introduce important notations that will be reused in the quantum methods.
The transformer network starts by decomposing an image into patches and pre-processing the set of patches to map each one into a vector, as shown in Fig. 2. The initial set of patches is enhanced with an extra vector of the same size as the patches, called class embedding.This class embedding vector is used at the end of the network, to feed into a fully connected layer that yields the output (see Fig. 1).We also include one trainable vector called positional embedding, which is added to each vector.At the end of this pre-processing step, we obtain the set of n vectors of dimension d, denoted x i to be used in the next steps.
Next, feature extraction is performed using a transformer layer [7,9] which is repeated L times, as shown in Fig. 3. Within the transformer layer, we first apply layer normalisation over all patches x i , and then apply the attention mechanism detailed in Fig. 4.After this part, we obtain a state to which we add the initial input vectors before normalisation in an operation called residual layer, represented by the blue arrow in Fig. 3, followed by another layer normalisation.After this, we apply a Multi Layer Perceptron (MLP), which consists of multiple fully connected linear layers for each vector that result in same-sized vectors.Again, we add the residual from just before the last layer normalisation, which is the output of one transformer layer.
After repeating the transformer layer L times, we finally take the vector corresponding to the class embedding, that is the vector corresponding to x 0 , in the final output and apply a fully connected layer of dimension (d × number of classes) to provide the final classification result (see Fig. 1).It is important to observe here that we only use the first vector outcome in the final fully connected layer to do the classification (therefore the name class embedding).
Looking inside the attention mechanism (see Fig. 4), we start by using a fully connected linear layer with trainable weights V to calculate for each patch x i the feature vector Vx i .Then to calculate the attention coefficients, we use another trainable weight matrix W and define the attention given by patch x i to patch x j as x T i Wx j .Next, for each patch x i , we get the final extracted features as the weighted sum of all feature vectors Vx j where the weights are the attention coefficients.This is equivalent to performing a matrix multiplication with a matrix A defined by A ij = x T i Wx j .Note, in classical transformer architecture, a column-wise softmax is applied to all A ij and attention coefficients ) is used instead.Overall, the attention mechanism makes use of 2d 2 trainable parameters, evenly divided between V and W, each of size d × d.
In fact, the above description is a slight variant from the original transformers proposed in [7], where the authors used two trainable matrices to obtain the attention coefficients instead of one (W) in this work.This choice was made to simplify the quantum implementation but could be extended to the original proposal using the same quantum tools.
Computational complexity of classical attention mechanism depends mainly on the number of patches n and their individual dimension d: the first patch-wise matrix multiplication with the matrix V ∈ R d×d takes O(nd 2 ) steps, while the subsequent multiplication with the large matrix A ′ takes O(n 2 d).Obtaining A ′ from W requires O(nd 2 ) steps as well.Overall, the complexity is O(nd 2 + n 2 d).In classical deep learning literature, the emphasis is made on the second term, which is usually the most costly.Note that a recent proposal [37] proposes a different attention mechanism as a linear operation that only has a O(nd 2 ) computational complexity.
We compare the classical computational complexity with those of our quantum methods in Table 2.These running times have an real impact on both training and inference, as they measure how the time to perform each layer scales with the number and dimension of the patches.

B Quantum Tools (Extended) B.1 Quantum Data Loaders for Matrices
In order to perform a machine learning task with a quantum computer, classical data (a vector, a matrix) needs to be loaded into the quantum circuit.The technique we choose for this task is called amplitude encoding, which uses the classical scalar component of the data as amplitudes of a quantum state made of d qubits.In particular we build upon previous methods to define quantum data loaders for matrices, as shown in Fig. 5. [26] proposes three different circuits to load a vector x ∈ R d using d−1 gates for a circuit depth ranging from O(log(d)) to O(d) as desired (see Fig. 14).These data loaders use the unary amplitude encoding, where a vector where |e i ⟩ is the quantum state with all qubits in 0 except the i th one in state 1 (e.g.
The circuit uses RBS gates: a parametrised two-qubit gate given by Eq.1.The d − 1 parameters θ i of the RBS gates are classically pre-computed to ensure that the output of the circuit is indeed |x⟩.
We require a loader for matrices.Given a matrix X ∈ R n×d , instead of loading a flattened vector, rows X i are loaded in superposition.As shown in Fig. 5, on the top qubit register, we first load the vector (∥x 1 ∥ , • • • , ∥x n ∥) made of the norms of each row, using a data loader for a vector and obtain a state 1 ∥X∥ n i=1 ∥x i ∥ |e i ⟩.Then, on a lower register, we are sequentially loading each row X i ∈ R d .To do so, we use vector data loaders and their adjoint, as well as CNOTs controlled on the i th qubit of the top register.The resulting state is a superposition of the form: One immediate application of data loaders that construct amplitude encodings is the ability to perform fast inner product computation with quantum circuits.Applying the inverse data loader of x i after the regular data loader of x j effectively creates a state of the form ⟨x i , x j ⟩ |e 1 ⟩+ |G⟩ where |G⟩ is a garbage state.The probability of measuring |e 1 ⟩, which is simply the probability of having a 1 on the first qubit, is | ⟨x i , x j ⟩ | 2 .Techniques to retrieve the sign of the inner product have been developed in [5].

B.2 Quantum Orthogonal Layers
In this section, we outline the concept of quantum orthogonal layers used in neural networks, which generalises the work in [5].These layers correspond to parametrised circuits of N qubits made of RBS gates.More generally, RBS gates preserve the number of ones and zeros in any basis state: if the input to a quantum orthogonal layer is a vector in unary amplitude encoding, the output will be another vector in unary amplitude encoding.Similarly, if the input quantum state is a superposition of only basis states of hamming weight 2, so is the output quantum state.This output state is precisely the result of a matrix-vector product, where the matrix is the unitary matrix of the quantum orthogonal layer, restricted to the basis used.Therefore, for unary basis, we consider a N × N matrix W instead of the full 2 N ×2 N unitary.Similarly for the basis of hamming weight two, we can restrict the unitary to a N 2 × N 2 matrix.Since the reduced matrix conserves its unitary property and has only real values, these are orthogonal matrices.More generally, we can think of such hamming weight preserving circuits with N qubits as block-diagonal unitaries that act separately on N + 1 subspaces, where the k-th subspace is defined by all computational basis states with hamming weight equal to k.The dimension of these subspaces is equal to N k .There exist many possibilities for building a quantum orthogonal layer, each with different properties.The Pyramid circuit, proposed in [5], is composed of exactly N (N − 1)/2 RBS gates.This circuit requires only adjacent qubit connectivity, which is the case for most superconducting qubit hardware.More precisely, the set of matrices that are equivalent to the quantum orthogonal layers with pyramidal layout is exactly the Special Orthogonal Group, made of orthogonal matrices with determinant equal to +1.We have showed that by adding a final Z gate on the last qubit would allow having orthogonal matrices with −1 determinant.The pyramid circuit is therefore very general and cover all the possible orthogonal matrices of size N × N .
The two new types of quantum orthogonal layers we have introduced are the butterfly circuit (Fig. 8), and the X circuit (Fig. 9) (Section 2.2).
There exists a method [5] to compute the gradient of each parameter θ i in order to update them.This backpropagation method for the pyramid circuit takes time O(N 2 ), corresponding to the number of gates, and provided a polynomial improvement in run time compared to the previously known orthogonal neural network training algorithms [22].The exact same method developed for the pyramid circuit can be used to perform quantum backpropagation on the new circuits introduced in this paper.The run time also corresponds to the number of gates, which is lower for the butterfly and X circuits.See Table 1 for full details on the comparison between the three types of circuits.In particular, when considering the butterfly layer, the complexity of the backpropagation method transitions from O(N 2 ) to O(N log N ).

C Medical Image Classification via Quantum Transformers (Extended) C.1 Datasets
In order to benchmark our models, we used MedMNIST, a collection of 12 pre-processed, two-dimensional medical image open datasets [35,36].The collection has been standardised for classification tasks on 12 different imaging modalities, each with medical images of 28 × 28 pixels.All three quantum transformers and two benchmark methods were trained and validated on all 12 MedMNIST datasets.For the hardware experiments, we focused on one dataset, Reti-naMNIST.The MedMNIST dataset was chosen for our benchmarking efforts due to its accessible size for simulations of the quantum circuits and hardware experiments, while being representative of one important field of computer vision application: classification of medical images.

C.2 Simulations
First, simulations of our models are performed on the 2D MedMNIST datasets and demonstrate that the proposed quantum attention architecture reaches accuracy comparable to and at times better than the various standard classical models.Next, the setting of our simulations are described and the results compared against those reported in the AutoML benchmark performed by the authors in [36].

C.2.1 Simulation setting MedMNIST
The JAX package [38] was used to efficiently simulate the complete training procedure of the five benchmark architectures.The experimental hyperparameters used in [36] were replicated for our benchmark: every model is trained using the cross-entropy loss with the Adam optimiser [39] for 100 epochs, with batch size of 32 and a learning rate of 10 −3 that is decayed by a factor of 0.1 after 50 and 75 epochs.
The 5 different neural networks were trained over 3 random seeds, and the best overall performance for each one of them was selected.The evaluation procedure is similar to the AutoML benchmark in [35,36], and the benchmark results are shown in Table 3 where the area under receiver operating characteristic (ROC) curve (AUC) and the accuracy (ACC) are reported as evaluation metrics.A full comparison with the classical benchmark provided by [35] is given in (Appendix D, Table 6).

C.2.2 Simulation results MedMNIST
From Table 3, we observe that Quantum Orthogonal and Compound Transformer architectures outperform the Orthogonal Fully-Connected and Orthogonal Patch-wise neural networks most of the time.This may be due to the fact that the latter do not rely on any mechanism that exchange information across the patches.Second, all quantum neural networks provide very competitive performances compared to the AutoML benchmark and outperform their classical counterparts on 7 out of 12 MedMNIST datasets.
Moreover, comparisons can be made with regard to the number of parameters used by each architecture, in particular for feature extraction.Table 5 presents a resource analysis for the quantum circuits that were simulated, per layer.It includes the number of qubits, the number of gates with trainable parameters, and the number of gates with fixed parameters used for loading the data.The table shows that our quantum architectures have a small number of trainable parameters per layer.The global count for each quantum method is as follows.
• Orthogonal Patch-wise Neural Network: 32 parameters per circuit, 16 circuits per layer which use the same 16 parameters, and 4 layers, for a total of 128 trainable parameters.
• Quantum Orthogonal Transformer: 32 parameters per circuit, 17 circuits which use the same 16 parameters and another 289 circuits which use another set of 16 parameters per layer, and 4 layers, for a total of 256 trainable parameters.
• Compound Transformer: 80 parameters per circuit, 1 circuit per layer, and 4 layers, for a total of 320 trainable parameters.
These numbers are to be compared with the number of trainable parameters in the classical Vision Transformer that is used as a baseline.As stated in Section A, each classical attention layer requires 2d 2 trainable parameters, which in the simulations performed here corresponds to 512.Note again this resource analysis focuses on the attention layer of the each transformer network, and does not include parameters used for the preprocessing of the images (see Section C.2.1), as part of other transformer layers (Fig. 3), and for the single layer used in the final classification (Fig. 1), which are common in all cases.More generally, performance of other classical neural network models provided by the authors of MedMNIST is compared to our approaches in Table 6 found in the Appendix.Some of these classical neural networks reach somewhat better levels of accuracy, but are known to use an extremely large number of parameters.For instance, the smallest reported residual network has approximately a total number of 10 7 parameters, and the automated machine learning algorithms train numerous different architectures in order to reach that performance.
Based on the results of the simulations in this section, quantum transformers are able to train across a number different of classification tasks, deliver performances that are highly competitive and sometimes better than the equivalent classical methods.

C.3 Quantum Hardware Experiments
Quantum hardware experiments were performed on one specific dataset: RetinaMNIST.It has 1080 images for training, 120 images for validation, and 400 images for testing.Each image contains 28×28 RGB pixels.Each image is classified into 1 of 5 classes (ordinal regression).

C.3.1 Hardware Description
The hardware demonstration was performed on two different superconducting quantum computers provided by IBM, with the smaller experiments performed on the 16-qubit ibmq guadalupe machine (see Fig. 15) and the larger ones on the 27-qubit ibm hanoi machine.Results are reported here from experiments with four, five and six qubits; experiments with higher numbers of qubits, which entails higher numbers of gates and depth, did not produce meaningful results.Note that the main sources of noise are the device noise and the finite sampling noise.In general, noise is undesirable during computations.
In the case of a neural network, however, noise may not be as troublesome: noise can help escape local minima [40], or act as data augmentation to avoid over-fitting.In classical deep learning, noise is sometimes artificially added for these purposes [41].Despite this, when the noise is too large, we also see a drop in the accuracy.

C.3.2 Hardware Results
Hardware experiments were performed with four, five and six qubits to push the limits of the current hardware, in terms of both the number of qubits and circuit depth.Three quantum proposals were run: the Orthogonal Patch-wise network (from Section 3.1), the Quantum Orthogonal transformers (from Sections 3 and 3.3) and finally the Quantum Compound Transformer (from Section 3.4).
Each quantum model was trained using a JAXbased simulator, and inference was performed on the entire test dataset of 400 images of the Reti-naMNIST on the IBM quantum computers.Regarding the experimental setting on real hardware, the number of shots for the compound setup using 6 qubits was maximized to 32.000.For other configurations using 4 qubits, 10.000 shots were used.
The first model, the Orthogonal Patch-wise neural network, was trained using 16 patches per image, 4 features per patch, and one 4 × 4 orthogonal layer, using a 4-qubit pyramid as the orthogonal layer.The experiment used 16 different quantum circuits of 9 RBS gates per circuit per image.The result was compared with an equivalent classical (non-orthogonal) patch-wise neural network, and a small advantage in accuracy for the quantum native method could be reported.
The second model, the Quantum Orthogonal Transformer, used 4 patches per image, 4 features per patch, and an attention mechanism with one 4 × 4 orthogonal layer and trainable attention coefficients.4-qubit pyramids were used as orthogonal layers.The experiment used 25 different quantum circuits of 12 RBS gates per circuit per image and 15 different quantum circuits of 9 RBS gates per circuit per image.
The third set of experiments ran the Orthogonal Transformer with the quantum attention mechanism.We used 4 patches per image, 4 features per patch, and a quantum attention mechanism that paid attention to only the neighbouring patch, thereby using a 5-qubit quantum circuit with the X as the orthogonal layer.The experiment used 12 different quantum circuits of 14 RBS gates and 2 CN OT s per circuit per image.
The last two quantum proposals were compared with a classical transformer network with a similar architecture and demonstrated similar level of accuracy.
Finally, the fourth experiment was performed on the ibmq hanoi machine with 6 qubits, with the Compound Transformer, using 4 patches per image, 4 features per patch, and one orthogonal layer using the X layout.The hardware results were quite noisy with the X layer, therefore the same experiments were performed with a furtherreduced orthogonal layer named the "\Circuit": half of a X Circuit (Fig. 9) where only one diagonal of RBS gates is kept, and which reduced the noise in the outcomes.The experiment used 2 different quantum circuits of 18 RBS gates and 3 CN OT s per circuit per image.
Note that with the restriction to states with a fixed hamming weight, strong error mitigation techniques become available.Indeed, as we expect to obtain only quantum superpositions of unary states or states with hamming weight 2 in the case of Compound Transformers, at every layer, every measurement can be processed to discard the ones that have a different hamming weight i.e. states with more than one (or two) qubit in state |1⟩.This error mitigation procedure can be applied efficiently to the results of a hardware demonstration, and has been used in the results presented in this paper.
The conclusion from the hardware experiments is that all quantum proposals achieve stateof-the-art test accuracy, comparable to classical networks.Looking at the simulation experiments (details found in Table 3), the compound transformer occasionally achieves superior performance compared to classical transformer.Note that achieving such a compound implementation in a classical setting incurs a polynomial overhead.

D Extended Performance Analysis
We add our results to the already existing results on the MedMNIST [36] datasets in the Table 6 below.

Figure 5 :
Figure 5: Data loader circuit for a matrix X ∈ R n×d .The top register uses N qubits and the vector data loader to load the norms of each row, (∥x 1 ∥ , • • • , ∥x n ∥), to obtain the state 1 ∥X∥

Figure 10 :
Figure 10: Quantum circuit to perform the matrix multiplication Vx i (fully connected layer) using a data loader for x i and a quantum orthogonal layer for V.

Figure 11 :
Figure 11: Quantum circuit to compute |x T i Wx j | 2 , a single attention coefficient, using data loaders for x i and x j and a quantum orthogonal layer for W.

Figure 12 :
Figure 12: Quantum circuit to directly apply the attention mechanism, given each coefficient in A. The first part of the circuit corresponds to the matrix data loader from Fig.5, where Load(∥X∥) is replaced by Load(A i ).A quantum orthogonal layer from Section 2.2 is used for V.

Figure 13 :
Figure 13: Quantum circuit to execute one attention layer of the Compound Transformer.We use a matrix data loader for X (equivalent to Fig.5) and a quantum orthogonal layer for V c applied on both registers.

Figure 14 :
Figure 14: Three possible data loaders for d-dimensional vectors (d = 8).From left to right: the parallel, diagonal, and semi-diagonal circuit have respectively a circuit depth of log(d), d, and d/2.The X gate represent the Pauli X gate, and the vertical lines represent RBS gates with tunable parameters.

Table 1 :
Comparison of different quantum orthogonal layer circuits with N qubits.

Table 3 :
Performance analysis using AUC and ACC on each test dataset of MedMNIST of our quantum architectures (Orthogonal PatchWise, Orthogonal Transformer and Compound Transformer) compared to the classical (Vision Transformer

Table 4 :
Hardware Results for RetinaMNIST using various models.Classical (JAX): classical code run by JAX, equivalent to quantum operations.IBM Simulator: code compiled to run on actual IBM hardware and executed using their Aer Simulator.Note that "\ Circuit" contains a single diagonal of trainable RBS gates.Details of the experiment are written in C.3.2.

Table 5 :
Resource analysis on a single attention layer used for the MedMNIST simulations (Section 4.1).From Table2, it can be derived that the classical transformer requires 512 trainable parameters.Note that the Orthogonal Transformer is using two different types of circuits per layer.