Quantum Machine Learning with SQUID

In this work we present the Scaled QUantum IDentifier (SQUID), an open-source framework for exploring hybrid Quantum-Classical algorithms for classification problems. The classical infrastructure is based on PyTorch and we provide a standardized design to implement a variety of quantum models with the capability of back-propagation for efficient training. We present the structure of our framework and provide examples of using SQUID in a standard binary classification problem from the popular MNIST dataset. In particular, we highlight the implications for scalability for gradient-based optimization of quantum models on the choice of output for variational quantum models.


Introduction
Quantum Machine Learning (QML) is a rapidly growing, emerging field, with a diverse set of ideas and applications. While there are significant differences in applications as to where Machine Learning and where Quantum Computing are applied, quantum-enhanced machine learning has become one of the dominant subfields [1][2][3][4]. The main benefits of such algorithms are potential quantum speed-ups [5][6][7][8][9], and the potential of recognizing statistical patterns hard to learn with purely classical schemes [10]. Recent work on QML has started to tackle the problems of trainability [11][12][13][14] and the generalization error [15,16] of quantum models.
However, machine learning algorithms on near-term quantum devices face an issue of constrained resources. While there exist encodings that efficiently use qubits, they still do not allow to load datasets such as MNIST in quantum memory, while additionally introducing overhead when encoding and decoding information between classical and quantum devices. To counter that issue researchers have used two approaches.
One was to use synthetic or very small datasets that could be learned efficiently (as in eg. [17]). This however makes any benchmarks artificial, and comparison to their classical counter-parts hard. As potential performance benefits are the main driver of the quantumenhanced machine learning, we believe that ease of comparison to classical machine learning should be one of the priorities in the field. Second approach is to classically pre-process data, so it can fit in the limited space defining the quantum model (as in eg. [18]). While this approach allows for direct comparison to classical performance on the same data, it also requires to factor in what impact pre-processing had on performance of both algorithms. It requires scientists to carefully prepare experiments to not give unfair advantages to either quantum or classical algorithms. Lastly, since no two studies will use the same pre-processing there is additional overhead when comparing two different quantum-enhanced approaches, or performing meta-analysis of the field.
To combat these issues, we propose a standardized approach of designing hybrid (quantum and classical) models. Similarly to how TensorFlow [19] and Py-Torch [20] changed classical machine learning field and increased reproducibility of efforts, we propose Scaled QUantum IDentifier (SQUID) [21], which is an extensible framework which can incorporate quantum models. As it is based on top of PyTorch, it has most of the benefits of a mature framework when it comes to purely classical architectures. For quantum models, we provide a standardized model design where user has to implement forward-and back-propagation functions.
By doing so, the pre-processing algorithms can be standard across applications and approaches, making them more directly comparable. It also reduces overhead on new researchers, as it significantly reduces the amount of coding required for an experiment. Such mix of both worlds also resembles quantum-inspired algorithms [1], which also benefit from above points.
The article is organized in following manner. In Sec-tion 2 we outline the framework design and describe the relevant internal details. In Section 3 we show an example application of the model using the MNIST dataset, and study the impact of including information from single vs. all available output qubits. We describe the use of the SQUID Helpers package and possible future extensions in Sec. 4. Finally, in Section. 5 we provide a summary and perspective.

The squid framework
Our main goal when designing SQUID is to propose a framework within which both classical and quantum machine learning can work in concert to solve a classification problem. Properly utilizing classical computing, when possible, is of great importance because quantum and classical models for data will often have different advantages and disadvantages. From an architectural perspective, the key innovation that SQUID allows is for classical neural networks to be globally trained in conjunction with a quantum neural network to build optimal encoders and decoders for the classical inputs or quantum outputs from the hybrid neural network. This ability allows us to, in effect, learn a feature map that not only allows us to represent large quantum datasets in near term devices but also allows us to incorporate classical information that may be known a priori into the quantum model. Before proceeding with the detailed description of the model, it is important for work like this that hybridizes between quantum and classical models to discuss the correspondence between the quantum and classical machine learning models that we implicitly assume. In all these cases we assume that in general our training dataset contains both classical vector and quantum states and the following form Here we assume a classical bit-encoding, meaning that we assume that each v class [j] ∈ R D . This is the typical setting in machine learning, however, it is also possible to envision that the actual training vectors are distributions over D symbols and the classical values v class are given by the probabilities of drawing each symbol from the distribution. We further implicitly assume that the quantum data |v quant [j] is provided using an amplitude encoding. By this we mean that the values of the training vectors are stored in the amplitudes of the |v quant [j] state. We make this choice because it is the most general setting that we can assume as it also subsumes the case where the quantum training vectors correspond to distinct quantum bit strings (otherwise known as a bitencoding). In A the nodes can be thought of as neurons in classical machine learning, or quantum state vectors. The edges represent transformations applied to these states. These typically involve trainable weights, but they might, as often in case of Quantum Model perform a set of transformations defined by incoming data. "Inputs to the Quantum Model" ( v class ) in particular can define both the input state to a quantum system (WE) along with the rotations that should be applied to that input state (WV (θ)). Quantum Model should also perform a measurement of the state (MD) and return classical amplitude estimates ( vout). Part B contains also planned future extension in which quantum features are also allowed to be passed in.
The SQUID framework was designed with extensibility and simplicity as its core principles. It generally follows design principles set by PyTorch [20]. Currently there are many competing Quantum SDKs [22][23][24], most of which include Python interfaces. Hence a successful QML package should allow, a simple extensible solution, which can be adjusted to any specific SDK. SQUID enables that by providing general classes, similar to nn.Module in PyTorch, one each for Quantum and Classical Models. These satisfy minimal requirements of functions used by the backend MainModel to properly propagate the gradient through combinations of the models.

Framework Design
Main component of SQUID is the MainModel, which itself accepts three smaller models. The first and last models are currently enforced to be classical, while the middle can be either quantum or classical. In case all three models are classical, MainModel is equivalent to PyTorch's nn.Sequential with three sub-components.
The complete framework is shown graphically on Fig. 1, and the detailed relations between models within the ensemble are described in the following subsections.

Propagating through the Main Model
Calling the model, or calling the forward function is exactly the same to PyTorch's forward pass. The only difference, is when middle model is Quantum, and the conversion between Tensors and numpy arrays [25] is required. The reason for choosing numpy arrays to be passed into Quantum model is due to the fact that many QML packages accept them as the input and in fact prefer them even over standard Pythonic lists.
The backward function offers the only major modification for the user in comparison to PyTorch, and it is required to be called explicitly by the user. For classical models standard PyTorch back-propagation (autograd) is used, and exact gradients are calculated. In the case there is a Quantum model in the middle, the automatic gradient propagation stops at the end of the second (quantum) model. This is because there was a conversion to/from numpy in the forward pass. SQUID uses the backward function provided by implementation of the Quantum Model. This both updates any parameters stored within the model, and provides the gradient with respect to the output of the encoder.
The conversion from numpy array to PyTorch tensor in backward call requires us to create a fake loss, which we then use to propagate the gradients backward using back-propagation. Given encoder forward output o 12 , gradient of global loss with respect to that output g 12 (provided by the Quantum Model as described above), we define a fake loss L : After the above calculation a PyTorch backward call is made on L , which propagates the gradient using autograd. Hence, gradients with respect to all of the Main Models parameters are calculated. For clarity the process is shown in Algorithm 1.
It is worth mentioning that by using PyTorch built-in autograd procedure any PyTorch loss can be used. For example, in the Section 3 Cross Entropy loss is used. Similarly any optimizer can be used. The main caveat to using various optimizers is that if any parameters are defined within the Quantum Model the user has absolute control over updating them, and the overhead of optimizer implementation falls onto the user.

Complexity of Gradient Evaluation
A crucial question that needs to be evaluated to assess the practicality of any QML algorithm is the quantum Measure Q 1 output probabilities p k from circuit; Use parameter shift rule and a total of N C = min 2 N M, 2Q 1 (M + 1) circuits to estimate the Q 1 M -dimensional gradients g kw ; Output: Returns result= p and grad=g kw complexity of the gradient calculation. This is especially relevant since the gradients propagated through the quantum model require statistical sampling or a quantum technique such as amplitude estimation to evaluate [26][27][28]. Here we provide such a complexity analysis with the aim of bounding the scaling of the number of quantum operations needed to ensure that the gradients yielded by Algorithm 2 are accurate within bounded error . Here in ensuring that the error is we mean that the gradient computation detailed in Algorithm 2 outputs an estimate of the gradient g such that |g − g| 2 ≤ .
Let us consider a Monte-Carlo estimate of the gradient. The algorithm for generating such an estimate involves measuring the expectation value of the gradient. This expectation value can be evaluated using Hadamard tests to estimate each component of the gradient (see Appendix A). Using the empirical frequency of measurements as an unbiased estimate of the probability, we have that if g is the estimate that returns from our protocol |g − E( g)| 2 = 0. (4) As there are M different parameters and Q 1 outputs yielded by the quantum model, we further have that Since the variance of the sum is the sum of the variances and we are using the sample mean for our estimates of the probability. This means that if N samples are used per point then Therefore we have that the mean square error of g is at note that if the variances are small then the number of samples required will be further reduced. Then from Markov's inequality, the number of samples needed to estimate the gradient within error with probability greater than 2/3 will be simply given by 3 times the estimate in Eq. (7) above. If a learning rate of λ is used for the gradient ascent then the error in the quantum parameters (as quantified by the Euclidean distance) is, with probability greater than 2/3, at most λ. We denote by ∆ the error introduced by using the noisy estimator g for the parameter update with · 2 the induced Euclidean norm. The generators H j used in SQUID (see Sec. 2.5) have unit norm, this allows us to bound the error as Therefore it follows from the fact that the error in the output probabilities p satisfies , that the value of N needed to ensure that the maximum error δ in p k after a single step of gradient ascent, with probability at least 2/3, obeys Finally, an application of the Chernoff bound shows that if we wish the error to be δ with probability at least 1 − η then we can repeat the experiment a logarithmic number of times and use majority voting to estimate the updated parameters. This results in where O(·) denotes an asymptotic upper bound with polylogarithmic multiplicative factors suppressed. On a future fault-tolerant quantum device it would also be possible to obtain a quadratic speedup in both and M at the cost of a longer circuit depth (see eg. [26,29]).

Available classical models
There are two built-in classical models: Linear and Convolutional. Both accept arguments which specify numbers of neurons per layer, and activation functions between them. However, since ClassicalModel is a sub-class of nn.Module from PyTorch, it is straightforward for a user to implement their own model. This is recommended, unless configuration files from SQUID helpers are utilized (see Sec. 4.1).

Available quantum models
In this section we provide details on the implementation of quantum models within SQUID. The approach we follow in this preliminary study is to construct the most general models on a given set of qubits by expressing the quantum circuits as layers of structured operations acting in nearest-neighbors only. This allows for both generality and a direct connection with real-world implementations on near-term devices with limited connectivity. Despite this choice, the framework is general and can be easily extended to accommodate quantum models with a different structure.
The common construction for a variational quantum classifier (see eg. [10,17,30]) is to start by considering the encoding of an input D-dimensional feature vector v ∈ [0, 1] D into the quantum state of a register containing N ≥ log 2 (D) qubit by introducing an encoding unitary operation W E as with |0 a reference state in the computational basis. The encoded state |Ψ is then modified by acting with a second unitary W V defined in terms of a set of N v variational parameters θ ∈ [0, π) Nv . The final state of the quantum register right before measurement is then The output of the quantum models we consider here are the probabilities of measuring each one of the computational basis states in the state |Φ , which can be estimated by collecting statistics over a large number of circuit executions. Given that the number of possible outcomes scales exponentially in the register size, a small subset of probabilities is typically selected in order for the overall scheme to be scalable.
The decomposition of the total unitary operation mapping |0 to |Φ as a function an the encoding unitary W E and a variational unitary W V is however artificial and does not necessarily lead to the most efficient scheme. This is especially true in the SQUID framework where a classical network is devoted to optimally determine an encoding of the classical data into a quantum state. The approach we take in SQUID is to consider instead the M -dimensional output from the classical encoder as the parameters describing a global unitary W in the quantum register, without artificially distinguishing between "encoding" parameters and "variational" parameters. This hybridization of the standard approach described above is still completely general and the global network can adjust to effectively reproduce a factorized form W = W V W E if there is a measurable advantage for the data under analysis.
We note that a generic unitary operation on N qubits can depend on at most (4 N − 1) parameters, which implies that we need to choose M < (4 N − 1) for the output of the classical encoder. In practice this is not a limitation since unitary operations which can be decomposed efficiently into a polynomial number of one and two qubit operations will depend at most on a polynomially large number of parameters.
In this first exploratory study we use a simple, but general, parametrization of W in terms of a one qubit unitary U 1 k (α, β, γ) and a two qubit unitary U 2 jk (θ, φ, η) both parametrized by 3 real parameters taking values in [0, π). The unitary U 1 can be written as and we recognize the parameters (α, β, γ) to be the Euler angles in the Y ZY decomposition. In the expression above Z k (Y k ) denote the Pauli Z matrix (Y matrix) for qubit k, and we will denote X k similarly in the following. The two-qubit unitary U 2 jk acting on the qubit pair {j, k} is instead defined as The usefulness of these choices comes from the possibility to represent a generic unitary by applying appropriately layers of U 1 k and U 2 k operations. For instance, a general SU (4) transformation for 2 qubits can be ex-actly represented with the following circuit (see [31,32]) requiring 1 application of U 2 01 and 4 applications of U 1 k (on qubit 0 and 1) for a total of 15 parameters. This construction can be readily extended to larger systems by interleaving layers of U 1 k with layers of U 2 jk applied alternatively on even or odd partitions. For instance with 4 qubits we consider circuits of the following form Note that in the construction above we didn't include the first and last single-qubit operations in the third U 1 layer. This allows to remove redundancy in the parameters since we replace the product of two U 1 operations with a single U 1 , this simplification results in enhanced stability in the training.

Example applications
As an initial application of our framework, we present here results for binary classification on the MNIST database [33] using digits 3 and 7. This is a standard benchmark for classification algorithms and analysis with a quantum model is made possible by the ability of SQUID to compress the input features into data with the appropriate dimensions.
The input feature vectors for this dataset are real vectors of dimension 784 representing a grayscale 28 × 28 pixel picture. In this section we will compare results obtained with the architecture displayed in Fig. 2 composed by: a single layer encoder with M 0 units, a blackbox classifier to be specified below and a single layer decoder with M 1 units. The classical black-box classifier used in this section consists of a simple single layer network with 2 units (panel (B) of Fig. 2) and we take M 1 = 2 for it's output. In the following subsections we will also consider different implementations of quantum classifiers with the general structure displayed in panel (C) of the same figure.
All of the calculations (classical and quantum) presented in this section were obtained using an Adam optimizer as implemented in PyTorch and using the hyperparameters reported in Tab 1. In all case we use  48 independent optimization runs that were performed in order to estimate the variance in the attained accuracy. In the following we will refer to this ensemble as "Bootstrap runs". We present the results obtained with the classical network in Fig. 3 (the full data is available in Tab. 2 of Appendix B). We can see from the evolution of the median accuracy (green circles) that the classical network is able to achieve classification accuracies above 99% but the increase in the number of hidden units at the level of the input model doesn't seem to provide a statistically significant improvement on the final accuracy. The displayed error bars are 68% confidence intervals extracted from our finite population sample.
In the main panel, we also show the location of the 90% accuracy percentile, ie. the boundary value for which 10% of bootstrap runs provide a higher accuracy, for both the validation set (red squares) and the training set(blue diamonds). These results are consistent with the expectation that, as the number of parameters K tot in the model increases, the training set can be described almost exactly by the network while at the same time we see that the distribution of accuracies for the validation set does not evolve significantly with K tot . In order to understand this point better we show in the left panels the estimated histograms for accuracy reached in our set of 48 bootstrap runs for the smallest classical model (panel (b)) and the largest model (panel (c)). As expected from the results in the main panel, most of the density is in the same location but for the larger models the tails are more important. Note that the dispersion in the accuracy around the median reported in the main panel of Fig. 3 is relatively small, of the order of ≈ 0.002. This is caused in large part by the simplicity of the classification problem, as we can see in Fig. 4 (obtained for a medium sized model with M 0 = 40) the accuracy quickly converges to a narrow interval around the mean for both the training set and the validation set. With more than 80 epochs the accuracy for the training data set is able to reach 100%.
When the inner model is replaced by a quantum subroutine, as depicted in panel (C) of Fig. 2, the output dimension for a quantum circuit over N qubits is bounded by M 1 ≤ 2 N . In the following sections we will consider two limiting situation: the maximum possible dimension M 1 = 2 N (indicated as "full" below) and the minimum one M 1 = 2 (indicated as "min" below) and corresponding to the probability of measuring a single basis state (here we choose |0 ⊗N ).

Separable Quantum Models
The first class of quantum models we consider here are separable models with a single layer of U 1 unitaries and are therefore fundamentally classical in that entanglement plays no role in shaping the output probabilities In this case, at least for the models with M 1 = 2 N , we can see a mild increase of the final accuracy as a function of the number of parameters/qubits in the model, the largest model is however outperformed by classical networks with a smaller size (see the blue circles in Fig. 5(a)). The models with a latent space corresponding to the restricted output for the quantum layer (shown as red circles in Fig. 5(a)) show instead a deterioration as we increase the number of qubits/parameters. This effect is especially clear looking at the histograms of achieved accuracy in the 48 bootstrap runs displayed in the right panels of Fig. 5 for the largest model with N = 6 qubits: employing a large output vector from the quantum layer produces results with better than 99% for more than half of the runs while restricting the output to a single probability prevents most runs from reaching this threshold. Strikingly, this is true even for the training set (not shown) where only a single bootstrap run achieved an accuracy above 99%. This is a first clear sign of the importance to supplement the quantum classifier with a rich decoder at the possible expense of a larger sample complexity.

Quantum models with entanglement
We now turn to consider more general quantum models that are capable of creating entanglement in the quantum register through the use of the two-qubit unitary U 2 defined in Sec. 2.5. The resulting median accuracy shown in panel (a) of Fig. 6 show a similar trend to the simpler separable models above: the accuracy of the quantum model never exceeds the classical accuracy  and there doesn't seem to be any measurable advantage in increasing the number of parameters. Interestingly, for the models with restricted output size (red circles denoted QM − min in panel (a) of Fig. 6), we see that the optimization procedure is struggling to find a good set of parameters for the larger models and the accuracy decreases almost monotonically with size. It is possible that better results could be obtained using directly the accuracy as cost function instead of the cross-entropy.
In order to clarify that the effect we are seeing is not coming from over-fitting of the training set, but really from difficulties in exploring efficiently the energy surface, we show on the left panels the evolution of the 90% accuracy percentile as a function of the number of parameters for both the validation and training set (panels (b) and (c) respectively) for the 3 networks considered here: the classical feed-forward network considered before (green squares) and the quantum models with entanglement either with the full output (red squares) or the restricted output model (blue squares). As can be clearly seen in panel (c) the optimizer is not able to find a good parameter set for large models and the accuracy in training decreases.
These results highlight the importance of supplementing the quantum classifier with a non trivial output decoder in order to achieve a good efficiency, a possibil-  Figure 6: Results for the accuracy achieved on MNIST with the classical model (green points) and the quantum models with entanglement described in the text (red and blue points). Panel (a) shows the accuracy as a function of the number of parameters for the classical model (green solid circles), the full quantum models (blue solid circles) and the quantum models with a single output variable M1 = 1 (red solid circles) indicated by "min". The left panels show the location of the 90% accuracy percentile for the classical (green squares), full quantum (blue squares) and minimal quantum model (red squares) in either the validation set (panel (b)) or the training set (panel (c)).
ity that is available only if we choose to measure more than a single qubit from the quantum device.

SQUID Extensions and Future Work
As mentioned is Section 2 SQUID is designed with extensibility in mind. This means that it should be easy to write additional packages on top of it, as well that additional features should involve changing at most few modules. In the following two sections we describe the SQUID Helpers package, designed to allow the use of SQUID using user-provided configuration files, and comment on possible extensions of the framework.

SQUID Helpers Package
To show how such extension could function, but also to stream-line the workload for use cases, we provide SQUID Helpers extension. It allows user to use yaml configuration files to run SQUID code. As a result bootstrapping multiple runs of multiple test cases and aggregation of the results is much easier. The conversion from configuration files to SQUID is done when a file is first read, and if a bootstrap option is provided, then random seed is changed during every iteration. The change of the seed is deterministic, and hence all of the results are exactly reproducible. At the initialization step, there is a single batch forward pass of data to ensure that dimensions of models inputs and outputs match. This is done due to fail-first and fail-fast principle. A single forward pass of data is faster than a pass of a single batch, and in perspective of training it is very cheap. The code then runs a typical training for-loop, with an additional call to backward function of Main Model, as explained in Section 2.2. In the end, all of the results, as well as configuration is saved to a single location. If bootstrap was used, along with the results for each run, there is a folder with aggregated results created for simpler analysis.
Helpers extension also provides very basic plotting utilities for the results. However, those are meant as a example of processing the output folders, since plots are highly dependent on studies performed.

Future Work
There is a vast amount of possible extensions to SQUID, some of which can be included directly in a main project, while others can be used as standalone packages. The main advantage of the SQUID is that it allows for abstracting communication with a specific backend. To do so however, custom QuantumModel subclasses interfacing with the backend API will be needed.
Additionally, as of right now SQUID allows only for a classification and regression tasks. More advanced scenarios would require changes both in SQUID as well as, to a larger extent, SQUID helpers.
Another, and much sooner addition to SQUID will be to start supporting various quantum computing frameworks. For example, Qiskit has created a great package for optimization of quantum circuits [23]. This would fit perfectly into SQUID ecosystem with a translation layer. Such addition would allow the user to implement hybrid models without the explicit definition of circuits for training, provided the circuit gradient are correctly passed by the backward function.
There are also others frameworks that have either already similar behavior or plans for optimization packages. Modularity of SQUID would allow code to be backend agnostic, and work uniformly across multiple types of quantum devices.

Conclusions
The great success of classical machine learning algorithms in tasks such as classification, together with the expectation that quantum computers will allow us to explore algorithms in a larger complexity class than their classical counterparts, makes the exploration of the connections between these ideas a fertile ground for the discovery of novel approaches to automated inference.
In this work we have presented SQUID as a computational framework which allows to explore efficiently the possible advantages of quantum computing for machine learning purposes. This is achieved by embedding the quantum algorithm part in a more general multi-layer architecture that allows to interface classical and quantum networks while enjoying efficient optimization by using the automatic differentiation engine provided by PyTorch. While there are similar packages (notably Xanadu's PennyLane [34], IBM's Qiskit [23]), they do not offer as much flexibility as SQUID. For example, PennyLane offers much more complete experience, with many examples and models, as well as ability to run code on quantum computer, and not a simulator. However, for the same reason, creating a custom model is much easier in SQUID. Similar argument can be made about Qiskit. This generalized frameworks provides several advantages over an either purely classical or purely quantum approach: it allows for a seamless dimensionality reduction of the inference problem, a step that would be necessary to explore high dimensional datasets on small quantum devices, while at the same time allowing for automatic tuning of the measurement settings needed to extract information from the quantum state produced by the algorithm. This latter feature, implemented in SQUID by using a classical network as decoder after the central model as shown in Fig. 1, is extremely important in order to reach high precisions. The use of an explicit decoder at the output of the quantum model allows for a more careful optimization of the trade-off between measuring only vanishingly small fraction of the possible output probabilities on one hand, with the drawback that entanglement can start to be detrimental due to information scrambling (see eg. [13,35]), and a full measurement of the probability distribution on the computational basis states which will require an exponential number of repetitions. In this work we used a simple classification problem from the MNIST database to show the effect of this tradeoff for a concrete high-dimensional problem.
Thanks to the generality of the architecture developed in this work, future explorations of algorithms with entanglement but with a classically efficient representation (such as tensor network states with polynomially large bond dimension, see eg. [36,37]) could be carried out within SQUID with only minimal modifications to the code. We expect the added flexibility in interchanging classical and quantum components in a global classifier to prove valuable in identifying promising datasets and inference problems where the presence of entanglement and quantum correlations can provide important accuracy gains. Once these problems have been identified with a simplified approach as the one currently implemented in SQUID, a successive study of the sample complexity along the lines of the derivation presented in Sec. 2.3 will be needed to assess the practical viability of the algorithm. Extensions to implement the effect of finite statistics, together with more advanced effects such as models of decoherence for a specific target device, can be added efficiently within SQUID and we plan to explore their impact on classification problems in future work.

A Gradient evaluation for variational quantum models
In this appendix we provide a derivation of the scheme employed in the main text to evaluate gradients with respect to the parameters of the quantum models. Following the discussion in Sec. 2.5 of the main text, a generic variational quantum model is defined by a unitary transformation W ( w) dependent on M real parameters w taking values in [0, π). The general construction employed in this work consists in decomposing the full unitary operation W ( w) into a combination of one and two qubit unitaries denoted by U 1 k (α, β, γ) and U 2 jk (θ, φ, η) respectively. The subscript indices indicate the qubit (or pairs of qubits) the unitary acts upon. Given the simple structure of a generic circuit as the one depicted in Eq. (2.5), one can obtain a closed form expression for every component of the gradient by looking at the individual gradients of the basic unitaries U 1 and U 2 directly. Here we will use as a concrete example the generic SU (4) unitary transformation from Eq. (2.5) which we reproduce here for convenience .
Note that in this case M = 15. In the following we will assume, without loss of generality, that this unitary operation is applied to the initial state |00 of the two qubit register and denote the resulting state by The classical output generated by a measurement on the qubit register can be completely characterized by the 4 probabilities to find the system in each one of the possible basis states: where we have introduced explicitly the projectors Computing the derivatives with respect to the 12 angles corresponding to the 4 one qubit SU (2) operations is straightforward by recalling the definition Eq. (14) of U 1 in terms of exponentials of Pauli operators. As an example, the derivative with respect to γ 0 of any of the probabilities in Eq. (17) can be written explicitly as where I (R) indicating the imaginary (real) part. Note that, for all of the 12 parameters characterizing the single qubit transformations, the derivative of the full variational circuit unitary W can be expressed in terms of the same parametrized unitary with the appropriate angle angle shifted by π/2. For the case of γ 0 considered above we have for instance: with a new set of parameters given by Using the optimal implementation for the more general SU (4) transformation derived in Ref. [32] (see Fig.6 there) one can show that we have the same property for the 2 qubit unitary U 2 jk . This property is usually referred to as the parameter shift rule [38,39].
In order to estimate the expectation values in the last line of Eq. (19) we can employ two strategies: if the required number of output probabilities K is the maximum possible one with n qubits (ie. K = 2 n ), it is convenient to first decompose the projectors in the computational basis states into a linear combination of K = 2 n diagonal operators obtained by considering all the possible tensor products of identities and Pauli Z, and then to evaluate each one of the resulting expectation values using a single Hadamard test each. The total number of separate circuits required for this approach is then KM , with M the total number of parameters.
In the more realistic situation where K 2 n instead, the strategy just described will still require an exponential number of measurement in the size of the qubit register. A more efficient alternative can be obtained by evaluating explicitly the K pairs of expectation values with |k the computational basis state associated to the projector Π k . These expectation values can be estimated using an Hadamard test with one additional ancilla qubit and require the execution of 2K independent circuits (one each for real and imaginary part).
For each one of the M parameters, we then use additional 2K Hadamard tests to estimate the expectation values associated with the shifted unitaries where we used the compact notation W ( w) m to indicate the derivative with respect to the m-th parameter. This requires a total of 2KM independent circuit executions for a total of 2K(M + 1) observables. The gradient can then be computed as An alternative approach to reduce the number of independent circuits needed for gradient evaluation is to use expectation values of unitary operators instead of projectors. This extension can be easily implemented within the SQUID framework.

B Additional information on the MNIST benchmark
We report in Tab. 2 the parameters and results for the classical models used in the MNIST classification discussed in Sec. 3 and corresponding to the results presented in Fig. 3    training set. The estimated errors correspond to a 68% confidence interval. The parameters of the separable quantum models considered in the main text, together with the results obtained from training on the MNIST classification problem, are presented in Tab. 3. The models with a latent space corresponding to the restricted output for the quantum layer are indicated with a subscript 1 in the table.
The same convention is used in Tab. 4 where we present the parameters and results for the quantum models with entanglement described in Sec. 3.2. We also show in Fig. 7 the evolution of the accuracy for both the training set (left panels) and validation set (right panels). The top two panels are obtained with models with maximal output size (M 1 = 16 in this case) while the bottom panels show the results using a restricted output model with M 1 = 1.
Finally, we present in Fig. 8 an extension of the results presented in Fig. 6 where in the left panels we show the accuracy histograms for the largest model considered in this work for both the full output model (panel(c)) and the restricted output model (panel(d)).