Structural risk minimization for quantum linear classifiers

Quantum machine learning (QML) models based on parameterized quantum circuits are often highlighted as candidates for quantum computing's near-term ``killer application''. However, the understanding of the empirical and generalization performance of these models is still in its infancy. In this paper we study how to balance between training accuracy and generalization performance (also called structural risk minimization) for two prominent QML models introduced by Havl\'{i}\v{c}ek et al. (Nature, 2019), and Schuld and Killoran (PRL, 2019). Firstly, using relationships to well understood classical models, we prove that two model parameters -- i.e., the dimension of the sum of the images and the Frobenius norm of the observables used by the model -- closely control the models' complexity and therefore its generalization performance. Secondly, using ideas inspired by process tomography, we prove that these model parameters also closely control the models' ability to capture correlations in sets of training examples. In summary, our results give rise to new options for structural risk minimization for QML models.


Introduction
After years of efforts the first proof-of-principle quantum computations that arguably surpass what is feasible with classical supercomputers have been realized [3]. As the leap from noisy intermediate-scale quantum (NISQ) devices [4] to full-blown quantum computers may require further decades, finding practically useful NISQ-suitable algorithms is becoming increasingly important. It has been argued that some of the most promising NISQ-suitable algorithms are those that rely on parameterized quantum circuits (also called variational quantum circuits) [5,6]. Such algorithms have been proposed for quantum chemistry [7,8], for optimization [9], and for machine learning [10]. One of the advantages of parameterized quantum circuits is that restrictions of NISQ devices can be hardwired into the circuit. Moreover, families of parameterized quantum circuits can -under widely believed complexity-theoretic assumptions -realize input-output correlations that are intractable for classical computation [11,12]. In this paper we discuss the application of parameterized quantum circuits as machine learning models in hybrid quantum-classical methods. The use of NISQ devices in the context of machine learning is particularly appealing as machine learning algorithms may be more tolerant to noise in the quantum hardware [13,14]. In short, parameterized quantum circuits could yield NISQ-friendly machine learning models that could be used to classify data for which conventional classical machine learning models may struggle.
In machine learning, parameterized quantum circuits can serve as a parameterized family of realvalued functions in a manner similar to neural networks (they are often called quantum neural networks). It has been noted that machine learning models based on parameterized quantum circuit are closely related to linear classifiers, which use hyperplanes to separate classes of data embedded in a vector space. This connection was first established by Havlíček et al. [1], and Schuld & Killoran [2], who both defined two machine learning models based on parameterized quantum circuits that efficiently implement certain families of linear classifiers -an illustration of which can be found in Figure 1. In this paper we further investigate and exploit this relation between machine learning models based on parameterized quantum circuits and standard linear classifiers to investigate how to perform structural risk minimization. More specifically, we study how to tune parameters of quantum machine learning models to optimize their expressivity (i.e., the ability to correctly capture correlations in sets of training examples) while preventing the model from becoming too complex ( Figure 1: An overview of the quantum machine learning models introduced in [1,2]. First, a parameterized quantum circuit is used to encode the data into a quantum state ρx. Afterwards, an observable O is measured. If its expectation value lies above d, then we assign the label +1, and −1 otherwise. The goal in training is the find the optimal observable O and threshold d. 1. (a) A measure of complexity called the VC dimension can be controlled by limiting the dimension of the sum of the images of the observables measured by the quantum model. In particular, we provide explicit analytical bounds on the VC dimension in terms of the dimension of the span of the observables, and the dimension of the sum of the images of the observables. Afterwards, we use this result to devise quantum models for which we can control this VC dimension bound by limiting the ranks of the observables (i.e., they can be regularized by penalizing high-rank observables).
(b) A measure of complexity called the fat-shattering dimension can be controlled by limiting the Frobenius norm of the observables measured by the quantum model. In particular, we provide explicit analytical bounds on the fat-shattering dimension in terms of the Frobenius norm of the observable.
Due to the well-established connection between these complexity measures and upper bounds on the generalization performance [15,16], our results theoretically quantify the effect that adjusting the dimension of the sum of the images, or Frobenius norm of the observables measured by the quantum model have on its generalization performance. Further we show that: 2. (a) Quantum models that use high-rank observables are strictly more expressive than quantum models that use low-rank observables. In particular, we show that i) any set of examples that can be correctly classified using a low-rank observable can also be correctly classified using a high-rank observable, and ii) there exist sets of examples that can only be correctly classified using an observable of at least a certain rank.
(b) Quantum models that use observables with large Frobenius norms can achieve larger margins (i.e., empirical quantities measured on a set of training examples that influence certain generalization bounds) compared to quantum models that use observables with small Frobenius norms. In particular, we show that there exist sets of examples that can only be classified with a given margin using observables of at least a certain Frobenius norm. Since the Frobenius norm controls the fat-shattering dimension, this can actually also have a positive effect on the generalization performance (as discussed in Section 2.2).
To summarize, we show that the rank or Frobenius norm of the observables measured by the quantum model also controls the model's ability to capture correlations in sets of examples.
Additional to the above two points, we also connect quantum machine learning with parameterized quantum circuits to standard structural risk minimization theory and discuss how to use our results to find the best quantum models in practice. In particular, we discuss different types of regularization that are theoretically motivated by our results, which help improve the performance of the quantum models in practice without putting extra requirements on the quantum hardware and are thus NISQ-suitable. Moreover, we find that there exist training methods -i.e., those who penalize high-rank observablesthat are theoretically motivated by our results, and for which the resulting quantum model does not necessarily correspond to a kernel method as argued in [17].

Related work
The way the observable in Figure 1 is measured typically consists of multiple steps that involve different parts of the quantum model. For instance, a prominent approach consists of first applying a parameterized quantum circuit to the data encoding state ρ x , and then performing some fixed measurement. Previous works have focused on showing how complexity measures depend on the different parts of the quantum model that implement the observable measurement, such as the quantum circuit ansatz [18,19,20], or the level of noise in the model [21]. In this work we study the observable measured by the quantum model as a whole due to the 1-1 correspondence with the normal vectors of separating hyperplanes of linear classifiers. By studying the observable as a whole, our results apply to all quantum models that are of the structure described in Figure 1, independent of how the observable measurement is implemented. Moreover, by being agnostic to how the observable is measured, our results are complementary to results that focus on the specifics of a particular implementation of the observable measurement such as those mentioned above. Other related work has focused on showing that quantum machine learning models are remarkably expressive and satisfy generalization bounds based on different complexity measures [22,23,24]. Finally, other related work has studied the generalization performance of quantum machine learning models in order to compare their performance with classical machine learning models [25,26].
Organization In Section 2, we define the quantum machine learning models studied in this paper, and we provide background on structural risk minimization. In Section 3, we investigate how structural risk minimization can be achieved for the quantum models. First, we determine two capacity measures of the quantum models, which allows us to identify model parameters that control the model's complexity in Subsection 3.1. Afterwards, we investigate the effect of these model parameters on the empirical performance in Subsection 3.2. We end with a discussion of how to implement structural risk minimization of the quantum models in practice in Subsection 3.3.

Background and motivation
In this section we provide the necessary background and we motivate our results. First, we introduce the family quantum machine learning models that we will study. Afterwards, we introduce the framework of statistical learning theory, which together with our results will provide an approach to optimally tuning the family of quantum models via so-called structural risk minimization.

Quantum linear classifiers
A fundamental family of classifiers used throughout machine learning are those constructed from linear functions. Specifically, these classifiers are constructed from the family of real-valued functions on R given by where ., . denotes an inner product on the input space R . These linear functions are turned into classifiers by adding an offset and taking the sign, i.e., the classifiers are given by These linear classifiers essentially use hyperplanes to separate the different classes in the data. While this family of classifiers seems relatively limited, it becomes powerful when introducing a feature map. Specifically, a feature map Φ : R → R N is used to (non-linearly) map the data to a (much) higher-dimensional space -called the feature space -in order to make the data more linearly-separable. We let C (Φ) = {c • Φ | c ∈ C} denote the family of classifiers on R obtained by combining a family of linear classifiers C ⊆ C lin on R N with a feature map Φ. If the feature map is clear from the context we will omit it in the notation and just write C. A well known example of a model based on linear classifiers is the support vector machine (SVM), which aims to finds the hyperplane that attains the maximal perpendicular distance to the two classes of points in the two distinct half-spaces (assuming the feature map makes the data linearly separable).
The linear-algebraic nature of linear classifiers makes them particularly well-suited for quantum treatment. In the influential works of Havlíček et al. [1], and Schuld & Killoran [2], the authors propose a model where the space of n-qubit Hermitian operators -denoted Herm C 2 n -takes the role of the feature space. Specifically, they view Herm C 2 n as a 4 n -dimensional real vector space equipped with the Frobenius inner product A, B = Tr[A † B]. Their feature map maps classical inputs x to n-qubit density matrices Φ(x) = ρ Φ (x) (i.e., positive semi-definite matrices of trace one). Finally, the hyperplanes that separates the states ρ Φ(x) corresponding to the different classes are given by n-qubit observables. In short, the family of functions their model uses is given by and the family of classifiers -which we refer to as quantum linear classifiers -is given by We can estimate f O (x) defined in Equation (3) by preparing the state ρ Φ(x) and measuring the observable O. In particular, approximating f O (x) up to additive error ε requires only O 1/ε 2 samples. While the error creates a fuzzy region around the decision boundary, this turns out to not cause major problems in practical settings [10]. Using parameterized quantum circuits both the preparation of a quantum state that encodes the classical input and the measurement of observables can be done efficiently for certain feature maps and families of observables. We now briefly recap two ways in which parameterized quantum circuits can be used to efficiently implement a family of quantum linear classifiers, as originally proposed by Havlíček et al. [1], and Schuld & Killoran [2]. Both ways use a parameterized quantum circuit to implement the feature map. Specifically, let U Φ be a parameterized quantum circuit, then we can use it to implement the feature map given by where The key difference between the two approaches is which observables they are able to implement (i.e., which separating hyperplanes they can represent) and how the observables are actually measured (i.e., how the functions f O are evaluated). An overview of how the two approaches implement quantum linear classifiers can be found in Figure 2, and we discuss the main ideas behind the two approaches below.
Explicit quantum linear classifier 1 The observables measured in this approach are implemented by first applying a parameterized quantum circuit W (θ), followed by a computational basis measurement   (7) and (8), respectively. Note that in the case of the explicit classifier, a universal circuit W (θ) (specifying the eigenbasis) followed by a computational basis measurement and universal postprocessing λ (specifying the eigenvalues) allows one to measure any observable. and postprocessing of the measurement outcome λ : [2 n ] → R. Upon closer investigation, one can derive that the corresponding observable is given by Examples of efficiently computable postprocessing functions λ include functions with a polynomially small support (implemented using a lookup table), functions that are efficiently computable from the input bitstring (e.g., the parity of the bitstring, which is equivalent to measuring Z ⊗n ), or parameterized functions such as neural networks. Note that the postprocessing function λ plays an important role in how the measurement of the observable in Eq. (6) is physically realized. Altogether, this efficiently implements the family of linear classifiers -which we refer to as explicit quantum linear classifiersgiven by The power of this model lies in the efficient parameterization of the manifold (inside the 4 n -dimensional vector space of Hermitian operators on C 2 n ) realized by the quantum feature map together with the parameterized separating hyperplanes that can be attained by W (θ) and λ. Here also lies a restriction of the explicit quantum linear classifier compared to standard linear classifiers, as in the latter all hyperplanes are possible and in the former only the hyperplanes that lie in the manifold parameterized by W (θ) and λ are possible. Furthermore, explicit quantum linear classifiers can likely not be efficiently evaluated classically, as computing expectation values Tr ρ Φ (x)O λ θ is classically intractable for sufficiently complex feature maps and observables [11,12]. 2 Another way to implement a linear classifier is by using the so-called kernel trick [27]. In short, this trick involves expressing the normal vector of the separating hyperplane, -i.e., the observable O in the case of quantum linear classifiers -on a set of training examples D as a linear combination of feature vectors, resulting in the expression

Implicit quantum linear classifier
Using this expression we can rewrite the corresponding quantum linear classifier as 2 Also called the quantum kernel estimator [1].
These type of linear classifiers can also be efficiently realized using parameterized quantum circuits. Using quantum protocols such as the SWAP-test or the Hadamard-test it is possible to efficiently evaluate the overlaps Tr[ρ Φ (x)ρ Φ (x )] for the feature map defined in Equation (5). Afterwards, the optimal parameters {α x } x ∈D are obtained on a classical computer, e.g., by solving a quadratic program. Altogether, this allows us to efficiently implement the family of linear classifiers -which we refer to as implicit quantum linear classifiers -given by The power of this model comes from the fact that evaluating the overlaps Tr ρ Φ (x)ρ Φ (x ) is likely classically intractable for sufficiently complex feature maps [1], demonstrating that classical computers can likely neither train nor evaluate this quantum linear classifier efficiently. Moreover, any quantum linear classifier that is the minimizer of a loss functions that includes regularization of the Frobenius norm of the observable can be expressed as an implicit quantum linear classifier [17]. However, as we indicate later in Section 3.3, this does not mean that we can forego explicit quantum linear classifiers entirely, as in the explicit approach there are unique types of meaningful regularization for which there is no straightforwards correspondence to the implicit approach.

Structural risk minimization: generalization bounds and model selection
When looking for the optimal family of classifiers for a given learning problem, it is important to carefully select the family's complexity (also known as expressivity or capacity). For instance, in the case of linear classifiers, it is important to select what kind of hyperplanes one allows the classifier to use. Generally, the more complex the family is, the lower the training errors will be. However, if the family becomes overly complex, then it becomes more prone to worse generalization performance (i.e., due to overfitting). Structural risk minimization is a concrete method that balances this trade-off in order to obtain the best possible performance on unseen examples. Specifically, structural risk minimization aims to saturate well-established upper bounds on the expected error of the classifier that consist of the sum of two inversely related terms: a training error term, and a complexity term penalizing more complex models.
In statistical learning theory it is generally assumed that the data is sampled according to some underlying probability distribution P on X × {−1, +1}. The goal is to find a classifier that minimizes the probability that a random pair sampled according to P is misclassified. That is, the goal is to find a classifier c f,d (x) = sign(f (x) − d) that minimize the expected error given by As one generally only has access to training examples D = (x 1 , y 1 ), . . . , (x m , y m ) that are sampled according to the distribution P , it is not possible to compute er P directly. Nonetheless, one can try to approximate Equation (9) using training errors such as Intuitively, er D in Equation (10) represents the frequency of misclassified training examples, and er γ D in Equation (11) represents the frequency of training examples that are either misclassified or are "within margin γ from being misclassified". In particular, for γ = 0 both training error estimates are identical (i.e., er D = er 0 D ). When selecting the optimal classifiers from a given model one typically searches for the classifier that minimizes the training error (in practice more elaborate and smooth loss functions are used), which is referred to as empirical risk minimization. The problem that structural risk minimization aims to tackle is how to optimally select a model such that one will have some guarantee that the training error will be close to the expected error.
Structural risk minimization uses expected error bounds -two of which we will discuss shortlythat involve a training error term, and a complexity term that penalizes more complex models. This complexity term usually scales with a certain measure of the complexity of the family of classifiers. A well known example of such a complexity measure is the Vapnik-Chervonenkis dimension.
Definition 1 (VC dimension [28]). Let C be a family of functions on X taking values in {−1, +1}. We say that a set of points Besides the VC dimension we also consider a complexity measure called the fat-shattering dimension, which can be viewed as a generalization of the VC dimension to real-valued functions. An important difference between the VC dimension and the fat-shattering dimension is that the latter also takes into account the so-called margins that the family of classifiers can achieve. Here the margin of a classifier Throughout the literature, this is often referred to as the functional margin.
Definition 2 (Fat-shattering dimension [29]). Let F be a family of real-valued functions on X . We say that a set of points We will now state two expected error bounds that can be used to perform structural risk minimization. These error bounds theoretically quantify how an increase in model complexity (i.e., VC dimension or fat-shattering dimension) results in a worse expected error (i.e., due to overfitting). First, we state the expected error bound that involves the VC dimension.

Theorem 1 (Expected error bound using VC dimension [30]). Consider a set of functions
} is sampled using m independent draws from P . Then, with probability at least 1 − δ, the following holds for all c ∈ C: where k = VC C .
Next, we state the expected error bound that involves the fat-shattering dimension. One possible advantage of using the fat-shattering dimension instead of the VC dimension is that it can take into account the margin that the classifier achieves on the training examples. This turns out to be useful since this margin can be used to more precisely fine-tune the expected error bound.
Theorem 2 (Expected error bound using fat-shattering dimension [16]). Consider a set of real-valued functions F on X . Suppose D = {(x 1 , y 1 ), . . . , (x m , y m )} is sampled using m independent draws from P . Then, with probability at least 1 − δ, the following holds for all where k = fat F (γ/16).

Remark(s). If the classifier can correctly classify all examples in D, then the optimal choice of γ in the above theorem is the margin achieved on the examples in
Generally, the more complex a family of classifiers is, the larger its generalization errors are. This correlation between a family's complexity and its generalization errors is theoretically quantified in Theorems 1 and 2. Specifically, the more complex the family is the larger its VC dimension will be, which strictly increases the second term in Equation 12 that corresponds to the generalization error.
Note that for the fat-shattering dimension in Theorem 2 this is not as obvious. In particular, a more complex model could achieve a larger margin γ, which actually decreases the second term in Equation 13 that corresponds to the generalization error.
Theorems 1 and 2 establish that in order to minimize the expected error, we should aim to minimize either of the sums on the right-hand side of Equations (12) or (13) (depending on which complexity measure one wishes to focus on). Note that in both cases the first term corresponds to a training error and the second term corresponds to a complexity term that penalizes more complex models. Crucially, the effect that the complexity measure of the family of classifiers has on these terms is inversely related. Namely, a large complexity measure generally gives rise to smaller training errors, but at the cost of a larger complexity term. Balancing this trade-off is precisely the idea behind structural risk minimization. More precisely, structural risk minimization selects a classifier that minimizes either of the expected error bounds stated in Theorem 1 or 2, by selecting the classifier from a family whose complexity measure is fine-tuned in order to balance both terms on the right-hand side of Equations (12) or (13). Note that limiting the VC dimension and fat-shattering dimension does not achieve the same theoretical guarantees on the generalization error, and it will generally give rise to different performances in practice (as also discussed Section 3.2). An overview of the trade-off in the error bounds stated in Theorems 1 and 2 is depicted in Figure 3.  [15]. Increasing the complexity of the classifier family causes the training error (blue) to decrease, while it increases the complexity term (green). Structural risk minimization selects the classifier minimizing the expected error bound in Eqs. (12) and (13) given by the sum of the training error and the complexity term (red).

Structural risk minimization for quantum linear classifiers
In this section we theoretically analyze and quantify the influence that model parameters of quantum linear classifiers have on the trade-off in structural risk minimization. We first analyze the effect that model parameters have on the complexity term (i.e., the green line in Figure 3) and afterwards we analyze their effect on the training error (i.e., the blue line in Figure 3). Specifically, in Section 3.1 we analyze the complexity term by establishing analytic upper bounds on complexity measures (i.e., the VC dimension and fat-shattering dimension) of quantum linear classifiers. In Section 3.2 we study the influence that model parameters which influence the established complexity measure bounds have on the training error term. Finally, in Section 3.3, we discuss how to implement structural risk minimization of quantum linear classifiers based on the obtained results.

Complexity of quantum linear classifiers: fat-shattering and VC dimension
In this section we determine the two complexity measures defined in the previous section -i.e., the fat-shattering dimension and VC dimension -for families of quantum linear classifiers. As a result, we identify model parameters that allow us to control the complexity term in the expected error bounds of Theorems 1 and 2. In particular, these model parameters can therefore be used to balance the trade-off considered by structural risk minimization, as depicted in Figure 3. Throughout this section we fix the feature map to be the one defined Equation (5) and we allow our separating hyperplanes to come from a family of observables O ⊆ Herm C 2 n (e.g., the family of observables implementable using either the explicit or implicit realization of quantum linear classifiers). Our goal is to determine analytical upper bounds on complexity measures of the resulting family of quantum linear classifiers.
First, we show that the VC dimension of a family of quantum linear classifiers is upper bounded by the dimension of the span of the observables that it uses. This in turn is upper bounded by the square of the dimension of the space upon which the observables act nontrivially. We remark that while the VC dimension of quantum linear classifiers also has a clear dependence on the feature map, we chose to focus on the observables because the resulting upper bounds give rise to more explicit guidelines on how to tune the quantum model to perform structural risk minimization (as we discuss in more detail in Section 3.3). We defer the proof to Appendix A.1.
Remark(s). The quantity r in the above proposition is related to the ranks of the observables. Specifically, note that for any two observables O, O ∈ Herm C 2 n we have that The above proposition implies the (essentially obvious) result that VC dimension of a family of implicit quantum linear classifiers is upper bounded by the number of training examples (i.e., the operators {ρ Φ (x)} x∈D span a subspace of dimension at most D ). We are however more interested in the application of the above proposition to explicit quantum linear classifiers. In this case, we choose to focus on the upper bound r 2 + 1 because it has interpretational advantages as to what parts of the model one has to tune from the perspective of structural risk minimization (i.e., recall from Section 3 that one way to perform structural risk minimization is to tune the VC dimension). Moreover, in the case of explicit quantum linear classifiers, the bound r 2 +1 is only quadratically worse than the bound dim Span O +1. To see this, we consider a family of explicit quantum linear classifiers with observables and we denote W (θ) |i = |ψ i (θ) . Next, suppose that λ(j) = 0 for all j > L and define Then, Proposition 3 states that Now, by the following lemma, we indeed find that the bound r 2 + 1 is only quadratically worse than the bound dim Span O + 1. We again defer the proof to Apppendix A.1.
Therefore, if we sufficiently limit r = dim(H), then this also limits dim Span O = dim(V ). Moreover, even though dim Span O + 1 can provide a tighter bound, it can still be advantageous to study the bound r 2 + 1 because it might have interpretational advantages. Specifically, it might be easier to construct cases of ansatze where the latter bound allows us to identify a controlable hyperparameter that controls the VC dimension (as we discuss in more detail in Section 3.3).
Note that the quantity r defined in the above proposition, depends on both the structure of the ansatz W as well as the post-processing function λ. One way to potentially limit r is by varying the rank of the final measurement (i.e., the value L defined above). However, for several ansatzes in literature, having either a low-rank or a high-rank final measurement will not make a difference in terms of the VC dimension bound r 2 + 1 5 . To see this, consider an ansatz consisting of a single layer of parameterized X-rotations on all qubits, where each rotation is given a separate parameter. Already for this simple ansatz even the first columns 2π) n } span the entire n-qubit Hilbert space. In particular, the above proposition gives the same VC dimension upper bound for the cases where the final measurement is of rank L = 1, and where it is of full rank L = 2 n (i.e., we have no guarantee that limiting L limits the VC dimension). This motivates us to design ansatzes for which subsets of columns do not span the entire Hilbert space when varying the variational parameter θ. On the other hand, to exploit the bound dim Span O + 1 one needs to consider the span of the projectors onto the first L columns in the vector space of Hermitian operators. This quantity can be slightly less intuitive than the span of the first L columns in the n-qubit Hilbert space, and in Section 3.3 we show that this latter quantity can already be used to affirm the effectiveness of certain regularization techniques. Specifically, in Section 3. 3 we discuss examples of ansatzes for which subsets of columns do not span the entire Hilbert space when varying the variational parameter, and we explain how they allow for structural risk minimization by limiting the rank of the final measurement.
Next, we show that the fat-shattering dimension of a family of quantum linear classifiers is related to the Frobenius norm of the observables that it uses. In particular, we show that we can control the fat-shattering dimension of a family of quantum linear classifiers by limiting the Frobenius norm of its observables. We defer the proof to Appendix A.3, where we also discuss the implications of this result in the probably approximately correct (PAC) learning framework.
is upper bounded by Remark(s). The upper bound in the above proposition matches the result discussed in [26]. This was derived independently by one of the authors of this paper [31], and we include it here for completeness.
The above proposition shows that the fat-shattering dimension of a family of explicit quantum linear classifiers can be controlled by limiting ||O λ 2 . In particular, it shows that the selection of the postprocessing function λ is important when tuning the complexity of the family ofr classifiers. Furthermore, the above proposition shows that the fat-shattering dimension of a family of implicit quantum linear classifiers can be controlled by limiting ||O α || F ≤ ||α|| 1 . It is important to note that the Frobenius norm itself does not fully characterize the generalization performance of a family of quantum linear classifiers. Specifically, plugging Theorem 5 into Proposition 2 we find that the generalization performance bounds depend on both the Frobenius norm as well as the functional margin on training examples 6 . Therefore, to optimize the generalization performance bounds one has to minimize the Frobenius norm, while ensuring the functional margin on training examples stays large. Note that one way to achieve this is by maximizing the so-called geometric margin, which on a set of example {x i } is given by

Expressivity of quantum linear classifiers: model parameters & errors
Having established that the quantity r defined in Proposition 3 and the Frobenius norms of the observables influence the complexity of the family of quantum linear classifiers (i.e., the green line in Figure 3), we will now study the influence of these parameters on the training errors that the classifiers can achieve (i.e., the blue line in Figure 3). First, we study the influence of these model parameters on the ability of the classifiers to correctly classify certain sets of examples. Afterwards, we study the influence of these model parameters on the margins that the classifiers can achieve.
Recall from the previous section that the VC dimension of certain families of quantum linear classifiers depends on the rank of the observables that it uses. For instance, if the observables are such that their images are (largely) overlapping, then the quantity r defined in Proposition 3 can be controlled by limiting the ranks of all observables. In Section 3.3 we use this observation to construct ansatzes for which the VC dimension bound can be tuned by varying the rank of the observable measured on the output of the circuit. Since the VC dimension is only concerned with whether an example is correctly classified (and not what margin it achieves), we choose to investigate the influence of the rank on being able to correctly classify certain sets of examples. In particular, we show that any set of examples that can be correctly classified using a low-rank observable, can also be correctly classified using a high-rank observable. Moreover, we also show that there exist sets of examples that can only be correctly classified using observables of at least a certain rank. We defer the proof to Appendix B.1.
qlin denote the family of quantum linear classifiers corresponding to observables of exactly rank r, that is, Then, the following statements hold: qlin , but which no classifier c ∈ C (k) qlin with k < r can classify correctly. Note that in the above proposition we define our classifiers in such a way that high-rank classifiers do not subsume low-rank classifiers. In particular, the family of observables that C (r) qlin and C (k) qlin use are completely disjoint for k = r. The construction behind the proof of the above proposition is inspired by tomography of observables. Specifically, we construct a protocol that queries a quantum linear classifier and based on the assigned labels checks whether the underlying observable is approximately equal to a fixed target observable of a certain rank. In particular, we can use this to test whether the underlying observable is really of a given rank, as no low-rank observable can agree with a high-rank observable on the assigned labels during this protocol. Note that if we could query the expectation values of the observable, then tomography would be straightforward. However, the classifier only outputs the sign of the expectation value, which introduces a technical problem that we circumvent. Our protocol could be generalized to a more complete tomographic-protocol which uses queries to a quantum linear classifier in order to find the spectrum of the underlying observable.
Next, we investigate the effect that limitations of the rank of the observables used by a family of quantum linear classifier have on its ability to implement certain families of standard linear classifiers.
In particular, assuming that the feature map is bounded (i.e., all feature vectors have finite norm), then the following proposition establishes the following chain of inclusions: qlin on n + 1 qubits ⊆ · · · ⊆ C (≤r) qlin on n + 1 qubits ⊆ · · · ⊆ C lin on R 4 n , where C (≤r) qlin denotes the family of quantum linear classifiers using observables of rank at most r. Note that C (≤r) qlin C (≤r+1) qlin is strict due to Proposition 6. We defer the proof to Appendix B.2. Recall from the previous section that the fat-shattering dimension of a family of linear classifiers depends on the Frobenius norm of the observables that is uses. In the following proposition we show that tuning the Frobenius norm changes the margins that the model can achieve, which gives rise to better generalization performance (as discussed in Section 2.2). In particular, we show that there exist sets of examples that can only be classified with a certain margin by a classifier that uses an observable of at least a certain Frobenius norm. We defer the proof to Appendix B.3. In conclusion, in Proposition 3 we showed that in certain cases the rank of the observables control the model's complexity (e.g., if the observables have overlapping images), and in Proposition 6 we showed that the rank also controls the model's ability to achieve small training errors. Moreover, in Proposition 8 we similarly showed that the Frobenius norm not only controls the model's complexity (see Proposition 5), but that it also controls the model's ability to achieve large functional margins. However, note that tuning each model parameter achieves a different objective. Namely, increasing the rank of the observable increases the ability to correctly classify sets of examples, whereas increasing the Frobenius norm of the observable increases the margins that it can achieve. For example, one can increase the Frobenius norm of an observable by multiplying it with a positive scalar which increases the margin it achieves, but in order to correctly classify the sets of examples discussed in Proposition 6 one actually has to increase the rank of the observable.

Structural risk minimization for quantum linear classifiers in practice
Having established how certain model parameters of quantum linear classifiers influence both the model's complexity and its ability to achieve small training errors, we now discuss how to use these results to implement structural risk minimization of quantum linear classifiers in practice. In particular, we will discuss a common approach to structural risk minimization called regularization. In short, what regularization entails is instead of minimizing only the training error E train , one simultaneously minimizes an extra term h(ω), where h is a function that takes larger values for model parameters ω that correspond to more complex models. In this section, we discuss different types of regularization (i.e., different choices of the function h) that can be performed in the context of quantum linear classifiers based on the results of the previous section. These types of regularization help improve the performance of quantum linear classifiers in practice, without putting more stringent requirements on the quantum hardware and are thus NISQ-suitable.
To illustrate how Proposition 3 can be used to implement structural risk minimization in the explicit approach, consider the setting where we have a parameterized quantum circuit W (θ) (with θ ∈ R p ) followed by a fixed measurement that projects onto the first computational basis states. To use the bound r 2 + 1 from Proposition 3 one has to compute the quantity where |ψ i (θ) denotes the ith column of W (θ). To use the other bound dim Span O + 1 from Proposition 3 one has to compute the quantity Although both are of course possible, in some cases it is slightly easier to see how the quantity in Eq. (23) scales with respect to . Specifically, utilizing the quantity in Eq. (23) already leads to interesting ansatze that allow for structural risk minimization by limiting . As discussed below Proposition 3, setting to be either large or small will not influence the upper bound on the VC dimension independently of the structure of the parameterized quantum circuit ansatz W . The proposition therefore motivates the design of ansatzes whose first columns define a manifold when varying the variational parameter that is contained in a relatively low-dimensional linear subspace. Specifically, in this case Proposition 3 results in nontrivial bounds on the VC dimension that we aim to control by varying . We now give three examples of ansatzes that allow one to control the upper bound on the VC dimension by varying .
In particular, these ansatzes allow structural risk minimization to be implemented by regularizing with respect to the rank of the final measurement.
• For the first example, split up the qubits up in a "control register" of size c and a "target register" of size t (i.e., n = t + c). Next, let C−U i (θ i ) denote the controlled gate that applies the t-qubit parameterized unitary U i (θ i ) to the target register if the control register is in the state |i . Finally, consider the ansatz Note that the matrix of W (θ) is given by the block matrix For this choice of ansatz, if the final measurement projects onto = m2 t (m < 2 c ) computational basis states, then by Proposition 3 the VC dimension is at most 2 + 1. Note that t is a controllable hyperparameter that can be used to tune the VC dimension. In particular, we can set it such that the resulting VC dimension is not exponential in n. Let us now consider the other bound dim Span O + 1 from Proposition 3. For this choice of ansatz, computing the quantity in Eq. (24) is also straightforward due to the block structure of the unitary. Moreover, for this choice of ansatz the inequalities in Lemma 4 are strict, which shows why being able to compute the quantity in Eq. (23) does not always imply that we can also compute the quantity in Eq. (24) (i.e., one is not simply the square of the other).
• For the second example, consider an ansatz that is composed of parameterized gates of the form U (θ) = e iθP for some Pauli string P ∈ {X, Y, Z, I} ⊗n . Specifically, consider the ansatz By the bound r 2 + 1 from Proposition 3, for this choice of ansatz if the final measurement projects onto computational basis states the VC dimension is at most r 2 + 1, where r = · 2 d . This bound is obtained by computing the quantity in Eq. (23), which can be done by noting that a column of the unitary U (θ) spans a subspace of dimension at most 2 when varying the variational parameter θ. Moreover, subsequent layers of U (θ) will only increase the dimension of the span of a column by at most a factor 2. Thus, when applying U (θ) a total of d times, the dimension of the span of any columns of W (θ) is at most r = · 2 d . Also in this construction we note that d is a controllable hyperparameter that can be used to tune the VC dimension. In particular, we can set it such that the resulting VC dimension is not exponential in n. For this particular choice of ansatze, computing the quantity in Eq. (24) might also be possible, but it is a bit more involved and not necessary for our main goal of establishing that controls the VC dimension. In particular, one might be able to compute the quantity in Eq. (24), but the bound r 2 + 1 from Proposition 3 already suffices to establish that is a tunable hyperparameter that controls the VC dimension.
• For the third example, we use symmetry considerations as a tool to control the VC dimension. First, partition the n-qubit register into disjoint subsets I 1 , . . . , I k of size |I j | = m j (i.e., j m j = n). Next, consider "permutation-symmetry preserving" parameterized unitaries on these partitions, which are defined as where we have say P i = X i , P i = Y i , P i = Z i or P i = I for all i ∈ I j (i.e., the same operator acting on all qubits in the partition I j ). Note that if we apply these operators to a permutation invariant state on the m j -qubits in the jth partition, then it remains permutation invariant (independent of θ). From these symmetric parameterized unitaries we construct parameterized layers U (θ 1 , . . . , θ k ) = k j=1 S +/⊗ Ij (θ j ), from which we construct the ansatz as By the bound r 2 + 1 from Proposition 3, for this choice of ansatz if the final measurement projects onto computational basis states the VC dimension is at most r 2 + 1, where This bound is obtained by computing the quantity in Eq. (23), which can be done by noting that if we apply a layer U to an n-qubit state that is invariant under permutations that only permute qubits within each partition, then it remains invariant under these permutations (i.e., independent of the choice of θ). In other words, the first column of W (θ) is always contained in the space of n-qubit states that are invariant under permutations that only permute qubits within each partition. Next, note that the dimension of the space of n-qubit states that are invariant under permutations that only permute qubits within each partition is equal to k j=1 (m j + 1). Finally, note that any other column of W (θ) spans a space whose dimension is at most that of the first column of W (θ) when varying θ. Thus, any columns of W (θ) span a space of dimension is most r = · k j=1 (m j + 1) when varying θ. Equivalent to the example above, for this particular choice of ansatze, computing the quantity in Eq. (24) might also be possible, but it is again a bit more involved and not necessary for our main goal of establishing that controls the VC dimension. In particular, one might be able to compute the quantity in Eq. (24), but the bound r 2 + 1 from Proposition 3 again already suffices to establish that is a tunable hyperparameter that controls the VC dimension.
In all of the above cases we see that we can control the upper bound on the VC dimension by varying the rank of the final measurement . It is worth noting that in these cases the regularized explicit quantum linear classifiers will generally give rise to a different model then the implicit approach without any theoretical guarantee regarding which will do better, because the standard relationship between the two models [17] will not hold anymore (i.e., the regularized explicit model does not necessarily correspond to a kernel method anymore).
Secondly, recall that by tuning the Frobenius norms of the observables used by a quantum linear classifier, we can balance the trade-off between its fat-shattering dimension and its ability to achieve large margins. In particular, this shows that we can implement structural risk minimization of quantum linear classifiers with respect to the fat-shattering dimension by regularizing the Frobenius norms of the observables. Again, it is important to note that the Frobenius norm itself does not fully characterize the generalization performance, since one also has to take into account the functional margin on training examples. In particular, to optimize the generalization performance one has to minimize the Frobenius norm, while ensuring that the functional margin on training examples stays large. As mentioned earlier, one way to achieve this is by maximizing the geometric margin, which on a set of examples {x i } is given by min i Tr Oρ Φ (x) − d /||O|| F . As before, for explicit quantum linear classifiers, we can estimate the Frobenius norm by sampling random computational basis states and computing the average of the postprocessing function λ on them in order to estimate ||O λ θ || F = 2 n i=1 λ(i) 2 (note that in some cases the Frobenius norm can be computed more directly). On the other hand, for implicit quantum linear classifiers, we can regularize the Frobenius norm by regularizing ||α|| 1 as ||O α || F ≤ ||α|| 1 . However, if the weights are obtained by solving the usual quadratic program [1,2], then the resulting observable is already (optimally) regularized with respect to the Frobenius norm [17].
Besides the types of regularization for which we have established theoretical evidence of the effect on structural risk minimization, there are also other types of regularization that are important to consider. For instance, for explicit quantum linear classifiers, one could regularize the angles of the parameterized quantum circuit [32]. Theoretically analyzing the effect that regularizing the angles of the parameterized quantum circuit has on structural risk minimization would constitute an interesting direction for future research. Another example is regularizing circuit parameters such as depth, width and number of gates for which certain theoretical results are known [19,18]. Finally, it turns out that one can also regularize quantum linear classifiers by running the circuits under varying levels of noise [21]. For these kinds of regularization the relationships between the regularized explicit and regularized implicit quantum linear classifiers are still to be investigated.

A Proofs of Section 3.1 A.1 Proofs of Proposition 3 and Lemma 4
Proposition 3. Let O ⊆ Herm C 2 n be a family of n-qubit observables with r = dim O∈O Im O 8 .. Then, the VC dimension of satisfies Proof. Define V = O∈O ImO ⊂ C 2 n and let P V denote the orthogonal projector onto V . Let Φ : X → Herm C 2 n denote the feature map of C O qlin and define Φ = . It is known that the VC dimension of linear classifiers on R is + 1, and it is clear that Herm V Herm C r R r 2 . Also, note that Span O is a subspace of Herm V . We therefore conclude that Proof. First, we note that V is contained in the space of Hermitian operators on H. Since the dimension of the space of Hermitian operators on H is equal to dim(H) 2 , this implies that Next, we fix a basis of H which we denote , where we each |ψ k is of the form |ψ i (θ) for some i ∈ {1, . . . , L} and θ ∈ R m . To show that dim(V ) ≥ dim(H), we will show that the operators ⊂ V are linearly independent. We do so by contradiction, i.e., we assume they are not linearly independent and show that this leads to a contradiction. That is, we assume that there exists a k ∈ {1, . . . , dim(H)} and {α k } k =k ⊂ R such that This implies that are not linearly independent. This clearly contradicts the assumption that is basis of H. We therefore conclude that the operators {|ψ k ψ k |} dim(H) k=1 ⊂ V are linearly independent, which shows that dim(V ) ≥ dim(H).

A.2 Relationship between VC dimension bound and ranks of the observables
In this section we discuss one possible way to relate the quantity r in Proposition 3 with the ranks of the observables by considering the overlaps of the images of the observables. Specifically, consider a family of observables {O i } n i=1 , where each observable is of rank R 10 . Next, define the quantities and In Lemma 9 below we can thus w.l.o.g. consider the case where the family of observables is finite.

Lemma 9. Consider a family of observables
Proof. The proof is basically a repeated application of the formula Specifically, by repeatedly applying the above formula we find that The results in this section hold more generally for families with varying ranks, though for simplicity (and to more closely relate it to Proposition 6) we assume all observables have some fixed rank R (from which it should be clear how to adapt it to the case where the observables can have different ranks).

A.3 Proof of Proposition 5
Proposition 5. Let O ⊆ Herm C 2 n be a family of n-qubit observables with η = max O∈O O F . Then, the fat-shattering dimension of is upper bounded by Proof. Due to the close relationship to standard linear classifiers, we can utilize previously obtained results in that context. In particular, for our approach we use the following proposition.
Proposition 10 (Fat-shattering dimension of linear functions [33]). Consider the family of real-valued functions on the ball of radius R inside R N given by The fat-shattering dimension of F lin can be bounded by The context in the above proposition is closely related, yet slightly different than that of quantum linear classifiers. Firstly, n-qubit density matrices lie within the ball of radius R = 1 inside Herm C 2 n equipped with the Frobenius norm. However, as in our case the hyperplanes arise from the family of observables O, whose Frobenius norms are upper bounded by η, we cannot directly apply the above proposition. We therefore adapt the above proposition by exchanging the role of R with the upper bound on the norms of the observables in O, resulting in the following lemma. Lemma 11. Consider the family of real-valued functions on the ball of radius R = 1 inside R N given by The fat shattering dimension of F ≤η lin can be upper bounded by Proof. Let us first determine the fat-shattering dimension of the family of linear functions with norm precisely equal to η on points that lie within the ball of radius R = 1, i.e., Suppose F =η lin can γ-shatter a set of points {x 1 , . . . , x k } that lie within the ball of radius R = 1. Because w, x i = w/η, ηx i , we find that F =1 lin can γ-shatter the set of points ηx 1 , . . . , ηx k that lie within the ball of radius R = η. By Proposition 10 we have k ≤ min{9η 2 /γ 2 , N + 1} + 1. Thus, the fat-shattering dimension of F =η lin on points within the ball of radius R = 1 is upper bounded by fat F =η lin (γ) ≤ min{9η 2 /γ 2 , N + 1} + 1. To conclude the desired results, note that this bound is monotonically increasing in η, and thus allowing hyperplanes with with norm w < η will not increase the fat-shattering dimension.
From the above lemma we can immediately infer an upper bound on the fat-shattering dimension of quantum linear classifiers by identifying that as vector spaces Herm C 2 n R 4 n .

A.3.1 Sample complexity in the PAC-learning framework
Besides being related to generalization performance, the fat-shattering dimension is also related to the so-called sample complexity in the probably approximately correct (PAC) learning framework [29]. The sample complexity captures the amount classifier queries required to find another classifier that with high probability agrees with the former classifier on unseen examples. By plugging the upper bound of Proposition 5 into previously established theorems on the sample complexity of families of classifiers [34,35], we derive the following corollary, which can be viewed as a dual of the result of [36].
Then, with probability at least 1 − δ over P , we have that Proof. Follows directly from plugging the uppper bound of Proposition 5 into Corollary 2.4 of [36]. For every 0 < ε < δ we have that O = O + εP has rank(O ) = r. What remains to be shown is that qlin correctly classifies D. To do so, first let x ∈ D + (i.e., labeled +1) and note that which shows that indeed c O ,b (x) = +1. Next, let x ∈ D − (i.e., labeled −1) and note that which shows that indeed c O ,b (x) = −1. (ii): We will describe a protocol that queries a classifier c O,b and based on its outcomes checks whether O is approximately equal to a fixed target observable T of rank r. We will show that if the queries to c O,b are labeled in a way that agrees with the target classifier that uses the observable T , then the spectrum of O has to be point-wise within distance ε of the spectrum of T . In particular, this will show that the rank of O has to be at least r if we make ε small enough. Consequently, if the rank of O is less than r, then at least one query made during the protocol has to be labeled differently by c O,b than the target classifier. In the end, the queries made to the classifier during the protocol will therefore constitute the set of examples described in the theorem.
Let us start with some definition. For a classifier c O,b (ρ) = sgn Tr Oρ − b we define its effective observable O eff = O − bI which we express in the computational basis as O eff = (O ij ). Next, we define our target classifier to be c T ,−1 where the observable T is given by and we define its effective observable T eff = T + I which we express in the computational basis as T eff = (T ij ). Rescaling O eff with a positive scalar does not change the output of the corresponding classifier. Therefore, to make the protocol well-defined, we define O eff to be the unique effective observable whose first diagonal element is scaled to be equal to O 00 = −(r + 1).
Our approach is as follows. First, we query c O,b in such a way that if the outcomes agree with with the target classifier c T ,−1 , then the absolute values of the off-diagonal entries in the first row and column of O eff must be close to zero (i.e., approximately equal to those of T eff ). Afterwards, we again query c O,b but now in such a way that if the outcomes agree with the target classifier c T ,−1 , then the diagonal elements of O eff must be approximately equal to those of T eff . In the end, we query c O,b one final time but this time in such a way that if the outcomes agree with the target classifier c T ,−1 , then the absolute values of the remaining off-diagonal elements of O eff must be close to zero (i.e., again approximately equal to those of T eff ). Finally, we use Gershgorin's circle theorem to show that the spectrum of O eff has to be point-wise close to the spectrum of T eff . We remark that this procedure could be generalized to a more complete tomography approach, where one uses queries to the classifier c O,b in order to reconstruct the entire spectrum of O eff .
First, we query the quantum states |i for i = 0, . . . , 2 n − 1. Without loss of generality, we can assume that the classifiers c O,b and c T ,−1 agree on the label, i.e., In order to show that the absolute value of the off-diagonal elements of the first row and column of O eff must be close to zero and that the diagonal elements of O eff must be close to those of T eff , we consider the quantum states given by Its expectation value with respect to O eff is given by and its expectation value with respect to T eff is given by Crucially, by Equation (27) we know that the label of |γ θ (α) goes from −1 to +1 as α goes 0 → 1. Note that the expectation value of |γ θ (α) with respect to T eff is independent from the phase θ.
To determine that O 0j is smaller than δ > 0, we query the classifier c O,b on the states |γθ(α) for allθ in a ζ-mesh of [0, 2π) and for allα in a ξ-mesh of [0, 1] and we suppose they are labeled the same as the target classifier c T ,−1 would label them. Using these queries we can find estimatesα O eff cross (θ) that are ξ-close to the unique α O eff cross (θ) = α that satisfies by finding the smallestα where the label has gone from −1 to +1. We refer to the α satisfying Equation (31) as the crossing point at phase θ. Because the label assigned by c T ,−1 does not depend on the phase θ, and since all states |γθ(α) were assigned the same label by c O,b and c T ,−1 , we find that the crossing point estimateα O eff cross (θ) is the same for allθ. In particular, this implies that the actual crossing points α O eff cross (θ) have to be within ξ-distance of each other for allθ. Before we continue, we first show that if c O,b assigns the same labels as c T ,−1 , then O jj is bounded above by a quantity that only depends on n. Fixθ to be any point inside the ζ-mesh such that Cθ ≤ 0, and define the function ). Therefore, if c O,b and c T ,−1 agree on the entire ξ-mesh for a small enough ξ, then it must hold that α O eff cross (θ) ∈ ( 1 2 , 2 n +1 2 n +2 ). By the mean value theorem there exists an α ∈ (α O eff cross (θ), 2 n +1 2 n +2 ) such that After some rewriting, we can indeed conclude from the above equation that O jj is bounded above by a quantity that only depends on n.
To determine that O jj is within distance δ > 0 of T jj we again query the classifier c O,b but this time on the states |γ 0 (α) for allα in a ξ -mesh of [0, 1] and we suppose they are labeled the same as the target classifier c T ,−1 would. Using these queries we can find estimatesα O eff cross (0),α T eff cross (0) that are ξ -close to the corresponding actual crossing point. As we assumed that all queries are labeled the same by c O,b and c T ,−1 , the crossing point estimateα O eff cross (0) has to be equal to the crossing point estimate α T eff cross (0). In particular, this implies that the actual crossing points α O eff cross (0) and α T eff cross (0) have to be within ξ -distance of each other. Next, define g(α, C) to be the unique coefficient O ∈ R ≥0 that satisfies It is clear that g is a continuous function in α and C that is independent from c O,b , and that T jj = g(α T eff cross (0), 0) and O jj = g(α O eff cross (0), C 0 ). Finally, we let δ > 0 and ξ > 0 be small enough such that if In conclusion, to determine that O jj is within distance δ > 0 of T jj we first do the required queries to determine that C 0 = O 0j < δ, after which we do the required queries to determine that α O eff cross (0) − α T eff cross (0) < ξ , which together indeed implies that O jj is within distance δ > 0 of T jj . In order to show that the absolute value of the remaining off-diagonal elements of O eff must be close to zero (i.e., close to those of T eff ) we consider the quantum states given by Its expectation value with respect to O eff is given by where C θ := Re e iθ (O 0j + O ij ) , and its expectation value with respect to T eff is given by Crucially, by our choice of T we know that the label of |µ θ (α) goes from −1 to +1 as α goes 0 → 1. Note that the expectation value of |µ θ (α) with respect to T eff is independent from the phase θ.
To determine that O ij is smaller than δ > 0 for i, j ≥ 1 and i = j, we query the classifier c O,b on the states |γθ(α) for allθ in a ζ -mesh of [0, 2π) and for allα in a ξ -mesh of [0, 1] and we suppose they are labeled the same as the target classifier c T ,−1 would. Using these queries we can find estimateŝ α O eff cross (θ) that are ξ-close to the unique α O eff cross (θ) = α that satisfies by finding the smallestα where the label has gone from −1 to +1. Because the label assigned by c T ,−1 does not depend on the phase θ, and since all states |µθ(α) were assigned the same label by c O,b and c T ,−1 , we find that the crossing point estimateα O eff cross (θ) is the same for allθ. In particular, this implies that the actual crossing points α O eff cross (θ) have to be within ξ -distance of each other for allθ. Subsequently, write O 0j + O ij = O 0j + O ij e iφ with φ ∈ [0, 2π), letθ abs denote the point in the ζ -mesh of [0, 2π) that is closest to 2π − φ, and letθ 0 denote the point in the ζ -mesh of [0, 2π) that is closest to π/2 − φ modulo 2π. By our previous discussion we know that α O eff cross (θ abs ) − α O eff cross (θ 0 ) < ξ , which implies where h is a continuous function (independent from c O,b and c T ,−1 ) with h(ξ ) → 0 as ξ → 0. Moreover, using the inequality cos(ζ ) ≥ 1 − λζ , where λ ≈ 0.7246 is a solution of λ π − arcsin(λ) = 1 + √ 1 − λ 2 , together with the inequality cos(π/2 − ζ ) ≤ ζ , we can derive that Finally, by combining Equation (39) with Equation (40) we can conclude that which for ξ and ζ small enough shows that O 0j + O ij < δ /2 (i.e., the fineness of both meshes ξ and ζ will depend on the choice of δ ). In conclusion, to determine that O ij is smaller than δ > 0 we first do the required queries to determine that O 0j < δ /2, after which we do the required queries to determine that O 0j + O ij < δ /2, which together indeed implies that O ij < δ . All in all, we have described a (finite) set of states such that if the label assigned by c O,b agrees with the label assigned by c T ,−1 , then the absolute value of the off-diagonal elements of the first row of O eff have to be smaller than δ, the diagonal elements of O eff have to be within δ -distance of those of T eff , and the remaining off diagonal elements of O eff have to be smaller than δ . Finally, we choose δ, δ , δ = 1/2 n+1 and use the above protocol to establish that for 1 ≤ i ≤ r − 1 the Gershgorin discs D i of O eff (i.e., with center O ii and radius j |O ij |) have to be contained in the disksD i with center i + 1 and radius 1/2. Moreover, we establish that the Gershgorin disc D 0 has to be contained in the disksD 0 with center −r + 1 and radius 1/2. Since the disksD i as disjoint, so are the Gershgorin discs D i , which implies that O eff must have at least r distinct eigenvalues, and thus that rank O ≥ r. Consequently, if rank O < r, then c O,b must disagree with c T ,−1 on the label of at least one of the states queried during the protocol.

B.2 Proof of Proposition 7
Proposition 7. Let C lin (Φ) denote the family of linear classifiers that is equipped with a feature map Φ. Also, let C (iii) For every quantum feature map Φ : R → Herm C 2 n , there exists a classical feature map Φ : R → R 4 n such that the families of linear classifiers satisfy C qlin (Φ) = C lin (Φ ).
Proof. (i): First, we define the feature map Φ : R → R N +1 which maps where e N +1 denotes the (N + 1)-th standard basis vector. Note that this feature map indeed satisfies that ||Φ (x)|| = 1 for all x ∈ R . Next, for any classifier c w,b ∈ C qlin (Φ) we define w = w and b = b/M and we note that for any x ∈ R we have (ii): First, we define the feature map Φ : R → R N +1 which maps where e N +1 denotes the (N + 1)-th standard basis vector. Next, for any classifier c w,b ∈ C lin (Φ) we definẽ w = w − be N +1 and we note that for all x ∈ R we have cw ,0 ( Φ(x)) = sign Φ(x), w = sign Φ(x), w − b = c w,b (Φ(x)).
Therefore, it suffices to show that we can implement any linear classifier on R N +1 with b = 0 as a quantum linear classifier on n = log N + 1 + 1 qubits. To do so, we define the quantum feature map Φ : R → Herm C 2 n which maps where |0 is a vector that does not lie in the support of Φ (note this vectors exists since we have chosen n large enough). Finally, for any linear classifier c w,0 ∈ C lin (Φ) on R N +1 we define b = ||w|| 2 /2 and O = |w w |, where |w = |w + ||w|| |0 and we note that for all x ∈ R we have  qlin whose observable is given by We remark that c O,0 can indeed classify the set of examples D r with margin η/ √ m. Now suppose c O ,b ∈ C η qlin with η < η can classify D m with margin γ , that is By combining these two inequalities we find that Finally, by the Cauchy-Schwarz inequality we find that Combining Equation (42) and (43) we find that from which we can conclude that γ < η/ √ m.