The complexity of quantum support vector machines

Quantum support vector machines employ quantum circuits to define the kernel function. It has been shown that this approach offers a provable exponential speedup compared to any known classical algorithm for certain data sets. The training of such models corresponds to solving a convex optimization problem either via its primal or dual formulation. Due to the probabilistic nature of quantum mechanics, the training algorithms are affected by statistical uncertainty, which has a major impact on their complexity. We show that the dual problem can be solved in $O(M^{4.67}/\varepsilon^2)$ quantum circuit evaluations, where $M$ denotes the size of the data set and $\varepsilon$ the solution accuracy compared to the ideal result from exact expectation values, which is only obtainable in theory. We prove under an empirically motivated assumption that the kernelized primal problem can alternatively be solved in $O(\min \{ M^2/\varepsilon^6, \, 1/\varepsilon^{10} \})$ evaluations by employing a generalization of a known classical algorithm called Pegasos. Accompanying empirical results demonstrate these analytical complexities to be essentially tight. In addition, we investigate a variational approximation to quantum support vector machines and show that their heuristic training achieves considerably better scaling in our experiments.


Introduction
Finding practically relevant problems where quantum computation offers a speedup compared to the best known classical algorithms is one of the central challenges in the field.Quantifying a speedup requires a provable convergence rate of the quantum algorithms, which restricts us to studying algorithms that can be analyzed rigorously.The impressive recent progress on building quantum computers gives us a new possibility: We can use heuristic quantum algorithms that can be run on current devices to demonstrate the speedup empirically.This however requires a hardware friendly implementation, i.e., a moderate number of qubits and shallow circuits.
In recent years, more and more evidence has been found supporting machine learning tasks as good candidates for demonstrating quantum advantage [1][2][3][4].In particular, the so-called supervised learning setting, where in the simplest case the goal is to learn a binary classifier of classical data, received much attention.The reasons are manifold: (i) The algorithms only require classical access to data.This avoids the common assumption that data is provided in the amplitudes of a quantum state which is hard David Sutter: dsu@zurich.ibm.com;Gian Gentinetta and Arne Thomsen contributed equally and are listed in alphabetical order.
to justify [5].It has been observed that if classical algorithms are provided with analogous sampling access to data, claimed quantum speedups may disappear [6,7].
(ii) Kernelized support vector machines offer a framework that can be analyzed rigorously.In addition, this approach can be immediately lifted to the quantum setting by using a quantum circuit parametrized by the classical data as the feature map.This then defines a quantum support vector machine (QSVM) [2,4].
(iii) It is known that, for certain (artificially constructed) data sets, QSVMs can offer a provable exponential speedup compared to any known classical algorithm [4].Note, however, that according to current knowledge such QSVMs require deep circuits and, hence, rely on a fault-tolerant quantum computer.
Despite the growing interest in QSVMs, a rigorous complexity-theoretic treatment comparing the dual and primal approaches to training models employing arbitrary quantum kernels is missing (the analysis presented in [8] considers kernel functions defined by quantum states, but these are limited to classical polynomial-and RBF-kernels).In contrast to the classical case where the kernel function can be computed exactly, the runtime analysis of QSVMs is complicated by the fact that all kernel evaluations are unavoidably subject to statistical uncertainty stemming from finite sampling, even on a fully error corrected quantum computer.This is the case since the quantum kernel function is defined as an expectation value, which in practice can only be approximated as a sample mean over a finite number of measurement shots.Furthermore, it has recently been shown that quantum kernels can suffer from exponential concentration, if one is not careful with the construction of the feature map [9].Then, an exponential number of measurements is necessary to distinguish the kernel evaluations.However, in this work, we focus on the training of the QSVM and, hence, assume that the feature map which embeds the data through a quantum circuit has been chosen reasonably (see assumption on noisy halfspace learning in [4,Lemma 14,Definition 15]).Being able to successfully train these models is a necessary prerequisite for the ultimate goals of quantum advantage and good generalization, which is why we think the training stage of the algorithms warrants a detailed investigation of its own.The computational complexity of QSVMs is then defined as the total number of quantum circuit evaluations1 necessary to find a decision function which, when evaluated on the training set, is ε-close to the solution employing an infinite number of measurement shots.Note that this definition solely in terms of training is legitimate since fitting the models is computationally dominant compared to evaluating the final classifier.
The beauty of QSVMs is that they allow the classification task to be characterized by an efficiently solvable convex optimization problem.Classically, the problem is especially simple in its dual form, where the kernel trick can be used to reduce the optimization to solving a quadratic program.The main drawback of the dual method is that the entire M × M kernel matrix needs to be calculated.When the kernel entries are affected by finite sampling noise, we find that the scaling is even steeper.In order to find an εaccurate classifier O(M 4.67 /ε2 ) quantum circuit evaluations are found to be necessary in the setting of noisy halfspace learning [4, Lemma 14, Definition 15]. 2 Here and throughout the manuscript, ϵ denotes an upper bound on the absolute difference between two fully Results: Our findings are summarized in Table 1, where additional empirical scalings are included which show that the analytically bounds are essentially tight for practical data sets.In Section 2 we give a brief overview of QSVMs, the optimization problems underlying their training, and the heuristic model designated approximate QSVM.The derivation of the analytical complexity statements for the dual and primal methods can be found in Sections 3.1 and 3.2, respectively.In Section 4, we present the numerical experiments justifying the empirical complexity statements.

Support vector machines
The machine learning task considered in this paper is the supervised training of binary classifiers on classical data using support vector machines (SVMs) [12][13][14].Given an unknown probability distribution P (x, y) for data vectors x ∈ R r and class membership labels y ∈ {−1, 1}, we draw a set of training data X = {x 1 , . . ., x M } with corresponding labels y = {y 1 , . . ., y M }.Using this set, the SVM defines a classification function c : R r → {−1, 1} implementing the trade-off between accurately predicting the true labels and maximizing the orthogonal distance (called margin) between the two classes.Once trained, the classification function can be applied to classify previously unseen data drawn from the same probability distribution.While this generalization performance of models is crucial when solving a concrete machine learning task, the sole focus of this work lies elsewhere, namely in the training of the models.Therefore, throughout this work only a training and not a test is considered.
We start by introducing classical kernelized support vector machines.The quantum case then constitutes a straightforward extension where the feature maps are defined by quantum circuits.To conclude the section, we present variational quantum circuits and interpret them as approximate quantum support vector machines.

Kernelized support vector machines
Support vector machines can implement nonlinear decisions in the original data space by transforming the input vectors with a nonlinear feature map.Here, we consider the finite dimensional case where the feature map is given by φ : R r → R s .The decision boundary implemented by some w ∈ R s is then given by which is linear in feature space.We define the hyperplane w ⋆ as the solution to the primal optimization problem posed by support vector machines, which is given by where the term containing λ > 0 provides regularization and the second term is a sum over the hinge losses of the data points contained in the training set.Instead of solving the primal optimization problem, modern implementations [15] usually optimize the dual3 of (1) The dual problem is typically favored because the kernel trick can be straightforwardly applied by defining the symmetric, positive semidefinite kernel function as the inner product between feature vectors.The solutions of the primal and dual optimization problems are connected via the Karush-Kuhn-Tucker condition as w = M i=1 α i y i φ(x i ) [4] and hence the classification function can be rewritten as from which it is apparent that both the dual optimization problem and the classification function only depend on the kernel function and feature vectors need not be explicitly computed at any point.From (2), it follows that solving the dual requires the evaluation of the full kernel matrix K ∈ R M ×M with entries (5) Note that given K, the dual optimization can be restated as a convex quadratic program (see (29) in Appendix A.2) and hence solved in polynomial time [16].

Primal Estimated sub-GrAdient SOlver (Pegasos)
Alternatively, we can solve the primal problem with an algorithm called Pegasos [17] 4 , which arises from the application of stochastic sub-gradient descent on the objective function f (w) in the primal optimization problem (1).Starting with initial weights w 0 = 0, we iteratively optimize w t for t ∈ {1, . . ., T }.In the following, we analyse how the weights are updated in every step.
As is standard in stochastic gradient descent, we start the optimization step t by uniformly sampling a random index i t ∈ {1, . . ., M }.We then define the partial objective f t as the hinge-loss for the chosen datum x it ∈ X ⊂ R r added to the regularization term where φ : R r → R s denotes the feature map and λ is the regularization parameter.In order to find the steepest descent in the loss landscape defined by the partial objective f t , we calculate the gradient of f t with respect to w as Next, we update the weights w t = w t−1 − η t ∂f t ∂w w=w t−1 , where η t > 0 is the learning rate.To further facilitate the calculation, we expand w t−1 = 1 λ(t−1) M j=1 α t−1 j y j φ(x j ), leading to We can now fix the learning rate η t = 1/λt to simplify the expression as where the coefficients α t are defined as α t j = α t−1 j for j ̸ = i t and Crucially, the feature map is only accessed via the kernel entries, i.e. the Pegasos algorithm allows for a kernelization of the primal optimization problem.While in theory we calculate the gradient of f t with respect to w, in practice w is never explicitly calculated.Instead, it suffices to calculate and store the integer coefficients α t both for training and prediction.

Quantum support vector machines
The classical SVM formulation can be straightforwardly adapted to the quantum case by choosing a feature map ψ : R r → S(2 q ) x → |ψ(x)⟩⟨ψ(x)| , where S(2 q ) denotes the space of density matrices on q qubits [2].The kernel function is then given by the Hilbert-Schmidt inner product and can be estimated with the help of a quantum computer as illustrated in Figure 1 [4,11].
The major difference between the classically computed kernel in (3) and the quantum one in (6) is that the latter expression can only be evaluated approximately with a finite number of measurement shots, due to the probabilistic nature of quantum mechanics.Hence, to prove a complexity statement for QSVMs, it is crucial to understand the robustness of the primal and dual optimization problems with respect to noisy kernel evaluations.A detailed analysis is included in Section 3. . . . . . . . . .
Figure 1: Quantum kernel estimation [4,11]: Let E(x i ) denote a parametrized unitary fixed by the datum x i , which defines the feature map |ψ( ⊗q and then measuring all of the qubits in the computational basis, a bit string z ∈ {0, 1} q is determined.When this process is repeated R-times, the frequency of the all zero outcome approximates the kernel value k(x i , x j ) in (6).
Note that while such noisy kernel evaluations are necessary for both training and prediction, the complexity of the prediction step is O(S 2 /ε 2 ) [11] and subdominant compared to training (see table 1) as S ≤ M and often even S ≪ M for the number of support vectors S [22].Therefore, our present analysis is restricted to fitting the support vector machines to a training set.

Approximate quantum support vector machines
Compared to the analytically tractable QSVM formulations introduced in Section 2.3, the heuristic model we denote approximate QSVM defined in the following, potentially offers favorable computational complexity.The main idea is that, in addition to the feature map encoding the data |ψ(x)⟩ = E(x) |0⟩ ⊗q analogously to the QSVM, we implement a variational unitary W(θ), whose parameters θ ∈ R d are trained to solve the classification problem. Figure 2 shows a quantum circuit implementing this architecture.
We define the decision function as the expectation value of the q-fold Z Pauli-operator [2] which together with the introduction of a learnable bias b ∈ R is used to classify the data according to Given a loss function L, we use a classical optimization algorithm to minimize the empirical risk on the training data X The resulting model implements a linear decision boundary in the feature space accessed through the feature map circuit E and in this sense approximates a QSVM employing the same feature map (see [2] and [11,Section 2.3.1]).Note that the model above is just one possible definition to perform binary classification with the circuit depicted in Figure 2.
The classical datum x is encoded with a feature map circuit E(x) analogous to the one in Figure 1.However, the resulting state is then acted upon by a variational circuit W(θ), after which measurement in the computational basis is performed to determine the expectation value in (7).
One promising alternative is considering a local instead of a global observable.For both cases the model has to be chosen carefully to avoid unfortunate properties such as Barren plateaus [23].
The approximate QSVM model introduced above are also referred to as quantum neural networks [3,24] or as variational quantum circuits [2,25] in the literature.

Analytical complexity
Having introduced the models and associated optimization problems considered in this work, we now turn to analyzing their complexity.The goal of this section is to justify the analytical part of Table 1.For the scaling of the dual approach we are able to provide a mathematical proof.The scaling of the primal approach using Pegasos can also be analytically derived as long as an additional assumption on the data and feature map is fulfilled.While this assumption is not mathematically proven, we provide empirical evidence to support it.

Dual optimization
To solve the dual problem (2), the full kernel matrix K, whose elements are given as expectation values which are calculated on a quantum computer (see Figure 1), needs to be evaluated.As we can only perform a finite number of measurement shots in practice, the estimated kernel entries will always be affected by statistical errors, even when a fault tolerant quantum computer is employed.More precisely, we approximate each kernel entry by taking the sample mean over R i.i.d.realizations of the random variable k(l) ij , which is defined to be equal to 1 if the original all zero state |0⟩ ⊗q is recovered in the measurement, and equal to 0 for all other measurement outcomes.These entries define an approximate kernel matrix K R , which in the limit R → ∞ converges to the true kernel matrix K by the law of large numbers.To prove a complexity statement for the dual approach, we need to quantify the speed of this convergence.
A detailed analysis which is shifted to Appendix A.1 for improved readability shows that the expected error on the kernel when using R samples per entry scales as Note that throughout this work, the notation ∥•∥ denotes the Euclidean vector norm and ∥•∥ 2 the operator norm it induces.
In a next step, we analyze how this error propagates through the optimization problem (2) to the decision function (4).To make this step mathematically precise we need to employ an assumption on our kernel and data set.
Assumption 1.Our kernel and data set satisfy the noisy halfspace learning assumption defined in [4, Lemma 14].
Note that Assumption 1 is necessary to rigorously analyze the error propagation as done in Appendix A.2. Furthermore, as shown in [4] the assumption is reasonable for some kernels.
Denote by α ⋆ i and α ⋆ R,i the solution to the optimization problem utilizing an exact and noisy kernel matrix respectively.These solutions then lead to the terms inside the sign operator in the classification function (4), where x is an arbitrary feature vector.We also refer to h(x) as the decision function.We find that Thus, in order to achieve an error of |h(x) − h R (x)| ≤ ε, we need a total of measurement shots, as the full symmetric kernel matrix consists of O(M 2 ) independent kernel entries.Once the approximated kernel matrix K R has been calculated, the corresponding quadratic problem can be solved on a classical computer in time O M 7/2 log(1/ε) 5 .Hence, estimating the kernel matrix is the computationally dominant part of the algorithm and the result in (11) defines the overall complexity for the dual approach.

Primal optimization via Pegasos
In order to analyze the computational complexity of solving the primal optimization problem (1), let f (w) denote its objective function.Classically, where the kernels are known exactly, the runtime of the kernelized Pegasos-algorithm scales as O(M/δ), where is the difference in the objective between the true optimizer w ⋆ and the result of Pegasos w P [17].When using quantum kernels, however, we have to consider the inherent finite sampling noise afflicting their evaluation.In order to carry out the complexity analysis for such noisy kernels, we next impose an assumption on Algorithm 1 for which we have empirical evidence as provided in Appendix B.1.
Assumption 2. The convergence of Algorithm 1 is unaffected if the sum in line 14 of Algorithm 1 is only δ-accurate for a sufficiently small δ > 0.
In other words, this means that above a certain approximation accuracy, Algorithm 1 is robust to the statistical uncertainty inherent to quantum kernels which have been evaluated using a finite number of measurement shots.
Thus, we now analyze how many measurement shots are required for the sum in line 14 to be δ-accurate in terms of the standard deviation.First, note that initially α 1 [j] = 0 for all j and in every iteration at most one component of α becomes non-zero (in line 15).At the t-th iteration, the sum therefore contains at most t-terms.In order for the sum to be δ-accurate, every individual term should thus be (δ/ √ t)-accurate, so we need t/δ 2 measurement shots per term and t 2 /δ 2 shots to evaluate the whole sum.Bearing in mind that the total number of iterations T is bounded by T = O(1/δ) [17], this results in a total of O(T 3 /δ 2 ) = O(1/δ 5 ) quantum circuit evaluations to train a QSVM with Pegasos.Alternatively, the number of non-zero terms in the sum conditioned on in line 14 can be bounded by M , leading to a total of To directly compare this scaling with (11), we need to clarify the relation between the error and δ defined in (12), where h is the ideal decision function and h P the decision function resulting from the noisy Pegasos algorithm.As the SVM optimization problem is λstrongly convex for λ the regularization constant, this connection can be derived as Combining these results, the number of measurement shots required to achieve εaccurate decision functions is bounded by 10  .

Empirical complexity
In this section we provide experiments justifying the empirical part of Table Table 2: Empirical complexities of QSVMs, i.e. the total number of quantum circuit evaluations required to achieve an ε-accurate solution for the decision function resulting from the dual, primal, and approximate QSVM approach.We consider a data set of size M .

Training data
The training data used throughout the following experiments is artificially generated according to Algorithm 2 in the Appendix with respect to a fixed feature map of the architecture illustrated in Figure 3. Unless declared otherwise, the experiments are run on 8 qubits and the data thus consists of 8 features.In particular, we consider two settings: data sets that are linearly separable in feature space and ones where there is an overlap between the classes.One realization per setting is illustrated in Figure 4.The training data is generated using Algorithm 2 with the feature map from Figure 3 and a sample size M = 100.The linearly separable data is achieved by setting the margin positive (µ = 0.1 in this case), while the margin for the overlapping data is negative (µ = −0.1).The colouring in the background corresponds to the fixed classifier used to generate the data (note that this is not necessarily the ideal classifier for the generated data points).To allow visualisation, the data displayed here is generated for 2 qubits according to the same algorithm.

Dual optimization
In addition to the complexity-theoretic scaling derived in Section 3.1, we provide an empirical scaling for the dual optimization problem based on numerical experiments.Using shot based noisy kernels, we first determine the ε-dependence of the runtime.In a separate experiment, we analyze the M -dependence, which the theory indicates to pose the main limitation of the method.
For the ε-dependence, we fix the data size to M = 256.First, the exact kernel is calculated with a statevector simulator [18].The quadratic program (29) is solved with help of the quadprog python library [26] in order to find a reference decision function h ∞ (x) corresponding to an infinite number of measurement shots.In the next step, we emulate the shot based kernel entries 6 .We know that for the exact kernel entries K ij , the probability of measuring the all |0⟩ outcome is equal to |⟨ψ(x i )|ψ(x j )⟩| 2 = K ij .We can thus emulate a sample mean of R measurement shots using the binomial distribution B(R, K ij ).This procedure is repeated for values of R ranging from 10 9 to 10 197 to find the noisy decision functions h R (x).The error is then defined as Analyzing the plots in Figure 5, we find that the bound R = O(1/ε 2 ) predicted in Section 3.1 is tight and in precise agreement with the experiments.
In the second experiment, we analyze how the runtime of the dual optimization problem scales with M .To this end, the dual problem is solved for different data sizes M and shots per kernel evaluation R to calculate the corresponding error ε(M, R) according to (15).We then fix some ε 0 > 0 and define R ε 0 (M ) as the smallest R such that ε(M, R) < ε 0 ,  i.e.R ε 0 (M ) is the minimal number of shots needed to achieve an ε 0 -accurate solution.In Figure 6, a log-log plot of R tot = R ε 0 (M ) [M (M + 1)/2]8 as a function of M for different ε 0 is included.From a linear least squares fit in this plot the empirical scaling is determined as R tot = O(M 4.8±0.4 ) for separable data and R tot = O(M 4.5±0.3 ) for overlapping data.

Primal optimization via Pegasos
In order to determine the empirical scaling of the Pegasos algorithm, we observe how the error ε of the decision function changes when the number of shots per evaluation is varied.Again employing the feature map and training data described in Section 4.1 (with size M = 100), we first train a reference decision function h ∞ (x) by running Pegasos for T = 1000 iterations with a statevector simulator.This number of steps is sufficient for convergence in our case as can be seen in Figure 8.In a next step, QASM-simulators [18] with fixed number of shots per kernel evaluation R are used to run Pegasos.After every iteration t, we calculate the hinge loss where h t R (x) is the decision function after t iterations and y i the true label.This procedure is repeated for T iterations until convergence, which is defined to be reached when for some tolerance which is fixed to τ = 10 −4 for this experiment.Finally, the decision function after convergence h T R is compared to the reference decision function, defining an  error ε := max Figure 7 shows the relationship between ε and R. Figure 8 further suggests that the number of iterations needed until convergence is independent of the number of shots R per kernel evaluation for sufficiently large R. Thus, we can assume the total number of shots to scale as R tot = T R for some constant T > 0. Under this assumption, the experiments yield an empirical scaling of with respect to ε.
In addition to the ε-dependence of the complexity of Pegasos, we should also remark on the M -dependence.Figure 8 shows that for large enough R, the number of steps needed to converge is independent of R. Thus, it suffices to analyze how fast Pegasos converges for varying M when a statevector simulator corresponding to an infinite number of measurement shots is used.Figure 9 shows that there is no discernible correlation between the number of steps until convergence of Pegasos and the data size M as expected from the theoretical analysis in Section 3.2.

Heuristic training of approximate QSVMs
Before we tackle the empirical scaling of the non-convex optimization problem that the training of approximate QSVMs poses, we analyze how the expectation value of h θ (x) defined in (7) is affected by finite sampling statistics.Let Q ∈ {−1, 1} denote the random measurement outcome, then 1  2 (1+Q) ∼ Bernoulli(p) follows a Bernoulli distribution where p is the probability that Q = 1.For an i.i.d.sample of size R, the standard deviation     7, QASM-simulators ranging from 16 to 8192 shots per evaluation are employed to run Pegasos until convergence as defined in (16).The number of iterations until convergence is plotted as a function of the number of shots per kernel evaluation.The experiment is repeated for the different regularization parameters λ = 1/10 and λ = 1/1000.For every plotted data point, 50 runs are performed using different random seeds for the choice of indices in the Pegasos algorithm.The markers shown are the means of the respective steps until convergence and the error bars mark the standard deviation.model to the training set then scales as

of the sample mean
Because the optimization problem is non-convex, a rigorous proof of the conjectured scaling in (18) remains an unsolved problem.Instead, we perform numerical experiments to determine an empirical scaling.The circuit employed to implement the approximate QSVM is of the form shown in Figure 2, where the unitaries E(x) and W(θ) are ZZFeatureMap and RealAmplitudes circuits, respectively, as provided by the Qiskit Circuit Library [18].The experiment is performed on both separable and overlapping data generated as outlined in Section 4.1 and a size of M = 100.To train the parameters θ, we minimize the cross-entropy loss using a gradient based method.Calculating the full gradient scales linearly with both the training set size M and the number of parameters d.To avoid this, we use SPSA [28], which ensures that the scaling is independent of d.In addition, we use stochastic gradient descent to approximate the gradient on a batch of 5 data points instead of the whole training set, removing the M -dependence.Figure 10 includes empirical evidence in support of this claim.
To analyze the effect of finite sampling noise, the training is performed with Qiskit's shot based QASM-simulator [18] using R shots per expectation value.The optimization of the models is first performed for 1000 training steps, resulting in the noisy decision function h θ R (x).To enable a direct comparison to the ideal case, we minimize the loss using full gradient descent (i.e.all parameters and all data in the training set are considered in the gradient evaluations) and a statevector simulator in an additional experiment.In this way, the noiseless decision function h θ∞ (x) is determined.Here, θ ∞ are the parameters that the statevector optimization (R → ∞) converged to given the initial parameters θ R .This procedure is repeated multiple times with different random initializations of the trained weights and varying R. For an error defined as a least squares fit with respect to the number of shots is performed to conclude the experiment.The result is presented in Figure 11.From that experiment, the empirical .The total number of circuit evaluations for SPSA with a batch size of 5 is then given as R tot = 5 • 2 • T .Every experiment is repeated n = 10 times, such that the markers correspond to the means and the error bars to the interval between the 15.9 and 84.1 percentile over these runs.(19) for the training procedure detailed in the main text.A linear fit inside the log-log plot is used to determine the empirical exponent realized in our experimental setup.We analyze the case of 2-dimensional data with 8 trainable parameters (blue) and of 8-dimensional data with 16 trainable parameters (orange) both for separable and overlapping data.Every experiment has been repeated n = 10 times and the markers shown are the means of the resulting errors, while the error bars mark the interval between the 15.9 and 84.1 percentile.
complexity of the measurement shots needed for an ε-accurate solution is found to be approximately given by

Conclusion
The results summarized in Table 1 show that QSVMs can be trained in polynomial time with multiple training algorithms that differ in their computational complexity with respect to the size of the data set M and the accuracy ε of the decision function.Both the dual-and primal-QSVM solvers benefit from the convexity of the optimization problem that guarantees convergence to the global optimum.For the dual optimization we find that the complexity scales as O(M 4.67 /ε 2 ), which entails a significant overhead over training a classical SVM due to shot noise that is inherent in any quantum measurements.Conversely, the Pegasos algorithm provides a scaling that is independent of M by solving the primal optimization problem using stochastic gradient descent.While this provides an advantage for training the classifier on large data sets, this algorithm scales worse with respect to the ε.The cost of training the algorithm with quantum kernels increases to O(1/ε 10 ) from O(1/ε 4 ) when using classical kernels.While this scaling looks daunting, it is important to remark that ε denotes the error between the noisy and noiseless decision function evaluated on the training set.However, when solving a classification task in practice, typically the generalization performance of the models on some test set is the quantity of interest.The relationship between ε and the generalization error is an open question and promising direction for future research.
The best empirical scaling for training the QSVM optimization problem is found to be given by the approximate QSVM as the cost is also independent of M and the dependence on ε is reduced to O(1/ε 3 ).However, by choosing the approximate QSVM, we lose any guarantees on finding the global optimum of the optimization problem as the loss landscape now becomes non-convex.In addition to ensuring that the quantum kernel does not suffer from exponential concentration, we now also need to make sure that the loss function and initial choice of parameters does not lead to trainability problems such as barren plateaus.Nevertheless, the efficient scaling which might render the approximate QSVM the only feasible variant in practical settings.
Finally, we note that in this work we set the focus on the complexity of training QSVMs.We show that the optimization problem can be solved efficiently but the effect of statistical noise inherent to the quantum kernel introduces a polynomial overhead over the classical equivalent.It remains an open question whether quantum feature maps that provide a practical advantage over classical machine learning methods for real life classification tasks exist.
V. C.], the goal is now to bound the operator distance between the ideal K and obtainable K R .
To do so, define the error and observe that where the linearity of the expectation value and ( 20) have been used.The second moment of E R can be calculated as where in the step from the second to the third line, the sum inside the square is expanded into two terms collecting the two cases when both samples are identical and when they are different with l ̸ = m.The expression can be further simplified knowing because kij is Bernoulli distributed and because the k(•) ij are independently and identically distributed.Plugging in these results back into (23) and making use of finally yields Analogously, it can be shown that the fourth moment scales as Knowing this, the following result from random matrix theory can be harnessed: Theorem 3 (Latala's theorem [31,Theorem 2] and [32,Theorem 5.37]).For any matrix X ∈ R n×n whose entries are independent random variables x ij of zero mean, there exists a constant c > 0 such that where ∥•∥ 2 = σ max (•) denotes the operator norm induced by the Euclidean vector norm.
The matrix E R satisfies all of the assumptions in Theorem 3 and by ( 22), ( 24) and (25), yields With this, a bound on the accuracy of the kernel matrix is established.

A.2 Justification of (11)
The next question is how the solution to the SVM optimization problem is affected by perturbations to the ideal K.The following treatment is based on [4, Appendix C].
We show that we need O(M 8/3 /ε 2 ) measurement shots per kernel entry to ensure that ) unique entries need to be calculated, implying (11).As a reminder, the goal is to solve the dual optimization problem (2), which can be expressed in terms of matrices.Define Q as the matrix with entries (Q for the vector of training labels y ∈ R M , and diag(y) can be absorbed into the v in the definition of positive semidefiniteness of a matrix v ⊤ Q v ≥ 0 ∀v ∈ R M \ {0}.The kernel matrix K is always positive semidefinite [33, Section 2.2], implying that Q must be too.The dual problem in (2) can be written as the quadratic program [4, Eq. (D19)] where α denotes the vector with components α i , 1 ∈ R M the all one vector and id ∈ R M ×M the identity matrix.Because the identity commutes with all matrices, Q and λid can be simultaneously diagonalized.Since Q is positive semidefinite, its eigenvalues are all greater than or equal to zero.So when Q and λid are expressed in a basis where both matrices are diagonal, adding the two yields a diagonal matrix with entries greater than or equal to λ, implying that µ ≥ λ (30) for the the smallest eigenvalue µ of the matrix (Q + λid).With this, the following theorem can be invoked to prove the robustness of the dual QSVM optimization when the kernel function is only evaluated up to some finite precision.
Theorem 4 (Daniel's Theorem [34, Theorem 2.1] [4, Lemma 16]).Let x 0 be the solution to the quadratic program where K is positive definite with smallest eigenvalue µ > 0 and the dimensions of the vectors c, g, d and matrices G, D are such that all operations are well-defined.Let K ′ be a symmetric matrix such that ∥K ′ − K∥ 2 ≤ ε < µ. 10 Let x ′ 0 be the solution to (31) with K replaced by K ′ .Then From the ground up, classification with QSVMs can be divided into two subsequent steps, training and prediction.For training, the quantum kernel matrix K defined in ( 5) is initially evaluated on a quantum computer.Then, the QSVM is fit by running the dual program in (29) on a classical computer, yielding a solution vector α ⋆ .For prediction, a new datum x ∈ R d is assigned a class membership ŷ ∈ {−1, +1} via the classification function c SVM (x) in (4).To this end, the quantum computer has to be employed again to determine the quantum kernel value k(x, x i ) for all x i in the training set X.
From these two steps, it is clear that there are two separate instances in the QSVM algorithm where quantum kernels are evaluated and the statistical uncertainty inherent to quantum expectation values unavoidably enters.Because even on a fault tolerant quantum computer, fundamentally, only an approximate kernel function k R (x, x i ) as defined in ( 9) is feasible.Consequently, for training, K has to be replaced by its approximation K R , in turn leading to a noisy Q R analogous to (28).The solution to the dual (29) when Q is replaced by Q R is then denoted as α ⋆ R .In the same way, the kernel evaluations in the decision function are replaced by evaluations of k R (x, x i ) in the prediction step.Putting all of this together, a statement on the robustness of the overall QSVM with respect to measurement uncertainty can be made on the level of the ideal and noisy version of the term inside the sign function in the classification function (4).The goal is to show the robustness of QSVMs with respect to finite sampling noise by bounding the difference between these terms with high probability.
Lemma 5. Let ε > 0 and suppose R = O(M 8/3 /ε 2 ) measurement shots are taken for each quantum kernel estimation circuit.Furthermore, assume the setting of noisy halfspace learning (see Assumption 1).Then, with fixed probability p > 1  2 , where the probability is stemming from the choice of random training samples and uncertainty coming from the finite sampling statistics, for every x ∈ R d we have |h R (x) − h(x)| ≤ ε . 10Positive definiteness of K ′ is then implied.
Proof.Start by considering the operator distance between the noisy matrix Q R resulting from quantum kernel evaluations with a finite number of measurement shots R per entry and the ideal matrix Q. Making use of the connection between Q and K in (28), we find where in the inequality step uses |y i | = 1 and the triangle inequality.Then, (10) implies For the setting of noisy halfspace learning as defined in [4, Lemma 14, Definition 15], it can be shown [4,Remark 2] that where α ⋆ is the solution to (29).With Markov's inequality the statements ( 33) and ( 34) in expectation can be transformed into statements in probability.To this end, choose a such that p which hold with a probability greater than or equal to p ′ respectively.Now, Theorem 4 can be leveraged as Q is positive-definite because there is a lower bound on its smallest eigenvalue µ in (30).Then, for δ i := α ⋆ R,i − α ⋆ i and with probability at least p ′2 , (32) where R has to be chosen such that η < µ, which is always feasible due to the lower bound on µ.
In a next step, denote by ν i := k R (x, x i ) − k(x, x i ) for i = 1, 2, ..., M the difference between the obtainable and ideal kernel function evaluated for pairs of the datum to be classified x and the elements in the training set x i .As ν i is a special case of the components in E R in (21), E[ν i ] = 0 is implied by ( 22) and E[ν 2 i ] = O 1 R by (24).By making use of Chebyshev's inequality and Var[X] = E[X 2 ] − E[X] 2 , the above expectation values can be converted into the statement that holds with probability at least p ′ > 1/ 3 √ 2, where the √ M scaling comes from the Euclidean norm and the fact that the vector ν has M components ν i .
Returning to the ideal and noisy decision functions, the bounds ( 35) and (36) can be combined to give which is true with probability at least p := (p ′ ) 3 > 1 2 for any x ∈ R d .The penultimate step uses Cauchy-Schwarz and property k(x, x i ) ∈ [0, 1].Hence, the desired accuracy |h R (x) − h(x)| ≤ ε for the QSVM decision function is obtained with probability greater than or equal to p for R = O M 8/3 ε 2 (37) measurement shots per quantum kernel evaluation.
As remarked at the beginning of this appendix, this result justifies (11) in the main text.Remark 6.We note that this analysis is an improvement of the results derived in [4, Lemma 19], where they showed that R tot = O(M 6 /ε 2 ) shots are necessary under the same assumptions.11

B Extended analysis of the primal approach
In this appendix, we provide justifications for the steps discussed in Section 3.2.We first provide experiments supporting Assumption 2.Then, we present a rigorous mathematical derivation of the connection between ε and δ in (14) in the main text.

B.1 Empirical justification of Assumption 2
Assumption 2 claims that Pegasos converges independently of the sampling noise when the number of measurement shots is large enough.To provide numerical evidence for this statement, we compare the evolution of the algorithm for different numbers of shots per kernel evaluation R.
For the experiments presented in Figure 12, every optimization step of the Pegasos algorithm is performed with a noisy kernel value k R (see (9)) in line 14 of the algorithm.The accuracy plotted is then computed according to (4) evaluated for exact kernel evaluations, which facilitates a direct comparison of the different optimization runs.We observe that for R sufficiently large, the algorithm converges to a solution which is close to the one obtained by performing an infinite number of quantum circuit evaluations.Furthermore, visually the rate of convergence is not significantly slower for smaller but sufficiently large R. Taken together, these observations suggests that Assumption 2 is fulfilled.
In Figure 13, we additionally present experiments probing how the coefficients behave under noisy kernel evaluations.The integer coefficients from Pegasos are scaled by a factor of λ/t, similar to the sum in line 14.Unsurprisingly, the error is quite large in the first few iterations as t in the denominator is small.However, after around t = 40 iterations the error stabilizes and in some cases even decreases.This provides further evidence that the convergence of Pegasos is not strongly affected by finite sampling noise.

B.2 Justification of (14)
In the following we denote by f (w) the objective function of the primal optimization problem (1) where w ⋆ is the optimizer.

Figure 3 :
Figure3: Feature map circuit: This quantum circuit (scaled to 8 qubits with nearest-neighbour entanglement) is employed as the feature map in the QSVMs throughout Section 4. H denotes a Hadamard-gate, R z (θ) a single-qubit rotation about the z-axis, and ZZ(θ) a parametric 2-qubit Z ⊗ Z interaction (maximally entangled for θ = π/2).The feature components x 1 and x 2 (where x = (x 1 , x 2 ) is a classical input datum) provide the angles for the rotations.This feature map was chosen due to its previous use in the literature[2,3].

Figure 4 :
Figure 4: Artificial data:The training data is generated using Algorithm 2 with the feature map from Figure3and a sample size M = 100.The linearly separable data is achieved by setting the margin positive (µ = 0.1 in this case), while the margin for the overlapping data is negative (µ = −0.1).The colouring in the background corresponds to the fixed classifier used to generate the data (note that this is not necessarily the ideal classifier for the generated data points).To allow visualisation, the data displayed here is generated for 2 qubits according to the same algorithm.

Figure 5 :
Figure5: ε-scaling of the dual: Using noisy kernel evaluations drawn from a binomial distribution to simulate the number of shots R, the kernel matrix K R is constructed and the dual optimization problem solved.The number of shots is plotted as a function of ε on a doubly logarithmic scale, where ε is calculated according to(15).A linear fit inside the log-log plot is then used to determine the empirical exponent.The experiment is repeated for the different regularization parameters λ = 1/10 and λ = 1/1000 and run on n = 100 different realizations of the training data (see Section 4.1).The markers shown are the means over the different runs and the horizontal error bars correspond to the interval between the 15.9 and 84.1 percentile.

Figure 6 :
Figure 6: M -scaling of the dual: It is plotted how the minimal number of total shots necessary to achieve an ε 0 -accurate decision functions depends on M .The experiment is repeated for the different regularization parameters λ = 1/10 and λ = 1/1000.The markers shown are the means and the error bars indicate the interval between the 15.9 and 84.1 percentile.

Figure 7 :
Figure7: ε-scaling of Pegasos: Using QASM-simulators with 16 to 8192 shots per evaluation, the Pegasos algorithm is run for 750 iterations to ensure convergence.The number of shots is plotted as a function of ε on a doubly logarithmic scale, where ε is calculated as defined in(17).A linear fit is used to determine the empirical exponent.The experiment is repeated for the different regularization parameters λ = 1/10 and λ = 1/1000.Every data point corresponds to the mean over 50 random runs of the Pegasos-algorithm and the error bars indicate the standard deviation.

Figure 8 :
Figure 8: Convergence of Pegasos: Like in Figure7, QASM-simulators ranging from 16 to 8192 shots per evaluation are employed to run Pegasos until convergence as defined in(16).The number of iterations until convergence is plotted as a function of the number of shots per kernel evaluation.The experiment is repeated for the different regularization parameters λ = 1/10 and λ = 1/1000.For every plotted data point, 50 runs are performed using different random seeds for the choice of indices in the Pegasos algorithm.The markers shown are the means of the respective steps until convergence and the error bars mark the standard deviation.
where σ Q denotes the standard deviation of the distribution from which the Q i are sampled.Withσ 2 Bernoulli = p(1 − p) ≤ 1/4 we find σ 2 Q = 4σ 2Bernoulli ≤ 1 and, thus, if we require the standard deviation to be bounded by σ Q R = O(ε) in order to get a confidence interval of width O(ε), we need to perform R = O(1/ε 2 ) measurement shots per expectation value.Assuming that the training converges in O(1/ε) steps 9 , the expected total number of shots required to fit the

Figure 9 :
Figure 9: M -scaling of Pegasos: Pegasos is applied on the artificial data introduced in Section 4.1 for different training set sizes M .The number of iterations until convergence using a statevector simulator is plotted as a function of M .The experiment is repeated for the two regularization parameters λ = 1/10 and λ = 1/1000.Every data point is the mean over 10 different runs of the probabilistic Pegasos-algorithm and the error bars mark the interval between the 15.9 and 84.1 percentile.

Figure 10 :
Figure 10: M -and d-scaling of the approximate QSVM model: The approximate QSVM is trained using a combination of the SPSA and SGD optimization algorithms and a statevector simulator.Convergence is defined as the iteration T where the parameters fulfill ||θ T − θ T −1 ||/d < 10 −4 .The experiment is repeated for different (a) sizes M of the 2-dimensional data set (with d = 8 fixed) and (b) number of trainable parameters d with (M = 256 fixed).The total number of circuit evaluations for SPSA with a batch size of 5 is then given as R tot = 5 • 2 • T .Every experiment is repeated n = 10 times, such that the markers correspond to the means and the error bars to the interval between the 15.9 and 84.1 percentile over these runs.

Lemma 7 .Corollary 8 .
The objective function f (w) is λ-strongly convex in w.Proof.The hinge lossL hinge (y i , w ⊤ x + b) = max 0, 1 − y i (w ⊤ x + b) is convex and thus the sum g(w) = M i=1 L hinge (y i , w ⊤ x i + b) is convex as well.Now f is λ-strongly convex if and only if f (w) − λ 2 ∥w∥ is convex.But f (w) − λ 2 ∥w∥ = g(w)is convex and, thus, f is indeed strongly convex.Let ε > 0 and w ∈ R s be such that f (w) ≤ f (w ⋆ ) + ε.Then, ∥w − w ⋆ ∥ ≤ 2ε λ .

Table 1 : Complexity of QSVM training
, i.e. the asymptotic number of quantum circuit evaluations required to achieve an ε-accurate decision function on a training set of size M .

end if 19: end for 20: 21: Output: α T +1
. In Section 4.1, we define the training data used in the following experiments.InSections 4.2 and 4.3, we respectively show that the analytical complexity for the dual and primal methods are confirmed in our experiments as close to being tight.Additionally, we provide an empirical scaling for the approximate QSVMs in Section 4.4.Table2includes the results of the experiments performed in this section for the two settings described in Section 4.1.

-scaling of the approximate QSVM model: with
shots per expectation value ranging from 2 to 8192 are employed.The number of shots R is plotted as a function of ε as defined in