A quantum extension of SVM-perf for training nonlinear SVMs in almost linear time

We propose a quantum algorithm for training nonlinear support vector machines (SVM) for feature space learning where classical input data is encoded in the amplitudes of quantum states. Based on the classical SVM-perf al-gorithm of Joachims [1], our algorithm has a running time which scales linearly in the number of training examples m (up to polylogarithmic factors) and applies to the standard soft-margin ‘ 1 -SVM model. In contrast, while classical SVM-perf has demonstrated impressive performance on both linear and nonlinear SVMs, its eﬃciency is guaranteed only in certain cases: it achieves linear m scaling only for linear SVMs, where classiﬁcation is performed in the original input data space, or for the special cases of low-rank or shift-invariant kernels. Similarly, previously proposed quantum algorithms either have super-linear scaling in m , or else apply to diﬀerent SVM models such as the hard-margin or least squares ‘ 2 -SVM which lack certain desirable properties of the soft-margin ‘ 1 -SVM model. We classically simulate our algorithm and give evidence that it can perform well in practice, and not only for asymptotically large data sets.


Introduction
Support vector machines (SVMs) are powerful supervised learning models which perform classification by identifying a decision surface which separates data according to their labels [2,3]. While classifiers based on deep neural networks have increased in popularity in recent years, SVM-based classifiers maintain a number of advantages which make them an appealing choice in certain situations. SVMs are simple models with a smaller number of trainable parameters than neural networks, and thus can be less prone to overfitting and easier to interpret. Furthermore, neural network training may often get stuck in local minima, whereas SVM training is guaranteed to find a global optimum [4]. For problems such as text classification which involve high dimensional but sparse data, linear SVMs -which seek a separating hyperplane in the same space as the input data -have been shown to perform extremely well, and training algorithms exist which scale efficiently, i.e. linearly in [5][6][7], or even independent of [8], the number of training examples m.
In more complex cases, where a nonlinear decision surface is required to classify the data successfully, nonlinear SVMs can be used, which seek a separating hyperplane in a higher dimensional feature space. Such feature space learning typically makes use of the kernel trick [9], a method enabling inner product computations in high or even infinite dimensional spaces to be performed implicitly, without requiring the explicit and resource intensive computation of the feature vectors themselves.
While powerful, the kernel trick comes at a cost: many classical algorithms based on this method scale poorly with m. Indeed, storing the full kernel matrix K in memory itself requires O(m 2 ) resources, making subquadratic training times impossible by brute-force computation of K. When K admits a low-rank approximation though, sampling-based approaches such as the Nystrom method [10] or incomplete Cholesky factorization [11] can be used to obtain O(m) running times, although it may not be clear a priori whether such a low-rank approximation is possible. Another special case corresponds to so-called shift-invariant kernels [12], which include the popular Gaussian radial basis function (RBF) kernel, where classical sampling techniques can be used to map the high dimensional data into a random low dimensional feature space, which can then be trained by fast linear methods. This method has empirically competed favorably with more sophisticated kernel machines in terms of classification accuracy, at a fraction of the training time. While such a method seems to strike a balance between linear and nonlinear approaches, it cannot be applied to more general kernels. In practice, advanced solvers employ multiple heuristics to improve their performance, which makes rigorous analyses of their performance difficult. However, methods like SVM-Light [13], SMO [14], LIBSVM [15] and SVMTorch [16] still empirically scale approximately quadratically with m for nonlinear SVMs.
The state-of-the-art in terms of provable computational complexity is the Pegasos algorithm [8]. Based on stochastic sub-gradient descent, Pegasos has constant running time for linear SVMs. For nonlinear SVMs, Pegasos has O(m) running time, and is not restricted to low-rank or shift-invariant kernels. However, while experiments show that Pegasos does indeed display outstanding performance for linear SVMs, for nonlinear SVMs it is outperformed by other benchmark methods on a number of datasets. On the other hand, the SVM-perf algorithm of Joachims [1] has been shown to outperform similar benchmarks [17], although it does have a number of theoretical drawbacks compared with Pegasos. SVM-perf has O(m) scaling for linear SVMs, but an efficiency for nonlinear SVMs which either depends on heuristics, or on the presence of a low-rank or shift-invariant kernel, where linear in m scaling can also be achieved. However, given the strong empirical performance of SVM-perf, it serves as a strong starting point for further improvements, with the aim of overcoming the restrictions in its application to nonlinear SVMs.
Can quantum computers implement SVMs more effectively than classical computers? Rebentrost and Lloyd were the first to consider this question [18], and since then numerous other proposals have been put forward [19][20][21][22][23]. While the details vary, at a high level these quantum algorithms aim to bring benefits in two main areas: i) faster training and evaluation time of SVMs or ii) greater representational power by encoding the high dimensional feature vectors in the amplitudes of quantum states. Such quantum feature maps enable high dimensional inner products to be computed directly and, by sidestepping the kernel trick, allow classically intractable kernels to be computed. These proposals are certainly intriguing, and open up new possibilities for supervised learning. However, the proposals to date with improved running time dependence on m for nonlinear SVMs do not apply to the standard soft-margin 1 -SVM model, but rather to variations such as least squares 2 -SVMs [18] or hard-margin SVMs [23]. While these other models are useful in certain scenarios, soft-margin 1 -SVMs have two properties -sparsity of weights and robustness to noise -that make them preferable in many circumstances.
In this work we present a method to extend SVMperf to train nonlinear soft-margin 1 -SVMs with quantum feature maps in a time that scales linearly (up to polylogarithmic factors) in the number of training examples, and which is not restricted to lowrank or shift-invariant kernels. Provided that one has quantum access to the classical data, i.e. quantum random access memory (qRAM) [24,25], quantum states corresponding to sums of feature vectors can be efficiently created, and then standard methods employed to approximate the inner products between such quantum states. As the output of the quantum procedure is only an approximation to a desired positive semi-definite (p.s.d.) matrix, it is not itself guar-anteed to be p.s.d., and hence an additional classical projection step must be carried out to map on to the p.s.d. cone at each iteration.
Before stating our result in more detail, let us make one remark. It has recently been shown by Tang [26] that the data-structure required for efficient qRAMbased inner product estimation would also enable such inner products to be estimated classically, with only a polynomial slow-down relative to quantum, and her method has been employed to de-quantize a number of quantum machine learning algorithms [26][27][28] based on such data-structures. However, in practice, polynomial factors can make a difference, and an analysis of a number of such quantum-inspired classical algorithms [29] concludes that care is needed when assessing their performance relative to the quantum algorithms from which they were inspired. More importantly, in this current work, the quantum states produced using qRAM access are subsequently mapped onto a larger Hilbert space before their inner products are evaluated. This means that the procedure cannot be de-quantized in the same way.

Background and Results
H be a feature map where H is a real Hilbert space (of finite or infinite dimension) with inner product · , · , and let K : R d × R d → R be the associated kernel function defined by K(x, y) def = Φ(x), Φ(y) . Let R = max i Φ(x i ) denote the largest 2 norm of the feature mapped vectors. In what follows, · will always refer to the 2 norm, and other norms will be explicitly differentiated.

Support Vector Machine Training
Training a soft-margin 1 -SVM with parameter C > 0 corresponds to solving the following optimization problem: Note that, following [1], we divide i ξ i by m to capture how C scales with the training set size. The trivial case Φ(x) = x corresponds to a linear SVM, i.e. a separating hyperplane is sought in the original input space. When one considers feature maps Φ(x) in a high dimensional space, it is more practical to consider the dual optimization problem, which is expressed in terms of inner products, and hence the kernel trick can be employed.
This is a convex quadratic program with box constraints, for which many classical solvers are available, and which requires time polynomial in m to solve. For instance, using the barrier method [30] a solution can be found to within ε b in time O(m 4 log(m/ε b )). Indeed, even the computation of the kernel matrix K takes time Θ(m 2 ), so obtaining subquadratic training times via direct evaluation of K is not possible.

Structural SVMs
Joachims [1] showed that an efficient approximation algorithm -with running time O(m) -for linear SVMs could be obtained by considering a slightly different but related model known as a structural SVM [31], which makes use of linear combinations of labelweighted feature vectors: With this notation, the structural SVM primal and dual optimization problems are: where J cc = Ψ c , Ψ c and · 1 denotes the 1 -norm.
Whereas the original SVM problem OP 1 is defined by m constraints and m slack variables ξ i , the structural SVM OP 3 has only one slack variable ξ but 2 m constraints, corresponding to each possible binary vector c ∈ {0, 1} m . In spite of these differences, the solutions to the two problems are equivalent in the following sense. Theorem 1 (Joachims [1]). Let (w * , ξ * 1 , . . . , ξ * m ) be an optimal solution of OP 1, and let ξ * = 1 m m i=1 ξ * i . Then (w * , ξ * ) is an optimal solution of OP 3 with the same objective function value. Conversely, for any optimal solution (w * , ξ * ) of OP 3, there is an optimal solution (w * , ξ * 1 , . . . , ξ * m ) of OP 1 satisfying ξ * = 1 m m i=1 ξ i , with the same objective function value.
While elegant, Joachims' algorithm can achieve O(m) scaling only for linear SVMs -as it requires explicitly computing a set of vectors {Ψ c } and their inner products -or to shift-invariant or low-rank kernels where sampling methods can be employed. For high dimensional feature maps Φ not corresponding to shift invariant kernels, computing Ψ c classically is inefficient. We propose instead to embed the feature mapped vectors Φ(x) and linear combinations Ψ c in the amplitudes of quantum states, and compute the required inner products efficiently using a quantum computer.

Our Results
In Section 3 we will formally introduce the concept of a quantum feature map. For now it is sufficient to view this as a quantum circuit which, in time T Φ , realizes a feature map Φ : R d → H, with maximum norm max x Φ(x) = R, by mapping the classical data into the state of a multi-qubit system.
Our first main result is a quantum algorithm with running time linear in m that generates an approximately optimal solution for the structural SVM problem. By Theorem 1, this is equivalent to solving the original soft-margin 1 -SVM.
Quantum nonlinear SVM training: [See Theorems 6 and 7] There is a quantum algorithm that, with probability at least 1 − δ, outputsα andξ such that if (w * , ξ * ) is the optimal solution of OP 3, then , T Φ is the time required to compute feature map Φ on a quantum computer and Ψ min is a term that depends on both the data as well as the choice of quantum feature map.
Here and in what follows, the tilde big-O notation hides polylogarithmic terms. In the Simulation section we show that, in practice, the running time of the algorithm can be significantly faster than the theoretical upper-bound. The solutionα is a t max -sparse vector of total dimension 2 m . Once it has been found, a new data point x can be classified according to where y pred is the predicted label of x. This is a sum of O(mt max ) inner products in feature space, which classical methods require time O(mt max ) to evaluate in general. Our second result is a quantum algorithm for carrying out this classification with running time independent of m.
Quantum nonlinear SVM classification: [See Theorem 8] There is a quantum algorithm which, in timeÕ to within accuracy. The sign of the output is then taken as the predicted label.

Methods
Our results are based on three main components: Joachims' linear time classical algorithm SVM-perf, quantum feature maps, and efficient quantum methods for estimating inner products of linear combinations of high dimensional vectors.

SVM-perf: a linear time algorithm for linear SVMs
On the surface, the structural SVM problems OP 3 and OP 4 look more complicated to solve than the original SVM problems OP 1 and OP 2. However, it turns out that the solution α * to OP 4 is highly sparse and, consequently, the structural SVM admits an efficient algorithm. Joachims' original procedure is presented in Algorithm 1.
The main idea behind Algorithm 1 is to iteratively solve successively more constrained versions of problem OP 3. That is, a working set of indices W ⊆ {0, 1} m is maintained such that, at each iteration, the solution (w, ξ) is only required to satisfy the constraints 1 The inner for loop then finds a new index c * which corresponds to the maximally violated constraint in OP 3, and this index is added to the working set. The algorithm proceeds until no constraint is violated by more than . It can be shown that each iteration must improve the value of the dual objective by a constant amount, from which it follows that the algorithm terminates in a number of rounds independent of m.

norm of the training set vectors. For any training set S and any
is an optimal solution of OP 3, then Algorithm 1 returns a point (w, ξ) that has a better objective value than (w * , ξ), and for which (w, ξ + ) is feasible in OP 3.
In terms of time cost, each iteration t of the algorithm involves solving the restricted optimization problem (w, ξ) = arg min w,ξ≥0 which is done in practice by solving the corresponding dual problem, i.e. the same as OP 4 but with summations over c ∈ W instead of over all c ∈ {0, 1}. This involves computing In what follows we will not consider any sparsity restrictions.
For nonlinear SVMs, the feature maps Φ(x i ) may be of very large dimension, which precludes ex- , which are then each evaluated using the kernel trick. This rules out the possibility of an O(m) algorithm, at least using methods that rely on the kernel trick to evaluate each Φ(x i ), Φ(x j ) . Noting that w = c α c Ψ c , the inner products w, Φ(x i ) are similarly expensive to compute directly classically if the dimension of the feature map is large.

Quantum feature maps
We now show how quantum computing can be used to efficiently approximate the inner products Ψ c , Ψ c and c α c Ψ c , Φ(x i ) , where high dimensional Ψ c can be implemented by a quantum circuit using only a number of qubits logarithmic in the dimension. We first assume that the data vectors x i ∈ R d are encoded in the state of an O(d)-qubit register |x i via some suitable encoding scheme, e.g. given an integer k, x i could be encoded in n = kd qubits by approximating each of the d values of x i by a length k bit string which is then encoded in a computational basis state of n qubits. Once encoded, a quantum feature map encodes this information in larger space in the following way: Note that the states |Φ(x) are not necessarily orthogonal. Implementing such a quantum feature map could be done, for instance, through a controlled parameterized quantum circuit.
We also define the quantum state analogy of Ψ c from Definition 1:

Quantum inner product estimation
Let real vectors x, y ∈ R d have corresponding normalized quantum states |x = 1 The following result shows how the inner product x, y = x|y x y can be estimated efficiently on a quantum computer. [32], restated). Let |x and |y be quantum states with real amplitudes and with bounded norms x , y ≤ R. If |x and |y can each be generated by a quantum circuit in time T , and if estimates of the norms are known to within /3R additive error, then one can perform the mapping |x |y |0 → |x |y |s where, with probability

Theorem 3 (Robust Inner Product Estimation
Thus, if one can efficiently create quantum states |Ψ c and estimate the norms Ψ c , then the corresponding J cc = Ψ c , Ψ c = Ψ c Ψ c Ψ c |Ψ c can be approximated efficiently. In this section we show that this is possible with a quantum random access memory (qRAM), which is a device that allows classical data to be queried efficiently in superposition. That is, if x ∈ R d is stored in qRAM, then a query to the qRAM implements the unitary j α j |j |0 → j α j |j |x j . If the elements x j of x arrive as a stream of entries (j, x j ) in some arbitrary order, then x can be stored in a particular data structure [33] in timeÕ(d) and, once stored, |x = 1 x j x j |j can be created in time polylogarithmic in d. Note that when we refer to real-valued data being stored in qRAM, it is implied that the information is stored as a binary representation of the data, so that it may be loaded into a qubit register.
are known then, with probability at least 1 − δ, an estimate s cc satisfying A similar result applies to estimating inner products of the form cαc Ψ c , y i Φ(x i ) .

Linear Time Algorithm for Nonlinear SVMs
The results of the previous section can be used to generalize Joachims' algorithm to quantum featuremapped data. Let S + n denote the cone of n × n positive semi-definite matrices. Given X ∈ R n×n , let Define IP ,δ (x, y) to be a quantum subroutine which, with probability at least 1 − δ, returns an estimate s of the inner product of two vectors x, y satisfying |s − x, y | ≤ . As we have seen, with appropriate data stored in qRAM, this subroutine can be implemented efficiently on a quantum computer.
Our quantum algorithm for nonlinear structural SVMs is presented in Algorithm 2. At first sight, it appears significantly more complicated than Algorithm 1, but this is due in part to more detailed notation used to aid the analysis later. The key differences are (i) the matrix elements J cc = Ψ c , Ψ c are only estimated to precision J by the quantum subroutine; (ii) as the corresponding matrix J is not guaranteed to be positive semi-definite, an additional classical projection step must therefore be carried out to map the estimated matrix on to the p.s.d. cone at each iteration; (iii) In the classical algorithm, the values of c * i are deduced by c * i = max(0, 1−y i (w T x i )) whereas here we can only estimate the inner products w T , Φ(x i ) to precision , and w is known only implicitly according to w = c∈W α c Ψ c . Note that apart from the quantum inner product estimation subroutines, all other computations are performed classically. Theorem 6. Let t max be a user-defined parameter and let (w * , ξ * ) be an optimal solution of OP 3. If Algorithm 2 terminates in at most t max iterations then, with probability at least 1 − δ, it outputsα andξ such where Ψ min = min c∈Wt f Ψ c , and t f ≤ t max is the iteration at which the algorithm terminates.
Proof. See Appendix C.
The total number of outer-loop iterations (indexed by t) of Algorithm 2 is upper-bounded by the choice of t max . One may wonder why we do not simply set t max = max 4 , 16CR 2 2 as this would ensure that, with high probability, the algorithm outputs a nearly optimal solution. The reason is that t max also affects the the quantities J = which, as we show in the Simulation section, lead to good classification performance on the datasets we consider, corresponds to t max = 1.6 × 10 9 , and log(1/δ J ) J ≥ 1.6 × 10 13 . In practice, we find that this upper-bound on t max is very loose, and the situation is far better in practice: the algorithm can terminate successfully in very few iterations with much smaller values of t max . In the examples we consider, the algorithm terminates successfully before t reaches t max = 50, corresponding to log(1/δ J ) J ≤ 3.7 × 10 8 . The running time of Algorithm 2 also depends on the quantity Ψ min which is a function of both the dataset as well as the quantum feature map chosen. While this can make Ψ min hard to predict, we will again see in the Simulation section that in practice the situation is optimistic: we empirically find that Ψ min is neither too small, nor does it scale noticeably with m or the dimension of the quantum feature map.

Classification of new test points
As is standard in SVM theory, the solutionα from Algorithm 2 can be used to classify a new data point x according to where y pred is the predicted label of x. From Theorem 5, and noting that |W| ≤ t max , we obtain the following result: Theorem 8. Letα be the output of Algorithm 2, and let x be stored in qRAM. There is a quantum algorithm that, with probability at least 1 − δ, estimates the inner product cα c Ψ c , Φ(x) to within error in timeÕ CR 3 log(1/δ) Ψmin tmax T Φ Taking the sign of the output then completes the classification.

Simulation
While the true performance of our algorithms for large m and high dimensional quantum feature maps necessitate a fault-tolerant quantum computer to evaluate, we can gain some insight into how it behaves by performing smaller scale numerical experiments on a classical computer. In this section we empirically find that the algorithm can have good performance in practice, both in terms of classification accuracy as well as in terms of the parameters which impact running time.

Data set
To test our algorithm we need to choose both a data set as well as a quantum feature map. The general question of what constitutes a good quantum feature map, especially for classifying classical data sets, is an open problem and beyond the scope of this investigation. However, if the data is generated from a quantum problem, then physical intuition may guide our choice of feature map. We therefore consider the following toy example which is nonetheless instructive. Let H N be the Hamiltonian of a generalized Ising Hamiltonian on N spins where J, ∆, Γ are vectors of real parameters to be chosen, and Z j , X j are Pauli Z and X operators acting on the j-th qubit in the chain, respectively. We generate a data set by randomly selecting m points ( J, ∆, Γ) and labelling them according to whether the expec- for some cut-off value µ 0 , i.e. the points are labelled depending on whether the average total magnetism squared is above or below µ 0 . In our simulations we consider a special case of (3) where J j = J cos k J π(j−1) N , ∆ j = ∆ sin k ∆ πj N and Γ j = Γ, where

Quantum feature map
For quantum feature map we choose where |ψ GS is the ground state of (3) and, as it is a normalized state, has corresponding value of R = 1.
We compute such feature maps classically by explicitly diagonalizing H N . In a real implementation of our algorithm on a quantum computer, such a feature map would be implemented by a controlled unitary for generating the (approximate) ground state of H N , which could be done by a variety of methods e.g. by digitized adiabatic evolution or methods based on imaginary time evolution [34,35], with running time T φ dependent on the degree of accuracy required. The choice of (5) is motivated by noting that condition (4) is equivalent to determining the sign of W, Ψ , where W is a vector which depends only on µ 0 , and not on the choice of parameters in H N (see Appendix E). By construction, W defines a separating hyperplane for the data, so the chosen quantum feature map separate the data in feature space. As the Hamiltonian is real, it has a set of real eigenvectors and hence |Ψ can be defined to have real amplitudes, as required.

Numerical results
We first evaluate the performance of our algorithm on data sets S N,m for N = 6 and increasing orders of m from 10 2 to 10 5 .
• The values of µ 0 , k j , k ∆ , Γ were fixed and chosen to give roughly balanced data, i.e. the ratio of +1 to −1 labels is no more than 70:30 in favor of either label.
• These values of C and were selected to give classification accuracy competitive with classical SVM algorithms utilizing standard Gaussian radial basis function (RBF) kernels, with hyperparameters trained using a subset of the training set of size 20% used for hold-out validation. Note that the quantum feature maps do not have any similar tunable parameters, and a modification of (5), for instance to include a tunable weighting between the two parts of the superposition, could be introduced to further improve performance.
• The quantum IP ,δ inner product estimations in the algorithm were approximated by adding Gaussian random noise to the true inner product, such that the resulting inner product was within of the true value with probability at least 1 − δ. Classically simulating quantum IP ,δ inner product estimation with inner products distributed according to the actual quantum procedures underlying Theorems 4 and 5 was too computationally intensive in general to perform. However, these were tested on small data sets and quantum feature vectors, and found to behave very similarly to adding Gaussian random noise. This is consistent with the results of the numerical simulations in [32].
Note that the values of C, , δ chosen correspond to max 4 , 16CR 2 2 > 10 9 . This is an upper-bound on the number of iterations t max needed for the algorithm to converge to a good solution. However, we   find empirically that t max = 50 is sufficient for the algorithm to terminate with a good solution across the range of m we consider.
The results are shown in Table 1. We find that (i) with these choices of C, , δ, t max our algorithm has high classification accuracy, competitive with standard classical SVM algorithms utilizing RBF kernels with optimized hyperparameters. (ii) Ψ min is of the order 10 −2 in these cases, and does scale noticeably over the range of m from 10 2 to 10 5 . If Ψ min were to decrease polynomially (or worse, exponentially) in m then this would be a severe limitation of our algorithm. Fortunately this does not appear to be the case.
We further investigate the behaviour of Ψ min by generating data sets S N,m for fixed m = 1000 and N ranging from 4 to 8. For each N , we generate 100 random data sets S N,m (µ 0 , J, k J , ∆, k ∆ , Γ), where each data set consists of 1000 points (J, ∆) sampled uniformly at random in the range [−2, 2] 2 , and random values of µ o , k J , k ∆ , Γ chosen to give roughly balanced data sets as before. Unlike before, we do not divide the data into training and test sets. Instead, we perform training on the entire data set, and record the value of Ψ min in each instance. The results are given in Table 2 and show that across this range of N (i) the average valueΨ min is of order 10 −2 (ii) the spread around this average is fairly tight, and the minimum value of Ψ min in any single instance is of order 10 −3 . These support the results of the first experiment, and indicate that the value of Ψ min may not adversely affect the running time of the algorithm in practice.

Conclusions
We have proposed a quantum extension of SVM-perf for training nonlinear soft-margin 1 -SVMs in time linear in the number of training examples m, up to polylogarithmic factors, and given numerical evidence that the algorithm can perform well in practice as well as in theory. This goes beyond classical SVMperf, which achieves linear m scaling only for linear   SN,1000(µ0, J, kJ , ∆, k ∆ , Γ). For each value of N , 100 instances (of m = 1000 data points each) were generated for random values of kJ , k ∆ and Γ, with µ0 = 3N . Algorithm 2 was trained using parameters (C, , δ, tmax) = (10 4 , 10 −2 , 10 −1 , 50).
SVMs or for feature maps corresponding to low-rank or shift-invariant kernels, and brings the theoretical running time and applicability of SVM-perf in line with the classical Pegasos algorithm which -in spite of having best-in-class asymptotic guarantees -has empirically been outperformed by other methods on certain datasets. Our algorithm also goes beyond previous quantum algorithms which achieve linear or better scaling in m for other variants of SVMs, which lack some of the desirable properties of the soft-margin 1 -SVM model. Following this work, it is straightforward to propose a quantum extension of Pegasos. An interesting question to consider is how such an algorithm would perform against the quantum SVM-perf algorithm we have presented here. Another important direction for future research is to investigate methods for selecting good quantum feature maps and associated values of R for a given problem. While work has been done on learning quantum feature maps by training parameterizable quantum circuits [19,[36][37][38], a deeper understanding of quantum feature map construction and optimization is needed. In particular, the question of when an explicit quantum feature map can be advantageous compared to the classical kernel trick -as implemented in Pegasos or other state-of-the-art algorithms -needs further investigation. Furthermore, in classical SVM training, typically one of a number of flexible, general purpose kernels such as the Gaussian RBF kernel can be employed in a wide variety of settings. Whether similar, general purpose quantum feature maps can be useful in practice is an open problem, and one that could potentially greatly affect the adoption of quantum algorithms as a useful tool for machine learning.
|Ψ c can then be created by the following procedure: Discarding the |0 register, and applying the Hadamard transformation H |j = 1 √ m k (−1) j·k |k to the first register then gives where 0 ⊥ , junk is an unnormalized quantum state where the first qubit is orthogonal to |0 . The state Ψc ηc |0 |Ψ c + 0 ⊥ , junk can therefore be created in time T Uc + 2T Ux + T Φ =Õ(T Φ ). By quantum amplitude amplification and amplitude estimation [39], given access to a unitary operator U acting on k qubits such that U |0 ⊗k = sin(θ) |x, 0 + cos(θ) G, 0 ⊥ (where |G is arbitrary), sin 2 (θ) can be estimated to additive error in time O T (U ) and |x can be generated in expected time O T (U ) sin(θ) ,where T (U ) is the time required to implement U . Amplitude amplification applied to the unitary creating the state in (6) allows one to create |Ψ c in expected timeÕ ηc Similarly, amplitude estimation can be used to obtain a value s satisfying s − Outputting Ψ c = η 2 c s then satisfies Ψ c − Ψ c ≤ 3R .
Proof. From Lemma 1, the states |Ψ c and |Ψ c can be created in timeÕ R min{ Ψc , Ψ c } T Φ , and estimates of their norms to /3R additive error can be obtained in timeÕ R 3 T Φ . From Theorem 3 it follows that an estimate s cc satisfying can be found with probability at least 1 − δ in time With the above data in qRAM, an almost identical analysis to that in Theorem 4 can be applied to deduce that, for any c ∈ W, with probability at least 1 − δ/ |W|, an estimate t ci satisfying can be computed in time and the total time required to estimate all |W| terms (i.e. t ci for all c ∈ W) is thus |W| T ci . The probability that every term t ci is obtained to /C accuracy is therefore (1 − δ/ |W|) |W| ≥ 1 − δ. In this case, the weighted sum cW α c t ci can be computed classically, and satisfies

B Proof of Theorem 6
The analysis of Algorithm 2 is based on [13,31], with additional steps and complexity required to bound the errors due to inner product estimation and projection onto the p.s.d. cone. Theorem 6. Let t max be a user-defined parameter and let (w * , ξ * ) be an optimal solution of OP 3. If Algorithm 2 terminates in at most t max iterations then, with probability at least 1 − δ, it outputsα andξ such that w = cα c Ψ c satisfies P (ŵ,ξ) − P (w * , ξ * ) ≤ min C 2 , 2 8R 2 , and (ŵ,ξ + 3 ) is feasible for OP 3. If t max ≥ max 4 , 16CR 2 2 then the algorithm is guaranteed to terminate in at most t max iterations. Lemma 2. When Algorithm 2 terminates successfully after at most t max iterations, the probability that all inner products are estimated to within their required tolerances throughout the duration of the algorithm is at least 1 − δ.
Proof. Each iteration t of the Algorithm involves • O(t 2 ) inner product estimations IP J ,δ J (Ψ c , Ψ c ), for all pairs c, c ∈ W t . The probability of successfully computing all t 2 inner products to within error J is at least ( The probability of all estimates lying within error is at Since the algorithm terminates successfully after at most t max iterations, the probability that all the inner products are estimated to within their required tolerances is where the right hand side follows from Bernoulli's inequality. By Lemma 2 we can analyze Algorithm 2, assuming that all the quantum inner product estimations succeed, i.e. each call to IP ,δ (x, y) produces an estimate of x, y within error . In what follows, let J Wt be the |W t | × |W t | matrix with elements Ψ c , Ψ c for c, c ∈ W, let δĴ Wt Ctmax , where · σ is the spectral norm.
Proof. The relation between the spectral and Frobenius norms is elementary. We thus prove the upper-bound on the Frobenius norm. By assumption, all matrix elementsJ cc satisfy J cc − Ψ c , Ψ c ≤ J = 1 Cttmax . Thus, where the second equality follows from the definition ofĴ Wt in Algorithm 2, the first inequality because pro-jectingJ W onto the p.s.d cone cannot increase its Frobenius norm distance to a p.s.d matrix J Wt , and the third inequality because the size of the index set W t increases by at most one per iteration.
To proceed, let us introduce some additional notation. Given index set W, define Next we show that each iteration of the algorithm increases the working set W such that the optimal solution of the restricted problem D W increases by a certain amount. Note that we do not explicitly compute D * W , as it will be sufficient to know that its value increases each iteration.
For any 0 ≤ β ≤ C, the vector α + βη is entrywise non-negative by construction, and satisfies for any η satisfying η T ∇D(α) > 0 (See Appendix D). We now show that this condition holds for the η defined above. The gradient of D satisfies From Lemmas 5 and 6 we have and since c∈Wt α c = c∈Wtα c ≤ C it follows that Also: where we note that J cc = Ψ c , Ψ c ≤ max c∈{0,1} m Ψ c 2 ≤ R 2 . Combining (11), (12), (13)  , this increase is at least C tmax . D * Wt is upperbounded by D * , the optimal value of OP 4 which, by Lagrange duality, is equal to the optimum value of the primal problem OP 3, which is itself upper bounded by C (corresponding to feasible solution w = 0, ξ = 1). Thus, the algorithm must terminate after at most t max iterations.
We now show that the outputsα andξ of Algorithm 2 can be used to define a feasible solution to OP 3.
We are now in a position to prove Theorem 6.
Theorem 6. Let t max be a user-defined parameter and let (w * , ξ * ) be an optimal solution of OP 3. If Algorithm 2 terminates in at most t max iterations then, with probability at least 1 − δ, it outputsα andξ such that w = cα c Ψ c satisfies P (ŵ,ξ) − P (w * , ξ * ) ≤ min C 2 , 2 8R 2 , and (ŵ,ξ + 3 ) is feasible for OP 3. If t max ≥ max 4 , 16CR 2 2 then the algorithm is guaranteed to terminate in at most t max iterations.