Quantum Alphatron: quantum advantage for learning with kernels and noise

At the interface of machine learning and quantum computing, an important question is what distributions can be learned provably with optimal sample complexities and with quantum-accelerated time complexities. In the classical case, Klivans and Goel discussed the \textit{Alphatron}, an algorithm to learn distributions related to kernelized regression, which they also applied to the learning of two-layer neural networks. In this work, we provide quantum versions of the Alphatron in the fault-tolerant setting. In a well-defined learning model, this quantum algorithm is able to provide a polynomial speedup for a large range of parameters of the underlying concept class. We discuss two types of speedups, one for evaluating the kernel matrix and one for evaluating the gradient in the stochastic gradient descent procedure. We also discuss the quantum advantage in the context of learning of two-layer neural networks. Our work contributes to the study of quantum learning with kernels and from samples.


Introduction
Machine learning is highly successful in a variety of applications using heuristic approaches even though the methods being used are often without strong guarantees on their learning performance.Important questions are why common machine learning algorithms such as stochastic gradient descent and kernel methods [28] work well and what is the best way to interpret the results.Computational learning theory addresses some of the fundamental theoretical questions and provides a systematic framework to discuss provable learning of probability distributions and machine learning architectures (such as neural networks).In a variety of settings and architectures, further assumptions on the underlying distribution can rule out hard instances and lead to provable and fast learning algorithms.Such guarantees have been given for generalized linear models, Ising models, and Markov Random Fields [20], for example.The Alphatron developed by Goel and Klivans [15] is a gradient-descent like algorithm for isotonic regression with the inclusion of kernel functions, which provably learns a kernelized, non-linear concept class of functions with a bounded noise term.As a consequence it can be employed to learn two-layer neural networks, where one layer of activation functions feeds into a single activation function.
Quantum gradient computation has been considered in [12].Many algorithms are envisioned for near-term quantum computers [24,23,5].Kernel methods extend linear learning tasks to non-linear learning tasks and, similarly, quantum kernel methods [29,18] use a high-dimensional Hilbert space of a quantum system to encode features of data.Some algorithms are similar in spirit to the use of heuristic methods in classical machine leaning.They often cannot obtain provable guarantees for the quality of learning and for the run time.An interesting avenue for quantum algorithms for machine learning is therefore to take provable classical algorithms for learning and study provable quantum speedups which retain the guarantees of the classical algorithms [7,2,21].
In this work, we provide quantum speedups for the Alphatron and its application to nonlinear classes of functions and two-layer neural networks.First, we consider the simple idea of pre-computing the kernel matrix used in the algorithm.Our setting is one where the samples are given via quantum query access.Using this access, we can harness quantum subroutines to estimate the entries of the kernel matrices used in the algorithm.The quantum subroutines we use are adaptations of the amplitude estimation algorithm.We show that the learning guarantees can be preserved despite the erroneous estimation of the kernel matrices.In a subsequent step, we also quantize the Alphatron algorithm itself.To this end, we require the storage of intermediate values in a quantum-accessible memory [14,13,3].In particular, we show that there are estimations inside the main loop of the algorithm which can be replaced with quantum subroutines, while keeping the main loop intact.We carefully study the regimes where the algorithms allow for a quantum speedup.We are again able to show that the others parameters of the algorithm remain stable under these estimations.Our main result is that we obtain a quantum algorithm for learning the original concept class, where the a quantum speedup is obtained for a large parameter regime of the concept class.We consider a previously defined class of two-layer neural networks and demonstrate that these networks are outside the regime for quantum advantage.We define a different neural network architecture which can exhibit a quantum advantage for learning.
The paper is organized as follows.Section 2 discusses the mathematical preliminaries, the weak p-concept learning setting, and kernel methods, and introduces the Alphatron algorithm with a run time analysis.Section 3 discusses the kernel matrix estimation in the context of the Alphatron using both classical sampling and quantum estimation.Section 4 discusses the main loop of the Alphatron and the corresponding quantum run time.Section 5 summarizes the results in terms of all relevant parameters and discusses the regime where a quantum speedup is obtained.Finally Section 6 considers the application for learning two-layer neural networks, where we give a neural network architecture that may achieve a quantum speedup.

Preliminaries and Alphatron algorithm 2.1 Notations
The vectors are denoted by bold-face x, and their elements by x j .We leave in plain font the α vector (and all other vectors denoted with Greek symbols).The standard vector space of reals and the unit ball of dimension n are denoted by R n and B n , respectively.The ℓ p -norm of vectors in R n is denoted by ∥ • ∥ p .Moreover, the max norm is denoted by ∥x∥ max = max i |x i |.We use a • b to denote the standard inner product in R n .We use the notation O () to omit any poly-log factors in the arguments.When we write g + O (. ..), we mean g + f with some f ∈ O (. ..).We use a := b to define a in terms of b.Given two random variables X and Y , denote by E[Y |X] the conditional expectation value.

Cost of arithmetic operations
We use the following arithmetic model for classical computation.We represent the real numbers with a sufficiently large number of bits.We assume that the number of bits is large enough to make the numerical errors negligible in the correctness and run time proofs of the algorithms under consideration.The implication is that we can ignore numerical errors of arithmetic operations (e.g., addition, subtraction, multiplication, and so on) with respect to truncation or rounding.Hence, we assume all real numbers cost O (1) space and the basic arithmetic operations between them cost O (1) time.While the accumulated error can be important, dealing with a proper error analysis would require a substantial deviation from the main purpose of this paper.
For the quantum algorithms, we keep track of the amount of (quantum) bits for storing real numbers.We use a standard fixed-point encoding of real numbers.

The learning model
We consider the standard "probabilistic concept" (p-concept) learning model (Ref.[19]) in our paper.Let X be the input space and Y be the output space.A concept class C is a class of functions mapping the input space to the output space, i.e., C ⊆ Y X .We define here weak learnability with a fixed lower bound for the error, in contrast to the standard definition of p-concept learnability for all ϵ 0 > 0.
Definition 2 (Weak p-concept learnable).For ϵ 0 > 0, a concept class C is "weak p-concept learnable up to ϵ 0 " if there exists an algorithm A such that for every δ > 0, c ∈ C, and distribution D over X × Y with E y [y|x] = c(x) we have that A, given access to samples drawn from D, outputs a hypothesis h : X → Y, such that with probability at least The quantity ε(h) is the generalization error of hypothesis h.Moreover, if we have m samples (x i , y i ) drawn from the distribution D, we define the empirical error of h as For convenience, we also define another similar function as ( Note that E y [y|x] is independent of the choice of h.Hence, for hypotheses h 1 and h 2 , we have err(h 1 )−err(h 2 ) = ε(h 1 )−ε(h 2 ).Thus, we may use err() instead of ε() for comparing hypothesis.Moreover, by using the empirical version of the err(h) function, even without knowing the probability distribution D, we are still able to evaluate the quality of the hypothesis h given m samples (x i , y i ) ∼ D as By the Chernoff bound, we may bound the generalization error err() in terms of the empirical error ê rr() with high probability.
To learn a good hypothesis, on the one hand, we prefer to assume a relatively simple concept class (e.g., a concept class consisting only of linear functions).Then it is easy to design an algorithm for finding the best hypothesis in that class.On the other hand the real-world data distribution is often complicated and cannot be captured by a hypothesis from a simple concept class.The kernel trick is widely used to turn a simple linear concept class and a given learning algorithm into a non-linear concept class and a corresponding learning algorithm, usually without changing too much the algorithm.In the kernel method, we use a more general function to measure the similarity between two vectors instead of the linear inner product.The kernel function K : X × X → R is a (usually non-linear) similarity measure on the input space and it is defined via a feature map.Let V be an arbitrary metric space with inner product ⟨•, •⟩.The feature map ψ : X → V maps any input vector into the metric space (also called feature space).For vectors x, y ∈ X , we define K(x, y) = ⟨ψ(x), ψ(y)⟩.
For our purpose, we use the multinomial kernel function to allow the learning of non-linear concepts.Consider formal polynomials over n variables of total degree d.There are n d := d i=0 n i monomials, which can be uniquely indexed by the tuples We consider from now on the input space X = B n ⊆ R n and the feature space V = R n d .We consider the standard Euclidean metric on the feature space R n d .We define a normalized feature map When i = 0, we have the empty tuple and the corresponding component is the constant term 1.Note that for n ≥ 2, we have both the components x 1 x 2 and x 2 x 1 .This redundancy can be avoided by the use of ordered multisets but will not influence our discussion.For x, y ∈ B n , we can compute the inner product as Observe that by definition, for all x, y ∈ B n , we have that ⟨ψ d (x), ψ d (y)⟩ ≤ 1.This can be seen from, where we have used that max x∈B n ∥x∥ 2 2 = 1 for the unit ball B n .Throughout this paper, our definition for the normalized multinomial kernel function with degree d is With these definitions, let us consider the following concept class.

Definition 3 (Concept class and distribution). Let K d be the normalized multinomial kernel function corresponding to the feature map
( Consider a distribution D on B n × [0, 1] for which c ∈ C exists such that the distribution satisfies where E y [y|x] is the conditional expectation value.
The ϵ in this definition motivates Definition 2, as we will see that we cannot learn the concept class for all ϵ 0 > 0. The intrinsic error ϵ will define a lower bound for the error.The learning guarantee for the Alphatron and all algorithms in this work is proven for this concept class.

Alphatron algorithm
We review the classical Alphatron algorithm of [15].Alphatron is the first provably efficient algorithm for learning neural networks with two nonlinear layers without further assumptions on the structure of the neural network.It can also be used for many problems like Boolean Learning and Multiple Instance Learning.The key idea is to combine isotonic regression with kernel methods, where isotonic regression is to find a non-decreasing function for predicting sequences of observations.More explicitly, isotonic regression is to find where x i ∈ R are a sequence of m data.The setting considered for the Alphatron algorithm includes a kernel function into the problem, as we can see in the following.
The setting is as follows.We split the data into two parts, training set and validation set.The training data set contains where m 1 is the size of the training data.The validation data set contains Let m := m 1 +m 2 be the total size of the data set.Then, since m 1 , m 2 ∈ O (m) we can use O (m) as an upper bound of the size of data.In the Alphatron algorithm, we first build several hypotheses from the training set.
Then we use the validation set to evaluate each hypothesis and select the optimal one from them.

Algorithm 1: Alphatron
7 Output α tout , where The algorithm has a number of iterations T , which will be related to the other input quantities via the learning guarantees.In each iteration, the algorithm generates a new vector α t+1 and a new hypothesis h t+1 from the old ones.The output of the algorithm is a vector α tout ∈ R m 1 describing the hypothesis h tout : B n → [0, 1], which has some p-concept error according to the input data.First, we are interested in the general run time complexity, before we discuss the guarantees for the weak p-concept learning of the specific concept class.

Theorem 1 (Run time of Alphatron). Algorithm 1 has a run time of O(T m 2 (n + log d)).
Proof.For computing the multinomial kernel function K d (x, y), we need to first compute the inner product ⟨x, y⟩ in O(n) time trivially.For all r ∈ R\{1}, since 0≤i≤d r i = r d+1 −1 r−1 , it costs O (log d) time to compute the multinomial kernel function from the inner product r.Thus, by the definition of h t (x) in line 4 of the algorithm, h t (x) is computed in O(m(n+log d)) for a given x.In the first part, the training phase, line 6 is executed for O(T m) times.Hence, this part costs O(T m 2 (n + log d)).Similarly, in the second part, the validation phase (line 7), a number O(T m) of calls to the function h t (a) is used.Hence, the algorithm costs O(T m 2 (n + log d)) in total.
For obtaining a learning guarantee in the setting from Definition 3, it is supposed that the following relations between the parameters hold.Definition 4 (Parameter definitions and relations).Consider the setting in Definition 3, which defines the distribution D and the parameters (B, L, ζ, ϵ), and consider the Algorithm 1, which uses the parameters (m 1 , m 2 , T, λ).Define the following additional parameters and fix the following relationships between the parameters.

Equate the L-Lipschitz non-decreasing function from the concept class with the function
u used in the algorithm.

Let the training set
4. Let C > 0 be a large enough constant and set T = CBL m 1 / log(1/δ).
5. Let C ′ > 0 be a large enough constant and set m 2 = C ′ m 1 log(T /δ), and let the validation set (a i , b i ) m 2 i=1 be sampled iid from D.

Define
and let C ′′ > 0 be a large enough constant.
The following learning guarantee was proven in Ref. [15].
Theorem 2 (Learning guarantee of Alphatron, same as Ref. [15]).Given the learning setting in Definition 3 and the parameters defined in Definition 4, Algorithm 1 outputs α tout which describes the hypothesis h tout (x) := u Next, we discuss a regime where Theorem 2 achieves p-concept learnability.This result was implicit in Ref. [15].Proof.Recall that weak p-concept learnability up to ϵ 0 means that with probability 1−δ.Hence, we desire that C ′′ A 2 ≤ ϵ 0 .Note the trivial case when ϵ 0 < C ′′ L √ ϵ, which means that the intrinsic error of the concept class is too large, and we fail to achieve learnability.Hence, we can only prove the case when ϵ 0 ≥ C ′′ L √ ϵ, and we prove the theorem only for ϵ 0 = 2C ′′ L √ ϵ, where we use a factor 2 to leave room for the other terms in A 2 .It follows that we would like to set m 1 such that Here, Eq. ( 7) can be achieved by making each term smaller than ϵ 0 /4 (alternatively, we can solve a quadratic equation, which leads to more complicated equations).This means that both of following statements have to be true: Hence, we take m 1 greater than the maximum of the right-hand side expressions.Employing the lower-bound 2C ′′ L √ ϵ ≤ ϵ 0 , we find that m 1 ≥ max {m ′ 1 , m ′′ 1 } leads to weak p-concept learnability up to 2C ′′ L √ ϵ.
3 Pre-computation and approximation of the kernel matrix One bottleneck of the Alphatron algorithm is the repeated inner product computation when evaluating the function h t (x).In Algorithm 1, at every step t out of T steps, we need to evaluate O m 2 inner products for the kernel function.This evaluation is redundant because the inner products do not change for different t.A simple pre-computing idea helps to reduce the time complexity to some extent.We improve Algorithm 1 as follows.Given input data and (a i , b i ) m 2 i=1 and d, we define two matrices and If these two matrices are given by an oracle, then we are able to rewrite Algorithm 1 as below.Algorithm 2 will be used as a subroutine several times in the remainder of this paper.

Algorithm 2: Alphatron_with_Kernel
1 Input Function u : R → [0, 1], number of iterations T , learning rate λ, query access to With equivalent input, Algorithm 2 produces the same output as Algorithm 1, which can be easily checked as follows.Fix the input for Algorithm 1. From these fixed training examples compute the kernel matrices .
Note that even if we do not explicitly define the hypothesis h t in Algorithm 2, in the analysis, we still use the same notation h t for the t-th generated hypothesis as in Algorithm 1.

Theorem 4 (Alphatron_with_Kernel). Algorithm 2 runs in time
Proof.Since each entry of the matrices K and K ′ is accessible in O (1) time, the run time of the algorithm is O T m 2 .
We now discuss the pre-computation, i.e., we prepare the matrices K ij and K ′ ij by evaluating the kernel function for the training and testing data.We present the following algorithm, which performs the pre-computation and then runs Algorithm 2. On the same input, this algorithm produces exactly the same output as Algorithm 1.
, number of iterations T , degree of the multinomial kernel d, learning rate λ.
) with all inputs as above and K ij and K ′ ij .9 Output α tout

Theorem 5 (Alphatron_with_Pre). Algorithm 3 generates the same output as Algorithm 1, and runs in time
Proof.First of all, it is straightforward to see that Algorithm 3 behaves in the same way as Algorithm 1, by using the definition h are exactly the same.
For the time complexity, we have O m 2 inner products to be evaluated.For each of them we need time O (n + log d) as we showed in the proof of Theorem 1. Hence it costs O(m 2 (n + log d)) time to pre-compute the results of all K d (x, y).By Theorem 4, the time complexity of Alphatron_with_Kernel is O(T m 2 ).In total the algorithm runs in time By the pre-computation, we evaluate each kernel function only once with the corresponding memory cost of storing the values.Comparing with the O T m 2 (n + log d) time used for Algorithm 1, Algorithm 3 achieves a significant speedup.

Classical inner-product estimation
We can hope to obtain a further speedup by approximating the inner products instead of computing them exactly.Next, we discuss this inner product approximation, where the approximations rely on sampling data structures, which are discussed in Appendix C.These data structures when given a vector allow to sample an index with probability proportional to the components of the vector, as described in Facts 1 and 2. We call them ℓ 1 and ℓ 2 sampling data structures.Here, we use the ℓ 2 case (Fact 2), while the second part of this work uses the ℓ 1 case.Based on these data structures, elementary results can be provided to estimate inner products between two vectors.These are described in Lemmas 9 and 10 in Appendix C, of which we need Lemma 10 here.Our version of the Alphatron algorithm with approximate pre-computation is given in Algorithm 4. We use the inner product estimation of Lemma 10 to improve the run time complexity of Algorithm 3.

Algorithm 4:
Alphatron_with_Approx_Pre , error tolerance parameter ϵ K , failure probability δ K , function u : R → [0, 1], number of iterations T , degree of the multinomial kernel d, learning rate λ 2 for i ← 1 to m 1 do 3 Prepare sampling data structure for x i according to Fact 2.
Prepare sampling data structure for a i according to Fact 2.
and provide the kernel matrices K and Proof.For all vectors x i and a j , the sampling data structure is prepared in total time O (mn).There are O m 2 inner products to be estimated between these vectors.Hence, by Lemma 10, each estimation of inner product with additive accuracy ϵ K /(3d) and success probability 1 factor under the tilde notation compared to m 2 .Again, O (log d) extra time is needed to compute each multinomial kernel function K d from the inner product.However, we also ignore the log d factor under the tilde notation.By Lemma 3 of the Appendix, the Lipschitz constant for The last step for calling Algorithm 2 costs O T m 2 again as the matrices are accessible in O (1).
Since only m sampling state structures are prepared which allow the inner products to be approximated in advance, Algorithm 4 improves the run time complexity of the Algorithm 3.However, as the inner products are approximated, we may lose the correctness of Algorithm 3. In Ref. [15], a theoretical upper bound is proven for the sample complexity of Algorithm 1 in the problem setting of Definition 3. We now show that with approximate pre-computation, under the same problem and parameter settings as in Ref. [15], the p-concept error of the output hypothesis does not increase too much.
Theorem 7 (Correctness of Alphatron_with_Approx_Pre).If Definition 3 and 4 hold, then by setting δ K = δ, with probability 1 − 3δ, Algorithm 4 outputs α tout which describes the hypothesis h tout (x) := u where We prove this theorem in Appendix B.

Quantum Pre-computation
In the previous subsection, we classically estimate the inner products that construct the kernel matrices.Now, given quantum access to the training data, we replace this estimation with a quantum subroutine and obtain a quantum speedup.This section presents the quantum algorithm for pre-computing the kernel matrices used in the Alphatron.We assume quantum access to the training data, which includes classical access and also superposition queries to the data.
Definition 5 (Quantum query access).Let c and n be two positive integers and u be a vector of bit strings u ∈ ({0, 1} c ) n .Define element-wise quantum access to u for j ∈ [n] by the operation on O (c + log n) qubits.We denote this access by QA(u, n, c).
, and j ∈ [m 2 ], let x i and a j be the input vectors with ∥x i ∥ 2 = ∥a j ∥ 2 = 1, and let x ik and a jk be the entries of the vectors.Assume c = O (1) bits are sufficient to store x ik and a jk .Assume that we are given QA(x i , n, c) Our first quantum algorithm is constructed in a straightforward manner.We replace the classical approximation of the kernel matrix inner products with a quantum estimation.For the quantum estimation of inner products refer to Lemma 12 in Appendix D, which requires quantum query access similar to Data Input 1.The run time of Lemma 12 depends on the ℓ 2norms of the input vectors, which here are 1.The result is Algorithm 5.The run time analysis and the guarantees for the output hypothesis are similar to the classical algorithm.We state them below as a corollary.

Algorithm 5:
Alphatron_with_Q_Pre ) with all inputs as above and we have quantum query access to the vectors x i and a j via Data Input 1. Lines 2 − 9 of Algorithm 5 have a run time of and provide the kernel matrices K and K′ such that Proof.For ϵ K ∈ (0, 1), the run time of each invocation of Lemma 12 is O d , using that the input vectors are in the unit ball.All probabilistic steps in Lines 2 − 9 of the algorithm succeed with probability 1 − δ K using a union bound.
Corollary 2 (Guarantee for Alphatron_with_Q_Pre).Let δ > 0. Assume that for all i ∈ [m 1 ] and j ∈ [m 2 ], we have quantum query access to the vectors x i and a j via Data Input 1.Let Definitions 3 and 4 hold.If A 2 ≤ 1, and we set ϵ K = L √ ϵ T and δ K = δ, then Algorithm 5 with probability at least 1−3δ outputs α tout which describes the hypothesis h tout (x) := u with a run time of Proof.The proof is analogous to the proof of Theorem 7, where we use Corollary 1 for the run time of the inner product estimation.

Quantum Alphatron
Up to this point, we have been discussing improvements in the pre-computation step of the Alphatron.We always use the same Alphatron_with_Kernel algorithm once we prepare the kernel matrices K and K ′ .If data dimension n is much larger than the other parameters, the quantum pre-computation costs asymptotically more time than Alphatron_with_Kernel.Hence, we do not benefit much from optimizing Alphatron_with_Kernel if the cost of preparing the data of size n is taken into account.However, if we assume that the pre-computation was already done for us, it makes sense to discuss quantum speedups for Alphatron_with_Kernel, which is what the remainder of this work is about.In other words, we consider the following scenario.

Data Input 2. Let there be given two training data sets
Let each entry K ji and K ′ ji be specified by O (1) bits.We assume that we have query access to each entry in O (1).
The bottleneck of the computation in the Alphatron_with_Kernel is the cost of about O (T m) for the inner product evaluations.By the sampling techniques and quantum estimation, we may speed them up.

Main loop with approximated inner products
We employ the classical sampling of inner products in the Alphatron_with_Kernel algorithm.The result is Algorithm 6.For the kernel matrices K ji and K ′ ji , define K max as an upper bound for |K ji | and |K ′ ji |.

Algorithm 6:
Alphatron_with_Kernel_and_Sampling , error parameter ϵ I and failure probability δ, function u : R → [0, 1], number of iterations T , degree of the multinomial kernel d, learning rate λ, query access to K ji and K ′ ji , the upper bound K max for both Prepare sampling data structure for α t via Fact 1 Theorem 8. We assume query access Data Input 2 to the kernel matrices K and K ′ with known K max .Let ϵ I , δ ∈ (0, 1).If the Definitions 3 and 4 hold, the Algorithm 6 outputs α tout which describes the hypothesis where A 2 is defined in Definition 4. The run time of this algorithm is Moreover, if A 2 ≤ 1, and we set ϵ I = √ ϵ, then we obtain the guarantee and have a run time of Proof.By the definition of r t j , we have and by the definition of s t j , we have By Definition 10 and by the Lipschitz condition of u, we obtain that u(r t j ) − g(α t , K, j) ≤ Lϵ I , and u(s t j ) − g(α t , K ′ , j) ≤ Lϵ I .Consider the cases in Eqns.( 61) and (62) in the proof of Theorem 7 for the sequence of αt generated by Algorithm 6. Similarily, there exists t * such that Case 2 holds.Then by Lemma 7 with ω = α t * , we obtain Hence it now holds that which implies that Using the known bound for η we have Again, by the Rademacher analysis in proof of Theorem 7, we obtain We define . By Lemma 8 with Γ t j = u(s t j ), at Line 13 in Algorithm 6, we obtain h t such that As in the proof of the Theorem 7, by Chernoff bound, setting Using the same idea as in the last part of the proof of the Theorem 7, we relate above inequalities ( 15), (16), and (17), and obtain that for the output hypothesis h t, For the run time complexity, the total time of preparing the sampling data structure for α t is O (T m) because we prepare O (T ) such structures and preparing each of them costs O (m).By Lemma 5, we have the upper bound max t ∥α t ∥ 1 ≤ T L .Hence, the run time of each estimation r t j and s t j via Lemma 9 is bounded by Setting with the run stated in the theorem.
Now, replace the classical sampling of the inner product with the quantum estimation of the inner product.With the Lemma 13 in Appendix D, we can remove the explicit dimension dependence of the inner product estimation, at the expense of using a QRAM, see Definition 7 in the next section.

Quantum speedup for the main loop
For the quantum algorithm, we assume the quantum query access to the kernel matrices K and K.Note the definition of quantum query access in Definition 5 in Appendix D.

Data Input 3. Assume Data Input 2 for the training data and the kernel matrices. For all
, and for all j ∈ [m 2 ], define K ′ j as the vector . Assume the availability of the quantum access QA(K j , m 1 , O (1)), for all j ∈ [m 1 ], and the quantum access Based on this input a simple circuit prepares query access to the non-negative versions of the vectors.Lemma 1. Assume Data Input 2 and define the non-negative vectors (K j ) + , (K j ) − , with K j = (K j ) + − (K j ) − and the non-negative vectors ] can be provided with two queries to the respective inputs and a constant depth circuit of quantum gates.
For our quantum version for the main loop of the Alphatron algorithm, we will also require a dynamic quantum data structure for the α vector which allows us to obtain efficient quantum sample access.
Definition 6 (Quantum sample access).Let c 1 , c 2 , and n be positive integers and v ′ ∈ ({0, 1} c 1 ) n and v ′′ ∈ ({0, 1} c 2 ) n be vectors of bit strings.Define quantum sample access to a vector v via the operation on O (log n) qubits.We denote this access by QS(v, n, c 1 , c 2 ).For the sample access to a vector v which approximates a vector with components in [0, 1], we use the shorthand notation One way to obtain such an access is via quantum random access memory (QRAM) [14,13,3].Such a device stores the data in (classical) memory cells, but allows for superposition queries to the data.If all the partial sums are also stored, then QRAM can provide quantum sample access via the Grover-Rudolph procedure, see Ref. [16].This costs resources proportional to the length of the vector to set up, but then can provide the the superposition state in a run time logarithmic in the length of the vector.Based on Data Input 3 and Definition 7, Lemma 13 in Appendix D allows us to estimate the inner products between α t and K j more efficiently than the equivalent estimation in Algorithm 6.We have the following algorithm.Define the non-negative vectors (K j ) + , (K j ) − , with K j = (K j ) + − (K j ) −

6
From query access to K j provide query access to From query access to K j provide query access to Store in QRAM (see Definition 7) the non-negative vectors (α t ) + , (α t ) − , where α t = (α t ) + − (α t ) − , where each element of the vector is stored using , and (α t ) − • (K ′ j ) − via Statement (iii) of Lemma 13 (using w t and q max j ), each to additive accuracy ϵ I /8 with success probability 1 − δ I /(16T m 2 ) Theorem 9 (Quantum Alphatron).We assume quantum query access to the vectors K j and K ′ j via Data Input 3. Again, let K max be maximum of all entries in K and K ′ .Let δ ∈ (0, 1).Given Definitions 3 and 4 and δ I = δ, Algorithm 7 outputs α tout such that the hypothesis where A 2 is defined in Theorem 7. The run time of this algorithm is If A 2 ≤ 1 and we set ϵ I = √ ϵ then we further obtain the guarantee and the run time is Proof.First, consider the numerical error from truncating the α vectors.Recall that in the classical steps of the algorithm we work in the arithmetic model where all the steps occur at infinite precision.Let α t ∈ R m 1 be the vector given to infinite precision (arithmetic model) with known 0 < α max ≤ λT /m 1 (Lemma 5).Set c 1 ≥ ⌈log(λT /m 1 )⌉ and c 2 ≥ log 2Kmaxm 1 ϵ I . Let α ∈ {0, 1} c 1 +c 2 be the element-wise c 1 + c 2 bit approximation of α t (stored in QRAM).Note that Aside from the estimation of the inner products, the remaining part of Algorithm 7 is the same as Algorithm 6.Compared to Algorithm 6, we change the accuracy of the inner product estimation to ϵ I /2, hence we achieve that with the stated success probabilities.Using Eq. ( 21) for the numerical error, we obtain that with the same success probabilities.Hence the same accuracy guarantees holds as in the proof of Theorem 8.For the output hypothesis h tout , we have For the run time complexity, there are three terms.From Line 2 to Line 10, we perform O (m 1 + m 2 ) = O (m) quantum maximum findings.The run time of a single run of the quantum maximum finding is bounded by O ( √ m log(1/δ)) [11].Hence, this part of the algorithm takes O m 1.5 log(1/δ) time.In Line 12, the time of storing all α t in QRAM is O (T m) because we have O (T ) vectors and storing each of them costs O (m) time.For each step t, the run time of the estimations r t j and s t j depends on the norm ∥α t ∥ 1 ≤ T /L.Hence, the run time of each estimation r t j and s t j via (iii) of Lemma 13 is bounded by O T Kmax Lϵ I log 1 δ , and we need to estimate O (T m) inner products.Then the overall run time is If we set ϵ I = √ ϵ, we obtain ϵ(h) ≤ O(A 2 ) and the run time is

Discussion
In this section, we summarize the results from previous sections and discuss the improvement on the run time complexity by pre-computation and quantum estimation.In Section 3, we have introduced Alphatron_with_Pre,Alphatron_with_Approx_Pre, and Alpha-tron_with_Q_Pre, which improve the original Alphatron.The scenario is that the dimension of the data n is much larger than the other parameters, a situation that is relevant for many practical applications.Without the pre-computation, we have a run time O T m 2 n compared to the run time with the pre-computation of O m 2 n + m 2 log d + T m 2 .The factor of the n dependent term loses a factor T which is a small improvement.Moreover, by quantum amplitude estimation, we gain a quadratic speedup in the dimension n.We list the results of Section 3 in Table 1 for comparison.

Name Pre-computation Main loop Proved in
Corollary 1 Table 1: Comparison of the first set of algorithms in Section 3. We separate the pre-computation of the multinomial kernel function from the main loop and also estimate the training set inner products instead of computing them exactly, which can improve the time complexity of the computation of the kernel function.
For all algorithms, we indicate the general result without using the learning setting in Definition 3.For Alphatron_with_Approx_Pre and Alphatron_with_Q_Pre, the relevant kernel functions are estimated to accuracy ϵ K with failure probability δ K .To obtain the weak p-concept learning result of Theorem 3 for all these algorithms, take the concept class defined in Definition 3 and the parameter settings for the algorithms of Definition 4. Also, set ϵ K = L √ ϵ/T and δ K = δ.We do not further evaluate the formulas (using, e.g., the expressions for T and m 1 ) as the main focus of this table is on the dependency on n which dominates all other parameters.Section 4 has introduced Alphatron_with_Kernel_and_Sampling and Quantum_ Alphatron.In this scenario, we assume constant time query access to the kernel matrices (i.e., to the result of the pre-computation), while the quantum version requires quantum query access.These algorithms only focus on the main loop part of the Alphatron.Hence, these algorithms can be viewed as the improvements for Alphatron_with_Kernel.We list the results of Section 4 in Table 2.For comparison, we also list the time complexity of Alpha-tron_with_Kernel.
For Table 2, it is not obvious that the quantum algorithm has a speedup compared to the Alphatron_with_Kernel.As mentioned in the preliminary, K d (x, y) ≤ 1 for all x, y ∈ B n .Hence, we can use δ .From Theorem 9, we have the run time For the classical run time, we simplify O T m δ .This analysis of the two cases is summarized in Table 3 and allows us to state our final theorem.
Theorem 10 (Quantum p-concept learnability via the Quantum_Alphatron).Let the concept class and distribution be defined by Definition 3.For this concept class, let 4ζ 4 > B 2 ϵ.A note on the condition 4ζ 4 > B 2 ϵ for the speedup.By Definition 3, ζ determines the range of the noise function, while ϵ is an upper bound to the variance of the noise function.For any function the variance will be smaller or equal to the range.Hence, the condition 4ζ 4 > B 2 ϵ is reasonably easy to satisfy and we may obtain a quantum advantage for a broad concept class of functions.In Appendix E, we combine the algorithms for the kernel matrix estimation and the inner loop estimations, both for the classical and quantum cases.

Applications for learning two-layer neural networks
In this section, we describe how to use algorithms described above to learn neural networks with two nonlinear layers in the p-concept model.Following previous works [15,27], first consider an one-layer neural network with k units where x ∈ R n , b ∈ B k , a i ∈ B n for i ∈ {1, . . ., k}, and σ : R → R is the activation function.
In the following, we consider the sigmoid function σ = 1/(1 + e −x ) as the activation function.Subsequently, we define a neural work with two nonlinear layers with one unit in the second layer where σ ′ : R → R is an L-Lipschitz non-decreasing function.Ref. [15] proved the following lemma, which states that such kinds of neural networks are efficiently p-concept learnable.

] is a known L-Lipschitz non-decreasing function and σ is the sigmoid function.
There exists an algorithm that outputs a hypothesis h such that with probability 1 − δ, • log 2 (1/δ).The algorithm runs in time polynomial with m and n.
The statement of the previous work mentions the training sample complexity m 1 instead of total sample complexity m, where the latter depends on log 2 (1/δ).Here, we focus on the total sample complexity.
We assume corresponding query access to the kernel matrices, both for the classical and the quantum scenario, respectively.We focus on the comparison of time complexity for Alpha-tron_with_Kernel and Quantum_Alphatron.The sigmoid function can be uniformly approximated by low-degree polynomials.By Lemma 8 and 12 in Ref. [15], we have that there exists a constant C sig and a v ∈ H d with ∥v∥ for every x ∈ B n .Hence, in this case we have Based on the discussion in Section 5, we can compare the value of 4ζ 4 and B 2 ϵ to determine the possibility for a quantum speedup.In this case, we have As k ≥ 1, 0 < ϵ 0 < 1, and C sig > 2 [22], in general the inequality will not hold, and hence we will not achieve a quantum speedup for learning two-layer neural networks with the sigmoid function.

A Lipschitz condition for multinomial kernel function
Note that throughout this paper, it always holds that z 0 = 1.In this case, the Lipschitz constant is bounded by O (d).

B Proof of Theorem 7
We first introduce several definitions and lemmas for proving the theorem.Given the coefficients α i , we generate a hypothesis vector v(α) in the feature space by taking the linear combination over vectors ψ d (x i ).

Definition 9 (Auxiliary definitions). In the setting of Definitions 3 and 4, define the generated hypothesis mapping
In addition, define β ∈ R m 1 as using v from the concept class and ∆ := v(β) with the norm η := ∥∆∥ 2 .Finally, define as the average quadratic noise over the input data.
We note the subtle difference of the symbols v(α) and v but emphasize that v(α) will always have the parenthesis with the input value, while v is static and fixed by the element of the concept class.To adapt to matrices K ij (with dimension m 1 × m 1 ) and K ′ ij (with dimension m 2 × m 1 ), Definition 10 and Lemma 4 are stated in general terms.

Definition 10 (Hypothesis function). Let u : R → [0, 1] be an L-Lipschitz function. Define the hypothesis function g
In Lemma 4, we show that if we have a good enough estimation M for the matrix M , then the estimated result g(α, M , i) is not too far from the exact value g(α, M, i), with a dependence on ∥α∥ 1 .

Lemma 4. Let
(by the L-Lipschitz conditon of u) (by assumption) In the following lemma we show that if we update α t according to the Alphatron algorithm, then the max norm of α t can be bounded in terms of T , λ and m 1 .

Lemma 5. For arbitrary
We prove the statement by induction.The base case is obviously true.Note that y is from [0, 1] and the range of g is also [0, 1].Hence, We state the convergence result from the original work Ref. [15] in Lemma 6 before proving our own Lemma 7. Intuitively, these lemmas prove that we indeed make progress towards the target by each iteration.The norm ∥v(ω) − v∥ 2  2 measures the distance between the current vector and the target.If we show that this value decreases as we run the algorithm and if we lower bound this value in term of the empirical error ε of the current vector, then either we made progress, or the quality of the current vector is already good enough.

Lemma 6 (Convergence of Algorithm 1 from [15]). Consider Definition 3 and 4 for the setting and the algorithm parameters, as well as the training labels y ∈ [0, 1] m 1 , and the kernel matrix
Let h be the hypothesis function defined as h(x) = u(⟨v(ω), ψ d (x)⟩), and recall from Definition 9 that η = ∥∆∥ 2 .If ∥v(ω) − v∥ 2 ≤ B, for B > 1, and η < 1, then where by definition We claim the following modified convergence result for our Algorithm 4. The difference to the previous lemma is the appearance of a term −ϵ 2 I in the convergence bound.
Lemma 7. Consider Definition 3 and 4 for the setting and the algorithm parameters, as well as the training labels y ∈ [0, 1] m 1 , and the kernel matrix For any vector ω ∈ R m 1 , let ω ′ ∈ R m 1 be the vector defined as and ω ∈ R m 1 be the vector defined as Let h be the hypothesis function defined as h(x) = u(⟨v(ω), ψ d (x)⟩).Let A 1 be defined as in Lemma 6.
(expand the definition of v, ω ′ , and ω) (by triangle's inequality) Therefore, using the triangle inequality, we can deduce that Note that except for the vector ω and the related conditions, Lemma 7 has the same settings as Lemma 6.Thus by the conclusion of Lemma 6, we can lower bound ∥v Together with equation (49), we obtain the required bound.
In the last step of Algorithm 2, we pick the hypothesis with the minimum ê rr value.We also lose accuracy here as we use entries from the approximation of the matrix K ′ .In Lemma 8, we show that these errors are acceptable.

Lemma 8. Consider the training data samples
We define the hypothesis functions h t as h t (x) = u(⟨v(α t ), Proof.By Definition 10 and the assumption on the kernel matrix K ′ , we obtain We have m 2 samples for validation.Recall the definition of the empirical error ê rr (by empirical error with respect to (a i , b i ) and equation ( 51)) (the maximum is at least the average) (by assumption) Hence, we have both when t = t ′ in (57), and when t = t in (57).And by the minimization of t, From the last three equations by transitivity we deduce the required ê rr(h t) − ê rr(h t ′ ) ∈ O (Lϵ I ).
Now, we are going to prove the main theorem.As the original work, the theorem requires a generalization bound which involves the Rademacher complexity of the function class considered here.The required result is Theorem 12 in the Appendix F.
Proof of Theorem 7. We first consider the success probability of the approximation part.According to Algorithm 4, we estimate each inner product with success probability 1 − δ K /(m 2  1 + m 1 m 2 ).A union bound for these m 2  1 + m 1 m 2 estimations gives the total success probability to be at least 1 − δ K = 1 − δ.Then it is sufficient to show that the main body itself succeeds with probability at least 1 − δ and indeed produces a good enough hypothesis.
The remaining proof consists of three parts.In the first part, we show that there exists t * ∈ [T ] such that the empirical error of h t * is good enough.In the second part, we show using the Rademacher complexity that for any specific hypothesis in the concept class we introduced, the generalization error is not very far from empirical error.In the third part, we show that by using m 2 additional samples to validate all generated hypotheses, as done in the algorithm, we are able to find a hypothesis with similar error as h t * .
First, we show that there exists a good enough hypothesis according to the empirical error.Recall the notations ∆ and ρ in Definition 9, and let η = ∥∆∥ 2 .Ref. [15] shows that η ≤ by Hoeffding's inequality.Since η and ρ only depend on the setting in Definition 3, the modification done in Algorithm 4 compared to Algorithm 3 keeps the bounds for η and ρ the same.Hence, we use the same bounds in this proof.
In the Algorithm 2, which is used in Algorithm 4, assume we are presently at the iteration t for computing the vector α t+1 from α t .In this proof, we use the tilde above the α to emphasize that we indeed construct a different sequence ( α t ) t∈[T ] compared to the sequence (α t ) t∈[T ] of Algorithm 2 with the exact kernel matrices.One of the following two cases is satisfied, Let t * be the first iteration where Case 2 holds.We show that such an iteration exists.Assume the contradictory, that is, Case 2 fails for each iteration.Since ∥v( α 0 ) − v∥ 2 2 = ∥0 − v∥ 2  2 ≤ B 2 by assumption, however, for k iterations.Hence, in at most BL η iterations Case 1 will be violated and Case 2 will have to be true.By Assumption 4 and the bound on η, we have that T ≥ BL η , and then t * ∈ [T ] must exist such that Case 2 is true.For all t ∈ [T ], define the hypothesis function h t as h t (x) = u(⟨v( α t ), ψ d (x)⟩).By Theorem 6, we have that max the shorthand Γ i := g( α t * , K, i), with the hypothesis function from Definition 10.Then, with Lemma 4, we obtain max . Then, by Lemma 7 with ω = α t * and ϵ I = ϵ K ∥ α t * ∥ 1 , we obtain Note that Case 2 holds for the iteration t * .Together with the upper bound in Eq. ( 62), it holds by transitivity that which implies that Recall the definition of Using the known bounds for η and ρ, we have The last term can be bounded as where we use ∥ α t * ∥ 1 ≤ T /L from Lemma 5.As a next step, we would like to bound ε(h t * ) in terms of ε(h t * ).An argument based on the Rademacher complexity gives us the same bound as in the original work [15].Define a function class Z = {x → u(⟨z, ψ d (x)⟩) − E y [y|x] : ∥z∥ 2 ≤ 2B}.Ref. [15] shows that the Rademacher Let us show that v α t * 2 satisfies the norm bound 2B, same as the z in class Z.Note that in the first t * −1 iterations, Case 1 holds.By Eq. (61), we have In other words, the distance between v α t and v decreases when t increases in [t * − 1].Thus, we conclude that and hence by the triangle inequality, ∥v( The above proof shows the existence of a good hypothesis h t * for some t * ∈ [T ].We define the index of the best hypothesis as which immediately implies that In the last part of this proof, we show that at Line 6 of Alphatron_with_Kernel, we indeed find and output a good enough hypothesis (though this hypothesis may be different from the hypothesis derived from the output of Alphatron_with_Kernel).Our goal is to find a hypothesis h t which minimizes ε(•).However, ε(•) is hard to compute according to the definition.From Eq. ( 3), we have that for arbitrary hypotheses Hence, we may find the best hypothesis by minimizing err(•) instead of ε(•).Formally, it holds that As we do not know the distribution D, we are unable to compute err(•).However, it is possible to compute the empirical version ê rr(•).In Algorithm be the index of the hypothesis of the output in Algorithm 4. As before, the exact kernel matrix ), the upper bound on the estimation error | ê rr(h t ) − ê rr(h t ′ )| is shown to be in Lemma 8, as ê rr h t − ê rr h t ′ ≤ Lϵ I . (79) We have ∥α t ∥ 1 ≤ T /L.Thus ϵ I = ϵ K ∥α t ∥ 1 ≤ T ϵ K /L.From Eq. (77) and Eq. ( 79), we obtain With Eq. ( 74), we obtain The union bound of the probabilistic steps of estimating the full kernel matrix, the Rademacher generalization bound, and the Chernoff bound leads to a total success probability of 1 − 3δ.
Since by definition ε(h) ≤ 1, for any hypothesis h, it is reasonable to assume that A 2 ≤ 1 if we want a useful bound.Then, by setting T , then Algorithm 4 with probability at least 1−3δ outputs α tout which describes the hypothesis h tout (x) := u with a run time of

C Classical sampling
The next facts discuss the construction of a data structure to sample from a vector and the next lemmas discuss the approximation of an inner product of two vectors by sampling.Both ℓ 1 and ℓ 2 cases are required in this work.The SQ label can be understood as "sample query".The arithmetic model allows us to assume infinite-precision storage of the real numbers.
Fact 1 (ℓ 1 -sampling [31,32]).Given an n-dimensional vector u ∈ R n , there exists a data structure to sample an index j ∈ [n] with probability |u j |/∥u∥ 1 which can be constructed in time O (n).One sample can be obtained in time O (log n).We call this data structure SQ1(u, n).
Fact 2 (ℓ 2 -sampling [31,32]).Given an n-dimensional vector u ∈ R n , there exists a data structure to sample an index j ∈ [n] with probability u 2 j /∥u∥ 2 2 which can be constructed in time O (n).One sample can be obtained in time O (log n).We call this data structure SQ2(u, n).
Next, we show the estimation of inner products via sampling.The number of samples scales with 1/ϵ 2 classically, in contrast to using quantum amplitude estimation which scales with 1/ϵ.Lemma 9 is adapted from [30] and Lemma 10 is taken directly from [30].Proof.Define a random variable Z with outcome sgn(u j )∥u∥ 1 v j with probability |u j |/∥u∥ 1 .Note that E 1 ∥v∥ 2 max .Take the median of 6 log 1/δ evaluations of the mean of 9∥u∥ . Take the median of 6 log 1/δ evaluations of the mean of 9∥u∥

D Quantum subroutines
First, we recall the quantum access used for vectors.
Definition 11 (Quantum query access).Let c and n be two positive integers and u be a vector of bit strings u ∈ ({0, 1} c ) n .Define element-wise quantum access to u for j ∈ [n] by the operation on O (c + log n) qubits.We denote this access by QA(u, n, c).
For the following part of this Appendix, recall Definition 1 regarding the fixed-point encoding of real numbers.In addition, we define the quantum sample access to a normalized semipositive vector v/∥v∥ 1 which is a fixed-point approximation of a real semi-positive vector.Each component of the vector v is represented with c 1 bits before the decimal point and with c 2 bits after the decimal point.
Definition 12 (Quantum sample access).Let c 1 , c 2 , and n be positive integers and v ′ ∈ ({0, 1} c 1 ) n and v ′′ ∈ ({0, 1} c 2 ) n be vectors of bit strings.Define quantum sample access to a vector v via the operation on O (log n) qubits.We denote this access by QS(v, n, c 1 , c 2 ).For the sample access to a vector v which approximates a vector with components in [0, 1], we use the shorthand notation QS(v, n, c 2 ) := QS(v, n, 1, c 2 ).
As stated in [26], we have the following lemma for estimating the ℓ 1 -norm of a vector with entries in [0, 1] and preparing states encoding the square root of the vector elements.
Lemma 11 (Quantum state preparation and norm estimation).Let c and n be two positive integers and u ∈ ({0, 1} c+1 ) n .Assume quantum access to u via QA(u, n, c + 1).Let max j Q(u j ) = 1.Then: 1.There exists a quantum circuit that prepares the state 1 Lemma 12 (Quantum inner product estimation with additive accuracy).Let ϵ, δ ∈ (0, 1).Let c and n be two positive integers.Let two non-zero vectors of bit strings be u, v ∈ ({0, 1} c+2 ) n , which leaves one bit for the sign of each component, one bit for the number before the decimal point, and c bits for the number after the decimal point.Let there be given quantum access to u and v as QA(u, n, c + 2) and QA(u, n, c + 2), respectively.Let the norms ∥Q(u)∥ 2 and ∥Q(v)∥ 2 be known.Then, there exists a quantum algorithm which provides an estimate I for the inner product such that Proof.Define the vectors u + and u − as follows It is easy to see that Q(u) = Q(u + ) − Q(u − ).Define the vectors v + and v − in a similar way.Then, Define two more vectors of bit strings z + and z In the following, we use the standard ± notation to denote that a statement holds for both the + and the − case.Determine the index of z ± max := ∥Q(z ± )∥ max with the quantum maximum finding algorithm with success probability 1−δ/4, with O √ n log 1 δ queries and O √ n log 1 δ quantum gates [11].In case that z ± max = 0, we infer that z ± = 0, and if both are true we return the estimate 0. Otherwise, for non-zero vector z ± , we apply Statement 2 of Lemma 11 on the vectors of bit strings corresponding to Q(z ± )/z ± max , respectively.These vectors of bit strings can be computed efficiently from the query access and the result of the maximum finding.We obtain estimates Γ + and Γ − such that with success probability at least 1 − δ/4 for each of them, with O 1 where the first inequality follows from Cauchy-Schwarz.Similarly, we have that ∥Q(z Hence, we obtain an estimate Since ∥Q(u)∥ 2 and ∥Q(v)∥ 2 are given, choosing ϵ ′ = ϵ/(4∥Q(u)∥ 2 ∥Q(v)∥ 2 ) leads to the result.The run time of the ± estimation is then, using ϵ In the absence of more knowledge about the vectors, we take the bound z ± max /∥Q(z ± )∥ 1 ≤ 1.Then the run time is O ∥Q(u)∥ 2 ∥Q(v)∥ 2 √ n ϵ log 1 δ .Combining these resource bounds with the resource bounds for maximum finding leads to the stated result.
With the following Lemma, we can remove the explicit dimension dependence of the inner product estimation.For this lemma we suppose that one vector is given via quantum query access as before and that the other vector is given via access to a quantum subroutine that prepares an amplitude encoding of the vector.In our work, the quantum sampling access is provided via QRAM in Definition 7. The vectors in this lemma are considered to be fixed-point approximations to real vectors with elements restricted to [0, 1].
Lemma 13 (Inner product estimation with quantum sampling and query access).Let c and n be two positive integers.Let u ∈ ({0, 1} c+1 ) n be a non-zero vector of bit strings, and let v ∈ ({0, 1} c+1 ) n be another vector of bit strings.Assume quantum query access to u via QA(u, n, c+ 1), and quantum sample access v via QS(v, n, c + 1).Then: (i) If max j Q(u j ) = 1, then there exists a quantum circuit that prepares the state with three queries and O (c + log n) additional gates.
(iii) Let ϵ, δ ∈ (0, 1).Let the norm ∥Q(v)∥ 1 and j max := arg max j Q(u j ) be known.There is a quantum algorithm, similar to (ii), which provides an estimate Proof.For (i), with quantum sample access and the quantum query access, perform The first step consists of an oracle query to the vector v on the first register.The second step consists of an oracle query to the vector u which puts the vector component in the second register depending on the index in the first register.The last step consists of a controlled rotation.The rotation is well-defined as Q(u j ) ≤ max j Q(u j ) = 1 and can be implemented with O (c) gates.Then we uncompute the data register |u j ⟩ with another oracle query.For (ii), define a unitary U = U 1 I − 2 0 0 U † 1 , where U 1 is the unitary obtained in (i).Define another unitary by V = I − 2I ⊗ |0⟩ ⟨0|.Using K applications of U and V, Amplitude Estimation [8] allows to provide an estimation ã of the quantity a = Q(v)•Q(u) ∥Q(v)∥ 1 to accuracy Note that 0 ≤ a = Q(v)•Q(u) ∥Q(v)∥ 1 ≤ max j Q(u j ) = 1.Set K > 3π ϵ .Then we obtain Performing a single run of amplitude estimation with K steps requires O (K) = O 1 ϵ queries to the oracles and O 1 ϵ gates and succeeds with probability 8/π 2 .The success probability can be boosted to 1 − δ with O (log(1/δ)) repetitions of amplitude estimation.
For (iii), from the index j max = arg max j Q(u j ) we can obtain the bit string u jmax and its corresponding value Q(u jmax ).This allows us to prepare the quantum circuit for the transformation from the original query access to u and basic arithmetic quantum circuits for the division.Then we run the same steps as in (ii) with vector Q(u)/Q(u jmax ), quantum sample access to vector v, and error parameter ϵ/(∥Q(v)∥ 1 Q(u jmax )).We obtain an estimate Γ from (ii) such that Then by multiplying both sides of (93) with ∥Q(v)∥ 1 Q(u jmax ), we obtain the required estimate Γ ′ = Γ∥Q(v)∥ 1 Q(u jmax ).

E Combined algorithm
In this section, we combine the algorithm of the kernel matrix pre-computation and gradient estimation for the learning process.Note that the kernel matrix estimation step introduces an additional error during the process.The classical algorithm is as follows.The corresponding guarantee and run time is given by the following result.The run time for obtaining this output is in Proof.The proof is the same as Theorem 8, except for the inequalities as follows.Here, we use Alphatron_with_Approx_Pre to estimate the kernel matrices with precision ϵ K , which are denoted by Kij and K′ ij respectively.Therefore, with Alphatron_With_Kernel_And_Sampling, we estimate the inner product α t • Kj instead of α t •K j , to additive accuracy ϵ ′ I .Let the estimated value be r t j .To promise it suffices to take ϵ ′ I = ϵ I /2 and ϵ K = ϵ I /(4λT ).Recall that by Lemma 5, 0 < α max ≤ λT /m 1 .For the run time, storing Kij and K′ ij takes time O m 2 , which is included in the quoted run time.Set ϵ I = √ ϵ, we obtain the run time from Theorem 8.
We continue with the corresponding quantum algorithm.For the result, we slightly change the proof for Theorem 9.
Corollary 5 (Quantum Alphatron combined).We assume quantum query access to the vectors x i and a i via Data Input 3. Let K max be maximum of all entries in K and K ′ , which are the corresponding kernel matrices.Let δ ∈ (0, 1).Given Definitions 3 and 4 with A 2 ≤ 1 and with ϵ I = √ ϵ, Algorithm 9 outputs α tout such that the hypothesis h tout = u j α tout j ψ(x j ) • ψ(x) satisfies with probability 1 − 4δ, ε(h tout ) ∈ O (A 2 ) .

Definition 7 (
Quantum RAM).Let c and m be positive integers.Let v be a vector of dimension m, where each element of v is a bit string of length c, i.e., v ∈ ({0, 1} c ) m .Quantum RAM is defined such that with a one-time cost of O (c m) we can construct quantum query and sampling access QA(v, m, c) and QS(v, m, c), see Definitions 5 and 6 in Appendix D. Each query costs O (c poly log m).

Algorithm 7 :1
Quantum_Alphatron Input training data (x i , y i ) m 1 i=1 , testing data (a j , b j ) m 2j=1 , error tolerance parameters ϵ I and δ I , function u : R → [0, 1], number of iterations T , degree of the multinomial kernel d, learning rate λ, quantum query access toK j , ∀j ∈ [m 1 ] and K ′ j , ∀j ∈ [m 2 ] via Data Input 3 2 α 0 ← 0 ∈ R m 1 3 for j ← 1 to m 1 do4 p max j ← max i |K ji | via quantum maximum finding with success probability 1 − δ I /(4m 1 ) 5 ′′ L √ ϵ by the Quan-tum_Alphatron algorithm with a run time that shows an advantage by a factor ∼ ζ 2 B √ ϵ over the classical algorithm given the same input.

Corollary 3 .
and we can simplify the right hand side of Eq. (81) to O (A 2 + ϵ K T ).From the runtime analysis in Theorem 6 and the accuracy analysis in Theorem 7, we have the following corollary.In the same setting as Theorem 7, if L √ ϵ ≤ 1, and we set ϵ K = L √ ϵ
Use these matrices and the other inputs of Algorithm 1 to fix the input of Algorithm 2. The sequences (α t = m 1 + m 2 , and we have m 2

Table 2 :
Comparison of the second set of algorithms Atron_with_Kernel_and_Sampling and Quan-tum_Alphatron, which are discussed in Section 4, to Alphatron_with_Kernel.These algorithms change the main loop part by using an inner product estimation.The inner product estimation is performed to accuracy ϵ I and the total success probability of the algorithm is 1 − δ.Here, we indicate the general result without the learning setting in Definition 3.

Table 3 :
Comparison of the algorithms Alphatron_with_Kernel and Quantum_Alphatron for the learning setting in Definition 3.Only in the first case a quantum advantage is obtained.
2, we use a fresh sample set (a i , b i ) of size m 2 as the validation data set, and we compute the empirical error ê rr(h t ), for each h t , on this data set.Letϵ ′ = 1/ √ m 1 .For fixed t, since ê rr(h t ) is in [0, 1], by a Chernoff bound on m 2 ∈ O log(T /δ)/(ϵ ′ ) 2 samples, with probability 1 − δ/T , we obtain err h t − ê rr h t ≤ ϵ ′ .∈ [T ].However, in Algorithm 4, to find the hypothesis with the minimum empirical error, we use the estimated kernel matrix K ′ instead of the exact inner products.Thus, we have additional errors in computing ê rr(h t ).Let 21 ∥v∥ 2 max /(2ϵ 2 ) samples of Z.Then, by using the Chebyshev and Chernoff inequalities, we obtain an ϵ additive error estimation of u • v with probability at least 1 − δ in O Lemma 10 (Inner product with ℓ 2 -sampling).Let ϵ, δ ∈ (0, 1).Given query access to v ∈ R n and SQ2(u, n) access to u ∈ R n , we can determine u • v to additive error ϵ with success probability at least 1 − δ with O 22 ∥v∥22 /(2ϵ 2 ) samples of Z.Then, by using the Chebyshev and Chernoff inequalities, we obtain an ϵ additive error estimation of u • v with probability at least 1 − δ in O By the above Fact 2 and Lemma 10, given vector u, v ∈ R n , the sampling data structure for u can be constructed in O (n) time and an estimation of u • v with ϵ additive error can be obtained with probability at least 1 − δ at a run time cost of O ∥u∥ 2 ∥v∥ 2

Algorithm 8 :
Alphatron_combined 1 Input Training data (x i , y i ) m i=1 and testing data (a i , b i ) N i=1 , error tolerance parameter ϵ I , failure probability δ, function u : R → [0, 1], number of iterations T , degree of the multinomial kernel d, learning rate λ 2 Kij , K′ ij ← Call Alphatron_with_Approx_Pre (Algorithm 4) with all inputs as above.Prepare query access to Kij , K′ ij as Data Input 2. 4 α tout ← Call Alphatron_With_Kernel_And_Sampling(Algorithm 6) with all inputs as above and K j and K ′ j via Data Input 2 as prepared.5 Output α tout Corollary 4 (Alphatron combined).Given training data (x i , y i ) m i=1 and testing data (a i , b i ) N i=1 .Let K max be maximum of all entries in K and K ′ , which are the corresponding kernel matrices.Let ϵ I , δ ∈ (0, 1).If the Definitions 3 and 4 hold and A 2 ≤ 1 and with ϵ I = √ ϵ, the Algorithm 8 outputs α tout which describes the hypothesis h tout (x) := u m 1 i=1 α tout i K d (x, x i ) such that with probability 1 − 4δ, ε(h tout ) ∈ O (A 2 ) . 3