Towards understanding the power of quantum kernels in the NISQ era

A key problem in the field of quantum computing is understanding whether quantum machine learning (QML) models implemented on noisy intermediate-scale quantum (NISQ) machines can achieve quantum advantages. Recently, Huang et al. [Nat Commun 12, 2631] partially answered this question by the lens of quantum kernel learning. Namely, they exhibited that quantum kernels can learn specific datasets with lower generalization error over the optimal classical kernel methods. However, most of their results are established on the ideal setting and ignore the caveats of near-term quantum machines. To this end, a crucial open question is: does the power of quantum kernels still hold under the NISQ setting? In this study, we fill this knowledge gap by exploiting the power of quantum kernels when the quantum system noise and sample error are considered. Concretely, we first prove that the advantage of quantum kernels is vanished for large size of datasets, few number of measurements, and large system noise. With the aim of preserving the superiority of quantum kernels in the NISQ era, we further devise an effective method via indefinite kernel learning. Numerical simulations accord with our theoretical results. Our work provides theoretical guidance of exploring advanced quantum kernels to attain quantum advantages on NISQ devices.


Introduction
Kernel methods provide powerful framework to perform nonlinear and nonparametric learning, attributed to their universal property and interpretability [9,22,45]. During the past decades, kernel methods have been broadly applied to accomplish image processing, translation, and data mining tasks [2,44,47]. As shown in Figure 1, a general rule of kernel methods is embedding the given input x (i) ∈ R d into a high-dimensional feature space, i.e., φ(·) : R d → R q with q d, which allows that different classes of data points can be readily separable. Note that explicitly manipulating of φ(x (i) ) becomes computationally expensive for large q. To permit efficiency, kernel methods construct a kernel matrix K ∈ R n×n to effectively accomplish the learning tasks in the feature space with n being the size of training examples. Specifically, the elements of K represent the inner product of feature maps with K ij = K ji = φ(x (i) ), φ(x (j) ) for ∀i, j ∈ [n], where such an inner product can be evaluated by a positive definite function κ(x (i) , x (j) ) in O(d) runtime. The performance of kernel methods heavily depends on the utilized embedding function φ(·), or equivalently the function κ(·, ·) [8,18]. To this end, various kernels such as the radial basis function kernel, Gaussian kernel, circular kernel, and polynomial kernel have been proposed to tackle various tasks [1,28]. Moreover, a recent study showed that the evolution of neural networks during training can be described by the neural tangent kernel [25].
Quantum machine learning (QML) aims to effectively solve certain learning tasks that are challenging for classical methods [7,39,50]. Theoretical studies have demonstrated that many QML algorithms, e.g., quantum perceptron [27], quantum support vector machine [33], and quantum differentially private sparse learning [15], outper-form their classical counterparts in the measure of runtime complexity. Despite runtime advantages, the required resources to implement these algorithms are expensive and even unaffordable for noisy intermediate-scale quantum (NISQ) machines [3,39]. Meanwhile, experimental studies have confirmed the feasibility of using near-term devices to accomplish various QML tasks such as classification [20] and image generation [23,41], and drug design [32]. However, theoretical results to guarantee quantum advantages of these NISQbased QML algorithms are lacking. Therefore, an open question in the field of QML is 'What QML algorithms can be executed on NISQ devices with evident advantages?'.
A possible solution towards the above question is quantum kernels [42]. As shown in Figure 1, there is a close correspondence between classical kernels and quantum kernels: the feature map φ(·) coincides with the preparation of quantum states via variational quantum circuits U E (x (i) ) ∈ C 2 N ×2 N [6, 10, 13], i.e., |ϕ(x (i) ) = U E (x (i) ) |0 , which map the input data into highdimensional Hilbert spaces described by N qubits but can not be effectively accessible; the result of kernel function κ(·, ·) coincides with applying measurements on the prepared quantum states, i.e., | ϕ(x (j) )|ϕ(x (i) ) | 2 , which enables the efficient collection of information from feature space. Due to the flexibility of the variational quantum circuits, quantum kernels have been experimentally implemented on different platforms such as superconducting, optical, and nuclear magnetic resonance quantum chips to resolve classification tasks [5,20,29,52]. As indicated by [43], quantum kernels can achieve advantages when the prepared quantum states are classically intractable. Following the same routine, Huang et al. [24] recently proved the predication advantages of quantum kernels. Namely, for appropriate datasets, quantum kernels assures a lower generalization error bound than that of classical kernels [49].
Despite the promising achievements, most of the theoretical results in [24] are established on the ideal setting. In particular, they assumed that the number of measurements is infinite and the exploited quantum system is noiseless, where both of them are impractical for NISQ devices. The quantum kernel returned by NISQ machines, affected by the system noise and a finite number of measurements, may be indefinite and therefore does not obey the results claimed in [24]. Driven by attractive merits comprised by quantum kernel methods and the deficiencies of near-term quantum machines, a crucial question is: Does the power of quantum kernels still hold in the NISQ era? A positive affirmation of this question will not only contribute to a wide range of machine learning tasks to gain prediction advantages but can also establish the quantum deep learning theory.
A central theoretical contribution of this paper is exhibiting that a larger data size n, a higher system noise p, and a fewer number of measurements m will make the generalization advantage of quantum kernels inconclusive. This result indicates a negative conclusion of using quantum kernels implemented on NISQ devices to tackle large-scale learning tasks with evident advantages, which is contradicted with the claim of the study [24] such that a larger data size n promises a better generalization error. Moreover, we show that quantum system noise is a fatal factor that has the ability to collapse any superiority provided by quantum kernels. These observations are crucial guidance to help us design powerful quantum kernels to earn quantum advantages in the NISQ era.
Our second contribution is empirically demonstrating that under the NISQ setting, the advantages of quantum kernels may be preserved by suppressing its estimation error. Concretely, we adopt advanced spectral transformation techniques, which are developed in the indefinite kernel learning, to alleviate the negative effect induced by the system noise and the finite measurements. Numerical simulation results demonstrate that the performance of noisy quantum kernels can be improved by 14%. Our work opens up a promising avenue to combine classical indefinite kernel learning methods with quantum kernels to attain quantum advantages in the NISQ era.
2 Quantum kernels in the NISQ scenario Before elaborating on our main results, let us first follow the study [24] to formulate the quantum kernel learning tasks in the NISQ scenario. In particular, suppose that both the training and test examples are sampled from the same domain X × Y. The training dataset is denoted  Figure 1: The paradigm of classical and quantum kernels. Both of the classical and quantum kernels embed the data points from data space X into high-dimensional space, and then compute the kernel as the inner product of feature maps. The quantum kernel leverages variational quantum circuits to achieve this goal, as indicated by the blue color. In the ideal scenario, quantum kernels promise a better performance over classical kernels for certain datasets.
y (i) ∈ R refer to the i-th example with the feature dimension d and the corresponding label, respectively. The prepared quantum state for the i-th example yields |ϕ(x (i) ) = U E (x (i) ) |0 ⊗N , where U E (·) is the specified encoding quantum circuit and N is the number of qubits. The relation between x (i) and y (i) is y (i) = f (x (i) ) := Tr(OU (θ * )ρ(x (i) )U (θ * ) † ), where O, U (θ * ), and ρ(x (i) ) represent the measurement operator, a specified quantum neural networks [14], and the density operator of the encoded quantum data with ρ(x (i) ) = |ϕ(x (i) ) ϕ(x (i) )| ∈ C 2 N ×2 N , respectively. The aim of quantum kernels learning is using the quantum kernel W ∈ R n×n , i.e., (1) to infer a hypothesis h(x (i) ) = ω * , ϕ(x (i) ) with a low generalization error, where ω * refers to the optimal parameter with ω * = arg min ω λ ω, ω + N i=1 ( ω, ϕ(x (i) ) − y (i) ) 2 (see Appendix B for details). The generalization error of quantum kernels is quantified by where the randomness is taken over the dataset and quantum kernels methods.
We note that the quantum kernel W in Eqn.
(1) corresponds to the ideal setting. Nevertheless, NISQ machines are prone to having errors and only support finite number of measurements [39]. In the worst scenario, the system noise can be simulated by the quantum depolarization channel N p , i.e., where p refers to the depolarization rate and I 2 N is identity. Note that p = 1 − (1 − p) L Q depends on the quantum circuit depth L Q and the depolarization rate p in each layer (see Appendix C for details). To this end, the element W ij in the quantum kernel W changes to W ij = Tr(N p (ρ(x (i) )ρ(x (j) ))) for ∀i, j ∈ [n]. In addition, when the number of measurements applied to W is m, the estimated element of quantum kernels yields where V k ∼ Ber( W ij ) is the output of a quantum measurement and X ∼ Ber(p) refers to the Bernoulli distribution with Pr(X = 0) = p and Pr(X = 1) = 1 − p. The generalization error bound under the noise setting described above is summarized in the following theorem, whose proof is provided in Appendix C.
. Notably, when the depolarization noise is considered, the generalization error of the noisy quantum kernel E x, W |h(x)−y| will always have a term n 1/4 . The results achieved in Theorem 1 indicate that the generalization error bound in Theorem 1 is nearly saturated in the NISQ setting, where the lower and upper bounds are separated by a factor n 1/4 . In other words, the generalization error bound of noisy kernels must contain a term that is proportional to n. Such an observation implies that the generalization bound derived in [24] fails to explain the generalization ability of noisy quantum kernels, since their result shows an upper bound of O( c 1 /n) in the noiseless setting and claims that the error continuously vanishes as n enlarged. To further elucidate the separation between noisy and ideal quantum kernels, we first conduct numerical simulations to exhibit how the generalization error varies with different size of training dataset n under ideal and noise scenarios, respectively. Simulation results indicate that the prediction accuracy for the noisy kernel begins to decline when n exceeds a certain threshold, which implies the correctness of Eqn. (5). We further theoretically derive that the achieved upper bound in Eqn. (5) is nearly saturated, which indicates that the generalization error bound of noisy kernels must contain a term that is proportional to n. Refer to Appendix D for more details.
Besides that, Theorem 1 provides the following two insights.
• The performance of quantum kernels in the NISQ era heavily depends on the number of measurement m. When m = O(n 3 ) and p is small, the generalization error of noisy quantum kernels is competitive with the ideal case. When m < n, the advantage of quantum kernels entirely vanishes.
• The term c 2 in Eqn. (5) indicates the negative role of the system noise, i.e., even though m is set as sufficiently large, the generalization error bound can still be very large induced by p. Moreover, the generalization error bound in Eqn. (5) would be infinite once p > 1/(nc W (1 + 1/2 N +1 )).
The above insights can be employed as guidance to design powerful quantum kernels in the NISQ era. In particular, the size of the training dataset n should be controlled to be small. A possible solution is constructing a coreset [4]. In addition, the minimum number of measurements m is required to be set as n 3 to peruse potential quantum advantages. Last, since p is determined by the circuit depth L Q and the depolarization rate p as shown in Eqn.
Remark. The results in Theorem 1 is established on the depolarization noise, the achieved results can be easily extend to more general noisy channels.
We next conduct extensive numerical simulations to validate the correctness of Theorem 1. In particular, following the same routine as the study [24] does, we adopt the Fashion-MNIST to exhibit how noise affects the superiority of quantum kernels. The data preprocessing stage contains four steps. First, we clean the dataset and only reserve images with labels '0' and '3', which correspond to the 'cloth' and 'dress', respectively. In other words, our simulation focuses on the binary classification task. Second, we sample n (n T e ) examples from the filtered data to construct the training dataset D (test dataset D T e ). Third, a feature reduction technique, i.e., principle component analysis (PCA) [26], is exploited to project each example in D∪D T e (an image with feature dimension 28×28) into a low-dimensional feature vector x i ∈ R N , where N refers to the number of qubits. Last, we reassign the data label in D∪D T e to saturate the geometric difference  between quantum kernels and classical kernels following the method proposed by [24].
Once the data preprocessing is completed, we apply quantum kernel methods to learn these modified datasets under both the noiseless and noisy settings.
Furthermore, to understand whether noisy quantum kernels hold any superiority, we introduce an advanced classical kernel, i.e., the radial basis function (RBF) kernel, as a reference to learn these modified datasets. Note that RBF kernel is optimized by tuning hyperparameter and regularization parameter to make full use their power (see Appendix F.3 for details). The hyper-parameter setting used in the numerical simulations is as follows. The size of the training dataset D is set as n ∈ {5, 100}. The size of the test dataset D T e is 100. The number of measurement shots is set as m ∈ {10, 100, 500, 100}. The number of qubits, or equivalently the dimension of the projected feature x i , is set as N ∈ {2, 8}. Notably, we adopt two types of noise models to quantify performance of noisy quantum kernels, which include the depolarization channel N p in Eqn.
(3) with p ∈ {0.001, 0.1} and the noise model extracted from the real quantum hardware IBMQ-Melbourne. Since the optimization of indefinite kernels is intractable, we adopt the near-est projection method [21] to project the indefinite kernel onto positive definite matrix space. Please refer to Appendix F for more simulation details.
The simulation results are illustrated in Figure  2. Specifically, the left panel exhibits the simulation results under the depolarization noise with N = 2. When n = 100, the test accuracy is continuously approaching to the baseline 95% with respect to the increased number of measurements m. For instance, when m = 10 and m = 1000, the test accuracy of the noisy quantum kernel with p = 0.05 is 61.2% and 91.2%, respectively. Similar observations can be summarized for the case of n = 5. Moreover, the performance for the case of n = 5 is competitive and even better than that of n = 100 when the number of measurements is small. Notably, for both the noisy and ideal quantum kernels with varied settings, their performance is superior to that of RBF kernels. The middle panel shows the simulation results under the depolarization noise with N = 8. Concretely, an increased number of measurements m and a decreased rate of noise p enable the improved performance of noisy quantum kernels. Meanwhile, the noisy quantum kernels always outperform RBF kernels once m > 500. All above observations echo with Theorem 1. The right panel depicts the performance of noisy quantum kernels under the specific noise model extracted from the real quantum hardware IBMQ-Melbourne. The simulation results suggest that the rule implied in Theorem 1 can be employed as guidance to describe the behavior of noisy quantum kernels in realistic settings.
3 Enhance performance of noisy quantum kernels Let us revisit the consequence of noisy quantum kernels as indicated in Theorem 1. Specifically, to preserve the superiority of quantum kernels carried out on NISQ machines, we can either slim the size of training datasets or suppress the effects of noise. The solution to address the first issue is apparent, i.e., the construction of a coreset [4], while the strategy to emphasize the second issue remains obscure. In the following, we investigate how to mitigate the negative influence of quantum noise to further enhance performance of noisy quantum kernels.
describes how the estimation error, i.e., W −1 − W −1 F , induced by the quantum noise influences the performance of noisy quantum kernels (see proof of Theorem 1). In other words, an effective strategy that has the ability to shrink the distance between the noiseless and noisy quantum kernels contributes to a better generalization error. Mathematically, if there exists a matrix W satisfying then this matrix can be used to infer a better hypothesis h(·) instead of W. Seeking the target matrix W has been extensively investigated in indefinite kernel learning [11,35,37,51]. The key insight to solve this problem is using certain spectral transformations to convert W into W . To facilitate discussion, in the rest of the paper, we denote the spectral decomposition of W and where λ i and λ i refer to the eigenvalues and v i and u i are the corresponding eigenvectors. Without loss of generality, we assume λ 1 ≥ · · · ≥ λ r ≥ 0 ≥ · · · ≥ λ n . Here we explore three advanced spectral transformation techniques to acquire W and prove their theoretical guarantees. In particular, the first approach is clipping all negative eigenvalues in W to be zero [51], i.e., W c = r i=1 λ i u i u i . The second approach is flipping the sign of negative eigenvalues [19] The third approach is shifting all eigenvalues by a positive constant [40], i.e., W s = n i=1 ( λ i − λ n )u i u i respectively, where λ n is the minimum non-positive eigenvalue of W. The following lemma exhibits that the modified quantum kernels { W c , W f , W s } can achieve a lower estimation error over the noisy quantum kernel W.
Lemma 1. Let W and W be the ideal and noisy quantum kernel in Eqns.( (1)) and ( (4)), respectively. Applying the spectral transformation techniques to W, the obtained kernel W ∈ W c , W f , W s yields where · F refers to the Frobenius norm.
The proof of Lemma 1 is given in Appendix E. Such a relation is ensured by the fact that W can be viewed as the approximate orthogonal projection of W in the positive semi-definite (PSD) space. The triangle inequality for the norm of the projection orthogonal immediately imply Eqn. (7). The result in Lemma 1 suggests that spectral regularization may be a fundamental tool to improve generalization of quantum kernels in NISQ era.
We conduct the following numerical simulations to demonstrate that spectral transformation techniques contribute to enhance performance of noisy quantum kernels. Specifically, for all settings with n ∈ {5, 50, 100, 200}, the generalization performance of noisy quantum kernels in the original case is no better than 75% when m = 10, as shown in the upper left panel in Figure 3. By contrast, the training performance of noisy quantum kernels is dramatically improved when the spectral transformation techniques are adopted. Notably, the shifting method enables that the training performance of the noisy quantum kernels is competitive with the ideal quantum kernels, as shown in the lower right panel in Figure 3. The achieved simulation results provide a strong implication such that the superiority may be preserved by designing advanced spectral transformation techniques. Please refer to Appendix F for the omitted details.

Conclusion
In this study, we investigate the generalization performance of quantum kernels under the NISQ setting. We theoretically exhibit that a large size of the training dataset, a small number of measurement shots, and a large amount of quantum system noise can destroy the superiority of quantum kernels. In addition, we demonstrate that the generalization error bound in Theorem 1 is nearly saturated. Our future work is tightening this upper bound. To improve performance of quantum kernels in the NISQ era, we further prove that effective spectral transformation techniques have the potential to maintain the advantage of quantum kernels in the NISQ era. Besides the theoretical results, we empirically demonstrate that spectral transformation techniques have the capability of improving performance of noisy quantum kernels for both the depolarization noise and noise extracted from the real quantum-hardware (IBMQ-Melbourne). The achieved results in this study fuel the exploration of quantum kernels assisted by other advanced calibration methods to accomplish practical tasks with advantages in the NISQ era.
Improving support vector machine classifiers by modifying kernel functions.
Neural Networks, 12 (6) The organization of the appendix is as follows. In Appendix A, we unify the notations used in the whole appendix. In appendix B, we review the results of [24] established in the noiseless setting. Then, in Appendix C and Appendix D, we exhibit the proof details of Theorem 1 and compare the generalization error bound between ideal and noisy quantum kernels. Next, in Appendix E, we prove Lemma 1, which shows that applying spectral transformation to the noisy quantum kernel contributes to reduce the kernel estimation error and thus narrow the generalization bound. Eventually, in Appendix F, we elaborate on the numerical simulation details.

A The summary of notation
We unify the notations throughout the whole paper. The matrix is denoted by the capital letter e.g., W ∈ R n×n , The vector is denoted by the lower-case bold letter e.g., x ∈ R d . We denote ·, · as the inner product in Hilbert space. The notation [m] refers to the set 1, 2, · · · , m. A random variable X that follows Bernoulli distribution is denoted as X ∼ Ber(p), i.e., Pr(X = 1) = p and Pr(X = 0) = 1−p. We denote · 2 as the 2 -norm and · F as the Frobenius norm.

B The results of quantum kernels under the ideal setting
In this section, we recap the main results of the study [24] to facilitate readers' understanding. Specifically, we first demonstrate the generalization of quantum kernels under the ideal setting. We next explain how to construct a specific dataset with quantum advantages in the measure of generalization.
Generalization error bound of ideal quantum kernels. The study [24] focuses on the following setting. Namely, given n data feature vectors {x (i) } n i=1 sampled from the domain X , their corresponding labels yield where O, U (θ * ), and ρ(x (i) ) = |ϕ(x (i) ) ϕ(x (i) )| refer to the measurement operator, a specified quantum neural networks respectively. In other words, all labels {y (i) } n i=1 can be well described by quantum circuits.
Define a set of hypothesis functions as {h(·) = ω, ϕ(·) |ω ∈ Ω}, Ω is the parameters space and |ϕ(·) refers to the quantum feature map. The aim of quantum machine learning is seeking the optimal parameters ω * to enable a minimum empirical risk. This task can be achieved by optimizing the loss functions where λ ≥ 0 is the hyper-parameter. Namely, the optimal parameters yield ω * = arg min ω∈Θ L(ω, x).
The explicit form of ω * in Eqn. (10) satisfy where Y = [y (1) , ..., y (n) ] refers to the vector of labels and W is the quantum kernels as defined in Eqn.
(1). Note that we always assume that W = W +λI is non-singular with taking λ → 0, which is a general assumption that is broadly employed in the study of quantum kernels.
The key conclusion in [24] is exhibiting that the generalization error bound of quantum kernels is quantified by ω * .
Theorem 2 (Theorem 1, [24]). Define the given training dataset as {x (i) , y (i) = Tr(O U ρ(x (i) ))} n i=1 and the corresponding quantum kernel as W in Eqn. (1). With probability at least 1 − δ, quantum kernel methods can learn a hypothesis h(·) with the generalization error where the randomness comes from of the sampling training data from X , c 1 ≡ ω * = Y W −1 Y and the notationÕ hides the logarithmic terms.
The construction of dataset with quantum advantages. Recall that the result of theorem 2 indicates that the generalization error bound depends on the kernel matrix W, the labels Y , and the size of the training dataset n. When the training examples {x i } n i=1 are fixed and the labels of data Y can be modified, the study [24] shows that the advantage of quantum kernels can be achieved by maximizing the geometric difference where W and K refer to quantum and classical kernels. Ensured by Theorem 2, a large geometric difference indicates that quantum kernel methods can achieve better generalization error than classical kernel models.

C Proof of Theorem 1
Here we present the proof of Theorem 1, especially for the derivation of Eqn. (5). Note that we defer the proof that the generalization error bound of the noisy quantum kernel always contains an unavoidable term n 1/4 in Appendix D.
Before moving on to elaborate Theorem 1, we first simplify the depolarization noise model applied to the quantum kernels as described in the main text. In particular, for an L Q -layer quantum circuit, all noise channel N p separately applied to each quantum circuit depth can be compressed to the last layer and presented through a new depolarization channel N p .

There always exists a depolarization channel N
where ρ is the input quantum state.
The proof of the above lemma is deferred to Appendix C.1. Besides Lemma 2, the proof of Theorem 1 employs the following lemmas, where the corresponding proofs are deferred to Appendix C.2 and Appendix C.3, respectively.

Lemma 3. Define the given training set as {x (i) , y (i) = Tr(O U ρ(x i ))} and the noisy kernel as W in Eqn. (4). With probability at least 1 − δ, the noisy quantum kernel can learn a hypothesis h(·) with the generalization error
where the randomness comes from the sampling training data and the noise in the NISQ scenario, c = |Y W −1 Y | and the notation O hides the logarithm terms.

Lemma 4. Suppose the system noise is modeled by the depolarization channel N p in Eqn.
(3). Define the ideal quantum kernel as W with entry W ij = Tr(ρ(x (i) )ρ(x (j) )) and the noisy quantum kernel as W with entry W ij = 1 m m k=1 V k , where V k ∼ Ber( W ij ) and W ij = Tr(N p (ρ(x (i) )ρ(x (j) )). With probability at least 1 − δ 2 , we have where the randomness is taken over W, c 2 = max c −2 W 1 2 log 4n 2 δ 1 2 + m 1 2 p 1 + 1 and c W = W −1 is the spectral norm of W −1 .
We are now ready to exhibit the proof of Theorem 1.
Proof of Theorem 1. Following the result of Lemma 3, with probability at least 1− δ 2 , the generalization error bound of the noisy quantum kernel W yields where the randomness is taken over the sampling of the training data and c = |Y W −1 Y |. Note that the upper bound of the term c is as follow, i.e., where the last inequality utilizes the fact that √ a + b ≤ √ a + √ b for any given a, b ≥ 0. In other words, the generalization error of the noisy quantum kernel W is upper bounded by In the following, we derive the upper bound of the term W −1 − W −1 2 to obtain the generalization error bound of the noisy quantum kernel W. Specifically, by leveraging Lemma 4, with probability at least 1 − δ 2 , we have where the randomness is over W. In conjunction Eqn. (22) with Eqn. (23), we achieve the generalization error bound of the noisy quantum kernel W, i.e., with probability at least 1 − δ, where the randomness is over W and the sampling of the training data, c 1 = Y W −1 Y , and c 2 = max c −2 W 1 2 log 4n 2 δ 1 2 + m 1 2 p 1 + 1

C.1 Proof of Lemma 2
Proof of Lemma 2. Denote ρ (k) as ρ (k) = k l=1 U l (θ)ρU l (θ) † . Applying N p to ρ (1) yields where D refers to the dimensions of Hilbert space interacted with N p . By induction, we can suppose that at the k-th step, the generated state satisfies Then applying U k+1 (θ) followed by N p yields According to the formula of depolarization channel, an immediate observation is that the noisy QNN is equivalent to applying a single depolarization channel N p at the last circuit depth, i.e.,

C.2 Proof of Lemma 3
The proof of Lemma 3 mainly employs a basic theorem in statistics and learning theory as presented below.
Theorem 3 (Theorem 3.3, [36]). Let G be a family of function mappings from a set X to [0, 1]. Then for any δ > 0, with probability at least 1 − δ over the identical and independent draw of n samples from X : x (1) , · · · , x (n) , we have for all g ∈ G, where σ 1 , · · · , σ n are independent and uniform variables over ±1.
A discussion about the above theorem in [24] gives a modified version, i.e., supported by Talagrand's lemma, the generalization error with probability at least 1−δ, where u = inf{v ∈ Z | u < v} and h α (·) = n j=1 α j κ(·, x (j) ) with κ(·, ·) being the indefinite kernel function related to W. According to the theory of indefinite kernel learning [37], κ can be uniquely decomposed into the difference of two positive definite kernels κ + and κ − . The sum of κ + and κ − gives rise to a positive definite kernel κ * which possesses the smallest reproducing kernel Hilbert space corresponding to the Krein space of κ [37]. We are now ready to present the proof of Lemma 3.

C.3 Proof of Lemma 4
The proof of Lemma 4 leverages the following two lemmas whose proofs are given in Subsections C.4 and C.5, respectively. Eqn. (3). The noisy quantum kernel W and the ideal quantum kernel W has the following relation, i.e.,

Lemma 5. Suppose the system noise is modeled by the depolarization channel N p in
Lemma 6. Let · be a given matrix norm and suppose A, B ∈ R n×n are nonsingular and satisfy We are now ready to present the proof of Lemma 4.
where c W = W −1 2 . In other words, to achieve In the following, we leverage the concentration inequality to quantify the probability when W − W 2 ≥ δ . Mathematically, we have where the first inequality uses W − W 2 ≤ W − W F , the first equality employs the definition of the Frobenius norm, the second inequality utilizes the union bound, the last second inequality is supported by the sub-additivity of probability measure, and the last inequality exploits the results of Lemma 5. Note that to use Lemma 5, we require With setting δ = δ c W (δ+c W ) > np 1 + 1 2 N +1 as indicated in Eqn. (35) and Eqn. (37), we obtain We remark that the condition δ c W (δ+c W ) > np 1 + 1 2 N +1 is guaranteed when We now demonstrate that with probability at least 1 − δ /2, the following relation is satisfied, i.e., where c 3 = c −2 Note that the physical meaning of δ is distance, which requires δ > 0.
Meanwhile, the maximum operation naturally secures the condition required in Eqn. (39).

C.4 Proof of Lemma 5
Proof of Lemma 5. Supported by the Chernoff-Hoeffding bound [49], the discrepancy between W ij and W ij yields where m represents the number of measurements. Moreover, the distance between the ideal result W ij and the expectation value W ij = Tr(N p (ρ(x (i) )ρ(x (j) ))) corresponding to the depolarization noise follows where p refers to the depolarization rate in Eqn. (3).

D The comparison of the generalization error bound between ideal and noisy quantum kernels
In this section, we demonstrate that the achieved upper bound for noisy quantum kernels in Theorem 1 is non-trivial, where the generalization error bound of ideal quantum kernels [24] can not be used to analyze the generalization ability of noisy quantum kernels. In the following, we provide both the theoretical and numerical evidence to support our claim.

D.1 Numerical evidence for the saturation
Here we conduct numerical simulations to exhibit how the the generalization error (i.e., prediction accuracy) varies with respect to the different size of training dataset n under ideal and noisy scenarios, respectively. The hyper-parameters settings are as follows. The size of training dataset and the number of measurements are set as n ∈ {5, 50, 100, 200} and m ∈ {10, 100, 500, 1000, inf} respectively. The number of qubits used to establish quantum kernels is set as N = 2. When the noisy scenario is considered, the depolarizing rate is set asp = 0.05. The simulation results of the ideal quantum kernel are shown in Figure 4, highlighted by the solid blue line (i.e.,p = 0 & m = inf). Specifically, the prediction accuracy for the ideal kernel keeps on increasing from 79%, 94%, 95% to 96% when the size of the training dataset n increases from 5, 50, 100, to 200. The achieved prediction accuracy indicates that a larger size of the training dataset ensures a better generalization ability. These observations exactly echo with the conclusion of Ref.
Different from the ideal setting (i.e.,p = 0 & m = inf), the prediction accuracy achieved by noisy kernels, highlighted by the dotted lines, contrasts with Huang's upper boundÕ( c 1 /n) in Eqn. (12). In particular, for the case of m ∈ {10, 100, 500} (m = 1000), the prediction accuracy for the noisy kernel reaches the peak when n = 50 (n = 100) and then begins to decline, while Huang's result claims that the maximum prediction accuracy should be n = 200. The obtained empirical results accord with our results in Eqn. (5) but contradicts with Huang's results in Eqn. (12) such that increasing n for the noisy kernel may decrease the prediction accuracy, or equivalently increase the generalization error. In other words, the achieved generalization error bound for noisy kernels is by no means trivially above Huang's upper bound.

D.2 Theoretical evidence for saturation
Envisioned by the aforementioned simulation results, we now theoretically investigate that the generalization error bound of noisy kernels must contain a term that is proportional to n, e.g., Ω c/n + 1 c n √ m . This is equivalent to explore whether the generalization error bound in Theorem 1 is saturated. Notably, according to Eqn. (5), the achieved upper bound is constituted by two terms, i.e., c 1 /n and n/(c 2 √ m). In particular, the former quantifies the generalization error in the ideal setting and the latter origins from the kernel approximation error between W and W. With this regard, in the following, we will separately analyze the saturation of the achieved generalization error bound in the ideal and NISQ settings. The saturation in the ideal setting. The term c 1 /n refers to the generalization error upper bound in the ideal setting, i.e., the quantum system is noiseless and the number of measurements is infinite (m → ∞). As shown in [24] (see Section D 2, Page 17), the generalization error bound of quantum kernels is obtained by deriving the upper bound of the empirical Rademacher complexity R D (H), where H = {h ∈ H : h H ≤ Λ} is the hypothesis set, H is the reproducing kernel Hilbert space (RKHS) of the quantum kernel function of W, h is quantum kernel based hypothesis, and Λ refers to a bounded constant. Notably, the term c 1 /n refers to the upper bound of R D (H), i.e., The study [36] has proven that the term c 1 /n is saturated, supported by the following theorem.
The results in Theorem 4 indicate that the upper and lower bounds of the empirical Rademacher complexity R D (H) follow the same scaling behavior, which is c 1 /n. To this end, we can conclude that the first term c 1 /n in Theorem 1 is saturated.
The saturation in the noise setting. We then analyze the saturation of the achieved generalization error bound in the NISQ scenario. This part also serves as the proof of Theorem 1. Notably, compared with the ideal case, there is an additional term n/(c 2 √ m) introduced by the quantum system noise and sample error. In other words, understanding the saturation of the achieved generalization error bound for noisy quantum kernels is equivalent to quantifying the saturation of n/(c 2 √ m). Recall that this term is originated from the kernel approximation error between W −1 and W −1 , i.e., Under the above observation, here we intend to examine the saturation of the kernel approximation error between W −1 and W −1 , especially for the dependence of √ n. For ease of understanding, we only consider the sampling error (i.e., a finite number of measurement m) and set the depolarizing rate as p = 0. In other words, the element of the noisy quantum kernel In this scenario, with probability δ > 0, there exists a noisy kernel satisfying where > 0 refers to the random error. An immediate observation is that the spectral norm of the difference between W −1 and W −1 satisfies where the first inequality employs · 2 ≥ · F / √ n. In conjunction with Eqn. (51) and Eqn. (52), we obtain The achieved lower and upper bounds in the above equation provide two implications. First, the generalization error bound of noisy kernels must contain a term that is proportional to n. Second, the second term in Theorem 1 is nearly saturated, where the lower bound and our upper bound are separated by a scaling factor n 1/4 . To summarize, the generalization error bound in Theorem 1 is exactly saturated in the ideal case. Besides, the generalization error bound in Theorem 1 is nearly saturated in the NISQ setting, where the lower and upper bounds are separated by a factor n 1/4 . Therefore, the generalization error bound of noisy kernels must contain a term that is proportional to n. In other words, the bound in Eqn. (12) fails to explain the generalization ability of noisy quantum kernels.

E Proof of Lemma 1
In this section, we introduce three spectral transformation techniques, i.e., the clipping, flipping, and shifting methods, which are utilized to enhance generalization performance of noisy quantum kernels. Moreover, we provide the theoretical evidence such that these transformations methods enable a better generalization performance.

E.1 Spectrum clipping method
The construction rule of the noisy quantum kernel implies that W in Eqn. (4) is symmetric and thus it has an eigenvalue decomposition where U is an orthogonal matrix and Λ is a diagonal matrix of real eigenvalues with Λ = diag( λ 1 , · · · , λ n ). Without loss of generality, we assume that λ 1 ≥ · · · ≥ λ r ≥ 0 ≥ λ r+1 ≥ · · · ≥ λ n . The mechanism of the spectrum clip method is clipping all the negative eigenvalues of W to zero. Intuitively, the negative eigenvalues of W have been regarded as the result of noise disturbance, where setting them to zero is treated as the denoising step [51]. Define the clipped eigenvalues as Λ c = diag λ 1 , · · · , λ r , 0, · · · , 0 . (55) The calibrated quantum kernel yields The following lemma exhibits that the discrepancy between the calibrated quantum kernel and the ideal quantum kernel is lower than the difference between the original noisy quantum kernel and the ideal quantum kernel.
Proof of Lemma 7. We note that supported by the definition of the Frobenius norm, an equivalent of achieving Eqn. (57) is After simplification, Eqn. (58) can be rewritten as To achieve Eqn. (59), we now prove we only need to prove Tr( W 2 c ) ≤ Tr( W 2 ) and Tr(W W c ) ≥ Tr(W W) separately. Following notation in the main text, we denote W The linear property of the trace operation allows us to achieve Tr(W W c ) ≥ Tr(W W), i.e., where the last equality uses the fact that λ i ≥ 0 and λ j ≤ 0 for ∀i ∈ [n], ∀j ∈ [n]\[r].

E.2 Spectrum flipping method
The spectral flipping method is proposed by [38]. Different from the clipping method [51] that interpreting the negative eigenvalues is caused by noise, Refs.
[30] and [31] showed that the negative eigenvalues in W may encode useful information about data features or categories. Following notation in Subsection E.1, the spectral flipping method flips the sign of the negative eigenvalues of W to obtain the calibrated quantum kernel where Λ f = diag λ 1 , · · · , λ r , − λ r+1 , · · · , − λ n .
Moreover, the relation between Tr(W W) and Tr(W W f ) follows

E.3 Spectrum shifting method
The mechanism of the spectrum shifting method is shifting the spectrum of the noisy quantum kernel W by the absolute value of its minimum eigenvalue |λ n |. Mathematically, the calibrated quantum kernel yields Note that the spectrum shifting method ensures that any indefinite matrix W can be calibrated to be PSD. Compared with the clipping and flipping methods, the spectrum shifting method only enhances all self-similarities by the amount of |λ n | and does not change the relative similarity between any two different samples.
where the second equality employs Tr( W) = n.

F More details about numerical simulations
In this section, we append more implementation details and simulation results as omitted in the main text. Specifically, we first introduce the construction rule of the employed dataset in Subsection F.1. Next, we explain the detailed implementation of the quantum kernel to achieve quantum advantages in Subsection F.2. Last, in Subsection F.4, we conduct comprehensive simulations to demonstrate how the quantum system noise, the number of measurements, and dataset size influence performance of quantum kernels, and how spectral transformation techniques improve performance of quantum kernels under NISQ settings.  Figure 6: Performance of RBF kernels with varied λ and γ. The heat-map presents the prediction accuracy (the higher means the better) of RBF kernels with varied λ and γ given in Eqn. (75) and Eqn. (74). The feature dimension is d = 8 and the sample size is n = 100. Figure 6 depicts the simulation results of RBF kernels with varied γ and λ. Specifically, compared with the unregularized RBF kernels with fixed γ, RBF kernels with tuned γ and λ achieve a better performance. For example, the prediction accuracy of RBF kernels is improved by 10% (i.e., from 43% to 53%) when the size of training data n = 100 and the feature dimension d = 8. These results reflect that the accuracy of RBF kernel can be improved by properly tuned γ and λ.

F.4 Numerical simulation results
In the main text, we present the core results to indicate how the performance of quantum kernels is influenced by the imperfection of NISQ machines and how spectral transformation techniques can address this issue. For completeness, here we illustrate more simulation results to support our theoretical claims.
Performance of noisy quantum kernels. We first benchmark the performance of quantum kernels under the depolarization noise. The hyper-parameters setting is as follows. The number of qubits is set as N = 2, 8, the size of the training dataset is set as n = 5, 50, 100, and the depolarization rate is set as p = 0.001. The simulation results are demonstrated in the left and middle panels in Figure 7, where the collected simulation results collaborate with Theorem 1. Statistically, the generalization error degrades with respect to the increased data size n and the decreased number of measurements m. We note that when N = 8 and n = 5, 50, the increased number of measurements may not improve the performance of noisy quantum kernels. We suspect that this phenomenon is caused by the spectrum transformation losing some important information and the randomness of measurements.
We next benchmark the performance of quantum kernels based on the noise model extracted from the real quantum-hardware, i.e., IBMQ-Melbourne. The number of qubits is set as N = 8 and the size of the training dataset is set as n = 5, 50, 100. The simulation results are demonstrated in the right panel of Figure 7, which also accords with Theorem 1. Namely, the generalization error becomes worse when the size of the training set is enlarged. We last compare the performance between noisy quantum kernels and classical kernels, i.e., RBF, under different measurement shots and depolarization rates. The achieved results are exhibited in Figure 8. In particular, when the size of the training set is restricted as n = 100, the noisy quantum kernel outperforms the classical RBF when the depolarization rate p < 0.05 for both N = 2 and N = 8. These results provide a strong evidence to use quantum kernels to earn quantum advantages in the NISQ era.
Performance of noisy quantum kernels with spectral transformation methods. Before moving on to elaborate the appended simulation details, we first address the nearest projection technique employed in the main text. Using this technique facilitate to optimization since the indefinite kernels lead to a non-convex optimization problem. An intuitive approach is to compute a nearest positive definite  Figure 9: The comparison of quantum kernels under the IBMQ-Melbourne's noisy settings with different spectral transformation methods. The simulation results of noisy quantum kernels calibrated by the nearest projection as labeled by 'Original', and the spectral transformation methods, i.e., clipping, flipping, and shifting methods, are shown in the order from left to right respectively. The meaning of different labels is same with those explained in Figure 2.
matrix in terms of Frobenius norm by increasing the eigenvalues less than δ to δ with the threshold δ > 0. We now explore how the spectral transformation methods introduced in the main text improve the performance of quantum kernels based on the noisy model extracted from the real hardware, i.e., IBMQ-Melbourne. The hyper-parameters setting is as follows. The qubit count is set as N = 8, the number of measurements is set as m ∈ {10, 100, 500, 1000}, and the size of the training set is n = {5, 50, 100}. The simulation results are shown in Figure 9. For all settings, the shift method dramatically improve the performance of the noisy quantum kernels. These results provide a strong evidence to explore advanced spectral transformation techniques to enhance the capabilities of quantum kernels in the NISQ era. To further support our theoretical claims, we conduct extensive numerical simulations by adding more points of measurement shots in the numerical simulation about comparing noisy quantum kernels with different calibration methods. Specifically, there are eight settings for the allowable number of measurements m, i.e., {10, 100, 300, 500, 800, 1000, 1500, 2000}, while all other hyper-parameters settings are identical those introduced in the main text.
The simulation results are shown in Figure 10. Specifically, under such a fine-grained setting, the three calibration methods can still improve the prediction accuracy of noisy kernel compared with the original method. Moreover, the shifting method attains the best performance compared with clipping and flipping methods. These results echo with our theoretical results in the sense that suppressing the kernel approximation error can improve the generalization performance.