Generalization despite overfitting in quantum machine learning models

The widespread success of deep neural networks has revealed a surprise in classical machine learning: very complex models often generalize well while simultaneously overfitting training data. This phenomenon of benign overfitting has been studied for a variety of classical models with the goal of better understanding the mechanisms behind deep learning. Characterizing the phenomenon in the context of quantum machine learning might similarly improve our understanding of the relationship between overfitting, overparameterization, and generalization. In this work, we provide a characterization of benign overfitting in quantum models. To do this, we derive the behavior of a classical interpolating Fourier features models for regression on noisy signals, and show how a class of quantum models exhibits analogous features, thereby linking the structure of quantum circuits (such as data-encoding and state preparation operations) to overparameterization and overfitting in quantum models. We intuitively explain these features according to the ability of the quantum model to interpolate noisy data with locally"spiky"behavior and provide a concrete demonstration example of benign overfitting.


Introduction
A long-standing paradigm in machine learning is the trade-off between the complexity of a model family and the model's ability to generalize: more expressive model classes contain better candidates to fit complex trends in data, but are also prone to overfitting noise [1,2].Interpolation, defined for our purposes as choosing a model with zero training error, was hence long considered bad practice [3].The success of deep learning -machine learning in a specific regime of extremely complex model families with vast amounts of tunable parameters -seems to contradict this notion; here, consistent evidence shows that among some interpolating models, more complexity tends not to harm the generalisation performance1 , a phenomenon described as "benign overfitting" [4].
In recent years, a surge of theoretical studies have reproduced benign overfitting in simplified settings with the hope of isolating the essential ingredients of the phenomenon Evan Peters: e6peters@uwaterloo.ca[4,5].For example, Ref. [6] showed how interpolating linear models in a high complexity regime (more dimensions than datapoints) could generalize just as well as their lowercomplexity counterparts on new data, and analyzed the properties of the data that lead to the "absorption" of noise by the interpolating model without harming the model's predictions.Ref. [7] showed that there are model classes of simple functions that change quickly in the vicinity of the noisy training data, but recover a smooth trend elsewhere in data space (see Figure 1).Such functions have also been used to train nearest neighbor models that perfectly overfit training data while generalizing well, thereby directly linking "spiking models" to benign overfitting [8].Recent works try to recover the basic mechanism of such spiking models using the language of Fourier analysis [9,10,11].
In parallel to these exciting developments in the theory of deep learning, quantum computing researchers have proposed families of parametrised quantum algorithms as model classes for machine learning (e.g.Ref. [12]).These quantum models can be optimised similarly to neural networks [13,14] and have interesting parallels to kernel methods [15,16] and generative models [17,18].Although researchers have taken some first steps to study the expressivity [19,20,21,22], trainability [23,24] and generalisation [25,26,27,28] of quantum models, we still know relatively little about their behaviour.In particular, the interplay of overparametrisation, interpolation, and generalisation that seems so important for deep learning is yet largely unexplored.
In this paper we develop a simplified framework in which questions of overfitting in quantum machine learning can be investigated.Essentially, we exploit the observation that quantum models can often be described in terms of Fourier series where well-defined components of the quantum circuit influence the selection of Fourier modes and their respective Fourier coefficients [29,30,31].We link this description to the analysis of spiking models and benign overfitting by building on prior works analyzing these phenomena using Fourier methods.In this approach, the complexity of a model is related to the number of Fourier modes that its Fourier series representation consists of, and overparametrised model classes have more modes than needed to interpolate the training data (i.e., to have zero training error).After deriving the generalization error for such model classes these "superfluous" modes lead to spiking models, which have large oscillations around the training data while keeping a smooth trend everywhere else.However, large numbers of modes can also harm the recovery of an underlying signal, and we therefore balance this trade-off to produce an explicit example of benign overfitting in a quantum machine learning model.
The mathematical link described above allows us to probe the impact of important design choices for a simplified class of quantum models on this trade-off.For example, we find why a measure of redundancy in the spectrum of the Hamiltonian that defines standard data encoding strategies strongly influences this balance; in fact to an extent that is difficult to counterbalance by other design choices of the circuit.
The remainder of the paper proceeds as follows.We will first review the classical Fourier framework for the study of interpolating models and develop explicit formulae for the error in these models to produce a basic example of benign overfitting (Sec.2).We will then construct a quantum model with analogous components to the classical model, and demonstrate how each of these components is related to the structure of the corresponding quantum circuit and measurement (Sec.3).We then analyze specific cases that give rise to "spikiness" and benign overfitting in these quantum models (Sec.

Interpolating models in the Fourier framework
In this section we will provide the essential tools to probe the phenomenon of overparametrized models that exhibit the spiking behaviour from Figure 1 using the language of Fourier series.We will review and formalize the problem setting and several examples from Refs.[9,10,11], before extending their framework by incorporating standard results from linear regression to derive closed-form error behavior and examples of benign overfitting.

Setting up the learning problem
We are interested in functions g defined on a finite interval that may be written in terms of a linear combination of Fourier basis functions or modes e i2πkx (each describing a complex sinusoid with integer-valued frequency k) weighted by their corresponding Fourier coefficients ĝk : We restrict our attention to well-behaved functions that are sufficiently smooth and continuous to be expressed in this form.We will now set up a simple learning problem whose basic components -the model and target function to be learned -can be expressed as Fourier series with only few non-zero coefficients, and define concepts such as overparametrization and interpolation.
Consider a machine learning problem in which data is generated by a target function of the form which only contains frequencies in the discrete, integer-valued spectrum for some odd integer n 0 .We call functions of this form band-limited with bandwidth n 0 /2 (that is, |k| < n 0 /2 for all frequencies k ∈ Ω n 0 ).The bandwidth limits the complexity of a function, and will therefore be important for exploring scenarios where a complex model is used to overfit a less complex target function.We are provided with n noisy samples y j = g(x j ) + ϵ of the target function g evaluated at points x j = j/n spaced uniformly on interval [0, 1] (we will assume input data has been rescaled to [0, 1] without loss of generality), where n > n 0 and we require n to be odd for simplicity.
The model class we consider likewise contains band-limited Fourier series, and since we are interested in interpolating models, we always assume that they have enough modes to interpolate the noisy training data, namely: Similarly to Eq. (3) we define the spectrum Following Ref. [9], this model class has two components: The set of weighted Fourier basis functions √ ν k e i2πkx describe a feature map applied to x for some set of feature weights ν k ∈ R + , while the trainable weights α k ∈ C are optimized to ensure that f interpolates the data.The theory of trigonometric polynomial interpolation [32] ensures that f can always interpolate the training data for some choice of trainable weights α k under these conditions.In the following, we will therefore usually consider the α k as being determined by the data and interpolation condition, while the ν k serve as our "turning knobs" to create different settings in which to study spiking properties and benign overfitting.We call the model class described in Eq. (4) overparameterized when the degree of the Fourier series is much larger than the degree of the target function, d ≫ n 0 , in which case the model has many more frequencies available to perform fitting than there are present in the underlying signal to fit.
Note that one can rewrite the Fourier series of Eq. (4) in a linear form where From this perspective, optimizing f amounts to learning trainable weights α by performing regression on observations ϕ(x) sampled from a random Fourier features model [33] for which ν k and ϕ(x) k are precisely the eigenvalues and eigenfunctions of a shift-invariant kernel [34].
To complete the problem setup, we have to impose one more constraint.Consider that exp(i2πkx j ) with frequency k < n/2 and uniformly spaced points x j is equal to exp(i2πk ′ x j ) for any choice of alias frequency k ′ = k mod n.The presence of these aliases means that the model class described in Eq. (6) contains many interpolating solutions in the overparameterized regime.Motivated by prior work exploring benign overfitting for linear features [6], Fourier features [9,35], and other nonlinear features [36,37], we will study the minimum-ℓ 2 norm interpolating solution, Minimizing the ℓ 2 norm is a typical choice for imposing a penalty on complex functions (regularization) in the underparameterized regime, though we will see that this intuition does not carry over to the overparameterized regime.The remainder of this section will explore how this learning problem results in a trade-off in interpolating Fourier models: Overparameterization introduces alias frequencies that increase the error in fitting simple target functions but can also reduce error by absorbing noise into high-frequency modes with spiking behavior.

Two extreme cases to understand generalization
To better understand the trade-off that overparametrization -or in our case, a much larger number of Fourier modes than needed to interpolate the data -introduces between fitting noise and generalization error, we revisit two extreme cases explored in Ref. [9], involving a pure-noise signal and a noiseless signal.

Case 1: Noise only
The first case demonstrates how alias modes can help to fit noise without disturbing the (here trivial) signal.We set g = 0 and consider n observations y j = ϵ j of zero-mean noise with known variance E[ϵ 2 ] = σ 2 .After making n uniformly spaced observations, we compute the discrete Fourier transform of the observations as the sequence of values εj satisfying which characterizes the frequency content of the noisy signal that can be captured and learned from using only n evenly spaced samples.Suppose that the degree of the model (controlling the model complexity) is given by d = n(m + 1) for some even integer m and that ν k = 1 for every mode, so that there are exactly m equally-weighted aliases for each frequency in the spectrum of the Fourier series for g.Then the optimal (i.e., the minimum ℓ 2 -norm, interpolating) trainable weight vector α opt has entries for ℓ = −m/2, . . ., m/2, with all other entries being zero (see Appendix A.2). From Eq. (11), the minimum-∥α∥ 2 solution distributes noise Fourier coefficients εk evenly into many alias frequencies k + nℓ, while enforcing that the sum of trainable weights α k+nℓ for all of these aliases is εk to guarantee interpolation.As shown in Figure 2, the higherfrequency aliases suppress the optimal model f opt (x) = ⟨α opt , ϕ(x)⟩ to near-zero at every point away from the interpolated points, resulting in a test error of O(σ 2 /m) that decreases monotonically with the complexity of the model.As m increases, the optimal model f opt remains close to the true signal y = 0 while becoming "spiky" near the noisy samples.By conforming to the true signal everywhere except in the vicinity of noise, this behavior embodies the mechanism of how overparameterized models can absorb noise into high frequency modes.In this case the generalization error, measuring how close the model is to the target function on average, decreases with increasing complexity of the model class.

Case 2: Signal only
While the above case shows how overparametrization can help to absorb noise to reduce error without harming the signal, the second case will illustrate how alias frequencies in the overparameterized model can harm the model's ability to learn the target function.; this bleeding of the signal g into higher frequencies results in higher error in the overparameterized model [9,11].
To demonstrate this, we now consider a noiseless, single-mode signal g(x) := ĝp e i2πpx of frequency p ≤ n 0 /2.The data is hence of the form Once again we choose d = n(m + 1) and for simplicity we assume an unweighted model, ν k = 1 for k ∈ Ω d .By orthonormality of Fourier basis functions, the interpolation condition requires that only the modes of the model f with integer multiples of the frequency p are retained.The interpolation constraint can then be rewritten as The choice of trainable weights α k that satisfy Eq. (13) while minimizing ℓ 2 -norm is for k = p + nℓ and α k = 0 otherwise (see Appendix A.2). Eq. (13) distributes the Fourier coefficient ĝp among the trainable weights α p+nℓ corresponding to frequencies p+nℓ.Therefore, minimizing ∥α∥ 2 in this case "bleeds" the target function into higher frequency aliases and results in the opposite effect compared to fitting a noisy signal (see Fig. 2b): The generalization error of the overparameterized model now increases with the number of aliases m and the complexity of the model harms its ability to fit a noiseless target function.
In order to recover a trade-off in generalization error for more general cases, we will need to consider more interesting distributions of feature weights ν k (instead of ν k = 1) that provide finer control over fitting the target function with low-frequency modes while spiking in the vicinity of noise with high-frequency aliases.

Generalization trade-offs and benign overfitting
The opposing effects of higher-frequency modes in overparameterized models in the cases discussed above hint at a trade-off in model performance that depends on the underlying signal and the feature weights of the Fourier feature map.Returning to the more general case of input samples y j = g(x j ) + ϵ j , in Appendix A we show that the task of fitting uniformly spaced samples using weighted Fourier features may be transformed into a linear regression problem, thereby generalizing the results of [9] to derive the following general solution to the minimum-ℓ 2 interpolating problem of Eq. (9): where ŷk is the discrete Fourier transform of y k and k ∈ Ω n 0 , where denotes the set of alias frequencies of k appearing in the overparameterized model with spectrum Ω d .The optimal model is then expressed as Recalling that our model f is trained on n noisy samples (y 0 , . . ., y n−1 ) of the target function g, we are interested in the squared error of the model f averaged over (noisy) samples over the input domain, and we call L the generalization error of f , as it captures the behavior of f with respect to g over the entire input domain x ∈ [0, 1] instead of just the uniformly spaced training points x j .In Appendix A we derive We use this generalization error now to explore two interesting behaviors of the interpolating model in our setting: The tradeoff between noise absorption and signal recovery exemplified by the cases in Sec.2.2, and the ability of an overparameterized Fourier features model to benignly overfit the training data.
The first behavior involves a trade-off in the generalization error L(f opt ) between absorbing noise (reducing var) and capturing the target function signal (reducing bias 2 ) that recovers and generalizes the behavior of the two cases in Sec.2.2.This trade-off is controlled by three components: The noise variance σ 2 , the input signal Fourier coefficients ĝk , and the distribution of feature weights √ ν k .As described in the two cases above, when σ 2 → 0 (signal only) the variance term var vanishes and the model is biased for any choice of nonzero ν k where k > n.Conversely, when ĝ → 0 (noise only) the bias term bias 2 vanishes, and the variance term is minimized by choosing uniform ν k for all k ∈ Ω d .
The second interesting behavior occurs when the generalization error of the overparameterized model decreases at a nearly optimal rate as the number of samples n increases, known as benign overfitting.Prior work on benign overfitting in linear regression studied scenarios where the distribution of input data varied with the dimensionality of data and size of the training set in such a way that the excess generalization error of the overparameterized model (compared to a simple model) vanished [6].However, since the dimensionality of the input data for our model is fixed, we instead consider sequences of feature weights that vary with respect to the problem parameters (n 0 , n, d) in a way that results in bias 2 and var vanishing as n → ∞.In this case, by fitting an increasing number of samples n using such a sequence of feature weights, the overparameterized model both perfectly fits the training data and generalizes well for unseen test points, and therefore exhibits benign overfitting.
These behaviors are exemplified by a straightforward choice of feature weights that incorporate some prior knowledge of the spectrum Ω n 0 available to the target function g.For all k ∈ Ω n 0 , let ν k = c/n 0 for some positive c and normalize the feature weights so that k∈Ω k ν k = 1.We show in Appendix A.3.1 that the error terms of L(f opt ) scale as Thus, as long as the dimension of the overparameterized Fourier features model grows strictly faster than n (i.e., d = ω(n)), the model exhibits benign overfitting.In Appendix A.3.2 we demonstrate how this simple example actually captures the mechanisms of benign overfitting for much more general choices of feature weights.Fig. 3 summarizes this behavior and provides an example of the bias-variance tradeoff that occurs for overparameterized models.In particular, Fig. 3a exemplifies the setting in which benign overfitting occurs, wherein the feature weights of the Fourier features model are strongly concentrated over frequencies in Ω n 0 but extend over a large range of alias frequencies for each k ∈ Ω n 0 .The generalization behavior described here is fundamentally different from many generalization guarantees typically found in statistical learning theory.While prior work has derived guarantees for the generalization of quantum models by constructing bounds on the complexity of the model class [25], Eqs.20-21 demonstrate that generalization may occur as the complexity (i.e.dimension) of a model grows arbitrarily large.
So far, we have reviewed the Fourier perspective on fitting periodic functions in a classical setting and extended the analysis to characterize benign overfitting.However, if we can link the basic components of quantum models to the terms appearing in the error of Eq. (19), then we will be able to study a similar trade-off in the error of overparameterized quantum models and the conditions necessary for benign overfitting.The remainder of this work is devoted to showing that analogous mechanisms exist in certain quantum machine learning models, and to studying the choices of feature weights for which quantum models can exhibit tradeoffs in generalization error and benign overfitting.3 Benign overfitting in single-layer quantum models In the previous section we have seen that the feature weights √ ν k balance the tradeoff between absorbing the noise and hurting the signal of overparametrized models.To understand how different design choices of quantum models impact this balance, we need to link their mathematical structure to the model class defined in Eq. (4), and in particular to the feature weights, which is what we do now.
The type of quantum models we consider here are simplified versions of parametrized quantum classifiers (also known as quantum neural networks) that have been heavily investigated in recent years [13,38,39].They are represented by quantum circuits that consist of two steps: first, we encode a datapoint x ∈ [0, 1] into the state of the quantum system by applying a d-dimensional unitary operator V (x), and then we measure the expectation value of some d-dimensional (Hermitian) observable M .This gives rise to a general class of quantum models of the form To simplify the analysis, we will consider a quantum circuit V (x) that consists of a dataindependent unitary U and a diagonal data-encoding unitary generated by a d-dimensional Hermitian operator H, which includes a large class of quantum models commonly studied but excludes schemes involving data re-uploading [40,41].Defining U |0⟩ = |Γ⟩ = d j=1 γ j |j⟩, the output of this quantum model becomes where |Γ⟩ can be treated as an arbitrary input quantum state.We call the corresponding quantum circuit for this model single-layer in the sense that it contains a single diagonal data-encoding layer in which all data-dependent circuit operations could theoretically be executed simultaneously (though in general the operation U and measurement M may require significant depth to implement).Applying insights from Refs.[29,30], quantum models of this form can be expressed in terms of a Fourier series where the spectrum Ω as well as the partitions R(k) depend on the eigenspectrum λ(H) of the data-encoding Hamiltonian H: Comparing Eq. (25) to Eq. ( 4) we see that that the quantum model may be expressed as a linear combination of weighted Fourier modes, but it is not yet clear how the input state γ j and the trainable observable M of the quantum model correspond to feature weights ν k for each Fourier mode.To reveal this correspondence, we will need to first find the minimum-norm interpolating observable that solves the optimization problem where ∥M ∥ F = Tr (M † M ) denotes the Frobenius norm of M .Solving Eq. ( 28) is analogous to the minimization the ℓ 2 norm of α in the classical optimization problem of Eq. ( 9), and serves a role similar to regularization commonly applied to quantum models by introducing a penalty term proportional to ∥M ∥ 2 F [26,42,43].In Appendix B we prove that subject to the condition that γ i > 0, the minimum-∥•∥ F interpolating observable that solves Eq. ( 28) is given as for all ℓ, m ∈ R(S k ), and the corresponding optimal quantum model is where S(k) denotes the set of aliases of k appearing in Ω from Eq. (16).By comparison to the optimal classical model of Eq. ( 17) we have identified the feature weights of the optimized quantum model as Interestingly, while there was initially no clear way to separate the building blocks of the quantum model in Eq. (25) into trainable weights α k and feature weights ν k , this separation has now appeared after solving for the optimal observable M opt .Furthermore, the optimal quantum model depends on |γ i | and is independent of phases associated with amplitudes γ i (an effect that stems from using only a single data-encoding layer S(x) in the quantum model).
From Eq. (31) it is clear that the partitions R(k) of Eq. ( 27) arising from the choice of data-encoding unitary S(x) have a strong relationship with the feature weights ν k of the quantum model.We will now consider a simplified quantum model to highlight this relationship, thereby identifying a tradeoff between noise absorption and target signal recovery and the possibility of observing benign overfitting in quantum models.

Simplified quantum model
To explicitly highlight the role of R(k) in controlling the feature weights ν opt k of the optimized quantum model, we will now simplify the model by using an equal superposition input state j=0 |j⟩ and by restricting the set of observables considered during optimization.If we fix every entry of the observable with respect to elements in a partition R(k) to be proportional to some complex constant M (k): then we can simplify the quantum model of Eq. (25) to Comparing Eq. (33) to Eq. ( 4) we identify a direct correspondence between the trainable weights α k in the classical model with M (k), as well as a correspondence between the feature weights √ ν k and the the degeneracy |R(k)| of the quantum model.Making the for this restricted choice of M and so the solution to the optimization of Eq. ( 28) is essentially the same as that of the classical problem in Eq. (15).The crucial property of the simplified model is that the degeneracy |R(k)| -and hence the combinatorial structure introduced by the data encoding Hamiltonian's eigenvalues -completely controls the trade-off in the generalization error (Eq.( 19)).We can hence study different types of partitions R(k) to show a direct effect of the data-encoding unitary S(x) on the fitting and generalization error behaviors for this simplified, overparameterized quantum model.
To study these behaviors we will now consider specific families of H which we call encoding strategies since the choice of H completely determines how the data is encoded into the quantum model.While R(k) and Ω may be computed for an arbitrary S(x) using brute-force combinatorics, some encoding strategies lead to particularly simple solutions.We have derived a few such examples of simple degeneracies and spectra for different encoding strategies in Appendix C and present the results in Table 1.These choices highlight the extreme variation in Ω resulting from minor changes to S(x), for example |Ω| ∝ n q for the "Hamming" encoding strategy compared to |Ω| ∝ 3 nq for the "Ternary" encoding strategy.These examples also highlight the limitations in constructing Hamiltonians with specific properties such as uniform |R(k)| or evenly-spaced frequencies in Ω.
Since the feature weights ν k of the Fourier modes are fixed by the choice of the dataencoding unitary, we can understand a choice of S(x) as providing a structural bias of a quantum model towards different overfitting behaviors, and conversely the choices of feature weights available to quantum models are limited and are particular to the structure of the associated quantum circuit.Figure 4 shows distributions for feature weights arising

Encoding strategy
Hamiltonian example Degeneracy Table 1: Spectra Ω and degeneracies R(k) computed for various data-encoding Hamiltonians H defined for either d dimensions or n q qubits.The Hamming, Binary, and Ternary data encoding strategies are realized on n q qubits using a separable Hamiltonian consisting of Pauli-Z operators with different prefactor schemes.The Ternary encoding strategy results in the largest |Ω| possible for a separable data encoding Hamiltonian (see also Ref. [44]), while the Golomb encoding encoding strategy (named in reference to Golomb rulers, e.g.[45]) results in the largest |Ω| possible for any choice of d-dimensional data-encoding Hamiltonian.Note that the spectrum is preserved under permutations and additive shifts of the diagonal of the Hamiltonian and so we use "∼" to denote equivalence up to these operations.The function T converts an integer to a signed ternary string, as defined in Eq. (234) (see Appendix C).
from the example encoding strategies presented in Table 2, and demonstrates a broad connection between the degeneracies |R(k)| of the model (giving rise to feature weights ν opt k ) and the generalization error L(f opt ).
where δ j i denotes the Kronecker delta.Furthermore, in Appendix B we observe that the variance of ν opt k around its average tends to be small for encoding strategies considered in this work, for instance scaling like O d −3 for the Binary encoding strategy and O d −4  for the Golomb encoding strategy.This demonstrates that the feature weights of generic quantum models (i.e., ones for which U is randomly sampled) will be dominated by the degeneracy |R(k)| introduced by the data-encoding operation S(x).
Despite the behavior of ν k being dominated by |R(k)| in an average sense, there are specific choices of U for which the feature weights deviate significantly from this average.We will now use one such choice of U to provide a concrete example of an interpolating quantum model that exhibits benign overfitting.Suppose we choose U such that the elements of |Γ⟩ are given by for j ∈ [d], and some integers 0 < c 1 < c 2 < d and amplitudes a, b subject to normalization.We show that given a band-limited target function g with access to spectrum Ω n 0 , there is a specific choice of c 1 , c 2 dependent on d and n 0 for which the interpolating quantum model f opt of Eq. (30) also benefits from vanishing generalization error in the limit of many samples, namely we show in Appendix D that Thus, by perfectly fitting the training data and exhibiting good generalization in the n → ∞ limit, the quantum model exhibits benign overfitting.This behavior is outlined in Figure 5, which highlights the role that |Γ⟩ plays in concentrating the feature weights ν k within the spectrum Ω n 0 of g while preserving a long tail that provides the model with low-variance "spiky" behavior in the vicinity of noisy samples.In contrast, the feature weights for the Binary encoding strategy with a uniform input state has little support on Ω n 0 and resulting in a large bias error.
The above discussion shows how the input state amplitudes γ i provide additional degrees of freedom with which the feature weights ν opt k can be tuned in order to modify the generalization behavior of the interpolating quantum model, and to exhibit benign overfitting in particular.It is therefore worthwhile to consider what other kinds of feature weights ν k might be prepared by some choice of input state |Γ⟩.We may use simple counting arguments to demonstrate the restrictions in designing particular distributions of feature weights.Suppose we define Ω + = {k : k ∈ Ω, k > 0} containing the positive frequencies of a quantum model.Then the introduction of an arbitrary input state |Γ⟩ provides us with 2 nq − 1 free parameters with which to tune |Ω + |-many terms in the distribution of ν opt k (subject to ν k = ν −k and k ν k = 1).Clearly, there are distributions of feature weights ν opt k that can not be achieved for models where Conversely, the condition |Ω + | < 2 nq does not necessarily mean that we can thoroughly explore the space of possible feature weights by modifying the input state |Γ⟩.For example, consider the Hamming encoding strategy for which the number of free parameters controlling the distribution of feature weights ν opt k is |Ω + | = n q , which is exponentially smaller the number of parameters in |Γ⟩.While this might suggest significant freedom in adjusting ν opt k , the opposite is true: For any choice of input state |Γ⟩, there is another state of the form that achieves exactly the same distribution of feature weights ν k .In Eq. (37), |Φ i ⟩ describes a uniform superposition over all computational basis state bitstrings with weight i, and so the distribution of ν opt k actually only depends on n q + 1 real parameters ϕ i , i = 0, 1, . . ., n q , and the feature weights are invariant under any operations in U that preserve |Φ i ⟩ (see Appendix B).An example of such operations are the particle-conserving unitaries well-known in quantum chemistry, which act to preserve the number of set bits (i.e., the Hamming weight) when each bit represents a mode occupied by a particle in second quantization [46,47].This example demonstrates how symmetry in the data-encoding Hamiltonian (e.g.Refs.[48,49]) can have a profound influence on the ability to prepare  specific distributions of feature weights ν opt k , and consequently affect the generalization and overparameterization behavior of the associated quantum models.

Conclusion
In this work we have taken a first step towards characterizing the phenomenon of benign overfitting in a quantum machine learning setting.We derived the error for an overparameterized Fourier features model that interpolates the (noisy) input signal with minimum ℓ 2 -norm trainable weights and connected the feature weights associated with each Fourier mode to a trade-off in the generalization error of the model.We then demonstrated an analogous simplified quantum model for which the feature weights are induced by the choice of data-encoding unitary S(x).Finally, we discussed how introducing an arbitrary state-preparation unitary U gives rise to effective feature weights in the optimized general quantum model, presenting the possibility of connecting U and S(x) to benign overfitting in more general quantum models.
Our discussion of interpolating quantum models presents an interpretation of overparameterization (i.e., the size of the model spectrum Ω) that departs from other measures of quantum circuit complexity discussed in the literature [19,50,51], as even the simplified quantum models studied here are able to interpolate training data using a fixed circuit U and optimized measurements.We also reemphasize that -unlike much of the quantum machine learning literature -we do not consider a setting where the model is optimized with respect to a trainable circuit, as the model of Eq. ( 30) is constructed to exhibit zero training error (and can therefore not be improved via optimization).Finding the input state |Γ⟩ that will result in a specific distribution of feature weights ν opt k generally requires solving a |Ω + |-dimensional system of equations that are second order in 2 nq many real parameters |γ i | 2 (i.e., inverting the map of the form R 2 nq → R |Ω + | in Eq. (31)) or otherwise performing a variational optimization that will likely fail due to the familiar phenomenon of barren plateaus [23,24,52,53].
While we have shown an example of benign overfitting by a quantum model in a relatively restricted context, future work may lead to more general characterizations of this phenomenon.Similar behavior likely exists for quantum kernel methods and may complement existing studies on these methods' generalization power [54].An exciting possibility would be to demonstrate benign overfitting in quantum models trained on distributions of quantum states which are hard to learn classically [55,56], thereby extending the growing body of statistical learning theory for quantum learning algorithms [27,28,57].

Code availability
by the Province of Ontario through the Ministry of Colleges and Universities.Circuit simulations were performed in PennyLane [58].

A Solution for the classical overparameterized model
In this section we derive the optimal solution and generalization error for the classical overparameterized weighted Fourier functions model.We then discuss the conditions under which benign overfitting may be observed and construct examples of the phenomenon.

A.1 Linearization of overparameterized Fourier model
We first show that the classical overparameterized Fourier features model may be cast as a linear model under an appropriate orthogonal transformation.We are interested in learning a target function of the form with the additional constraint that the Fourier coefficients satisfy ĝk = ĝ * −k such that g is real.The spectrum of Eq. (38) for odd n 0 only contains integer frequencies, and we accordingly call g a bandlimited function with bandlimit n 0 /2 − 1.To learn g, we first sample n equally spaced datapoints on the interval [0, 1], where [n] = {0, 1, 2, . . ., n − 1} and we assume n is odd, and we then evaluate g(x j ) with additive error.This noisy sampling process yields n observations of the form y j = g(x j ) + ϵ with E[ϵ 2 ] = σ 2 .We will fit the observations y j using overparameterized Fourier features models of the form with α ∈ C d , and we have introduced weighted Fourier features ϕ : R → C d defined elementwise as In Eq. (41), Ω d describes the set of frequencies available to the model for any choice of d ≥ n ≥ n 0 .We are interested in the case where f interpolates the observations y j , i.e., f (x j ) = y j for all j = 0, . . ., n − 1.To this end, we define a n × d feature matrix Φ whose rows are given by ϕ(x j ) † : The interpolation condition may then be stated in matrix form as where (y) j = y j is the vector of noisy observations.Ω d contains alias frequencies of Ω n 0 , and so there the choice of α that satisfies Eq. ( 44) is not unique.Here we will focus on the minimum-ℓ 2 norm interpolating solution, A.1.1 Fourier transform into the linear model We will now show that Eq. (45) with uniformly-spaced datapoints x j can be solved using methods from ordinary linear regression under a suitable choice of transformation.Defining the n-th root of identity as ω = e i2π/n , then the j-th row of the LHS of Eq. ( 44) is equivalent to where S(k) = {j : j ∈ Ω d , j mod n = k} is the set of alias frequencies of k appearing in Ω d , i.e. the set of frequencies k + ℓn with ℓ ∈ Z that obey e i2πkx j = e i2πkj/n = e i2π(k+ℓn)j/n = e i2π(k+ℓn)x j , ( for p, q ∈ [n].This implies k∈Ωn ω k(p−q) = ω min(Ωn)(p−q) nδ q p = nδ q p , ( for p, q ∈ Ω n .Defining the discrete Fourier Transform ŷk of y according to then using the identity of Eq. (51), we evaluate the j-th row of Eq. (44) as Inspecting this final line yields a new matrix equation: Let X be an n × d matrix X with elements where the conditional operator 1{Z} evaluates to 1 if the predicate Z is true, and 0 otherwise.Then we may express Eq. ( 56) for all p ∈ Ω n as a matrix equation where (ŷ) j = ŷj .We have shown that Eq. ( 44) is exactly equivalent to Eq. (58) for uniformly spaced inputs, and as α is unchanged between these two representations this implies that the solution to Eq. ( 45) is also given by Therefore, the minimum ℓ 2 -norm solution to interpolate the input signal using weighted Fourier functions provided as samples y j is exactly the same as the minimum ℓ 2 -norm solution for an equivalent linear regression problem on the matrix X with targets ŷ.Furthermore, this linear regression problem is related to the original problem via Fourier transform.Let F be the (nonunitary) discrete Fourier transform defined on C n elementwise as We may similarly recover 1 n F y = ŷ to show that the coefficients ŷk are given by a discrete Fourier transform of y j .

A.1.2 Error analysis of the linear model
Having shown that the discrete Fourier transform F relates the original system of trigonometric equations Φα = y to the system of linear equations (or ordinary least squares problem Xα = ŷ, where we treat the rows of X as observations in R d ), we now derive the error of the problem in the Fourier representation.Standard treatment for ordinary least squares (OLS) gives the minimum ℓ 2 -norm solution to Eq. (59) as in which case the optimal interpolating overparameterized Fourier model is Once we have trained a model on noisy samples y of g using uniformly spaced values of x, we would like to evaluate how well the model performs for arbitrary x in [0, 1].Given some function f = ⟨ϕ(x), α⟩ the mean squared error E x (f (x) − g(x))2 may be evaluated with respect to the interval [0, 1].We define the generalization error of the model as the expected mean squared error evaluated with respect to y: We decompose the generalization error of the model as Because of orthonormality of Fourier basis functions, the cross-terms cancel resulting in a decomposition of the generalization error of the optimal standard bias and variance terms: We now evaluate var and bias 2 using the linear representation developed in the previous section.Beginning with the variance, conditional on constructing Φ from the set of uniformly spaced points x j we apply the discrete Fourier transformation to yield X and compute α opt using Eq. ( 64): Letting ϵ := (y − E y y) we have simplified the above using the following 2 : where we have used the fact that the errors ϵ are independent and zero mean, E y [ϵ p ϵ q ] = E y [ϵ 2 p ]δ q p .We have defined the feature covariance matrix as Σ ϕ = E x [ϕ(x)ϕ(x) † ], which may be computed elementwise using the orthonormality of Fourier features on [0, 1]: The following may be computed directly: In line (84) we have used the identity since ℓ ∈ S(k) ⇒ k = ℓ mod n and therefore ℓ ∈ S(j) ⇒ j = ℓ mod n = k.We now compute the variance as To evaluate the bias 2 , we will first rewrite g(x) in terms of its Fourier representation, Noting that (Xα 0 ) k = ĝk implies Xα 0 = E y ŷ we evaluate While Eq. (95) already completely characterizes the bias error in terms of the choice of feature weights and input data, it may be greatly simplified by taking advantage of the sparseness X.We have where in line (97) we have used for k, j ∈ Ω d and ℓ ∈ Ω n , which follows from similar reasoning as Eq.(84).And so, writing Letting Q k := m∈S(k) ν m the bias term of the error evaluates to With Eqs.(88) and (108) we have recovered a closed form expression for generalization error in terms of feature weights ν k , the target function Fourier coefficients ĝk , and the noise variance σ 2 .This was possible in part because of the orthonormality of the Fourier features ϕ(x) k and the choice to sample x j uniformly on [0, 1], resulting in diagonal Σ ϕ and XX T respectively.This simplicity is more advantageous for studying scenarios where benign overfitting may exist compared to prior works [6,59].We will now analyze choices of feature weights ν k that may give rise to benign overfitting for overparameterized weighted Fourier Features models.
We remark that the results of this section may also be derived using the methods of Ref. [9], though we have opted here to use a language more reminiscent of linear regression to highlight connections between the analysis of weighted Fourier features models and ordinary least squares errors.

A.2.1 Noise-only minimum norm estimator
We now derive the cases considered in Sec.2.2 of the main text.We first consider when the target function is given by g = 0 and we attempt to predict on a pure noise signal with E[ϵ 2 ] = σ 2 .We can recover an unaliased, "simple" fit to this pure noise signal by reconstructing ϵ from the DFT using Eq. ( 52): By setting all weights ν k = 1, an immediate choice for an interpolating f with access to frequencies in Ω d is found by setting where we have assumed d = n(m + 1).Eq. (111) is the minimum ℓ 2 -norm estimator and evenly distributes the weight of each Fourier coefficient εk over m aliases frequencies in in S(k).The effect of higher-frequency aliases is to reduce the f to near-zero everywhere except at the interpolated training data.We can directly compute the generalization error of the interpolating estimator as Using d = n(m + 1) as the dimensionality of the feature space, we have recovered the lower bound scaling for overparameterized models derived in Ref. [9].In line (113) we have used the independence of ϵ from x k and y k and in line (114) we have used orthonormality of Fourier basis functions on [0, 1].line (116) uses Parseval's relation for the Fourier coefficients, namely: 2a of the main text shows the effect of the number of cohorts m on the behavior of f , which interpolates a pure noise signal with very little bias.As the number of cohorts increases, the function deviates very little away from the true "signal" y = 0, and becomes very "spiky" in the vicinity of noise

A.2.2 Signal-only minimum norm estimator
Now we study the opposite situation in which the pure tone is noiseless, and aliases in the spectrum of f interfere in prediction of f .In this case, we set σ = 0 and interpolate target labels with −n/2 < p < n/2.When we set d = n(m + 1) and predict on y, there are exactly |S(p)| − 1 = m aliases for the target function with frequency p.We again assume all feature weights are equal, ν k = 1, and by orthonormality of Fourier basis functions, only the components of f with frequency in S(p) are retained: which will interpolate the training points for any choice of α satisfying k∈S(p) The choice of trainable weights α k that satisfy Eq. (120) while minimizing ℓ 2 -norm is The problem with minimizing the ℓ 2 norm in this case is that it spreads the true signal into higher frequency aliases: The generalization error of this model is We see that this model generally fails to generalize.This poor generalization was described as "signal bleeding" by Ref. [9]: Using n samples, there is no way to distinguish the signal due to an alias of p from the signal due to the true frequency p, so the coefficients α k become evenly distributed over aliases with very little weight allocated to the true Fourier coefficient ĝp in the model f .Fig. 2b in the main text shows the effect of "signal bleed" for learning a pure tone in the absence of noise.

A.3 Conditions for Benign overfitting
The behavior of the error of the overparameterized Fourier model (Eq.( 19) of the main text) depends on an interplay between noise variance σ 2 , signal Fourier coefficients ĝk , feature weights ν k , and the size of the model d.A desirable property of the feature weights ν k is that they should result in a model that both interpolates the sampled data while also achieving good generalization error in the limit n → ∞.For our purposes we will consider cases where lim n→∞ L(f opt ) = 0, though this condition could be relaxed to allow for more interesting or natural choices of ν k .We now analyze the error arising from a simple weighting scheme to demonstrate benign overfitting using the overparameterized Fourier models discussed in this work.

A.3.1 A simple demonstration of benign overfitting
Here we demonstrate a simple example of benign overfitting when the feature weights ν k are chosen with direct knowledge of the spectrum Ω n 0 of g.For some n 0 < n < d, fix c ∈ (0, 1) and use the feature weights given as for all k ∈ Ω d .For simplicity, suppose d = n(m + 1) such that |S(k)| = m + 1 for all k ∈ Ω n .Defining the signal power as we can directly evaluate var of Eq. (88) and bias 2 of Eq. (108): Fixing n 0 , we can bound generalization error in the asymptotic limit n → ∞ as: Therefore, by setting d = ω(n) the model perfectly interpolates the training data and also achieves vanishing generalization error in the limit of large number of data.A similar example was considered in Ref. [9], though a rigorous error analysis (and relationship to benign overfitting) was not considered there.This benign overfitting behavior is entirely due to the feature weights of Eq. (123): As d, n → ∞ with d = ω(n), the feature weights are concentrated on Ω n 0 (suppressing bias) while becoming increasingly small and evenly distributed over all aliases of Ω n 0 (suppressing var).

A.3.2 More general conditions for benign overfitting
We have derived closed-form solutions for the bias 2 and var terms that determine the total generalization error of an interpolating model f and in the previous section provided a concrete example of a model that achieves benign overfitting in this setting.We now discuss conditions under which a more general choice of model can exhibit benign overfitting.
We begin by showing that the variance of Eq. (88) splits naturally into an error due to a (simple) prediction component and a (spiky, complex) interpolating component.Following Refs.[4,6], we will split the variance into components corresponding to eigenspaces of Σ ϕ with large and small eigenvalues respectively.Let S ≤p denote the set indices for the largest p eigenvalues of Σ ϕ (i.e., the largest p values of ν k ), and S >p = [d]\S ≤p be its complement.Define P ≤p : R d → R d as the projector onto the the subspace of R d spanned by basis vectors labelled by indices in S ≤p (and P >p = I d − P ≤p is defined analogously).Then letting (ν) k = ν k be the vector of feature weights and assuming p ≤ n, we may rewrite the variance of Eq. (88) as where we have used P ≤p P S(k) êj = δ k j 1{j ≤ p} for p ≤ n and we have introduced an effective rank for the alias cohort of k, Since p = n 0 is a relevant choice for our problem setup, we define n 0 (Σ ϕ ) and focus on the bound The first term of the decomposition of Eq. (135) corresponds to the variance of a (noninterpolating) model with access to n 0 Fourier modes, while the second term corresponds to excess error introduced by high frequency components of f .Given a sequence of experiments with increasing n (while g and σ 2 remain fixed), we would like to understand the choices of feature weights ν k for which var vanishes as n → ∞.Given that |{k ∈ Ω n }| = n, a sufficient condition for a sequence of feature weight distributions to be benign is that R (k) = ω(1) for all k while a necessary condition is that there is no k for which R (k) = O(1/n).These conditions are not difficult to satisfy: Intuitively they require only that Σ ϕ changes with increasing n in such a way that the values of ν k in Ω n \Ω n 0 continue "flatten out" as n increases.This is precisely the behavior engineered in the example of Sec.A.3.1.We now proceed to bound the bias term of Eq. (95).Observe that P ⊥ := (I d − X T (XX T ) −1 X) is a projector onto the subspace of R d orthogonal to the rows of X, therefore satisfying ∥P ⊥ ∥ ≤ 1 and P ⊥ X T X = 0. Then the bias 2 is bounded as The term X T X can be interpreted as a finite-sample estimator for Σ ϕ in the sense that X T X = n −1 Φ T Φ = Σ ϕ for n = d.However, we cannot apply standard results on the convergence of sample covariance matrices (e.g.Ref. [60]) since the uniform spacing requirement for training data x violates the standard assumption of i.i.d.input data.To proceed, we will make a number of simplifying assumptions about the feature weights.First, to control for the possibility that large feature weights ν k concentrate within a specific S(k) we will assume that feature weights corresponding to any set of alias frequencies of Ω n are close to their average.Letting d = (m + 1)n, we define for ℓ = 0, 1, . . ., m.We will impose that |ν k+nℓ − η ℓ | ≤ t for all k ∈ Ω n , ℓ = 1, . . ., m, for some positive number t.We further assume normalization, k∈Ω d ν k = 1 = n m ℓ=0 η ℓ .Under these assumptions we can bound the first term of (138) as where in line (140) we have used Gershgorin circles.Meanwhile, defining ζ := k∈Ωn 0 ν k , by Cauchy-Schwarz we have that where P is the signal power of Eq. (124).A necessary condition for producing benign overfitting in overparameterized Fourier models is that ζ remains relatively large as n, d → ∞.If this is accomplished, then a small enough t guarantees that all feature weights associated with frequencies k ∈ Ω d \Ω n 0 will be uniformly suppressed.For instance, if ζ is lower bounded as a constant while t = 0 then combining Eqs.(143) and (144) yields a bound of Although the analysis of Sec.A.3.1 yields a significantly tighter bound, this demonstrates that the mechanisms behind that simple demonstration of benign overfitting are somewhat generic.In particular, normalization and lower bounded support of the feature weights on Ω n 0 is almost sufficient to control the bias term of the generalization error.

B Solution for the quantum overparameterized model
We now derive the solution to the minimum-norm interpolating quantum model, We will use the following notation and definitions: Here, a k and b k characterize the number positive and negative frequency aliases of k appearing in Ω (i.e., k + nℓ ∈ Ω for all a k ≤ ℓ ≤ b k ) assuming that |Ω| is odd, a requirement for any quantum model.Let L(d) be the space of linear operators acting on d × d matrices.
Define the linear operator for any X ∈ L(d).Importantly, P k is not necessarily Hermitian preserving.Denoting Γ := |Γ⟩⟨Γ| for brevity, we may rewrite the Fourier coefficients of f (x) as where in the last line we have used hermiticity of M .Applying f (x j ) = y j ∀ j ∈ [n] and substituting into Eq.(147) we find the interpolation condition where P Sp := b k q=a k P p+nq .The equality follows from due to the fact that R(j) and R(k) are disjoint sets for any j ̸ = k.Following the technique of Ref. [9] we apply the Cauchy-Schwarz inequality to find with equality if and only if P Sp (M ) is proportional to P Sp (Γ).Saturating this lower bound by setting M ℓm = cγ ℓ γ * m and solving for the proportionality c constant using Eq.(160), we find This indicates an additional requirement for interpolation that γ ℓ , γ m > 0 for some pair (ℓ, m) ∈ R(k) whenever ỹk ̸ = 0 and so for simplicity we will require that γ ℓ > 0 for all ℓ = 1, . . ., d.Within each set of indices R(S k ), the elements of the optimal observable are defined piecewise with respect to that partition R: Minimization of ∥M ∥ F subject to the interpolation constraint is equivalent to minimization of ∥P p+nq (M )∥ F for all q ∈ [a k , b k ], k ∈ Ω, and so solving the constrained optimization over all distinct subspaces in n−1 k=0 R(S k ) = {0, 1} n × {0, 1} n we recover the optimal observable We now verify that this matrix is Hermitian and therefore a valid observable.We will use the following: Eq. (164) follows from our assumption that y j ∈ R ∀ j ∈ [n].And so In line (167) we have used Eq.(164), while in line (169) we have observed that Eq. (162) holds only with respect to any fixed partition; M ℓm and M mℓ must be computed with respect to distinct partitions.The optimal model may now be rewritten in terms of base frequencies as Recall that the optimal classical model derived in Eq. ( 65) is given by Then despite Eq. (147) not having a clear decomposition into scalar feature weights and trainable weights, we can identify the feature weights of the optimized quantum model as which recovers the same form of the optimal classical model of Eq. ( 65).This means that the behavior of the feature weights of the (optimal) general quantum model are strongly controlled by the degeneracy sets R(k).The generalization error of the quantum model is also described by Eq. ( 19) of the main text under the identification of Eq. (172), and therefore exhibits a tradeoff that is predominantly controlled by the degeneracies R(k) of the data-encoding Hamiltonian.
We can now substitute γ ℓ = 1/ √ d to recover the optimal observable for the simplified model derived in Sec 3.1 of the main text using other means, namely B.1 Computing feature weights of typical quantum models In Sec 3.1 we introduced a simple model with a uniform amplitude state |Γ⟩ as input and demonstrated that the feature weights of this simple quantum model are completely determined by the sets R(k) induced by the encoding strategy.We now wish to extend the intuition that the behavior of the optimized general quantum model is strongly influenced by the distribution of the degeneracies |R(k)|.We do so by evaluating the optimal quantum models with respect to an "average" state preparation unitary U .We can compute the average value of |γ i | 2 |γ j | 2 for U sampled uniformly from the Haar distribution using standard results from the literature [61] 3 : Since (i, i) ∈ R(0) we can then compute the feature weights of the optimal model according to 4 3 Note that since ν opt k is invariant with respect to γi → e iϕ γi a spherical measure would suffice here. 4It is implied that we compute the optimal ν with respect to each distinct U sampled independently and uniformly with respect to the Haar measure -without optimizing M with respect to each U we would find the trivial result .
From Eq. (177) we see that the feature weights of a quantum model optimized with respect to random U are completely determined by the degeneracies R(k).This expected value is useful but does not fully characterize the behavior of an encoding strategy.To demonstrate that this average behavior is meaningful, we would further like to verify that the feature weights corresponding to a random U concentrate around the mean of Eq. (177).We characterize this by computing the variance where we have dropped the superscript on ν opt k for brevity.This computation requires significantly more counting arguments dependent on the structure of R(k).When k ̸ = 0, we identify cases for which i ̸ = j whenever (i, j) ∈ R(k): The expected values for of these terms are evaluated using the observation that the vector ) is distributed uniformly on the d-simplex leading to simple expressions for the following expected values [61]: where p, q, r, s = 1, . . ., d and p ̸ = q ̸ = r ̸ = s, and D := (d + 3)(d + 2)(d + 1)d.In computing the expected value of Eq. (180), by linearity the terms of each sum will become a constant and only the number of items in each sum will be relevant.The first sum of Eq. (180) contains |R(k)| many terms and the total number of terms is |R(k)| 2 , and so we only need compute the number of elements in the two middle sums of Eq. (180) (which contain an equal number of terms due to the symmetry R(k) = R(−k) T ).These computations may be carried out by brute-force combinatorics and are summarized in Table 2 for a few of the models studied in this work.
For the Binary encoding strategy we compute Summing over all such p's where p + 2k ≤ n q yields the desired result, however this computation does not admit a clear closed-form expression and so we have omitted the corresponding scaling of Var(ν opt k ) for the Binary encoding strategy.and so ) for the Binary encoding strategy.Taking d = 2 nq corresponding to n q qubits, then while the mean decays exponentially in the number of qubits n q , the variance decays exponentially faster.For the Golomb encoding strategy, the calculation is comparatively straightforward, yielding for k ̸ = 0 with the variance again decaying significantly faster in d than the mean.Figure 6 shows the average and variance of ν opt k for the general quantum model with U sampled uniformly with respect to the Haar measure.We find tight concentration of ν opt k around its average in each of these cases Why is it that the feature weights ν opt k of the quantum model given by Eq. (172) only appear after deriving the optimal observable?One explanation is that while the classical Fourier Features model utilizes random Fourier features such that components of ϕ(x) are mutually orthogonal on [0, 1], the quantum model does not.Consider the operator where p : X → R is the probability density function describing the distribution of data in X and Vec : C d×d → C d 2 is the vectorization map that acts by stacking transposed rows of a square matrix into a column vector.Σ is analogous to a classical covariance matrix, and here determines the correlation between components of ρ(x) (with the second equality holding when ρ(x) is a pure state for all x).From Eq. (8) describing the classical Fourier features it is straightforward to compute E x∼[0,1] [ϕϕ † ] = I d , demonstrating that the classical Fourier features are indeed orthonormal.However, under the identification of the feature vector ϕ(x) → Vec (ρ(x)) in the quantum case, the same is not true for the  quantum model:

Numerical average Theory
Thus Σ is not diagonal in general, as many components of S(x)|Γ⟩ each contribute to a single frequency k.Nor will it be possible in general to construct a quantum feature map via unitary operations that does act as an orthogonal Fourier features vector.As Σ is positive semidefinite by construction, there exists a spectral decomposition Σ = W DW † with diagonal D and d 2 × d 2 unitary W . However the linear operator Φ ∈ L(d) acting on d-dimensional states according to Vec (Φ(ρ)) = W † Vec (ρ(x)) will not be unitary in general, and thus it may not be possible to prepare ρ in such a way that the elements of Vec (ρ(x)) consist of orthonormal Fourier features.

B.2 The effect of state preparation on feature weights in the general quantum model
We now discuss how the state preparation unitary U may affect the feature weights ν opt k in a quantum model, and what choices one can make to construct a U that gives rise to a specific distribution of feature weights ν opt k .As an example, we consider the general quantum model using the Hamming encoding strategy and an input state |Γ⟩.Computing the feature weights ν opt k for the Hamming encoding strategy depends on the amplitudes γ i and γ j for which the weights of the indices i, j differ by k.We have We will now show how the feature weight ν opt k may be computed with respect to a rebalanced input state which distributes each amplitude γ i among all computational basis states with index having weight w(i).We define this rebalanced state on n qubits as where W (i) = {j : j ∈ {0, 1} n , w(j) = i} is the set of weight-i indices and |W (i)| = n i .
Observing that the amplitudes of Therefore, the feature weights computed from |Γ⟩ = U |0⟩ and the rebalanced state |Γ ′ ⟩ are identical.We can emphasize the significance of this observation by rewriting |Γ ′ ⟩ of Eq. (191) as where |Φ i ⟩ = j∈W (i) |j⟩ is an (unnormalized) superposition of bitstrings with weight i.
The state |Γ ′ ⟩ of Eq. (199) has only n + 1 real parameters ϕ i , and is invariant under operations that are restricted to act within the subspace spanned by components of |Φ⟩.This invariance greatly reduces the class of unitaries U that affect the distribution of feature weights ν opt k and enables some degree of tuning for these parameters.

B.2.1 Vanishing gradients in preparing feature weights
We briefly remark on the possibility of training feature weights ν opt k of the general quantum model variationally.For example, one could consider defining a desired distribution of feature weights as ϕ k ∈ R |Ω| and then attempting to tune parameters of U in order to minimize a cost function such as5 We will now demonstrate that for a sufficiently expressive class of state preparation unitaries U , such an optimization problem will be difficult on average.For simplicity, we follow the original formulation of barren plateaus presented in Ref. [23]: Let U be defined with respect to a set of parameters θ ∈ R L as with U (θ ℓ ) = exp(iθ ℓ V ℓ ), V ℓ being a d-dimensional Hermitian operator, and W ℓ being a d-dimensional unitary operator.Pick a parameter θ m and define U + = L ℓ=m U (θ ℓ )W ℓ and U − = m ℓ=1 U (θ ℓ )W ℓ , and observe that We can compute the derivative of ν k with respect to θ m using the chain rule: where ρ − = U − |0⟩⟨0|U † − and H x = U † + |x⟩⟨x|U + , and we used the equality We now show that each term in this sum vanishes by following Ref.[23] in letting U − be sampled from a distribution that forms a unitary 2-design: where in line (209) we have used a common expression for E U ∼U (d) [ρ ⊗ ρ] in terms of the projector onto the symmetric subspace (e.g.[62]).Substituting this result into Eq.(205) we find By extension, the gradient of a loss function of the form of Eq. (200) will vanish for expressive enough state-preparation unitaries U , suggesting that solving for a choice of U to induce a specific distribution of feature weights ν opt k will be infeasible in practice.

C Determining the degeneracy of quantum models
Here we will develop a theoretical framework for manipulating the degeneracy (and therefore feature weights) of quantum models.We will begin with choosing the data-encoding This introduces many degeneracies into the spectrum; using T n r = k we immediately recover the result of Ref. [30] that the unique elements of k are given by Ω = {−n, −(n − 1), . . ., 0, . . ., n − 1, n}, ( with |Ω| = 2n + 1.To recover the degeneracy of each frequency we compute model.This can be achieved by setting the diagonal of the data-encoding Hamiltonian as the elements of a Golomb ruler [45,63].The resulting spectrum is nondegenerate for all nonzero frequencies, though it is straightforward to prove that one cannot achieve uniform spacing in the spectrum (i.e., a Perfect Golomb ruler ) for d ≥ 5.As a result, the corresponding spectrum of this model generally exhibits gaps between frequencies.Further exploration of the connections to concepts from radio engineering [64,65] and classical coding theory [66] may enrich investigations into the spectral properties of quantum models.

D Demonstration of benign overfitting in the general quantum model
In this section we demonstrate an example of benign overfitting in the general quantum model by explicitly constructing a sequence of state preparation unitaries U for a particular choice of data-encoding unitary S(d).
for k ∈ Ω, and we will require d to be even for simplicity.Suppose the target function spectrum has size n 0 < n < d for some integer n and satisfies (n 0 + 1) mod 4 = 0. Define the constants We then choose a constant a ∈ [0, 2/(n 0 + 1)] and prepare |Γ⟩ with elements given by where normalization requires that For this encoding strategy, the optimal feature weights of the quantum model corresponding to positive frequencies k ∈ Ω + are given by Counting arguments and algebraic simplification lead to It follows that the minimum-∥•∥ F interpolating quantum model with state preparation unitary U satisfying Eq. (241) (up to permutations) and data encoded using the Binary encoding strategy will result in benign overfitting as long as the dimensionality of the model scales as d = ω(n) (i.e., the number of qubits satisfies n q = ω(log n)), in which case Eqs.(248) and (256) characterizing the generalization error of the model both vanish in the large-n limit.
A convenience of this demonstration of benign overfitting in a quantum model was that the state |Γ⟩ of Eq. (241) incorporates knowledge of the target function (namely the bandlimited spectrum Ω n 0 ).This could be considered a limitation of the approach, as it imposes an inductive bias on the resulting interpolating model f opt , in contrast to other examples of benign overfitting that are more agnostic to the underlying distribution [6,59].Future work could reveal choices of |Γ⟩ that are more data-independent (no explicit dependence on n 0 ) but give rise to feature weights ν opt k with the same desirable properties as Eq.(244).As described in Appendix A.3.2, the desirable properties include a more weight on all ν k with k ∈ Ω n 0 and a long, thin tail of feature weights for all other k ∈ Ω d \Ω n 0 .

Figure 1 :
Figure 1: Intuition behind the phenomenon of benign overfitting with spiking models.a Training a model typically involves a trade-off between fitting noisy training data well and recovering an underlying target function b Traditional learning theory associates interpolating models (reaching zero training error by fitting every data point) with low generalization capability (or high test error) by failing to recover a simple target function.c "spiking models" that change quickly in the vicinity of training data but otherwise exhibit simple behavior can explain how both kinds of errors may be kept low.

Figure 2 :
Figure 2: Comparison of overparametrized and simple models that interpolate data from different target functions.a Noise only: The overparameterized model (n = 7, d = 35, blue) is more effective at recovering the target function g(x) = 0 when provided with noisy data, while a band-limited model (red) cannot do so effectively.Inset: The distribution of trainable weights for the minimum ∥α∥ 2 model are distributed among many aliases suppressing the error of the optimized model (proportional to j | fj | 2 ) in this case.b Signal only:The opposite occurs in the case of noiseless input, for which a band-limited model (n 0 = 5) perfectly recovers g(x) = sin (2π(2x)) while the overparameterized model fails to capture the behavior of the input signal.Inset: Minimizing the ℓ 2 norm of α distributes Fourier coefficients ĝ2 among aliases f2+ℓn (and ĝ−2 among aliases f−2+ℓn ); this bleeding of the signal g into higher frequencies results in higher error in the overparameterized model[9,11].

Figure 3 :
Figure 3: The feature weights ν k may be engineered to exhibit a bias-variance tradeoff and benign overfitting.a A demonstration of feature weights ν k which account for prior knowledge of the target function g, with large ν k for k ∈ Ω n0 and small ν k for k ∈ Ω d \Ω n0 .b After sampling a bandlimited target function (n 0 = 15) at a fixed number of uniformly spaced points (n = 31), the generalization error of the minimum ℓ 2 -norm model f opt consists of a tradeoff between decreasing var and increasing bias 2 for larger d.Note the peaking behavior for var at d/n = 1.c This choice of feature weights ν k benignly overfits the input signal such that lim n,d→∞ L(f opt ) = 0 for any scaling of d = n α with α > 1, and fails to generalize on the interval [0, 1] when α ≤ 1.

Figure 5 :
Figure 5: Benign overfitting for the general quantum model.a For the Hamiltonian H ∼ diag ((0, 1, . . ., d − 1)) we construct a simple input state |Γ 1 ⟩ (Eq.(35)) that is dependent on n 0 and d (see Appendix D), compared to the uniform state |Γ 0 ⟩ = 1 √ d k |k⟩.b |Γ 1 ⟩ induces feature weights ν opt k in the optimized model that are heavily concentrated in the target function spectrum Ω n0 (gray band) and decay like 1/d and 1/d 2 elsewhere for k ∈ Ω d \Ω n0 , compared to the feature weights induced by |Γ 0 ⟩ which have similar magnitude over all of Ω d .c Plotting the ∥M ∥ F -minimizing quantum model f opt 1 with input state |Γ 1 ⟩ for a sample signal (d = 128, n = 31) highlights how |Γ 1 ⟩ and H work together to interpolate y j via high-frequency "spiky" behavior near the sampled points, but otherwise closely matches the underlying signal g(x).d The resulting quantum model benignly overfits the band-limited target function: As long as d = ω(n), the generalization error L(f opt 1 ) of the minimum-∥M ∥ F interpolating quantum model vanishes for large enough sample size n.
which follows from e i2πℓk = 1 since k ∈ Z.The operator P S(k) : C d → C d is the projector onto the set of standard basis vectors in C d with indices in S(k), and (w) k := √ ν k .The roots of unity are orthonormal since n−1 k=0

Table 2 :
Encoding strategy |{(i, j, ℓ, m) : m ̸ = i, ℓ = j, (i, j), (ℓ, m) ∈ R(k)}| Var(ν opt k The size of the subset of R(k) × R(k) with a single repeated index computed for different encoding strategies and the corresponding scaling of the variance of feature weights (all entries assume k ̸ = 0).The Hamming encoding strategy computation works as follows: Each bitstring i with weight p (of which there are n-choose-p) is paired with n-choose-(p + k) many bitstrings j with weight p + k.Then, taking ℓ = j there will be n-choose-(p + 2k) many bitstrings m with weight (p + k) + k = p + 2k.