Encoding-dependent generalization bounds for parametrized quantum circuits

A large body of recent work has begun to explore the potential of parametrized quantum circuits (PQCs) as machine learning models, within the framework of hybrid quantum-classical optimization. In particular, theoretical guarantees on the out-of-sample performance of such models, in terms of generalization bounds, have emerged. However, none of these generalization bounds depend explicitly on how the classical input data is encoded into the PQC. We derive generalization bounds for PQC-based models that depend explicitly on the strategy used for data-encoding. These imply bounds on the performance of trained PQC-based models on unseen data. Moreover, our results facilitate the selection of optimal data-encoding strategies via structural risk minimization, a mathematically rigorous framework for model selection. We obtain our generalization bounds by bounding the complexity of PQC-based models as measured by the Rademacher complexity and the metric entropy, two complexity measures from statistical learning theory. To achieve this, we rely on a representation of PQC-based models via trigonometric functions. Our generalization bounds emphasize the importance of well-considered data-encoding strategies for PQC-based models.


Introduction
Recent years have witnessed a surge of interest in the question of whether and how quantum computers can meaningfully address computational problems in machine learning [1,2]. This development has been largely driven by two factors. On the one hand, there is evidence that some quantum machine learning algorithms may lead to an increased performance over classical algorithms for the analysis of classical data with respect to important figures of merit [3][4][5][6][7]. On the other hand, the increasing availability of quantum computational devices provides significant stimulus. While these "noisy intermediate-scale quantum" (NISQ) devices are still a far cry from full-scale fault-tolerant quantum computers, there exists growing evidence that they may be able to out-perform classical computers on some highly-tailored tasks [8]. Given the inherent limitations of NISQ devices, most current approaches to near-term quantum-enhanced machine learning fall under the umbrella of hybrid quantum-classical algorithms [9]. Of particular prominence are variational quantum algorithms in which a parametrized quantum circuit (PQC) is used to define a machine learning model which is then updated via a classical optimizer [10][11][12].
There is a wealth of architectural choices for PQC-based machine learning models. These include the width and depth of the quantum circuit, the precise layout and structure of trainable gates, as well as the mechanism via which classical data is encoded into the quantum circuit. The flexibility in design choices for PQCs is often only perceived strongly in terms of the structure and layout of the trainable gates [13,14]. However, when using a PQC to define a machine learning model for classical data, the data-encoding strategy becomes a necessary architectural for a given problem.

Structure of this work
This work is structured as follows: Section 2 gives a pedagogical introduction to statistical learning theory, explains the importance of generalization bounds, and discusses the structural risk minimization principle. After establishing these concepts, we formulate the main questions addressed in this work. In Section 3, we begin by introducing the PQC-based learning models used in this work. We then present a detailed discussion of the approach of Ref. [33], which demonstrates how the functions implemented by a PQC-based model can be represented by generalized trigonometric polynomials. In particular, we emphasize how the data encoding strategy of the PQC-based model translates to the accessible frequencies of the generalized trigonometric polynomials. Section 4 then provides a detailed review of prior work on generalization in quantum machine learning. In Section 5, we establish generalization bounds for classes of generalized trigonometric polynomials in terms of the number of accessible frequencies. We present one approach via the Rademacher complexity (Section 5.1) and another via covering numbers (Section 5.2). Section 6 then expands upon Section 3 by deriving upper bounds on the number of accessible frequencies, in the generalized trigonometric polynomial representation of the PQC-based models associated with different data-encoding strategies. This analysis allows us to use the results from Section 5 to state explicitly encoding-dependent generalization bounds for PQC-based models, and to compare different encoding strategies from a generalization perspective. We discuss the implications of our results in Section 7. In particular, we emphasize how our results are complementary to many prior works, but also describe how the different approaches can be combined. Additionally, we sketch some directions for future research. Section 8 contains a short summary of our work. The logical flow of this manuscript is visualized in Figure 1.

Motivation: Generalization bounds, sample complexities and model selection
To motivate the content of this work and to define the setting, we start with a brief and select introduction to the framework of statistical learning theory. Interested readers are referred to Refs. [20] and [34] for a more detailed presentation. Within this framework, any supervised learning problem is defined by a domain X , a co-domain Y, a probability distribution P over X × Y and a loss function : Y × Y → R. We assume that X , Y and are known, while P is unknown. We will denote the set of all functions from X to Y as Y X . To gain intuition, it is useful to think of the situation in which there exists a deterministic rule for assigning predictions to domain elements. We can model this in the framework outlined above with an unknown target function f ∈ Y X , as well as some unknown probability distribution P X over X , such that samples from P are obtained by first drawing a domain element x ∈ X from P X , and then outputting the tuple (x, f (x)), i.e.
In general, however, it may be the case that there exists y 1 = y 2 for which both P (x, y 1 ) > 0 and P (x, y 2 ) > 0, i.e., that the underlying process for labeling data points is not deterministic. Additionally, we are given a training data set of m tuples drawn independently from (the unknown distribution) P , and our goal is to design a learning algorithm A which, given S as input, outputs a hypothesis h ∈ Y X that achieves a sufficiently small risk Informally, we often refer to the risk R(h) as characterizing the out-of-sample performance of the hypothesis h, as it is this quantity which tells us how well we can expect the hypothesis h to perform on (possibly previously unseen) future data drawn from P . It is critical to note, however, that as the underlying probability distribution P is unknown, given a hypothesis h ∈ Y X , one cannot directly evaluate R(h). In light of this, a natural alternative is to evaluate the empirical risk of h with respect to S, which is defined as the average loss over the training sampleŝ In contrast to the risk R(h), the empirical riskR(h) characterizes the in-sample performance of h with respect to the data set S, which has been sampled from P . Naively, one might hope to be able to construct learning algorithms which could in principle output any h ∈ Y X . However, the "no-free-lunch" theorem rules out the possibility of meaningful learning in this case [35], and therefore we typically consider learning algorithms whose range is some subset F ⊆ Y X . We then refer to F as the hypothesis class associated with the learning algorithm which is, by assumption, also known to the learning algorithm. To gain some intuition, one could think of F as the set of all functions realizable by neural networks of some fixed width and depth, or, as we describe in Section 3, as the set of all functions realizable by a parametrized quantum circuit model with some fixed architecture. With respect to this setting, the following natural question arises: Suppose we have a learning algorithm A with hypothesis class F, which has been run on a randomly drawn data set of m samples S ∼ P m and outputs some hypothesis h ∈ F, as well as some "training log" which we denote by hist(A, S) 1 . Given the achieved empirical riskR S (h), can we put an upper bound on the true risk R(h), which holds with high probability over the randomly drawn data set S? More specifically, can we make a statement of the form: For all δ ∈ (0, 1), with probability 1 − δ over S ∼ P m , for all h ∈ F we have that R(h) ≤R S (h) + g(F, h, m, S, A, hist(A, S), δ).
We refer to such a statement as a generalization bound, and note that the function g appearing in Eq. (5) provides a (probabilistic) upper bound on the quantity R(h) −R S (h), which we call generalization gap (of h with respect to S). Such bounds are desirable because they allow us to leverage the information we have access to -i.e., the empirical risk, and properties of the learning algorithm, data set and optimization procedure -to upper bound R(h), which is the quantity we do not have access to, but are ultimately interested in. In general, as indicated explicitly in Eq. (5), the upper bound g on the generalization gap could depend on properties of the achieved hypothesis h, properties of the data set S, properties of the learning algorithm A, and details of the optimization that led to h. However, in this work we will focus on uniform generalization bounds of the form: for all δ ∈ (0, 1), with probability 1 − δ over S ∼ P m , we have for all h ∈ F that To be specific, we focus on generalization bounds for which the upper bound on the generalization gap -i.e., the function g -depends only on properties of the hypothesis class F, the data set size m and the desired probability δ. We note that the term "uniform" is used when describing such generalization bounds to indicate that, with respect to a fixed data set size m and probability threshold δ, the upper bound on the generalization gap will be the same -i.e., uniform -for all h ∈ F. While it is known that there exist scenarios in which uniform generalization bounds are not tight [36,37], we postpone a discussion of these issues to Section 7. As motivated above, given a uniform generalization bound for a hypothesis class F, one typical application is as follows: Given a data set S sampled from P , with |S| = m, run some learning algorithm to obtain a hypothesis h ∈ F, evaluate its empirical riskR S (h), and then use the generalization bound to place a (probabilistic) upper bound on the true risk R(h). However, we can also often straightforwardly use such a generalization bound to answer the following natural question: Given some > 0 and some δ ∈ (0, 1), what is the minimum size of S sufficient to ensure that, with probability 1 − δ, for all h ∈ F, the generalization gap satisfies R(h) −R S (h) ≤ ? To see this, note that if we have a uniform generalization bound, then by setting and solving for m, it is often possible to find some function f ( , δ, F) such that, with probability As the generalization bound may not be tight, we therefore see that f ( , δ, F) provides an upper bound on the minimum size of S sufficient to probabilistically guarantee a generalization gap less than for all h ∈ F. Finally, apart from the fundamental applications of allowing us to bound the out-of-sample performance of a hypothesis, or upper bound the minimum sample-size sufficient to guarantee a certain generalization gap, generalization bounds also allow us to address the issue of model selection, via the framework of structural risk minimization [20]. Importantly, we note that one cannot simply use only the function g(k, m, δ) for model selection: A trivial learning model, which outputs the same hypothesis independently of the input data, has g(k, m, δ) = 0, but cannot achieve good prediction performance on interesting tasks. Structural risk minimization thus suggests combining a generalization bound with an empirical risk evaluation on a specific data-set to choose the model with the smallest upper-bound on the true risk. More specifically, let us assume that our hypothesis class depends on some "architectural hyper-parameter" k, with some notion of ordering such that For example, F k could be the set of all neural networks of fixed width and depth k. Given this, how should we choose the hypothesis class -or model complexity -that we use for a given learning problem? As illustrated in Figure 2, generalization bounds, when combined with empirical risk evaluations, can allow us to answer this question. In particular, assume that we have a uniform generalization bound of the form: For all δ ∈ (0, 1), with probability 1 − δ over S ∼ P m , for all where g(k, m, δ) is non-decreasing with respect to k. Here, we have written g(k, m, δ) rather than g(F k , m, δ) to emphasize the assumption that the hyper-parameter k is the only property of F k on Figure 2: Illustration of structural risk minimization (adapted from Ref. [20]). Increasing the complexity of a hypothesis class typically allows one to obtain hypotheses with decreasing empirical risk. However, in many cases increasing the complexity of a hypothesis class also leads to a larger upper bound on the generalization gap. Structural risk minimization aims to identify a hypothesis with the smallest upper bound on the true risk that quantifies the out-of-sample performance by combining an evaluation of the empirical risk of candidate hypotheses with an upper bound on the generalization gap of the relevant hypothesis class.
which the generalization bound depends explicitly. While increasing k increases the expressivity of the hypothesis class and therefore typically leads to smaller empirical risk, it also increases the upper bound g(k, m, δ) on the generalization gap and may therefore lead to hypotheses with worse out-of-sample performance. As such, a natural strategy to find an optimal hypothesis -in the sense of having the smallest probabilistic upper bound on the true risk -is as follows: 1. For k in {k 1 , . . . , k n }, run the learning algorithm A k , with hypothesis class F k , and obtain the hypothesis h k .
We refer to such a procedure as structural risk minimization 2 , and contrast this with empirical risk minimization, which simply outputs the hypothesis minimizing the empirical risk. In light of the above discussion, we note that, given a family of hypothesis classes {F k }, each specified by some architectural hyper-parameter k and satisfying the condition of Eq. (9), we would ideally like to obtain an upper bound on the generalization gap g(k, m, δ) which grows slowly with respect to k. In particular, we can now understand this from two different but complementary perspectives: Firstly, from the structural risk minimization (or model selection) perspective, we see from Figure 2 that slow growth of g(k, m, δ) is indicative of our ability to exploit the expressivity of more complex hypothesis classes, i.e. those with larger k, without risking poor generalization performance due to overfitting. More specifically, under the assumption of monotonically decreasing empirical risk, the slower g(k, m, δ) grows, the longer we can expect the quantityR S (h k ) + g(k, m, δ) to decrease before reaching a minimum, and therefore the smaller we can expect our ultimate upper bound on the true risk of the optimal hypothesis h kopt to be. In contrast, if g(k, m, δ) grows too fast with respect to k, then even if we can achieve very small empirical risk by increasing model complexity, we do not expect to be able to achieve a sufficiently small upper bound on the true risk of the optimal hypothesis h kopt .
Secondly, from the sample complexity perspective, let us denote by f ( , δ, k) the complementary upper bound on the minimum sample sample size m sufficient to probabilistically ensure a generalization gap less than > 0, which typically follows from g(k, m, δ) (as we recall from the discussion around Eqs. (7) and (8)). As we naturally expect g(k, m, δ) to be decreasing with increasing m, slow growth of g(k, m, δ) with respect to k typically implies slow growth of f ( , δ, k) with respect to k. In other words, slow growth of g(k, m, δ) typically implies slow growth, with respect to model complexity, of the minimum amount of data one has to use before being able to probabilistically guarantee a certain generalization gap for all output hypotheses. As generating data (i.e., sampling from the distribution P ) may be expensive or difficult, and as the run-time of learning algorithms typically scales with respect to the data set size, slow growth of g(k, m, δ) therefore facilitates the process of learning with models of higher complexity.
Given the above observations, we can finally understand the motivation of this work in an informal way. In particular, in the following section we will see that parametrized quantum circuits (PQCs) naturally give rise to hypothesis classes with multiple architectural hyper-parameters, each reflecting a different aspect of the circuit architecture, such as circuit depth, circuit width, the total number of gates or the total number of data-encoding gates of a particular type. In Section 4 we will then see that a body of previous work has resulted in a collection of generalization bounds for PQC-based models, each of which depend explicitly on some subset of architectural hyperparameters, but not on others. As of yet, however, there exist no generalization bounds which depend explicitly on hyper-parameters associated with the data-encoding strategy, despite the important role such strategies play in determining the expressive power of PQC-based hypothesis classes [33]. As such, the questions which we address in this work are as follows: (a) Can we derive generalization bounds for PQC-based hypothesis classes which depend explicitly on hyper-parameters associated with the data-encoding strategy?
(b) Can we use such bounds to identify data-encoding strategies for which the upper bounds on the generalization gap grow polynomially with respect to the architectural hyper-parameter relevant to the encoding strategy?
As will be discussed in Section 7, apart from filling a gap in our understanding of the manner in which the data-encoding influences generalization, such bounds would also complement existing works, in that they would allow one to perform structural risk minimization with respect to multiple architectural hyper-parameters simultaneously. With this motivation in mind, before proceeding it is worth briefly mentioning how (uniform) generalization bounds are typically obtained. Intuitively, one might expect that the generalization performance of a hypothesis class is related to how complex (or how expressive) the hypothesis class is, and thus one might hope for the existence of a complexity measure for hypothesis classes from which generalization bounds follow. This intuition is indeed correct, and in fact a large amount of work in statistical learning theory has resulted in a variety of suitable complexity measures -such as the VC dimension [38], Rademacher complexity [39], pseudo-dimension [40] and metric-entropy amongst others -all of which directly give rise to generalization bounds [20,34,35]. As a result, given a hypothesis class F k , one typically proves a uniform generalization bound for F k , which depends explicitly on the architectural hyperparameter k, by first characterizing the dependence of a suitable complexity measure C on k (i.e., by writing/bounding C(F k ) explicitly in terms of k), and then writing down the known generalization bound which follows from C(F k ). We also follow such a strategy in this work by first characterizing both the Rademacher complexity and metric-entropy of PQC-based models in terms of architectural hyper-parameters related to the data-encoding strategy and then presenting generalization bounds in terms of these complexity measures. At this stage it is hopefully clear, both why generalization bounds are desirable, and how (at least intuitively) one might obtain such bounds. Given this, we proceed in the following section to define more precisely the PQC-based hypothesis classes considered in this work.

Parametrized quantum circuit based model classes
Parametrized quantum circuits (PQCs) are ubiquitous in the field of near-term quantum computing [9][10][11] and can be used to construct quantum machine learning models [12]. We will consider qubit-based quantum systems. The focus of this work lies on variational quantum machine learning models that are constructed from a PQC U θ (x) that depends on trainable parameters θ ∈ Θ and on data inputs x ∈ X . A prediction in the co-domain Y = R is then obtained by evaluating the Figure 3: Circuit model considered in this work. We assume that the circuit consists of gates which are parametrized either by the data x (data-encoding gates), or the trainable parameters θ (trainable gates). The data encoding gates are assumed to implement the time evolution of a data-encoding Hamiltonian, with evolution time given by some data coordinate x (i) = e (i) x. The model output is then given by the expectation value of an observable M . expectation value of a fixed observable M , which can be efficiently evaluated, as In the following, we assume that the data inputs are d-dimensional real-valued vectors with entries in the interval [0, 2π), i.e., X = [0, 2π) d . This choice is somewhat arbitrary, as data can always be rescaled to fit into a particular interval. However, [0, 2π) is a natural choice because quantum gates available on actual hardware are usually parametrized in terms of angles. As will become apparent later, we need not make any assumptions on the nature of the trainable parameters, but in most cases they will also be angles, i.e., Θ = [0, 2π) p , where p is the number of trainable parameters. We also make some assumptions on the structure of the circuit U θ (x). Our model is motivated by the actual quantum circuits that can be executed on NISQ devices. These devices usually only allow fixed gates and parametrized evolutions under device-specific Hamiltonians [41][42][43]. In our model, the data inputs x and the trainable parameters θ enter the circuit through different gates. The unitaries parametrized by θ, denoted by {W i (θ)}, constitute the trainable part of the model. Fixed unitaries can be absorbed into the trainable unitaries.
We assume that the gates through which the data enters the circuit are time evolutions under some Hamiltonian, where the "evolution time" is given by one of the data coordinates x (i) . We denote the j-th gate that encodes the data coordinate x (i) as where we rewrote the encoding gate in terms of the input data vectors by recognizing that is a standard basis vector. It is of course possible to consider more general dependencies of the evolution time on the input data, i.e. in terms of linear combinations or even non-linear functions of the data coordinates. However, we choose not to include models with such classical pre-processing of the data, in order to isolate the part of the model which is truly quantum. Indeed, if one allowed for arbitrary pre-processing, then one could just use a very complicated neural network to find suitable evolution times for good predictions, but that would miss the point of using a quantum learning model at all. We note though that our definition still encompasses such approaches after a suitable reparametrization of the inputs, which will usually result in a larger number of input coordinates. For our analysis, no restriction on the placement of the trainable gates and the data-encoding gates in the circuit is necessary. Thus, we assume that they can be arranged arbitrarily, as depicted in Figure 3. However, we will refer to the choice of data-encoding Hamiltonians per data coordinate as D (i) = {H (i) j } and call the union of these sets over all data coordinates the data-encoding strategy The total number of encoding gates per data coordinate is N (i) = |D (i) | and the total number of data-encoding gates is N = d i=1 N (i) . A data-encoding strategy D together with a fixed circuit structure and a choice of trainable gates defines a parametrized quantum circuit U θ (x). We denote the fact that this circuit uses the encoding strategy D as U θ (x) ∼ D. When we fix an observable M to generate the predictions, this defines a function class which is obtained by considering all possible parametrizations θ ∈ Θ of the trainable gates. This function class depends explicitly on the parametrization of the trainable parts of the circuit and on the data-encoding strategy. As we ultimately want to obtain generalization bounds that depend on the hyper-parameters associated with the encoding strategy -such as the number of encoding gates N -it will be helpful for us to reformulate the function class in a way that makes it more amenable to the analyses in the following sections. To this end, we draw on the results of Refs. [15,33], which show that the nature of the data encoding gates as Hamiltonian evolutions allows us to expand the model output as a generalized trigonometric polynomial (GTP). A GTP "generalizes" the notion of a trigonometric polynomial by allowing arbitrary frequencies as in While the GTP's coefficients {c ω } depend on the particular parametrization and observable, the set of frequencies Ω(D) depends solely on the chosen data-encoding strategy D, in particular on the spectra of the Hamiltonians {H We describe the procedure for obtaining such a GTP representation in more detail below. The fact that the expectation value is always real is reflected by c ω = c * −ω and by the observation that ω ∈ Ω(D) implies that also −ω ∈ Ω(D). Additionally, we note that the absolute value of any expectation value obtained from measuring M is upper bounded by its operator norm M ∞ , and therefore, if we assume that M ∞ ≤ B, then where Ω = Ω(D). We have thus defined a function class that solely depends on the data-encoding strategy. We stress that this function class subsumes all possible ways to parametrize the trainable parts of a circuit with fixed data-encoding strategy D and fixed observable M , but also goes beyond this by allowing all possible choices of observable M such that M ∞ ≤ B. Therefore, it also contains models where not only the parameters of the trainable gates, but also the measurement itself is subject to optimization. In going from F Θ,D,M to F B Ω , we effectively allow for a universal trainable part and observable, which enables us to focus on the encoding strategy. Studying intermediate classes between F Θ,D,M and F B Ω could constitute a path towards tighter generalization bounds that depend on both the data-encoding and the trainable part of the PQC-based model.
In Section 5, we will first prove generalization bounds for F B Ω , which depend explicitly on properties of Ω, before exploring in detail in Section 6 how these relevant properties of Ω depend on the data-encoding strategy D. Exploiting the fact that, for a given then automatically yields explicitly encoding-dependent generalization bounds for F Θ,D,M .
As the connection between the data-encoding strategy D and the set Ω(D) plays a crucial role, we illustrate this connection for a generic data-encoding strategy here. We first consider the action of a single encoding evolution S(x) in the density matrix picture, where it acts via the quantum channel where the Hamiltonian H takes the role of any of the above Hamiltonian terms H (i) j and e can be any basis vector. We can expand ρ in the eigenbasis of the Hamiltonian H|λ k = λ k |λ k and obtain We see that the differences of the eigenvalues λ k of the Hamiltonian H determine the frequencies with which the different elements of the expansion of ρ are multiplied. We can combine the different frequencies with the weight vector e to obtain the set of all available frequencies With this notation, we can simplify our expression for S(x)[ρ] to obtain where the operators ρ ω are given by collecting the terms in the above sum for which the frequency differences are the same, i.e.
As ρ is Hermitian, we have that ρ ω = ρ * −ω . The frequency structure carries over if we measure the expectation value of an arbitrary observable M for the state S(x)[ρ] to obtain a prediction As a result, we obtain a GTP with coefficients c ω = Tr{ρ ω M }. Note that, as ρ ω = ρ * −ω , we have that c ω = c * −ω , which ensures that f (x) is real-valued as expected. The coefficients of this series could depend intricately on the circuit that was used to construct ρ and on the specific observable M , but a profound understanding of this relation is an open question. However, this does not pose an obstacle for us, as only the set Ω is relevant for our study.
We have just derived the frequency structure for one encoding gate, but for more complicated circuits we have to understand the action of multiple encoding gates, potentially interleaved with some trainable unitaries. The intermediary unitaries, however, will only result in a basis change, not affecting the set of combined frequencies. We can therefore ignore them and just consider the repeated action of two distinct encoding gates with Hamiltonians H 1 and H 2 , resulting in At this point, we precisely understand that the application of the second gate results in new frequencies that encompass all possible sums of the different frequencies. We can again consolidate this if we consider the sumset (or Minkowski sum) of the two sets of frequencies Ω(H 1 ) and Ω(H 2 ) defined as With that we have Note again that the values of specific components ρ ω depend on the specific initial state ρ and possible intermediate unitaries, but, in this work, we are only interested in Ω itself. We can apply the same logic recursively to see that the set of accessible frequencies for any encoding strategy D is given by the sumset of all the individual sets of frequencies Ω(H (i) j ) for each gate:

Prior and related work
Before presenting our explicitly encoding-dependent generalization bounds for PQC-based models in the next two sections, we discuss how our results compare to prior work. While there is a massive amount of prior and ongoing work on the generalization capacity of classical models, see for example the survey in Ref. [37], such results have only recently begun to emerge for PQCbased models. Here, we focus on a comparison with these latter results. Additionally, while the following paragraphs constitute a detailed review of existing generalization bounds for PQC-based models, we stress that no knowledge of these prior works is necessary to understand our proofs and results. In particular, the presentation here is intended to establish context for our work and to place prior works in relation to each other, but the remainder of this manuscript can safely be read independently of the review presented here. Given the discussions in the previous two sections, we note that, at a high level, all prior work on generalization bounds for PQC-based models can be classified via the following three criteria: 1. Which restrictions -if any -are placed on the architecture/structure of the PQCs generating the model class considered?
2. In terms of which architectural hyper-parameters, or experimentally accessible quantities, are the generalization bounds expressed?
3. Via which complexity measure are the generalization bounds derived?
Given this, we will use the above questions as guidelines for understanding and relating existing results. Throughout this discussion, keep in mind that, as explained in Section 1, all prior works are restricted to encoding-first models, whereas we allow for data re-uploading. Additionally, while some of the following works study the same complexity measures as the ones examined here -namely, Rademacher complexity and covering numbers -all of them differ from ours in both the restriction to encoding-first PQC-based models and in a lack of explicit dependence on the data-encoding strategy. Given this, we split our survey into two parts. First, in Section 4.1, we discuss those prior works which derive encoding-independent generalization bounds. In Section 4.2, we then discuss existing works deriving generalization bounds which depend on the data-encoding strategy, but with a dependence which is implicit, and not necessarily clear a priori.

Encoding-independent complexity and generalization bounds
Ref. [24] is an early study of the complexity and generalization capacity of quantum circuit based models, which presents encoding-independent bounds on the pseudo-dimension of function classes associated with encoding-first 2-local (unitary or CPTP) PQCs, polynomial in the size (number of gates) and depth of the trainable part of the circuit (in which all gates were considered trainable).
Such pseudo-dimension bounds then yield generalization bounds, which also depend polynomially on the size and depth of the trainable circuit. Ref. [44] has extended the generalization bounds of Ref. [24] to the agnostic setting. In a similar vein, Ref. [29] has recently derived encodingindependent covering number bounds for encoding-first PQC-based models, which depend explicitly on the number of gates in the PQC, and the operator norm of the measured observable. Once again, using standard tools from statistical learning theory, the authors of Ref. [29] are then able to use these covering number bounds to provide an encoding-independent generalization bound.
Working from the perspective of kernel methods, Ref. [32] has recently investigated the complexity of encoding-first PQC-based models in terms of properties of the parametrized measurement which follows data-encoding. More specifically, they interpret the entire parametrized circuit following the data-encoding as a parametrized measurement, and provide bounds for the VCdimension of the model class in terms of the rank of the parametrized observable, and for the fat-shattering dimension in terms of the Frobenius norm of the parametrized observable. These bounds on standard complexity measures then allow them to prove generalization bounds which depend explicitly on either the rank or the Frobenius norm of the accessible observables. However, similarly the perspective we advocate in this work, the authors of Ref. [32] stress the application of generalization bounds for model selection, via structural risk minimization.
Finally, Ref. [27] has recently initiated a resource-theoretic approach by providing encodingindependent bounds on both the Rademacher and Gaussian complexity of encoding-first PQCbased models, in terms of the number of repetitions of resource channels allowed in the PQC. These Rademacher and Gaussian complexity bounds have then been used to derive generalization bounds, which depend on the same quantities, and therefore provide an encoding-independent resource-theoretic perspective on generalization in encoding-first PQC-based models.

Encoding-dependent complexity and generalization bounds
We proceed by discussing prior work deriving generalization bounds which do depend on the dataencoding strategy. While the dependence on the data-encoding could take various forms, in this manuscript we aim to derive generalization bounds which depend explicitly on architectural hyperparameters related to the data-encoding strategy (such as the number of encoding gates of a specific type), and therefore facilitate the straightforward implementation of model selection via structural risk minimization. This is in contrast to all of the prior encoding-dependent generalization bounds, which are written in terms of some quantity which depends on the data-encoding strategy, but with an implicit dependence which is not a priori clear, and needs to be assessed experimentally. Given this fundamental difference between our generalization bounds and those of the prior works we discuss here, a natural open question is whether the implicitly encoding-dependent quantities used in the following works can be written explicitly in terms of architectural hyper-parameters related to the data-encoding strategy. If possible, this would immediately provide explicitly encodingdependent generalization bounds comparable to those we derive in this work.
With this in mind, we begin our survey of implicitly encoding-dependent generalization bounds with Ref. [25], which has suggested a complexity measure based on the classical Fisher information, called the effective dimension, and demonstrated that one can indeed state generalization bounds in terms of the effective dimension. Utilizing the empirical Fisher information as a tool for approximating the effective dimension, Ref. [25] presented numerical experiments which demonstrate a clear dependence of the effective dimension on the encoding-strategy. However, the explicit dependence of the effective dimension on the encoding strategy is not clear and needs to be evaluated experimentally. Additionally, Ref. [25] also provided a comparison between the effective dimension of PQC-based models and comparable classical models, and demonstrated that PQC-based models can exhibit a higher effective dimension. While not discussed explicitly in Ref. [25], we stress, however, that one should not use model complexity (e.g., effective dimension) as the sole criterion for model selection, since model classes with higher effective dimension may have worse generalization behavior than models with a lower effective dimension. Instead, as we advocate in this work, one should ideally use a framework such as structural risk minimization to select a model with the smallest upper bound on out-of-sample performance.
Also working from an information theoretic perspective, and with a focus on the role of dataencoding, Ref. [31] has recently presented generalization bounds for PQC-based models in terms of information-theoretic quantities describing a notion of mutual information between the postencoding quantum state ρ(x) and the classical data. While these generalization bounds have a strong implicit dependence on the data-encoding strategy, it is once again not immediately clear, apart from in a few special cases, how to explicitly express the suggested complexity measure in terms of architectural hyper-parameters related to the data-encoding strategy.
From a resource theoretic perspective, and complementing Ref. [27], the series of works [26,28] have further studied the Rademacher complexity of encoding-first PQC-based models. However, unlike in Ref. [27], the Rademacher complexity bounds of Refs. [26,28] are given in terms of quantities that exhibit an implicit dependence on the data-encoding strategy. More specifically, Ref. [28] provides Rademacher complexity bounds in terms of the size, depth and amount of magic available as a resource. Additionally, Ref. [26] also studies noisy PQC-based models and provides Rademacher complexity bounds in terms of either the Rademacher complexity of the associated noiseless circuit or the free-robustness of the model.
Recently, Ref. [45] has studied generalization for PQC-based models using a hardware efficient ansatz with a specific choice of data-encoding. For this setting, they proved VC-dimension bounds that scale polynomially with the minimum of the number of qubits and the number of trainable layers. In their proofs, they combine light cone arguments with a trigonometric function representation for functions implemented by their ansatz.
Finally, we mention Ref. [30] which has developed techniques for evaluating the potential advantages of quantum kernels over classical kernels. These results are of relevance to this work due to the close relationship between PQC-based models and kernel methods [16]. In a first step, the authors of Ref. [30] suggest the evaluation of a geometric quantity which depends on the chosen quantum feature map and the available training data instances. If the quantum machine learning model passes this first test, a model complexity parameter, which now depends on the quantum encoding and the training data (both instances and labels), should be computed. While these complexity measures can be classically computed in time polynomial in the training data size, analytically determining their exact dependence on the data-encoding can be challenging. This is in contrast to our model complexity bounds, which depend straightforwardly on hyper-parameters associated with the data-encoding strategy, such as the number of encoding gates of a specific type.

Generalization bounds for generalized trigonometric polynomials
We recall (from Section 3) that we can prove generalization bounds on F Θ,D,M , the hypothesis class of interest for a given PQC-based model, by proving generalization bounds on F B Ω . Recall that F B Ω has been defined as the class of generalized trigonometric polynomials (GTPs) with frequencies in Ω and infinity-norm bounded by B as In order to prove generalization bounds for F B Ω , it will be convenient to work with the cosine and sine representation of the complex exponential, and with the norm of the vector of coefficients instead of the norm of the function. Note that, since we have observed in Section 3 that c −ω = c * ω , we can define, for every ω ∈ Ω With these, it further follows that which allows us to rewrite the sum in Eq. (31) as a sum of real terms only. If we were only considering frequencies given by real numbers, then it would suffice to sum over the non-negative frequencies in the real sum representation. However, we are dealing with frequency vectors. As this is the case, we start by removing the zero vector from the set of frequencies to obtain Ω * := Ω\{0}.
Note that this is meaningful as 0 ∈ Ω for any Ω of the form introduced in Section 3. Next, we divide Ω * into two disjoint parts Ω * = Ω + ∪ Ω − , with Ω + ∩ Ω − = ∅, such that for every ω ∈ Ω + we have that −ω ∈ Ω − . We again note that this is possible due to the specific form of the sets Ω discussed in Section 3. In particular, we then have |Ω| = 2|Ω + | + 1. Additionally, we make use of a shorthand notation for the vectors (a ω ) ω∈Ω+ and (b ω ) ω∈Ω+ : We keep the indices outside of the parentheses, but remove the indexing set. Namely we write (a 0 , (a ω ) ω , (b ω ) ω ) in place of (a 0 , (a ω ) ω∈Ω+ , (b ω ) ω∈Ω+ ). We only explicitly write the indexing set at certain points to avoid confusion.
With these notational points in mind,we can rewrite the hypothesis class F B Ω as and we define the class H B Ω via where the 2-norm is given by We note that, by construction, F B Ω ⊆ H B Ω holds true. To see this, note that for a function f ∈ F B Ω given by f (x) = ω∈Ω exp(−iωx)c ω = a 0 /2 + ω∈Ω+ (a ω cos(ωx) + b ω sin(ωx)), we obtain As a consequence of the fact that F B Ω ⊆ H B Ω , generalization bounds uniform over H B Ω imply generalization bounds uniform over F B Ω . Therefore, we focus on proving generalization bounds for H B Ω . Our bounds focus on the dependence of generalization on the frequency spectrum Ω. We obtain these bounds from bounds on the complexity of H B Ω , measured in terms of two complexity measures from classical learning theory, namely the Rademacher complexity and the metric entropy. We first recall the definitions of these important quantities and then give an overview over our results and proof strategy.
Definition 1 ((Empirical) Rademacher complexity). Let Z be some data space, F ⊆ R Z a function class, and S = (z 1 , . . . , z m ) ∈ Z m . The empirical Rademacher complexity of F with respect to S is defined asR where For later use, we note that, if F ⊆ G ⊆ R Z , then, for any S ∈ Z m we haveR S (F) ≤R S (G). Next, we introduce our second complexity measure: Definition 2 (Covering nets, covering number, and metric entropy). Let (X, d) be a (pseudo-)metric space. Let K ⊆ X and let ε > 0. We call N ⊆ K an (interior) ε-covering net of K if for all x ∈ K there exists ay ∈ N such that d(x, y) ≤ ε. The covering number N (K, d, ε) is defined as the smallest possible cardinality of an (interior) ε-covering net of K. Finally, we define the metric entropy log 2 N (K, d, ε) via a logarithm of the covering number.
For our purposes, the relevant covering numbers are those of H B Ω with respect to the pseudometrics induced by the data-dependent semi-norms · 2,S|x , which, given training data , are defined as In Section 5.1, we prove Rademacher complexity bounds for H B Ω . We do so by understanding H B Ω as (a subset of) a class of functions implemented by a simple classical neural network (NN) with a single hidden layer and with sinusoidal activation functions in the hidden layer. For such NN architectures, we can then apply already known Rademacher complexity bounds. This strategy leads toR for a training data set S of size m, with data instances S| Here, theÕ refers to the asymptotic behavior as |Ω|, m → ∞ and hides a logarithmic dependence on |Ω|. (As we are most interested in the dependence on |Ω|, we also hide the dependence on B here.) With these Rademacher complexity bounds at hand, we can then derive generalization guarantees for H B Ω , and thus F B Ω , using a standard generalization bound in terms of the Rademacher complexity. We obtain that for a bounded Lipschitz loss function, with probability ≥ 1 − δ, the generalization error satisfies uniformly over f ∈ H B Ω for training data S of size m. Again, we emphasize the leading-order dependence on |Ω| and hide other parameters. We note that, without further assumptions, as in classical agnostic learning scenarios, we do not expect a better scaling with respect to m than the In Section 5.2, we bound the covering number and metric entropy of H B Ω ,and thus of F B Ω .We achieve this by constructing a covering net for H B Ω from a suitable (finer-grained) covering net of the allowed vectors of Fourier coefficients. Here, we crucially use that |Ω| determines the dimension of the space in which we have to take these covering nets. With this reasoning, we obtain a metric entropy bound of where theÕ hides logarithmic dependencies on B and |Ω|. Given these metric entropy bounds, we then use the chaining method to derive empirical Rademacher complexity bounds. Again assuming a bounded Lipschitz loss function, this method yields, with probability ≥ 1 − δ, a generalization error bound of simultaneously for all f ∈ F B Ω ⊆ H B Ω , assuming training data of size m and hiding both logarithmic terms and dependencies on B, the Lipschitz constant, and the bound on the loss. While we see that, with the above definition of F B Ω and H B Ω , the strategies of Sections 5.1 and 5.2 lead to the same generalization bound in leading order, we nevertheless present both approaches because they yield different results if the assumption on the Fourier coefficients appearing in F B Ω or H B Ω is changed from a 2-norm bound to a general p-norm bound.
In the light of the discussion in Section 3, these generalization bounds for classes of generalized trigonometric polynomials imply generalization bounds for PQCs. As we have focused on the dependence on the frequency spectrum in the former, we obtain a focus on the encoding-dependence in the latter. We provide and discuss these results in Section 6.

Generalization bounds for generalized trigonometric polynomials via Rademacher complexity
We begin our analysis by stating our Rademacher complexity bound for H B Ω . As we will see, this bound is obtained by combining two partial results, and will lead directly to a generalization bound. For ease of notation, we write K i := max ω∈Ω+ {|ω i |} for i ∈ {1, . . . , d} and K := i K i .

Lemma 3 (Rademacher complexity bounds for GTPs
Ω be as defined in Eq. (36). The empirical Rademacher complexity of H B Ω with respect to S| x := (x 1 , . . . , x m ) can be upper-bounded aŝ In order to prove Lemma 3 we state and show two partial results, namely Lemmas 4 and 5. These two Lemmata have slightly different proof strategies, but both are motivated by thinking of generalized trigonometric polynomials as being realized by certain neural network architectures.

Lemma 4 (Empirical Rademacher complexity of H B Ω -Version 1). Let d, m, S| x , and H B Ω be as in Lemma 3. Then, the empirical Rademacher complexity of H B Ω with respect to S| x can be upper-bounded asR
Proof. We prove this statement by constructing a function class that contains H B Ω and whose empirical Rademacher complexity we are able to upper bound by viewing it as arising from a simple layered neural network (NN) architecture. More specifically, we consider the following class of functions which can be realized by a NN with a single hidden layer of neurons with sine activation functions, and a linear activation at the output neuron. Here, again (d ω ) ω stands for the vector (d ω ) ω∈Ω+ . Also, note that for every ω ∈ Ω + , α ω is a d-dimensional vector and γ ω a real number. We claim that H B Ω ⊆ G B Ω . We can prove this inclusion directly by finding the corresponding parameters (d 0 , (d ω ) ω ), (γ ω ) ω and (α ω ) ω for each element f ∈ H B Ω , specified by the corresponding (a 0 , (a ω ) ω , (b ω ) ω ). We can find a valid assignment term by term. We start by noting d 0 = a 0 . Next, we spell out the term corresponding to the frequency vector ω with the well-known angle sum trigonometric identity Now, for any given (a ω ) ω and (b ω ) ω , we can set d ω := a 2 ω + b 2 ω , α ω := ω, and γ ω := arctan(b ω /a ω ).
At this point, it is important to confirm that the assignment is valid within the restrictions imposed in Eq. (47). To begin with, we note that the 2-norm bound from Eq. (38), i.e. (a 0 , (a ω ) ω , Additionally, one can also see that the components of α ω are nothing but the frequencies ω i for each data coordinate, which fall in the interval [−K i , K i ] by construction. Finally, as a function arctan can output any angle, choosing the branch [−π, π) is valid. With these, we reach which has been our goal. As G B Ω arises from a NN whose activation functions are 1-Lipschitz, continuous and antisymmetric, we can use Lemma 16 (stated in the Appendix). For that, we require upper bounds for the 1-norm of the weight vector going into each neuron and for the moduli of the biases. For every neuron in the hidden layer, there are d incoming weights, one for each data dimension, corresponding to the d input neurons. Each component of those weight vectors (α ω in Eq. (47)) takes values in ∈ [−K i , K i ] for some i ∈ {1, . . . , d}, so the 1-norm of such a weight vector is upper bounded by K.
At the output neuron, there are |Ω + | incoming weights (d ω in Eq. (47)) and we have a bound on the 2-norm of this weight vector. Therefore, Hölder's inequality applied to the 2-norm gives the With that, we now know that the 1-norm of any weight vector in the NN is upper bounded by max{K, 2(2π) d 2 B |Ω + |}. Next, we note that the modulus of the biases is at most π in the hidden layer, and 2(2π) where the O notation refers to the scaling in |Ω|. As G B Ω contains H B Ω as a subset, this bound directly impliesR which completes the proof.
In the proof of Lemma 4, we do not bound the empirical Rademacher complexity of H B Ω directly, rather we embed it into a larger class G B Ω whose complexity we then bound. However, whereas only a discrete set of frequencies is used in H B Ω , the class G B Ω allows for a continuum of frequencies. In Lemma 5, we modify the idea of the previous proof to avoid this overcounting of frequencies.

Lemma 5 (Empirical Rademacher complexity of H B Ω -Version 2). Let d, m, S| x , and H B Ω be as in Lemma 3. Then, the empirical Rademacher complexity of H B Ω with respect to S| x can be upper-bounded asR
Proof. Analogously to the proof of Lemma 4, we provide an empirical Rademacher complexity upper bound for a larger function classH B Ω . Along the way, we see that the inclusion H B Ω ⊆H B Ω holds, so that the uniform bound we derive for the larger set is immediately inherited for the smaller one. We start by defining an auxiliary set of functions: let M Ω be the set of generalized trigonometric monomials over R d with frequency values in Ω + , defined as Now, recalling that |Ω| = 2|Ω + | + 1, we can define the function class of our current interest as where we use the notation ·, · for the standard inner product. Notice howH B Ω can be seen as a class of functions implemented by a single neuron with identity activation and 2-norm bounded weights, where the input signals have been pre-processed by functions from the specified class M Ω . With this, we note the inclusion H B Ω ⊆H B Ω . Next, we use Lemma 15 (stated in the Appendix). To use the result, we note that the activation function of the neuron is the identity x → x (which is a 1-Lipschitz, anti-symmetric function); that M Ω contains the 0-function; that the modulus of the bias is upper bounded by 2(2π) d 2 B; and that we can again use Hölder's inequality applied to the 2-norm to upper bound the 1-norm of the weight vector as (b 0 , w) 1 Hence, in order to proceed we need to find an upper bound for the empirical Rademacher complexity of M Ω . We apply Massart's Lemma (which we recall as Lemma 17 in the Appendix for completeness) for this last step. Let A be the set of generalized trigonometric monomials with frequencies in Ω + , evaluated on every element of S| x = (x 1 , . . . , x m ), i.e., Note that, by Hölder's inequality, again applied to the 2-norm, and since sine and cosine take values in [−1, 1], we have that A ⊆ B √ m (0), where B r (c) is the ball of radius r in 2-norm centered at c. Now, we can rewrite the empirical Rademacher complexity and apply Massart's lemma (Lemma 17) to getR Plugging this into Eq. (58), we obtain Recalling again that H B Ω ⊆H B Ω then yields the claimed bound.

Proof of Lemma 3. This follows directly from combining Lemmas 4 and 5.
With this Rademacher complexity bound at hand, we can make use of standard tools from classical statistical learning theory to derive a generalization bound.

Theorem 6 (Generalization bound for GTPs-Version 1). Let d, m ∈ N. Let H B
Ω be as defined in Eq. (36). Let : R × R → [0, c] be a bounded loss function such that R z → (y, z) is L-Lipschitz for all y ∈ R. For any δ ∈ (0, 1) and for any probability measure P on [0, 2π) d × R, with probability Ω , the generalization error can be upper-bounded as (66) Proof. The proof of this theorem consists in combining the standard generalization bound in terms of Rademacher complexity with the Rademacher complexity bounds from Lemma 3. More precisely, we define G ⊆ [0, c] [0,2π) d ×R to be the class of functions that can be obtained by post-composing elements of H B Ω with the loss function -i.e. we define We then have the following generalization bound (see, e.g., Theorem 3.3 in Ref. [20] or Theorem 1.15 in Ref. [35]): For any probability measure P on [0, 2π) d ×R and for any δ > 0, with probability Note that, when writing g ∈ G as g(x, y) = (y, f (x)) for some f ∈ H B Ω , we directly have That is, Eq. (68) indeed provides a high-probability bound on the generalization error. Therefore, we now upper-bound the empirical Rademacher complexityR S (G). To this end, we use Talagrand's Lemma (going back to Ref. [46]) and our bounds for the empirical Rademacher complexity of H B Ω . As we assume that R z → (y, z) is L-Lipschitz for all y ∈ R, we can apply Talagrand's Lemma (Lemma 18) and Lemma 3 to obtain where we have denoted by S| x := {x i } m i=1 the set of unlabeled training data points. Inserting this bound into Eq. (68) now gives the stated generalization error bound.
The generalization bound of Theorem 6 can be rewritten as an upper bound on the number of labeled training examples that suffice to guarantee small generalization error.

Corollary 7 (Number of labeled training examples sufficient for a small generalization error-Version 1).
For any ε, δ ∈ (0, 1) and for any probability measure P on [0, 2π) d × R, a training data size suffices to guarantee that, with probability We set the upper bound on the generalization error proven in Theorem 6 equal to ε and solving for m. Remark 8. The proof strategy for obtaining Rademacher complexity bounds of generalized trigonometric polynomials presented here easily extends beyond the case in which the 2-norm of the vector of Fourier coefficients is assumed to be bounded. Namely, if we consider, for 1 ≤ p ≤ ∞, the class with Fourier coefficients of a bounded p-norm, we obtain, with essentially the same proof, an empirical Rademacher complexity bound of where q ∈ [0, 1] is the Hölder conjugate of p, i.e., 1 /p + 1 /q = 1, and theÕ hides a logarithmic dependence on |Ω|. This, in turn, leads (for c-bounded L-Lipschitz loss) to a generalization error bound of which holds with probability ≥ 1−δ uniformly over H B,p Ω , for training data of size m. These bounds based on p-norms might be of independent interest. For example, depending on the structure of the trainable part of the PQC, a detailed analysis might lead to additional structural properties (such as sparsity) of the set of admissible Fourier coefficients, which could then lend themselves to an analysis in terms of p-norms for p = 2.

Generalization bounds for generalized trigonometric polynomials via covering numbers
Similarly to Section 5.1, we first prove a bound on a complexity measure for the hypothesisclass F B Ω and then derive a generalization bound from it. This subsection differs from the previous one in that we discuss a different complexity measure, covering numbers, and that we do not need to resort to the larger hypothesis class H B Ω , but rather study F B Ω directly. Lemma 9 (Covering number bound for GTPs). Let d ∈ N and ε > 0. Let F B Ω be as defined in Eq. (16). The ε-covering number of F B Ω with respect to · ∞ can be upper-bounded as (79) Therefore, the corresponding metric entropy can be upper-bounded as As discussed after introducing the class H B Ω , we have F B Ω ⊆ H B Ω . Therefore, according to the approximate monotonicity of covering numbers (see, e.g., Exercise 4.2.10 in [47]), we have, for every ε > 0, Thus, it remains to prove a covering number bound for H B Ω . Let Nε be anε-covering net of the ball with respect to the metric induced by · 2 on R |Ω| . By definition of H B Ω , to every f ∈ H B Ω we can associate a point (a 0 , (a ω ) ω∈Ω+ , (b Given such a vector of coefficients (a 0 , (a ω ) ω∈Ω+ , (b ω ) ω∈Ω+ ) ∈ B -which, again, for the sake of notational ease, we write as (a 0 , (a ω ) ω , (b ω ) ω )), omitting the Ω + everywhere -we can find an element (ã 0 , (ã ω ) ω∈Ω+ , (b ω ) ω∈Ω+ ) ∈ Nε of the cover that isε close in 2-norm to the coefficients of f , i.e., such that Definef as the function specified by these new coefficients, We now bound the infinity norm distance between f andf in terms of the 2-norm distance between the corresponding coefficients as Here, we have used the triangle inequality and the fact that sine and cosine can only take values in [−1, 1], as well as (in the last step) Hölder's inequality with respect to the 2-norm. That means, if we denote by N F the set of GTPs whose coefficients come from the cover Nε, i.e., and if we fixε to beε = ε/ |Ω|, then N F is an ε-covering net of H B Ω with respect to · ∞ . Thus, to finish the proof, it remains to upper bound the cardinality |N F | ≤ |Nε|. To obtain such a bound, we recall that we only require Nε to be anε-cover of a 2-norm ball of radius 2(2π) d 2 B in R |Ω| with respect to the 2-norm. A simple volumetric argument (presented, e.g., in section 4 of Ref. [47]) shows that there exists such aε-cover Nε of B with cardinality All in all, we have proven that there exists an ε-covering net of H B Ω with respect to · ∞ whose cardinality is bounded by (93) This is exactly the claimed upper bound on the ε-covering number of H B Ω , thus completing the proof.
The covering number bound just established implies a generalization bound for GTPs.
Proof. The proof consists of three steps. First, we use the chaining technique from random process theory to upper bound the (empirical) Rademacher complexity in terms of an integral over the square root of the uniform empirical metric entropy. Second, we show that the metric entropy with respect to · ∞ upper-bounds the uniform empirical metric entropy, so we can use the bound in Lemma 9 to upper-bound the (empirical) Rademacher complexity of generalized trigonometric polynomials. Third, we again use the standard generalization bound based on empirical Rademacher complexities.
Similarly to the proof of Theorem 6, we define Again, since we assume that R z → (y, z) is L-Lipschitz for all y ∈ R, Talagrand's Lemma (Lemma 18 in the Appendix) tells us that where we have denoted by S| x := {x i } m i=1 the unlabeled training data points. Next, Dudley's Theorem (which we recall as Theorem 19 in the Appendix), yieldŝ where · 2,S|x is the (data-dependent) semi-norm on R R d defined as we can combine Eq. (97) with our covering number bound from Lemma 9 and further upper bound where we have used the integral with the error function defined as At this point, we again have a bound on the empirical Rademacher complexity at our disposal. So, just like in the proof of Theorem 6, we can now apply the standard Rademacher complexity generalization bound. This then tells us that, for any probability measure P on [0, 2π) d × R and for any δ > 0, with probability ≥ 1 − δ over the choice of an i.i.d. training data set S of size m, we have, for every f ∈ F B Ω , as claimed.
Also for this generalization bound, we provide the reformulation in terms of a bound on the sample size sufficient to guarantee small generalization error.
suffices to guarantee that, with probability ≥ 1 − δ over the choice of i.i.d. training data S ∈ Proof. We set the upper bound on the generalization error proven in Theorem 10 equal to ε and solving for m.

Remark 12.
Our metric entropy bounds of trigonometric polynomials presented here again extend beyond the case of bounded 2-norm of the vector of Fourier coefficients to a general bounded pnorm. However, if we again consider, for 1 ≤ p ≤ ∞, the class H B,p Ω defined in Remark 8, our proof strategy here yields essentially -i.e., to leading order in |Ω| -the same metric entropy and generalization bounds as for p = 2. The reason is that the dimension of the space in which we take covering nets in the proof of Lemma 9 remains proportional to |Ω|, independently of p. We only see improvements for 1 ≤ p < 2 in the terms depending logarithmically on |Ω|. Therefore, while the proof strategies of Sections 5.1 and 5.2 give essentially the same generalization guarantees for p = 2, the approach of Section 5.1 adapts nicely to the case p < 2, whereas the reasoning of Section 5.2 is typically preferable for p > 2.
Remark 13. The proof of Theorem 10 extends beyond Lipschitz loss functions. For example, suppose that R z → (y, z) is α-Hölder continuous with Hölder coefficient A > 0 for all y ∈ R, where α ∈ (0, 1). Then, with the notation of the above proof, We can thus apply Dudley's Theorem to upper bound Now, we again observe that · 2α,S|x ≤ · ∞ and upper bound the covering number integral, using our result from Lemma 9. The parameters of the Hölder continuity enter the final Rademacher complexity bound via a term scaling with log(A) /α.

Encoding-dependent generalization bounds for parametrized quantum circuits
We are finally in a position to answer the questions posed in Section 2. Recall that our first goal was to derive generalization bounds for PQC-based models which depend explicitly on architectural hyper-parameters related to the data-encoding strategy. We showed in Section 3 how PQC-based model classes can be viewed as a subset of generalized trigonometric polynomials (GTPs), whose set of frequencies Ω is determined solely by the data-encoding strategy D. We then derived complexity and generalization bounds for GTPs in terms of the number of different frequencies |Ω(D)|. In order to provide explicitly encoding-dependent generalization bounds for PQC-based models, it remains to express |Ω(D)| in terms of the relevant architectural hyper-parameters associated with different data-encoding strategies.
To do so, we recall that the data-encoding strategy of a PQC-based model class is defined as a collection of lists of data-encoding Hamiltonians D (i) = {H (i) j } associated with each coordinate x (i) . We distinguish different data-encoding strategies according to the different assumptions made on the structure of the data-encoding Hamiltonians H ∈ D (i) . Given a particular assumption, for example that all H are tensor products of Pauli operators or at most κ-local, the natural hyper-parameter associated with the data encoding strategy is the number N = d i=1 |D (i) | of data-encoding Hamiltonians of the assumed type. Hence, our goal in this section is to derive, for different data-encoding strategies, upper bounds on |Ω(D)| that depend on N as well as as on other relevant properties of the data-encoding Hamiltonians (such as, e.g., the locality κ). By substituting these upper bounds on |Ω| into the GTP generalization bounds of the previous section, we then obtain generalization bounds for PQC-based model classes which depend explicitly on properties of the data-encoding strategy.
We first recall the definition of Ω from Eq. (30). If we denote the Hamiltonians of the dataencoding strategy associated with x (i) as {H (i) j }, we can group the frequencies associated with each data coordinate into a separate sumset Ω (i) : The frequencies belonging to the different coordinates {x (i) } are linearly independent because they were defined to be multiples of different standard basis vectors e (i) . This implies that the cardinality of the full set is equal to the product of the individual cardinalities, thus allowing us to multiply bounds on the cardinalities obtained for the separate data-encoding strategies, |Ω (i) |, to obtain a bound on |Ω|.
As the underlying frequencies in Ω (i) are all scalar multiples of the same basis vector e (i) , the analysis of Ω (i) comes down to the different frequencies generated by the Hamiltonians that are used to encode x (i) . For a given single Hamiltonian H, we denote this set by where e is the basis vector associated to the respective coordinate. Next, we derive some bounds on |Ω (i) | for different assumptions on the underlying Hamiltonians.
Worst case upper bounds. We first derive the worst-case limits of |Ω (i) | for κ-local encoding Hamiltonians. A κ-local Hamiltonian H has local dimension 2 κ and the number of possible differences of eigenvalues in the spectrum is thus upper bounded as One can in principle construct a Hamiltonian that saturates this bound by choosing spec(H max ) = {0, 3, 9, . . . , 3 2 κ }, but this is a rather synthetic example that we do not expect to encounter on real hardware. Eq. (114) implies that repeating N (i) κ-local Hamiltonians will, in the case where there are no duplicates in the frequency set, imply a cardinality of at most Again, this bound can be saturated by choosing Hamiltonians with ever-larger spectra, namely by choosing H : We can reformulate this by counting how often the 2T + 1 different elements of Ω (i) 0 are present in a particular instance of the above sum, and get To bound the size of this set, we exploit the symmetry of the underlying frequencies δ j ∈ ∆(H (i) ). Let us outline the idea: We will first count how we can distribute the number N (i) of repetitions over the different non-negative frequencies δ j and then multiply this with the number of different frequencies that can be created by repeating δ j and −δ j . To improve the scaling we get at the end, we will resort to a small trick and actually group the frequency 0, which we know to always be present in the spectrum, with the first other frequency, therefore considering the combinatoric problem of distributing N (i) "balls" over T distinguishable "bins" where some bins can be empty. The different possible ways to achieve this task are given by counting the weak compositions of N (i) into T parts, C(N (i) , T ). The number of such weak compositions is We will denote such a composition as (N where we have used the arithmetic-geometric mean inequality to obtain the second inequality. From this inequality, we see that by repeating the same Hamiltonian for an encoding, we obtain a polynomial scaling in the number of repetitions whose exponent depends on the number of different frequencies generated by the repeated Hamiltonian. Pauli encodings. Encodings performed with Hamiltonians that are a tensor product of Pauli operators, H = n k=1 P (k) where P (k) ∈ {I, X, Y, Z}, have been analyzed in Ref. [33]. Therein, it was shown that N (i) repetitions of such encodings of arbitrary dimension will result in |Ω (i) | = 2N (i) + 1.
Summary. We can easily connect the different upper bounds on |Ω (i) | to upper bounds on |Ω| via the arithmetic-geometric mean inequality, i.e., and by noting that, for q ≥ 1,  Table 1 summarizes the different upper bounds on |Ω (i) | for individual parameters x (i) derived in this section as well as the associated bounds on |Ω|.
Given these results, we are finally in a position to provide a concrete answer to the first question posed in Section 2. More specifically, by substituting the upper bounds on |Ω| given in Table 1 into the generalization bounds for GTPs given in Section 5, we can obtain generalization bounds for PQC-based model classes which depend explicitly on architectural hyper-parameters associated with the data-encoding strategy. Recall that we denoted the function class associated with a particular set of parameters Θ, an encoding strategy D and an observable M , as F Θ,D,M . We then obtain from Theorems 6 and 10 the following Corollary: (b) if D denotes any data-encoding strategy consisting of the same single Hamiltonian per data coordinate with T frequencies, (c) if D denotes any data-encoding strategy consisting of the same single κ-local Hamiltonian per data coordinate, While we consider only four specific data-encoding strategies in this corollary, the generalization bounds from Theorems 6 and 10 can in principle be applied to PQC-based models with any dataencoding strategy. To use the bounds, the corresponding |Ω(D)| has to be identified, which can then be readily combined with our generalization bounds for GTPs.

Comparison of data-encoding strategies from a generalization perspective
The results of the previous subsection give a concrete answer to the first question posed in Section 2, namely explicitly encoding-dependent generalization bounds for PQC-based models. However, recall from Section 2 that we also aimed to use such bounds to identify data-encoding strategies which give rise to a slow (polynomial) growth of model complexity with respect to increasingly complex data-encoding strategies, and therefore facilitate meaningful model selection via structural risk minimization. The results of the previous section now allow us to address this additional goal.
Given an assumption or constraint on the structure of the data-encoding Hamiltonians in a possible data-encoding strategy, the most natural data-encoding hyper-parameter for structural risk minimization is the number N of encoding Hamiltonians. We see that using either repeated Pauli Hamiltonians, a repeated (but fixed) κ-local Hamiltonian, or the repetition of a fixed Hamiltonian with 2T + 1 frequencies, leads to a complexity bound and generalization bound that scale polynomially with N . However, using N different κ-local data-encoding Hamiltonians can lead, in the worst case, to complexity upper bounds which scale exponentially with respect to N . In the latter case we stress, however, that these worst-case bounds are constructed using Hamiltonians designed to saturate the maximum possible number of frequency differences, and in many cases the complexity scaling with respect to N may be much slower. Additionally, while the polynomial generalization bounds we obtain for the first three data-encoding strategies give us hope in the possibility of meaningful structural risk minimization with respect to the number of data-encoding gates, our upper bounds on the generalization gap are not necessarily tight. Hence, we cannot rule out the possibility of better bounds for strategies consisting of many different Hamiltonians, which would facilitate the use of strucural risk minimization.
Additionally, while increasing the complexity of a data-encoding strategy by increasing N is a natural (and experimentally feasible) strategy, in principle one might also consider increasing either the locality κ or the number of frequencies T of the repeated data-encoding Hamiltonian. This would be particularly relevant in the realistic scenario where experimental constraints severely limit the number of data-encoding gates which can be used. However, apart from the potential experimental obstacles one would face in doing so, we note that while our complexity bounds are polynomial with respect to N (when keeping κ and T fixed), they are exponential (or doublyexponential) with respect to κ and T respectively (when keeping N fixed). As such, given the generalization bounds we have obtained in this work, from the generalization and structural risk minimization perspective it makes the most sense to systematically increase the complexity of the data-encoding strategy by keeping κ and/or T constant, and increasing the number of dataencoding gates.

Discussion
As discussed in Section 2, the results from the previous section can be applied in a variety of ways. In particular, apart from the straightforward application of (probabilistically) bounding the generalization gap of an output hypothesis, or bounding the number of data samples required to guarantee an output hypothesis with a sufficiently small generalization gap, our results also facilitate the use of structural risk minimization with respect to architectural hyper-parameters related to the data-encoding strategy. We reiterate that the results obtained here should be viewed as complementary to many of the prior results discussed in Section 4. In particular, our results complement those which derive generalization bounds applicable to the same PQC-based hypothesis classes, but with explicit dependencies on architectural hyper-parameters which do not appear in our generalization bounds, such as depth, width, and total number of trainable gates.
More specifically, the generalization bounds of Section 6 allow one to use structural risk minimization to find the optimal setting for data-encoding hyper-parameters (in the sense of yielding an output hypothesis with the smallest upper bound on true risk). However, they do not give any guidance as to how one should choose the remaining architectural hyper-parameters, and in particular those related to the trainable parts of the PQC. As such, a natural (and recommended) strategy is to use different available and applicable generalization bounds to perform "multi-dimensional structural risk minimization:" One can vary all architectural hyper-parameters for which one has a generalization bound, and evaluate each hyper-parameter setting with respect to an upper bound on the true risk obtained from a union bound over all existing applicable bounds. To make this more concrete, assume that we have a family of hypothesis classes {F (k1,k2) }, parametrized by two architectural hyper-parameters k 1 and k 2 (for example k 1 could be the number of encoding gates, and k 2 could be the number of trainable gates in a PQC based model). Additionally, let us assume that we have derived two different generalization bounds, one depending on k 1 , the other depending on k 2 . More concretely, assume that we have a function g 1 (k 1 , m, δ) and a function g 2 (k 2 , m, δ) such that, for all i ∈ {1, 2}, for all δ ∈ (0, 1), with probability 1 − δ over S ∼ P m , for all h ∈ F (k1,k2) we have that Using a union bound, we can then straightforwardly combine these two results to obtain the following generalization bound: For all δ ∈ (0, 1), with probability 1 − δ over S ∼ P m , for all h ∈ F (k1,k2) we have that We see that we can perform structural risk minimization by varying both k 1 and k 2 and using min i [g i (k i , m, δ/2)] to calculate an upper bound on the true risk of the candidate hypothesis. The above argument can clearly be generalized to an arbitrary number of architectural hyper-parameters, and thereby yields a methodology for exploiting multiple existing generalization bounds for "multi-dimensional structural risk minimization." While the approach we have just discussed certainly allows us to exploit existing complementary generalization bounds depending on different architectural hyperparameters, it is an interesting open question whether one can derive generalization bounds which depend simultaneously on multiple architectural hyper-parameters. In particular, it is of interest to understand whether one can in this way obtain generalization bounds, depending on multiple architectural hyper-parameters, which are tighter than the bounds obtained by taking a union bound over existing bounds, each of which depends only on a single hyper-parameter. A potential strategy for obtaining such bounds would be to better understand the effect of structural assumptions on the trainable part of a PQC architecture on the structure of the coefficients of the associated GTP representation. More concretely, while in this work we have focused on the frequency spectra of the GTPs, which are fully determined by the data-encoding strategy, the coefficients of the GTPs are determined by both the data-encoding strategy and the trainable part of the circuit. If one can characterize the implications of different circuit architectures on the structure of GTP coefficients, one could plausibly use refinements of the techniques presented in Section 5 to derive generalization bounds for the relevant GTPs that depend simultaneously on both the data-encoding strategy and complementary parameters of the circuit architecture. For example, certain PQC architectures may lead to GTP coefficients with a specific sparsity structure, or a constrained upper bound on a specific norm. Such a norm-specific bound may allow us to exploit the general p-norm extensions of our GTP bounds, mentioned in Remarks 8 and 12, to derive generalization bounds which also depend on the trainable circuit architecture.
Finally, we recall the potential shortcomings of uniform generalization bounds. In particular, in Ref. [36], the authors have shown both experimentally and analytically that sufficiently complex neural networks can achieve zero empirical risk for classification tasks with randomly assigned labels. As the true risk for such a learning problem can be no better than what would be achieved by random guessing, any uniform generalization bound for such a hypothesis class cannot offer any meaningful information in this complexity regime. More specifically, as uniform generalization bounds hold, by definition, for all hypotheses in the hypothesis class, and as there exist hypotheses which can achieve zero empirical risk even when generalization is not possible (i.e., when labels are selected randomly), such uniform bounds must be trivial.
It is, however, critical to emphasize that this finding applies only to sufficiently complex hypothesis classes. More specifically, they apply to models capable of achieving zero empirical risk even for completely unstructured data, which typically requires that the number of model parameters is at least as large as the number of elements in the training data set. As the number of parameters in a NISQ-regime PQC-based model is typically orders of magnitude less than the size of training data sets associated with "real-world" learning problems, it is unlikely that these known issues with uniform generalization bounds hinder the application of our uniform bounds to the analysis of currently available and near-term PQC-based hypothesis classes.
Despite this, it is important to keep these concerns in mind as the complexity of available PQC-based models increases. Consequently, there are a variety of natural open questions for future research: Firstly, can one replicate both the experimental and analytical aspects of Ref. [36] for PQC-based model classes? This would help to determine whether (or when) it is necessary to move beyond uniform generalization bounds for PQC-based models. In particular, from an experimental perspective, can one demonstrate the ability of a (sufficiently complex) PQC-based model class to achieve zero risk for a randomly-relabeled real-world classification task? Secondly, can one put an analytical bound on what is "sufficiently complex", i.e., how many model parameters are sufficient to ensure that for any training data set of size m, there always exists a hypothesis in the hypothesis class which can achieve zero empirical risk? Additionally, the shortcomings of uniform generalization bounds exposed in Ref. [36] have stimulated an explosion of research on non-uniform generalization bounds for highly complex neural network models [37]. It would be of interest to understand whether or how one can obtain non-uniform generalization bounds for PQC-based models, which would tighten the bounds obtained in this work in the future regime of high complexity.

Conclusion
In this work, we have derived Rademacher complexity and metric entropy bounds for PQC-based model classes. These depend explicitly on architectural hyper-parameters associated with the dataencoding strategy and are applicable to PQC-based models incorporating data re-uploading. By exploiting tools and techniques from statistical learning theory, we have then used these complexity bounds to obtain uniform generalization bounds, which allow to place a probabilistic upper-bound on the out-of-sample performance of any hypothesis, given its performance on the data. Additionally, we have used the obtained generalization bounds to compare data-encoding strategies from a generalization perspective and have discussed how, for certain data-encoding strategies, our generalization bounds may be used for model selection via structural risk minimization. We have stressed how the encoding-dependent generalization bounds obtained in this work should be viewed as complementary to existing complexity and generalization bounds for PQC-based models, which depend explicitly on architectural hyper-parameters to which our bounds are insensitive. More specifically, we have sketched in Section 7 how the combination of our bounds with existing works facilitates model selection via multi-dimensional structural risk minimization. Finally, as discussed in Section 7, it is important to acknowledge that the bounds we have obtained here are expected to be useful for PQC-based models in the "moderate-complexity" regime, i.e., for models parametrized by fewer parameters than the number of available data samples. However, in analogy with known results for classical model classes, these bounds may cease to be meaningful as the complexity of PQC-based models increases into an over-parametrized regime. Given this, we have also sketched in Section 7 a variety of open questions and directions for future research.
Foundation (Einstein Research Unit on quantum devices) and by the EU's Horizon 2020 research and innovation programme under grant agreement No. 817482 (PASQuanS). M.C.C. gratefully acknowledges support from the TopMath Graduate Center of the TUM Graduate School at the Technical University of Munich, Germany, from the TopMath Program at the Elite Network of Bavaria, and from the German Academic Scholarship Foundation (Studienstiftung des deutschen Volkes).

A Auxiliary results from statistical learning theory
In this appendix, we collect some well known results from classical statistical learning theory that we make use of in our proofs.
Lemma 15 (Rademacher complexity progression (Theorem 2.15 in Ref. [35])). Let a, b ∈ R and σ : R → R an L-Lipschitz function and assume F 0 ⊆ R X is a set of functions that includes the 0 function. Also, let F be the following function class Then, the empirical Rademacher complexity of F with respect to any point x ∈ X m can be bounded in terms of the one of F 0R The 2 factor can be dropped if F 0 = −F 0 .
Lemma 16 (Rademacher complexity of layered network (Corollary 2.11 in Ref. [35])). Let a, b > 0 and X := x ∈ R d | x ∞ ≤ C . Consider a neural network architecture with δ hidden layers that implements F ⊆ R X , and such that 1. The activation function σ : R → R is L-Lipschitz and anti-symmetric.

For every neuron, the modulus of the bias is upper-bounded by a.
Then, the empirical Rademacher complexity of F with respect to any point x ∈ X m can be upperbounded asR Lemma 17 (Massart's Lemma [48]). Let N ∈ N. Let A ⊂ R N be a finite set contained in a Euclidean ball of radius r > 0. Then where the expectation is with respect to i.i.d. Rademacher random variables σ 1 , . . . , σ N .
Theorem 19 (Dudley's Theorem ([49]; see also Theorem 8.1.2 in Ref. [47] or Theorem 1.19 in Ref. [35])). For a fixed vector z ∈ Z m let G be a subset of the pseudo-metric space (R Z , · 2,z ) and let γ 0 := sup g∈G g 2,z . Then the empirical Rademacher complexityR z (G) of G with respect to z can be upper-bounded aŝ