On the Hardness of PAC-learning Stabilizer States with Noise

We consider the problem of learning stabilizer states with noise in the Probably Approximately Correct (PAC) framework of Aaronson (2007) for learning quantum states. In the noiseless setting, an algorithm for this problem was recently given by Rocchetto (2018), but the noisy case was left open. Motivated by approaches to noise tolerance from classical learning theory, we introduce the Statistical Query (SQ) model for PAC-learning quantum states, and prove that algorithms in this model are indeed resilient to common forms of noise, including classification and depolarizing noise. We prove an exponential lower bound on learning stabilizer states in the SQ model. Even outside the SQ model, we prove that learning stabilizer states with noise is in general as hard as Learning Parity with Noise (LPN) using classical examples. Our results position the problem of learning stabilizer states as a natural quantum analogue of the classical problem of learning parities: easy in the noiseless setting, but seemingly intractable even with simple forms of noise.


Introduction
A fundamental task in quantum computing is that of learning a description of an unknown quantum state ρ. Traditionally this is formalized as the problem of quantum state tomography, where we are granted the ability to form multiple copies of ρ and take arbitrary measurements, and must learn a state σ that is close to ρ in trace distance. In an influential work, Aaronson [1] introduced the "Probably Approximately Correct" (PAC) framework from computational learning theory [43] as an alternative perspective on this problem. Here the key innovation is that instead of learning ρ in an absolute metric (such as trace distance), we only wish to learn it with respect to a pre-specified distribution on measurements. This requirement is considerably weaker than that of full tomography.
Concretely, let E denote the space of two-outcome n-qubit measurements E, where E (corresponding to the POVM {E, I − E}) accepts a state ρ with probability Tr(Eρ) and rejects it otherwise. Let D be a distribution on E. We are given the ability to sample and • It is the most realistic learning model for which strong, unconditional lower bounds are known for many basic classes. Indeed, there is a considerable literature on this topic, with lower bounds usually proven using the so-called SQ-dimension and its generalizations [15,22,40].
• SQ algorithms are naturally implementable in a way that satisfies differential privacy of the training data, and indeed are the main examples of realistic differentially private learning algorithms [17,21].
Given all of these properties, it is natural to wonder whether the SQ model has something to bring to quantum learnability, with a particular eye towards noise tolerance. In this work we show (among other results) that for stabilizer states this approach cannot work: SQ-learning stabilizer states is exponentially hard, and in general, learning stabilizer states with noise is as hard as the well-known Learning Parity with Noise (LPN) problem. Define a parity measurement to be a Pauli measurement of the form E x = Px+I 2 for some x ∈ {0, 1} n , where P x = y∈{0,1} n χ x (y)|y y| and χ x (y) = (−1) x·y mod 2 . They are so named since for any computational basis state |y y|, Tr(E x |y y|) = x · y mod 2. The following theorems hinge on the observation (stated as Proposition 4.10) that parities can be very naturally embedded within the problem of learning stabilizer states under distributions on parity measurements. (These theorems are formally stated as Corollaries 4.7, 4.11 and 4.12 respectively.) Our results position the problem of learning stabilizer states as a quantum analogue of the important classical problem of learning parities. 2 In both cases there are simple "algebraic" learning algorithms for the noiseless setting, and the problem seems to become intractable with even the simplest kinds of noise. The algorithm of Rocchetto [41] thus joins a small class of PAC algorithms that do not fall into the SQ model, and hence do not admit any straightforward algorithms in noisy settings. In our view, this frames learning stabilizer states with noise as one of the more compelling problems on the frontier of learning quantum states with noise.
Another interpretation of our results is that they highlight limitations of the PAC framework of Aaronson [1]: insofar as this framework reduces the problem of learning quantum states to an essentially classical problem, it also inherits longstanding problems from classical learning theory.
We also hope that our introduction of the SQ model to quantum state learning will be of independent interest and help spur new ideas in this area.
We now detail the rest of our contributions and lay out the organization of this paper: • In Section 2, we formally define the problem of SQ-learning quantum states and extend the notion of the SQ-dimension to this setting, building on recent work that formally analyzed the SQ-dimension as applicable to the p-concept setting [24].
• In Section 3, we show that SQ algorithms for learning quantum states are indeed resistant to mild forms of noise, including classical classification noise as well as quantum channels with bounded noise (such as depolarizing noise).
• In Section 4, we give exponential SQ lower bounds on learning stabilizer states. Under the uniform distribution on Pauli measurements, we show (Corollary 4.7) that it requires exponentially many queries in order to improve on the maximally mixed state's performance. Under a different natural distribution on Pauli measurements, namely the uniform distribution over parity measurements, we show (Corollaries 4.11, 4.12) that learning stabilizer states with noise is as hard as learning parities with noise.
• In Section 5, by way of positive results, we give SQ algorithms for the simple setting of learning product states. We describe an SQ algorithm for learning product states under Haar-random single-qubit measurements, and show that it allows one to perform tomography on the individual qubits.
• In Section 6, we relate SQ learning to a form of differential privacy for quantum state learners. This form of differential privacy has recently been studied by [11].

Related work
The problem of learning quantum states via state tomography has a long history in quantum computing, culminating in the celebrated optimal algorithms of O'Donnell and Wright [36,37] and Haah et al. [28]. We operate in the alternative PAC framework introduced by Aaronson [1]. In recent years, this framework has been extended to the online setting [6] as well as verified in experimental setups [42]. To our knowledge, the only known computationally efficient PAC learners for supervised learning of a commonly-considered class of states are the algorithm of Rocchetto [41] for learning stabilizer states, as well as that of Yoganathan [45] for other classes of states whose generating circuits can be efficiently classically simulated and inverted, including low Schmidt rank states. While the focus of this paper is on stabilizer states, we remark that Yoganathan's algorithm for low Schmidt rank states also involves solving a system of polynomial equations in the examples, and hence would also not admit any straightforward SQ implementation. Cheng et al. [19] frame the problem of PAC-learning unknown quantum measurements under a distribution of states as a dual problem to PAC-learning an unknown state, and are able to recover Aaronson's main sample complexity bound using a classical proof.
Other examples of efficient learning algorithms not lying strictly within the PAC formalism include Aaronson and Grewal [4] in the unsupervised setting for non-interacting fermions and Montanaro [35] for improved tomography of stabilizer states. Lai and Cheng [34] extend Montanaro's ideas for stabilizer states and use it to learn outputs of Clifford circuits with a single layer of T gates.
Recent work by Arunachalam et al. [11] extends work by Bun et al. [18] to the quantum setting, and relates differentially private (DP) learning of quantum states to one-way communication, online learning, and other models. We show in Section 6 that our notion of SQ learnability implies their notion of DP learnability, and hence by their results also implies finite sequential fat-shattering dimension, online learnability, and "quantum stability." We note that the problem of PAC-learning quantum states is very different from the problem of PAC-learning Boolean functions using quantum representations of data, as considered in a recent active line of work [8,9]. In particular, the model of SQ-learning that we introduce is unrelated to a recent notion of SQ-learning of Boolean functions using quantum representations [10]. When one is given quantum samples of Boolean or integer-valued functions, there have been important results on learning in the presence of noise, showing that both Learning Parity with Noise (LPN) [20] and Learning With Errors (LWE) [27] are tractable in this setting.

Preliminaries
Notation and terminology. We use ρ to refer to the density matrix of an n-qubit quantum mixed state, representable as a 2 n × 2 n PSD operator of trace 1. (The number of qubits will be n throughout and suppressed from the notation.) A pure state is a quantum state with rank 1. Let E denote the space of two-outcome n-qubit measurements E (corresponding to the POVM {E, I − E}), which accept a state ρ with probability Tr(Eρ). We will view the measurement outcomes themselves as {−1, 1}-valued, so that the outcome of measuring ρ using E is a random variable Y that is 1 with probability Tr(Eρ) and −1 otherwise. Define We will often identify a state ρ with its behavior with respect to two-outcome measurements, namely with the function f ρ , and use the notation Y ∼ f ρ (E) to mean that Y ∈ {−1, 1} is the random measurement outcome satisfying E[Y |E] = f ρ (E). In learning theoretic terms, this means f ρ describes a probabilistic concept, or p-concept, on the space E. A p-concept on a domain X is a classification rule that assigns random {−1, 1}-valued labels to each point in X according to a fixed conditional mean function; we always identify the p-concept with its conditional mean function. Given a set F of quantum states, we use F to also mean the class of associated p-concepts, with the meaning clear from context. Given a distribution D over E, we will often regard functions f ρ , f σ : E → R as members of the L 2 space L 2 (D, E), with the inner product given by . We use [n] to refer to the set of indices {1, . . . , n}. Given a set S ⊆ [n], we will use χ S to refer to the parity on S, defined as a function from {−1, 1} n → {−1, 1} by χ S (x) = i∈S x i . Given x, y ∈ {0, 1} n , we will sometimes also use χ x (y) = (−1) x·y mod 2 .

Learning models
We begin by formally defining the problem of PAC-learning a quantum state. Definition 2.1 (PAC-learnability of quantum states, [1]). Let F be a class of n-qubit quantum states. Let D be a distribution over E. We say F is PAC-learnable up to squared loss with respect to D if there exists a learner that, given sample access to labeled examples (E, Y ) for E ∼ D, Y ∼ f ρ (E) for an unknown ρ ∈ F, is able to output a state σ satisfying The number of examples used by the learner is called its sample complexity.
We note that this is a slight modification of the original definition in [1], stated directly in terms of squared loss since this is the view that will be convenient for us. PAC learners are also allowed to fail with some probability δ, but for simplicity we will ignore this in this paper. A learner that only succeeds with some constant probability can easily be amplified to succeed with probability 1 − δ using standard confidence amplification procedures.
With PAC learners, one may speak of both computational efficiency (overall running time) and statistical or information-theoretic efficiency (sample complexity). The original result of Aaronson [1] described a computationally inefficient algorithm for learning arbitrary states that nevertheless had O(n) sample complexity. An efficient PAC learner is one that is computationally efficient, i.e. runs in polynomial time, and hence also draws at most polynomially many examples (each draw is considered as taking one unit of time).
We now introduce the following natural extension of these definitions to the SQ setting. In both cases, we operate in the so-called distribution-specific setting, where the learner is assumed to have knowledge of the distribution D. Definition 2.2 (SQ-learnability of quantum states). Let F be a class of n-qubit quantum states. Let D be a distribution over E. An SQ oracle for an unknown state ρ ∈ F is an oracle that accepts a query and a tolerance, (ϕ, τ ), where ϕ : E × {−1, 1} → [−1, 1] and τ > 0, and responds with y such that We say F is SQ-learnable up to squared loss if there is a learner that, given only queries to the SQ oracle for an unknown ρ ∈ F, is able to output a state σ satisfying The number of queries used by the learner is called its query complexity.
An SQ learner is considered efficient if it uses polynomially many queries and its queries all have tolerance τ ≥ 1/ poly(n).

SQ lower bounds for p-concepts
One of the chief features of the classical SQ model is the possibility of proving unconditional lower bounds on learning a class C in terms of its so-called statistical dimension. The quantum setting that we work in, where we identify a state ρ with the p-concept f ρ , becomes a special case of the SQ model for learning p-concepts. Building on recent work [24] that formally proved SQ lower bounds for p-concepts, we extend this framework to the quantum setting. Let X denote an arbitrary domain (for us, X will be E, while in the classical setting, X is usually R n ).

Definition 2.3 (Statistical dimension).
Let D be a distribution on X , and let C be a class of functions from X to R. The average (un-normalized) correlation of C is defined to be ρ Theorem 2.4 ([24], Cor. 4.6). Let D be a distribution on X , and let C be a p-concept class on X . Say our queries are of tolerance τ , the final desired squared loss is , and that the functions in C satisfy c D ≥ β for all c ∈ C. For technical reasons, we require τ ≤ , 2 ≤ β/3. Then learning C up to squared loss (we may pick as large as β/3) requires at least SDA D (C, τ 2 ) queries of tolerance τ .
We remark that the way to interpret such a lower bound is as follows: if the SQ learner's queries have tolerance at least τ , then at least SDA D (C, τ 2 ) queries are required. That is, one must either use small tolerance or many queries.
The following lemma will be convenient in order to bound the SDA when we have bounds on pairwise correlations.

The problem of learning parities
One of the most basic problems in classical learning theory is that of learning the concept class of parity functions. Let the domain be X = {−1, 1} n , and for any subset S ⊆ [n], define χ S (x) = ⊕ i∈S x i = i∈S x i to be the parity on S. Here x i ⊕ x j = x i x j is simply the XOR when bits are represented by {−1, 1}. Let D be any distribution on X . We say a learner is able to learn parities under D if given access to labeled examples (x, χ S (x)) where x ∼ D and S ⊆ [n] is unknown (or, in the SQ setting, given access to the corresponding SQ oracle), and for any error parameter , it is able to output a function h such that The problem of learning parities displays a striking phase transition in going from the noiseless to the noisy setting. Given noiseless labeled examples, the problem of recovering the right parity is simply a question of solving linear equations over F 2 , and can be done using Gaussian elimination by a PAC learner using only Θ(n) examples. With just a little noise, however, the problem seems to become intractable. Perhaps the simplest noise model one can consider is the classification noise model, where every example has its label flipped with some constant probability η (known as the noise rate). Learning parities under classification noise is the basis of the famous Learning Parity with Noise (LPN) problem. Formally, the search version of LPN with noise rate η is precisely the problem of learning parities under the uniform distribution on X and with classification noise at rate η. Usually one also has the additional knowledge that the true target χ S (the "secret") is picked uniformly at random from the set of all parities. This problem is widely conjectured to be hard, including for quantum algorithms, and is even used as a basis for cryptography (see [38] for a survey). The best-known algorithm in the PAC setting runs in slightly subexponential time [16].
Since SQ learners are naturally tolerant of classification noise, one would expect that there are no SQ learners for parities under the uniform distribution, and indeed, this is one of the foundational results in the SQ literature. Thus we see that simple Gaussian elimination is an example of an efficient PAC learner that is not SQ. This establishes a characteristic limitation of SQ algorithms: while they include a wide range of common algorithms, they do not include algorithms that depend entirely on "algebraic" structure.
It is worth emphasizing that this discussion has considered learning parities with a classical representation of the data. When given a quantum representation of the data, as in the quantum "example state" |ψ = 2 −n/2 x∈{0,1} n |x, χ S (x) (taking the distribution over the domain to be uniform), the task becomes easy even with noise [20]. This is because we can now use Hadamard gates to implement a Boolean Fourier transform à la the famous Bernstein-Vazirani algorithm [14].

Noise-tolerant SQ learning
One of the prime features of classical SQ learning is its inherent noise tolerance. From an intuitive standpoint, certain common stochastic noise models are systematic enough that their effects in expectation can be predicted in advance, and hence either be corrected for or bounded. Slightly more precisely, the query expectations of a noisy state are often related in simple ways to the query expectations on a noiseless state, so that the latter can be recovered from the former. We mainly consider three such noise models here: (a) classical classification noise and malicious noise, (b) quantum depolarizing noise, and (c) more general quantum channels with bounded noise.

Classification and malicious noise
Classification noise [7] and malicious noise [33,44] are two classical Boolean noise models that SQ algorithms are able to handle. In the classification noise model, every example's label is flipped with probability η (known as the noise rate). The malicious noise model is a stronger form of noise where for any given example, with probability 1 − η, the label is reported correctly, but with probability η both the point and its label may be arbitrary (and adversarially selected based on the learner's behavior so far). We note that these models are well-defined even in the p-concept setting and hence for quantum states, and simply introduce further randomness into the label. The following results were originally stated for Boolean functions but readily extend to p-concepts.

Theorem 3.1 ([32]
). Let C be a p-concept class learnable under distribution D in the SQ model up to error using q queries of tolerance τ . Then for any constant 0 < η < 1/2, even with respect to an SQ oracle with classification noise at rate η (i.e., one that computes expectations with classification noise), C is learnable up to using O(q) queries of tolerance O(τ (1 − 2η)). If the learner is given noisy training examples as opposed to access to a noisy SQ oracle, thenÕ( q poly(τ (1−2η)) ) noisy examples suffice.
Theorem 3.2 ([12]). Let C be a p-concept class learnable under distribution D in the SQ model up to error using q queries of tolerance τ . An SQ oracle with malicious noise at rate η is one that computes query expectations with respect to a distribution (1−η)f (D)+ηQ, where f (D) denotes the true labeled distribution (x, y) for x ∼ D, y ∼ f (x) (f being the unknown target p-concept), and Q is an arbitrary and adversarially selected distribution on X ×{−1, 1}. If η =Õ( ) and η < τ , then even with respect to an SQ oracle with malicious noise at rate η, C is learnable up to using O(q) queries of tolerance τ − η. If the learner is given noisy training examples as opposed to access to a noisy SQ oracle, then C is learnable (with constant probability) usingÕ( q poly(τ −η) ) noisy examples suffice. (More efficient implementations are also available in some special cases).
The proofs of both theorems are similar: one first relates the noisy query expectations to the true expectations, and then argues that when using a suitably small tolerance (or sufficiently many examples) the effects of the noise can be corrected for (within information theoretic limits).

Depolarizing noise
Depolarizing noise acts on quantum states by shifting them closer to the maximally mixed state. One can consider a setting where it acts on an entire n-qubit state at once, as well as one where it acts independently on each individual qubit. We will consider the former. Definition 3.3 (Depolarizing noise). Let ρ be an arbitrary n-qubit state. Then depolarizing noise at rate η (0 < η < 1) acts on this state by transforming it into Λ η (ρ) = (1−η)ρ+η(I/2 n ). Theorem 3.4. Let 0 < η < 1 be any constant, and let Λ η denote the depolarizing channel at noise rate η. Let C be a class of n-qubit quantum states and D be a distribution on E, the space of two-outcome measurements on such states. Let L be an SQ learner capable of learning C under D using q queries of tolerance τ . There there exists a learner L such that for any ρ ∈ C, L is capable of learning ρ under D using q queries of tolerance τ (1 − η) given only SQ access to Λ η (ρ) as well as sampling access to D.
Proof. For simplicity, we will assume that we know the noise rate η exactly. (So long as we have an upper bound on η, then by a standard "grid search" argument due to [32], we can estimate η sufficiently closely simply by trying out many different values. Briefly: if say we try out η = 0, δ, 2δ, . . . , 1 (1/δ values in all), then one of these will be within δ/2 of the true η. The algorithm when run with this guess for η will produce a good hypothesis. By taking δ = O(τ (1 − η) 2 ) and testing all 1/δ hypotheses produced by our guesses for η on a sufficiently large validation set, we can ensure the best one will perform and generalize well.) Let ρ ∈ C be the unknown target. Observe that for any E ∈ E, by linearity, It is worth stressing that we are able to handle any constant noise rate η ∈ (0, 1), and the price we pay is requiring the tolerance to scale as τ (1 − η).

Quantum channels with bounded noise
We can also consider more general kinds of quantum channels with bounded noise. As long as the queries are bounded, small amounts of noise cannot alter query expectations too much, and so can be "absorbed" into the tolerance. This is similar to classical malicious noise: since classical malicious noise at rate η only can only change query expectations by η (recall that the queries are bounded by 1), a noisy query of tolerance τ − η is able to simulate a noiseless query of tolerance τ . Unlike with depolarizing noise, this means we cannot handle arbitrary η; this is an artifact of the fact that more general kinds of noise do not permit the kind of systematic correction we were able to perform for depolarizing noise.
For concreteness here we consider a noisy quantum channel Λ such that Λ − 1 n ≤ η, where 1 n is the identity map on n-qubit states and the norm is the diamond norm. We do not define this norm here, but its chief property for our purposes is that for any n-qubit state ρ and 2-outcome measurement E, |Tr(E(ρ − Λ(ρ)))| ≤ η. Similar theorems can be proven with respect to other distance measures such as fidelity.
Theorem 3.5. Let Λ be a quantum channel such that Λ − 1 n ≤ η, as above. Let C be a class of n-qubit quantum states learnable under distribution D using q queries of tolerance τ > 2η. Then C is still learnable under noise Λ (i.e. when our queries are answered not with respect to ρ but Λ(ρ)) using q noisy queries of tolerance τ − 2η.

General noise for distribution-free learning
So far, we've only considered distribution-specific learning, where the learner is only required to succeed with respect to a pre-specified distribution D. In the distribution-free case, where the learner is required to succeed no matter what D is, we now give a simple proof that any SQ algorithm for a concept class can also handle any kind of quantum noise on the state, as long as the noise is known. This is unsurprising, and at a high level, the approach simply boils down to off-loading the noise from the state to the measurement. Learning a noisy set of measurements is thus handled by distribution-free learning algorithm.
Given a quantum operation Λ, its adjoint Λ † is such that ∀ρ, Tr(E · Λ(ρ)) = Tr(Λ † (E) · ρ) and always exists (see [39] for details on how to prove this folklore result). Let D be the distribution we are trying to learn concept class C using statistical queries and let Λ be the noise applied to the quantum state. We can then define D † to be the distribution Λ † (E) where E is drawn from D and by definition the traces (and thus the statistical queries) are the same when applied to ρ and Λ(ρ) respectively. Also by definition, a distribution-free learner for C would also be able to learn with distribution D † .
4 Lower bounds on learning stabilizer states with noise

Stabilizer formalism
The stabilizer states are a popular class of quantum states that are used throughout quantum information for areas like quantum error correction, classical simulation of quantum mechanics, and quantum communication. If we let P = {I, X, Y, Z} be the Pauli matrices, then we can define a generalization of the Pauli matrices to an n-qubit system as P n = {±1}·{I, X, Y, Z} ⊗n . 3 As an example, −X ⊗ Y ⊗ Z ⊗ Y ⊗ Z ∈ P 5 , though in the future we will simply opt to write this Pauli matrix as −XY ZY Z. It is not hard to show that ∀P = ±I ⊗n ∈ P n half of the eigenvalues are 1 and the other half are −1.
We say that a pure state ρ = |ψ ψ| such that |ψ ∈ C 2 n is stabilized by P ∈ P n if P |ψ = |ψ . In other words, |ψ must be an eigenvector of P with eigenvalue 1. The set of pure states that are stabilized by the subset S ⊆ P n are then the states that lie in the intersection of the eigenvalue 1 subspaces. It is known that if S is an Abelian group that does not contain −I ⊗n , then this intersection has dimension 2 n /|S| [29]. Due to the nature of the outer product, any vector e iθ |ψ drawn from this space results in the same density matrix ρ = |ψ ψ|. Definition 4.1 (Stabilizer states). Let S ⊂ P n \ {−I ⊗n } be an Abelian group of order 2 n (note that P n is not itself a group under matrix multiplication). The unique state density matrix ρ = |ψ ψ| that results from the one-dimensional subspace that is stabilized by all elements of S is then defined to be a stabilizer state. We then say that S stabilizes ρ.
The stabilizer states are then the set of all such quantum pure states that are stabilized by Abelian groups of order 2 n formed from P n \ {−I ⊗n }.
Proof. Because |ψ ψ| = |ϕ ϕ| then S = S . We also know that S ∩ S is an abelian group without −I ⊗n , so |S ∩ S | < 2 n . Since 2 n /|S ∩ S | is the dimension of the space stabilized by this group [29], it must be an integer. Due to the prime factorization of 2 n , |S ∩ S | = 2 m for some integer 0 ≤ m < n, of which the largest possible m is n − 1.

Difficulty of beating the maximally mixed state on uniform Pauli measurements
Let C be the class of all n-qubit stabilizer pure states. If P ∈ P n is a Pauli operator, then the two-outcome measurement associated with P is (I + P )/2, and is referred to as a Pauli mea-surement. We will first examine the natural distribution D given by the uniform distribution over Pauli measurements, E Pauli = {(I + P )/2 | P ∈ P n }.
In doing so, we will show that performing better than the trivial algorithm of always outputting the maximally mixed state I/2 n is difficult. We use the following folklore lemma (see e.g. [41] Lemma 1 for a simple proof).
Simple algebraic manipulations then tell us that Tr(P ρ) can only take on the values {1, 0, −1} with Tr(P ρ) being 1 or −1 if and only if P or −P is in the stabilizer group of ρ respectively. With this result, we can compute bounds on | f ρ , f ρ D | for stabilizer states ρ and ρ by counting how many matrices lie in the intersection of their stabilizer groups or the negations of their stabilizer groups.

Lemma 4.4. Let C be the concept class of n-qubit stabilizer pure states, and let D denote the uniform distribution on n-qubit Pauli measurements. Then for any stabilizer states ρ, ρ
Proof. Let ρ, ρ ∈ C. Let S and S be the stabilizer groups for ρ and ρ respectively, and also let −S = {−P | P ∈ S} and −S = {−P | P ∈ S }. For any Pauli measurement (I + P )/2, we have f ρ ( I+P 2 ) = 2 Tr( I+P 2 ρ) − 1 = Tr(P ρ). Thus the correlation of the two p-concepts becomes Tr(P ρ) · Tr(P ρ ) .
With this result, we can use Lemma 2.5 to compute the SDA and by extension prove a lower bound on the number of statistical queries needed to learn this concept class under this distribution.  Proof. By Proposition 4.5, |C| = 2 Θ(n 2 ) . Using Lemma 2.5 with κ = 1 2 n and γ = 1 2 n+1 as calculated from Lemma 4.4, we find that Setting γ = 1 2 n+1 gives the result. Proof. Simply apply Theorem 2.4, with β = 2 −n .
Since the norms of our p-concepts are exponentially small (i.e. 2 −n/2 ), we only get hardness for error on the order of 2 −O(n) . But as we now show, the p-concept norm corresponds almost exactly to the squared loss achieved by the maximally mixed state. Our results show that doing significantly better than the maximally mixed state requires 2 Ω(n 2 ) statistical queries even when the tolerance is exponentially small.
Proof. In essence, this is simply because the p-concept f I/2 n is almost always zero. Specifically, for all E ∈ E Pauli \ {0, I}, f I/2 n (E) = 2 Tr(E/2 n ) − 1 = 0, since Tr(E) = 2 n−1 for all such E.
(This is because E = (I + P )/2 for some n-qubit Pauli matrix P ∈ P n , and Tr(P ) = 0 for all P ∈ P n \ {±I}.) As for E ∈ {0, I}, we note that f ρ (E) = f I/2 n (E). Thus

Lower bounds via a direct reduction from learning parities
To get around this norm issue, we look at a subset of stabilizer states such that we can produce p-concepts with norm 1. Recall that the Pauli measurements are the set of all projectors onto the eigenvalue-1 space of some Pauli matrix P , i.e. { P +I 2 | P ∈ P n }. We define a subset of the Pauli measurements called the parity measurements, and show the hardness of SQ-learning stabilizer states under the uniform distribution on such measurements. This is via a simple equivalence, holding essentially by construction, with the problem of learning parities under the uniform distribution. As a further consequence, we obtain that learning stabilizer states with noise is at least as hard as Learning Parity with Noise (LPN). This holds for general PAC-learning, even outside the SQ model.

Definition 4.9 (Parity measurements).
For all x ∈ {0, 1} n , let P x = y∈{0,1} n χ x (y)|y y|. Since the set of P x is equivalent to {I, Z} ⊗n , the corresponding measurement E x = Px+I 2 is by definition a Pauli measurement. We will refer to such measurements as parity measurements. In particular, learning stabilizer states under D is at least as hard as learning parities under D.
Proof. If the unknown state ρ is a computational basis state |y y|, then the value Tr(E x |y y|) = x · y mod 2 is simply the parity of x over the subset specified by y (represented using {0, 1} instead of {−1, 1}). In the PAC setting, this would be equivalent to getting the sample (E x , x·y mod 2). Accordingly, let us define D simply as the distribution over E x for x ∼ D. It is clear that these are different representations of the same problem, such that a learning algorithm for one implies a learning algorithm for the other. We note that this relationship holds even in the presence of classification noise. Finally, note that computational basis states are a subset of the stabilizer states, so any learner for stabilizer states implies a learner for the computational basis states as well. This implies that learning stabilizer states on D is at least as hard as learning parities on D, even in the presence of classification noise. Proof. By Proposition 4.10, SQ-learning stabilizer states under the uniform distribution on E x parity measurements is at least as hard as learning parities over the uniform distribution. Applying Theorem 2.6, we get the exponential lower bound. Proof. Proposition 4.10 directly implies that learning computational basis states under the uniform distribution on parity measurements and with classification noise is equivalent to LPN.

An SQ learner for product states
Turning to positive results, we now give SQ algorithms for some simple concept classes, namely the computational basis states and, more generally, products of n single-qubit states. The distribution on measurements that we will consider will correspond to a natural scheme for these classes: pick a qubit at random and measure it using a Haar-random unitary.
Concretely, let D be the distribution of single qubit measurements formed from the projection onto Haar-random single qubit state (i.e. U |0 0|U † where U is a Haar random unitary), and let D be the distribution on n-qubit measurements that corresponds to picking a qubit at random and measuring it using a measurement drawn from D . That is, D = 1 n n i=1 I ⊗i−1 ⊗ D ⊗ I ⊗n−i . Let C be the concept class of product states ρ = ⊗ n i=1 ρ i . Of course, this class includes the computational basis states. The main result of this section will be a simple O(n)-query SQ algorithm for learning C under the distribution D.
We remark that our algorithm's guarantee actually trivially extends to learning arbitrary (not just product) states under the above distribution D of single-qubit Haar-random measurements. This is simply because such measurements only ever inspect each qubit individually, so that a product state ⊗ i ρ i is indistinguishable-under D-from a more general mixed state ρ whose reduced density matrix on qubit i is ρ i for every i. 4 Yet since this distribution on measurements is fundamentally not very interesting for anything other than product states, we state the results in this section only for product states.
Tr (E|ψ ψ|) = Tr 1 2 U (I + cos ϕ sin θX + sin ϕ sin θY + cos θZ)U † |ψ ψ| = Tr 1 2 (I + cos ϕ sin θX + sin ϕ sin θY + cos θZ)|0 0| = 1 + cos θ 2 We can also do the same thing for ρ: = λ 1 + cos θ cos 2θ + cos ϕ sin θ sin 2θ 2 + (1 − λ) 1 − cos θ cos 2θ − cos ϕ sin θ sin 2θ 2 = 1 + (2λ − 1)(cos θ cos 2θ + cos ϕ sin θ sin 2θ ) 2 This allows us to perform a spherical integral over θ and ϕ to get the expectation: Our algorithm for learning product states will be work by learning each qubit in the Pauli basis. This results in a 3n-query algorithm, corresponding to the 3n parameters that it takes to define a product state. We first recall the definition of trace distance, which is the quantum generalization of total variational distance. The following lemma will then be necessary to relate trace distance of the states to the squared loss in learning under this distribution.
We will define ξ i = Tr i (ξ) = ρ i − σ i to be the reduced density matrix on the i th qubit of ξ. Noting that each ξ i is traceless, then by diagonalizing we can write Tr is the trace distance of the reduced density matrix.
We now show how to use Lemma 5.1 to learn each qubit of the product state, allowing us to then apply Lemma 5.4 to get our learning result. Theorem 5.5. Let D be the distribution on measurements and let C be the concept class of product states defined earlier. There is an SQ learner that is able to learn C under D up to squared loss using 3n queries of tolerance √ /n.
Proof. Let the unknown ρ ∈ C be given by ρ = i ρ i . If we define P 1 = X, P 2 = Y , and P 3 = Z, then our queries will be The query ϕ i,j will correspond to taking the projection of the i th qubit along the Pauli P j , as we now show: Here the third equality exploits the definition of D as 1 n n i=1 I ⊗i−1 ⊗ D ⊗ I ⊗n−i (only the i th term yields a nonzero expectation), and the fourth equality is Lemma 5.1.
Any specific qubit ρ i can be written in Bloch sphere coordinates as We can estimate x i = 1 2 Tr(P 1 ρ i ) up to error √ using a single query of tolerance √ /n. The same holds true for y i and z i . If we use this to construct our estimate then by Proposition 5.3 we get Finally, using Lemma 5.4: We note that if the estimated point is outside of the Bloch sphere, we can simply normalize the point to the surface of the Bloch sphere and this will never increase the error. To quickly sketch the proof of this, take the plane formed by the center of the sphere, the estimated point p that is outside of the sphere, and the real point p which is both within the Bloch sphere and within a sphere radius located at p. The normalized point p is always located on the line from the p to the origin, and one can make a separating plane that bisects the line segment between p and p that denotes whether one is closer to p or p . Since the Bloch sphere will always be on the side closer to p and the real point p is in the Bloch sphere, p will always be closer to p than p.
We can simplify this algorithm if we know in advance that ρ is a computational basis state. In that case, we know that each qubit ρ i is either (I + Z)/2 or (I − Z)/2, and so we only need to make n queries ϕ i,3 , one for each i. Moreover, we only need to identify the coordinate z i to within an accuracy of 1 in order to distinguish the z i = 1 and z i = −1 cases, so that our tolerance need only scale as O(1/n) in order to learn ρ perfectly (i.e. with = 0).

Connections to differential privacy
A PAC learning algorithm L can be viewed as a randomized algorithm that takes as input a training dataset (i.e. a set of labeled examples (x, y) sampled from a distribution) and outputs a hypothesis that with high probability has low error over the distribution. That is, if S is a training dataset, then L(S) describes a probability distribution over hypotheses (where the randomness arises from the internal randomness of the learner). Intuitively, differential privacy requires L to satisfy a kind of stability: on any two inputs S and S that are close, the distributions L(S) and L(S ) must be close as well. Definition 6.1 (Differential privacy, [21]). Call two datasets neighbors if they only differ in one entry. A learner L (understood in the sense just discussed) is said to be α-differentially private (or α-DP for short) if for any S and S that are neighbors, the distributions L(S) and L(S ) are close in the sense that for any hypothesis A well-known property of SQ algorithms is that they can readily be made differentially private [17,21]. Since differential privacy is a notion that is well-defined only in the PAC setting where the input is a set of training examples (as opposed to access to an SQ oracle), such a statement is necessarily of the form "any SQ learner yields a PAC learner that satisfies differential privacy." Theorem 6.2 (see e.g. [13]). Let C be a concept class learnable up to error by an SQ learner L using q queries of tolerance τ . Then it is also learnable up to error in the PAC setting by an α-DP learner L with sample complexityÕ( q ατ + q τ 2 ) (with constant probability). The proof is standard and proceeds by simulating each of L's queries using empirical estimates over a sample of size roughly 1/τ 2 and then using the Laplace mechanism to add some further noise.
One can extend this notion to the quantum setting. One natural and direct way of doing so is simply by replacing the classical dataset of labeled pairs (x i , y i ) by one of measurementoutcome pairs (E i , Y i ); the rest remains exactly analogous. Theorem 6.2 then carries over verbatim to our notion of quantum SQ learnability. This form of quantum differential privacy was recently studied by Arunachalam et al. [11], who were able to relate it to online learning, one-way communication complexity, and shadow tomography of quantum states, extending ideas of Bun et al. [18]. Since our notion of quantum SQ learnability implies quantum DP learnability, it also fits into their framework. In particular, by the chain of implications established in that work, efficient quantum SQ learnability of a class of states implies DP PAC learnability, which implies finite sequential fat-shattering (sfat) dimension, which in turn implies online learnability, gentle shadow tomography, and "quantum stability." In fact, in the classical setting, some of the main examples of realistic DP learners are SQ (even though technically the inclusion is known to be strict) [17,31], and one might expect the same to hold in the quantum setting as well.
We remark that a somewhat different kind of quantum differential privacy, where privacy is with respect to copies of the unknown state, may also be defined as follows. View a quantum state learner L as an algorithm that takes in multiple copies ρ ⊗m of some unknown state ρ, is allowed to sample and perform random measurements from a distribution D, and outputs another state σ that is required to be close to ρ with respect to D with high probability. If the random measurements are viewed as the internal randomness of the learner, then this is similar to the view we took of a classical learner earlier. We can now define a notion of differential privacy for quantum state learners by requiring that L(ρ ⊗m ) and L(ρ ⊗m−1 ⊗ ρ ) (where ρ = ρ , so that ρ ⊗m and ρ ⊗m−1 ⊗ ρ are neighbors) are α-close as distributions over states (in the natural way). This can also be seen as a stylized kind of tolerance to noise or corruptions. The following analogue of Theorem 6.2 can then be proven using almost exactly the same proof; essentially, we are only replacing classical examples with copies of quantum states. Theorem 6.3. Let C be a class of quantum states learnable up to error by an SQ learner L using q queries of tolerance τ . Then it is also learnable up to error in the PAC setting by an α-DP learner L (in the specific sense just described) with copy complexityÕ( q ατ + q τ 2 ) (with constant probability).
Note that these notions are different from those of [5], which defined differential privacy for quantum measurements. Here two n-qubit states are considered neighbors if it is possible to reach one from the other by a quantum operation (sometimes called a superoperator) on a single qubit. In particular, two product states ρ = ⊗ i ρ i and σ = ⊗ i σ i are neighbors if ρ i = σ i for all i but one. Definition 6.4 (Quantum differential privacy for measurements, [5]). A measurement M is said to be α-DP if for any n-qubit neighbor states ρ, σ, and any outcome y, P[M (ρ) = y] ≤ e α P[M (σ) = y].
The authors show that this definition can be related to the notion of a "gentle quantum measurement," and this connection can be carefully exploited to perform shadow tomography [2]. However, this kind of quantum DP is not applicable in a natural way to a PAC or SQ learner, since such a learner is an algorithm rather than just a single measurement.

Discussion and open problems
Statistical vs. query complexity. Conceptually, the contrast between our SQ model and the original PAC model of [1] is interesting. Apart from the definition of an elegant model, Aaronson's main insight was to characterize learnability in a purely statistical sense, showing bounds on sample complexity via an analysis of the so-called fat-shattering dimension of quantum states. In learning theoretic terms, this took advantage of a separation of concerns that the PAC model encourages: (a) empirical performance, i.e. a learner achieving low error with respect to the training data, and (b) generalization, i.e. this performance actually generalizing to the true distribution. The SQ model, however, does not naturally accommodate such a separation. SQ algorithms are instead primarily characterized by the number of queries required; generalization is "in-built." The closest analogue to a notion of sample complexity is the role played by the tolerance, and the closest thing to studying generalization on its own might have been to show a phase transition in what different regimes of the tolerance are able to accomplish. The formal statements of our SQ lower bounds do have such a flavor: "either use small tolerance or many queries." Suitable classes and distributions for PAC-learning. It is notable that the algorithms of [41] for learning stabilizer states and [45] for low Schmidt rank states are essentially the only known positive results in the framework of [1]. Both these algorithms are "algebraic" and involve solving a system of polynomial equations, something that SQ cannot handle. A longstanding question in this area is: what other interesting classes can be learned, and under what distributions on measurements? And can they also be learned in the SQ setting?
A major issue in picking suitable distributions on measurements is that under many natural distributions, the maximally mixed state actually performs quite well, so that the problem of learning becomes essentially superfluous. Even in this work, we obtained lower bounds for learning stabilizer states under the uniform distribution on Pauli measurements only for learning up to exponentially small squared loss. This was because the norms of the p-concepts are themselves exponentially small, or in other words the maximally mixed state already achieves exponentially small loss. We were able to get around this and obtain a Ω(2 n ) lower bound via a direct reduction from learning parities (by considering parity measurements). Can we do better than just 2 n ? Is there a ω(2 n )-sized (e.g., 4 n or 2 n 2 ) subset of stabilizer states such that there exists a distribution over Pauli measurements inducing norms that are only polynomially small yet have an exponentially small average correlation? That is, is there a ω(2 n )-sized set of stabilizer states and accompanying distribution over Pauli measurements such that the maximally mixed state does not do well?
Other forms of noise. Can we extend the noise tolerance of SQ algorithms to more forms of noise, or improve the parameters of the noise tolerated? One such interesting form of noise would be depolarizing noise that acts on individual qubits (as opposed to acting directly on the whole state).
Noise-tolerant learning beyond SQ. The best-known PAC algorithm for learning parities with noise is due to [16] and runs in slightly subexponential time. Interestingly, this means it beats the exponential SQ lower bound and is hence essentially the only known example of a noise-tolerant PAC algorithm that is not SQ (although it cannot handle noise arbitrarily close to the information-theoretic limit). Can we similarly hope for a noise-tolerant but non-SQ learner for stabilizer states that runs in subexponential time?