Two new results about quantum exact learning

We present two new results about exact learning by quantum computers. First, we show how to exactly learn a $k$-Fourier-sparse $n$-bit Boolean function from $O(k^{1.5}(\log k)^2)$ uniform quantum examples for that function. This improves over the bound of $\widetilde{\Theta}(kn)$ uniformly random \emph{classical} examples (Haviv and Regev, CCC'15). Additionally, we provide a possible direction to improve our $\widetilde{O}(k^{1.5})$ upper bound by proving an improvement of Chang's lemma for $k$-Fourier-sparse Boolean functions. Second, we show that if a concept class $\mathcal{C}$ can be exactly learned using $Q$ quantum membership queries, then it can also be learned using $O\left(\frac{Q^2}{\log Q}\log|\mathcal{C}|\right)$ \emph{classical} membership queries. This improves the previous-best simulation result (Servedio and Gortler, SICOMP'04) by a $\log Q$-factor.


Quantum learning theory
Both quantum computing and machine learning are hot topics at the moment, and their intersection has been receiving growing attention in recent years as well. On the one hand there are particular approaches that use quantum algorithms like Grover search [18] and the Harrow-Hassidim-Lloyd linear-systems solver [19] to speed up learning algorithms for specific machine learning tasks (see [1,8,16,28,33] for recent surveys of this line of work). On the other hand there have been a number of more general results about the sample and/or time complexity of learning various concept classes using a quantum computer (see [3] for a survey). This paper presents two new results in the latter line of work. In both cases the goal is to exactly learn an unknown target function with high probability; for the first result our access to the target function is through quantum examples for the function, and for the second result our access is through membership queries to the function. In the early days of quantum computing, Bshouty and Jackson [10] generalized this learning setting by allowing coherent quantum examples. A quantum example for concept c w.r.t. distribution D, is the following ( log N + 1)-qubit state:

D(x)|x, c(x) .
Clearly such a quantum example is at least as useful as a classical example, because measuring this state yields a pair (x, c(x)) where x ∼ D. Bshouty and Jackson gave examples of concept classes that can be learned more efficiently from quantum examples than from classical random examples under specific D. In particular, they showed that the concept class of DNF-formulas can be learned in polynomial time from quantum examples under the uniform distribution, something we do not know how to do classically (the best classical upper bound is quasi-polynomial time [32]). The key to this improvement is the ability to obtain, from a uniform quantum example, a sample S ∼ c(S) 2 distributed according to the squared Fourier coefficients of c. 1 This Fourier sampling, originally due to Bernstein and Vazirani [7], is very powerful. For example, if C is the class of F 2linear functions on {0, 1} n , then the unknown target concept c is a character function χ S (x) = (−1) x·S 2 ; its only non-zero Fourier coefficient is c(S) hence one Fourier-sample gives us the unknown S with certainty. In contrast, learning linear functions from classical uniform examples requires Θ(n) examples. Another example where Fourier sampling is proven powerful is in learning the class of -juntas on n bits. 3 Atıcı and Servedio [5] showed that (log n)-juntas can be exactly learned by a quantum learner under the uniform distribution in time polynomial in n. Classically it is a long-standing open question if a similar result holds when the learner is given uniform classical examples (the best known algorithm runs in quasi-polynomial time [23]). These cases (and others surveyed in [3]) show that uniform quantum examples (and in particular Fourier sampling) can be more useful than classical examples. 4 In this paper we consider the concept class of n-bit Boolean functions (with domain {0, 1} n and range {−1, 1}) that are k-sparse in the Fourier domain: c(S) = 0 for at most k different S's. This is a natural generalization of the above-mentioned case of learning linear functions, which corresponds to k = 1. It also generalizes the case of learning -juntas on n bits, which are functions of sparsity k = 2 . Variants of the class of k-Fourier-sparse functions have been well-studied in the area of sparse recovery, where the goal is to recover a k-sparse vector x ∈ R N given a low-dimensional linear sketch Ax for a so-called "measurement matrix" matrix A ∈ R m×N . See [20,22] for some upper bounds on the size of the measurement matrix that suffice for sparse recovery. Closer to the setting of this paper, there has also been extensive work on learning the concept class of n-bit real-valued functions that are k-sparse in the Fourier domain. In this direction Cheraghchi et al. [14] showed that O(nk(log k) 3 ) uniform examples suffice to learn this concept class, improving upon the works of Bourgain [9], Rudelson and Vershynin [26] and Candés and Tao [11].
In this paper we focus on exactly learning the target concept from uniform examples, with high success probability. So D(x) = 1/2 n for all x, ε = 0, and δ = 1/3. Haviv and Regev [21] showed that for classical learners O(nk log k) uniform examples suffice to learn k-Fourier-sparse functions, and Ω(nk) uniform examples are necessary. In Section 3 we study the number of uniform quantum examples needed to learn k-Fourier-sparse Boolean functions, and show that it is upper bounded by O(k 1.5 (log k) 2 ). For k n 2 this quantum bound is much better than the number of uniform examples used in the classical case. Proving the upper bound is done in two phases. In the first phase we use the fact that a uniform quantum example allows us to Fourier-sample the target concept and, with some Fourier analysis of k-Fourier-sparse functions, we learn the Fourier span using O(rk) examples, where r is the Fourier dimension of the target concept (see Section 2 for the definition of Fourier dimension). In the second phase, we reduce the number of variables to the dimension r of the Fourier support, and then invoke the classical learner of Haviv and Regev to learn the target function from O(rk log k) classical examples. Since it is known that r = O( √ k log k) [27], the two phases together imply that O(k 1.5 (log k) 2 ) uniform quantum examples suffice to exactly learn the target with high probability. We also prove a (non-matching) lower bound of Ω(k log k) uniform quantum examples, using techniques from quantum information theory.
We believe that the sample complexity for Phase 1 of our learning algorithm is actuallỹ O(k). Towards that end, we propose a possible way to prove the sample complexity of our Phase 1 toÕ(k). The first step in Phase 1 of our algorithm is to obtain an S = 0 n such that c(S) = 0, where c is the k-Fourier-sparse target concept. It follows from Chang's lemma [13], a central result in additive combinatorics, that in expectation O(k √ log k/ √ r) Fourier-samples are sufficient to obtain one such S. In Section 3.3 we present an improvement of Chang's lemma for the case of k-Fourier-sparse Boolean functions. Using this improvement we can show that in expectation O((k log k)/r) Fourier-samples are sufficient to obtain an S = ∅ such that c(S) = 0. We conjecture (Conjecture 1) a generalization of our improvement of Chang's Lemma which, if true, would imply that Phase 1 of our algorithm can be done inÕ(k) many expected number of samples. Our improvement of Chang's lemma and the techniques used therein might be of independent interest.

Exact learning from quantum membership queries
Our second result is in a model of active learning. The learner still wants to exactly learn an unknown target concept c : [N ] → {−1, 1} from a known concept class C, but now the learner can choose which points of the truth-table of the target it sees, rather than those points being chosen randomly. More precisely, the learner can query c(x) for any x of its choice. This is called a membership query. 5 Quantum algorithms have the following query operation available: For some concept classes, quantum membership queries can be much more useful than classical. Consider again the class C of F 2 -linear functions on {0, 1} n . Using one query to a uniform superposition over all x and doing a Hadamard transform, we can Fourier-sample and hence learn the target concept exactly. In contrast, Θ(n) classical membership queries are necessary and sufficient for classical learners. As another example, consider the concept class Elements from this class can be learned using O( √ N ) quantum membership queries by Grover's algorithm, while every classical algorithm needs to make Ω(N ) membership queries.
For a given concept class C of ±1-valued function on [N ], let D(C) denote the minimal number of classical membership queries needed for learners that can exactly identify every c ∈ C with success probability 1 (such learners are deterministic without loss of generality). Let R(C) and Q(C) denote the minimal number of classical and quantum membership queries, respectively, needed for learners that can exactly identify every c ∈ C with error probability ≤ 1/3. 6 Servedio and Gortler [29] showed that these quantum and classical measures cannot be too far apart. First, using an information-theoretic argument they showed Intuitively, this holds because a learner recovers roughly log |C| bits of information, while every quantum membership query can give at most O(log N ) bits of information. Note that this is tight for the class of linear functions, where the left-and right-hand sides are both constant. Second, using the so-called hybrid method they showed for some combinatorial parameter γ(C) that we will not define here (but which is 1/N for the class C of point functions, hence this inequality is tight for that C). They also noted the following upper bound: Combining these three inequalities yields the following relation between D(C) and Q(C) This shows that, up to a log N -factor, quantum and classical membership query complexities of exact learning are polynomially close. While each of the three inequalities that together imply (1) can be individually tight (for different C), this does not imply (1) itself is tight. Note that Eq. (1) upper bounds the membership query complexity of deterministic classical learners. We are not aware of a stronger upper bound on bounded-error classical learners. However, in Section 4 we tighten that bound further by a log Q(C)-factor: This inequality is tight both for the class of linear functions and the class of point functions.
Our proof combines the quantum adversary method [2,6,30] with an entropic argument to show that we can always find a query whose outcome (no matter whether it is 1 or −1) will shrink the concept class by a factor ≤ 1 − log Q(C) Q(C) 2 . While our improvement over the earlier bounds is not very large, we feel our usage of entropy to save a log-factor is new and may have applications elsewhere.

Preliminaries
Notation. Let For a Boolean function f : where the matrix-vector product Bx is over F 2 . Throughout this paper, the rank of a matrix B ∈ F n×n 2 will be taken over F 2 . Let B 1 , . . . , B n be the columns of B.
Fourier analysis on the Boolean cube. We introduce the basics of Fourier analysis here, referring to [25,34] for more. Define the inner product between functions f, g : where the expectation is uniform over all x ∈ {0, 1} n . For S ∈ {0, 1} n , the character function corresponding to S is given by For every j ∈ [n], we use the notation χ j to denote the function χ {j} . Observe that the set of functions {χ S } S∈{0,1} n forms an orthonormal basis for the space of realvalued functions over the Boolean cube. Hence every f : {0, 1} n → R can be written uniquely as We now state a number of known structural results about Fourier coefficients and dimension.
Proof. Write out the Fourier expansion of f B : where the second equality used S, (B −1 ) T x = B −1 S, x and the last used the substitution S = BQ.
The following lemma (Lemma 3) easily follows by applying Lemma 2 with an invertible linear map B that maps e i to B i , for every i ∈ [r].

For every
Here is the well-known fact, already mentioned in the introduction, that one can Fourier-sample from uniform quantum examples: There exists a procedure that uses one uniform quantum example and satisfies the following: with probability 1/2 it outputs an S drawn from the distribution { f (S) 2 } S∈{0,1} n , otherwise it rejects.
Proof. Using a uniform quantum example 1 1} unitarily, apply the Hadamard transform to the last qubit and measure it. With probability 1/2 we obtain the outcome 0, in which case our procedure rejects. Otherwise the remaining state is 1 Apply Hadamard transforms to all n qubits to obtain S f (S)|S . Measuring this quantum state gives an S with probability f (S) 2 .
Information theory. We refer to [15] for a comprehensive introduction to classical information theory, and here just remind the reader of the basic definitions. . If ρ is a density matrix (i.e., a trace-1 positive semi-definite matrix), then its singular values form a probability distribution P , and the von Neumann entropy of ρ is S(ρ) := H(P ). We refer to [24, Part III] for a more extensive introduction to quantum information theory.

Exact learning of k-Fourier-sparse functions
In this section we consider exactly learning the concept class C of k-Fourier-sparse Boolean functions: The goal is to exactly learn c ∈ C given uniform examples from c of the form (x, c(x)) where x is drawn from the uniform distribution on {0, 1} n . Haviv and Regev [21] considered learning this concept class and showed the following results.
Theorem 2 (Corollary 3.6 of [21]). For every n > 0 and k ≤ 2 n , the number of uniform examples that suffice to learn C with probability Theorem 3 (Theorem 3.7 of [21]). For every n > 0 and k ≤ 2 n , the number of uniform examples necessary to learn C with constant success probability is Ω(k(n − log k)).
Our main results in this section are about the number of uniform quantum examples that are necessary and sufficient to exactly learn the class C of k-Fourier-sparse functions. A uniform quantum example for a concept c ∈ C is the quantum state Our first theorem of this section (Section 3.1) gives an upper bound on the number of uniform quantum examples that are sufficient to learn C by giving a learning algorithm. The learning algorithm has two phases: Phase 1 is described in Section 3.1.1 and Phase 2 is discussed in Section 3.1.2.
In the theorem below (Section 3.2) we prove the following (non-matching) lower bound on the number of uniform quantum examples necessary to learn C.
Theorem 5. For every n > 0, constant c ∈ (0, 1) and k ≤ 2 cn , the number of uniform quantum examples necessary to learn C with constant success probability is Ω(k log k).
In Section 3.3 we give a possible direction to prove an improved sample complexity for Phase 1 of our learning algorithm.

Upper bound on learning k-Fourier-sparse Boolean functions
We split our quantum learning algorithm into two phases. Suppose c ∈ C is the unknown concept, with Fourier dimension r. In the first phase the learner uses samples from the distribution The learner may not know the exact Fourier dimension r in advance, but Theorem 1 gives an upper bound r = O( √ k log k), so our Theorem 4 follows immediately from Theorem 6.
Before we prove this Theorem 6, we first give a "trivial" algorithm for learning the Fourier support of Fourier-sparse functions quantumly. Gopalan et al. [17] showed that every k-Fourier-sparse Boolean function is "2 − log k -granular", i.e., every Fourier coefficient of a k-Fourier-sparse Boolean function c is either 0 or an integer multiple of 2 − log k . Using this observation, if one is allowed to Fourier-sample from c, then each S with non-zero c(S) will be observed with probability Ω(1/k 2 ), and using a coupon collector argument, we obtain the entire Fourier support using O(k 2 log k) many Fourier-samples. Our main contribution in Theorem 6 is to use the Fourier dimension in order to improve this trivial quantum algorithm. In particular observe that for functions with Fourier dimension log k (such as (log k)-juntas), the theorem above scales as O(k log 2 k) which is better than the trivial algorithm by a factor of nearly k.

Phase 1: Learning the Fourier span
In this phase of the algorithm our goal is to learn the r-dimensional Fourier span of the k-Fourier-sparse target concept c, using O(rk) Fourier-samples. The algorithm is very simple: Fourier-sample more and more S's and keep track of their span; stop when we reach dimension r. The key is the following technical lemma, which says that if our current span V does not yet equal the full Fourier span V , then there is significant Fourier weight outside of V . This implies that a small expected number of additional Fourier-samples will give us an S ∈ V \ V , which will grow our current span. After r such grow-steps we have learned the full Fourier span. Proof. Let us assume the worst case, which is that dim(V ) = r − 1. Because we can do an invertible linear transformation on c as in Lemma 2, we may assume without loss of generality that the one "missing" dimension corresponds to the variable x r (i.e., V = span(V ∪ {e r })). Let g be the (not necessarily Boolean-valued) part of c with Fourier coefficients in V : Suppose, towards a contradiction, that the Fourier weight W := S∈V \V c(S) 2 is < 1/k. This implies that c and g have the same sign on every x ∈ {0, 1} n , as follows (using Cauchy-Schwarz): Since c depends on the variable x r , there exists an x ∈ {0, 1} n where x r is influential, i.e., c(x) = c(x r ). But g is independent of x r , which implies c(x) = sign(g(x)) = sign(g(x r )) = c(x r ), a contradiction. Hence W ≥ 1/k.
We now conclude Phase 1 by presenting a quantum learning algorithm that learns the Fourier span of an unknown r-dimensional c ∈ C, given uniform quantum examples for c. This quantum learner can actually run forever, but if we know the Fourier dimension r of c, or an upper bound r on the actual Fourier dimension (e.g., by Theorem 1), then we can stop the learner after processing 6rk examples; now, by Markov's inequality, with probability ≥ 2/3 the last subspace will be the Fourier span of c.
Proof. In order to learn the Fourier span of c, the quantum learner simply takes Fouriersamples until they span an r-dimensional space. Since we can generate a Fourier-sample from an expected number of 2 uniform quantum examples (by Lemma 4), the expected number of uniform quantum examples needed is at most twice the expected number of Fourier-samples. If our current sequence of Fourier-samples spans an r -dimensional space V , with r < r, then Lemma 5 implies that the next Fourier-sample has probability at least 1/k of yielding an S ∈ V . Hence an expected number of at most k Fourier-samples suffices to grow the dimension of V by at least 1. Since we stop at dimension r, the overall expected number of Fourier-samples is at most 2rk.

Phase 2: Learning the function completely
In the above Phase 1, the quantum learner obtains the Fourier span of c, which we will denote by T . Using this, the learner can restrict to the following concept class  , c(x)) for x ∈ {0, 1} n , the quantum learner knows B and hence can obtain a uniform example (z, c (z)) for c by letting z be the first r bits of B T x and c (z) = c(x).

Lower bound on learning k-Fourier-sparse Boolean functions
In this section we show that Ω(k log k) uniform quantum examples are necessary to learn the concept class of k-Fourier-sparse Boolean functions. Proof. Assume for simplicity that k is a power of 2, so log k is an integer. We prove the lower bound for the following concept class, which was also used for the classical lower bound of Haviv and Regev [21]: let V be the set of distinct subspaces in {0, 1} n with dimension n − log k and Note that every function in C has Fourier sparsity at most k, |C| = |V|, and each c V ∈ C evaluates to 1 on a (1 − 1/k)-fraction of its domain.
We prove the lower bound for C using a three-step information-theoretic technique. A similar approach was used in proving classical and quantum PAC learning lower bounds in [4]. Let A be a random variable that is uniformly distributed over C. Suppose A = c V , and let B = B 1 . . . B T be T copies of the quantum example for c V . The random variable B is a function of the random variable A. The following upper and lower bounds on I(A : B) are similar to [4, proof of Theorem 12] and we omit the details of the first two steps here.

I(A : B)
≥ Ω(log |V|) because B allows one to recover A with high probability.

I(A : B)
≤ T · I(A : B 1 ) using a chain rule for mutual information.

Claim 1. The number of distinct d-dimensional subspaces of F n 2 is at least 2 Ω((n−d)d) .
Proof. We can specify a d-dimensional subspace by giving d linearly independent vectors in it. The number of distinct sequences of d linearly independent vectors is exactly (2 n − 1)(2 n − 2)(2 n − 4) · · · (2 n − 2 d−1 ), because once we have the first t linearly independent vectors, with span S t , then there are 2 n − 2 t vectors that do not lie in S t . However, we are double-counting certain subspaces in the argument above, since there will be multiple sequences of vectors yielding the same subspace. The number of sequences yielding a fixed d-dimensional subspace can be counted in a similar manner as above and we get (2 d − 1)(2 d − 2)(2 d − 4) · · · (2 d − 2 d−1 ). So the total number of subspaces is Combining this claim (with d = n − log k) and T = Ω(k(log |V|)/n) gives T = Ω(k log k).

A potential direction to prove an improved sample complexity for Phase 1
In this section we give a potential direction to prove that in expectationÕ(k) Fouriersamples are sufficient for Phase 1 of our learning algorithm presented in Section 3.
Thus an expected number of O((k √ log k)/ √ r) many Fourier-samples are sufficient to obtain an S ∈ supp( c) such that S = 0 n in Phase 1. This is already an improvement from what Lemma 5 guaranteed.
In this section we give an improvement of Chang's lemma for k-Fourier-sparse Boolean functions: We remark that in a follow-up paper [12], a subset of the authors gave a refinement of the theorem above.
Before giving a proof of Theorem 9, let us first discuss how this theorem improves the analysis of Phase 1 of our learning algorithm. Theorem 9 implies that for a k-Fourier-sparse Boolean function c : {0, 1} n → {−1, 1} of Fourier dimension r, This is a better lower bound on the Fourier weight of c on the set {0, 1} n \ {0 n } than that obtained from Chang's lemma (Equation 3). Thus an expected number of O((k log k)/r) many uniformly quantum samples is sufficient to obtain an S ∈ supp( c) such that S = 0 n .
We suspect that Theorem 9 can in fact lead to anÕ(k) learning algorithm for Phase 1. Towards that end we make the following conjecture which can be viewed as a generalization of Theorem 9.
If the above conjecture is true then it would imply anÕ(k) learning algorithm for Phase 1. Let c : {0, 1} n → {−1, 1} be a k-Fourier-sparse function of Fourier dimension r. Assuming Conjecture 1 to be true we have So the expected number of samples to increase the dimension by 1 is ≤ k log k r−r . Accordingly, the expected number of Fourier-samples needed to learn the whole Fourier span of f is at most where the final inequality used r i=1 1 i = O(log r). We now proceed to the proof of Theorem 9.

Proof of Theorem 9
We first define the following notation. For U ⊆ [r], let f (U ) be the function obtained by fixing the variables {x i } i∈U in f to x i = (1 + sign( f (i)))/2 for all i ∈ U . Note that fixing variables cannot increase Fourier sparsity. ({i,j}) . In this proof, for an invertible matrix B ∈ F n×n 2 , we will often treat its columns as a basis for the space B be the function obtained by fixing x i = (1 + sign( f (i)))/2 in the function f B . The core idea in the proof of the theorem is the following structural lemma, which says that there is a particular x i that we can fix in the function f B without decreasing the Fourier dimension very much. We defer the proof of the lemma to later and first conclude the proof of the theorem assuming the lemma. Consider the matrix B defined in Lemma 7. Using Lemma 3 it follows that f B has only r influential variables, so we can write . Also, f B (0 r ) = f (0 n ) = 1 − 2α. For convenience, we abuse notation and abbreviate f = f B . It remains to show that for every f : We prove this by induction on r.

Induction step. Let i ∈ [r] be the index from Lemma 7. Note that f (i) is still k-
Fourier-sparse and f (i) (0 r−1 ) = 1 − 2α + | f (i)|. Since | f (i)| ≥ 1/k (by Lemma 1), we have Since r−log k ≤ Fdim(f (i) ) ≤ r−1, we can use the induction hypothesis on the function f (i) to conclude that This concludes the proof of the induction step and the theorem. We now prove Lemma 7.
Proof of Lemma 7. In order to construct B as in the lemma statement, we first make the following observation.
B ( ) are non-zero. We defer the proof of this observation to the end. We proceed to prove the lemma assuming the observation. Note that Property 3 gives the following simple corollary: B is a function of x t+1 , . . . , x r and independent of x 2 , . . . , x t (and hence f (1) for every i ∈ {2, . . . , t}).
Proof. Before proving the claim we first make the following observation. Let us consider an assignment of (x 1 , x t+1 , . . . , x r ) = z in f B and assume that the resulting function f B,z is independent of x i for some i ∈ {2, . . . , t}. Let us assign x i = (1 + sign( f B (i)))/2 in f B,z and call the resulting function f B,z could have alternatively been obtained by first fixing x i = (1 + sign( f (i)))/2 in f B and then fixing (x 1 , x t+1 . . . , x r ) = z. In this case, by Claim 2, after fixing of x 2 , . . . , x t and after fixing  (x 1 , x t+1 , . . . , x r ) = z, f B,z is a constant. This in particular shows that if there exists a z such that f B,z is independent of x i for some i ∈ {2, . . . , t}, then f B,z is also independent of x 2 , . . . , x t .
Towards a contradiction, suppose that for every assignment of (x 1 , x t+1 , . . . , x r ) = z to f B , the resulting function f B,z is independent of x i , for some i ∈ {2, . . . , t}. Then by the argument in the previous paragraph, for every assignment z, f B,z is also independent of x k for every k ∈ {2, . . . , t}. This, however, contradicts the fact that x 2 , . . . , x t had non-zero influence on f B (since B was chosen such that f B (j) = 0 for every j ∈ [r] in Lemma 7). This implies the existence of an assignment (x 1 , x t+1 , . . . , x r ) = (a 1 , a t+1 . . . , a r ), such that the resulting function depends on all the variables x 2 , . . . , x t .
We now argue that the assignment in Claim 3 results in a function which resembles the AND function on x 2 , . . . , x t , and hence has Fourier sparsity 2 t−1 . Proof. By Claim 3, g depends on all the variables x 2 , . . . , x t . This dependence is such that if any one of the variables {x i : i ∈ {2, . . . , t}} is set to x i = (1 + sign( f B (i)))/2, then by Claim 2 the resulting function g (i) is independent of x 2 , . . . , x t . Hence, g (i) is some constant b i ∈ {−1, 1} for every i ∈ {2, . . . , t}. Note that these b i s are all the same bit b, because first fixing x i (which collapses g to the constant b i ) and then x j gives the same function as first fixing x j (which collapses g to b j ) and then x i . Additionally, by assigning x i = (1 − sign( f B (i)))/2 for every i ∈ {2, . . . , t} in g, the resulting function must evaluate to 1 − b because g is non-constant (it depends on x 2 , . . . , x t ). Therefore g equals (up to possible negations of input and output bits) the (t − 1)-bit AND function.
We now conclude the proof of Lemma 7. Let f : {0, 1} n → {−1, 1} be such that Fdim(f ) = r. Let B be as defined in Observation 1. Consider the assignment of (x t+1 , . . . , x r ) = (a t+1 , . . . , a r ) to f B as in Claim 4, and call the resulting function f B . From Claim 4, observe that by setting x 1 = a 1 in f B , the resulting function is g(x 2 , . . . , x t ) and by setting x 1 = 1 − a 1 in f B , the resulting function is a constant. Hence f B can be written as . . . , x t , a t+1 , . . . , a r where b a 1 ,a t+1 ,...,ar ∈ {−1, 1} (note that it is independent of x 2 , . . . , x t by Corollary 1). Since where the third equality used c 1 = e 1 , and f D (1) = 0 follows from the definition of D. We next prove the following fact, which we use to verify the remaining three properties.
. From the choice of D, observe that for all i ∈ {2, . . . , t}, where the inequality follows by definition of D.
On transforming g (1) using the basis C we have: Consider the function g C . The Fourier expansion of g C is g C (y) = S∈{0,1} n g(CS)χ S (y) and the Fourier expansion of the (g C ) (1) can be written as Using Eq. (5), (6), we conclude that (g (1) ) C = (g C ) (1) , concluding the proof of the fact.
This concludes the proof of the observation.
This concludes the proof of the theorem.

Quantum vs classical membership queries
In this section we assume we can access the target function using membership queries rather than examples. Our goal is to simulate quantum exact learners for a concept class C by classical exact learners, without using many more membership queries. A key tool here will be the ("nonnegative" or "positive-weights") adversary method. This was introduced by Ambainis [2]; here we will use the formulation of Barnum et al. [6], which is called the "spectral adversary" in the survey [30]. Let C ⊆ {0, 1} N be a set of strings. If N = 2 n then we may view such a string c ∈ C as (the truth-table of) an n-bit Boolean function, but in this section we do not need the additional structure of functions on the Boolean cube and may consider any positive integer N . Suppose we want to identify an unknown c ∈ C with success probability at least 2/3 (i.e., we want to compute the identity function on C). The required number of quantum queries to c can be lower bounded as follows. Let Γ be a |C|×|C| matrix with real, nonnegative entries and 0s on the diagonal (called an "adversary matrix"). Let D i denote the |C| × |C| 0/1-matrix whose (c, c )-entry is [c i = c i ]. 9 Then it is known that at least (a constant factor times) Γ / max i∈[N ] Γ • D i quantum queries are needed, where · denotes operator norm (largest singular value) and '•' denotes entrywise product of matrices. Let denote the best-possible lower bound on Q(C) that can be achieved this way.
The key to our classical simulation is the next lemma. It shows that if Q(C) (and hence ADV(C)) is small, then there is a query that splits the concept class in a "mildly balanced" way.
be the nonnegative adversary bound for the exact learning problem corresponding to C. Let µ be a distribution on C such that max c∈C µ(c) ≤ 5/6. Then there exists an i ∈ [N ] such that min(µ(C i = 0), µ(C i = 1)) ≥ 1 36ADV(C) 2 .
Proof. Define unit vector v ∈ R |C| + by v c = µ(c), and adversary matrix where diag(µ) is the diagonal matrix that has the entries of µ on its diagonal. This Γ is a nonnegative matrix with 0 diagonal (and hence a valid adversary matrix for the exact learning problem), and Γ ≥ vv * − diag(µ) ≥ 1 − 5/6 = 1/6. Abbreviate A = ADV(C). By definition of A, we have for this particular Γ the entries of v 0 are the ones corresponding to Cs where C i = 0, and the entries of v 1 are the ones where C i = 1. Then It is easy to see that Γ where the last inequality used max(µ(C i = 0), µ(C i = 1)) ≤ 1.
Note that if we query the index i given by this lemma and remove from C the strings that are inconsistent with the query outcome, then we reduce the size of C by a factor ≤ 1 − Ω(1/ADV(C) 2 ). Repeating this O(ADV(C) 2 log |C|) times would reduce the size of C to 1, completing the learning task. However, we will see below that analyzing the same approach in terms of entropy gives a somewhat better upper bound on the number of queries. Then there exists a classical learner for C using O ADV(C) 2 log ADV(C) log |C| membership queries that identifies the target concept with probability ≥ 2/3.

Proof.
Fix an arbitrary distribution µ on C. We will construct a deterministic classical learner for C with success probability ≥ 2/3 under µ. Since we can do this for every µ, the "Yao principle" [35] then implies the existence of a randomized learner that has success probability ≥ 2/3 for every c ∈ C. Consider the following algorithm, whose input is an N -bit random variable C ∼ µ: 1. Choose an i that maximizes H(C i ) and query that i. 10 10 Querying this i will give a fairly "balanced" reduction of the size of C irrespective of the outcome of the query. If there are several maximizing is, then choose the smallest i to make the algorithm deterministic.
2. Update C and µ by restricting to the concepts that are consistent with the query outcome.
The queried indices are themselves random variables, and we denote them by I 1 , I 2 , . . .. We can think of t steps of this algorithm as generating a binary tree of depth t, where the different paths correspond to the different queries made and their binary outcomes. Let P t be the probability that, after t queries, our algorithm has reduced µ to a distribution that has weight ≥ 5/6 on one particular c: b∈{0,1} t Because restricting µ to a subset C ⊆ C cannot decrease probabilities of individual c ∈ C , this probability P t is non-decreasing in t. Because N queries give us the target concept completely, we have P N = 1. Let T be the smallest integer t for which P t ≥ 5/6. We will run our algorithm for T queries, and then output the c with highest probability under the restricted version of µ we now have. With µ-probability at least 5/6, that c will have probability at least 5/6 (under µ conditioned on the query-results). The overall error probability under µ is therefore ≤ 1/6 + 1/6 = 1/3. It remains to upper bound T . To this end, define the following "energy function" in terms of conditional entropy: Because conditioning on a random variable cannot increase entropy, E t is non-increasing in t. We will show below that as long as P t < 5/6, the energy shrinks significantly with each new query. Let C i 1 . . . C it = b be such that there is no c ∈ C s.t. µ(c | C i 1 . . . C it = b) ≥ 5/6 (note that this event happens in our algorithm with µ-probability 1 − P t ). Let µ be µ restricted to the class C of concepts c where c i 1 . . . c it = b. The nonnegative adversary bound for this restricted concept class is A = ADV(C ) ≤ ADV(C) = A. Applying Lemma 8 to µ , there is an i t+1 ∈ [N ] with p := min(µ (C i t+1 = 0), µ (C i t+1 = 1)) ≥ 1 36A 2 ≥ 1 36A 2 . Note that H(p) ≥ Ω(log(A)/A 2 ). Hence ≥ Ω(log(A)/A 2 ).

Future work
Neither of our two results is tight. As directions for future work, let us state two conjectures, one for each model: • k-Fourier-sparse functions can be learned from O(k·polylog(k)) uniform quantum examples.
• For all concept classes C of Boolean-valued functions on a domain of size N we have: