Quantum algorithms and approximating polynomials for composed functions with shared inputs

We give new quantum algorithms for evaluating composed functions whose inputs may be shared between bottom-level gates. Let $f$ be an $m$-bit Boolean function and consider an $n$-bit function $F$ obtained by applying $f$ to conjunctions of possibly overlapping subsets of $n$ variables. If $f$ has quantum query complexity $Q(f)$, we give an algorithm for evaluating $F$ using $\tilde{O}(\sqrt{Q(f) \cdot n})$ quantum queries. This improves on the bound of $O(Q(f) \cdot \sqrt{n})$ that follows by treating each conjunction independently, and our bound is tight for worst-case choices of $f$. Using completely different techniques, we prove a similar tight composition theorem for the approximate degree of $f$. By recursively applying our composition theorems, we obtain a nearly optimal $\tilde{O}(n^{1-2^{-d}})$ upper bound on the quantum query complexity and approximate degree of linear-size depth-$d$ AC$^0$ circuits. As a consequence, such circuits can be PAC learned in subexponential time, even in the challenging agnostic setting. Prior to our work, a subexponential-time algorithm was not known even for linear-size depth-3 AC$^0$ circuits. As an additional consequence, we show that AC$^0 \circ \oplus$ circuits of depth $d+1$ require size $\tilde{\Omega}(n^{1/(1- 2^{-d})}) \geq \omega(n^{1+ 2^{-d}} )$ to compute the Inner Product function even on average. The previous best size lower bound was $\Omega(n^{1+4^{-(d+1)}})$ and only held in the worst case (Cheraghchi et al., JCSS 2018).


Introduction
In the query, or black-box, model of computation, an algorithm aims to evaluate a known Boolean function f : {0, 1} n → {0, 1} on an unknown input x ∈ {0, 1} n by reading as few bits of x as possible. One of the most basic questions one can ask about query complexity, or indeed any complexity measure of Boolean functions, is how it behaves under composition. Namely, given functions f and g, and a method of combining these functions to produce a new function h, how does the query complexity of h depend on the complexities of the constituent functions f and g?
The simplest method for combining functions is block composition, where the inputs to f are obtained by applying the function g to independent sets of variables. That is, if f : {0, 1} m → {0, 1} and g : {0, 1} k → {0, 1}, then the block composition (f • g) : {0, 1} m·k → {0, 1} is defined by (f • g)(x 1 , . . . , x m ) = f (g(x 1 ), . . . , g(x m )) where each x i is a k-bit string. In most reasonable models of computation, one can evaluate f • g by running an algorithm for f , and using an algorithm for g to compute the inputs to f as needed. Thus, the query complexity of f • g is at most the product of the complexities of f and g. 1 For many query models, including those capturing deterministic and quantum computation, this is known to be tight. In particular, letting Q(f ) denote the bounded-error quantum query complexity of a function f , it is known that Q(f •g) = Θ(Q(f )·Q(g)) for all Boolean functions f and g [HLŠ07,Rei11]. This result has the flavor of a direct sum theorem: When computing many copies of the function g (in this case, as many as are needed to generate the necessary inputs to f ), one cannot do better than just computing each copy independently.

Quantum algorithms for shared-input compositions
While we have a complete understanding of the behavior of quantum query complexity under block composition, little is known for more general compositions. What is the quantum query complexity of a composed function where inputs to f are generated by applying g to overlapping sets of variables? We call these more general compositions shared-input compositions. Not only does answering this question serve as a natural next step for improving our understanding of quantum query complexity, but it may lead to more unified algorithms and lower bounds for specific functions of interest in quantum computing. Many of the functions that have played an influential role in the study of quantum query complexity can be naturally expressed as compositions of simple functions with shared inputs, including k-distinctness, k-sum, surjectivity, triangle finding, and graph collision.
In this work, we study shared-input compositions between an arbitrary function f and the function g = AND. If f : {0, 1} m → {0, 1}, then we let h : {0, 1} n → {0, 1} be any function obtained by generating each input to f as an AND over some subset of (possibly negated) variables from x 1 , . . . , x n , as depicted in Figure 1.
Of course, one can compute the function h by ignoring the fact that the AND gates depend on shared inputs, and instead regard each gate as depending on its own set of copies of the input variables. Using the quantum query upper bound for block compositions, together with the fact that Q(AND n ) = Θ( √ n) [Gro96,BBBV97], one obtains (1) Observe that this bound on Q(h) is non-trivial only if Q(f ) √ n. A priori, one may conjecture that this bound is tight in the worst case for shared-input compositions. After all, if the variables overlap in some completely arbitrary way with no structure, it is unclear from the perspective of an algorithm designer how to use the values of already-computed AND gates to reduce the number of queries needed to compute further AND gates. It might even be the case that every pair of AND gates shares very few common input bits, suggesting that evaluating one AND gate yields almost no information about the output of any other AND gate. This intuition even suggests a path for proving a matching lower bound: Using a random wiring pattern, combinatorial designs, etc., construct the set of inputs to each AND gate so that evaluating any particular gate leaks almost no useful information that could be helpful in evaluating the other AND gates.
In this work, we show that this intuition is wrong: the overlapping structure of the AND gates can always be exploited algorithmically (so long as Q(f ) n).
Results. Our main result shows that a shared-input composition between a function f and the AND function always has substantially lower quantum query complexity than the block composition f • AND n . Specifically, instead of having quantum query complexity which is the product Q(f ) · √ n, a shared-input composition has quantum query complexity which is, up to logarithmic factors, the geometric mean Q(f ) · n between Q(f ) and the number of input variables n. This bound is nontrivial whenever Q(f ) is significantly smaller than n. (2) Note that Theorem 1 is nearly tight for every possible value of Q(f ) ∈ [n]. 2 For a parameter t ≤ n, consider the block composition (i.e., the composition with disjoint inputs) PARITY t • AND n/t . Since Q(PARITY t ) = t/2 [BBC + 01], this function has quantum query complexity matching the upper bound provided by Theorem 1 up to log factors. This shows that Theorem 1 cannot be significantly improved in general. The proof of Theorem 1 makes use of an optimal quantum algorithm for computing f and Grover's search algorithm for evaluating AND gates. Surprisingly, it uses no other tools from quantum computing. The core of the argument is entirely classical, relying on a recursive gate and wire-elimination argument for evaluating AND gates with overlapping inputs.
At a high level, the algorithm in Theorem 1 works as follows. The overall goal is to query enough input bits such that the resulting circuit is simple enough to apply the composition upper bound Q(f • g) = O(Q(f )Q(g)). To apply this upper bound and obtain the claimed upper bound in Theorem 1, we require Q(g) to be O( n/Q(f )). Since g is just an AND gate on some subset of inputs, this means we want the fan-in of each AND gate in our circuit to be O(n/Q(f )). If we call AND gates with fan-in ω(n/Q(f )) "high fan-in" gates, then the goal is to eliminate all high fan-in gates. Our algorithm achieves this by judiciously querying input bits that would eliminate a large number of high fan-in gates if they were set to 0.
Besides the line of work on the quantum query complexity of block compositions, our result is also closely related to work of Childs, Kimmel, and Kothari [CKK12] on read-many formulas. Childs et al. showed that any formula on n inputs consisting of G gates from the de Morgan basis {AND, OR, NOT} can be evaluated using O(G 1/4 · √ n) quantum queries. In the special case of DNF formulas, our result coincides with theirs by taking the top function f to be the OR function. However, even in this special case, the result of Childs et al. makes critical use of the top function being OR. Specifically, their result uses the fact that the quantum query complexity of the OR function is the square root of its formula size. Our result, on the other hand, applies without making any assumptions on the top function f . This level of generality is needed when using Theorem 1 to understand circuits (rather than just formulas) of depth 3 and higher, as discussed in Section 1.3.

Approximate degree of shared-input compositions
We also study shared-input compositions under the related notion of approximate degree. For a Boolean function f : , is the least degree among all ε-approximating polynomials for f . We use the term approximate degree without qualification to refer to choice ε = 1/3, and denote it deg(f ) = deg 1/3 (f ).
A fundamental observation due to Beals et al. [BBC + 01] is that any T -query quantum algorithm for computing a function f implicitly defines a degree-2T approximating polynomial 2 Theorem 1 is not tight for every function f , of course. For example if f is an AND on many inputs, the composed function will have quantum query complexity O( √ n) but the upper bound of Theorem 1 can be larger than this.
for f . Thus, deg(f ) ≤ 2Q(f ). This relationship has led to a number of successes in proving quantum query complexity lower bounds via approximate degree lower bounds, constituting a technique known as the polynomial method in quantum computing. Conversely, quantum algorithms are powerful tools for establishing the existence of low-degree approximating polynomials that are needed in other applications to theoretical computer science. For example, the deep result that every de Morgan formula of size s has quantum query complexity, and hence approximate degree, O( √ s) [FGG08, CCJYM09, ACR + 10, Rei11] underlies the fastest known algorithm for agnostically learning formulas [KKMS08,Rei11] (See Section 1.4 and Section 5 for details on this application). It has also played a major role in the proofs of the strongest formula and graph complexity lower bounds for explicit functions [Tal17].

Results.
We complement our result on the quantum query complexity of shared-input compositions with an analogous result for approximate degree.
Note that our result for approximate degree is incomparable with Theorem 1, even for bounded error, since both sides of the equation include the complexity measure under consideration.
Like Theorem 1, Theorem 2 can be shown to be tight by considering the block composi- Our proof of Theorem 2 abstracts and generalizes a technique introduced by Sherstov [She18], who very recently proved an O(n 3/4 ) upper bound on the approximate degree of an important depth-3 circuit of nearly quadratic size called Surjectivity [She18]. Despite the similarity between Theorem 2 and Theorem 1, and the close connection between approximating polynomials and quantum algorithms, the proof of Theorem 2 is completely different from Theorem 1, making crucial use of properties of polynomials that do not hold for quantum algorithms. 3 In our opinion, this feature of the proof of Theorem 2 makes Theorem 1 for quantum algorithms even more surprising.
We remark that a different proof of the O(n 3/4 ) upper bound for the approximate degree of Surjectivity was discovered in [BKT18], who also showed a matching lower bound. It is also possible to prove Theorem 2 by generalizing the techniques developed in that work, but the techniques of [She18] lead to a shorter and cleaner analysis.

Application: Evaluating and approximating linear-size AC 0 circuits
The circuit class AC 0 consists of constant-depth, polynomial-size circuits over the de Morgan basis {AND, OR, NOT} with unbounded fan-in gates. The full class AC 0 is known to contain very hard functions from the standpoint of both quantum query complexity and approximate degree. The aforementioned Surjectivity function is in depth-3 AC 0 and has quantum query complexity Ω(n) [BM12,She15], while for every positive constant δ > 0, there exists a depth-O(log(1/δ)) AC 0 circuit with approximate degree Ω(n 1−δ ) [BT17].
Nevertheless, AC 0 contains a number of interesting subclasses for which nontrivial quantum query and approximate degree upper bounds might still hold. Here, we discuss applications of our composition theorem to understanding the subclass LC 0 , consisting of AC 0 circuits of linear size.
The class LC 0 is one of the most interesting subclasses of AC 0 . It has been studied by many authors in various complexity-theoretic contexts, ranging from logical characterizations [KLPT06] to faster-than-brute-force satisfiability algorithms [CIP09, SS12]. LC 0 turns out to be a surprisingly powerful class. For example, the k-threshold function that asks if the input has Hamming weight greater than k is clearly in AC 0 for constant k, by computing the OR of all n k possible certificates. But this yields a circuit of size O(n k ), which one might conjecture is optimal. However, it turns out that k-threshold is in LC 0 even when k is as large as polylog(n) [RW91]. Another surprising fact is that every regular language in AC 0 can be computed by an AC 0 circuit of almost linear size (e.g., size O(n log * n) suffices) [Kou09].
By recursively applying Theorem 1, we obtain the following sublinear upper bound on the quantum query complexity of depth-d LC 0 circuits, denoted by LC 0 d : Theorem 3. For all constants d ≥ 0 and all functions h : Our upper bound is nearly tight for every depth d, as shown in [CKK12].
Theorem 4 (Childs, Kimmel, and Kothari). For all constants d ≥ 0, there exists a function h : By recursively applying Theorem 2, we obtain a similar sublinear upper bound for the ε-approximate degree of LC 0 d , even for subconstant values of ε.
Theorem 5. For all constant d ≥ 0, and any ε > 0, and all functions h : For constant , we prove a lower bound of the same form with quadratically worse dependence on the depth d.
A lower bound of deg(h) = n 1−2 −Ω(d) was already known for general AC 0 functions f [BT17,BKT18], but the AC 0 circuits constructed in these prior works are not of linear size. Previously, for any ≥ 1, [BKT18] exhibited a circuit C : {0, 1} n → {0, 1} of depth at most 3 , size at most n 2 , and approximate degree deg(C) ≥ Ω(n 1−2 − ). We show how to transform this quadratic-size circuit C into a linear-size circuit C of depth roughly 2 , whose approximate degree is close to that of C. Our transformation adapts that of [CKK12], but requires a more intricate construction and analysis. This is because, unlike quantum query complexity, approximate degree is not known to increase multiplicatively under block composition.
For a given accuracy parameter ε, the goal of the learner is to produce a hypothesis h such that err D (h) ≤ min c∈C err D (c) + ε.
Very few concept classes C are known to be agnostically learnable, even in subexponential time. For example, the best known algorithm for agnostically learning disjunctions runs in time 2Õ ( √ n) [KKMS08]. 4 Moreover, several hardness results are known. Proper agnostic learning of disjunctions (where the output hypothesis itself must be a disjunction) is NPhard [KSS94]. Even improper agnostic learning of disjunctions is at least as hard as PAC learning DNF [LBW95], which is a longstanding open question in learning theory.
The best known general result for more expressive classes of circuits is that all de Morgan formulas of size s can be learned in time 2Õ ( √ s) [KKMS08,Rei11] (Section 5.1 contains a detailed overview of prior work on agnostic and PAC learning). Both of the aforementioned results make use of the well-known linear regression framework of [KKMS08] for agnostic learning. This algorithm works whenever there is a "small" set of "features" F (where each feature is a function mapping {0, 1} n to R) such that each concept in the concept class C can be approximated to error ε in the ∞ norm by a linear combination of features in F. (See Section 5 for details.) If every function in a concept class C has approximate degree at most d, then one obtains an agnostic learning algorithm for C with running time 2Õ (d) by taking F to be the set of all monomials of degree at most d. Applying this algorithm using the approximate degree upper bound of Theorem 5 yields a subexponential time algorithm for agnostically learning LC Prior to our work, no subexponential time algorithm was known even for agnostically learning LC 0 3 . Moreover, since our upper bound on the approximate degree of LC 0 circuits is nearly tight, new techniques will be needed to significantly surpass our results, and in particular, learn all of LC 0 in subexponential time. (Note that standard techniques [She11a] automatically generalize the lower bound of Theorem 6 from the feature set of low-degree monomials to arbitrary feature sets. See Section 5.2 for details.)

Application: New Circuit Lower Bounds
An important frontier problem in circuit complexity is to show that the well-known Inner Product function cannot be computed by AC 0 • ⊕ circuits of polynomial size. Here, AC 0 • ⊕ refers to AC 0 circuits augmented with a layer of parity gates at the bottom (i.e., closest to the inputs). Servedio and Viola [SV12] identified this open problem as a first step toward proving matrix rigidity lower bounds, itself a notorious open problem in complexity theory, and Akavia et al. [ABG + 14] connected the problem to the goal of constructing highly efficient pseudorandom generators. 5 Average-case versions of this question have also been posed, even just for DNFs with a layer of parity gates at the bottom [CS16,ER21]. Unfortunately, the best known lower bounds against AC 0 • ⊕ circuits computing Inner Product are quite weak. The state of the art result [CGJ + 16] for any constant depth d > 4 is that Inner Product cannot be computed by any depth-(d + 1) AC 0 • ⊕ circuit of size O(n 1+4 −(d+1) ). We show that Theorem 5 implies an improved (if still unsatisfying) lower bound of Ω(n 1/(1−2 −d ) ) = n 1+2 −d +Ω(1) . More significantly, unlike prior work our lower bound holds even against circuits that compute the Inner Product function on slightly more than half of all inputs. Below, when we refer to the depth of an AC 0 • ⊕ circuit, we count the layer of parity gates toward the depth. For example, we consider a DNF of parities to have depth 3.
Theorem 8. For any constant integer d ≥ 4, any depth-(d + 1) AC 0 • ⊕ circuit computing the Inner Product function on n bits on greater than a 1/2 + n − log n fraction of inputs has size This application is new and does not appear in the conference version of this paper [BKT19]. The idea of our proof is to use the approximate degree upper bound for LC 0 d circuits of Theorem 5 to show that any small AC 0 •⊕ circuit has non-trivial (i.e., 2 −n ) correlation under the uniform distribution with some parity function. Yet it is well-known that the Inner Product function has correlation at most 2 −n with any parity function. As we show, this rules out the possibility that a small AC 0 • ⊕ circuit computes the Inner Product function, even on slightly more than half of all inputs.

Discussion and future directions
Summarizing our results, we established shared-input composition theorems for quantum query complexity (Theorem 1) and approximate degree (Theorem 2), roughly showing that for compositions between an arbitrary function f and the function g = AND, it is always possible to leverage sharing of inputs to obtain algorithmic speedups. We applied these results to obtain the first sublinear upper bounds on the quantum query complexity and approximate degree of LC 0 d .
Generalizing our composition theorems. Although considering the inner function g = AND is sufficient for our applications to LC 0 , an important open question is to generalize our results to larger classes of inner functions. The proof of our composition theorem for approximate degree actually applies to any inner function g that can be exactly represented as a low-weight sum of ANDs (for example, it applies to any strongly unbalanced function g, meaning that |g −1 (1)| = poly(n)). Extending this further would be a major step forward in our understanding of how quantum query complexity and approximate degree behave under composition with shared inputs. While our paper considers the composition scenario where the top function is arbitrary and the bottom function is AND, the opposite scenario is also interesting. Here the top function is AND m and the bottom functions are f 1 , . . . , f m , each acting on the same set of n input variables. Now the question is whether we can do better than the upper bound obtained using results on block composition that treat all the input variables as being independent. More concretely, for such a function F , the upper bound that follows from block composition is . However, this upper bound cannot be improved in general, because the Surjectivity function is an example of such a function. Here the bottom functions f i check if the input contains a particular range element i, and the upper bound obtained from this argument is O(n), which matches the lower bound [BM12,She15]. Surprisingly, this lower bound only holds for quantum query complexity, as we know that the approximate degree of Surjectivity isΘ(n 3/4 ). We do not know if the upper bound obtained from block composition can be improved for approximate degree.
Quantum query complexity of LC 0 and DNFs. For quantum query complexity, we obtain the upper bound [CKK12]. However, the bounds do not match for any fixed value of d. The lack of matching lower bounds can be attributed to the fact that the Surjectivity function, which is known to have linear quantum query complexity, is computed by a quadratic-size depth-3 circuit, rather than a quadratic-size depth-2 circuit (i.e., a DNF). If one could prove a linear lower bound on the quantum query complexity of some quadratic-size DNF, the argument of [CKK12] would translate this into aΩ(n 1−2 −d ) lower bound for LC 0 d , matching our upper bound. Unfortunately, no linear lower bound on the quantum query complexity of any polynomial size DNFs is known; we highlight this as an important open problem (the same problem was previously been posed by Troy Lee with different motivations [Lee12]).
Open Problem 1. Is there a polynomial-size DNF with Ω(n) quantum query complexity?
The quantum query complexity of depth-2 LC 0 , or linear-size DNFs also remains open. The best upper bound is O(n 3/4 ), but the best lower bound is Ω(n 0.555 ) [CKK12]. Any improvement in the lower bound would also imply, in a black-box way, an improved lower bound for the Boolean matrix product verification problem. Improving the lower bound all the way to Ω(n 3/4 ) would imply optimal lower bounds for all of LC 0 using the argument in [CKK12]. We conjecture that there is a linear-size DNF with quantum query complexity Ω(n 3/4 ), matching the known upper bound.
Approximate degree of LC 0 and DNFs. For approximate degree, we obtain the upper bound deg(LC 0 d ) =Õ(n 1−2 −d ), and prove a new lower bound of deg(LC 0 . The reason our approximate degree lower bound approaches n more slowly than the quantum query lower bound from [CKK12] is that, while the quantum query complexity of AC 0 is known to be Ω(n), such a result is not known for approximate degree. This remains an important open problem.

Open Problem 2.
Is there a problem in AC 0 with approximate degree Ω(n)?
Our lower bound argument would translate, in a black-box manner, any linear lower bound on the approximate degree of a general AC 0 circuit into a nearly tight lower bound for LC 0 d . Alternatively, it would be very interesting if one could improve our approximate degree upper bound for LC 0 d . Even seemingly small improvements to our upper bound would have significant implications. Specifically, standard techniques (see, e.g., [CR96]) imply that for any constant δ > 0, there are approximate majority functions 6 computable by depth-(2d + 3) circuits of size O(n 1+2 −d +δ ). 7 This means that, for sufficiently large constant d, if one could improve our upper bound on the approximate degree of LC 0 , one would obtain a sublinear upper bound on the approximate degree of some total function computing an approximate majority. This would answer a question of Srinivasan [FHH + 14], and may be considered a surprising result, as approximate majorities are currently the primary natural candidate AC 0 functions that may exhibit linear approximate degree [BKT18].

Paper organization and notation
This paper is organized so as to be accessible to readers without familiarity with quantum algorithms. Section 2 assumes the reader is somewhat familiar with quantum query complexity and Grover's algorithm [Gro96], but only uses Grover's algorithm as a black box. In Section 2 we show our main result on the quantum query complexity of shared-input compositions (Theorem 1). Section 3 proves our result about the approximate degree of shared-input compositions (Theorem 2). Section 4 uses the results of these sections (in a black-box manner) to upper bound the quantum query complexity and approximate degree of LC 0 circuits, and proves related lower bounds. Section 5 uses the results of Section 4 to obtain algorithms to agnostically PAC learn LC 0 circuits. Section 6 derives our average-case lower bounds on the size of AC 0 • ⊕ circuits computing the Inner Product function. This section is new and does not appear in the conference version of this paper [BKT19].
In this paper we use the O(·) and Ω(·) notation to suppress logarithmic factors. More formally, f (n) = O(g(n)) means there exists a constant k such that f (n) = O(g(n) log k g(n)), and similarly f (n) = Ω(g(n)) means there exists a constant k such that f (n) = Ω(g(n)/ log k g(n)). For a string x ∈ {0, 1} n , we use |x| = i x i to denote the Hamming weight of x, i.e., the number of entries in x equal to 1. For any positive integer n, we use [n] to denote the set For non-negative integers n and k, we use n ≤k to denote k i=0 n i . A basic fact is that n ≤k ≤ n k .
2 Quantum algorithm for composed functions

Preliminaries
As described in the introduction, our quantum algorithm only uses variants of Grover's algorithm [Gro96] and is otherwise classical. To make this section accessible to those without familiarity with quantum query complexity, we only state the minimum required preliminaries to understand the algorithm. Furthermore, we do not optimize the logarithmic factors in our upper bound to simplify the presentation. For a more comprehensive introduction to quantum query complexity, we refer the reader to the survey by Buhrman and de Wolf [BdW02].
In quantum or classical query complexity, the goal is to compute some known function f : {0, 1} n → {0, 1} on some unknown input x ∈ {0, 1} n while reading as few bits of x as possible. Reading a bit of x is also referred to as "querying" a bit of x, and hence the goal is to minimize the number of queries made to the input.
For example, the deterministic query complexity of a function f is the minimum number of queries needed by a deterministic algorithm in the worst case. A deterministic algorithm must be correct on all inputs, and can decide which bit to query next based on the input bits it has seen so far. Another example of a query model is the bounded-error randomized query model. The bounded-error randomized query complexity of a function f , denoted R(f ), is the minimum number of queries made by a randomized algorithm that computes the function correctly with probability greater than or equal to 2/3 on each input. In contrast to a deterministic algorithm, such an algorithm has access to a source of randomness, which it may use in deciding which bits to query.
The bounded-error quantum query complexity of f , denoted Q(f ), is similar to boundederror randomized query complexity, except that the algorithm is now quantum. In particular, this means the algorithm may query the inputs in superposition. Since quantum algorithms can also generate randomness, for all functions we have Q(f ) ≤ R(f ).
An important example of the difference between the two models is provided by the OR n function, which asks if any of the input bits is equal to 1. We have R(OR n ) = Θ(n), because intuitively if the algorithm only sees a small fraction of the input bits and they are all 0, we do not know whether or not the rest of the input contains a 1. However, Grover's algorithm is a quantum algorithm that solves this problem with only O( √ n) queries [Gro96]. The algorithm is also known to be tight, and we have Q(OR n ) = Θ( √ n) [BBBV97]. There are several variants of Grover's algorithm that solve related problems and are sometimes more useful than the basic version of the algorithm. Most of these can be derived from the basic version of Grover's algorithm (and this sometimes adds logarithmic overhead).
In this work we need a variant of Grover's algorithm that finds a 1 in the input faster when there are many 1s. Let the Hamming weight of the input x be t = |x|. If we know t, then we can use Grover's algorithm on a randomly selected subset of the input of size O(n/t), and one of the 1s will be in this set with high probability. Hence the algorithm will have query complexity O( n/t). With some careful bookkeeping, this can be done even when t is unknown, and the algorithm will have expected query complexity O( n/t). More formally, we have the following result of Boyer, Brassard, Høyer, and Tapp [BBHT98].
Lemma 9. Given query access to a string x ∈ {0, 1} n , there is a quantum algorithm that when t = |x| > 0, always outputs an index i such that x i = 1 and makes O( n/t) queries in expectation. When t = 0, the algorithm does not terminate.
Note that because we do not know t = |x|, we only have a guarantee on the expected query complexity of the algorithm, not the worst-case query complexity. Note also that this variant of Grover's algorithm is a zero-error algorithm in the sense that it always outputs a correct index i with x i = 1 when such an index exists.
In our algorithm we use an amplified version of the algorithm of Lemma 9, which adds a log factor to the query complexity and always terminates after O( √ n log n) queries. Proof. This algorithm is quite straightforward. We simply run O(log n) instances of the algorithm of Lemma 9 in parallel and halt if any one of them halts. If we reach our budget of O( √ n log n) queries, then we halt and output "|x| = 0". Let us argue that the algorithm has the claimed properties. First, since the algorithm of Lemma 9 does not terminate when |x| = 0, our algorithm will correctly output "|x| = 0" at the end for such inputs. When |x| > 0, we know that the algorithm of Lemma 9 will find an index i with x i = 1 with high probability after O( √ n) queries. The probability that O(log n) copies of this algorithm do not find such an i is exponentially small in O(log n), or polynomially small in n. Finally, our algorithm makes only O( √ n log n) queries when |x| = 0 by construction. When |x| > 0, we know that the algorithm of Lemma 9 terminates after an expected O( n/|x|) queries, and hence halts with high probability after O( n/|x|) queries by Markov's inequality. The probability that none of O(log n) copies of the algorithm halt after making O( n/|x|) queries each is inverse polynomially small in n again.

Quantum algorithm
We are now ready to present our main result for quantum query complexity, which we restate below.  Figure 1). Then we have (2) While Theorem 1 allows the bottom AND gates to depend on negated variables, it will be without loss of generality in the proof to assume that all input variables are unnegated. This is because we can instead work with the function h : {0, 1} 2n → {0, 1} obtained by treating the positive and negative versions of a variable separately, increasing our final quantum query upper bound by a constant factor. We now define some notation that will aid with the description and analysis of the algorithm. We know that our circuit h has m AND gates and n input bits x i . We say an AND gate has high fan-in if the number of inputs to that AND gate is greater than or equal to n/Q(f ). Note that if our circuit h has no high fan-in gates, then we are done, because we can simply use the upper bound for block composition, i.e., Our goal is to reduce to this simple case. More precisely, we will start with the given circuit h, make some queries to the input, and then simplify the given circuit to obtain a new circuit h . The new circuit will have no high fan-in gates, but will still have h (x) = h(x) on the given input x. Note that h and h have the same output only for the given input x, and not necessarily for all inputs.
For any such circuit h, let S ⊆ [m] be the set of all high fan-in AND gates, and let w(S) be the total fan-in of S, which is the sum of fan-ins of all gates in S. In other words, it is the total number of wires incident to the set S. Since the set S only has gates with fan-in at least We now present our first algorithm, which is a subroutine in our final algorithm. This algorithm's goal is to take a circuit h, with |S| high fan-in gates and w(S) wires incident on S, and reduce the size of w(S) by a factor of 2. Ultimately we want to have |S| = w(S) = 0, and hence if we can decrease the size of w(S) by 2, we can repeat this procedure logarithmically many times to get |S| = w(S) = 0. Proof. The overall structure of the claimed algorithm is the following: We query some wellchosen input bits, and on learning the values of these bits, we simplify the circuit accordingly. If an input bit is 0, then we delete all the AND gates that use that input bit. If an input bit is 1, we delete all outgoing wires from that input bit since a 1-input does not affect the output of an AND gate.
Since the circuit will change during the algorithm, let us define S 0 to be the initial set of high fan-in (i.e., gates with fan-in ≥ n/Q(f )) AND gates in h.
We also define the degree of an input x i , denoted deg(i), to be the number of high fan-in AND gates that it is an input to. Note that this is not the total number of outgoing wires from x i , but only those that go to high fan-in AND gates, i.e., gates in the set S. With this definition, note that i∈[n] deg(i) = w(S), for any circuit. We say an input bit x i is high degree if deg(i) ≥ |S 0 |/(2Q(f )). This value is chosen since it is at least half the average degree of all x i in the initial circuit h. As the algorithm progresses, the circuit will change, and some inputs that were initially high degree may become low degree as the algorithm progresses, but a low degree input will never become high degree. But note that the definition of a high-degree input bit does not change, since it only depends on S 0 and Q(f ), which are fixed for the duration of the algorithm.
Finally, we call an input bit x i is marked if x i = 0. We are now ready to describe our algorithm by the following pseudocode (see Algorithm 1).

Algorithm 1
The algorithm of Lemma 11. if we find such an i then

6:
Delete all AND gates that use x i as an input 7: end if 8: until Grover Search fails to find an i ∈ M 9: Delete all remaining high-degree inputs and all outgoing wires from these inputs In more detail, we repeatedly use the version of Grover's algorithm in Lemma 10 to find a high-degree marked input, which is an input x i such that x i = 0 and deg(i) ≥ |S 0 | 2Q(f ) . If we find such an input, we delete all the AND gates that use x i as an input, and repeat this procedure. Note that when we repeat this procedure, the circuit has changed, and hence the set of high-degree input bits may become smaller. The algorithm halts when Grover's algorithm is unable to find any high-degree marked inputs. At this point, all the high-degree inputs are necessarily unmarked with very high probability, which means they are set to 1. We can now delete all these input bits and their outgoing wires because AND gates are unaffected by input bits set to 1.
Let us now argue that this algorithm is correct. Let S denote the set of high fan-in AND gates in the new circuit h obtained at the end of the algorithm, and w(S ) be the total fan-in of gates in S . Note that when the algorithm terminates, there are no high-degree inputs (marked or unmarked). Hence every input bit that has not been deleted has deg(i) < |S 0 | 2Q(f ) . Since there are at most n input bits, we have But we also know that we started with w(S) ≥ n|S 0 |/Q(f ), since each gate in S 0 has fan-in at least n/Q(f ). Hence w(S ) ≤ w(S)/2, which proves that the algorithm is correct. We now analyze the query complexity of this algorithm. Let the loop in the algorithm execute r times. It is easy to see that r ≤ 2Q(f ) because each time a high-degree marked input is found, we delete all the AND gates that use it as an input, which is at least |S 0 |/(2Q(f )) gates. Since there were at most S 0 gates to begin with, this procedure can only repeat 2Q(f ) times.
When we run Grover's algorithm to search for a high-degree marked input bit x i in the first iteration of the loop, suppose there are k 1 high-degree marked inputs. Then the variant of Grover's algorithm in Lemma 10 finds a marked high-degree input and makes O( n/k 1 log n) queries with probability 1 − 1 poly(n) . In the second iteration of the loop, the number of highdegree marked inputs, k 2 , has decreased by at least one. It can also decrease by more than 1 since we deleted several AND gates, and some high-degree inputs can become low-degree. In this iteration, our variant of Grover's algorithm (Lemma 10) makes O( n/k 2 log n) queries, and we know that k 1 > k 2 . This process repeats and we have k 1 > k 2 > · · · > k r . Since there was at least one high-degree marked input in the last iteration, k r ≥ 1. Combining these facts we have for all j ∈ [r], k j ≥ r − j + 1. Thus the total expected query complexity is We now have a quantum query algorithm that satisfies the conditions of the lemma with probability at least 1 − 1 poly(n) . We are now ready to prove Theorem 1.
Proof of Theorem 1. We start by applying the algorithm in Lemma 11 to our circuit as many times as needed to ensure that set S is empty. Since each run of the algorithm reduces w(S) by a factor of 2, and w(S) can start off being as large as m · n, where m is the number of AND gates and n is the number of inputs, we need to run the algorithm log(mn) times. Since the algorithm of Lemma 11 is correct with probability 1 − 1 poly(n) , we do not need to boost the success probability of the algorithm. The total number of queries needed to ensure S is empty is O( Q(f ) · n log(n) log(mn)). Now we are left with a circuit h with no high fan-in AND gates. That is, all AND gates have fan-in at most n/Q(f ). We now evaluate h using the standard composition theorem for disjoint sets of inputs, which has query complexity The total query complexity is O( Q(f ) · n log(n) log(mn)) = O( Q(f ) · n log 2 (mn)).
Note that we have not attempted to reduce the logarithmic factors in this upper bound. We believe it is possible to make the quantum upper bound match the upper bound for approximate degree with a more careful analysis and slightly different choice of parameters in the algorithm.

Approximating polynomials for composed functions 3.1 Preliminaries
We now define the various measures of Boolean functions and polynomials that we require in this section. Since we only care about polynomials approximating Boolean functions, we focus without loss of generality on multilinear polynomials as any polynomial over the domain {0, 1} n can be converted into a multilinear polynomial (since it never helps to raise a Boolean variable to a power greater than 1).
The approximate degree of a Boolean function, commonly denoted deg(f ), is the minimum degree of a polynomial that entrywise approximates the Boolean function. It is a basic complexity measure and is known to be polynomially related to a host of other complexity measures such as decision tree complexity, certificate complexity, and quantum query complexity [BdW02,BT21]. We also use another complexity measure of polynomials, which is the sum of absolute values of all the coefficients of the polynomial. This is the query analogue of the so-called µ-norm used in communication complexity [LS09, Definition 2.7]. We now formally define these measures.
We use the following standard relationship between the two measures in our results. (1 + y i ) to obtain a multilinear polynomial p(y 1 , . . . , y n ) = s∈{0,1} n β s y s 1 1 · · · y sn n . In this representation, a coefficient β s is simply the expectation over the hypercube of the product of p and a parity function, and hence is at most O(1) in magnitude. Since there are only This shows that log µ(p) is at most deg(p) (up to log factors). However, log µ(p) may be much smaller than deg(p), as evidenced by the polynomial p(x) = x 1 · · · x n . Similarly, log µ(f ) may be much smaller than deg(f ), as evidenced by the AND function on n bits, which has deg(AND n ) = Θ( √ n) [NS94], but µ(AND n ) ≤ 1.

Polynomial upper bound
In this section we prove Theorem 2, which follows from the following more general composition theorem.  Figure 1). Then Proof. Let us first fix some notation. We will use x ∈ {0, 1} n to refer to the input of the full  , p(y 1 , . . . , y m ) = s∈{0,1} m α s y s 1 1 · · · y sn n , where µ ε (f ) = s∈{0,1} m |α s |, and each y i is the AND of some subset of bits in x. Since the product of ANDs of variables is just an AND of all the variables involved in the product, for each s ∈ {0, 1} m , there is a subset T s ⊆ [n] such that y s 1 1 · · · y sn n = i∈Ts x i . Using this we can replace all the y variables in the polynomial p, to obtain Since p was an ε approximation to f , q is an ε approximation to h. Now we can replace every occurrence of i∈Ts x i with a low error approximating polynomial for the AND of the bits in T s . We know that the approximate degree of the AND function to error δ is O( n log(1/δ)) [BCdWZ99]. If we approximate each AND to error δ = ε/µ ε (f ), then by the triangle inequality the total error incurred by this approximation is at most s∈{0,1} m |α s |ε/µ ε (f ) = ε. Choosing δ = ε/µ ε (f ), each AND is approximated by a polynomial of degree O( n log(1/δ)) = O n log µ ε (f ) + n log(1/ε) . Hence the resulting polynomial q(x) has this degree and approximates the function h to error 2ε. By standard error reduction techniques [BNRdW07], we can make this error smaller than ε at a constant factor increase in the degree. This establishes the first equality in (16), and the second equality follows from Lemma 13. 4 Applications to linear-size AC 0 circuits

Preliminaries
A Boolean circuit is defined via a directed acyclic graph. Vertices of fan-in 0 represent input bits, vertices of fan-out 0 represent outputs, and all other vertices represent one of the following logical operations: a NOT operation (of fan-in 1), or an unbounded fan-in AND or OR operation. The size of the circuit is the total number of AND and OR gates. The depth of the circuit is the length of the longest path from an input bit to an output bit.
For any constant integer d > 0, AC 0 d refers to the class of all such circuits of polynomial size and depth d. AC 0 refers to ∪ ∞ d=1 AC 0 d . Similarly, LC 0 d refers to the class of all such circuits of size O(n) and depth d, while LC 0 refers to ∪ ∞ d=1 LC 0 d . We will associate any circuit C with the function it computes, so for example deg(C) denotes the approximate degree of the function computed by C.
It will be convenient to assume that any AC 0 d circuit is layered, in the sense that it consists of d levels of gates which alternate between being comprised of all AND gates or all OR gates, and all negations appear at the input level of the circuit. Any AC 0 d circuit of size s can be converted into a layered circuit of size O(d · s), and hence making this assumption does not change any of our upper bounds.

Quantum query complexity
Applying our composition theorem for quantum algorithms (Theorem 1) inductively, we obtain a sublinear upper bound on the quantum query complexity of LC 0 d circuits.

Theorem 3. For all constants d ≥ 0 and all functions
Proof. We prove this for depth-d LC 0 circuits by induction on d. The base case is d = 1, where the function is either AND or OR on n variables, both of which have quantum query complexity O( √ n) [Gro96]. Now consider a function h, which is a layered depth-d AC 0 circuit of size O(n). It can be written as a depth-2 circuit (as in Theorem 1) where the top function is a LC 0 circuit f of depth d − 1 on at most O(n) inputs, and the bottom layer has only AND gates. (If the bottom layer has OR gates we can consider the negation of the function without loss of generality, since the quantum query complexity of a function and its negation is the same.) By the induction hypothesis we know that the quantum query complexity of any depth-(d − 1), size-O(n) AC 0 circuit with O(n) inputs is O(n 1−2 −(d−1) ). Invoking Theorem 1, we have that the quantum query complexity of the depth-d function h is O n 1−2 −d .

Approximate degree upper bound
We can now prove Theorem 5, restated below for convenience: Theorem 5. For all constant d ≥ 0, and any ε > 0, and all functions h : This follows from a more general result: In particular, for any h ∈ LC 0 d , we have deg(h) = O n 1−2 −d . Proof. We prove this for depth-d AC 0 circuits by induction on d. The base case is d = 1, where the function is either AND or OR on n variables, both of which have ε-approximate degree O( n log(1/ε)) [BCdWZ99]. Now consider a function h, which is a general depth-d AC 0 circuit of size s. It can be written as a depth-2 circuit (as in Theorem 2) where the top function is a size-s AC 0 circuit f of depth d − 1 on at most s inputs, and the bottom layer has only AND gates. If the bottom layer has OR gates we can consider the negation of the function without loss of generality, since the ε-approximate degree of a function and its negation is the same.
In the first case, if ε ≤ 2 −s , then for any function f : {0, 1} s → {0, 1} there is a polynomial of degree s and sum of coefficients at most 2 s that exactly equals f on all Boolean inputs. Hence we can apply Theorem 2 to get that deg ε (h) = O( √ ns + n log(1/ε)) = O( n log(1/ε)). In the second case, if ε > 2 −s , by the induction hypothesis we know that the ε-approximate degree of any depth-(d−1), size-O(s) AC 0 circuit with s inputs is O(s 1−2 −(d−1) (log(1/ε)) 2 −(d−1) ). Invoking Theorem 2, we have that the approximate degree of the depth-d function is

Approximate degree lower bound
In this section we prove our lower bound on the approximate degree of LC 0 d , restated below for convenience. Before proving the theorem, we will need to introduce several lemmas. The first lemma follows from the techniques of [ABO84] (see [Kop13] for an exposition).

Lemma 16.
There exists a Boolean circuit C with n inputs, of depth 3, and size O(n 2 ) satisfying the following two properties: • C(x) = 0 for all x of Hamming weight at most n/3.

• C(x) = 1 for all x of Hamming weight at least 2n/3.
We refer to the function computed by the circuit C of Lemma 16 as GAPMAJ, short for a gapped majority function (such a function is sometimes also called an approximate majority function).
The following lemma of [BCH + 17] says that if f has large ε-approximate degree for ε = 1/3, then block-composing f with GAPMAJ on O(log n) bits yields a function with just as high ε -approximate degree, with ε very close to 1/2. The following lemma says that if f has large ε-approximate degree for ε very close to 1/2, then block-composing any function g with f results in a function of substantially larger approximate degree than g itself. We are now ready to prove Theorem 6, which is restated at the beginning of this section.
Proof of Theorem 6. Let ≥ 1 be any constant integer to be specified later (ultimately, we will set = Θ( √ d), where d is as in the statement of the theorem).
[BKT18] exhibit a circuit family C * : {0, 1} n → {0, 1} of depth at most 3 , size at most n 2 , and approximate degree satisfying deg(C * ) ≥ D for some D ≥ Ω(n 1−2 − ). We need to transform this quadratic-size circuit into a circuit C of linear size, without substantially reducing its approximate degree, or substantially increasing its depth (in particular, the depth of C should be at most d).
To accomplish this, we apply the following iterative transformation. At each iteration i, we produce a new circuit C i : {0, 1} n → {0, 1} of linear size, such that deg(C i ) gets closer and closer to deg(C) as i grows. Our final circuit will be C := C .
C 1 is defined to simply be OR n , which is clearly in LC 0 1 . The transformation from C i−1 into C i works as follows. C i feeds √ n copies of C i−1 √ n/(10 log n) into the circuit C * √ n • GAPMAJ 10 log n . Here, C i−1 k denotes the function C i−1 constructed in the previous iteration, and defined on k inputs; similarly, C * k : {0, 1} k → {0, 1} n refers to the function C * constructed by [BKT18], defined on k inputs. That is: Observe that C i is a function on √ n · 10 log n · ( √ n/(10 log n)) = n bits. We now establish the following two lemmas about C i .
Lemma 20. C i is computed by a circuit of depth at most (3 + 3) · i, and size at most 2 · i · n.
Proof. Clearly this is true for i = 1, since C 1 is computed by a circuit of size and depth 1. Assume by induction that it is true for i − 1. Recalling that GAPMAJ 10 log n is computed by a circuit of size O(log 2 n) and depth 3, and C * √ n is computed by a circuit of size n and depth 3 , it is immediate from Equation (20) that C i is computed by a circuit satisfying the following properties: • The depth is at most 3 + 3 + (3 + 3)(i − 1) = (3 + 3)i.
Setting i = , we obtain a circuit C : {0, 1} n → {0, 1} with the following properties: • By Lemma 20, C has size at most 2 n and depth at most d := 2 2 .
Hence, for any constant value of d = 2 2 , we have constructed a circuit of depth d, size O(n), and approximate degree at least Ω(n 1−2 −Ω( √ d) ), as required by the theorem.

Sublinear-size circuits of arbitrary depth
Theorem 1 and Theorem 2 also allow us to prove sublinear quantum query complexity and approximate degree upper bounds for arbitrary circuits of sublinear size.

Applications to agnostic PAC learning
Our new upper bounds on the approximate degree of LC 0 circuits yield new subexponential time learning algorithms in the agnostic model. In this section, we provide background for, and the proof of, our main learning result restated below. Since the learning algorithm does not know D and is required to work for all D, this model is also called the distribution-independent (or distribution-free) PAC model. Unfortunately, in the distribution-free setting, very few concept classes are known to be PAC learnable in polynomial time or even subexponential time (i.e., time 2 n 1−δ for some constant δ > 0).
Kearns, Schapire, and Sellie [KSS94] then proposed the more general (and challenging) agnostic PAC learning model, which removes the assumption that examples are determined by a function at all, let alone a function in the concept class C. The learner now knows nothing about how examples are labeled, but is only required to learn a hypothesis h that is at most ε worse than the best possible classifier from the class C.
We now describe the agnostic PAC model more formally. Let D be any distribution on We say that C is agnostically learnable in time T (n, ε, δ) if there exists an algorithm which takes as input n and δ and has access to an example oracle EX(D), and satisfies the following properties. It runs in time at most T (n, ε, δ), and with probability at least 1 − δ, it outputs a hypothesis h satisfying err D (h) ≤ opt + ε. We say that the learning algorithm runs in subexponential time if there is some constant η > 0 such that for any constants ε and δ, the running time T (n, ε, δ) ≤ 2 n 1−η for sufficiently large n.
The agnostic model is able to capture a range of realistic scenarios that do not fit within the standard PAC model. In many situations it is unreasonable to know exactly that f belongs to some class C, since f may be computed by a process outside of our control. For example, the labels of f may be (adversarially) corrupted by noise, resulting in a function that is no longer in C. Alternatively, f may be "well-modeled," but not perfectly modeled, by some concept in C. In fact, the agnostic learning model even allows the input sample to not be described by a function f at all, in the sense that the distribution over the sample may have both (x, 0) and (x, 1) in its support. This is also realistic when the model being used does not capture all of the variables on which the true function depends.

Related work
Since the agnostic PAC model generalizes the standard PAC model, it is (considerably) harder to learn a concept class in this model. Consequently, even fewer concept classes are known to be agnostically learnable, even in subexponential time. For example, as mentioned in Section 1.4, the best known algorithm for agnostically learning the simple concept class of disjunctions, which are size-1, depth-1 Boolean circuits, runs in time 9 2 O( √ n) [KKMS08]. In contrast, they can be learned in polynomial time in the PAC model [Val84]. Meanwhile, several hardness results are known for agnostically learning disjunctions, including NP-hardness for proper learning [KSS94], and that even improper learning is as hard as PAC learning DNF [LBW95].
While it is an important and interesting problem to agnostically learn more expressive classes of circuits in subexponential time, relatively few results are known. The best known general result is that all de Morgan formulas (formulas over the gate set of AND, OR, and NOT gates) of size s can be learned in time 2 O( √ s) [KKMS08,Rei11]. In particular, linear-size formulas (i.e., s = Θ(n)) can be learned in time 2Õ ( √ n) , which is the same as the best known upper bound for disjunctions.
Even in the relatively easier PAC model, only a small number of circuit classes are known to be learnable in subexponential time. For the well-studied class of polynomial-size DNFs, or depth-2 AC 0 circuits, we have an algorithm running in time 2 O(n 1/3 ) [KS04], and we know that new techniques will be needed to improve this bound [RS10]. Little is known about larger subclasses of AC 0 , other than a recent paper that studied depth-3 AC 0 circuits with top fan-in t, giving a PAC learning algorithm of runtime 2Õ (t √ n) [DRG17], which is only subexponential when t √ n. Given the current state of affairs, a subexponential-time algorithm to learn all of AC 0 in the standard PAC model would represent significant progress. Indeed, for d > 2, the fastest known PAC learning algorithm for depth-d AC 0 circuits runs in time 2 n−Ω(n/ log d−1 n) [ST17], which is quite close to the trivial runtime of 2 n .
We view our new results for learning LC 0 and sublinear-size AC 0 circuits as intermediate steps toward this goal. We clarify that our results are incomparable to the known results about agnostically learning de Morgan formulas. A simple counting argument [Nis11] shows that there are linear-size DNFs that are not computable by formulas of size o(n 2 / log n), so one cannot learn even depth-2 LC 0 in subexponential time via the learning algorithm for de Morgan formulas. On the other hand, there are linear-size de Morgan formulas (of superconstant depth) that are not in LC 0 , or even AC 0 .
Motivated by the lack of positive results in the distribution-free PAC learning model, [ST17] study algorithms for learning various circuit classes, with the goal of "only" achieving a non-trivial savings over trivial 2 n -time algorithms. By achieving non-trivial savings, [ST17] mean a runtime of 2 n−o(n) ; prior work had already connected non-trivial learning algorithms to circuit lower bounds [KKO13,OS17]. The subexponential runtimes we achieve in our work are significantly faster than the 2 n−o(n) -time algorithms of [ST17]; in addition, our algorithms work in the challenging agnostic setting, rather than just the PAC setting. On the other hand, the algorithms of [ST17] apply to more general circuit classes than LC 0 .
As mentioned previously, [KS04] gave a 2 O(n 1/3 ) -time algorithm for PAC learning polynomial size DNF formulas; their algorithm is based on a O(n 1/3 ) upper bound on the threshold degree of such formulas. In unpublished work, [Tal18] has observed that the argument in [KS04,Theorem 4] can be generalized to show that for constant d ≥ 2, any depth-d LC 0 circuit has threshold degree at most O n 1−1/(3·2 d−3 ) . This in turn yields a PAC learning algorithm for LC 0 running in time exp O n 1−1/(3·2 d−3 ) . Note that this is in the standard PAC model, not the agnostic PAC model. As mentioned in Section 1, prior to our work, no subexponential time algorithm was known for agnostically learning even LC 0 3 in subexponential time.

Linear regression and the proof of Theorem 7
Our learning algorithm applies the well-known linear regression framework for agnostic learning that was introduced by [KKMS08]. The algorithm of [KKMS08] works whenever there is a "small" set of "features" F (where each feature is a function mapping {0, 1} n to R) such that each concept in the concept class C can be approximated to error ε in the ∞ norm via a linear combination of the features in F. Roughly speaking, given a sufficiently large sample S from an (unknown) distribution over {0, 1} n × {0, 1}, the algorithm finds a linear combination h of the features of F that minimizes the empirical 1 loss, i.e., h minimizes ( Then there is an algorithm that takes as input a sample S of size |S| = poly(n, |F|, 1/ε, log(1/δ)) from an unknown distribution D, and in time poly(|S|) outputs a hypothesis h such that, with probability at least 1 − δ over S, A feature set F that is commonly used in applications of Lemma 23 is the set of all monomials whose degree is at most some bound d. Indeed, an immediate corollary of Lemma 23 is the following.
Corollary 24. Suppose that for every c ∈ C, the ε-approximate degree of c is at most d. Then for every δ > 0, there is an algorithm running in time poly(n d , 1/ε, log(1/δ)) that agnostically learns C to error ε with respect to any (unknown) distribution D over {0, 1} n × {0, 1}.
The best known algorithms for agnostically learning disjunctions and de Morgan formulas of linear size [KKMS08,Rei11] combine Corollary 24 with known approximate degree upper bounds for disjunctions and de Morgan formulas of bounded size. We use the same strategy: our results for agnostic learning (Theorem 7) follow from combining Corollary 24 with our new approximate degree upper bounds. Specifically, Theorem 5 shows that the ε-approximate degree of any LC 0 d circuit is at mostÕ(n 1−2 −d log 2 −d (1/ε)), yielding our new result for agnostically learning LC 0 circuits. Theorem 15 shows that AC 0 circuits of size s have ε-approximate degreeÕ( √ ns 1/2−2 −d (log(1/ε)) 2 −d ), giving our new result for learning sublinear-size AC 0 . Furthermore, since our upper bound on the approximate degree of LC 0 circuits is nearly tight, new techniques will be needed to significantly surpass our results. In particular, new techniques will be needed to agnostically learn all of LC 0 in subexponential time. Theorem 6 implies that if F is the set of all monomials of at most a given degree d, then one cannot use Corollary 24 to learn LC 0 d in time less than 2 n 1−2 −Ω( √ d) . However, standard techniques [She11a] automatically generalize the lower bound of Theorem 6 from the feature set of lowdegree monomials to arbitrary feature sets. Specifically, we obtain the following theorem. .
For completeness, we provide the proof of Theorem 25 below.
Proof. For a matrix F ∈ {0, 1} N ×N , the ε-approximate rank of F , denoted rank ε (F ), is the least rank of a matrix A ∈ R N ×N such that where the expression rank 1/3 (F ) views F as a 2 4n × 2 4n matrix. Let F * be a feature set satisfying the hypothesis of Theorem 25, i.e., for every function for all x ∈ {0, 1} 4n . We claim that this implies that Theorem 25 then follows by combining Equation (23) Let M denote the 2 4n × |F | matrix whose (i, j)'th entry is α i,j . And let R denote that |F| × 2 4n matrix whose (j, x)'th entry is φ j (x), where we associate x with an input in {0, 1} 4n . Then Equation (24) implies that |M · R − F ij | ≤ 1/3 for all (i, j) ∈ [2 4n ] × [2 4n ]. Since M · R is a matrix of rank at most |F|, Equation (23) follows.

Circuit Lower Bounds (Proof of Theorem 8)
In this section, we view Boolean functions as mapping domain {−1, 1} n to {−1, 1}. Recall that IP(x, y) = ⊕ n i=1 (x i ∧ y i ) denotes the Boolean inner product on 2n bits. As a warmup, we start by establishing a worst-case version of Theorem 8.

Proposition 26. The Inner Product function cannot be computed by any depth-
Proof. Theorem 5 shows that any depth-d AC 0 circuit of size s ≥ n on n inputs has approximate degree at most D =Õ(s 1−2 −d ). Clearly, the approximating polynomial has at most s ≤D ≤ s D many monomials.
From this, one can conclude that any depth-(d + 1) AC 0 • ⊕ circuit C on n inputs of size s ≥ n can be approximated by a polynomial p over {−1, 1} n with at most s D many monomials. To see why, let us write C(x, y) = C (h 1 (x, y), . . . , h N (x, y)), where N ≤ s, C is an AC 0 circuit of depth d and size at most s, and each h i is a parity function. Since C is an AC 0 circuit of depth d and size at most s on N ≤ s inputs, it has approximate degree at most D. Accordingly, let q be a polynomial of degree at most D that point-wise approximates C to error at most 1/3. Now obtain p by replacing the i'th input to q with the corresponding parity gate, namely h i , of C. This yields a polynomial p that point-wise approximates C to error at most 1/3, i.e., |p(x, y) − C(x, y)| ≤ 1/3 for all (x, y) ∈ {−1, 1} n × {−1, 1} n . Since q is defined over domain {−1, 1} N , replacing any number of inputs to q with parity functions preserves the number of monomials of q.
On the other hand, it is known that that any polynomial p over {−1, 1} n × {−1, 1} n that point-wise approximates the Inner Product function to any error strictly less than 1 requires 2 Ω(n) many monomials [BS92].
Combining the above two facts means that s D must be at least 2 Ω(n) , which means that s must be at leastΩ(n 1/(1−2 −d ) ).
We now prove Theorem 8, restated here for convenience.  [Tal16] shows that bipartite de Morgan formulas of size s cannot compute the Inner Product function on more than a 1/2 + n − log n fraction of inputs unless they have size at least roughly n 2 . The only property of de Morgan formulas of size n 2 that Tal uses is that they have sublinear approximate degree.
Similarly, Theorem 5 shows that an AC 0 circuit of size s and depth d on n inputs, for which n ≤ s n 1/(1−2 −d ) , has sublinear approximate degree. Any parity function is an example of a bipartite function of size O(1), meaning that the parity function applied to some subset of an input (x, y) ∈ {−1, 1} n × {−1, 1} n is computable by a constant-sized circuit with leaves computing a function of only x or y. Hence, Tal's argument applies with cosmetic changes not only to sub-quadratic size bipartite de Morgan formulas, but also to AC 0 • ⊕ circuits of size s n 1/(1−2 −d ) . We remark that the entire argument (and hence the lower bound of Theorem 8 itself) applies not only to AC 0 • ⊕ circuits, but more generally to depth-d AC 0 circuits augmented with a layer of low-communication gates above the inputs; we omit this extension for brevity.
Suppose that q ≥ 1/2 + ε. Our goal is to show that s must be large, even for negligible values of ε.
Let N ≤ s denote the number of parity gates in C, with the ith parity gate denoted by h i (x) : {−1, 1} n → {−1, 1}. Then we may write C(x, y) = C (h 1 (x, y), . . . , h N (x, y)), where C is an AC 0 circuit on at most s inputs, of depth d and size at most s. By Theorem 5, there exists a polynomial p of degree at most D ≤Õ s 1−2 −d log 2 −d (1/ε) such that, for all w ∈ {−1, 1} N , |p(w) − C (w)| ≤ ε.
We claim that with strictly positive probability, this circuit C i+1 computes AMAJ n,p i+1 ,q i+1 . To see this, first fix an input x with Hamming weight at most p i+1 · n, so that the expected number of 1-inputs to any bottom AMAJ m,p i ,q i circuit is at most µ := p i+1 · m. Note that p i · m > (1 + 1/(10d))µ. If any AMAJ m,p i ,q i circuit "makes an error" on x (i.e., evaluates to 1 on x), then at least p i · m > (1 + 1/(10d)) · µ of the randomly chosen inputs to the gate are 1. By a Chernoff bound, for each of the bottom AMAJ m,p i ,q i gates, this happens on input x with probability at most exp(−µ/(3(10d) 2 )) ≤ exp(−µ/(300d 2 )) ≤ exp(−p i m/(600d 2 )).
The probability that more than (700d 2 /p i )m ≤ p i · M of these circuits makes an error is at most 2 M · (exp(−p i m/(600d 2 ))) (700d 2 /p i )m exp(−m 2 ). Thus, with probability at least 1 − exp(−m 2 ), the circuit C i+1 outputs 0 on input x.
An analogous argument holds for inputs x with Hamming weight at least q i+1 · n, so by a union bound over all at most the 2 n inputs to C i+1 with Hamming weight at most p i+1 · n or at least q i+1 · n, with strictly positive probability C i+1 computes AMAJ n,p i+1 ,q i+1 .
The circuit C i+1 has m 2 inputs and has size at most where recall that k i+1 = (1 + k i )/2. Equation (26) implies that the top and bottom layers of C i+1 consist of AND gates, with C i+1 inheriting this property directly from C i and C 0 . Moreover, by collapsing the bottom layer of C 0 with the top layer of each copy of C i (which is possible because C 0 is monotone), we find that the depth of C i+1 is at most most 3 + (2i + 3) − 1 = 2(i + 1) + 3. This completes the proof of the lemma.
Let p 0 , q 0 be as in Claim 28, and let p = p 0 /e and q = 1 − (1 − q 0 )/e. Theorem 27 follows by iteratively applying Lemma 29 d times (starting with i = 0; the assumptions of the lemma are satisfied for this value of i by Claim 28) to conclude that AMAJ n,p,q is computable by a circuit of depth 2d + 3 and size O d (n 1+2 −d +δ ).
Proof of Claim 28. The main idea of the (probabilistic) construction is to have an AND-OR-AND circuit C, where the top AND gate has fan-in t 1 := m, the middle layer (of OR gates) all have fan-in t 2 := m 1+δ , and the bottom layer of AND gates all have fan-in t 3 = log 2 (m). Each bottom AND gate is connected to t 3 randomly chosen inputs.
Consider any m-bit input x with Hamming weight at most p · m. Then for any fixed AND gate at the bottom layer of C, the probability the AND gate evaluates to 1 is at most p t 3 < 1/(2m 1+δ ). By a union bound, this implies that for any fixed OR gate at the middle layer of C, the probability the OR gate outputs 1 on x is at most t 2 · 1/(2m 1+δ ) ≤ 1/2. This implies that the probability the top AND gate outputs 1 on x is at most 1/2 t 1 = 2 −m . Now consider any m-bit input x with Hamming weight at least q · m. Then for any fixed AND gate at the bottom layer of C, the probability the AND gate evaluates to 1 is at least q t 3 > 1/m δ . This implies that for any fixed OR gate at the middle layer of C, the probability the OR gate outputs 1 on x is least 1 − (1 − 1/m δ ) t 2 ≥ 1 − e −m ≥ 1 − 1/(m2 m ). This implies that the probability the top AND gate outputs 1 on x is at least 1 − 2 −m . By a union bound over all the at most 2 m inputs x to C, we conclude that with positive probability C computes AMAJ m,p,q .