Clifford Circuits can be Properly PAC Learned if and only if RP = NP

Given a dataset of input states, measurements, and probabilities, is it possible to efficiently predict the measurement probabilities associated with a quantum circuit? Recent work of Caro and Datta [19] studied the problem of PAC learning quantum circuits in an information theoretic sense, leaving open questions of computational efficiency. In particular, one candidate class of circuits for which an efficient learner might have been possible was that of Clifford circuits, since the corresponding set of states generated by such circuits, called stabilizer states, are known to be efficiently PAC learnable [44]. Here we provide a negative result, showing that proper learning of CNOT circuits with 1 / poly( n ) error is hard for classical learners unless RP = NP , ruling out the possibility of strong learners under standard complexity theoretic assumptions. As the classical analogue and subset of Clifford circuits, this naturally leads to a hardness result for Clifford circuits as well. Additionally, we show that if RP = NP then there would exist efficient proper learning algorithms for CNOT and Clifford circuits. By similar arguments, we also find that an efficient proper quantum learner for such circuits exists if and only if NP ⊆ RQP . We leave open the problem of hardness for improper learning or O (1) error to future work.


Introduction
The goal of efficient learning of quantum states and the circuits that act on them, is to be able to predict the outcome of various measurements with some degree of accuracy. For example, given a quantum state ρ and a two-outcome measurement M can we predict the probability that the measurement accepts?
Naively, one can try and learn everything there is to know about the system, a technique known as tomography with versions for quantum states [29,41,42] and quantum processes [4,21]. However, this requires exponential time in the number of qubits due to information theoretic reasons related to the exponential dimension of the system. This exponential bound remains when trying to find a state close in trace distance as shown by a combination of Flammia et al. [24] and Holevo's bound. To address this, one can choose to restrict the type of information one wanted to learn, which led to the ideas of shadow tomography [2] and classical shadows [31]. By only needing to predict the value of M observables {O i }, one is able to use only a number of measurements that is polynomial in the number of qubits and polylogarithmic in M . In a similar vein, Aaronson [1] proposed the idea of PAC learning quantum states, which is the idea of learning relative to some distribution over measurements, but only being given samples from that distribution as well (see Section 2.5 for details). Aaronson was then able to give a generalization theorem for this problem, showing that if one could find a hypothesis h that had small training error on O(n) samples that h would also perform well on future samples. However, the problem of efficiently finding h was left open.
An alternative direction was to restrict the class of objects being learned on, but allow one to choose what kind of measurements are taken. Montanaro [40] was able to learn stabilizer states and later Low [39] learned an unknown Clifford circuit. Lai and Cheng [38] built on these results in the case of actually recovering the circuit, as well as limited learning in the presence of a small amount of non-Clifford gates. Stabilizer states and Clifford circuits are of particular interest to the quantum information community because many quantum communication protocols and well known quantum query algorithms [10-13, 26, 46] utilize these states and circuits. Gottesman and Knill [27] (with improvements by Aaronson and Gottesman [3]) were able to give an efficient classical simulation of these objects, showing these class of objects to seemingly be much simpler than the set of all quantum states or circuits. Combined with the fact that stabilizer states are good approximations to Haar random states [37,50], we get a set of circuits and states that are highly quantum with many interesting uses, but still have enough exploitable structure to be classically simulable, making them a prime candidate for learning. Rocchetto [44] was able to combine the ideas of PAC learning with the structure provided by restricting to stabilizer states to give an efficient PAC algorithm for learning stabilizer states. Caro and Datta [19] extended the ideas of PAC learning to quantum circuits, giving an analogous generalization theorem to Aaronson [1]. As with Aaronson [1], the problem of efficiently finding such a good hypothesis was left open. A natural follow-up was whether or not Clifford circuits could be efficiently PAC learned in an analogous way to stabilizer states. Here, we are given inputs of the form ρ, I ⊗n +P 2 for some stabilizer state ρ and Pauli matrix P , with labels Tr I ⊗n +P 2 CρC † corresponding to an unknown Clifford circuit C and asked to predict future labels. It is worth noting that we have slightly altered the definition of PAC learning a quantum circuit from that of Caro and Datta [19] to a setting we find more comparable to Aaronson's original PAC learning result for quantum states [1]. In the setting introduced by Caro and Datta, the measurements were limited to being rank 1 projectors with product structure, rather than the rank 2 n−1 projectors we use in our proof.
When one attempts to create a PAC learning algorithm a natural first step is to try an elimination method, i.e., eliminating options that don't match the given training data and then outputting some option that does match the data well. Such algorithms are known as proper learning algorithms (see Section 2.5 for more details) and were the only kind of learning algorithms considered when the idea of PAC learning was first introduced by Valiant [47]. And while the learning theory community now considers things like improper learning algorithms, the original proper learning algorithms generally remain the most natural class of learning algorithms to consider first. We note for instance that Rocchetto [44] is a proper learning algorithm, as well as learning algorithms for parities and other well known learning problems [35] To that extent, we show in this paper that an efficient proper learner for Clifford circuits that achieves 1/ poly(n) error exists if and only if RP = NP, effectively ruling out "straightforward" learning algorithms for Clifford circuits. More generally, these results apply to any proper learner that achieves arbitrary error (ϵ, δ) with runtime poly(n, ϵ −1 , δ −1 ), which is known as a strong learner. Furthermore, this is true even just for a learner of a subset of Clifford circuits called CNOT circuits. This subset essentially restricts to the set of Clifford circuits that map computational basis states to other computational basis states and these circuits are highly related to the complexity class L (see Section 2.4). We leave open the problem of showing O(1) hardness for proper learners using complexity theoretic means, such as in Guruswami and Raghavendra [28].
One can also imagine that the learning algorithm has access to a quantum computer. Since there exists problems like factoring [45] for which we have an efficient quantum algorithm but not an efficient classical algorithm, this learner may be able to efficiently learn more expressive concept classes. We also give results for this setting by relating NP to RQP, the quantum analogue of RP. We now informally state our main theorems regarding CNOT and Clifford circuits. The proofs of these main results starts by realizing that finding a CNOT circuit with zero training error requires finding a full rank matrix in an affine subspace of matrices under matrix addition (so as to differentiate from a coset of a matrix group using matrix multiplication). This is known as the NonSingularity problem [17] and is NP-complete. While this may seem like a backwards reduction, it turns out that the set of matrix affine subspaces used to show that NonSingularity can solve 3SAT are a subset of the ones needed to learn CNOT circuits with zero training error. Thus, there exist a set of samples such that a CNOT circuit with zero training error exists if and only if the SAT instance is satisfiable. Finding such a CNOT circuit is what is known as the search version of the consistency problem and in turn the decision version of the consistency problem is also NP-complete.
To show that an efficient proper learner for CNOT circuits implies RP = NP, we follow the same proof structure as similar results for NP-hardness of the consistency problem for 2-clause CNF, 3-DNF, or the intersection of two halfspaces [14,15,30]. First, let S be some sample from the decision version of the consistency problem for CNOT circuits. Using the uniform distribution over each element in S, we will sample every element of S with high probability given enough queries. Since S contains at most a polynomial number of samples, we are able to show that an efficient learner with arbitrary 1/ poly(n) error would necessarily also solve the consistency problem with high enough probability to create a solution in RP.
Completing the proof in the other direction, if RP = NP we utilize search-to-decision reductions for NP-complete problems to get an efficient algorithm for the search problem of minimizing training error. We can treat this search algorithm as our means of generating a hypothesis circuit C with low training error. By the generalization theorem provided by Caro and Datta [19], assuming we have enough samples, this C will properly generalize and have low true error, thus completing the proof. The quantum forms of the proof essentially come for free by replacing RP with RQP everywhere and using learners capable of doing quantum computation.

Related Work
We also note that we are dealing with the problem of classically PAC learning a classical function (i.e., classical labels) derived from a quantum system. This is as opposed to quantum PAC learning of a classical function as in Arunachalam and de Wolf [6], Arunachalam and de Wolf [7], Arunachalam et al. [8], Quek et al. [43] where instead of a distribution over samples we receive access to copies of a quantum state. This state results in the same distribution classically when measured in the computational basis but can be measured in other basis to get different results. There is also the attempt to directly learn a quantum process with quantum labels, as in Caro [18], Chung and Lin [22]. Here, they do not choose to measure the output state, and have samples of the form (ρ, M(ρ)) for quantum process M. Other related quantum learning works, some of which are outside the PAC model, include Cheng et al. [20], Low [39], Yoganathan [49].

Quantum States and Circuits
A quantum state ρ on n qubits is a 2 n ×2 n PSD matrix with trace 1. If the matrix is rank 1 then we refer to ρ being a pure state, since it can be decomposed as ρ = |ψ⟩⟨ψ| where |ψ⟩ is a 2 n -dimensional column vector with norm 1 and ⟨ψ| is its complex conjugate. A twooutcome measurement E is then a projector such that E 2 = E such that the probability of a '1' outcome is Tr[Eρ] and the probability of a '0' outcome is 1 − Tr[Eρ], leaving the expectation value as simply Tr [Eρ].
A quantum process is how one evolves a quantum state, and therefore it must preserve the trace 1 and the PSD condition. We will be primarily interested in quantum circuits, which are the subset of quantum processes that map pure states only to other pure states. These are constrained to be unitary operations, such that after acting on ρ with the circuit C, the state that we are left with is CρC † where C † is the complex conjugate of C.

Paulis and Stabilizer States/Groups
We will start by giving the following matrices, known as the Pauli matrices.
Noting that these are all unitaries that act on a single qubit, we can generalize to n qubits. Definition 2.1. Let P n = {±1, ±i} × {I, X, Y, Z} ⊗n be the matrix group consisting all n-qubit Paulis with phase ±1 or ±i.
We'll also introduce some shorthand notation: Definition 2.2. Let X i and Z i be the Pauli acting only on the i-th qubit with X or Z respectively and the identity matrix on all other qubits.
Note that Z v · Z w = Z v+w , assuming the dimensions of v and w match. It is easy to see that v ̸ = w also implies that Z v ̸ = Z w .
A stabilizer state ρ is any state that can be written as 1 2 n g∈G g, where G is an abelian subgroup G ⊂ P n \ {−I ⊗n } without the negative identity. G is known as the stabilizer group of ρ. As it turns out, if G is of order 2 n then ρ will be a pure state. This leads to the alternative (and more popular definition) where ρ = |ψ⟩⟨ψ|, is the unique state that is stabilized by G. That is, for all g ∈ G, g |ψ⟩ = |ψ⟩. This definition shows why −I ⊗n isn't allowed to be in G, since −I ⊗n stabilizes nothing. It also shows why one must restrict the entries of G to only have real phase. Proof. Given a Pauli with an imaginary phase, it's square would be equal to −I ⊗n , making the group not closed. This is a contradiction.
One of the reasons stabilizer states are so important is this bijection between the stabilizer group of a stabilizer state and the state itself; by simply knowing the generators of the group, one can easily reconstruct the state. And since there are at most n generators, if one can efficiently write down the generators themselves then there is a polynomial size representation of a stabilizer state. We now show how one can write down any member of a stabilizer group as follows. Given, P ∈ P n with real phase such that P = ± i P i , define a function N : P → {0, 1} 2 for each qubit N (I) = 00, N (X) = 10, N (Z) = 11, and concatenate to make N (P ) = (N (P 1 ), N (P 2 ), . . . , N (P n )). Additionally, have an extra bit for the sign for whether the sign is −1 or 1. This results in a 2n + 1 bit string for each generator, so writing down a stabilizer state requires only O n 2 bits to write down classically.

Clifford Circuits
Informally, a Clifford circuit maps stabilizer states to other stabilizer states.

Definition 2.5.
A Clifford circuit is a unitary U such that U P n U † = P n , while ignoring global phase on the unitary. More formally, consider the normalizer N (P n ) = {U ∈ U (2 n ) | U P n U † = P n }, and let C n = N (P n )/U (1) be the Clifford group.
Like stabilizer states, generators are an important part of how we deal with Clifford circuits. How a given Clifford circuit U acts on the generators of the Pauli matrices completely characterizes the unitary U [39]. To borrow the notation of Koenig and Smolin [36], this relationship can be efficiently described via: where p i , q i , α ij , β ij , γ ij , and θ ij are all {0, 1} values. It will sometimes be useful to view α ij , β ij , γ ij , and θ ij as the n × n boolean matrices A, B, Γ, and Θ respectively. This gives us a simple upper-bound on the number of Clifford circuits. Proof. The total number of bits we use to represent p, q, A, B, Γ, and ∆ is 4n 2 + 2n = O(n 2 ). There can then be at most 2 O(n 2 ) Clifford circuits.
However, because commutation relations are preserved, not all possible values of α, β, γ, θ are allowed (the p and q values can be arbitrary). This leads us to the idea of symplectic matrices. We note that a Clifford circuit can be encoded as a (2n + 1) × 2n boolean matrix S where column 2j − 1 is equal to (α 1j , β 1j , · · · , α nj , β nj , p j ) and column 2j is equal to (γ 1j , θ 1j , · · · , γ nj , θ nj , q j ). We will call this the full encoding of the Clifford circuit.
Definition 2.7. A symplectic matrix over F 2n 2 is a 2n × 2n matrix S with entries in F 2 such that These matrices form the symplectic group Sp(2n, F 2 ).
The symplectic matrices preserve the symplectic inner product ω(v, w) = v T Λ(n)w on F 2n 2 . It turns out that if we consider the submatrix defined by the first 2n rows of our full encoding S, a necessary and sufficient condition to preserve the commutation relations of the generators is for this submatrix to be symplectic, as {X i } ∪ {Z i } form what is known a symplectic basis. Formally, C n /P n ∼ = Sp(2n, F 2 ).

CNOT circuits and ⊕L
It is a well known fact that every Clifford circuit can be generated using only H, P , and CNOT gates as defined below: We note that X = HP 2 H. If we restrict to the subset of circuits that are generated by only X and CNOT, we get what are known as CNOT circuits [3], which are a clear subset of Clifford circuits. [3]). The complexity class ⊕L is the class of problems that reduce to simulating a polynomial-size CNOT circuit.

Definition 2.8 (Aaronson and Gottesman
A perhaps more familiar definition for complexity theorists is the class of problems that are solvable by a nondeterministic logarithmic-space Turing machine that accepts if and only if the total number of accepting paths is odd. Let us now consider the set of all Clifford circuit that map computational basis states to other computational basis states, thereby stabilizing the subgroup {±1} × {I, Z} ⊗n . Very briefly, we will call these classical Clifford circuits as we will now prove that they are largely equivalent to CNOT circuits. The following lemmas will be useful. Lemma 2.9. Let Θ be the matrix form of the θ ij from Eq. (1). Any CNOT circuit C must have Θ be full rank.
Proof. Let us first consider what happens to a computational basis state when acted upon by C and let S be the full encoding of C. Referencing Eq. (1), the γ ij must be 0 for all i and j. Since every member of Sp(2n, F 2 ) is full rank, the even columns of S must be as well. Since the γ terms are all zero, the even columns of S are full rank if and only if Θ is full rank. Lemma 2.10. Let Θ be the matrix form of the θ ij from Eq. (1) for some Clifford circuit C. If Θ is full rank then there exists a CNOT circuit with the same Θ.
Proof. One can verify that the Θ matrix of the circuit that does nothing, which is a valid CNOT circuit as well, is the identity matrix. We note that a CNOT from qubit i to qubit j performs the rowsum operation of adding row j to row i of Θ. Thus it is possible to efficiently construct a circuit with matching Θ using rowsum operations via CNOT gates.
We can now prove our desired goal leveraging these two lemmas.
Proposition 2.11. Let C be an arbitrary classical Clifford circuit. It can be efficiently generated using solely X, Z, and CNOT gates. Moreover, it's effect on the computational basis states can be entirely simulated using only X and CNOT.
Proof. Let us first consider what happens to a computational basis state when acted upon by C. Referencing Eq. (1), the γ ij must be 0 for all i and j, and we will essentially ignore α ij , β ij , and p j for now leaving us with θ ij and q j . By Lemma 2.9, Θ must full rank. By Lemma 2.10, there exists a CNOT matrix that achieves the same Θ as well. To get a matching q j , one can simply apply an X gate at the beginning of each qubit that has q j = 1, since XZX = −Z, and the following CNOT gates will not itself introduce any negative phases. From here, we have already proved the moreover statement.
To prove the full result, we return to the α ij and β ij . We will show that there exists a single unique solution. Similar to Θ we will define the corresponding n × n matrices A and B for the α ij and β ij respectively. Based on Eq. (2), to form a symplectic basis we find that A T Θ = I and A T B = 0, since γ ij = 0. Clearly A T = Θ −1 , which is guaranteed to exist, and B = 0 since A will also be full rank. To match the p j values we simply place Z gates in front of the qubits where p j = 1, similar to the X gates for q j .
This means we do not lose any kinds of interactions by only considering CNOT circuits, since the only differentiating factors (i.e., the Z gates) do not actually affect the outcome when fed with a computational basis state. As such, all given results will be given in terms of simply CNOT circuits.

PAC Learning
The goal of PAC learning is to learn a function relative to a certain distribution of inputs, rather than in an absolute sense. Let's say we want to learn an arbitrary f from some concept class C. If a hypothesis function h matches the true function f on many of the high probability inputs, then we can say that we have approximately learned f . If we can do this with high probability for arbitrary f , then we probably approximately (PAC) learned C.

Definition 2.12.
Let Ω be some domain of inputs and let C be a set of functions f : Ω → [0, 1]. We say that C is (ϵ, δ)-PAC-learnable if there exists a learner that, when given samples of the form (x, f (x)) for x ∼ D for arbitrary f and unknown distribution D, outputs with probability at least 1 − δ, over both the samples and the learning algorithm, a hypothesis h with error 1 satisfying The number of samples used is referred to as the sample complexity, and we refer to the learner being efficient if it can find such an h in time poly(n, ϵ −1 , δ −1 ) for arbitrary ϵ and δ.
From here, one can define two types of learning, based on where h comes from. If h is allowed to be any function that meets the PAC constraints, we refer to this as improper learning. If instead h ∈ C, we get what is known as proper learning, which will be the focus of this paper. With proper learning, we can then begin to talk about the consistency problem formally. Definition 2.13. Let S be a set of labeled samples such that |S| < s. Let ConsistentSearch(C, s) be the problem of finding a function h ∈ C that is consistent with all of S (i.e., for all (x, f (x)) ∈ S, f (x) = h(x)) if such an h exists, otherwise reject.

Generalization
Intuitively, given a set of samples the best one can really hope to do is find such an h that gets zero training error and hope that the true error for h is also low. This leads to the idea of generalization, which aims to show that doing well on a large enough set of training data (i.e., the consistency problem) allows one to give the PAC guarantee as well with high probability. In terms of computational efficiency, this effectively reduces the problem of proper learning to the consistency problem, or an approximation of the consistency problem. The most common approach to guarantee generalization is to bound the "expressiveness" of the concept class, such as with the VC-dimension [16]. Since VCdimension is defined for {0, 1} labels, we will now give a generalization of VC-dimension in the regression setting.

Definition 2.14.
Let Ω be some domain of inputs, η > 0, and let C be a set of functions f : Ω → [0, 1]. We say that a set of inputs {x 1 , x 2 , · · · , x m } ⊆ Ω is η-fat-shattered by C if there exists a set y 1 , y 2 , · · · , y m ∈ [0, 1] such that for any vector b = {±1} m there is an Definition 2.15. The η-fat-shattering dimension of a concept class C is the size of the largest set of inputs that is fat-shattered by C. We denote this as f at C (η).
At a high-level, the η parameter provides a buffer such that each f C (x i ) is robustly bounded away from y i by η in the appropriate direction. We now give a result saying that bounded fat-shattering dimension implies generalization from the training data to the actual learning task.
The following folklore bound on fat-shattering dimension is very loose, but still sufficient for our purposes of complexity-theoretic hardness in Section 6.
Proof. Assume for the sake of contradiction that C η-fat-shatters the set of points {x 1 , x 2 , · · · , x m } for m > log 2 |C|. Then C must be able to properly match each b ∈ {±1} m . There are 2 m > |C| possible b vectors, and only one f ∈ C can be used per shattering attempt, since no f can ever satisfy two different b vectors. This is a contradiction since we don't have enough f ∈ C to go around to satisfy every b vector.

Decision Problems
One can also define the decision version of the consistency problem, which is deciding if there even exists an h ∈ C that is consistent with all of S. We show that the existence of efficient learning algorithms can imply efficient one-sided error algorithms for the decision version of the consistency problem. Definition 2.18. Let ConsistentDecide(C, s) be decision version of the consistency problem for C using at most s samples.
| is the minimum non-zero error any hypothesis function can make on a single input.
Proof. For every set of samples S such that |S| ≤ s, we can define the D S to be the uniform distribution over all x ∈ χ such that (x, f (x)) ∈ S. By coupon collector, if we draw O(s log s) many samples then with probability at least 1− 1 s we will have drawn every item from S. Now imagine that there exists some hypothesis h ∈ C that is not consistent with S. Then our error must be at least α 2 s 2 by the definition of α. Now assume we have some efficient randomized (ϵ, δ) proper learning algorithm for ϵ < α 2 s 2 and δ < 1 2 + 1 2s . When running the learner on an arbitrary D S , it will see samples S with probability at least 1 − 1 s . To get error less than α 2 s 2 the learner must then be able to solve the search version consistency problem with probability p such that Solving for p we find p ≥ 1 2 on accepting instances.
This gives rise to the following algorithm in RP for solving ConsistentDecide(C, s). Given samples S with |S| ≤ s, we can run our learning algorithm and pretend that S is what we sampled from D S to get hypothesis h. If h is consistent with S then accept, otherwise reject. On an accepting instance h will be consistent with probability at least 1 2 while on rejecting instances it will never be consistent so the algorithm will always reject.
Informally, if there exists enough structure on the concept class, it can be possible to go the other way and show that an efficient algorithm for ConsistentDecide(C, s) implies an efficient proper learner for C. Namely, if a search-to-decision reduction exists for the consistency problem on C and f at C is finite then we can also expect to show that an efficient algorithm for the decision problem would imply an efficient proper learner for C. Of particular interest are NP-complete problems, which always admit search-to-decision reductions [32]. We can now give a formal proof of this commonly used technique to show proper PAC learning if RP = NP.

Lemma 2.20. Let C be a concept class and let
Proof. Because search-to-decision reductions exist for all NP-complete problems [32], a zero-error oracle for ConsistentDecide(C, s) can be used to efficiently solve ConsistentSearch(C, s). Let us run our algorithm for ConsistentSearch(C, s) on a sample S such that s ≥ |S| ≥ m. We now have an h ∈ C such that and so by Theorem 2.16 with probability at most δ over the samples. Finally, let γ be the number of calls to ConsistentDecide(C, s) used in the searchto-decision reduction. In order for the reduction to be efficient, γ = O(poly(n)). Since ConsistentDecide(C, s) is in NP and therefore RP, we have an efficient one-sided constant error algorithm A for ConsistentDecide(C, s). Using O(c + log γ) = O(poly(n)) many calls to A and taking the majority, we can get error at most 1 γ·2 c . Call this new algorithm A ′ and use it in place of the zero-error oracle for ConsistentDecide(C, s). By the union bound over all γ calls to A ′ , the probability that any query to A ′ differs from the zero-error oracle is at most 1 2 c . By the union bound over both the samples and the error in A ′ , the total error probability is at most δ + 1 2 c .

PAC Learning Applied to Clifford Circuits
Because of the works of Rocchetto [44] and Lai and Cheng [38], Clifford circuits are a prime candidate for an efficiently PAC-learnable class of circuits. We give a very loose bound on the fat-shattering dimension of Clifford circuits that is sufficient for our purposes. Because CNOT circuits are a subset of Clifford circuits, we can also upper-bound the fat-shattering dimension of CNOT circuits by O(n 2 ).

Consistency Problem of Clifford Circuits
We now turn to the consistency problem. Noting that each Pauli matrix is Hermitian, a very natural way to measure a stabilizer state is in a product basis where we measure each qubit with respect to a Pauli.

Definition 3.2.
If P ∈ P n is a Pauli operator, then the two-outcome measurement associated with P is I ⊗n +P 2 , and is referred to as a Pauli measurement. A critical part of Rocchetto [44] was noting that the measurement results with Pauli measurements could only have three distinct values: Lemma 3.4 (Rocchetto [44] Lemma 1). Let E P = I ⊗n +P 2 be a Pauli measurement associated to a Pauli operator P ∈ P n and ρ be an n-qubit stabiliser state. Then Tr E P CρC † can only take on the values 0, 1 2 , 1 , and: Tr E P CρC † = 1/2 iff neither P nor −P is a stabilizer of CρC † ; What information does a single sample tell us? Let G i be the stabilizer group of ρ i . From this, we can gather that if Tr E P Cρ i C † = 1 then C † P C ∈ G i , and if Tr E P Cρ i C † = 0 If the measurement E P appears multiple times across multiple samples, we can gather further information. For instance, have be the set of all samples such that E P is the measurement taken and let G i be the stabilizer group of each ρ i . Based on each label Tr E P Cρ i C † , we know that C † P C must lie in H i , which is one of G i , −G i or G i ∪ −G i . We then deduce that C † P C must lie in i H i . To actually be a Clifford circuit, we must also add the constraint that C † P C ̸ = I ⊗n , giving us The problem of finding a Clifford circuit with zero training error then reduces to the search problem of finding a set of α, β, γ, θ, p, q from Eq. (1) representing a C † that is consistent with all of these constraints while remaining symplectic according to Eq. (2). Let C be the set of Clifford circuits. We will call this problem CliffordSearch(s) = ConsistentSearch(C, s) 3 .
Due to Gottesman-Knill [3,27] showing that Clifford circuits are classically simulable, the act of verifying that we have a circuit that has zero training error is efficient, meaning that the decision version CliffordDecide(s) = ConsistentDecide(C, s) of the problem is in NP. Proof. Given a set of α, β, γ, θ, p, q, it easy to check that the γ, θ, p, q form a symplectic matrix by checking with respect to Λ(n). Checking that they are consistent with the samples in S can be done by iterating through S since the trace can be computed efficiently using Gottesman-Knill [3].
Knowing this, we find that CliffordSearch(poly(n)) ∈ FNP. This property extends to the analogous problems for CNOT circuits, CNOTSearch(poly(n)) and CNOTDecide(poly(n)), since one can also efficiently verify that γ ij = 0 and p j = 0 for all i and j.

Generating Samples with Certain Constraints
We will now show how we can use samples from PAC learning to generate certain kinds of constraints. It will suffice to only consider CNOT circuits with computational basis state measurements and measurements of the form {I, Z} ⊗n . The net effect of this is that from a PAC learning standpoint, for unknown CNOT circuit C we only need to figure a set of C † Z i C that is consistent with the samples as described in Section 3. Since we will never be tested on a measurement with some component of X i involved, this is equivalent to finding the θ ij and q j values from Eq. (1) of C † . We will again choose to view the θ ij as the matrix Θ, such that Θ must be full rank. be the stabilizer state that is formed from that stabilizer group.
The following observation will notationally make the proceeding theorem states and proofs easier to follow.

Observation 4.2.
Any one-dimensional affine subspace v + ⟨w⟩ can be represented as {v, v + w} and any set of two vectors/matrices {v, w} represents the one-dimensional affine subspace v + ⟨v + w⟩. Thus we can freely move between the two representations. Proof. Let (v, w, v 3 , · · · , v n ) be an arbitrary basis for {0, 1} n containing v and w. This can be found with O(n) random vectors and the use of Gaussian elimination. Recalling Definition 2.3, let us start by creating the sample which limits C † P C to be in {I, Z} ⊗n with positive phase. We can create the set of samples: By construction C † P C cannot have any component of Z v 3 because of the first sample of this set, nor any Z v i for i > 3 due to the remaining samples. This leaves C † P C to be one of Z v , Z w , or Z v+w (since it cannot be identity). To remedy this, we can introduce the final sample: which then eliminates Z w (and identity, due to the negative sign). The total number of samples is n and the whole process takes polynomial in n time to find the basis and create said samples.
We can easily extend this to the 0-dimensional case by simply treating w as v 2 , using an extra sample to remove the last dimension. More importantly, let's say we've constrained C † Z x C to lie in {Z v , Z v+w }. The effect of this on Θ is that if we sum the columns i where x i = 1 then the sum must lie in {v, v + w}.

Corollary 4.4. Let
be a one-dimensional affine subspace of n × k matrices over {0, 1} such that for all i, v i ̸ = w i and v i , w i ̸ = 0. Finally, let Θ ′ be an arbitrary n × k submatrix of Θ. Then there exists a set of (2k − 1)n samples that constrain Θ ′ to only have consistent solutions lying in {v, v + w} for CNOT circuit C. Furthermore these samples can be efficiently generated.
Proof. WLOG, we will let the set of k different columns we choose for Θ ′ to be columns 1 through k. We will use induction on k to prove this corollary, with the base case covered by Lemma 4.3. Now let us assume that we have samples that constrain columns 2 through k to be either The goal will be to generate constraints such that if column 2 is v 2 then column 1 must be v 1 . Otherwise, if column 2 is v 2 + w 2 then column 1 is constrained to be v 1 + w 1 .
To start us off, we can use Lemma 4.3 to constrain the sum of columns 1 and 2 to be either v 1 + w 1 or v 1 + w 1 + v 2 + w 2 . If we focus on columns 1 and 2, the solutions to this constraint lie in an affine subspace defined by: for arbitrary vector u. We then apply Lemma 4.3 again to constrain column 1 to be either v 1 or v 1 + w 1 . Thus the first two columns must either be Finally, to lie in the intersection from the inductive hypothesis, we note that if the second column is v 2 or v 2 + w 2 then columns 3 through k must be respectively. Collectively, we achieve our goal of constraining the entire solution to lie in v + ⟨w⟩. We used n samples at the first step and 2n for every inductive step after (one set of n samples for each call of Lemma 4.3), giving us a total number of samples of 2n(k −1)+n = 2kn−2n+n = (2k −1)n. Since each step was efficient, the whole process takes polynomial in n time to generate all of the samples.   Variable The high level idea of the proof is to first reduce a 3SAT instance over variables {x i } to solving an arithmetic formula F . The formula is then turned into a weighted directed graph whose adjacency matrix M (x) has a determinant that is equal to the formula F , where M (x) has entries from {0, 1} ∪ {x i }, and can thus be viewed as an affine subspace over F (|F |+2)×(|F |+2) 2 . While we will not prove the correctness of this statement, we will want to ascertain exactly what kind of M i are formed through the reduction. We now describe the construction of the graph (see Fig. 1 and Fig. 2

for relevant illustrations):
• For each atomic formula F ′ , create vertices s and t.
For each constant c create a unique node v c with a path from s to v c with weight c and a path from v c to t with weight 1.
For each variable x i create a unique node v x i with a path from s to v x i with weight x i and a path from v x i to t with weight 1.
• For multiplication of F i and F j , place the graphs of F i and F j in series.
• For addition of F i and F j , place the graphs of F i and F j in parallel.  • Once all of this is done, create a path of weight 1 from the global t vertex to the global s vertex.
• Create self loops at every vertex besides the global s vertex.
Let M be the resulting adjacency matrix of this graph. For every entry that is a constant, we can assign that to M 0 . Then for each variable x i , we can set M i to be the matrix that is zero everywhere except where x i appears in M . As an example, for a matrix For a more succinct reduction later in Section 6, we want to isolate the kinds of matrix affine subspaces over F 2 that are hard to solve (i.e., are used in the reduction from 3SAT).
The following notation will be beneficial for that. Definition 5.3. For a n × n matrix M over a field F, let N Z(M ) ⊆ {1, 2, · · · , n} be the columns of M that are non-zero (i.e., are not the all zeros vector).

Definition 5.4.
Let M and W be n × n matrices over a field F and let k : Proof. Rather than solve NonSingularity using Modified-NonSingularity, we merely show that the reduction from 3SAT naturally leads to instances of Modified-NonSingularity.
Let G be the graph produced by the reduction from 3SAT with adjacency matrix M (x). By the construction of M (x) given by Fig. 1, we see that every instance of a variable will create its own unique subgraph such that the instance of each variable connects to a unique vertex. Because this unique vertex can never be used as an s or t vertex from Fig. 1, that vertex also necessarily has in-degree 1 so that no other edges connect to it. Because these are the only vertices that have incoming edges assigned with variable weight, this conversely means that the columns of M (x) contain at most one variable.
Recall that for variable x i with i > 0, the matrix M i we form from the decomposition of M (x) is the entries associated with x i . Since each column only contains at most one x i , then N Z(M i ) ∩ N Z(M j ) must be disjoint for i > 0 and j > 0. Due to the constraint being met, we know that the R M i (M ) is confined to a one-dimensional affine subspace. As such, it can be represented as: since the weight of an edge will never be x i + 1 then an entry of w j being one implies the corresponding entry of v j is zero. This implies v j ̸ = w j for all j. Finally, to ensure that v j ̸ = 0, we note that each vertex besides s receives a self loop with weight 1. s instead receives an edge from t with weight 1. These self loops and the edge from t to s ensures that each column of M 0 has at least entry with 1 in it such that the v j ̸ = 0.
We have now shown that every matrix affine subspace produced from the reduction in Buss et al. [17] also meets the requirements for Modified-NonSingularity, thus showing that this problem is also NP-complete. If M (x) is an accepting instance of Modified-Singularity then there must be a full rank Θ that is consistent with S. Since Lemma 2.10 ensures us that CNOT circuits can instantiate any full rank Θ, there must also exist a CNOT circuit consistent with S. Alternatively, if M (x) does not contain a non-singular matrix, then there does not exist a full rank theta that is consistent with S. By Lemma 2.9 there cannot exist a CNOT circuit that is consistent with the data. This gives us that M (x) contains a non-singular matrix if and only if there is a CNOT circuit consistent S, which can be produced efficiently.

PAC Learning CNOT Circuits and NP
We now count the number of samples used. For each all 1 ≤ i ≤ m, we use n samples to constrain the first column of N Z(M i ). For the remaining columns, we either use 2n if that column is contained in some N Z(M j ), otherwise, we use n + 1 samples from the generalization of Lemma 4.3. Since 2n >= n + 1 for n ≥ 1, we use at most 2n(n − m) samples for the remainder, giving us at most 2n(n − m) + nm = 2n 2 − mn ≤ 2n 2 samples. This shows CNOTDecide(2n 2 ) is NP-hard. Combined with Proposition 3.5 we find that CNOTDecide(2n 2 ) to be NP-complete.
Since CNOTDecide(2n 2 ) ⊂ CNOTDecide(poly(n)) then CNOTDecide(poly(n)) is also NP-hard. Proof. We will start by proving the NP ⊆ RP version for classical randomized learners. The quantum version will follow trivially by replacing the learner with a quantum algorithm and therefore RP with RQP. The only change then is that NP ⊆ RQP does not necessary imply NP = RQP like it does with RP.

Corollary 6.3 (Formal Statement of Corollary 1.2).
There exists an efficient randomized (ϵ, δ) proper PAC learner for Clifford circuits with arbitrary ϵ and δ as arbitrary 1 poly(n) 4 We again abuse notation to signify that ϵ is a value less than 1 4n 4 and likewise for δ < Proof. Since CNOT circuits are also a form of Clifford circuits, CNOTDecide(2n 2 ) ⊂ CliffordDecide(2n 2 ), and so CliffordDecide(2n 2 ) is also NP-hard. Combined with Proposition 3.5, we get that it is NP-complete. The proof of Theorem 6.2 continues but with α = 1 2 instead due to Lemma 3.4. This leads to slightly different constants, but the proof ideas all follow without major change.

Discussion and Open Problems
In this work, we prove a negative result in proper learning of one of the best candidates for efficient PAC learning of quantum circuits. However, it should be noted that in many cases there exist improper learners even in the case where proper learning is NP-hard, such as 2-clause CNF, 3-DNF, and intersection of half spaces [14,15,30]. This immediately leaves the problem of whether or not an improper learner exists for Clifford circuits. One way of showing hardness would be to leverage cryptographic hardness such as in Arunachalam et al. [9], Kharitonov [33]. Another approach would be to assume the hardness of random k-DNF, such as the work of Daniely and Shalev-Shwartz [23]. For upper bounds, the work of Caro and Datta [19] can also be used to get agnostic generalization results, providing a possible pathway to answering research questions in that direction.
Another thing worth considering is that the hardness results only apply for small errors (roughly 1/ poly(n)). And while this is sufficient to give complexity-theoretic hardness for the kinds of PAC learners (i.e., strong proper learners) originally introduced by Valiant [47], it would be nice to get hardness results for larger errors as in the work by Guruswami and Raghavendra [28]. This work involved using PCP/hardness-of-approximation ideas to show that even constant training error was NP-hard.
We also note that a single output bit of a CNOT circuit is simply an XOR, which is easy to learn efficiently if there is no noise. However, because we are dealing with reversible computation each output bit has to be a linearly independent XOR such that each input bit is recoverable. Finding a CNOT circuit that matches a single XOR function f can be done by sampling expected O(n) random XOR until we get n linearly independent XOR with one of them being f (see Appendix B.1). Thus, the entire difficulty of proper learning CNOT circuits is this linear independence of the output bits. As such, even though AC 0 ⊆ TC 0 ⊆ NC 1 ⊆ L ⊆ ⊕L with the lower classes having improper hardness results based on cryptographic hardness [33], one cannot directly give an improper learning result for CNOT circuits despite the fact that simulating CNOT circuits is complete for ⊕L.
As noted previously, our PAC learning framework is slightly different from that of Caro and Datta [19], in that we use Pauli matrices, rather than rank 1 projectors as measurements. To the author's knowledge, there exists no proof showing that one framework is necessarily harder than the other. The author also do not see an obvious way of proving an analogous hardness theorem in the specific framework of Caro and Datta [19] for Clifford or CNOT circuits.
Finally, with everything from the input states to the circuits involved being classical, it is entirely possible to prove the technical results about CNOT circuits only talking about bit strings and parity functions. Namely, one can replace the entire problem with samples of the form (x, s, s T Cx mod 2) where (x, s) ∼ D are n-bit strings and C is a CNOT circuit. Since the stabilizer group of a computational basis state always lies in {±1} × {I, Z} ⊗n , we can uniquely define it by the subgroup that has positive phase. This is equivalent to the orthogonal complement of x, which is the subspace W x = x ∈ {0, 1} n w · x = 0 . From there, a sample of the form (x, s, 0) simply says that Cx ∈ W s , and one can get an analogous proof by copying the lemmas and theorems in Section 4 and Section 6. However, this proof isn't anymore intuitive than the one given using stabilizer groups, and in fact is probably less intuitive to the average reader due to the lack of established formalism from stabilizers and paulis. It would be interesting if a more intuitive purely classical proof could be made to show hardness of learning CNOT circuits under this model.

A Generalization of Caro and Datta [19]
We generalize the main results of Caro and Datta [19], which is itself analogous to Goldberg and Jerrum [25], to allow for projective measurements beyond rank 1, such as in the settings used in this paper. While not necessary for the main result of this paper, we hope to give this proof in a way that is also more black-box accessible for future work.

A.1 Quantum Circuits as Polynomials
The end goal will be to show that the outputs of our concept class can be described as a set of polynomials with bounded degree. Combined with an upper bound on the number of polynomials in that set, we can later arrive at an upper bound on the pseudo-dimension, which itself is an upper-bound on fat-shattering dimension.
We now show a more terse version of Lemma 3 from Caro and Datta [19].
Proof. Every 2-qudit gate U can be naïvely expressed as the d 4 complex values that make up the d 2 × d 2 unitary. By splitting up the complex values into a real and imaginary part, we get 2d 4 real values to describe each 2-qudit gate. If |ψ⟩ = s∈{0,1} n α s |s⟩ then applying a 2-qudit unitary to |ψ⟩ leaves us with the amplitudes of this new state being a polynomial of degree 1 in terms of the entries of U . Note that the α i , along with the circuit structure, are what determine the coefficients of this polynomial. By applying all γ 2-qudit gates that comprise C, the amplitudes of C |ψ⟩ can be described as a polynomial of degree Γ in 2γd 4 variables. Finally, since we can write |φ⟩ = s∈{0,1} n β s |s⟩, then the inner product ⟨φ|C|ψ⟩ is some weighted linear combination of the amplitudes of C |ψ⟩, which is a again polynomial with degree at most Γ. To get Tr C|ψ⟩⟨ψ|C † |φ⟩⟨φ| = |⟨φ|C|ψ⟩| 2 we note that the degree at most doubles when we multiply a polynomial by itself. This leaves us with p (|ψ⟩,|φ⟩) (c 1 , c 2 , · · · , c k ) as polynomial of degree at most 2Γ and m = 2γd 4 .
Corollary A.2. Consider a quantum circuit C with a fixed circuit structure (i.e., the location of the 2-qudit gate are in fixed positions, though the gates themselves can be arbitrary) comprised of at most γ 2-qudit gates. Such circuit can be described using variables c 1 , c 2 , · · · , c k ∈ R such that k = 2γd 4 . Then for every pair of quantum state |ψ⟩ and projector Π there exists a polynomial p (|ψ⟩,Π) (c 1 , c 2 , · · · , c k ) := Tr C|ψ⟩⟨ψ|C † Π with degree at most 2γ.
Proof. We note that Π = i |φ i ⟩⟨φ i |. By linearity, of the trace By Lemma A.1, this is the sum of real polynomials in 2Γd 4 variables with degree at most 2γ. Since the sum does not increase the degree, we are done.
Since we fixed the circuit structure, we will want to know how many circuit structures there are, because this directly bounds the number of polynomials we need to consider. The following result was the main ingredient in the proof of Lemma 2 from Caro and Datta [19]. [19] Lemma 2). There are at most γ!δ γ−δ (γ−δ)! (n!) δ ways to structure 2-qudit circuits with size γ and depth δ.

A.2 Pseudo-Dimension of Concept Classes Described via Polynomials
The following is a generalization of Goldberg and Jerrum [25], which used the degree of polynomials to bound the pseudo-dimension of concept classes that could be defined using polynomials in the parameter space of the concepts.
Definition A.4. The pseudo-dimension of a concept class C is the limit of the fatshattering dimension parameter η as η goes to zero. Formally, the pseudo-dimension is lim η→0 + f at C (η).
Because fat-shattering dimension increases as η decreases, fat-shattering dimension is always upper-bounded by the pseudo-dimension for all values of η > 0.
Definition A.5. Let {p 1 , p 2 , · · · , p m } ⊆ R k → R be a set of m polynomials on k variables. For η > 0, the η-sign assignment of {p 1 , p 2 , · · · , p m } on the input (x 1 , x 2 , · · · , x k ) ∈ R k is the vector b ∈ {−1, 0, 1} m such that Lemma A.6 is a stronger notion than pseudo-dimension, since it upper bounds the number of sign assignments over arbitrarily large sets of inputs. Since pseudo-dimension requires that C can achieve all (i.e., an exponential number of) sign assignments on some large set of samples, we can show that the pseudo-dimension cannot be too large. We formalize that notion here. Note that the polynomials in question in the following proof are over the parameters of the concept class, not the inputs. The intuition is that if the output of the concept is some bounded-degree polynomial in the parameter space, there cannot be too many sign assignments.
Theorem A.7 (Generalization of Goldberg and Jerrum [25] Theorem 2.2). Let C ⊆ Ω → [0, 1] be a concept class such that every element of C can be described via k different real variables c 1 , c 2 , · · · c k ∈ R, as well as an index l ∈ [s] for s ≥ 0. Furthermore, for every f c 1 ,c 2 ,···c k ,l ∈ C and x ∈ Ω, let f c 1 ,c 2 ,···c k ,l (x) = p x,l (c 1 , c 2 , · · · c k ) where p x,l is one of s polynomials each with degree at most d for d ≥ 1. Then the pseudo-dimension of C is at most 2k log 2 (8eds).
Proof. Let (x 1 , y 1 ), (x 2 , y 2 ), · · · , (x m , y m ) ⊆ R be the largest set of points pseudo-shattered by C. If ms < k, then there is no issue because the largest shattered set is smaller than k, which is smaller than 2k log 2 (8eds). Now assume that ms ≥ k. By Definition A.4, there must exist some points y 1 , y 2 , · · · , y m ∈ R and some (potentially arbitrarily small) value } is a set of ms polynomials that must be able to create at all 2 m different sign assignments that define b. However, we know from Lemma A.6 that the number of different sign assignments is at most 8edms k k as long as ms ≥ k, which we have assumed to be true. Therefore, 2 m ≤ 8edms k k . Taking the logarithm of both sides, We divide the situation into two cases based on which of these two logarithms is bigger: 8eds ≥ m k and 8eds < m k . The first case is easy to analyze, since if 8eds ≥ m k , then we directly get m ≤ 2k log 2 (8eds) via substitution on the right-hand side. The other case leads to m < 2k log 2 m k , also via substitution on the right-hand side. Solving this with the Lambert W -function tells us that if k > 0 then m < ke −W −1( − ln 2 2 ) = 4k. Because d ≥ 1 and s ≥ 1 then log 2 (8eds) ≥ log 2 (4e) > 2, so m < 4k < 2k log 2 (8eds) for this other case as well. Proof. We want to apply Theorem A.7 to the concept class of quantum circuits. We know from Corollary A.2 that for fixed circuit structure with γ gates and depth δ that it can be described as a polynomial with degree at most 2γ in the 2γd 4 real variables that describe the entries of the circuit. Furthermore, Lemma A.3 tells us that there is at most γδ γ−δ (γ−δ)! (n!) δ different circuit structures. We then apply Theorem A.7 with k = 2γd 4 , d = 2γ and s = γ!δ γ−δ (γ−δ)! (n!) δ . As a result we get that the pseudo-dimension is at most

A.3 Pseudo-Dimension for Quantum Circuits
We will now focus on giving an upper bound for the logarithmic term by showing that Splitting up the logarithm into sums and applying Stirling's approximation to each factorial, we arrive at Due to the definition of circuit structure, we know that δ ≤ γ. WLOG, we can also assume that every qubit has been acted upon by at least one gate (even if it's just the identity gate) such that n ≤ γ. Together, we arrive that the logarithmic term is at most O(δγ log γ).
Since we have achieved our goal of upper-bounding the logarithmic term, from Eq. (3) we immediately get that the pseudo-dimension is at most O(d 4 δγ 2 log γ).
We now state the generalization of the main result Caro and Datta [19] to projective measurements of arbitrary rank. Proof. We combine Theorem A.9 with Theorem 2.16, along with the fact that for all η > 0 the η-fat-shattering dimension is upper-bounded by pseudo-dimension.
As shown by Theorem 4 of Caro and Datta [19], a similar thing can be done with nqudit quantum processes by simply changing the d 4 to d 8 in Lemma A.1 and Corollary A.2. This is because a quantum process is still a linear operation, but contains d 8 many entries now in parameter space. This propagates to Theorem A.9 and Corollary A.10 by again replacing every appearance of d 4 with d 8 .

B Special Cases with Efficient Proper Learners
Despite the results given, there still exist situations where it is possible to efficiently proper learn Clifford circuits and CNOT circuits. We give brief proof sketches of some of them here.

B.1 CNOT Circuits for a Distribution with Support over a Single Measurement
Let us try to learn CNOT circuits with regard to a distribution D such that there exists some pauli P ∈ {I, Z} ⊗n with P (ρ,E)∼D [E = I ⊗n +P 2 ] = 1. Because we are dealing with CNOT circuit the labels will always be 0 and 1 so by Lemma 3.4 each label will tell us an affine subspace that C † P C lies in. We can efficiently compute the intersection of this using Gaussian elimination with the generators to find a P ′ that is consistent will all of the labels. From there, let P, Q 2 , Q 3 , . . . , Q n be a set of Paulis whose span is {I, Z} ⊗n . Let P ′ , Q ′ 2 , Q ′ 3 , . . . , Q ′ n also be a set of Paulis whose span is {I, Z} ⊗n . It is clear that if we define our CNOT circuit such that C † P C = P ′ and C † Q i C = Q ′ i then we have a valid CNOT circuit. Efficiently finding such {Q i } and {Q ′ i } only takes O(n) expected samples of random Paulis in {I, Z} ⊗n and so can be done efficiently. Appealing to both Proposition 2.6 and Theorem 2.16 completes the proof.

B.2 Clifford circuits with the Uniform Distribution over Pauli Measurements
We note that if the distribution D entails the measurements being uniform over the Paulis then the problem is trivially easy to properly learn with ϵ < 1 exp(n) and δ = 0 by just outputting a random Clifford circuit. This is because the probability that a random Pauli is in a given state's stabilizer group is 2 n 4 n so we will almost always see the label 1 2 regardless of the hypothesis circuit we choose.

B.3 CNOT Circuits with the Uniform Distribution over {I, Z} ⊗n
Let Θ and Q be the matrix/vector forms of θ ij and q j values from Eq. (1). We note that if we have enough independent samples that Θ ⊕ Q is confined to a O(log n) dimensional affine subspace then we can simply iterate through all possible Θ and Q combinations to find one with a non-singular Θ in poly(n) time. Let's say that we've restricted Θ ⊕ Q to lie in a d dimensional affine subspace. Let M ′ be the true value of Θ. and let M ̸ = M ′ be another arbitrary matrix. Likewise let x ′ be the true value of Q and x some arbitrary vector. Now let Z w ∈ {I, Z} ⊗n be a Pauli selected uniformly at random, and ρ = |v⟩⟨v| a uniformly random computational basis state. The pair M and x will give the same label as the true label on the input (ρ, I ⊗n +Z w Because w is uniform random, as long as v T (M + M ′ ) + x T + (x ′ ) T is not the zero vector over F 2 then this will only be 0 at most half the time. Since at least one of M ̸ = M ′ or x ̸ = x ′ is true, then (M + M ′ ) T v = x + x ′ will only be true with probability at most 1 2 as well. So the probability that any arbitrary M and x have different labels is at least 1 4 . Thus with O(n) expected samples uniformly drawn from arbitrary basis states and Z w we will have constrained our system to something we can bruteforce to find a full rank Θ and corresponding q j values that is consistent with all samples. From there we again apply both Proposition 2.6 and Theorem 2.16 to generalize with zero training error as long as the number of samples is also at least the parameter m from the theorem statement.

B.4 Clifford Circuits for a Distribution with Support over a Single State
In the converse of an earlier situation, let us try to learn Clifford circuits with regard to a distribution D such that there exists some stabilizer state σ with P (ρ,E)∼D [ρ = σ] = 1. This situation effectively reduces to that of Rocchetto [44]. If we run that algorithm we will find a state σ ′ that is consistent with all of the labels. Let {g i } be the generators of σ and {g ′ i } the generators of σ ′ . If we let Cg i C † = g ′ i we define the first part of a Clifford circuit that maps σ to σ ′ as desired. We can then run the algorithm from Van Den Berg [48] to fill in the remainder of the Clifford circuit. Appealing to both Proposition 2.6 and Theorem 2.16 once again completes the proof.