Fast simulation of planar Clifford circuits

A general quantum circuit can be simulated classically in exponential time. If it has a planar layout, then a tensor-network contraction algorithm due to Markov and Shi has a runtime exponential in the square root of its size, or more generally exponential in the treewidth of the underlying graph. Separately, Gottesman and Knill showed that if all gates are restricted to be Clifford, then there is a polynomial time simulation. We combine these two ideas and show that treewidth and planarity can be exploited to improve Clifford circuit simulation. Our main result is a classical algorithm with runtime scaling asymptotically as $n^{\omega/2}<n^{1.19}$ which samples from the output distribution obtained by measuring all $n$ qubits of a planar graph state in given Pauli bases. Here $\omega$ is the matrix multiplication exponent. We also provide a classical algorithm with the same asymptotic runtime which samples from the output distribution of any constant-depth Clifford circuit in a planar geometry. Our work improves known classical algorithms with cubic runtime. A key ingredient is a mapping which, given a tree decomposition of some graph $G$, produces a Clifford circuit with a structure that mirrors the tree decomposition and which emulates measurement of the corresponding graph state. We provide a classical simulation of this circuit with the runtime stated above for planar graphs and otherwise $nt^{\omega-1}$ where $t$ is the width of the tree decomposition. Our algorithm incorporates two subroutines which may be of independent interest. The first is a matrix-multiplication-time version of the Gottesman-Knill simulation of multi-qubit measurement on stabilizer states. The second is a new classical algorithm for solving symmetric linear systems over $\mathbb{F}_2$ in a planar geometry, extending previous works which only applied to non-singular linear systems in the analogous setting.


Introduction and Overview
Enormous resources are now being marshalled by quantum information scientists towards the goal of building a quantum computer: demonstration of a computation for which there are no known efficient classical algorithms [1]; quantum machines that are accessible to the public [2]; and validation of certain building blocks of quantum error correction [3,4,5]. A by-product of the progress towards building quantum machines has been a renewed interest in the classical simulation of quantum computers [6,7,8]. This is partly due to the fact that a direct comparison with classical simulation algorithms is an important measure of the performance of today's quantum computers [9,10,11]. A quantum computer in the real world is limited to a specific architecture, which may restrict both the native gate set and the number of qubits while also being subjected to noise from its environment. While in some cases it is possible to leverage quantum advantage from small quantum devices with limited capabilities, one can also exploit these limitations in classical simulation algorithms. As a result, the search for a quantum advantage with near-term quantum computers is grounded in our understanding of the power of restricted models of quantum computing. Such models include constant-depth quantum circuits [12], IQP circuits [13], BosonSampling [14], circuits with few non-Clifford gates [15,16,8], and others. Their study reveals bright spots of quantum advantage even in the presence of harsh restrictions on the operation of quantum devices.
Here we focus on the restricted family of Clifford circuits, which are expressed as a product of gates from the set Hadamard, S = diag (1, i), and CZ = diag(1, 1, 1, −1). Gottesman and Knill showed that Clifford circuits can be simulated in polynomial time on a classical computer [15]. The quantum state at any point during such a Clifford circuit is a so-called stabilizer state that can be specified using a tableau of O(n 2 ) classical bits, see Section 2.1 for details. Updating this representation after the action of a gate H, S, CZ can be performed by modifying O(n) bits of the tableau in a simple way. Measurements are more costly: the best known algorithm for simulating the outcome of a single-qubit measurement uses O(n 2 ) time [17].
Despite the fact that they can be simulated in polynomial time, it has been shown that quantum computations based on Clifford circuits still enjoy a certain type of quantum advantage [18,19,20,21]. The computational tasks considered in Refs. [18,20] can be viewed as special cases of the problem of simulating measurements on a graph state. For any graph G = (V, E), the graph state |G is defined as the n := |V | qubit state satisfying Here X v , Z v denote the Pauli X and Z operators which act nontrivially on qubit v. Equivalently, the state |G is prepared by a Clifford circuit with the following simple form: CZ uv H ⊗n |0 n .
In this paper we are interested in the computational problem of simulating measurement of quantum graph states. In the simplest version of the problem, we are given a graph G = (V, E) with n = |V | vertices and a single-qubit Pauli measurement basis P v ∈ {X, Y, Z} for each vertex v ∈ V , and we are asked to sample from the probability distribution obtained by measuring each qubit of |G in the given bases. In particular, letting U v be a single-qubit Clifford unitary that maps the P v basis to the computational (Z) basis (i.e., U v P v U † v = Z for each v ∈ V ), we are asked to sample a string z ∈ {0, 1} n from the probability distribution We also consider a more general problem where we allow some of the qubits to be postselected. That is, in addition to the graph and measurement bases, we are given a subset of qubits P ⊆ V and measurement outcomes m ∈ {0, 1} P as input. The problem is to sample the remaining qubits S := V \P after having postselected on measurement outcomes m for the qubits of P. In particular, let be the probability of obtaining outcomes m for the qubits in P. If p P (m) is nonzero, then the output is a binary string x ∈ {0, 1} |S| sampled from the conditional distribution p S (x|m) over measurement outcomes in S after postselecting on outcomes m for the qubits in P: Otherwise, the output is a flag indicating that p P (m) = 0, since it is meaningless to postselect on an impossible outcome.
Graph state simulation problem. The input to the problem is a graph G = (V, E) with n = |V | vertices, a measurement basis P v ∈ {X, Y, Z} for each vertex v ∈ V , a partition [n] = S ∪ P of the vertices, and a binary string m ∈ {0, 1} |P| . If the marginal probability p P (m) from Eq. (2) is zero, the output is an error flag. Otherwise, the output is a binary string x ∈ {0, 1} |S| sampled from the conditional distribution p S (x|m) defined in Eq. (3).
We shall refer to the special case P = ∅ as the graph state simulation problem without postselection. We note that although this special case without postselection is of particular interest, the classical algorithms we develop in this paper are capable of solving the more general graph state simulation problem with the same asymptotic runtime.
From a complexity perspective, the graph state simulation problem is closely connected to the tasks considered in Refs. [18,19,20] which demonstrate a quantum advantage for certain special cases. From an algorithms perspective, it is also well known that simulation of general Clifford circuits can be reduced to graph state simulation using, e.g., techniques from measurement-based quantum computing [22]. For a non-exhaustive list of applications of this problem, we refer the reader to Section 1.1.
Let us now compare quantum and classical algorithms for the graph state simulation problem. When there is no postselection, a quantum computer can solve the problem by preparing the graph state |G and then measuring all qubits in the given bases. The total runtime of this procedure is linear in the size of the graph (number of edges plus number of vertices) and is therefore upper bounded as O(n 2 ) in the general case, and as O(n) if G is sparse. The graph state simulation problem can also be solved efficiently on a classical computer via the Gottesman-Knill theorem since all unitaries and measurements to be simulated are Clifford operations. To our knowledge the best previously known algorithm for the graph state simulation problem is based on the standard Gottesman-Aaronson simulation method [17] and has runtime O(n 3 ). Our first contribution is to reduce the graph state simulation problem to matrix multiplication, improving the runtime in the general case to O(n ω ) where ω < 2.3728639 [23] is the matrix multiplication exponent. Theorem 1. There is a classical algorithm with runtime O(n ω ) which solves the graph state simulation problem. 1 Guan and Regan [24] have recently established a similar result for strong simulationthat is, given a binary string z ∈ {0, 1} n , compute the output probability p(z) in Eq. (1) in O(n ω ) time. (The term strong simulation, while widely used, can be misleading-the result of Guan and Regan does not provide a classical algorithm for other simulation tasks such as the ones considered in this paper). The proof of Theorem 1 is obtained from a general algorithm for simulating multi-qubit measurements on stabilizer states which may be of independent interest, see Appendix A for details.
Theorem 1 leaves open the possibility that quantum computers may have a polynomial speedup in terms of the total runtime (number of one-and two-qubit gates) required to solve the graph state simulation problem without postselection. For sparse graphs the comparison is more dramatic-the linear time quantum algorithm beats Theorem 1 by a quadratic factor even if the matrix multiplication exponent is shown to meet its lower bound ω ≥ 2. Unfortunately we lack a suitable technique for proving that classical algorithms are more costly in terms of total runtime. On the other hand, if we turn our attention to other metrics such as circuit depth, Refs. [18,20] prove that quantum computers beat classical machines in solving the graph state simulation problem even for the special case of a (subgraph of a) 2D grid graph. These works establish a quantum advantage in the following sense: Quantum advantage for relation tasks ( [18,19]) Let G be a subgraph of the ℓ × ℓ grid, and consider a relaxation of the graph state simulation problem (without postselection) in which it suffices to output any string of measurement outcomes z ∈ {0, 1} n that occurs with nonzero probability. (Equivalently, the problem can be expressed in terms of properties of binary quadratic forms, without reference to quantum graph states, see Ref. [18] for details). This problem can be trivially solved by a constantdepth quantum circuit since since the graph has constant degree. In contrast, it cannot be solved by any constant-depth classical circuit composed of unbounded fan-in AND, OR, and NOT gates (i.e., AC 0 circuits) [18,19].
Quantum advantage for interactive tasks ( [20]) Simulation of low-depth quantum circuits can be shown to be even harder for classical devices when they are required to answer the graph state simulation problem in two rounds. Again here we consider some subgraph G = (V, E) of the ℓ × ℓ grid. The first round input is a set of Pauli measurement bases for each of the qubits in some given subset P of the qubits, and the desired output is any measurement outcome m ∈ {0, 1} |P| such that p P (m) = 0. The second round input is a set of measurement bases for the remaining qubits in S = V \ P, and the desired outcome is any string x such that p S (x|m) = 0. Of course, this problem can be solved by a constantdepth quantum computation which prepares the graph state and in each round measures the corresponding subset of qubits in the given Pauli bases. Nevertheless, any classical algorithm which succeeds at this interactive task could be used to solve any problem in the class ⊕L. This provides a strong lower bound on the type of classical circuits that can solve the interactive task, as the class ⊕L is known to be strictly more powerful than the constant-depth circuit family AC 0 . 2 Motivated by the above results, we first observe that Theorem 1 can be improved for the special case of 2D grid graphs.
Theorem 2 (Special case of Theorem 3). There is an O(n ω/2 )-time classical algorithm which solves the graph state simulation problem for any subgraph of a √ n × √ n grid.
The seemingly innocuous quadratic improvement for the graph state simulation problem on grid graphs provides a striking conclusion about the problems used for quantum advantage with low-depth circuits [18,19,20]. Although quantum circuits based on measuring grid graph states are superior in terms of depth, their advantage in terms of gate complexity is on a shakier footing. Indeed, if ω = 2, then the advantage disappears almost entirely.
Although Theorem 2 follows directly from Theorem 3 (discussed below), we now present a simpler, self-contained algorithm for grid graphs which serves as a warm-up for the more general case. Consider graph state simulation on the ℓ × ℓ grid. For simplicity we consider the case without postselection (P = ∅) although the algorithms we consider below work in the more general case with only superficial modifications. We shall also ignore fast matrix multiplication for the moment. Our starting point is the very simple classical simulation of the process of preparing the graph state-by starting with the all-zeros computational basis state, applying Hadamards to each qubit, and applying CZ gates between qubits at vertices connected by an edge-and then sequentially measuring the ℓ 2 vertices in the given Pauli bases. Simulating all O(ℓ 2 ) gates in the circuit uses a runtime of O(ℓ 4 ). Since the state has O(ℓ 2 ) qubits, simulating each single-qubit measurement also uses a runtime of O(ℓ 4 ) [17], and the total runtime of this algorithm is O(ℓ 6 ). An improvement was noticed in Ref. [18], and is based on simulating a quantum algorithm that "sweeps" from the left to the right, processing at most 2ℓ qubits at a time. The key idea is that the measurements on any given column commute with the CZ gates two columns away. Concretely, one can prepare all qubits in the first two columns of the grid, measure the qubits in the first column in the given Pauli basis, and then initialize the qubits in the third column and apply the CZ gates between the second and third column. Continuing in this way, the algorithm sweeps from left to right, storing an active set of at most 2ℓ qubits at any given time. Since we are now measuring a stabilizer state of only O(ℓ) qubits, the cost of each single-qubit measurement is reduced to O(ℓ 2 ). Thus, the classical simulation algorithm runs in O(ℓ 4 ) time. Perhaps surprisingly, this is not the ordering of CZ gates and measurements that leads to the most efficient classical algorithm. In the box containing Figure 2, we describe a recursive method for ordering the measurements in which only one active set of size O(ℓ) dominates the runtime, leading to the bound given by Theorem 2. We stress that these advantages are not merely theoretical. Figure 1 compares the runtimes of software implementations of the naïve, left-to-right sweep, and recursive algorithms using the CHP++ framework [25]. The results show a clear asymptotic advantage for the recursive algorithm, and an absolute advantage beyond a 50 × 50 grid (where the algorithm finished in under a hundredth of a second 2 We note that the postselected graph state simulation problem was already known to be ⊕L-hard [17]. While the interactive task described in Ref. [20] has a similar hardness guarantee, it can be solved by a constant-depth quantum computation.  [17,25]. See Appendix B for the implementation details. anyway). Using the recursive algorithm we were able to simulate a maximum grid size of 400 million qubits in about 19 hours on a laptop computer.
Our recursive algorithm for graph state simulation on the 2D grid employs a well-worn divide-and-conquer strategy that is commonly exploited in problems involving grid graphs. The method has its origins in the nested dissection algorithm proposed by George for solving positive semidefinite linear systems of equations with a sparsity pattern described by a grid graph [26]. This technique was later generalized to obtain fast algorithms for solving linear equations on planar graphs [27,28]. Our main result in this work establishes that such speedups are attainable for the graph state simulation problem on general planar graphs. Theorem 3. There is an O(n ω/2 )-time classical algorithm which solves the graph state simulation problem for planar graphs.
While the simulation algorithm from Theorem 3 does use a divide-and-conquer strategy, we stress that it is not simply obtained using the active set approach-based on a clever ordering of measurements and CZ gates-which we used in the case of the 2D grid. To understand one reason why, note that an n-vertex planar graph may have vertices of degree Ω(n), and so the active set would need to be at least this large at some point during the classical simulation algorithm. Even measuring a single qubit of a stabilizer state on Ω(n) qubits uses a runtime Θ(n 2 ), which exceeds the runtime bound from Theorem 3.
Instead, the divide-and-conquer algorithm from Theorem 3 is based on a data structure known as the tree decomposition of the graph. The tree decomposition consists of nodeseach representing a subset of vertices of the original graph-which are connected together in a certain way to form a tree. Loosely speaking, the more your original graph already resembled a tree, the smaller you can make the subsets at each node (for a formal definition of the tree decomposition, see Section 2.2). The size of the largest subset of vertices at a node in the tree decomposition is referred to as the treewidth. Planar graphs are special due to the fact that they admit tree decompositions of width O( √ n) that can be computed in linear time using the planar separator theorem [29]. More generally, we ask how the runtime of algorithms for graph state simulation scale with the treewidth of the graph. Here we take our inspiration from pioneering work of Markov and Shi who showed that a general (universal) quantum circuit can be simulated on a classical computer in time exponential in the treewidth of a graph derived in a simple way from the circuit [6]. The simulation is based on viewing the quantum circuit as a tensor network and choosing an ordering to contract the tensor network based on a tree decomposition. Tensor-network contraction based algorithms along these lines are especially well-suited to the simulation of near-term quantum computers with a two-dimensional qubit architecture [30,7,31,32]. Of course, our setting is somewhat different as we are interested in polynomial-time algorithms and the special case of Clifford circuits. We describe a parameterized algorithm for the graph state simulation problem for low-treewidth graphs with the following runtime guarantee.
Theorem 4. There is a classical algorithm which, given an instance of graph state simulation problem along with a tree decomposition T of G of treewidth t and at most O(n) nodes, outputs a solution in O(nt ω−1 ) time.
Due to the known difficulty of computing (even approximately) the treewidth of a graph, it may be prohibitively expensive to find a tree decomposition in general [33]. 3 We expect Theorem 4 to be most useful for families of graphs where such a decomposition can be computed efficiently.
Theorem 3 and Theorem 4 are both obtained as special cases of a general classical simulation algorithm, see Section 3 for details. At the core of this algorithm is a tree-decompositionto-circuit mapping which takes a graph G along with a tree decomposition T and outputs a Clifford circuit that solves the graph state simulation problem on G and has a structure similar to that of T . This structure allows us to simulate the circuit in a lazy way, only initializing qubits and applying gates as needed. Just like in the active set approach described above, this reduces the (dominant) cost of measurements in the classical simulation algorithm, such that when we measure a given qubit, the stabilizer state we are measuring has as few qubits as possible. The circuit begins with a quantum state which is a tensor product of few-qubit graph states, one for each leaf node of the tree decomposition. Since a given vertex of the graph may appear in multiple nodes of its tree decomposition, this initial state may have several replica qubits corresponding to each data qubit of the graph state to be simulated. This replication of data qubits is an essential difference between our approach and the active set approach. We will exploit the fact that it is possible to replicate data qubits and operate on the replicas before merging them later in the simulation. During the course of the circuit, gates are applied, qubits are measured, and replica qubits are merged together as needed using a gadget composed of Clifford gates and postselection. We exhibit a classical simulation of this Clifford circuit with a runtime which is a simple function of the tree decomposition, see Section 3 for details. We show that this runtime is upper bounded by Theorem 4 in the general case, and we also provide a sharper upper bound as stated in Theorem 3 for planar graphs.
There are several open questions related to our work. As far as we know, the graph state simulation problem without postselection on general bounded-degree graphs remains a compelling candidate for a polynomial quantum speedup. Can the classical algorithms for the latter problem be improved, or conversely is it possible to strengthen the evidence for a quantum speedup? Another direction for future work is to understand whether our techniques can be incorporated into the stabilizer-rank based classical simulation algorithms for universal quantum circuits [35,16,8]. Roughly speaking, such algorithms are based on representing the quantum state of a given non-Clifford circuit as a superposition of stabilizer states which can each be simulated using the Gottesman-Knill theorem. For this application one requires a stronger form of classical simulation-going beyond the standard tableau representation [17]-which keeps track of the global phase of the stabilizer states, since they appear in superposition. Such phase-sensitive Clifford simulators are described in Refs. [36,16,8].
It is an open question whether their runtime can be improved using (a) fast matrix multiplication methods such as the ones presented in Appendix A, and (b) tree decompositions/planarity. The remainder of the paper is organized as follows. In Section 1.1 we describe applications of Theorem 3 to the simulation of constant-depth Clifford circuits on planar graphs and an extension to more general Clifford tensor networks. In Section 2 we discuss subroutines for the classical simulation of stabilizer states using the standard tableau representation [17], and we describe improved subroutines for multi-qubit measurements using fast matrix multiplication. This section also reviews relevant background information regarding tree decompositions and treewidth. We describe the tree-decomposition-to-circuit mapping and prove Theorem 4 in Section 3. Finally, in Section 4 we give a sharper analysis of the runtime of our algorithm when applied to tree decompositions that are computed using the planar separator theorem [29] and we prove Theorem 3. The proof of Theorem 1 is provided in Appendix A.
Example: recursive algorithm for 2D grid graphs Here we describe a simple recursive classical algorithm which attains the runtime bound claimed in Theorem 2 for the graph state simulation problem on an ℓ×ℓ grid graph. The main subroutine measures (or postselects) qubits in the interior of the subgrid in the given bases, and outputs the measurement outcomes along with the post-measurement state on an "active set" consisting of the 4ℓ − 4 perimeter vertices. The key insight is that the main subroutine can be performed by recursively computing the outcomes and post-measurement states for 4 copies of the ⌊ℓ/2⌋ − 1 × ⌊ℓ/2⌋ − 1 grid which are completely contained in the interior of the ℓ × ℓ grid, separately preparing the stabilizer state on the O(ℓ) remaining qubits (including the perimeter), applying the O(ℓ) CZ gates which connect them together, and then measuring all qubits except those on the perimeter. The algorithm is illustrated schematically in Figure 2. Letting T (ℓ) denote the runtime of this algorithm, we have T (1), T (2) = O(1) and T (ℓ) = 4T (ℓ/2) + O(ℓ ω ) for ℓ ≥ 3. Here the first term contains the runtime of the four recursive calls, while the second term contains the runtime required for the remaining steps of the algorithm which only involve preparation and measurement of stabilizer states of O(ℓ) qubits. The solution to the recurrence is T (ℓ) = O(ℓ ω ) = O(n ω/2 ). In the very last stage of the algorithm the qubits on the perimeter are measured, incurring an additional O(ℓ ω ) cost.

Applications
In this section, we show that solutions to the graph state simulation problem can be leveraged to solve other more traditional circuit simulation tasks. In the most general setting, we prove an analog of the Markov and Shi tensor network contraction algorithm for Clifford tensor networks with planar geometry: Theorem 5 (informal). A random non-zero element from a planar Clifford tensor network can be sampled in O(n ω/2 ) time.
As one might expect, Clifford tensor networks are extremely expressive, being able to capture unitary operations (such as composition of Clifford gates) and non-unitary operations (such as projection onto Clifford states). We formalize both the precise definition of a Clifford tensor and the Clifford tensor simulation problem in Section 1.1.2. In fact, our theorem for Clifford tensor networks immediately gives an algorithm for the simulation of planar Clifford circuits: Theorem 6 (informal). Let C be an n-qubit, depth-d Clifford circuit with qubits located at the vertices of a planar graph and gates acting on edges of the graph. A random n-bit measurement outcome of the state C |0 n can be sampled in time O(n ω/2 d ω ).
While this result can be obtained as a corollary to Theorem 5, in Section 1.1.3 we provide a proof which directly reduces the Clifford circuit simulation task to the graph state simulation problem without postselection. Although our algorithms for graph state simulation with and without postselection have the same asymptotic running time, their practical running times are likely to be quite different due to the increased complexity associated with postselection. Indeed, in Section 3 we will provide a self-contained algorithm for graph state simulation without postselection which avoids many of the linear algebra subroutines used in the more general case. This allows one to simulate constant-depth Clifford circuits in a planar geometry in a much more straightforward manner.
We note that the two theorems described above are not an exhaustive list of applications of graph state simulation. There are many possible variations of Clifford simulation problems one might want to solve that would follow from tweaking the proofs of the theorems above. For example, Theorem 6 applies when the qubits of a Clifford circuit are arranged on the vertices of a planar graph, but one could get a similar result for circuits where the gates are placed at the vertices of a planar graph with edges between gates if the output of one gate is fed into another gate (e.g., a Clifford circuit with gates acting only on neighboring qubits). 4 Furthermore, planar graphs are not the only graphs which admit efficiently computable tree decompositions. For example, the graph state sampling problem for the n   3 ω ) time 5 using a tree decomposition for the cube of width O(n 2/3 ), and so we could obtain similar applications for Clifford circuits where the gates/qubits are arranged in a 3D grid. 4 In more detail, the underlying graph is as follows: each gate/wire of the circuit is a vertex; and there is an edge between two vertices when the wire is an input/output of the gate. 5 In fact, such an algorithm follows from the same sort of recursive algorithm used in the proof of Theorem 2: split the cube into 8 identical subcubes, recursively measure their interiors, and then measure the boundary. This leads to the recursion T (n) = 8T (n/8) + O(n

Technical ingredients
Let us describe two additional technical ingredients that are common to the applications above. First, we show that the graph state simulation algorithm for planar graphs still suffices when the graph is close to planar in some sense. To this end, we introduce the notion of coarse-graining a graph (see Figure 3). We say that a graph G ′ can be coarsegrained to another graph G if there exists a map f : We say that f is an r-coarse-graining from G ′ to G if each vertex of G is the image of at most r vertices of G ′ . Figure 3: A 5-coarse-graining from a nonplanar graph to a planar graph.

−→
The following theorem, which we prove in Section 4, shows that the graph state sampling algorithm can be extended to graphs which can be coarse-grained to planar.
Theorem 7 (Extension of Theorem 3). Let G ′ be a graph with n vertices and let G be a planar graph. Suppose we are given an r-coarse-graining from G ′ to G. There is a classical algorithm with runtime upper bounded as O(n ω/2 r ω ) which solves the graph state simulation problem for G ′ .
Next, we will need the well-known fact that measurements of quantum states can induce certain unitary transformations on the remaining qubits. This is the basis of measurementbased quantum computation [22] and the heuristic algorithm for Clifford simulation due to Anders and Briegel [37]. Bravyi has established a linear-time reduction from constant-depth circuit simulation to graph state simulation using the tableau representation of stabilizer states [38]. To keep things relatively self-contained, here we shall use a "Hadamard gadget" [13] which allows us to implement Hadamard gates using X-basis measurements (see Figure 4). We note that a similar technique was used in the recent paper of Guan and Regan [24]. Figure 4: The Hadamard gadget implements either H or XH on the single-qubit input state ψ depending on the measurement outcome. By linearity, the gadget can also be used to implement a Hadamard gate on a single-qubit of a multi-qubit state.

Clifford tensor network simulation
A Clifford tensor T of rank k is an array of 2 k complex numbers T i 1 i 2 ...i k ∈ C indexed by k binary variables i 1 , i 2 , . . . , i k ∈ {0, 1}, with the additional property that the k-qubit state is proportional to a stabilizer state. Such a tensor can be represented pictorially as in Figure 5a where each "leg" represents one of the k binary variables. A leg of tensor T (1) and a leg of tensor T (2) can be contracted by summing the corresponding binary variable as depicted in Figure 5b. A Clifford tensor network consists of a set T (1) , T (2) , . . . , T (n) of "elementary" Clifford tensors along with a set E of pairs of legs that are to be contracted. Write T for the tensor obtained from the given collection of tensors by contracting all pairs of legs in E. We shall abuse notation and use T to refer both to the tensor network and the tensor that it evaluates to. In particular, T is a Clifford tensor with rank equal to the number of uncontracted legs. If it is nonzero, then it defines a stabilizer state (up to normalization) as in Eq. (4) and we consider the problem of simulating measurement of all qubits of this state.
The underlying graph of the tensor network T is the bipartite graph with a vertex for each elementary tensor T (i) , a vertex for each pair of contracted legs e = {ℓ 1 , ℓ 2 } ∈ E (i.e., an edge/wire of the network), and edges {e, ℓ 1 } and {e, ℓ 2 } between tensor vertices and contracted leg vertices which are incident in the network. Figure 5: (a) Graphical representation of rank-3 tensor T with indices i 1 , i 2 , and i 3 . (b) Contracting rank-3 tensor T (1) with rank-4 tensor T (2) on index k gives a rank-5 tensor Planar Clifford tensor network simulation problem. The input is a Clifford tensor network T which has a planar underlying graph, and each elementary tensor has constant rank. If T is the zero tensor then the output is a flag indicating this; otherwise the output is a binary string sampled from the probability distribution obtained by measuring each qubit of |T in the standard basis.
Theorem 5. There is a classical algorithm which solves the planar Clifford tensor network simulation problem with runtime O(n ω/2 ), where n is the number of elementary tensors.
Proof. Let T be specified by elementary tensors T (1) , T (2) , . . . , T (n) and pairs of contracted legs E. Each elementary tensor defines a stabilizer state |T (i) via Eq. (4). The state defined by the tensor network |T is then given by The state |T (1) ⊗ . . . ⊗ |T (n) appearing on the right hand side has one qubit for each leg of each elementary tensor. It is an m-qubit state, where m = O(n) is the total number of legs in the tensor network. We have then used the fact that contraction of two legs is equivalent to projecting the two corresponding qubits onto the Bell pair 00| + 11|.
We now use the following lemma to express each of the elementary tensors |T (i) as a graph state with local unitaries: Lemma 8 (Theorem 1 from Ref. [39]). A k-qubit stabilizer state |φ can be expressed as |φ = C 1 ⊗ C 2 ⊗ . . . ⊗ C k |G where G is a graph state and C 1 , C 2 , . . . , C k are single-qubit Clifford unitaries.
Each elementary tensor has constant rank and therefore we may compute the single-qubit Cliffords and graph state for each elementary tensor in constant time on a classical computer. In this way, we compute graphs G 1 , G 2 , . . . , G n and single-qubit Cliffords C 1 , C 2 , . . . , C m such that Our goal is to sample a binary string from the probability distribution obtained by measuring this state in the computational basis (or determine if T = 0). Below we show that this can be recast as an instance of the graph state simulation problem with a graph G that admits an O(1)-coarse-graining to a planar graph, and then Theorem 7 completes the proof. Consider a pair of qubits e = {i, j} ∈ E and the corresponding state which appears (up to normalization) in Eq. (5). Noting that where D = C † j C i H is a single-qubit Clifford, and therefore can be written as a product of Hadamard and phase gates [36] D for some a, c ∈ {0, 1, 2, 3} and b ∈ {0, 1}. If b = 0 then for single-qubit stabilizer states |φ i = |+ and |φ j = S a+c |+ . On the other hand, if b = 1 then we can use the Hadamard gadget from Figure 4 to get where |φ i = |+ , |φ j = S a |+ , |φ e = S c |+ are single-qubit stabilizer states. For every pair of contracted legs e = {i, j} ∈ E in the tensor network we can compute b ∈ {0, 1} and the decomposition given by either Eq. (7) or Eq. (8). Let us partition the In Eq. (9) we have a single-qubit stabilizer state φ i | for each contracted leg and one φ e | for each e ∈ E 1 . Letting L ⊆ [m] be the set of all contracted legs, we have where G T is a graph with m + |E 1 | vertices, defined as follows. Start with a graph with m vertices that consists of a copy of each of the graphs G 1 , G 2 , . . . , G n . Next add an edge between every pair e = {i, j} ∈ E 0 . Finally, add a new vertex labeled e for each e = {i, j} ∈ E 1 and edges {i, e} and {e, j}.
In light of Eq. (10), to solve the planar Clifford tensor network simulation problem defined by T , it suffices to simulate single-qubit Pauli measurements on the graph state |G T and postselect on the subset of measurement outcomes corresponding to the qubits in L ∪ E 1 . This is an instance of the graph state simulation problem with the graph G T .
To complete the proof, we construct an r-coarse-graining from G T to a planar graph, derived from the underlying planar graph of the tensor network. For all 1 ≤ i ≤ n, let the coarse-graining map all vertices of G i to the vertex for T (i) in the underlying graph. Map the additional vertex created for a pair of contracted legs e = {k, ℓ} ∈ E 1 to the vertex labelled e in the underlying graph of the tensor network. If G T has no vertex for a pair of contracted legs e ∈ E, then let us remove that vertex from the underlying graph by contracting either edge incident to it; clearly the modified graph is still planar. At worst, the coarse-graining collapses all of G i to a single vertex (corresponding to T (i) ), but this is at most the rank of the tensors T (i) , which is constant.

Clifford circuit simulation
Because Clifford circuit simulation is a such a common task, this section is devoted to proving an independent reduction from graph state simulation that does not require the use of postselection. We now state a formal version of the theorem we'd like to prove: Theorem 6. Consider n qubits arranged at the vertices of a planar graph G. Let C be an n-qubit, depth d quantum circuit composed of one-and two-qubit Clifford gates such that each two-qubit gate acts nontrivially on an edge of G. There is a classical algorithm with runtime upper bounded as O(n ω/2 d ω ) which produces a sample z ∈ {0, 1} n from the output distribution of C.
Before we give a new proof of this theorem, let's sketch why it follows from Theorem 5. Notice that an n-qubit Clifford circuit can be viewed as a Clifford tensor network obtained by contracting n rank-1 tensors representing the initial single-qubit states |0 , with rank-2 and rank-4 Clifford tensors corresponding to 1-and 2-qubit Clifford gates. The underlying graph of this tensor network is not necessarily planar, but is O(d)-coarse-grained planar as can be seen by grouping all the gates acting on the same qubit. This suffices to prove the claimed bound on the running time.
Proof of Theorem 6. Suppose without loss of generality that all gates are either H, S, or CZ. We will massage the circuit into a certain canonical form along the lines of Ref. [13]. Consider a circuit C ′ which is obtained from C by inserting two Hadamard gates acting on each qubit at the very beginning of the circuit, and also at the very end of the circuit. The unitaries U C , U C ′ implemented by these circuits are the same, since Moreover, the depth of C ′ is d + 4 = O(1). Let us say that a Hadamard gate in C ′ is a middle Hadamard gate if it is in the subcircuit H ⊗n U C H ⊗n containing all gates except the initial and final layer of Hadamards. It will be convenient to write h for the total number of middle Hadamard gates in C ′ . For any binary string z ∈ {0, 1} h we also define C ′ (z) to be the quantum circuit obtained from C ′ by replacing the jth middle Hadamard gate by X z j H, for each j ∈ [h] (here we fix some arbitrary ordering of the middle Hadamards).
Now the circuit C ′ is composed of O(n) one-and two-qubit Clifford gates, and the circuit C ′ (z) is obtained from C ′ by inserting O(n) Pauli X gates in locations determined by z. It follows that, given any binary string z ∈ {0, 1} h we may in O(n)-time compute an n-qubit Pauli P (z) such that In particular, P (z) is obtained by pushing all Pauli corrections to the end of the circuit using the fact that every gate in the circuit is Clifford and therefore maps Paulis to Paulis under conjugation. We now construct a new quantum circuit by replacing all middle Hadamard gates in C ′ with the Hadamard gadget depicted in Figure 4. Each Hadamard gadget introduces one ancilla qubit, so the resulting circuit has n + h qubits in total. Let us now partition these n + h qubits as , the set A j contains qubit j as well each ancilla qubit from a gadget replacing a Hadamard gate acting on the jth qubit in C ′ . Note that |A j | ≤ d + 4 for all j ∈ [n] since C ′ has depth at most d + 4.
Since all gates in C ′ were either H, S or CZ gates, after inserting the Hadamard gadgets we are left with a circuit of the form CZ ij H ⊗n+h for some subset M ⊆ [n + h] and graph G ′ with vertices labeled by [n + h]. Moreover, there exists a (d + 4)-coarse-graining from G ′ to G. Indeed, such a coarse-graining is defined by f : [n + h] → [n] where f (k) is the unique j ∈ [n] satisfying k ∈ A j . As noted above, |A j | ≤ d + 4 for all j. One can also straightforwardly verify that We may therefore use the algorithm from Theorem 7 to obtain x ∈ {0, 1} n and z ∈ {0, 1} h such that x| ⊗ z|V |0 n+h = 0. (Note that here we only need to solve an instance of the graph state simulation problem with P = ∅.) Using the identities in Fig. 4 we then have Finally, as noted above, we may compute (in O(n) time) an n-qubit Pauli P (z) such that Letting g ∈ {0, 1} n be such that P (z)|x ∝ |g we see that g|U C |0 n = 0. Finally, we select a random s ∈ {0, 1} n and let Note that Q(s) is a random stabilizer of U C |0 n and can be computed in linear time since U C has O(n) gates. By proposition 9, the binary string y such that |y ∝ Q(s)|g is a random sample from the output distribution of C. This y is the output of the algorithm. The total runtime is O(n ω/2 d ω ) since all steps except for the graph state simulation are linear time.

Preliminaries
In this section we define stabilizer states and tree decompositions and describe some of their properties. In Section 2.1 we review the tableau representation of stabilizer states and describe the runtime of classical algorithms for updating this representation under Clifford operations. In Section 2.2 we review the definition of a tree decomposition of a graph, and we describe simple algorithms which convert a given tree decomposition into one with certain desirable features.

Stabilizer states
For a ∈ {0, 1} n , let X a be the n-qubit Pauli operator X a 1 ⊗ · · · ⊗ X an . Define Z a similarly. A general n-qubit Pauli operator P is of the form i α (−1) β X a Z b for some n-bit strings a, b and bits α, β ∈ {0, 1} describing the global phase.
Recall that an n-qubit stabilizer state ψ can be specified (up to a global phase) by a set of independent, commuting n-qubit Pauli operators {P 1 , P 2 , . . . , P n } such that P j |ψ = |ψ . Here independence means that none of the operators P k is proportional to a product of the other ones. Equivalently, an n-qubit stabilizer state is of the form |ψ = Q|0 n where Q is an n-qubit Clifford unitary, i.e., a unitary which can be expressed as a product of single-qubit Hadamard, phase S = diag(1, i), and two-qubit CZ = diag(1, 1, 1, −1) gates.
The Gottesman-Knill theorem establishes that stabilizer states can be stored and manipulated efficiently on a classical computer [15]. Throughout this work, we use a classical representation of stabilizer states and Clifford operations which is a variant of the tableau representation described by Aaronson and Gottesman [17]. In particular, we represent both the n-qubit stabilizer state |ψ = Q|0 n (up to a global phase) and the Clifford unitary Q as a tableau that lists the images of X e j and Z e j under conjugation by Q for all j, where e j ∈ {0, 1} n is the n-bit vector with a 1 in position j and 0's everywhere else. Let We arrange this conjugation information into a 2n×2n matrix M, a phase vector p ∈ {0, 1} 2n , and a sign vector s ∈ {0, 1} 2n : where A, B, C, and D are the n × n binary matrices whose elements correspond to the Pauli decompositions. We will refer to this as the tableau representation of the state/circuit. It is well known that the tableau representation can be updated efficiently if we apply a Clifford gate or measure a qubit of the stabilizer state. For example, one can easily see that the runtime of updating the tableau under the application of a 1-or 2-qubit Clifford gate is O(n). Aaronson and Gottesman [17] provide an O(n 2 ) algorithm for measuring a qubit in the computational basis, improving upon a naïve O(n 3 ) algorithm. In this work we are interested in measuring batches of qubits all at once. An application of Aaronson-Gottesman would upper bound the runtime of measuring k qubits in the computational basis as O(n 2 k). We now describe how this can be improved.
The Aaronson-Gottesman simulation algorithm can be viewed as an application of linear algebra over F 2 . In fact, Aaronson and Gottesman show that the problem of Clifford circuit simulation is ⊕L-complete, putting it in the same class as matrix inversion, iterated matrix multiplication, determinant, and other linear algebra tasks over F 2 [40]. Since the discovery of fast matrix multiplication in the 1960s, many linear algebra tasks have been sped up via a reduction to matrix multiplication. The (asymptotically) fastest currently known algorithm for matrix multiplication has a runtime O(n ω ) for ω < 2.3728639 [23]. It seems quite natural to conjecture that simulating a depth-n Clifford circuit on n qubits, and then measuring all n qubits could be accomplished in O(n ω ) time. Yet surprisingly, we find no proof of this fact in the literature. To this end, in Appendix A we will give algorithms for fast stabilizer circuit composition (Theorem 37) and state measurement (Theorem 39) using their tableau representations. These operations are described in more detail in Table 1, where we summarize all the subroutines for manipulating stabilizer states which are used in this work.
We will also need the following simple fact about stabilizer states: Claim 9. Suppose ψ is an n-qubit stabilizer state and y ∈ {0, 1} n satisfies y|ψ = 0. If R is a uniformly random stabilizer of ψ, then the bit string z ∈ {0, 1} n such that |z ∝ R|y is drawn from the distribution p(z) = | z|ψ | 2 .
Proof. Let Stab(ψ) denote the stabilizer group of ψ. The probability distribution of the random variable z is, by definition,

Classical representation of stabilizer states
An n-qubit stabilizer state |ψ is represented up to a global phase as a tableau Eq. (12) of 4n(n + 1) bits. This representation can be manipulated as follows: Append a qubit: |ψ |0 can be computed in time O(n).
Apply a gate: Any 1-or 2-qubit Clifford gate can be applied to |ψ in time O(n).
Measure: Measure a given subset of k qubits (in the Z basis) in time O(k ω−2 n 2 ).
Apply CZ gates: Apply an arbitrary collection of CZ gates in O(n ω ) time. Now suppose P ∈ Ω, so y|P = α P z| for some α P ∈ {±1, ±i}. Then y|P |ψ = y|ψ = α P z|ψ . This shows that α P does not in fact depend on P and in particular Therefore Using Eq. (14) gives |Ω| = 2 n | z|ψ | 2 and combining with Eq. (13) completes the proof.
Finally, we note that Claim 9 has the following direct corollary: Claim 10. Suppose ψ is an n-qubit stabilizer state and x, y ∈ {0, 1} n satisfy x|ψ = 0 and y|ψ = 0. Then there exists a stabilizer P of ψ such that P |x ∝ |y .

Tree Decompositions
One of our goals in this work is to give a parameterized algorithm for the graph state simulation problem, where the parameter is treewidth. Let us introduce tree decompositions and treewidth.
Definition 11. Let G = (V, E) be a graph. A tree decomposition of G is a tree T such that each node 6 is associated with a bag of vertices B i ⊆ V such that the following hold: Figure 6: (a) Planar graph on 8 vertices. (b) A tree decomposition of width 2 [41].
The width of a tree decomposition is defined as t = max i |B i | − 1, the size of the largest bag minus one. The treewidth of a graph is the smallest t for which there exists a tree decomposition of width t. Figure 6 shows an example of a tree decomposition for a graph of treewidth 2.
Although the treewidth is a natural parameter for many graph algorithms, the performance of our algorithms are more precisely captured by the sum of the bag sizes raised to some power. We introduce the notation for the p-norm of the bag sizes of a tree decomposition T . Notice that treewidth is a special case: t = T ∞ − 1.
We will often assume that the tree decomposition has a particularly nice form. We caution that while the definition below is convenient for our purposes, it may diverge from other works.

Definition 12.
A nice tree decomposition is a rooted tree decomposition where every node is either an introduce node, a forget node, or a merge node: Introduce: Nodes whose bag strictly contains its child's bag (unless it does not exist). These nodes are either leaves, or have one child.
Forget: Nodes that have exactly one child and whose bag is strictly contained in their child's bag.
Merge: Nodes that have exactly two children and whose bag is the union of its children's bags.
The the remainder of this section is devoted to proving the following theorem, which will allow us to work with nice tree decompositions containing relatively few nodes whenever it is convenient.
Theorem 13. Let G = (V, E) be a graph with n = |V | vertices, and let T be a tree decomposition of G of width t. We can compute a nice tree decomposition T ′ of G with O(n/t) nodes and width O(t) in O( T 1 ) time.
Proof. First we show that the number of nodes can be reduced to n + 1 in linear time: Given an arbitrary tree decomposition T of G, we can construct a new tree decomposition T ′ of G with the same treewidth and at most n + 1 nodes in O( T 1 ) time.
Proof. Root the tree. Contract any edge where the parent bag contains the child bag. Call the new tree T ′ .
For any edge in T ′ , there is some vertex v in the child bag that does not appear in the parent bag, otherwise we would contract the edge. This is the unique edge where v appears in the child but not the parent, since otherwise the nodes containing v would be disconnected. It follows that there are at most |V | edges, and therefore |V | + 1 nodes.
Next we show that we can convert the tree decomposition to a nice tree decomposition.
Lemma 15. An arbitrary rooted tree decomposition T can be converted to a nice tree decomposition with the same treewidth and at most a constant factor more nodes, in O( T 1 ) time.
Proof. We shall describe the transformation in two steps.
In the first step, for each node with bag B and children with bags C 1 , . . . , C k , add nodes B ∩ (C 1 ∪ · · · ∪ C k ) and B ∩ C 1 , . . . , B ∩ C k as shown in Figure 7. Contract edges where the parent and child have the same bag.
It is clear that for each edge in the tree, the bags of the endpoints are comparable, either ⊆ or ⊇. Classify nodes with zero or one children as introduce nodes or forget nodes by the direction of this containment.
The second step of the transformation is as follows. For any node with more than two children, note that it is the union of its children. It will eventually become a merge node, but first we need to reduce the number of children. To do this, we pick two children arbitrarily, give them a new merge node as a parent, and make it a child of the original parent. Of course, the new merge node bag will be the union of its children's bags (as is required for a merge node). Repeat until all nodes have at most two children. Figure 7 depicts an example of this transformation. Note that the transformation only creates nodes that are intersections of others, so the maximum bag size cannot increase. When we create merge nodes from two children, we take the union of two nodes, but both are already contained in their parent, so the new bag cannot be bigger than its parent. Thus, treewidth does not increase.
Suppose there are m bags in the tree decomposition, and thus m − 1 edges. The first step of the transformation creates at most one node per edge, and an additional node per multi-child internal node. There are at most m − 1 edges, and at most m−1 2 multi-child internal nodes (since each one is associated with at least two edges), so we add at most 3 2 m nodes in that step.
If a node of the tree has d > 2 children, then in the second step of the transformation we create new merge nodes which blow it up into a binary tree with d leaves. The leaves and root of the binary tree already exist, but we create d − 2 additional nodes. The total number of children across all nodes is m − 1, so we add at most m − 3 additional nodes.
In total, the process introduces at most 5 2 m nodes. Finally, we show how to shave off a factor of t in the number of nodes in the nice tree decomposition.
Proof. The algorithm is as follows. Recursively process all children first, then for each child that has ≤ 2t vertices in its bag, contract that child into the current node. Note that contracting an edge promotes former grandchildren to children, but we do not contract these children, only (at most) the two original children of the node.
In the original tree T , each node has at most two children because T is nice. Any bag in T ′ consists of ≤ t vertices from the original bag, plus ≤ 2t vertices from each child, for ≤ 5t total vertices. Now suppose while processing some node P , we choose not to contract an edge to a child C of P . It follows that C will appear in T ′ (since there is never another opportunity to contract C with anything) and that C has > 2t vertices in its bag. Observe that in T ′ , there can be at most t vertices in common between C and its parent P ′ , specifically the vertices in the original parent P , since otherwise some vertex would appear in disconnected bags (i.e., in two subtrees of P , but not P itself). We conclude that across each edge in T ′ , we forget at least t + 1 vertices (i.e., they appear in the child but not the parent). Since we can forget each vertex at most once, there can be at most n/t edges in the tree, and hence O(n/t) nodes.
Suppose that ϕ : G → H is a coarse-graining from a graph G to another graph H. The following Lemma allows us to compute a tree decomposition of G, given one of H. In particular, suppose T H is a tree decomposition of H. Define a tree decomposition T G := ϕ −1 (T H ) to have the same tree as T H but each bag B in T H is replaced with its preimage We now prove that T G is indeed a tree decomposition of G.
Lemma 17 (Tree Decompositions are closed under inverse homomorphism). Let ϕ : G → H be a coarse-graining from a graph G = (V, E) to graph H. Given a tree decomposition T H of the graph H, the preimage Proof. We check the defining properties of a tree decomposition explicitly: 3. Take an arbitrary vertex v ∈ V . It appears in nodes of T G corresponding to nodes of T H that contain ϕ(v). Since this is a connected subtree of T H (by the tree decomposition property of T H ), it is also connected in T G .

Algorithm for graph state simulation
In this section we describe our algorithm for the graph state simulation problem which makes use of a tree decomposition of the graph. We first recall the definition of the problem: Graph state simulation problem. The input to the problem is a graph G = (V, E) with n = |V | vertices, a measurement basis P v ∈ {X, Y, Z} for each vertex v ∈ V , a partition [n] = S ∪ P of the vertices, and a binary string m ∈ {0, 1} |P| . If the marginal probability p P (m) from Eq. (2) is zero, the output is an error flag. Otherwise, the output is a binary string x ∈ {0, 1} |S| from the conditional distribution p S (x|m) defined in Eq. (3).
Our goal in this section is to prove the following theorem: There is a classical algorithm which takes as input an instance of the graph state simulation problem along with a nice tree decomposition T of the given graph G. The algorithm outputs a solution in time O( T ω ω ). Theorem 4 from the introduction (restated below) follows immediately. Later, in Section 4 we will show how the graph state sampling algorithm for planar graphs (Theorem 3) also follows from Theorem 18. Thus, Theorem 18 can be viewed as our main result in this work.
Algorithm overview Let us first describe the high level structure of the algorithm described in Theorem 18. It is based on a quantum circuit which is constructed from the given tree decomposition. The circuit, which we denote by C, is obtained by replacing each node of T with a gadget based on the node type, i.e., introduce, forget, or merge. The circuit acts on a total of n t := n + n a many qubits, where n = |V | is the number of data qubits and n a is the number of ancilla qubits. All qubits are initially prepared in the state |+ ≡ 2 −1/2 (|0 + |1 ). The ancillas are all associated with the merge gadgets and so we refer to them as merge ancillas. We will see that, if we apply the circuit and then project all merge ancillas into the all-zeros state, the resulting state of the n data qubits is the graph state to be simulated: Here the right-hand side is of course independent of the tree decomposition. To solve the graph state simulation problem it therefore suffices to classically simulate the following procedure: apply C, project all merge ancillas onto the all-zeros state, project all data qubits in the set P according to the given measurement bases/outcomes, and measure all data qubits in the set S in the given Pauli bases. Our algorithm is a slight variant of this and is based on (a classical simulation of) of two steps: a sampling subroutine, followed by a correction subroutine.
In the first step, which we call the sampling subroutine, we apply C, measure all merge ancillas in the computational basis, and measure all data qubits in the given Pauli bases {P v } v∈V . This produces a bit string y ∈ {0, 1} nt drawn from a distribution p(y) = | y|(U bases ⊗ I)C|+ nt | 2 .
Here we write U bases for the n-qubit Clifford unitary which locally changes the basis of each data qubit to the given Pauli bases {P v }, such that In particular, U bases is a tensor product of single-qubit identity gates, Hadamard gates, and which map the computational basis to itself, the X basis, and the Y basis respectively. A crucial feature of C, as we will see in more detail below, is that its structure mirrors that of the tree decomposition. We show how to classically simulate the circuit in a "lazy" way whereby gates are applied from the leaves up to the root of the tree and qubits are measured as soon as possible (after all operations on them have been applied). At any point in the simulation the state is separable under a partition reflecting the structure of the tree decomposition; we will show how this leads to a runtime of O( T ω ω ) which can be far less than the naïve simulation based on the Gottesman-Knill theorem.
The output y ∈ {0, 1} nt of the sampling subroutine will typically fail to have all the merge ancillas in the all-zeros state, and thus will not directly provide a solution to the graph state simulation problem (since Eq. (15) will not be satisfied). In addition, y may disagree with the given measurement outcomes for the data qubits in the set P to be postselected. We will fix both issues in the correction subroutine of the algorithm. The goal of the correction subroutine is to sample a uniformly random n t -qubit Pauli P cor such that P cor (U bases ⊗ I)C|+ nt ∝ (U bases ⊗ I)C|+ nt (17) and P cor |y ∝ |z ⊗ |0 na z| P = m.
Here the notation z| P = m means that z ∈ {0, 1} n matches the desired postselected outcomes (i.e., z agrees with m on the qubits in the set P). The runtime used in our algorithm to compute P cor is O( T ω ω ). For the special case where postselection on the data qubits is not needed (i.e., P = ∅), we also provide a simplified algorithm with a faster O(n t + |E|) runtime.
Plugging Eqs. (17,18) into Eq. (16) and using Eq. (15) we see that Therefore, the binary string z defined by P cor |y ∝ |z is a valid measurement outcome for the given graph state simulation instance, i.e., it occurs with nonzero probability if the graph state is measured in the given bases, and matches the specified postselection outcomes. To show that z is sampled from the correct probability distribution, which is uniform over all valid measurement outcomes, we appeal to Claim 10. Suppose z ′ is another valid measurement outcome for this given graph state simulation instance. Claim 10 states that there exists a Pauli P in the stabilizer group of (U bases ⊗ I)C|+ nt such that P |z ∝ |z ′ . Since P cor is a uniformly random stabilizer subject to Eqs. (17,18), the stabilizer P P cor occurs with equal probability. By definition, sampling P cor in the correction step leads to output z, whereas sampling P P cor leads to output z ′ . Therefore any two valid measurement outcomes z, z ′ are equally likely and we have shown that the output of the correction step is uniformly sampled from the set of valid measurement outcomes.
In the remainder of this section we fill in the details of the algorithm described above. In Section 3.1 we describe the circuit C and establish Eq. (15). In Section 3.2 we describe the sampling subroutine, show that it outputs a bit string y sampled according to Eq. (16), and establish the runtime bound O( T ω ω ). Finally, in Section 3.3 we describe the correction subroutine, and show that it outputs a random Pauli P cor satisfying equations (17) and (18) using a runtime O( T ω ω ).

The quantum circuit C
Given a nice tree decomposition T of a graph G, let us consider a quantum computation obtained by replacing each node with a gadget based on its type-introduce, forget, or merge. The gadgets depend on both the node and its children, so let A be bag of the current node, and let B 1 , B 2 be the bags of its children, or just B for a single child, or B = ∅ if no children.
The gadgets are depicted in Figure 8 and described below: Figure 8: Examples of the three gadgets for translating tree decomposition T to circuit C.
Introduce: Add a qubit in state |+ for each vertex in A\B, and pass everything else through.
Forget: Set aside qubits B\A (i.e., do not pass them to the parent), but first, we apply CZ gates for any edges in B incident to B\A.
Merge: For each vertex v ∈ B 1 ∩ B 2 , apply a CNOT from the qubit v ∈ B 2 to qubit v ∈ B 1 , then measure the latter qubit, a merge ancilla, in the Z-basis.
We will always assume that the root bag of T is empty. See Figure 9 for an example of the entire construction. We define the quantum circuit C to be the unitary part of the circuit constructed in this way, which contains only CZ and CNOT gates and does not contain any measurements. So the quantum computation described above and depicted in Figure 9 consists of preparing all qubits in the |+ na state, applying C and then measuring all merge ancillas in the computational basis. Note that the total number of qubits in the circuit is upper bounded as Total number of qubits in C = n t = n + n a ≤ T 1 .
The circuit contains a CZ gate for every edge in the graph and a CNOT gate for each merge ancilla. Therefore the total number of gates in C is Total number of gates in C = |E| + n a .
Theorem 19. Let G be a graph with a nice tree decomposition T . If all the merge ancillas are measured in the state |0 , then the state on the remaining qubits is precisely |G .
Proof. We will prove the following more general statement: Claim 20. Let T ′ be a subtree of T , and let C ′ be the portion of C corresponding to T ′ . If all merge ancillas in C ′ are measured in the state |0 , the state of the remaining qubits is a graph state |G ′ , where G ′ = (V ′ , E ′ ), V ′ is the set of all vertices of G that appear in any

DGH DG
Introduce node Forget node Merge node bag of T ′ , and E ′ is the set of edges of G between vertices in V ′ with at least one endpoint not in the root bag of T ′ .
The proof proceeds by induction on the root node of the subtree. If the root is an introduce node, the proof is trivial. In this case the gadget initializes some qubits in the |+ state. Appending some qubits in the |+ state to any graph state is equivalent to adding some disconnected vertices to the graph. This also covers the base case, since all leaves are introduce nodes.
If the root is a forget node, then the set of vertices in the root bag is reduced from child bag to parent bag. By the inductive hypothesis, the graph state prepared by the circuit for the child subtree is only missing edges where one or both endpoints are forgotten, and both endpoints are in the child bag. These are precisely the CZ gates we apply in the gadget.
For merge nodes, we will argue that CZ gates can be pushed through the gadget. Consider the following two circuit identities describing how CNOT moves past CZ.

= =
Putting these together, we see that we can push CZ gates through the merge gadget when we postselect on the merge ancilla being measured in state |0 .
By the inductive hypothesis, the circuits for the two child subtrees produce graph states |G 1 and |G 2 , which are built by applying CZ gates to |+ states. The argument above shows that instead of applying CZ gates and then the merge gadget, we can perform the merge gadget first and then the CZ gates (between the same pairs of vertices). Note that applying the merge gadget to |+ states produces a |+ state, effectively deleting one of the two duplicate qubits:

|+ |+
To complete the proof, we establish that the result is a graph state |G ′ which contains all edges of G which do not have both endpoints in the root bag of T ′ . To justify this claim, let us define A to be the root bag of T ′ with children B 1 and B 2 . Let G 1 = (V 1 , E 1 ) such that E 1 does not contain any edges with both endpoints in B 1 , and similarly, let G 2 = (V 2 , E 2 ). Clearly V ′ = V 1 ∪ V 2 by definition. We claim that E 1 ∩ E 2 = ∅. Then E ′ = E 1 ∪ E 2 follows from the above discussion , i.e., the fact that the CZ gates can be pushed through the merge gadgets.
We now show that E 1 ∩ E 2 = ∅. Suppose to reach a contradiction that there is an edge {u, v} ∈ E 1 ∩ E 2 . Since every vertex in a tree decomposition induces a connected subtree, we must have u, v ∈ A. Since u, v themselves are both contained in V 1 and V 2 , the connectivity property also implies that u, v ∈ B 1 and u, v ∈ B 2 . This is a contradiction as E 1 does not contain edges with both endpoints in B 1 .
Finally, we show that E ′ = E 1 ∪ E 2 is precisely the set of edges of G that appear in First, we claim that E 1 ∪ E 2 contains all such edges. Suppose, to reach a contradiction, that there is an edge {u, v} with u, v ∈ V 1 ∪ V 2 , where vertices u, v are not both contained in A, but {u, v} / ∈ E 1 ∪ E 2 . Without loss of generality, there are two cases. In the first case we suppose u, v ∈ V 1 . Then {u, v} ∈ E 1 unless u, v ∈ B 1 . But, u, v ∈ B 1 implies u, v ∈ A and we have reached a contradiction. In the second case we suppose u ∈ V 1 \V 2 and v ∈ V 2 \V 1 . Then u, v ∈ A by the subtree connectivity property and the fact that u and v must appear together in at least one bag of the tree decomposition, and again we have reached a contradiction. Therefore E ′ contains all edges {u, v} of G with both endpoints in V ′ and such that at least one endpoint is not in A.
Let us now prove the other direction. Suppose to reach a contradiction that there is an edge {u, v} ∈ E ′ = E 1 ∪ E 2 with u, v ∈ V ′ and such that both endpoints u, v are in A. Then by the subtree connectivity property and the fact that u, v are contained in both V 1 and V 2 , we have u, v ∈ B 1 and u, v ∈ B 2 . However, u, v ∈ B 1 implies {u, v} ∈ E 1 , and similarly for E 2 . Therefore, {u, v} / ∈ E 1 ∪ E 2 , and we have reached a contradiction, completing the proof.

The sampling subroutine
The goal of the sampling subroutine is to (classically simulate) the following process: initialize n t = n+n a qubits in the |+ state, apply the circuit C (described in the previous section), measure all n data qubits in the given Pauli bases {P v } v∈V , and measure all n a merge ancillas in the computational basis. The output of the sampling subroutine is a binary string y ∈ {0, 1} nt sampled from the distribution Eq. (16).
The sampling subroutine is a recursive algorithm which is described in Algorithm 1. In particular, the desired output y ∈ {0, 1} nt is obtained by running Algorithm 1 with the given graph G, and tree decomposition T (recall it has an empty root bag). The algorithm should be interpreted as a classical algorithm in which all stabilizer states are represented using tableaux and all operations are performed using the update rules summarized in Table 1.
Proof. Algorithm 1 recursively simulates C gadget by gadget up the tree decomposition, so it suffices to upper bound the runtime for simulating each gadget using the stabilizer simulation algorithms summarized in Table 1. Let A be the bag of the current node, and B 1 , B 2 be the bags of its children (or B if there is only one child).
Introduce node: Initializing |A\B| many qubits in the |+ state uses a runtime O(|A| 2 ).
Forget node: First we apply all CZ gates for edges which touch forgotten vertices in B, which takes O(|B| ω ) time. We then measure qubits at vertices in B\A, requiring   if T is a single node then 3: if root(T ).type = Introduce then 5: |ψ ← SampleGraph(S) 6: return |ψ ⊗ v∈A\B |+ v

9:
Apply CZ to |ψ for each edge {u, v} ∈ E(G) with u ∈ B and v ∈ B\A

10:
For each v ∈ B\A, measure corresponding qubit of |ψ in P v basis 11: return post-measurement state 12: if root(T ).type = Merge then 13:

16:
For each such v, measure the corresponding qubit of |ψ 1 in P v basis 17: return post-measurement state Merge node: We apply CNOT gates between O(|B 1 ∩ B 2 |) pairs of qubits at cost O(|B 1 ∩ B 2 | ω ). Then we measure a qubit from each pair, at a cost In each of the above cases, we can conservatively upper bound the cost as the sum of the ωth power of all the bags involved. Since each bag is involved in at most two gadgets (once as a child, once as a parent), the cost for the entire circuit is at most O( T ω ω ) as claimed.

The correction subroutine
The sampling subroutine from the previous section outputs a binary string y ∈ {0, 1} nt sampled from the distribution Eq. (16). The correction subroutine takes this y as input and computes a uniformly random n t -qubit Pauli P cor such that Eqs. (17,18) hold, or reports that no such Pauli exists. In the former case, the output of the correction subroutine is the binary string z ∈ {0, 1} n from Eq.(18) which solves the graph state simulation problem, i.e., it is sampled from the conditional distribution obtained by measuring all qubits of the graph state |G in the given Pauli bases {P v } v∈V and postselecting on the given outcomes m for qubits in P. In Section 3.3.1 we provide a description of the correction subroutine, which completes our description of the classical algorithm for graph state simulation and the proof of Theorem 18. In some applications, such as the Clifford circuit simulation task described in Section 1.1.3, one may be interested in the special case of graph state simulation without postselection. For this special case we provide a simplified correction subroutine with an improved runtime, in Section 3.3.2.

Correction subroutine: general case
There are two types of qubits for which our correction algorithm has strict requirements: postselected qubits (i.e., those qubits in the set P) and merge ancillas. For the merge ancillas, we enforce that their measurement outcomes are 0. For postselected qubits, we enforce that their measurement outcomes coincide with some desired values specified by the input to the problem. Nevertheless, in our analysis below we will treat each type of qubit equally since our correction subroutine will not need any properties that distinguish the two cases. Our goal is to compute a uniformly random Pauli P cor that stabilizes the state (U bases ⊗ I)C|+ nt (up to a sign) and whose X-component either flips or leaves unchanged all bits of y corresponding to merge ancillas and postselected qubits, depending on its desired measurement result (see Eq. (18)).
Let us represent n-qubit Pauli operators using the homomorphism mapping αX a Z b to the 2n-bit vector (a, b). The phase α is not needed because it will not affect measurement. Under this homomorphism, the Pauli group maps to F 2n 2 , a subgroup of the Paulis maps to a linear subspace of F 2n 2 , and a coset of a Pauli subgroup maps to an affine subspace of F 2n 2 . When searching for a Pauli stabilizer with particular X-and Z-components, we represent the desired components by a pattern π = (π (X) , π (Z) ) ∈ {0, 1, * } 2n , where * indicates that the X-or Z-component can be either 0 or 1. We associate the pattern π with the set of all allowable outcomes We say that a Pauli X a Z b respects π if (a, b) ∈ Π. The rest of this section is devoted to proving the following theorem: Theorem 22. Let a nice tree decomposition T of some n-vertex graph G be given. Let Stab ⊆ F 2nt 2 be the linear subspace describing the stabilizer group of (U bases ⊗ I)C|+ nt , and let Π ⊆ F 2nt 2 be the affine subspace describing a given pattern π. There is an O( T ω ω ) time algorithm that samples a uniformly random element of the intersection Stab ∩ Π, or reports that no such element exists.
While our algorithm will make use of the structure of the tree decomposition, most of the required operations are in fact independent of T . Consider then any n-qubit stabilizer state |ψ . We will define subsets of the stabilizer group of |ψ that are given by affine 1} n×N , vectors a, b ∈ {0, 1} n , and N ≤ 2n (note that if the latter condition did not hold then some of the columns of ( A B ) would be linearly dependent). Here, A and a correspond to the Xcomponent of the stabilizer, while B and b correspond to the Z-component. For example, the affine subspace for the entire stabilizer group of |ψ is such that each column of ( A B ) is the bit representation of a stabilizer generator of |ψ and a, b are both zero. We say that A is an affine stabilizer subspace.
To find the elements of an affine stabilizer subspace A that also respect the pattern π we must find all solutions x to We can write this more conventionally as the rectangular system of linear equations where A ′ , B ′ , a ′ , b ′ and π ′ are obtained by removing row i from ( A B ), ( a b ), and π whenever π i = * . This system can be solved using the following lemma: Lemma 23. Let C ∈ {0, 1} k×ℓ and d ∈ {0, 1} k be given for some k and ℓ. There is an O(k ω + ℓ ω )-time algorithm which determines if the system of linear equations Cx = d has a solution. If so, the algorithm reports its affine solution space, that is, a matrix E ∈ {0, 1} ℓ×ℓ and vector f ∈ {0, 1} ℓ such that Proof. First construct a generalized inverse 7 C g of C in O(k ω−1 ℓ + kℓ ω−1 ) time using an algorithm of Ibarra, Moran, and Hui [42]. This algorithm can be used directly if k ≤ ℓ; otherwise, if k > ℓ, we may find a generalized inverse of C T and then use the fact that C T (C T ) g C T = C T implies C((C T ) g ) T C = C. We have that Cx = d has a solution iff C g d = d, and furthermore, if a solution exists, then the solution space is given by [43] {(I − C g C)z + C g d : z ∈ {0, 1} ℓ }.
Setting E := I − C g C and f := C g d, which can be computed in time O(k ω + ℓ ω ) and O(ℓ 2 ), respectively, gives the lemma.
We arrive at the following important fact: enforcing the pattern π onto an affine stabilizer subspace returns another affine stabilizer subspace. In fact, taking tensor products and applying unitaries both map affine stabilizer subspaces to affine stabilizer subspaces, and tracing out qubits maps affine stabilizer subspaces to affine subspaces. For two states |ψ 1 and |ψ 2 with affine stabilizer subspaces A 1 and A 2 , respectively, the Paulis P 1 ⊗ P 2 that stabilize |ψ 1 ⊗ |ψ 2 (up to a sign) and also satisfy P 1 ∈ A 1 , P 2 ∈ A 2 are given by the direct product Notice that if A 1 and A 2 describe the stabilizers of |ψ 1 and |ψ 2 that respect some patterns π 1 and π 2 , then A 1 ⊕ A 2 contains exactly the stabilizers of |ψ 1 ⊗ |ψ 2 respecting π 1 ⊕ π 2 .
Recall that stabilizers of |ψ are mapped to stabilizers of U |ψ under conjugation by U: Therefore, an affine stabilizer subspace A is mapped to the affine stabilizer subspace UAU † for the state U |ψ by "conjugating" the columns of ( A B ) and the displacement vector ( a b ), i.e., associating each vector with a Pauli element, conjugating by U, and then reverting back to the bit representation. By Lemma 35, this can be done in time O(n ω ) by multiplying ( A B ) T by the tableau for U. Furthermore, notice that if elements of A respect the pattern π and the unitary U only touches qubits for which π i = * , then the Paulis in UAU † still respect π.
Suppose now that we want to restrict A to some subset of the qubits in |ψ . To remove the ith qubit from A, we simply remove the ith row of A, B, a, and b. After eliminating ℓ such qubits, we obtain the affine space: for matrices A ′ , B ′ ∈ {0, 1} (n−ℓ)×N and vectors a ′ , b ′ ∈ {0, 1} n−ℓ . Unfortunately, N may now be quite large in comparison to the total number of qubits n − ℓ. If N > 2(n − ℓ) we apply another algorithm of Ibarra, Moran, and Hui to find a maximal set of independent columns of A ′ B ′ in time O(n ω ), so that our representation of A ′ becomes manageable once again [42]. Notice that unitaries may still be applied after tracing out qubits, so long as they act only on the remaining qubits. Similarly, patterns may be enforced if they are indexed only by the remaining qubits. We can also take tensor products after tracing out qubits, though the resulting affine space will only represent the restriction of A 1 ⊕ A 2 to a subset of qubits.
We summarize the discussion above in the following claim: Claim 24. Let |ψ 1 and |ψ 2 be n 1 -and n 2 -qubit stabilizer states. Let A 1 ⊆ {0, 1} 2k 1 and A 2 ⊆ {0, 1} 2k 2 be affine subspaces for the stabilizers of |ψ 1 and |ψ 2 respecting the patterns π 1 ∈ {0, 1, * } 2n 1 and π 2 ∈ {0, 1, * } 2n 2 , and restricted to subsets of k 1 and k 2 qubits. There are algorithms to calculate the affine subspace A arising from each of the following operations: • Enforcing pattern π ∈ {0, 1, * } k 1 on A 1 : A ⊆ A 1 contains exactly those Paulis that are both in A 1 and respect pattern π. (Runtime O(k ω 1 ).) • Applying a 2 k 1 -qubit unitary U to qubits in A 1 : U † AU = A 1 . Furthermore, A respects pattern π 1 if U only acts on * -entries of π 1 . (Runtime O(k ω 1 ).) • Restricting to qubits in some subset S ⊆ [k 1 ] from A 1 : A is the set of Paulis i∈S P i for which there exists a Pauli i∈[k 1 ] P i in A 1 . (Runtime O(k ω 1 ).) • Taking tensor products: A = A 1 ⊕ A 2 contains exactly those Paulis which are tensor products of elements of A 1 and We are now ready to prove the main theorem for this section.
Proof of Theorem 22. Let C ′ = (U bases ⊗ I)C for ease of notation. Recall that our goal is to randomly sample an element of Stab ∩ Π (or report that the intersection is empty). A naïve way to do this would be to compute Stab in its entirety, enforce π using Lemma 23, and randomly sample an element of the solution space. This O(n ω t )-time approach is far too slow. Instead, we use the tree structure of C ′ to compute the stabilizers of C ′ |+ nt while simultaneously enforcing π. In Appendix C we work through an example of the algorithm described below.
We begin by defining an affine space A B for each bag B of the given tree decomposition T . Consider the subtree T B of T rooted at B. Let us write C| B and C| T B to denote the portions of C corresponding to B and T B , respectively. For example, for the root node R of the tree decomposition in Figure 9, C| R consists of three wires and three CZ gates; and for the leftmost merge node M, the circuit C| T M consists of six wires, four CZ gates, and one CNOT gate. Define C ′ | B and C ′ T B analogously. Let |ψ B = C ′ | T B |+ nt . First consider an affine space Y B which describes the stabilizers of |ψ B whose X-and Z-components agree with π on all qubits that appear only in C ′ | T B . These are the qubits which are merged or forgotten in C ′ | T B and are therefore untouched by all operations that occur later in the circuit C ′ . Now define A B to be the restriction of Y B to coordinates of qubits in C ′ | B . For example, Y R = Stab ∩ Π and A R ⊆ F 6 2 . We construct A B inductively, working our way up the tree. Each node inherits an affine subspace from its children (unless it is a leaf node), and the first step will always be to restrict that affine subspace to the qubits of C ′ | B . For an introduce node in which N qubits are initialized, we create the affine space I = {( I 0 )x + ( 0 0 ) : x ∈ {0, 1} N } for the space of stabilizers of the |+ N state. If B has a child C, then A B = A C ⊕ I; otherwise, A B = I. For a forget node, we apply the CZ gates and local operators of C ′ | B , and then enforce π on all qubits being forgotten in B. For a merge node we take the direct product of the affine spaces for each child, apply the CNOT gates of C ′ | B , and enforce π on the merged qubits.
We continue up the tree until we reach the root node R. If at any point during this process we encounter an affine space that is empty, we report that a stabilizer matching π does not exist. Otherwise by definition A R contains the restriction of such a stabilizer to coordinates of qubits in C ′ | R . All that remains is to reconstruct the remaining coordinates using the other affine spaces. To do so, we process the bags in reverse order. Let C be the child of R. By definition, a stabilizer P cor of C ′ |+ nt respects π iff its restriction to qubits in C ′ | R is in A R . With this in mind, we choose a uniformly random element (a, b) from A R and set X a Z b to be the tensor element of P cor corresponding to qubits of R. To reconstruct the remaining components of P cor , we push (a, b) backward through ; randomly sample an element (e, f ) of A C respecting the pattern (c, d); and set the jth tensor element of P cor to be X e j Z f j , for each qubit j being merged or forgotten in C ′ | C . We continue in this way down the tree, always randomly sampling an element of the affine stabilizer subspace of the child node that preserves choices made higher up in the tree.
The number of qubits that we handle at any given moment is never more than the size of the bag being operated upon plus that of its children. The only operations needed to construct A B are those in Claim 24. The time needed for each of these operations is at most the size(s) of the bag(s) involved raised to the ωth power. All bags contribute to the runtime at most twice. Therefore all affine spaces can be constructed in time O( T ω ω ). Similarly, the reconstruction of P cor from (a, b) also takes time O( T ω ω ). Finally, we must show that P cor is chosen uniformly from Stab ∩ Π. Let us start with a general fact about sampling affine spaces. Consider some affine space B ⊆ F n 2 and let B S ⊆ F (1) S 2 ; and so on. Let z be obtained by combining all the z i into a single string of length n. Since each z i term was chosen such that it never conflicted with z j for j < i, the choice of z is unambiguous.
We claim that z is a uniformly random element of B. After the first iteration, we only want to sample strings of B where the coordinates in S 1 are restricted to z 1 . As we've seen before, this imposes a linear constraint on B, and so this new space (B (1) ) is still affine by Lemma 23. In fact, we see from that lemma that the size of the solution space is independent of our choice of z 1 (provided, of course, that z 1 ∈ B S 1 ). In the second iteration, we sample uniformly from B 1 restricted to the coordinates in S 2 . And so on. It follows that the number of different z j that occur with nonzero probability in the jth step is independent of the specific choices of z 1 , . . . , z j−1 . Therefore, the final string z chosen from B has the same probability of being chosen as any other string z ′ ∈ B.
We now only have left to show that this is actually what is occurring in our correction protocol. At the root node, we sample uniformly from the affine space A R which, by construction, is equal to Stab ∩ Π restricted to qubits which are forgotten in the root bag. Further down in the tree, when nodes are forgotten or merged, we only sample strings which are consistent with measurements made higher up in the tree. Recall that the qubits that are measured in a particular bag are never touched again higher in the tree. Therefore, the affine space for a bag B (i.e., the space A B ) restricted to the qubits that are measured in that bag is equal to the affine space for the entire circuit Stab ∩ Π restricted to the qubits that are measured in that bag. For the qubits in A B that are not measured, we restrict them to the measurement results sampled higher in the tree. In some cases, these measurement results are obtained after applying some unitary. However, the application of that unitary is equivalent to applying an invertible linear transformation on the unmeasured qubits of A B . Applying the inverse of this transformation to the outcome sampled higher in the tree gives the unique string over the unmeasured qubits of A B consistent with higher measurements. Since we sample uniformly from A B restricted to this string, the theorem follows.

Simplified correction subroutine for graph state simulation without postselection
The correction procedure for the graph state simulation problem without postselection follows almost immediately from the following structural theorem about the circuit C: In particular, the Pauli P cor that corrects the output y ∈ {0, 1} nt of the sampling routine is computed as P cor = (U bases ⊗ I) Q(y) U † bases ⊗ I .
Since Q(y) stabilizes C|+ nt , we get that P cor stabilizes (U bases ⊗I)C|+ nt , that is, P cor satisfies Eq. (17). Furthermore, notice that the X-component of P cor on the merge ancillas is equal to y on the merge ancillas, that is, P cor satisfies Eq. (18). Therefore, P cor is a valid correcting Pauli, computable in O(|E| + n t ) time.
Unlike the more general postselection procedure, we have not generated a uniformly random correcting Pauli. Of course, the measurement outcome z ∈ {0, 1} n obtained in this way is still valid. So, by Claim 9, we can obtain a uniformly random sample in O(n) time by choosing a random stabilizer R of the state |ψ = U bases |G . A random stabilizer R of ψ can be computed straightforwardly in time O(n + |E|). 8 Proof of Theorem 25. Since we wish to construct Q(a) to be a stabilizer of C|+ nt , we have that C † Q(a)C will be a stabilizer of |+ nt . All stabilizers of the latter state have trivial Z-component, so for some p ∈ {0, 1} nt . Now recall that C contains only CZ and CNOT gates, which both have the property that when we conjugate a Pauli by them, the Z-component going in does not affect the X-component coming out (contrast with say, Hadamard, which interchanges X and Z). Using this fact in Eq. (25) we get which determines p. Finally, we may compute Thus, to compute Q(a) we need to conjugate two Paulis by the circuit C (cf. Eqs. (26,27)). The runtime of this algorithm is proportional to the number of gates in the circuit which, using Eq. (21), is equal to |E| + n a .

Improved algorithm for planar graphs
In this section we describe an improved algorithm for the graph state simulation problem in the special case of planar graphs. In particular, we prove Theorem 3 and its slight generalization Theorem 7, restated below.
Theorem 3. There is an O(n ω/2 )-time classical algorithm which solves the graph state simulation problem for planar graphs.
Theorem 7 (Extension of Theorem 3). Let G ′ be a graph with n vertices and let G be a planar graph. Suppose we are given an r-coarse-graining from G ′ to G. There is a classical algorithm with runtime upper bounded as O(n ω/2 r ω ) which solves the graph state simulation problem for G ′ .
The above results are obtained as a consequence of our more general algorithm for graph state simulation (Theorem 18). That result allows us to solve the graph state simulation problem for G, given a tree decomposition T of G, using a runtime O( T ω ω ). It is known that planar graphs have treewidth t = O( √ n), and there are linear-time algorithms to construct a width-O( √ n) tree decomposition of a planar graph. This fact, combined with Theorem 18, leads directly to a worst-case O(nt ω−1 ) = O(n (ω+1)/2 ) algorithm for planar graph state simulation. In this section we present a more careful analysis which shows that on the tree decompositions that we construct (recursively), the algorithm from Theorem 18 actually runs in O(n w ′ /2 ) for any w ′ > w. The intuition is that any subgraph of a planar graph is again planar, so an m-vertex subgraph of an n-vertex graph has treewidth O( √ m), not O( √ n). Thus, we expect bags to get smaller as we descend the tree. This leads to a better upper bound on the runtime O( T ω ω ). We shall use tree decompositions which are constructed recursively from planar graph separators.
Definition 26. Given a graph G = (V, E), a separator is a subset of the vertices S ⊆ V which divides the remaining vertices V \S into two disconnected parts, A and B.
It is known that any planar graph has a separator of size |S| = O( √ n), and Lipton and Tarjan showed how to find one in linear time.
Theorem 27 (Lipton and Tarjan [29]). Given an n-vertex planar graph G, there is an O(n) algorithm to compute a separator S of size |S| = β √ n that divides the remaining vertices into two disconnected parts A and B, of size |A|, |B| ≤ αn, where α = 2 3 and β = 2 √ 2.
It is well known that one can compute a tree decomposition of width O( √ n) using a runtime O(n log n) by recursively applying these separators, as in Algorithm 2 (ComputeTD). This can be improved to linear time with clever data structures [44], though we will not use this improvement here.
. . . The rest of this section is devoted to analyzing the runtime of ComputeTD, the properties of the tree it creates, and the performance of our graph state simulation algorithm when given this tree.
Algorithm 2 Separator to Tree Decomposition Algorithm. Makes use of a subroutine Separator which takes as input a planar graph and outputs a planar separator with parameters α, β.
Input: Graph G, subset U ⊆ V (G) Output: A tree decomposition of G with U contained in the root bag return forget node U above a merge node S ∪ U with children T A and T B We begin by verifying that ComputeTD is correct, and upper bounding the width of its output.
Lemma 28. Given a planar graph G and a subset U of its vertices, the tree T computed by ComputeTD(G, U) is a tree decomposition of G with root bag U and width O(|U| + √ n).
Proof. We claim by induction that ComputeTD(G, U) produces a valid tree decomposition where the root bag is U. We leave the base case for the reader to check. In the recursive case, the tree decomposition has the form depicted in Figure 10 and we see that • every vertex appears in some bag, since every vertex belongs to A ∪ S or B ∪ S, and by induction the two subtrees contain those vertices, • every edge appears in some bag, since all the edges in A ∪ S appear in T A , all the edges in B ∪ S appear in T B , and there are no edges between A and B because of the separator property, • nodes containing v ∈ V form a connected component because they are connected in T A and T B , if v appears in both T A and T B , then it follows that v ∈ S and appears in the merge node above T A and T B , and if v belongs to the merge node S ∪ U, then it also appears at the root of T A (or T B ) if it appears in that subtree at all.
It is also clear by construction that U is the root node of the new tree, finishing the induction. Notice that the recursive calls are on graphs G| A∪S and G| B∪S of size where we used n 0 as specified in the algorithm, and the fact that α > 1/2. It follows that the graph input to a recursive call at depth ℓ will have at most α ℓ (1 + ǫ) ℓ n vertices. Note that our choice Eq. (28) ensures that α(1 + ǫ) < 1.
The bag size can only increase from parent to child by at most |S|, so a bag in the ℓth level of recursion has size which has increased by at most the sizes of ℓ planar separators. Since the size of each planar separator is at most β times the square root of the number of vertices in the graph, the number of elements gained from ℓ levels of separators is therefore at most We conclude that every bag has at most O(|U| + √ n) vertices.
We now bound the runtime O( T ω ω ) which appears in Theorem 18, for tree decompositions T produced by ComputeTD.
Lemma 29. Let G be a planar graph with n vertices, and let T be the tree decomposition computed by ComputeTD(G, U) where |U| = ℓ. For constant c > 2, Proof. It is a property of p-norms that T p ≤ T q whenever p ≥ q > 0. In particular, since c > 2 we have T c ≤ T 2 . It suffices to show that T 2 as desired. The crux of the proof is to reduce to the case where U = ∅. That is, consider the tree T ′ ← ComputeTD(G, ∅) that we get by passing ∅ instead of U as the second argument. The tree structure T and T ′ will be identical except for the content of the nodes, since they will get the same separators, recurse on the same graphs, and stop at the same base cases (in other words, the control flow of ComputeTD does not depend on U). However, every node in T ′ will be some subset of the corresponding nodes of T , as if some vertices of T are missing from some nodes T ′ . We say T and T ′ disagree on these vertices at these nodes.
We will upper bound T 2 2 − T ′ 2 2 as follows: Recall from Lemma 28 that each bag is at most O(|U| + √ n) = O(ℓ + √ n), which upper bounds the second factor. It remains to bound the first factor (essentially the total number of missing vertices across all nodes), and use the resulting bound on T 2 2 − T ′ 2 2 to draw a conclusion about T 2 2 . Note that T and T ′ will disagree on a vertex u ∈ U at the root, and continue to disagree as we descend the tree until either • u stops appearing in T , because it is in the other half of a separator, i.e., this subtree is about a subgraph of G that does not include u anymore, or • u is in the separator at some level of the recursion, and since T and T ′ have the same separators, it will appear in both trees.
In either case, if T and T ′ agree on a vertex u at node X, then they also agree on u in all descendants of X. Furthermore, T and T ′ cannot disagree about u in both children, Y 1 and Y 2 , of some node X. This is because either u belongs to the separator at node X, so it will be included in Y 1 and Y 2 in both T and T ′ , or u is in one half of the separator or the other and thus will be absent (in both T and T ′ ) from one of the subtrees. The take away is that u ∈ U can be missing from at most one node per level of the tree. Otherwise, if u is missing from nodes Z 1 and Z 2 in the same level, then it is also missing from their ancestors (since if T and T ′ agree on an ancestor of Z i , they agree on all its descendants, including Z i ), and in particular from their least common ancestor Z and its two children. Since T and T ′ cannot disagree on both children, this is a contradiction. We conclude that any particular u ∈ U is missing from at most O(depth) = O(log n) nodes, and since |U| = ℓ, there are a total of O(ℓ log n) missing nodes across the whole tree. That is, i |B i | − |B ′ i | = O(ℓ log n), and thus We will show that T ′ 2 2 = O(n) to complete the proof. Let F (n) be the maximum value of T ′ 2 2 for a tree T ′ ← ComputeTD(G, ∅) over all graphs G on at most n vertices. In the recursive case (where n > n 0 ), we have two recursive calls which are unfortunately not of the form ComputeTD(G ′ , ∅); they have non-empty second arguments. However, by the same argument as above, setting the second argument from U ′ to ∅ changes the squared 2-norm by at most O(|U ′ |(|U ′ | + √ n) log n) = O(n log n), since here |U ′ | ≤ |S| = O( √ n). It follows that we can upper bound F (n) by F (n A ) + O(n log n) and F (n B ) + O(n log n) for the two subtrees, plus O(|S| 2 ) = O(n) for the root node, giving where n A , n B ≥ αn and n A + n B ≤ n + β √ n. It can be shown that F (n) = O(n log 2 n) by an inductive proof of the hypothesis F (n) ≤ k 0 n + k 1 n log n + k 2 n log 2 n.
The running time for the algorithm follows immediately, as long as ComputeTD is not too expensive. Let us bound that next.
Lemma 30. Let G be a planar graph with n vertices. ComputeTD(G, U) runs in time O(n log n) given that Separator runs in linear time.
Proof. Let T (n) be the worst-case running time of ComputeTD on graphs of at most n vertices. Using the recursive structure of the algorithm, we see that for sufficiently large n, T (n) ≤ T (n 1 ) + T (n 2 ) + Kn for some n 1 , n 2 such that n 1 + n 2 ≤ n + β √ n and (1 − α)n ≤ n 1 , n 2 ≤ αn + β √ n, and for some constant K > 0.
Since this term is multiplied by n + β √ n = O(n), it dominates the smaller c 1 β √ n log 2 n + βc 2 √ n terms. Thus, for sufficiently large n, we have T (n) ≤ nc 1 log 2 n + c 2 n.
We are now ready to prove Theorem 3.
Proof of Theorem 3. We will apply the algorithm from Theorem 18 to the tree T output by ComputeTD(G, ∅). The graph state simulation algorithm runs in O( T ω ω ) by Theorem 18, and T ω ω is O(n ω/2 ) by Lemma 29. The cost of constructing the tree decomposition is O(n log n) by Lemma 30. Thus, the total running time is O(n ω/2 ).
We also immediately get Theorem 7.
Proof of Theorem 7. Let ϕ : G ′ → G be the mapping witnessing that G ′ is r-coarse-grained planar. That is, G is planar and ϕ maps at most r vertices in G ′ to one vertex in G. To solve an instance of the graph state simulation problem for G ′ , we run the following steps: 1. Construct T G ← ComputeTD(G, ∅), using the planar tree decomposition.
2. Take the preimage T G ′ ← ϕ −1 (T G ). That is, let T G ′ have the same tree structure, and replace a bag B ⊆ V (G) with the preimage set ϕ −1 (B) : 3. Run the graph state simulation algorithm from Theorem 18 for G ′ with tree decomposition T G ′ .
Note that T G ′ is a tree decomposition for G ′ by Lemma 17, and therefore the algorithm is correct.
The running time will be dominated by Step 3 since constructing T G is O(n log n) time by Lemma 30, and computing preimages is O(n) time. Recall that the graph state simulation algorithm runs in time O( T G ′ ω ω ) by Theorem 18. Each bag T G ′ is at most r times bigger than T G , so it is clear that Lemma 29. Putting these facts together, we see that the sampling algorithm is O(n ω/2 r ω ) time.
For example, consider the k × k × k cubic grid G ′ . There is clearly a k-coarse graining onto the k × k grid G. Since the k × k grid is only n = k 2 qubits, sampling only takes O((k 2 ) ω/2 k ω ) = O(k 2ω ) time. This matches (up to log factors) the natural analogue of the recursive 2D grid algorithm, and beats the O(k 3ω ) runtime of direct Clifford simulation.

Acknowledgments
D. Gosset and D. Grier

A Stabilizer operations in matrix multiplication time
The goal of this section is to show that all stabilizer operations are computable in time which scales polynomially with the matrix multiplication constant ω < 2.3728639 [23]. We show how to perform each stabilizer operation in Table 1 in the given time. In particular, we will show that given the tableaux for Clifford operations on n and m qubits, the tableau for their composition can be computed in O(min(nm ω−1 , mn ω−1 )) time. Furthermore, given the tableaux for a Clifford state on n qubits, we simulate a measurement on k qubits in O(k ω−2 n 2 log(n/k)) time.
Most of the results in this section follow from the work of Dehaene and De Moor [49] which describes the composition of Clifford operations as the product of binary matrices. 9 We begin with a short summary of their techniques. First, recall that every Pauli P can be represented uniquely as for α, β ∈ {0, 1} and a, b ∈ {0, 1} n .
where all operations are over F 2 .
where all operations are over F 2 .
Let us now fix our tableau representation for an n-qubit Clifford circuit Q. 10 The tableau is simply a list of the images of X e j and Z e j under conjugation by Q for all j, where e j ∈ {0, 1} n is the n-bit vector with a 1 in position j and 0's everywhere else.
We arrange this conjugation information into a 2n × 2n binary matrix M, a phase vector p ∈ {0, 1} 2n , and a sign vector s ∈ {0, 1} 2n : and D = {d i,j } are the n × n binary matrices whose elements correspond to the Pauli decompositions. That is, a i is the ith row of A, and so on. ). As you can see from the above fact, we don't actually need to store p since it is encoded into M, but it will helpful to keep it around anyway.
Let us now turn the computational question of how much time is required to perform various Clifford operations. Since an n-qubit state is represented by O(n 2 ) many bits, it will often be computationally inefficient to write out an entirely new tableau resulting from a Clifford operation. For this reason, we will adopt the conventional view of Clifford operations as updating the existing tableau (e.g. modifying entries and appending rows/columns). Therefore, operations such as composing a single-qubit gate with an n-qubit Clifford circuit only require O(n) time if the tableau for the n-qubit Clifford is given as input (see Appendix A.1 for a detailed explanation of composition).
We conclude this section with a simple fact that tensor products of Clifford operations are represented by the direct sum of their tableau.
Fact 34. Let (M 1 , p 1 , s 1 ) and (M 2 , p 2 , s 2 ) be the tableaux for Clifford operations C 1 and C 2 , respectively. Then C 1 ⊗ C 2 is represented by tableau Therefore, we get that appending a single-qubit stabilizer state to an n-qubit stabilizer state takes time O(n) and appending an ℓ-qubit state takes time O((n + ℓ) 2 ).

A.1 Composition
This section is dedicated to showing fast matrix multiplication speedups for the composition of Clifford operations. Before we do so, let us first describe the action of a single Clifford operation on a Pauli string. For any square matrix A, we will define lower(A) to be the strictly lower triangular matrix of A (i.e., with the diagonal and above diagonal entries replaced with 0).
C D ]. We will write A i to index the ith row of matrix A (and similarly for B, C, and D). Notice that the inner products between the X and Z parts of M can be expressed with the Λ = ( 0 I 0 0 ) operator. For example, A i · B j = M i ΛM T j for i, j ∈ {1, . . . , n}. Ignoring the phase for the moment, we have is simply the product of the Pauli elements indexed by the rows of the tableau, so by Corollary 32 we get Adding the global phase back in gives the lemma.
where • denotes the Hadamard product 11 and "diag" selects the elements along the diagonal.
and similarly for the images of Z e j . Thus, we arrive at the theorem statement by applying Lemma 35 with Q 2 and every Pauli specified by the rows of the tableau for Q 1 .
From the tableaux composition theorem, it should be clear that the tableau for the product of two Clifford operations on n-qubits can be computed in time O(n ω ). Furthermore, when the two Clifford operations are of different sizes, it may be possible to obtain a significant savings. We know from standard techniques that if Q 2 is a one or two-qubit gate, then the tableau for Q 1 can be updated in O(n) time.
The theorem below interpolates between these extremes. Concretely, composing an n-qubit operation Q 1 with an m-qubit operation Q 2 for m ≤ n, means computing the product (Q 2 ⊗ I n−m )Q 1 , and similarly Q 2 (Q 1 ⊗ I m−n ) when m > n.
Theorem 37. Let Q 1 , Q 2 be n and m-qubit Clifford operators, respectively, given by their tableaux. The tableau of their product can be computed in time O(min(m ω−1 n, n ω−1 m)).
Proof. When m = Θ(n), the theorem is trivial. Let's consider the case where m = o(n). Without loss of generality, let's assume that Q 2 is applied to the first m qubits of Q 1 . On n qubits, the tableau of Q 2 is The Hadamard product of two n × m matrices A = {a i,j } i,j and B = {b i,j } i,j is their entrywise product where M 2 ∈ {0, 1} n×n and A, B, C, D ∈ {0, 1} m×m . Matching the dimensions of the matrices above, let's write the tableau of Q 1 as follows: Therefore, according to Theorem 36, we have that the matrix M 12 for the tableaux of There are two types of matrix products we require. Terms such as , so we will focus on computing the second more complicated term. We have from which we can derive the condition along the diagonal-that is, we have that the diagonal of where ⊕ is used here to indicate direct sum. Since A, B, C, D are m × m matrices, we can explicitly compute lower(AB T ) and lower(CD T ) in O(m ω ) time. Consider a term such as A 21 lower(AB T )A T 21 where we are multiplying matrices with dimension (n − m) × m, m × m, and m × (n − m). Dividing each matrix into blocks of size m, we get matrices of dimensions roughly n/m × 1, 1 × 1, and 1 × n/m where multiplication of blocks takes time O(m ω ). Since we only need to calculate the n/m blocks along the diagonal of the product matrix, we can compute the entire diagonal in time O((n/m)m ω ). Since all other terms follow in a similar manner, we can calculate the entire sign vector in time O(nm ω−1 ). Thus, for m = o(n), the entire composition takes time O(nm ω−1 ).
When n = o(m), the calculation is extremely similar to the one above, so we omit it.
The above theorem is more general than needed for the purposes of this paper. Nevertheless, we can use it to show how to multiply any n-qubit stabilizer state by an arbitrary collection of CZ gates in time O(n ω ) as stated in Table 1. First, let us assume that no two CZ gates in our collection are identical since they would simply cancel. Then, let A be the n × n binary symmetric matrix whose (i, j) entry is 1 iff there is a CZ gate between qubits i and j in our collection. The tableau for the entire collection of CZ gates is Since we can explicitly write down the tableau for the collection of CZ gates, we can straightforwardly apply Theorem 37.

A.2 Measurement
The for all j ∈ {1, . . . , n}. That is, the lower half of the tableau describes the Pauli operations that fix or "stabilize" the state, which is why these Pauli operations are often referred to as stabilizers. Nevertheless, the upper half of the tableau (consisting of the "destabilizers" of the state |ψ ) is still useful in speeding up measurement computations [17]. In this section, we show how to combine those techniques with fast matrix multiplication. Throughout this section, it will be useful to compute the images of several Pauli operators at once. We will encounter two situations in particular: Furthermore, C ′ , α ′ , and β ′ can be computed in O(n 2 k ω−2 ) time when ℓ = n, and in O(n 2 ℓ ω−2 log(n/ℓ)) time when ℓ ≤ k = n.
Proof. Correctness follows from Lemma 35, so we focus only on running time. Once again, we will only show how to compute the most complicated term-diag(C lower(MΛM T )C T )of the sign vector β ′ , and the rest of the terms follow by similar (or simpler) arguments. First, let us write M as comprised of the matrices A 11 ∈ {0, 1} k×k , A 12 ∈ {0, 1} k×(n−k) , A 22 ∈ {0, 1} (n−k)×(n−k) , and so on as in Theorem 37.
Following the same sort of elementary algebra, we arrive at the following expression for C lower(MΛM T )C T : First, let's consider the case where ℓ = n. To compute lower(·), we need to compute a k × (n − k) by (n − k) × k matrix product. Dividing into k × k blocks, such a multiplication takes time O(nk ω−1 ), after which we can trivially compute the lower triangle matrix in O(k 2 ) time. Finally, we must multiply matrices with dimensions n × k, k × k, and k × n. Once again, we divide into k × k blocks, to compute the entire product in O((n/k) 2 k ω ) time. Now suppose that k = n. Notice that computing lower(·) now requires n × n matrix products, so there is not obvious way to compute it in O(n 2 ℓ ω−2 ) time. For the B(C 11 B T 11 )A T term, we can choose the order in which we compose the matrices: first compute BC 11 which takes time O(n 2 ℓ ω−2 ) and leaves another ℓ × n matrix, and so on. Unfortunately, such a strategy does not automatically work for the other terms.
We are now ready to prove the main measurement theorem. We state the theorem in terms of a Z-basis measurement on k qubits. That said, the cost of this measurement dominates the cost of applying an arbitrary k-qubit Clifford unitary, so the theorem still holds for any choice of basis.
Theorem 39. Let any n-qubit stabilizer state (represented by its tableau) and a subset of k qubits be given. There exists an O(min(n 2 k ω−2 log(n/k), kn 2 )) time algorithm that returns 1. Z-basis measurement outcomes for each qubit in the subset, and 2. The Clifford tableau for the (n − k)-qubit post-measurement state.
Furthermore, the same algorithm can be used to postselect 12 on a particular k-bit outcome.
Proof. Without loss of generality, let us assume we measure the first k qubits. We will also only be concerned with the O(n 2 k ω−2 log(n/k)) measurement algorithm since one can trivially obtain a O(kn 2 ) algorithm by applying the single-qubit measurement scheme of Aaronson and Gottesman k times [17].
Let us write the tableau for our state represented by Clifford Q as , and so on. We will perform the measurement in two steps, following the Aaronson-Gottesman measurement procedure [17]-first focusing on those outcomes which are random and then on those which are determinate. 13 We will assume correctness from the Aaronson-Gottesman procedure and will focus primarily on running times.
Step 1: Random measurement outcomes To determine the random measurements, we first need the Pauli-X parts of the stabilizers, i.e., the n × k submatrix C := C 11 C 21 , to be in reduced row echelon form. To do this, we compute the LSP factorization (see Ibarra, Moran, and Hui [42]) of C T in O(k ω−1 n) time.
That is, C T can be written as the product of three matrices LSP , where L is an k × k lower triangular matrix with 1's on the main diagonal, S is an k × n matrix which reduces to an upper triangular matrix with 1's along the diagonal when the zero rows are deleted, and P is an n × n permutation matrix. Therefore, we get C = P T S T L T .
We assume without loss of generality that P is the identity transformation, since we can rearrange of the rows of the tableau according to P without affecting the underlying state. Now, let us rearrange the columns of S T by permutation Π such that only the first r := rank(C) columns are non-zero. That is, we write C = (S T Π)(Π T L T Π)Π T . 12 In addition to the subset of k qubits to be measured, a string of k preferred measurement outcomes is given as input. The algorithm either returns that the postselected event has probability 0 or returns the (n − k)-qubit state assuming the postselection was successful. 13 Note that the choice of which measurements are random and which are determinate is not unique. For example, consider the Bell state |00 +|11 √ 2 . Measuring the first qubit, determines the second, and vice versa.
The final matrix in this decomposition, Π T , is a permutation on the first k qubits of the state, so let us drop it in the following discussion since we can simply permute those columns of the tableau accordingly. We can now write C = S ′ U ′ , where S ′ is the n × r matrix obtained by removing the last k − r columns from S T Π and U ′ is the r × k matrix obtained by removing the last k − r rows from Π T L T Π. Notice that the first r × r submatrix of U ′ is upper triangular with 1's on the main diagonal. Let's also write S ′ = L ′ A ′ where L ′ is an r × r lower triangular matrix with 1's on the diagonal, and A ′ is arbitrary.
Finally, let us construct the matrix L that will reduce C to row echelon form, i.e., LC = [ U ′ 0 ]. One can check that L = (L ′ ) −1 0 A ′ (L ′ ) −1 I . We will view the application of L as directly multiplying the stabilizer generators of Q. For example, if the ith row of L has 1's in the first, second, and fourth columns, we will combine the first, second, and fourth stabilizers in the tableau (i.e., compute QZ e 1 Z e 2 Z e 4 Q † . Since L is only dense in the first r columns, we can use Lemma 38 to construct the resulting tableau in O(n 2 k ω−2 ) time (since r < k). Since we are only taking linear combinations of generators, the underlying state does not change. To preserve the fact that the tableau is symplectic, we can apply Lemma 38 again with applied to the destabilizers. 14 Because L −T is now upper triangular, we must use the other setting of Lemma 38, which takes O(n 2 k ω−2 log(n/k)) time.
To recap, the original submatrix C of our tableau is now where U ′′ is an r × r upper triangular matrix with ones along the diagonal. Therefore, we can compute (U ′′ ) −1 explicitly in time O(r ω ), and once again apply Lemma 38 so that each of the first r stabilizers has exactly one Pauli-X component (i.e., the submatrix is in reduced row echelon form). We then apply (U ′′ ) T to the destabilizers in the same fashion.
In the next step, we multiply the stabilizer generators into the destabilizer generators, so that no destabilizer has a Pauli-X component within the first r qubits. 15 Once again, this can be accomplished by Lemma 38. Then, we replace the first r destabilizers with the first r stabilizers.
Since these first r qubits will be the qubits for which the measurement outcome is random, we can now output a uniformly random measurement for each (or, in the case of postselection, the specified outcome). Let this measurement be z ∈ {0, 1} r . For i ∈ {1, . . . , r}, we replace the ith stabilizer with (−1) z i Z e i .
Unfortunately, the tableau we just constructed is only valid for the post-measurement state in tensor product with the measured qubits. Since we would like a tableau for just the (n − k)-qubit post-measurement state, additional work is required. To recap, the state of the tableau is now To see that the resulting tableau will be symplectic, it suffices to show that the matrix L −T 0 0 L is symplectic. This follows from Fact 33. 15 A note of warning: this is the first operation that we apply to the tableau which is not symplectic. For a more in-depth justification of this procedure, we refer the reader to the work of Aaronson/Gottesman [17].
where A 12 ∈ {0, 1} n×n−r , and so on. To clarify, this is a slight abuse of notation since we've subdivided the matrix in terms of r instead of k. In order to safely remove the first r qubits from the tableau, we need to ensure that the tableau will still be symplectic afterwards. In some sense, we need A 12 , B 11 , B 12 , B 21 and D 12 all to be 0. As before, we can zero out D 12 using Lemma 38 with the transformation L = I 0 D 21 I on the stabilizers and L −T = I D T 21 0 I on the stabilizers. We can now construct the tableau for the state on n − r qubits by simply removing the first r rows/columns from each submatrix: It may not be the case that A 12 , B 11 , B 12 , and B 21 are all zero, but one can check that the tableau is symplectic using the fact that it was symplectic before the rows/columns were removed.
Step 2: Determinate measurement outcomes Since we have already removed the r qubits which had random outcomes in Step 1, let us assume without loss of generality that we are measuring the first k qubits, all of which have determinate outcomes. That is, for i ∈ {1, . . . , k}, we have ǫ i Z e i is a stabilizer of the state for some ǫ i ∈ {±1}. The goal is to determine the ǫ i . To start, let us put the Pauli X part of the destabilizers-i.e., the submatrix A 11 In fact, because M is symplectic, we have that C 12 = 0, D 11 = I, and D 12 = 0, so we can immediately report the measurement outcome of the ith qubit as (ǫ 1 ) i . In the case of postselection, we check if these outcomes match the specified outcomes. If so, then there is nothing to be done; otherwise, we report that the postselected event has probability 0. Finally, we repeat the procedure from the end of Step 1: row reducing so that D 21 = 0, and then returning the tableau with the first k rows/columns removed from each submatrix. This completes the proof. Theorem 1. Let |ψ be any n-qubit stabilizer state given by its tableau, and let z = z 1 . . . z k be any k-bit string of postselected outcomes. There is an O(n ω ) algorithm to obtain a uniformly random measurement of the state ( z| ⊗ I) |ψ in the computational basis. 16 Proof. The idea will be to invoke Theorem 39 twice-first, for the postselected qubits, and then for the remaining qubits. Therefore, we get a running time of O(n 2 k ω−2 log(n/k)) + O((n − k) ω ). 16 If ω = 2, we pick up an additional log factor in the solution of the recurrence, i.e., a running time of O(n 2 log n).
When k = o(n), we have k = n/f (n) for some f (n) = ω(1), and so n 2 k ω−2 log(n/k) + (n − k) ω ≤ n ω f (n) ω−2 log(f (n)) + n ω = O(n ω ) where we use that log(f (n))/f (n) ǫ = o(1) for all constants ǫ > 0. Therefore, in the case of ω = 2, we actually get a bound of O(n 2 log n) We note again that the choice of measurement bases in Theorem 1 is arbitrary since single-qubits gates can be applied in O(n) time.

B Software Implementation
This section is about our software implementation of the graph state simulation problem for the n 1 2 × n 1 2 grid graph. The goal was to show that the asymptotic speedup we predict for the grid in Theorem 2 is actually practical, i.e., that it is feasible to implement and that the increased complexity of the algorithm does not hurt the runtime too much for small grids. We do not implement fast matrix multiplication for simplicity, so in the theorem we may set ω → 3 and avoid some extra log factors to obtain an O(n 3/2 )-time algorithm.
Recall that this running time is achieved by a divide-and-conquer recursive algorithm described in the example on page 9. We compare the running time of this algorithm with the naïve algorithm and the left-to-right sweep algorithm in Figure 1. In each case, for a given n, the algorithm was translated into a circuit (on n qubits for the naïve algorithm, O( √ n) for the others), and then fed to the same tableau-based Clifford simulator (CHP++ [25]) running on a modern laptop computer. 17 We make one concession to translate the divide-and-conquer algorithm to a circuit: each gate includes information about which qubits (represented as a interval) could be currently entangled with its inputs. In this way, we can represent the multiple subtableaux (demanded by a divide-and-conquer strategy) as separate intervals of qubits within the full tableau, without the performance hit that would come from scanning the entire tableau for each operation. Finally, in our tests, the measurement basis (X or Y) for each qubit is chosen at random; we expect (and conjecture) that this is induces near worst-case performance in the three algorithms.
In the log-log plot shown in Figure 11, the line of best fit for all three algorithms suggests a better runtime than we predict. Namely, the naïve algorithm scales as n 2.6 whereas the theoretical guarantee is O(n 3 ); the left-to-right sweep algorithm scales as n 1.6 with theoretical guarantee only O(n 2 ); and the recursive algorithm scales as n 1.3 with theoretical guarantee only O(n 1.5 ). Nevertheless, we believe this is simply due to small n effects.
In an effort to improve the performance of the recursive algorithm as much as possible, our software implementation differs from that given in the example in two important ways: shows the empirical scaling with number of qubits.
1. In each recursive step, we split the grid into two (almost) identical halves rather than four quarters as in the example. Of course, in the next step we split those halves again in the opposite direction (the rule is to split the longer dimension), so we end up with four quarters anyway.
2. We eliminate the frame qubits between the two halves. The recursive calls return the state of the perimeter qubits for each half, connect the two perimeters directly with CZs at the seam (rather than to a separating frame), and measure qubits that are no longer part of the perimeter. This leads to fewer active qubits at the top-most levels of the recursion, which is important because that part of the computation dominates the runtime.
Similarly, we use a tableau-based Clifford simulator loosely based on Aaronson's CHP [17], but it differs in two important ways: 1. When a qubit is measured, we update the tableau to be a direct sum, to reflect that the qubit is no longer entangled with the rest. This requires more work, but helps clearly separate the past interactions of the qubit from the future, facilitating reuse.
2. Additional gates have been added: CZ because it is essential to graph states, singlequbit gates to translate Y -basis measurement into standard basis measurement, and SWAP gates.

C Correction subroutine example
In this section we work through an explicit example of the correction subroutine described in Section 3.3.1. The input is shown in Figure 12. We shall consider the case in which all four data qubits are to be measured in the X basis (i.e., U bases = H ⊗4 ). The quantum circuit C ′ is shown in Figure 13. Note that vertices of the graph are labelled by uppercase letters while qubits of C ′ are labelled by lowercase letters, with the data qubit and merge ancilla for vertex B being labelled b and b ′ , respectively. Let us take π = ( * 1 * * 1, * * * * * ) as the desired pattern, where qubits are ordered ab ′ bcd. That is, we would like P cor to have nonzero X-components acting on qubits b ′ and d.
We first compute the affine spaces A 1 , A 2 , . . . , A 5 . By symmetry, A ′ 1 ∼ = A 1 and A ′ 2 ∼ = A 2 , so we omit them. The leaf node containing A and B is an introduce node, so we set corresponding to the stabilizer generators X a and X b ′ of |+ a ⊗|+ b ′ . Throughout this example, we will order the rows of matrices first by grouping X-and Z-components together, and then in the same descending order as in π and C ′ . In the matrix above, the first and third rows correspond to qubit a while the second and fourth correspond to b ′ . Next we conjugate A 1 by (H ⊗I)CZ. Since qubit a is forgotten, we would in general need to enforce π (X) a and π (Z) a , but since π (X) a = π (Z) a = * , there is nothing to be done. Therefore we have x : x ∈ F 2 2 }.
To compute A 3 , we first remove the rows of the matrix in A 2 corresponding to a (i.e. the first and third) to obtain {( 0 1 1 0 )x : x ∈ F 2 2 }. By symmetry, we would obtain the same set after removing qubit c from A ′ 2 . We then take the direct product x : x ∈ F 4 2 }.
We have rearranged the right hand side matrix so that the X-components form the first two rows and the Z-components the last two, in contrast with Eq. 23. Next, we apply the CNOT gate to obtain x : x ∈ F 4 2 }. Figure 13: Example circuit C ′ . Each affine space is associated with the bag immediately to the left of its dashed line. Qubits are ordered in the same way as π, from top to bottom.
Finally, we must enforce the pattern on the merge ancilla b ′ . Since π (X) b ′ = 1, this means finding the subset of elements in Eq. (29) that have the form (1, * , * , * ) T . Basic linear algebra tells us that x must take the form x = e 4 + y 1 e 1 + y 2 (e 2 + e 4 ) + y 3 e 3 , where e j is the usual standard basis vector and y 1 , y 2 , y 3 ∈ {0, 1} are free. The desired subset of Eq. (29) therefore is : y ∈ F 3 2 }.
We then restrict A 3 to coordinates corresponding to b to obtain {( 0 1 0 1 0 1 )x + ( 1 0 ) : x ∈ F 3 2 }. Since the matrix has more columns than rows, we search for a maximum linearly independent subset of columns. By inspection, we can just remove the third column to get {( 0 1 1 0 )x + ( 1 0 ) : x ∈ F 2 2 }. Taking a direct product with the affine space I for the stabilizer generator of |+ d gives us To compute A 5 , we conjugate the affine space by the CZ and Hadamard gates to obtain We are now ready to construct P cor . We first choose a random element from A 5 , say, (0, 1, 1, 0) T . This fixes two tensor elements of the Pauli correction: P cor = ? ⊗ ? ⊗ Z ⊗ ? ⊗ X (with the question marks to be determined below).