Magic State Distillation: Not as Costly as You Think

Despite significant overhead reductions since its first proposal, magic state distillation is often considered to be a very costly procedure that dominates the resource cost of fault-tolerant quantum computers. The goal of this work is to demonstrate that this is not true. By writing distillation circuits in a form that separates qubits that are capable of error detection from those that are not, most logical qubits used for distillation can be encoded at a very low code distance. This significantly reduces the space-time cost of distillation, as well as the number of qubits. In extreme cases, it can cost less to distill a magic state than to perform a logical Clifford gate on full-distance logical qubits.

erators, such that Z π/8 corresponds to a T gate. If a magic state is available, a logical n-qubit P π/8 gate is performed by measuring the logical Pauli product P ⊗Z acting on the n qubits and the magic state, see Fig. 1.
The problem with magic states is that, with surface codes, only faulty magic states can be prepared. They are faulty in the sense that they are initialized with an error probability proportional to the physical error rate p phys , regardless of the code distance. If these states are used to perform logical P π/8 rotations, one out of every ∼1/p phys logical gates is expected to be faulty. Since faulty gates spoil the outcome of a computation, but classically intractable quantum computations with a useful computational result typically involve more than 10 8 T gates [10], low-error magic states are required to execute gates with a low error probability. One possibility to generate low-error magic states is via a magic state distillation protocol. These protocols are short error-detecting quantum computations that use multiple high-error magic states to generate fewer low-error states. Many such protocols [8,[11][12][13][14][15][16][17][18][19][20][21] have been developed since magic state distillation was first proposed [9], gradually decreasing the cost of distillation. Even though state-of-the-art protocols are orders of magnitude more efficient than the earliest proposals, magic state distillation is still often described as a costly procedure and the leading contributing factor to the overhead of fault-tolerant quantum computing, which is the primary motivation for research into alternatives to magic state distillation [22][23][24][25][26][27].
In this work, we reduce the cost of distillation by another order of magnitude. On the level of circuits, none of the distillation protocols discussed in this work are new. Rather, the circuits are written in a way that the number of qubits is low and the circuit depth is high. The overhead reduction is achieved by finding surfacecode implementations of these protocols in which the code distance of each surface-code patch is never higher than required to achieve a specific output error probability, as was previously proposed in Ref. [21]. This yields protocols that not only have a low space-time cost, but also a small qubit footprint.
Results. Table 1 shows the space-time costs of the protocols that are constructed in the following sections. These protocols generate states with different output error probabilities p out , assuming physical circuit-level error rates p phys of 10 −3 and 10 −4 . The more T gates need to be executed, the lower p out needs to be. Each protocol is characterized by the space cost in terms of physical qubits (including ancilla qubits) and the time cost in terms of code cycles, where a code cycle corresponds to measuring all surface-code check operators exactly once. These numbers can be multiplied to obtain the space-time cost in qubitcycles. This is a meaningful figure of merit that should be minimized. It is more meaningful than only the space cost or only the time cost, since distillation protocols can be straightforwardly parallelized, using twice as many qubits to distill states twice as fast.
Even though this is not necessarily a meaningful quantity, we report the space-time cost in terms of the full distance d for two different choices of d in the last two columns of Tab. 1. While the smallest classically intractable quantum computations require ∼100 qubits, Figure 2: A sequence of 16 π/8 rotations on 5 qubits that is non-trivially equivalent to the identity. more complicated quantum algorithms use thousands of qubits, such as factoring 2048-bit numbers using Shor's algorithm. The lower and higher values of d are chosen such that they are sufficient for a 100-qubit and 10,000-qubit computation with at most 1/p out T gates, respectively. The reported costs in terms of the full distance are in terms of (physical data qubits)×(code cycles), i.e., they do not consider physical measurement ancillas and are therefore smaller by a factor of 2. This is done to more easily compare the numbers to the cost of storing a d × d surface-code patch for d code cycles, which is 1d 3 in terms of (physical data qubits)×(code cycles).
How to interpret the cost. Table 1 shows protocols that generate one magic state, 4 magic states, or one |CCZ state that can be used to execute a Toffoli gate. For protocols that generate multiple magic states, the space-time cost and output error are per magic state. Our protocols feature order-of-magnitude overhead reductions compared to the previous state of the art for all parameter regimes. One example is the (15-to-1) 9,3,3 protocol, where the subscripts label the code distances used in the protocol, as explained in Sec. 3. For p phys = 10 −4 , it generates magic states with p out = 9.3 × 10 −10 , sufficiently low for classically intractable 100-qubit computations with 10 8 T gates. In a quantum computer that can execute one T gate every d code cycles, 231 logical qubits at d = 13 would be used to store the 100 qubits with a low error rate [28], taking into account the routing overhead. A space-time cost of 4.71d 3 for distillation implies that a footprint equivalent to 4.71 full-distance qubits would be able to distill one magic state every d code cycles. In this example, ≈2% of the approximately 80,000 physical qubits are used for distillation. The numbers become even more extreme for the example of a 10,000-qubit computation with 10 8 T gates. Here, the 10,000 data qubits are stored using ∼20,000 logical qubits with d = 15, which means that the space-time cost of distillation is 3.07d 3 per magic state. For a quantum computation on more qubits or with a lower overall error probability, distance-17 data qubits might be required, reducing the cost to 2.11d 3 . In this example, the cost to distill a magic state would be lower than the space-time cost of a full-distance logical CNOT gate, which is 3d 3 per qubit [6], demonstrating that the cost of magic state distillation is not very high, and that space-time costs that are quantified in units of d 3 are of limited usefulness. These numbers are admittedly a bit contrived, but even in the more realistic case of a 100-qubit computation with p phys = 10 −3 and p out ≈ 10 −10 , only ≈10% of all physical qubits are used for distillation.
The main message is that magic state distillation is not the dominant cost in a surface-code-based quantum computer. Rather, the large overhead of surface codes is due to their low encoding rate, which implies that a large number of qubits is required to simply store all data qubits of the computation.
Overview. In the following sections, we discuss how the protocols in Tab. 1 are constructed. We start in Sec. 1 by reviewing how distillation circuits work and how their performance is quantified. Distillation protocols require faulty T gates on the level of logical qubits, which are usually performed via state injection and measurement. In Sec. 2, we introduce additional protocols for faulty logical T gates based on shrinking patches and faulty T measurements, which avoid Clifford corrections and use fewer qubits and cycles. Next, in Sec. 3, we go through the construction of the low-cost 15-to-1 protocol. In Sec. 4, we construct two-level protocols, where 15-to-1-distillation output states are fed into a second level of 15-to-1 or 20-to-4. In Sec. 5, we discuss synthillation protocols, i.e., the distillation of resource states that perform entire layers of π/8 rotations. Specifically, we show the example of |CCZ state distillation, which can replace four T -gate magic states for the execution of a controlled-controlled-Z gate. For the protocols in Tab. 1, the distillation costs of CCZ states are lower than the cost of four T -gate magic states with a similar p out , indicating that synthillation can lower the cost compared to the distillation of T -gate magic states. Finally, in Sec. 6, we discuss how protocols with a higher space-time cost, but smaller qubit footprint can be constructed. The examples shown in Tab. 1 reduce the error rate from 10 −3 or 10 −4 to ∼10 −9 , but use only as few as 762 or 7,780 physical qubits.

Distillation circuits
Magic state distillation protocols can be understood in terms of quantum error-correcting codes with transversal T gates [9,11], but it is conceptually simpler to explain them in terms of circuits [17]. When writing quantum circuits as sequences of Pauli product rotations P ϕ = e −iP ϕ , specifically π/8 rotations P π/8 , certain sequences are equivalent to the identity. While some of these sequences are trivial, e.g., P π/8 followed by P −π/8 , there also exist non-trivial sequences. One such sequence of 16 rotations on 5 qubits is shown in Fig. 2. In general, such sequences are described by triorthogonal matrices [11,17]. The equivalent concept of phase-polynomial identities is used in the context of circuit optimization [29].
If we multiply the circuit in Fig. 2 by a single-qubit rotation Z −π/8 on the first qubit, the first rotation will be cancelled and the remaining circuit will consist of 15 rotations, as in Fig. 3. Since the 16-rotation circuit is equivalent to the identity, the 15-rotation circuit is equivalent to a single Z −π/8 rotation on the first qubit. In other words, if the initial state is |+ ⊗5 , where |+ = (|0 + |1 )/ √ 2, then the circuit prepares the state | m ⊗|+ ⊗4 . Here, | m = (|0 +e −iπ/4 |1 )/ √ 2 is a state that can be used to perform π/8 rotations in the same way as |m , but the outcome of the P ⊗ Z measurement in Fig. 1 needs to be interpreted differently, i.e., this state is a magic state.
Because all rotations in Fig. 3 act non-trivially on qubits 2-5, these qubits can be used to detect errors. If the circuit is executed without errors, qubits 2-5 are initialized in the |+ state and returned to the |+ state, i.e., have an outcome of +1 upon X measurement. Errors are detected, if any of these measurement outcomes are −1, in which case the protocol fails and the state is discarded.
The 15-to-1 protocol [9] is sometimes characterized as having an output error probability of 35p 3 . This assumes that every P π/8 rotation generates a Pauli error P = P π/2 with a probability of p. Since these are Z-type Pauli errors, they will flip all X measurement outcomes of the qubits that they act on. Therefore, any one faulty P π/8 gate can be detected. Furthermore, there is no combination of two faulty gates that can go undetected. However, some combinations of three faulty gates, e.g., rotations 5, 11 and 14, will cause a Z Pauli error on the output state, but will not trigger any flipped X measurement outcomes. Since there are 35 such combinations, the probability to generate an undetected error is 35p 3 to leading order.
To compute the subleading corrections to the output error, this process can be simulated numerically. Starting with the initial state ρ init = |+ +| ⊗5 , each of the 15 rotations is applied by mapping The output state is determined by projecting into the subspace with the correct measurement outcomes using the projectors Π X = (1 + X)/2, i.e., where is the failure probability of the protocol. The output error probability is computed by comparing the ideal output state ρ ideal = | m m| ⊗ |+ +| ⊗4 to the actual output state ρ out . This is done by computing the infi-  delity where the last equality holds, because ρ ideal is a pure state. The infidelity corresponds to the probability that a faulty magic state that is used to perform a gate in a quantum circuit will lead to an error of this circuit's outcome [30]. Notably, in the examples that we consider, the trace distance tr (ρ ideal − ρ out ) 2 /2 yields identical or at least similar results. For the example of p = 10 −4 , the approximate output error probability is 35p 3 = 3.5 × 10 −11 , whereas the exact result is p out = 3.501 × 10 −11 . Random Pauli errors. If faulty P π/8 rotations are performed by preparing faulty magic states and using the circuit in Fig. 1, then the output error depends on the error model for the preparation of faulty magic states. In particular, if the faulty magic state is affected by a random Pauli error with probability p, i.e., by an X, Y or Z error with probabilities p/3, respectively, then this translates into a probability of p/3 of performing either a P −π/8 , P 3π/8 or P 5π/8 rotation instead of a P π/8 rotation. In other words, after each rotation, the state is mapped to so there is a p/3 probability of either a P −π/4 , P π/4 or P π/2 error. These first two errors are more forgiving than a proper P π/2 Pauli error, since they effectively only lead to a Pauli error with 50% probability. As a consequence, we expect each of the 35 combinations of three faulty rotations to contribute to the output error with (8/27)p 3 instead of 1p 3 : Out of the 27 combinations in {P −π/4 , P π/4 , P π/2 } 3 , there is one combination with three P π/2 's, which leads to an undetected error. There are 6 combinations with two P π/2 's leading to an error with a 50% probability, 20 combinations with one P π/2 leading to an error with a 25% probability, and 8 combinations with no P π/2 's, leading to an error with a 12.5% probability. Therefore, the output error should be p out = 35 · 8 27 p 3 ≈ 10.3704p 3 to leading order. Indeed, a numerical treatment of the full density matrix for p = 10 −4 yields p out = 1.03724 × 10 −11 .
Coherent errors. The previous two error models randomly applied Pauli errors with a certain probability. One might object that, for physical qubits, this is not necessarily a realistic error model. A more realistic error model would take coherent errors into account, such as systematic under-and over-rotation. Distillation circuits can also detect these errors, but their performance is indeed worse than for incoherent errors. For example, consider the map which systematically over-rotates each gate by an excess angle ϕ. A gate that over-rotates by an angle ϕ = arcsin(1/100) has the same gate fidelity as a gate that applies a Z error with a probability of 10 −4 . However, the infidelity of the output magic state p out = 1.22 × 10 −9 is higher by almost two orders of magnitude compared to the incoherent case. In our resource analysis, we will be working with incoherent circuit-level Pauli noise, applying errors according to Eq. (5), but with three different probabilities for the three different errors. Still, we comment on how coherent errors might affect the output error in Sec. 4. 20-to-4 distillation. Distillation protocols can output more than one magic state. If the 16-rotation circuit in Fig. 2 is multiplied by two Z −π/8 rotations, one on the first and one on the second qubit, a 14-rotation circuit is obtained that outputs a | m ⊗| m ⊗|+ ⊗3 state, i.e., two magic states. Similarly, using a 24-rotation circuit that non-trivially corresponds to the identity, a 20-rotation circuit that outputs a | m ⊗4 ⊗ |+ ⊗3 state can be obtained. This is the 20-to-4 protocol [11] shown in Fig. 4. With a Z-Pauli error model, there are 22 pairs of rotations that can lead to an output error. Therefore, the probability of an output error is 22p 2 to leading order. However, since four states are produced, one should interpret this as p out = 5.5p 2 per magic state. In other words, the probability that the resource state | m ⊗4 will cause an error in a circuit is 22p 2 , but, since this resource state executes four π/8 rotations, this translates into a 5.5p 2 error probability per gate. In a numerical simulation, the output error per state is determined via the infidelity between the projected output state and the ideal output state | m m| ⊗4 ⊗ |+ +| ⊗3 divided by four. For p = 10 −4 , this yields p out = 5.505 × 10 −8 per output state.

Faulty logical T gates
We use the notation of Ref. [28] to draw arrangements of logical surface-code qubits, where patches with dashed and solid edges represent d×d surface-code patches with X and Z boundaries. Logical operations are performed by measuring products of logical Pauli operators via lattice surgery [6][7][8]. A naive layout for the 15-to-1 protocol is shown in Fig. 5a, where the five qubits of the 15-to-1 circuit are placed next to each other with their Z boundaries facing up and down. A 5d × d ancillary space above and below these five qubits can be used to measure Pauli product operators between these qubits to perform π/8 rotations.
The code distance determines the logical error rate of the encoded qubits, which also depends on the underlying error model. Here, we consider circuit-level noise, where each physical gate, state initialization and measurement outcome is affected by a Pauli error with probability p phys . Using a minimum-weight perfect matching decoder for such a noise model, the logical error rate per code cycle [8] can be approximated as Since a failure to decode X or Z syndromes correctly leads to logical Z or X errors, respectively, we will assume that logical X and Z errors each occur with a probability of 0.5p L (p phys , d) per code cycle. Not all errors are equally harmful in the context of distillation protocols. Consider X and Z errors that affect one of the five qubits during the 15-to-1 protocol. Z errors affecting the first qubit (i.e., the output qubit) are always detrimental, since they cannot be detected ancilla ancilla ancilla ancilla (b) (a) Figure 5: A naive arrangement (a) of logical qubits could consist of five d × d patches initialized in the |+ state and two additional 5d × d ancilla regions for Pauli product measurements. The arrangement that we consider (b) consists of one dX × dX patch, four dZ × dX patches, and two ancilla regions with a width dX for Pauli product measurements. and contribute to the overall output error of the protocol. The effect of X errors on any of the five qubits is to turn all previous P π/8 rotations that acted on this qubit into P −π/8 rotations. For instance, consider an X error on the third qubit after rotation 7 in Fig. 3. This X error can be commuted to the beginning of the circuit and absorbed into the initial |+ state. The commutation turns rotations 2, 5 and 6 into −π/8 rotations, since X and Z anti-commute. As errors on multiple rotations can lead to undetected errors, X errors should also be avoided.
Z errors on qubits 2-5, on the other hand, are less damaging. They are detectable, as they have the same effect as Z errors that affect rotations 1-4. Therefore, it is not necessary to encode the logical Z operators of qubits 2-5 with the same distance as their logical X operators. Instead, we encode these qubits using rectan- per code cycle, since the X distance is d X , but the number of possible X error strings is lower by a factor of (d Z /d X ) compared to a square patch. Correspondingly, the probability of Z errors is 0.5(d X /d Z )·p L (p phys , d Z ), since the Z distance is d Z , but the number of Z error strings is higher by a factor of (d X /d Z ) compared to a square patch. We also fix the distance used in the ancillary region to d X . Finally, there is a third distance d m which determines the number of code cycles used in lattice surgery. This affects the error of the Pauli product measurements used for logical gates, which can be detected by the distillation protocol.
In total, we end up with the arrangement shown in Fig. 5b that is characterized by three code distances d X , d Z and d m , where d X and d Z are spatial distances, and d m is the temporal distance. Before we construct a surface-code implementation of the 15-to-1 protocol, we first discuss two different ways of performing faulty logical π/8 rotations with surface codes: the traditional method based on state injection, and a protocol based on faulty T measurements.

State injection
The standard method to perform faulty T gates with topological codes is via state injection and measurement. State injection is a protocol that prepares an arbitrary logical state |ψ L from a corresponding arbitrary physical state |ψ . Several such state-injection protocols exist [5,6,[31][32][33][34], but none of them are faulttolerant, i.e., the error probability of |ψ L is always proportional to p phys . The simplest protocol [6] starts with a physical state |ψ , i.e., a 1 × 1 surface-code patch, and then grows it into a d × 1 patch, and finally into a d × d patch. This is not a very efficient protocol, since growing patches involves measuring stabilizers for d code cycles. The qubit, therefore, spends many cycles in a distance-1 state, which increases the error probability.
More sophisticated state-injection protocols use postselection [32,34] to decrease the error. If the error rate of single-qubit operations is significantly lower than the error rate of two-qubit gates, the error due to state injection can even be lower than p phys . In circuit-level noise, a single number p phys characterizes all gates. However, physical systems typically feature significantly better single-qubit operations than two-qubit gates. In state-of-the-art superconductingqubit [35] and ion-trap [36] architectures, for instance, the fidelities of single-qubit and two-qubit gates differ by up to almost two orders of magnitude. Since twoqubit gates are typically the lowest-fidelity operations, and syndrome-readout circuits of surface codes mostly consist of two-qubit gates, the characteristic error rate p phys in circuit-level noise will be largely determined by the error rate of two-qubit gates. If the two-qubit error rate is p phys , but the single-qubit error rate is p phys /10, state injection can produce magic states with an error as low as 13 30 p phys [34] in just two code cycles. However, there is a certain failure rate of the protocol due to postselection, which increases the length of the protocol.
While state injection can be used to prepare faulty Figure 7: A faulty P π/8 rotation corresponds to a P ⊗ Z measurement involving a |+ ancilla, followed by a faulty T measurement of the ancilla.
magic states, it cannot be used to directly execute P π/8 rotations. Instead, state injection is used indirectly by preparing a faulty magic state and measuring P ⊗ Z via lattice surgery, as shown in Fig. 1. With a 50% probability, a P π/4 correction is required. Performing this correction operation either requires extra time or extra space. In any case, this Clifford correction has an effect on the distillation protocol and, therefore, needs to be performed, increasing the space-time cost of the protocol. For this reason, we will avoid state injection, and instead construct a protocol that executes faulty P π/8 rotations without the need for Clifford corrections.

Faulty T measurements
When a T gate is performed on a qubit |ψ via state injection, a faulty magic state is prepared, entangled with |ψ , and measured, as shown in Fig. 6a. The faulty preparation can be treated as a T gate applied on a |+ state followed by a random X, Y or Z Pauli error. The idea of faulty T measurements is to avoid the Clifford correction by reversing the order of the entangling operation and the faulty T gate, as shown in Fig. 6b.
Here, |ψ is first entangled with a |0 qubit. Next, a sequence of a random Pauli error, a T gate and an X measurement is performed, which we refer to as a faulty T measurement. Now, the correction operation in response to the X measurement is no longer a Clifford gate, but a Pauli Z operation, which requires no additional hardware operations. X, Y and Z errors lead to S † , S and Z errors on |ψ , respectively. Thus, a P π/8 rotation can be performed by measuring P ⊗Z involving an ancilla qubit initialized in the |+ state, followed by a faulty T measurement of the ancilla qubit, as shown in Fig. 7. With surface codes, protocols for faulty T measurements are exactly identical to protocols for state injection, except that the order of operations is reversed. Here, we describe a simplified protocol to demonstrate the working principle of faulty T measurements. Similarly to the case of state injection, one can construct significantly more sophisticated protocols, as we discuss in Appendix A.
One particularly simple state-injection protocol is performed by growing a physical qubit (a 1 × 1 patch) into a d × 1 patch and then into a d × d patch [6].
The corresponding faulty-T -measurement protocol can be performed by shrinking patches. Suppose that a log- where the logical Z operator is Z ⊗d , and the logical X operator corresponds to the X operator on any of the d qubits. Next, the d × 1 patch is shrunk to a 1 × 1 patch by measuring all red qubits in the Z basis. In fact, the red and green measurements can be performed simultaneously. The product of all Z measurement outcomes (and also preceding stabilizer measurements) determines an X Pauli correction on the remaining qubit, which now stores the logical information in its physical Pauli operators. Finally, a physical T gate is applied to the remaining qubit, before it is measured in the X basis.
Much like state injection, faulty T measurements are not fault-tolerant protocols, in the sense that their error rate is proportional to the physical error rate and does not decrease with the code distance. For a Pauli error model, these error rates can be understood as the probabilities of the Pauli-error operations in the dashed boxes in Fig. 6. For simplicity, we will assume that faulty T measurements have a Pauli error rate of p phys , meaning that, effectively, the blue qubit is affected by an X, Y or Z error with a probability of p phys /3 for each Pauli. When used to execute a P π/8 rotation, this implies that this gate will have a P −π/4 , P π/4 or P π/2 error with a probability of p phys /3 for each error. This assumption is actually very inaccurate for the protocol Step 2: d m code cycles Step 1: 0 code cycles Step 3: d m code cycles shown in Fig. 8, since, for this protocol, the error scales with the code distance d, as any single-qubit X or Y error on the red qubits translates into a logical error in the Figure 10: Shrinking the |+ patch in the top right corner to a dm × 1 patch produces a stabilizer configuration that is topologically equivalent to the one shown in Fig. 9. This does not reduce the code distance compared to Fig. 9, faulty T measurement. This can be avoided by changing the measurement pattern, similar to the state-injection protocols of Refs. [31,33]. Moreover, the error rate can be significantly suppressed by adding "pre-selection" to the faulty-T -measurement protocol, the analogous operation to post-selection in the state-injection protocol of Ref. [32], wherein a faulty T measurement is deferred until the stabilizer checks surrounding the sensitive blue qubit report no syndromes for two code cycles. Because such a faulty-T -measurement protocol is identical to the state-injection protocol of Ref. [32], apart from the reversed order of operations, its error rate can be lower than p phys , if single-qubit operations have a significantly higher fidelity than two-qubit operations.
In any case, the specific choice of faulty-Tmeasurement protocol (or state-injection protocol) will not matter for the distillation protocols that we construct in the following sections, which is why the more sophisticated T -measurement protocols are discussed in Appendix A. Moreover, we show in Appendix A that, for many relevant distillation protocols, a higher error rate for faulty T measurements has only a small effect on the performance of the distillation protocols, even if the error rate of faulty T measurements is assumed to be higher by an order of magnitude, 10p phys instead of p phys . In our following resource estimates, we will use the simplified assumption of an error rate of p phys for faulty T measurements.
We now show how to use faulty T measurements to perform P π/8 rotations in a surface-code arrangement similar to Fig. 5b. Suppose that we want to execute a P π/8 = (Z 1 ⊗ Z 4 ⊗ Z 5 ) π/8 rotation on three qubits, i.e., rotation 9 in Fig. 3. In the first step in Fig. 9, we start with the five qubits of the 15-to-1 protocol, (d X + 4d Z ) × d X unused qubits in the ancillary region and an additional d m ×d m unused qubits in the top right corner. In this example, d X = 5, d Z = 3 and d m = 3. Next, following the circuit in Fig. 7, we initialize a d m × d m patch in the |+ state and use multi-patch lattice surgery [8,28] to measure Z 1 ⊗Z 4 ⊗Z 5 ⊗Z a , where Z i is the Z operator of qubit i and Z a is the Z operator of the ancilla in the top right corner. The multi-patch lattice-surgery operation [8] is performed by initializing the physical qubits in the ancilla region in the |+ state and measuring the new stabilizers for d m code cycles, after which the qubits in the ancilla region are measured in the X basis. The outcome of the Z 1 ⊗ Z 4 ⊗ Z 5 ⊗ Z a measurement is encoded in the measurement outcomes of the Z stabilizers in the ancilla region. Finally, in the last step, a faulty T measurement is performed on the d m × d m patch.
Next, we explain how we perform the error estimate for this protocol. For a protocol that performs a P π/8 rotation, we consider Z and X storage errors on qubits 1-5 and P π/2 , P π/4 and P −π/4 errors in the application of the rotation. As explained previously, the incorrect decoding of X and Z syndromes on qubits 1-5 leads to Z and X storage errors. These occur with a probability of 0.5(d X /d H ) · p L (p phys , d H ) and 0.5(d H /d X )·p L (p phys , d X ), respectively, where d H = d X for qubit 1, and d H = d Z for qubits 2-5. The incorrect decoding of the X syndrome in the ancilla region causes Z error strings connecting the X boundaries in the ancilla region. Depending on the exact location of these error strings, this can cause Z errors on one or multiple qubits. For instance, the error string highlighted in red and labeled '(2)' in Fig. 9 is topologically equivalent to error string '(1)', and therefore causes a Z error on qubit 1. Error string '(3)', on the other hand, is less harmful, since it is equivalent to a Z 1 ⊗ Z 4 error affecting qubits 1 and 4. Still, as a simplified pessimistic estimate, we will assume that any such error directly contributes to the Z error of the output qubit, i.e., qubit 1. This happens with a probability of 0.5(l/d X ) · p L (p phys , d X ) · d m , where l is the length of the ancilla patch. In our example, l = d X + 4d Z .
The incorrect decoding of the Z syndrome in the ancilla region causes X error strings. While there are no Z boundaries in the ancilla region for the X errors to connect to, in a space-time picture, these errors can also condense at the temporal boundaries set by the initializations of the physical qubits in the |+ state and their measurement in the X basis. Because the latticesurgery stabilizers are measured for d m code cycles, the probability of this error is governed by d m . Since 0.5p L (p phys , d m ) · d m is the X error rate of a d m × d m patch stored for d m code cycles, but the lattice surgery takes place over an area of l ×d X , we estimate the probability of such an error as 0.5l·d X /d 2 m ·p L (p phys , d m )·d m . Such an error leads to an incorrect interpretation of the lattice-surgery measurement outcome. As shown in Fig. 7, incorrectly interpreting the outcome of the P ⊗Z measurement causes an X error on the |+ qubit, which turns the Z π/8 rotation into a Z −π/8 rotation, causing an on overall P −π/4 error.
Z and X storage errors affecting the |+ qubit in the top right corner each occur with a probability of 0.5p L (p phys , d m ) · d m and cause P π/2 and P −π/4 errors, respectively, as explained by inserting X and Z errors on the |+ qubit in Fig. 7. Finally, the single-qubit X, Y and Z errors with a probability of p phys /3 during the faulty T measurement of Fig. 8 contribute to the P −π/4 , P π/4 and P π/2 error, respectively.
In order to save qubits, we can shrink the d m × d m patch in the top right corner to a d m ×1 patch. As shown in Fig. 10, this produces a stabilizer configuration that is topologically equivalent to the configuration of Fig. 9 and maintains the code distance. In principle, the Z storage error on the |+ qubit is now reduced, but we will still use the error estimate explained in the previous paragraph.
To summarize our error estimate, in addition to the usual storage errors, a Z error affects the output qubit with a probability of 0.5(l/d X ) · p L (p phys , d X ) · d m . P π/2 , P −π/4 and P π/4 errors each occur with a probability of p phys /3. Furthermore, a P −π/4 error occurs with a probability of 0.5l · d X /d 2 m · p L (p phys , d m ) · d m . Moreover, additional P −π/4 and P π/2 errors occur with a probability of 0.5p L (p phys , d m ) · d m . Since we are not running an actual decoder to obtain these error probabilities, our error estimate should not be interpreted as a rigorous simulation, but as a pessimistic ballpark estimate.

15-to-1 distillation
Our implementation of the 15-to-1 protocol of Fig. 3 is shown in Fig. 11. It starts by initializing qubits 2-4 in the |+ state. In the first step, rotations 1-3 and 5 are performed. The single-qubit rotations are performed by initializing a d Z × d m surface-code patch in the |+ state, measuring Z ⊗ Z, and performing faulty T measurement of the d Z × d m patch. Multi-qubit rotations are performed via fast faulty T measurements. In step 2, qubit 1 is initialized in the |+ state and rotations 6 and 7 are performed. In step 3, qubit 5 is initialized in the |+ state and rotations 4, 8 and 9 are performed. In steps 4-6, rotations 10-15 are performed. Finally, in step 7, qubits 2-5 are measured in the X basis. If all measurement outcomes are +1, qubit 1 is a distilled Figure 12: Quantum circuit for delayed-choice P π/8 rotations. The choice of measurement basis for the single-qubit |+ measurement decides whether a P π/8 rotation is performed, or no operation at all. magic state which can be used to execute a low-error P π/8 rotation. The distillation block can now be used to distill the next magic state.
In order to prevent the output state from blocking the space reserved for qubit 1 for d X code cycles, the consumption of the output state for the execution of a P π/8 rotation can already be initiated in step 5, since this process takes d X code cycles. In the protocol shown in Fig. 11, steps 5-7 and step 1 of the subsequent distillation can be used to measure qubit 1, i.e., a total of 3d m code cycles. If d X > 3d m , the output state will block the space reserved for qubit 1 and slow down the protocol. This can be prevented by reordering the rotations. In any case, we only consider cases with d X ≤ 3d m .
One may be concerned that, when consuming the output state in step 5, we still do not know if the distillation protocol will succeed, i.e., whether the output state is faulty or not. This problem can be solved by using the circuit in Fig. 12, where an additional ancilla qubit initialized in the |+ state is used. To execute a P π/8 rotation, the operator P ⊗ Z ⊗ Z between the qubits, the magic state and the additional ancilla state is measured. Depending on whether the |+ state is measured in the Z or X basis, the P π/8 rotation is applied or not. This can be used to consume a magic state before the outcome of its distillation is known. If the distillation protocol for this magic state fails, the |+ state can be measured in the X basis to prevent the faulty magic state from being used.
In total, this 15-to-1 protocol has a space cost of 2 · (d X + 4d Z ) · 3d X + 4d m physical qubits, taking physical measurement ancillas into account. The time cost is 6d m /(1 − p fail ) code cycles, where p fail is the failure probability of the protocol. As discussed in Sec. 1, p fail and the output error p out are determined numerically. To this end, the 5-qubit density matrix is simulated, taking into account errors from storage and faulty T measurements. All P π/8 rotations have a P π/2 , P π/4 and P −π/4 error of p phys /3 and additional errors due to the fast faulty T measurement protocol, as discussed in Sec. 2, with the exception that, for rotations that do not involve qubit 1, Z errors in the ancilla region do not cause Z errors on qubit 1. For single-qubit rotations, X errors on the d Z × d m qubit cause P −π/4 errors and occur with a probability of 0.5d Z p L (p phys , d m ), whereas Z errors spread to the adjacent qubit with a probability of 0.5(d 2 m /d Z )p L (p phys , d Z ). After each step (apart from step 7), d m code cycles worth of X and Z storage errors are applied to all five qubits. In addition, Z or X errors on the d X × d X ancilla patch used to consume the output state are added as Z or X storage errors to the output state for d X code cycles. Finally, the output error probability p out is computed as the infidelity 1 − F (ρ ideal , ρ out ).
We refer to a 15-to-1 protocol characterized by the code distances d X , d Z and d m as a (15-to-1) d X ,d Z ,dm protocol. For p phys = 10 −4 , we find that the protocol (15-to-1) 7,3,3 has a p out = 4.4 × 10 −8 , where we round the output error to two significant digits. It has a space cost of 810 qubits and a time cost of 18.1 code cycles. These two numbers can be multiplied to obtain a spacetime cost of 14,600 qubitcycles to three significant digits. We also report the space-time cost in terms of the full distance d full . Consider a 100-qubit quantum computation with a T -gate count of n T , where n T < 1/p out . Using the construction from Ref. [28], the entire computation can be finished in n T · d full code cycles, if 231 distance-d full surface-code patches are used to store the 100 qubits. The probability that any of these qubits is affected by a storage error that can spoil the outcome of the computation is p storage error = 231 · n T · d full · p L (p phys , d full ) .
The storage error can be higher, if the computation does not exclusively consist of π/8 rotations, but also involves Pauli product measurements, as this increases the length of the computation. Using magic states with p out for each T gate, there is a probability of n T · p out that a faulty T gate will spoil the outcome of the computation. If we demand that storage errors leads to a relative increase of the error probability by 1% in the best case, then d = d full needs to satisfy Evidently, d does not depend on n T , but only on the number of qubits in the computation. For a 10,000qubit computation, the qubits need to be stored using 20,284 distance-d qubits, and the condition changes to We pick the smallest odd-integer d that satisfies the condition. For p out = 4.4×10 −8 , d = 11 in the 100-qubit

L1
(a) Qubit arrangement for the (15-to-1) × (15-to-1) protocol case and d = 13 in the 10,000-qubit case. The spacetime cost reported in the last two columns of Tab. 1 is in terms of (physical data qubits)×(code cycles), i.e., 5.49d 3 or 3.33d 3 . These numbers are remarkably low, as the cost to explicitly perform, e.g., a logical CNOT gate on two distance-d qubits [6] is 3d 3 . Still, the only truly meaningful number quantifying the space-time cost is the cost in terms of qubitcycles.
The other protocols reported in Tab. 1 for p phys = 10 −4 are (15-to-1) 9,3,3 and (15-to-1) 11,5,5 , reducing the error to p out = 9.3 × 10 −10 and p out = 1.9 × 10 −11 , respectively. For a higher physical error rate of p phys = 10 −3 , (15-to-1) 17,7,7 has a p out = 4.5 × 10 −8 . Interested readers can verify these numbers using the Python script or Mathematica notebook provided in the Supplementary Material [37], where they can also try out other parameters. The Supplementary Material contains the resource cost calculations for all protocols considered in this paper.
15-to-1 protocols cannot generate arbitrarily good output states, as p out is limited by ∼10p 3 phys . In order to distill higher-fidelity magic states, we turn to two-level protocols.

Two-level protocols
The idea of two-level protocols is to use distilled magic states (level-1 states) to perform a second round of distillation. We first discuss (15-to-1) × (15-to-1) protocols, where 15-to-1 output states are used for a second level of 15-to-1 distillation. The qubit arrangement used for this protocol is shown in Fig. 13a. It is described by three additional code distances d X2 , d Z2 and d m2 , and the number of level-1 distillation blocks n L1 , where n L1 is an even integer. The central region consists of a (d X2 + 4d Z2 ) × 3d X2 array of qubits where the second level of distillation takes place. To the left and right are the level-1 distillation blocks that feed level-1 states into the upper and lower ancilla region of the level-2 block, respectively. Each of these two level-1 region consists of n L1 /2 level-1 blocks. These blocks are characterized by the level-1 distances d X , d Z and d m , such that each level-1 region produces one magic state every 6d m /(1 − p fail )/(n L1 /2) code cycles.
In addition, each level-1 region has a 3d m2 × 4d m2 array of qubits that separates the level-1 blocks from the level-2 block. As level-1 states are produced, they are transported into this intermediate region. For this purpose, each level-1 region has two spots that are reserved for level-1 output states. While one of these spots is being filled with a newly generated level-1 state, the magic state in the other spot can be consumed to execute a P π/8 rotation in the level-2 block, as shown in Fig. 13b. These rotations are performed using the auto-corrected π/8 rotation [28] shown in Fig. 13c. Here, the operator P ⊗ Z between the level-2 qubits and a level-1 magic state |m is measured. Simultaneously, Z ⊗ Y between |m and an ancillary |0 state is measured. Depending on the outcome of the P ⊗ Z measurement, the |0 qubit is either read out in the X or Z basis, essentially performing a Clifford correction or not.
As shown in Fig. 13b, this kind of measurement can be performed in d m2 code cycles using a configuration similar to a faulty T measurement. While one magic state in each level-1 region is used to execute a level-2 π/8 rotation, a second level-1 state is being transported Step 6: 6t L1 cycles Step 7: 7t L1 cycles Step 8: 8t L1 cycles Step 1: t L1 cycles Step 2: 2t L1 cycles Step 3: 3t L1 cycles Step 4: 4t L1 cycles Step 5: 5t L1 cycles to be used for the subsequent level-2 rotation. In the top and bottom ancilla region, one such level-2 rotation can be performed every code cycles. If level-1 states are produced slowly, t L1 is determined by the output rate of the level-1 factories. If these states are produced fast, t L1 will be limited by d m2 , the duration of the auto-corrected π/8 rotations.
The entire (15-to-1) × (15-to-1) protocol is shown in Fig 14, focusing on the level-2 region. In steps 1 and 2, the first four rotations of the 15-to-1 circuit are performed. Since these are single-qubit rotations, |0 ancillas are not required, as Clifford corrections correspond to Pauli corrections for these first four rotations. In steps 3-7, rotations 5-14 are performed. The consumption of the output state is initiated in step 7. In step 8, qubit 4 is measured in the X basis and rotation 15 is performed in the bottom ancilla region. Since the space reserved for qubit 4 is now empty, the top ancilla region can be used to perform the first rotation of the subsequent distillation round, such that the next round of distillation will only take 7t L1 instead of 8t L1 code cycles. For this reason, the time cost of this distillation block is 7.5t L1 code cycles. Again, the consumption of the output state will slow down the protocol, if it takes longer than 3d m2 code cycles, so the distances should be chosen such that d X2 ≤ 3d m2 .
Error analysis. The level-1 blocks output level-1 states with an output error of p out = p L1 . This error contributes to the probability of a P π/2 error, when this state is used for a P π/8 rotation. Furthermore, the level-1 state accumulates additional errors as it is moved to the level-2 block. As shown in Fig. 13b, it traverses a region of width d m2 and a maximum length of n L1 /4 · (d X + 4d Z ) + 2d m2 for d m2 code cycles, before ending up in a 2d m2 × 2d m2 region, where it stays for another d m2 code cycles. Therefore, we define as the length of the ancilla region that increases the storage error of the level-1 state. In this sense, the X and Z error of the level-1 state are each increased by 0.5l move · p L (p phys , d m2 ), contributing to the P π/2 and P −π/4 error, respectively. The error analysis of the level-2 block is analogous to faulty T measurements. For an ancilla of length l, where l can be up to d X2 + 4d Z2 + d m2 , X errors lead to a P −π/4 error with a probability of 0.5(l · d X2 /d m2 ) · p L (p phys , d m2 ). Moreover, if qubit 1 is part of the rotation, it is affected by an additional Z storage error with a probability of 0.5(l · d m2 /d X2 ) · p L (p phys , d X2 ). Finally, the X and Z Step 1: t L1 cycles Step 2: 2t L1 cycles Step 3: 3t L1 cycles Step 4: 4t L1 cycles Step 5: 5t L1 cycles Step 7: 7t L1 cycles Step 6: 6t L1 cycles Step 8: 8t L1 cycles Step 9: 9t L1 cycles Step 10: 10t L1 cycles error of the output state increases by d X2 ·p L (p phys , d X2 ) while it is being consumed.
20-to-4 distillation. For output states with a desired error rate that is lower than what is possible with one level of 15-to-1, but higher than what can be achieved with two levels of 15-to-1, it can be more efficient to use a level-2 protocol that is cheaper, but features less error suppression. One such protocol is the 20-to-4 protocol of Fig. 4. The implementation of a (15-to-1)×(20-to-4) is shown in Fig. 15 and is very similar to a (15-to-1)×(15-to-1) protocol. The main difference is that the length of the level-2 region increases to 4d X2 + 3d Z2 , as the 20-to-4 circuit acts on seven qubits, four of which are output states.
For protocols that generate multiple output states, it is particularly important to pick a suitable order in which the rotations are performed, in order to avoid congestion. If four output states are generated at the end of the protocol, but the computation demands that they are consumed one after the other, then they will block the level-2 factory for many cycles. In step 1 of our protocol, rotations 1 and 2 are performed. In step 2, qubit 1 is initialized in the |+ state and rotations 4 and 5 are performed. Simultaneously, the measurement of qubit 1 as an output state can be initiated. In steps 3-6, rotations 3 and 6-12 are performed. In step 7, rotations 13 and 14 are performed and the consumption of output state 2 can be initiated. The snapshots in Fig. 15 are drawn in a way that assumes that it takes 3t L1 to consume a magic state, but this depends on d X2 . In steps 8-9, rotations 15-18 are performed and the consumption of qubit 3 can be initiated. In step 10, the last two rotations are performed and the protocol finishes after 10t L1 code cycles. The first three steps of the subsequent round of distillation can be used to initiate the measurement of output state 4 and finish the measurements of all remaining output states. If output states are consumed one after the other, e.g., to perform π/8 rotations one after the other, this protocol allocates up to 2.5t L1 code cycles for each output state, which is sufficient for the parameters that we consider.
The error analysis of protocols labeled as (15-to-1) n L1 d X ,d Z ,dm × (20-to-4) d X2 ,d Z2 ,dm2 is identical to two-level 15-to-1 protocols, albeit with a different space and time cost, and the extra step of dividing the space-time cost and output error by four. Moreover, if multiple output qubits are part of a rotation, the additional Z error due to Z-error strings in the ancilla region with a probability of 0.5(l · d m2 /d X2 ) · p L (p phys , d X2 ) is applied to all output qubits that are part of the rotation. For p phys = 10 −4 , we find that the (15-to-1) For p phys = 10 −3 , the protocols (15-to-1)  What about coherent errors? As discussed in Sec. 1, the performance of distillation protocols can be worse, if the underlying error model features coherent errors. In a circuit-level study of surface codes, coherent errors are difficult to analyze. If the logical error rate of surface codes can be maintained at p L (p phys , d) even in the presence of coherent errors, and the only effect of coherent errors is to under-or over-rotate the physical T gate used in faulty T measurements, then one can argue that the effect of coherent errors is not too significant. These errors would then increase the minimum achievable output error rate of the first level of distillation, albeit by an error that is governed by the single-qubit error rate. While this can be a problem for single-level distillation schemes, this is less detrimental for two-level distillation schemes, as only the first level is affected. Since level-1 blocks typically output states that have a fidelity that is much lower than the maximum achievable level-1 fidelity, the overall output fidelity would be barely affected. For instance, the level-1 block of the (15-to-1) 4 13,5,5 × (20-to-4) 27,13,15 protocol outputs states with p out ≈ 10 −6 , whereas the lowest possible error of the 15-to-1 protocol for p phys = 10 −3 is p out ≈ 10 −8 . Still, a more careful treatment of coherent errors is necessary, but is beyond the scope of this work.
What about feed-forward? One possible limiting factor in quantum computers is the feed-forward time, i.e., the time it takes to react to measurement outcomes, deciding which operation should be performed next. In our protocols, some qubits need to be measured in the X or Z basis depending on previous measurement outcomes, which is used to avoid Clifford corrections. In Fig. 13a, these are the qubits in the intermediate region between the level-1 blocks and the level-2 block. If the feed-forward time is a bottleneck in a given architecture, these qubits need to be stored for some additional time before being read out. A slowdown due to feed-forward can be avoided by using additional ancilla qubits in this ancilla region. In any case, long feed-forward times increase the overall space-time cost.
The constructions discussed in the previous sections can be used to implement any distillation protocol that can be expressed as a sequence of Z-type π/8 rotations, e.g., all protocols that are based on triorthogonal matrices [11,19]. A very similar class of protocols are synthillation protocols, whose implementation we discuss in the following section.

Synthillation
Synthillation [17,38] is a portmanteau of the words (gate) synthesis and distillation. The idea of synthillation is to generate resource states that do not execute single T gates or π/8 rotations, but entire layers of commuting π/8 rotations. The simplest example is the |CCZ resource state, which is prepared by applying a controlled-controlled-Z (CCZ) gate to a |+ ⊗3 state. This state can be used to perform a CCZ gate, which can be written as a sequence of seven commuting π/8 rotations [28,39]. However, it is also possible to execute CCZ gates using four T gates [40], or even only two T gates, if these CCZ gates are part of a computeuncompute circuit [41].
Synthillation circuits can be obtained the same way as ordinary distillation circuits. We start with a nontrivial representation of the identity in Fig. 16a as a sequence of 15 π/8 rotations on 4 qubits. Next, we cancel the first three and last four rotations by multiplying the entire circuit with the corresponding +π/8 and −π/8 rotations, i.e., the 7 rotations on the righthand side of Fig. 16b. These 7 rotations happen to correspond to the decomposition of the CCZ gate into 7 π/8 rotations. In other words, the circuit on the lefthand side of Fig. 16b prepares a |CCZ state on the first three qubits and acts trivially on the fourth qubit. Therefore, the fourth qubit can be used to detect errors. If any one of the 8 rotations fails, the X measurement outcome will flip. However, any pair of errors will go undetected. Therefore, the output error to leading order is 28p 2 under Z-Pauli noise.
Since the 8-to-CCZ circuit is a sequence of Z-type π/8 rotations, it can be implemented the same way as Step 1: t L1 cycles Step 2: 2t L1 cycles Step 3: 3t L1 cycles Step 4: 4t L1 cycles a two-level distillation protocol, as shown in Fig. 17. The level-2 block has a width of 3d X2 + d Z2 . Performing two rotations every t L1 code cycles, the protocol finishes after 4t L1 code cycles. In our numerical error analysis, we obtain the output error as the infidelity between the output state and the ideal state ρ ideal = |CCZ CCZ| ⊗ |+ +|. Labelling our protocols as (15- , we find that, for p phys = 10 −3 , the protocol (15-to-1) 6 13,7,7 × (8-to-CCZ) 25,15,15 generates |CCZ states with a fidelity of p out = 5.2 × 10 −11 and a space-time cost of 2,820,000 qubitcycles. If the execution of a CCZ gate using ordinary magic states requires four of these states, they need to have a quarter of this output error to achieve the same gate error. In comparison, a (15-to-1) × (20-to-4) protocol can generate one T -gate magic state with p out = 2.6 × 10 −11 using 1,840,000 qubitcycles per state. In other words, four such states would cost 7,360,000 qubitcycles, more than twice as expensive as a |CCZ state.
While this implies that synthillation protocols can reduce the distillation cost, this ignores the fact that the consumption of the output state can congest the distillation block. In the construction of Fig. 17, there are 2t L1 code cycles for the consumption of each state. If all states can be consumed simultaneously, then this might be sufficient. However, if the rest of the quantum computer is such, that the three output states need to be consumed one after the other, the synthillation protocol may be slowed down significantly, increasing the overall space-time cost. This can be avoided by using additional qubits to temporarily store the output state, although this does not prevent an increase in the spacetime cost. Another possibility is to slow down the protocol to increase the time available for the consumption of the output states, e.g, by using slow measurements instead of fast measurements.
For p phys = 10 −4 , we find that a (15-to-1) 4 7,3,3 × (8-to-CCZ) 15,7,9 generates output states with p out = 7.2 × 10 −14 using 447,000 qubitcycles, which is also cheaper compared to the cost to distill ordinary magic states with the same fidelity. Such synthillation protocols can be used to generate resource states that execute any arbitrary sequence of π/8 rotations. Schemes to obtain such protocols are found in Ref. [17]. However, note that the problem of output states congesting the distillation block becomes more severe, if the generated output state consists of many qubits.

Small-footprint protocols
So far, our protocols have focused on minimizing the space-time cost. In this section, we outline how protocols can be designed to minimize qubit footprint, i.e., the space cost, by sacrificing space-time overhead. We discuss the example of (15-to-1) and (15-to-1)×(15-to-1) protocols with a target output error of p out ≈ 10 −9 .
15-to-1. The footprint can be straightforwardly reduced by using only one region of ancilla qubits instead of two. This reduces the footprint to 4(d X + 4d Z )d X + 2d m physical qubits, as shown in Fig. 18. Since only once ancilla region is used, the time cost doubles compared to the protocol of Fig. 11 to 12d m . The error estimate is identical to the ordinary 15-to-1 protocol. We find that, for p phys = 10 −4 , the small- footprint (15-to-1) 9,3,3 protocol produces magic states with p out = 1.5 × 10 −9 using 762 qubits. However, since the protocol takes 36.2 code cycles, the space-time cost of 27,600 qubitcycles is higher than for comparable protocols with a similar output error.
Two-level 15-to-1 distillation. For two levels of 15-to-1 with a small footprint, we use the arrangement shown in Fig. 19a, which consists of a (d X2 + 4d Z2 ) × 2d X2 level-2 block, a (d X +4d Z )×3d X level-1 block, and an intermediate region of size 2d m2 × 2d m2 + d m2 × d X2 . The level-1 block outputs level-1 states every ∼6d m code cycles. In principle, it is possible to use the small-footprint level-1 block of Fig. 18, but in the interest of not increasing the space-time cost by too much, we use the 15-to-1 block introduced in Sec. 3. When a level-1 state is generated, it can be consumed for a level-2 rotation in 2d m2 code cycles. An example of a (Z 1 ⊗ Z 3 ⊗ Z 5 ) π/8 is shown in Fig. 19b. In the first d m2 code cycles, a level-1 state is transported from the level-1 block to the intermediate region. In the second d m2 code cycles, a Pauli product measurement is performed, executing a P π/8 rotation according to the circuit in Fig. 13c. Transporting the level-1 state into the intermediate region increases its X and Z error by 5d m2 · p L (p phys , d m2 ). If we define the time that it takes to execute a level-2 rotation as then the distillation protocol takes 15t L1 to finish. In our numerical analysis, we find that, for p phys = 10 −3 , a small-footprint (15-to-1) 9,5,5 × (15-to-1) 21,9,11 protocol generates magic states with p out = 6.1 × 10 −10 using only 7,780 physical qubits. With a time cost of 469 cycles, the space-time cost is 3,650,000 qubitcycles per output state. While the space cost is very low, this protocol sacrifices space-time cost, as an ordinary (15-to-1) × (20-to-4) protocol can generate states with p out = 1.4 × 10 −10 for only 1,420,000 qubitcycles.

Conclusion
We have constructed magic state distillation protocols that reduce the space-time cost by approximately 90% compared to the previous state of the art. Since our results were not obtained by simulating entire surfacecode patches and running an actual decoder, but via a careful error analysis, the numbers reported in Tab. 1 should be taken with a grain of salt. The protocols discussed in this paper should rather be regarded as a proof of principle, demonstrating that the overhead of distillation can be reduced significantly by carefully tuning the code distances of the different qubits that are part of the distillation protocol. In any case, exact numbers will have a strong dependence on the hardware-specific error parameters and the decoding procedure.
There is still plenty of room for optimization. For one, we only considered very simple distillation protocols, i.e., the 15-to-1 distillation protocol, a 20-to-4 blockcode protocol and the synthillation of |CCZ states. Perhaps, more sophisticated distillation circuits can further reduce the cost. While the 20-to-4 protocol is part of an entire family of (3k + 8)-to-k protocols, it seems unlikely that a higher k will decrease the cost, since the space cost is governed by k + 3, and the time cost is governed by 3k + 8. Therefore, the space-time cost per output state is governed by (3k+8)(k+3)/k, which happens to have minima at k = 2 and k = 4 for even integer k. Still, since it is possible to generate arbitrarily many such protocols based on triorthogonal codes [11,19], one could look for protocols that minimize the spacetime cost in this manner. For two-level distillation, it remains unclear which combination of protocols reduces the cost. It also remains unclear whether the space-time cost can be decreased by using protocols that reduce the number of π/8 rotation in the distillation circuit by adding Clifford gates [3], or protocols that employ catalyst states that need to be stored at a higher code distance [21]. One could also construct protocols with more than two levels of distillation.
The resource requirements for fault-tolerant surfacecode-based quantum computing can be daunting. Hopefully, this work helps demonstrate that this is not mainly due to the overhead of magic-state distillation, but rather due to the low encoding rate of topological codes, implying that thousands of physical qubits are required to encode a single logical qubit. Figure 20: Protocol of a faulty T measurement in the spirit of Ref. [33]. It can be thought of as a d × d patch being shrunk to a 4) patch, and so on, until a physical T gate is performed on the remaining 1 × 1 patch (the blue qubit), before it is measured in the X basis. Since all these shrinking operations can be performed simultaneously, the T -measurement protocol corresponds to just the last step of the figure, where all red qubits are measured in the Z basis, all green qubits in the X basis, and a faulty T measurement is performed on the blue qubit.

A Faulty T measurements
This appendix discusses faulty T measurements in more detail. Specifically, we discuss approaches to decrease their error rate to justify the use of a faulty-Tmeasurement error rate of p phys in the main text and the potential implications of a higher error rate.
The T -measurement protocol discussed in Sec. 2.2 is simple, but the operation of shrinking a d × d patch to a d × 1 patch is associated with an error that scales with the code distance d. While this might not be problematic for small code distances, it implies that this specific T -measurement protocol will not work for larger code distances. A more advanced version of this protocol is shown in Fig. 20. This protocol essentially corresponds to the state-injection protocol of Ref. [33] in reverse. A d×d patch is shrunk to a (d−2)×(d−2) patch by measuring physical qubits along the boundary in the X or Z basis. Next, the patch is shrunk to a (d − 4) × (d − 4) patch, a (d−6)×(d−6) patch, etc., until we end up with a 1 × 1 patch, i.e., a single physical qubit corresponding to the logical qubit. This qubit is then measured in the T basis. Since the qubit is in a distance-1 state only at the very end of the protocol, the error rate of this Tmeasurement protocol is not proportional to the code distance d.
The shrinking operations can all be performed simultaneously, such that the faulty T measurement corresponds to measuring all red qubits in the last step of Fig. 20 in the Z basis, and all green qubits in the X basis. A physical T gate is applied to the remaining blue qubit, after which it is measured in the X basis. This protocol is almost identical to the T -measurement protocol of Sec. 2.2, with the main difference being the position of the blue qubit. In the protocol of Sec. 2.2, the blue qubit is in the corner of the patch, whereas in Fig. 20, the blue qubit is in the center of the patch.
In order to examine the error sources of this protocol, we consider the space-time diagram of the T -measurement protocol of Fig. 20, which is shown in Fig. 21. It is identical to the space-time diagram of the state-injection protocol of Ref. [33] shown in Ref. [42]. Such space-time diagrams are useful to visualize the effect of physical Pauli errors and stabilizer measurement errors on the encoded logical information. Strings of errors connecting the red boundaries in the spacetime diagram lead to logical Z errors, whereas error strings connecting the green boundaries lead to logical X errors. In our usual estimates of idling distance-d logical qubits, we take into account the weight-d error strings that precede the faulty T measurement, i.e., error strings similar to the ones labelled (1) and (2) in Fig. 21, which have a probability of p L (p phys , d). During state injection or faulty T measurements, there are some additional low-weight errors corresponding to error strings that connect two time-like boundaries during the T measurement, i.e., single-qubit errors on the blue qubit (3) and strings similar to the one labelled (4) in Fig. 21.
Pre-selection. These low-weight errors are the dominant contribution to the error rate of such stateinjection of faulty-T -measurement protocols. Specifically, for circuit-level Pauli noise, single-qubit errors affecting the qubits close to the blue qubit and failures (4) Step 1 Step 2 Step 3 Figure 22: The finite rejection rate of pre-selection can be prevented from significantly increasing the time cost of distillation protocols by using an additional dm × dm patch during the execution of P π/8 rotations. While one dm × dm patch is idling until the pre-selection condition is satisfied -i.e., until the stabilizers in a region surrounding the sensitive physical qubit have not reported any syndromes for two code cycles -a second dm × dm patch can be used to execute the next P π/8 rotation.
of two-qubit gates involved in the syndrome readout of check operators close to the blue qubit contribute to these low-weight errors. In the case of state-injection protocols, these errors can be suppressed through the addition of post-selection [32], particularly, if the error rate of two-qubit gates is significantly higher than the error rate of single-qubit operations. An injected magic state is only accepted, if the stabilizer checks in a certain region around the sensitive, blue qubit report no syndrome for two code cycles. Since at least two circuit-level two-qubit errors are required to generate an unreported error, the dominant contribution to the error rate of the state-injection protocol will be governed by single-qubit errors on the sensitive physical qubits. However, this post-selection adds a certain failure probability to the state-injection protocol, as some instances of state injection will be rejected. The protocol in Ref. [32] reduces the error rate of state injection to ∼2p phys /3 for two-qubit error rates of p 2 = p phys and one-qubit error rates of p 1 = p phys /10 with a rejection rate of around 50% for p phys = 10 −3 . In Ref. [32], a distance-7 region around the sensitive qubit was used for post-selection, leading to a high rejection rate. Preliminary numerical results by Lao et al. [34] indicate that by post-selecting a distance-3 region around the sensitive qubit, the error rate of state injection still stays below p phys , while the rejection rate drops to ∼4% for p phys = 10 −3 .
Since faulty T measurements are identical to state injection, apart from the time direction being reversed, the same approach can be used to reduce the error rate of faulty-T -measurement protocols. The analogous operation is pre-selection: The single-qubit measurements in the protocol of Fig. 21 are delayed, until the check operators in a region around the blue qubit have reported no syndrome for two consecutive code cycles. This justifies the use of a faulty-T -measurement error rate of p phys in the main text, but also adds a time cost to the fault-T -measurement protocol through the rejection rate. While one might be worried that, particularly for high p phys , the finite rejection rate in this faulty-T -measurement protocol increases the time it takes to execute a faulty P π/8 rotation, this does not need to be the case, if a few extra qubits are added to the protocol, as shown in Fig. 22. Here, in step 1, a P π/8 rotation is performed using the protocol of Fig. 9. The d m × d m patch corresponding to the qubit that needs to be read out using a faulty T measurement is then left idling, until the pre-selection condition is satisfied and the qubit can be measured. While this patch is idling, a second d m × d m patch is used to perform the next P π/8 rotation in step 2. With a high probability, the d m × d m patch left idling in step 1 will have been read out by the beginning of step 3, such that the freed-up space can be used to initialize a d m × d m patch for the next rotation. In rare cases, the d m × d m patch left idling in step 1 will have failed the pre-selection after, in this example, 2d m code cycles, in which case it needs to keep idling, increasing the overall time cost of the distillation protocol. If this happens sufficiently rarely, the additional time cost is negligible.
The time required to perform a faulty T measurement with pre-selection depends on the rejection rate. In the above example, if, e.g., the probability of failing the pre-selection is 50% every two code cycles and d m = 5, then the probability of the d m × d m patch having failed the pre-selection after 2d m code cycles is ∼1%, implying that the overall time cost of level-1 distillation protocols increases by ∼1% compared to the numbers presented in Tab. 1. This increase is even lower under the assumption of a rejection rate of ∼4%. However, the numbers in Tab. 1 do not take into account the increased space cost due to the additional d m × d m patches that are used to avoid the time-cost increase. If, as in Fig. 22, one uses two d m × d m patches for each measurement region, then the space cost for, e.g., a (15-to-1) 11,5,5 protocol increases by ∼9% compared to the numbers in Tab. 1. Note that, in twolevel protocols, the increased space cost only affects the first level of distillation, but not the second, such that,  e.g., for a (15-to-1) 6 11,5,5 × (15-to-1) 25,11,11 protocol, the space cost increases by less than 2%. Furthermore, the error estimate in the main text does not take into account that, in cases where the d m × d m patches is left idling for many code cycles, the error rate of the corresponding P π/8 rotations is increased. However, with pre-selection, the error rate of faulty T measurements can be lower than the p phys assumed in the main text, if single-qubit operations are significantly less noisy than two-qubit operations.
Replacing faulty T measurements with state injection. After shrinking the d × d patch to a physical qubit, the preceding stabilizer measurement outcomes need to be decoded before a T (or T † ) gate and an X measurement can be applied to the physical qubit. If the time required for this decoding operation is long compared to the decoherence time of idling physical qubits, the error rate of faulty T measurements will be significantly increased. In this case, one should use state injection instead of faulty T measurements, executing P π/8 rotations via auto-corrected π/8 rotations [28], the same way P π/8 rotations are executed in level-2 distillation protocols (see Fig. 13). This has a higher space cost compared to faulty T measurements.
What if faulty T measurements have a higher error rate? If one uses a simple faulty-T -measurement or state-injection protocol without post-selection of preselection, one might be worried that an increased Tmeasurement error rate could adversely affect the performance of the distillation protocols discussed in the main text. Specifically, we can consider the effect on the output error rate p out , if the T -measurement Pauli error rate is 10p phys instead of p phys . We find that, for some of the distillation protocol shown in Tab. 1, an increased T -measurement error has only a small effect on the performance.
A collection of protocols is shown in Tab. 2. The main effect of an increased faulty-T -measurement er-ror rate is an increase of the lowest achievable output error rate p out for a given family of distillation protocols. For instance, the lowest achievable p out for a one-level 15-to-1 protocol increases from 10.37p 3 to 10370p 3 . For example, for p phys = 10 −4 , a one-level 15-to-1 protocol can no longer produce magic states with p out ≈ 10 −11 and, instead, a (15-to-1) × (20-to-4) protocol needs to be used, significantly increasing the space-time cost to produce these states. On the other hand, the (15-to-1) 4 9,3,3 × (20-to-4) 15,7,9 protocol produces states with a p out that is far away from the lowest achievable p out of a (15-to-1) × (20-to-4) protocol with p phys = 10 −4 . Compared to Tab. 1, the output error rate increases from p out = 2.4 × 10 −15 to p out = 6.6×10 −15 . Due to an increased failure probability, there is also a very small increase in space-time cost by 0.8%. Finally, (15-to-1) × (15-to-1) protocols can no longer produce magic states with an output error below p out ≈ 10 −22 .
For p phys = 10 −3 , we find a similar trend. A onelevel protocol can no longer be used to produce magic states with p out ≈ 10 −8 and a more expensive two-level protocol needs to be used. For p out = 2.5 × 10 −11 , a (15-to-1) × (20-to-4) protocol needs to be replaced by a (15-to-1) × (15-to-1) protocol, increasing the spacetime cost by 37.5%. Again, we find protocols which are only mildly affected by the increased error rate, such as the (15-to-1) 6 11,5,5 × (15-to-1) 25,11,11 protocol, which previously produced magic states with an output error of p out = 2.7 × 10 −12 , but now produces magic states with an output error of p out = 6.4 × 10 −12 with a spacetime cost increase of 3.5% due to the increased failure probability. The lowest achievable output error rate of (15-to-1)×(15-to-1) protocols increases to p out ≈ 10 −14 , making the production of magic states with error rates close to this limit very costly. For such states and states with a lower error rate, a three-level protocol might need to be used.