A Game of Surface Codes: Large-Scale Quantum Computing with Lattice Surgery

Given a quantum gate circuit, how does one execute it in a fault-tolerant architecture with as little overhead as possible? In this paper, we discuss strategies for surface-code quantum computing on small, intermediate and large scales. They are strategies for space-time trade-offs, going from slow computations using few qubits to fast computations using many qubits. Our schemes are based on surface-code patches, which not only feature a low space cost compared to other surface-code schemes, but are also conceptually simple, simple enough that they can be described as a tile-based game with a small set of rules. Therefore, no knowledge of quantum error correction is necessary to understand the schemes in this paper, but only the concepts of qubits and measurements. As an example, assuming a physical error rate of $10^{-4}$ and a code cycle time of 1 $\mu$s, a classically intractable 100-qubit quantum computation with a $T$ count of $10^8$ and a $T$ depth of $10^6$ can be executed in 4 hours using 55,000 qubits, in 22 minutes using 120,000 qubits, or in 1 second using 330,000,000 qubits.

Given a quantum gate circuit, how does one execute it in a fault-tolerant architecture with as little overhead as possible?
In this paper, we discuss strategies for surface-code quantum computing on small, intermediate and large scales.They are strategies for space-time tradeoffs, going from slow computations using few qubits to fast computations using many qubits.Our schemes are based on surface-code patches, which not only feature a low space cost compared to other surface-code schemes, but are also conceptually simple -simple enough that they can be described as a tile-based game with a small set of rules.Therefore, no knowledge of quantum error correction is necessary to understand the schemes in this paper, but only the concepts of qubits and measurements.
The field of quantum computing is fuelled by the promise of fast solutions to classically intractable problems, such as simulating large quantum systems or factoring large numbers.Already ∼100 qubits can be used to solve useful problems that are out of reach for classical computers [1,2].Despite the exponential speedup, the actual time required to solve these problems is orders of magnitude above the coherence times of any physical qubit.In order to store and manipulate quantum information on large time scales, it is necessary to actively correct errors by combining many physical qubits into logical qubits using a quantum errorcorrecting code [3][4][5].Of particular interest are codes that are compatible with the locality constraints of realistic devices such as superconducting qubits, which are limited to operations that are local in two dimensions.The most prominent such code is the surface code [6,7].
Working with logical qubits introduces additional overhead to the computation.Not only is the space cost drastically increased as physical qubits are replaced by logical qubits, but also the time cost increases due to the restricted set of accessible logical operations.Surface codes, in particular, are limited to a set of 2Dlocal operations, which means that arbitrary gates in a quantum circuit may require several time steps instead of just one.To keep the cost of surface-code quantum computing low, it is important to find schemes that translate quantum circuits into surface-code layouts with a low space-time overhead.This is also necessary to benchmark how well quantum algorithms per-form in a surface-code architecture.
There exist several encoding schemes for surface codes, among others, defect-based [7], twist-based [8] and patch-based [9] encodings.In this work, we focus on the latter.Surface-code patches have a low space overhead compared to other schemes, and offer lowoverhead Clifford gates [10,11].In addition, they are conceptually less difficult to understand, as they do not directly involve braiding of topological defects.Designing computational schemes with surface-code patches only requires the concepts of qubits and measurements.To this end, we describe the operations of surface-code patches as a tile-based game.This is helpful to design protocols and determine their space-time cost.
Surface codes as a game.The game is played on a board partitioned into a number of tiles.An example of a 5 × 2 grid of tiles is shown in Fig. 1.The tiles can be used to host patches, which are representations of qubits.We denote the Pauli operators of each qubit as X, Y and Z. Patches have dashed and solid edges representing Pauli operators.We consider two types of patches: one-qubit and two-qubit patches.One-qubit patches represent one qubit and consist of two dashed and two solid edges.Each of the two dashed (solid) edges represent the qubit's X (Z) operator.While the square patch in Fig. 1a only occupies one tile, a onequbit patch can also be shaped to, e.g., occupy three tiles (b).A two-qubit patch (c) consists of six edges and represents two qubits.The first qubit's Pauli operators X 1 and Z 1 are represented by the two top edges, while the second qubit's operators X 2 and Z 2 are found in the two bottom edges.The remaining two edges represent the operators Z 1 • Z 2 and X 1 • X 2 .
In the following, we specify the operations that can be used to manipulate the qubits represented by patches.Some of these operations take one time step to complete (denoted by 1 ), whereas others can be performed instantly, requiring 0 .The goal is to implement quantum algorithms using as few tiles and time steps as possible.There are three types of operations: qubit initialization, qubit measurement and patch deformation.

I. Qubit initialization:
-One-qubit patches can be initialized in the -Single-patch measurements: The qubits represented by patches can be measured in the X or Z basis.For two-qubit patches, the two qubits must be measured simultaneously and in the same basis.This measurement removes the patch from the board, freeing up previously occupied tiles.(Cost: 0 ) -Two-patch measurements: If edges of two different patches are positioned in adjacent tiles, the product of the operators of the two edges can be measured.For example, the product Z ⊗Z between two neighboring square patches can be measured, as highlighted in step 2 of Fig. 2a by the blue rectangle.If the edge of one patch is adjacent to multiple edges of the other patch, the product of all involved Pauli operators can be measured.For instance, if qubit A's Z edge is adjacent to both qubit B's X edge and Z edge, the operator Z A ⊗ Y B can be measured (see step 3 of Fig. 2d), since Y = iXZ.(Cost: 1 ) -Multi-patch measurements: An arbitrarilyshaped ancilla patch can be initialized.The product of any number of operators adjacent to the ancilla patch can be measured.The ancilla patch is discarded after the measurement.

III. Patch deformation:
-Edges of a patch can be moved to deform the patch.If the edge is moved onto a free tile to increase the size of the patch, this takes 1 to complete.If the edge is moved inside 0 Step 1 1 Step 2 0 Step 1 (c) Qubit movement Step 2 1 Step 3 (d) Y basis measurement 0 Step 1 1 Step 2 2 Step 3 2 Step 4 0 Step 1 1 Step 2 (b) Moving corners (a) Bell state preparation 0 Step 1 1 Step 2 the patch to make the patch smaller, the action can be performed instantly.
-Corners of a patch can be moved along the patch boundary to change its shape, as shown in Fig. 2b.(Cost: 1 ) To illustrate these operations, we go through three short example protocols in Fig. 2a/c/d.The first example (a) is the preparation of a Bell pair.Two square patches are initialized in the |+ state.Next, the operator Z ⊗ Z is measured.Before the measurement, the qubits are in the state |+ ⊗ |+ = (|00 + |01 + |10 + |11 )/2.If the measurement outcome is +1, the qubits end up in the state (|00 + |11 )/ √ 2. For the outcome −1, the state is (|01 + |10 )/ √ 2. In both cases, the two qubits are in a maximally entangled Bell state.This protocol takes 1 to complete.The second example (c) is the movement of a square patch into a different tile.For this, the square patch is enlarged by patch deformation, which takes 1 , and then made smaller again at no time cost.The third example (d) is the measurement of a square patch in the Y basis.For this, the patch is deformed such that the X and Z edge are on the same side of the patch.An ancillary patch is initialized in the |0 state and the operator Z ⊗ Y between the ancilla and the qubit is measured.The ancilla is discarded by measuring it in the Z basis.
Translation to surface codes.Protocols designed within this framework can be straightforwardly translated into surface-code operations.The exact correspondence between our framework and surface-code patches is specified in Appendix A, but it is not crucial to the understanding of this paper.Essentially, patches correspond to surface-code patches with dashed and solid edges as rough and smooth boundaries.Thus, for surface codes with a code distance d, each tile corresponds to d 2 physical data qubits.Two-patch and multi-patch measurements correspond to (twistbased) lattice surgery [9,11] and multi-qubit lattice surgery [12], respectively.Since both these measurements require d code cycles, 1 corresponds to d code cycles.Qubit initialization has no time cost, since, in case of X and Z eigenstates, it can be done simultaneously with the subsequent lattice surgery [9,13].For arbitrary states, initialization corresponds to state injection [13,14].Its time cost does not scale with d.Similarly, single-qubit measurements in the X or Z basis correspond to the simultaneous measurement of all physical data qubits in the corresponding basis and some classical error correction, which does not scale with d either.Patch deformation is code deformation, which requires d code cycles, unless the patch becomes smaller in the process, in which case it corresponds to singlequbit measurements.Note that not all surface-code operations are covered by this framework.An extended set of rules is discussed in Appendix B.
In essence, the framework can be used to estimate the space-time cost of a computation.The leading-order term of the space-time cost -the term that scales with d 3 -of a protocol that uses s tiles for t time steps is st • d 3 in terms of (physical data qubits)•(code cycles).The space cost is s • d 2 physical data qubits.Determining the exact time cost requires special care.In some protocols, the subleading contributions due to state injection and classical processing may need to be taken into account.For these protocols, we will show how they can be adapted to prevent such contributions from increasing the time cost beyond t • d code cycles.

Overview
Having established the rules of the game and the correspondence of our framework to surface-code operations, our goal is to implement arbitrary quantum computa-tions.In this work, we discuss strategies to tackle the following problem: Given a quantum circuit, how does one execute it as fast as possible on a surface-code-based quantum computer of a certain size?This is an optimization problem that was shown to be NP-hard [15], so the focus is on heuristics rather than a general solution.The content of this paper is outlined in Fig. 3.
The input to our problem is an arbitrary gate circuit corresponding to the computation.We refer to the qubits that this circuit acts on as data qubits.As we review in Sec. 1, the natural universal gate set for surface codes is Clifford+T , where Clifford gates are cheap and T gates are expensive.In fact, Clifford gates can be treated entirely classically, and T gates require the consumption of a magic state |0 +e iπ/4 |1 .Only faulty (undistilled ) magic states can be prepared in our framework.To generate higher-fidelity magic states for largescale quantum computation, a lengthy protocol called magic state distillation [16] is used.
It is therefore natural to partition a quantum computer into a block of tiles that is used to distill magic states (a distillation block) and a block of tiles that hosts the data qubits (a data block) and consumes magic states.The speed of a quantum computer is governed by how fast magic states can be distilled, and how fast they can be consumed by the data block.
In Sec. 2, we discuss how to design data blocks.In particular, we show three designs: compact, intermediate and fast blocks.The compact block uses 1.5n + 3 tiles to store n qubits, but takes up to 9 to consume a magic state.Intermediate blocks use 2n + 4 tiles and require up to 5 per magic state.Finally, the fast block uses 2n + √ 8n + 1 tiles, but requires only 1 to consume a magic state.The compact block is an option for early quantum computers with few qubits, where the generation of a single magic state takes longer than 9 .The fast block has a better space-time overhead, which makes it more favorable on larger scales.
Data blocks need to be combined with distillation blocks for universal quantum computing.In Sec. 3, we discuss designs of distillation blocks.Since magic state distillation is the main operation of a surfacecode-based quantum computer, it is important to minimize its space-time cost.We discuss distillation protocols based on error-correcting codes with transversal T gates, such as punctured Reed-Muller codes [16,17] and block codes [18][19][20].In comparison to braiding-based implementations of distillation protocols, we reduce the space-time cost by up to 90%.
A data block combined with a distillation block constitutes a quantum computer in which T gates are performed one after the other.At this stage, the quantum computer can be sped up by increasing the number of distillation blocks, effectively decreasing the time Figure 3: Overview of the content of this paper.To illustrate the space-time trade-offs discussed in this work, we show the number of physical qubits and the computational time required for a circuit of 10 8 T gates distributed over 10 6 T layers.We consider physical error rates of p = 10 −4 and p = 10 −3 , for which we need code distances d = 13 and d = 27, respectively.We assume that each code cycle takes 1 µs.
it takes to distill a single magic state, as we discuss in Sec. 4. In order to illustrate the resulting spacetime trade-off, we consider the example of a 100-qubit computation with 10 8 T gates, which can already be used to solve classically intractable problems [2].Assuming an error rate of p = 10 −4 and a code-cycle time of 1 µs, a compact data block together with a distillation block can finish the computation in 4 hours using 55,000 physical qubits. 1 Adding 10 more distillation blocks increases the qubit count to 120,000 and decreases the computational time to 22 minutes, using 1 per T gate.
For further space-time trade-offs in Sec. 5, we exploit that the T gates of a circuit are arranged in layers of gates that can be executed simultaneously.This enables linear space-time trade-offs down to the execution of one T layer per qubit measurement time, effectively implementing Fowler's time-optimal scheme [21].If the 10 8 T gates are distributed over 10 6 layers, and measurements (and classical processing) can be performed in 1 µs, up to 1500 units of 220,000 qubits can be run in parallel, where each unit is responsible for the execution of one T layer.This way, the computational time can be brought down to 1 second using 330 million qubits.While this is a large number, the units do not necessarily need to be part of the same quantum computer, but 1 We will assume that the total number of physical qubits is twice the number of physical data qubits.This is consistent with superconducting qubit platforms, where the use of measurement ancillas doubles the qubit count.If a platform does not require the use of ancilla qubits, the total qubit count is reduced by 50% compared to the numbers reported in this paper.can be distributed over up to 1500 quantum computers with 220,000 qubits each, and with the ability to share Bell pairs between neighboring computers.
In Sec. 6, we discuss further space-time trade-offs that are beyond the parallelization of Clifford+T circuits.In particular, we discuss the use of Clifford+ϕ circuits, i.e., circuits containing arbitrary-angle rotations beyond T gates.These require the use of additional resources, but can speed up the computation.We also discuss the possibility of hardware-based trade-offs by using higher code distances, but in turn shorter measurements with a decreased measurement fidelity.Ultimately, the speed of a quantum computer is limited by classical processing, which can only be improved upon by faster classical computing.
Finally, we note that while the number of qubits required for useful quantum computing is orders of magnitude above what is currently available, a proof-ofprinciple two-qubit device demonstrating all necessary operations using undistilled magic states can be built with 48 physical data qubits, see Appendix C.

Clifford+T quantum circuits
Our goal is to implement full quantum algorithms with surface codes.The input to our problem is the algorithm's quantum circuit.The universal gate set Clifford+T is well-suited for surface codes, since it separates easy operations from difficult ones.Often, this set is generated using the Hadamard gate H, phase gate S, if P P = P P : if P P = −P P : (a) if P 1 P = −P P 1 : if P 2 P = −P P 2 : if P P = P P : controlled-NOT (CNOT) gate, and the T gate.Instead, we choose to write our circuits using Pauli product rotations P ϕ (see Fig. 5), because it simplifies circuit manipulations.Here, P ϕ = exp(−iP ϕ), where P is a Pauli product operator (such as Z, Y ⊗ X, or X ⊗ 1 ⊗ X) and ϕ is an angle.In this sense, S = Z π/4 , T = Z π/8 , and . The CNOT gate can also be written in terms of Pauli product rotations as . In fact, we can more generally define P 1 -controlled-P 2 gates as The CNOT gate is the specific case of C(Z, X).
Getting rid of Clifford gates.Clifford gates are considered to be easy, because, by definition, they map Pauli operators onto other Pauli operators [22].This can be used to simplify the input circuit.A generic circuit is shown in Fig. 4, consisting of Clifford gates, Z π/8 rotations and Z measurements.If all Clifford gates are commuted to the end of the circuit, the Z π/8 rotations become Pauli product rotations.The rules for moving P π/4 rotations past P ϕ gates are shown in Fig. 4a: If P and P commute, P π/4 can simply be moved past P ϕ .If they anticommute, P ϕ turns into (iP P ) ϕ when P π/4 is moved to the right.Since C(P 1 , P 2 ) gates consist of π/4 rotations, similar rules can be derived as shown in Fig. 4b: If P anticommutes with P 1 , P ϕ turns into (P P 2 ) ϕ after commutation.If P anticommutes with P 2 , P ϕ turns into (P P 1 ) ϕ .If P anticommutes with both P 1 and P 2 , P ϕ turns into (P P 1 P 2 ) ϕ .
After moving the Clifford gates to the right, the resulting circuit consists of three parts: a set of π/8 rotations, a set of π/4 rotations, and Z measurements.Because Clifford gates map Pauli operators onto other   Figure 6: Clifford+T circuits can be written as a number of consecutive π/8 rotations.These gates are grouped into layers of mutually commuting rotations.A simple greedy algorithm can be used to reduce the number of layers, i.e., the T depth.
Pauli operators, the Clifford gates can be absorbed by the final measurements, turning Z measurements into Pauli product measurements.The commutation rules of this final step are shown in Fig. 4c and are similar to the commutation of Clifford gates past rotations.
T count and T depth.Thus, every n-qubit circuit can be written as a number of consecutive π/8 rotations and n final Pauli product measurements, as shown in Fig. 6.We refer to the number of π/8 rotations as the T count.An important part of circuit optimization is the minimization of the T count, for which there exist various approaches [23][24][25][26].The π/8 rotations of a circuit can be grouped into layers.All π/8 rotations that are part of a layer need to mutually commute.The number of π/8 layers of a circuit is strictly speaking not the same quantity as the T depth, but we will still refer to it as the T depth and to π/8 layers as T layers.Note that, in the usual definition, only up to n T gates can be part of a layer, whereas in our case, there is no limit.
When partitioning π/8 rotations into layers, the naive approach often yields more layers than are necessary.For instance, a naive partitioning of the first 6 T gates of Fig. 6 yields 4 layers.A few commutations can bring the number down to 2 layers.There are a number of algorithms for the optimization of the T depth [27][28][29].
Here, we use a simple greedy algorithm to reduce the number of layers: repeat for each layer i do for each rotation j in layer i + 1 do if (rotation j commutes with all rotations in layer i) then Move rotation j from layer i + 1 to layer i; end end end until the partitioning no longer changes; Note that when a reordering puts two equal π/8 rotations into the same layer, they can be combined into a π/4 rotation that is commuted to the end of the circuit, thereby decreasing the T count.As we discuss in Sec.6, this kind of algorithm can not only be used with π/8 rotations, but, in principle, with arbitrary Pauli product rotations.The reduction of the circuit depth in terms of non-π/8 rotations can be useful when going beyond Clifford+T circuits.

Pauli product measurements
When implementing circuits like Fig. 6 with surface codes, one obstacle is that π/8 rotations are not directly part of the set of available operations.Instead, one uses magic states [16] as a resource.These states are π/8-rotated Pauli eigenstates |m = |0 + e iπ/4 |1 .They can be consumed in order to perform P π/8 rotations.The corresponding circuit [30] is shown in Fig. 7.A P π/8 rotation corresponds to a P ⊗ Z measurement involving the magic state.If the measurement outcome is P ⊗ Z = −1, then a corrective P π/4 operation is necessary.Since this is a Clifford gate, it can be simply commuted to the end of the circuit, changing the axes of the subsequent π/8 rotations.Finally, in order to discard the magic state, it is disentangled from the rest of the system by an X measurement.Here, an outcome X = −1 prompts a P π/2 correction.π/2  Step 1 1 Step 2 In essence, if magic states are available, the only operations required for universal quantum computing are Pauli product measurements.In our framework, such operations can be performed in 1 via multipatch measurements, corresponding to multi-qubit lattice surgery.An example is shown in Fig. 8, where a stored in four two-tile one-qubit patches is performed.Using the circuit identity in Fig. 7, this is done by measuring Z |q1 ⊗Y |q2 ⊗X |q4 ⊗Z |m between the four qubits and a magic state.
Summary.Clifford+T circuits can be written in terms of π/8 rotations, π/4 rotations and measurements.To convert input circuits into a standard form, π/4 rotations can be commuted to the end of the circuit and absorbed by the final measurements.Thus, any quantum computation can be written as a sequence of π/8 rotations grouped into layers of mutually commuting rotations.The number of rotations is the T count and the number of layers is the T depth.Each rotation can be performed by consuming a magic state via a Pauli product measurement.These measurements can be implemented in our framework in 1 .

Data blocks
Since Clifford+T circuits are a sequence of π/8 rotations, each requiring the consumption of a magic state, it is natural to partition a quantum computer into a set of tiles that are used for magic state distillation (distillation blocks) and a set of tiles that host data qubits and consume magic states via Pauli product measurements (data blocks).In this section, we discuss designs for the latter.In principle, the structure shown in Fig. 8 is a data block, where each qubit is stored in a twotile patch and magic states can be consumed every 1 .However, this sort of design uses 3n tiles to host n data qubits, which is a relatively large space overhead.

ancilla region
Figure 9: A compact block stores n data qubits in 1.5n + 3 tiles.The consumption of a magic state can take up to 9 .

Compact block
The first design that we discuss uses only 1.5n + 3 tiles.This compact block is shown in Fig. 9, where each data qubit is stored in a square patch.This lowers the space cost, but restricts the operators that are accessible by Pauli product measurements, as only the Z operator is free to be measured.Using 3 , patches may also be rotated (see Fig. 11a), such that the X operator becomes accessible instead of the Z operator.The problematic operators are Y operators, which are the reason why the consumption of a magic state can take up to 9 .
The worst-case scenario is a π/8 rotation involving an even number of Y operators, such as the one shown in Fig. 10.One possibility to replace Y operators by X or Z operators is via π/4 rotations, since Y π/4 = Z π4 X π/4 Z −π/4 .Rotations with an even number of Y 's require two π/4 rotations, while an odd number of Y 's can be handled by one rotation.Only the left two π/4 rotations in Fig. 10 need to be performed explicitly.The right two rotations can be commuted to the end of the circuit, changing the subsequent π/8 rotations.Similar to a π/8 rotation, a P π/4 rotation can be executed using a resource state |Y = |0 + i |1 , as shown in Fig. 11b.However, even though this state is a Pauli eigenstate, it cannot be readily prepared in our framework.Instead, Step 1 1 Step 2 2 Step 4 5 Step 6 8 Step 7 9 Step 8 1 Step 3 2 Step 5 we use a |0 state and Y measurements, such that a P π/4 rotation is performed by a P ⊗ Y measurement between the qubits and the |0 state.Afterwards, the |0 state is measured in X.If the −P ⊗ Y and X measurements in Fig. 11b yield different outcomes, a Pauli correction is necessary.In Fig. 11, we go through the steps necessary to perform the (Y ⊗1⊗Y ⊗Z⊗Y ⊗Y ) π/8 rotation of Fig. 10.In step 1, we start with a 12-tile data block storing 6 qubits in the blue region.The orange region is not part of the data block, but is part of the adjacent distillation block, i.e., it is the source of the magic states.In steps 2-5, we perform the two π/4 rotations that are necessary to replace the Y operators with X's, i.e., the first two π/4 rotations in the circuit of Fig. 10.In step 6, we first rotate patches in the upper row, and then, in step 7, in the lower row.Finally, in step 8, we measure the Pauli product involving the magic state.
This general procedure can be used for any π/8 rotation.First, up to two π/4 rotations are performed in 2 .Next, patches in the upper and lower row are rotated, which takes 3 per row.Finally, the Pauli product is measured in 1 , requiring a total of 9 .While this is very slow compared to Fig. 8, the compact block is a valid choice for small quantum computers where the distillation of a magic state takes longer than 9 .

Intermediate block
One possibility to speed up compact blocks is to store all qubits in one row instead of two.This is the intermediate block shown in Fig. 13a, which uses 2n + 4 tiles to store n qubits.By eliminating one row, all patch rotations can be done simultaneously.In addition, one can save 1 by moving all patches to the other side, thereby eliminating the need to move patches back to their row after the rotation.An example is shown in Fig. 12. Suppose we have 5 qubits and need to prepare them for a Z ⊗ X ⊗ Z ⊗ Z ⊗ X measurement.The first, third and fourth qubit are moved to the other side, which takes 1 .Simultaneously, the second and fifth qubit are rotated, which takes 2 .Therefore, the total number of time steps to consume a magic state is at most 5 , where 2 are used for up to two π/4 rotations, 2 for the patch rotations, and 1 for the Pauli product measurement consuming the magic state.

Fast block
The disadvantage of square patches is that only one Pauli operator is adjacent to the data block's ancilla region, i.e., available for Pauli product measurements at any given time.Two-tile one-qubit patches as in Fig. 8, on the other hand, allow for the measurement 1 Step 1 2 Step 2 2 Step 3 of any Pauli operator, but use two tiles for each qubit.
In order to have both compact storage and access to all Pauli operators, we use two-qubit patches for our fast blocks in Fig. 13b.These patches use two tiles to represent two qubits (see Fig. 1), where the first qubit's Pauli operators are in the left two edges, and the second qubit's operators are in the right two edges.Therefore, the example in Fig. 13b is a fast block that stores 18 qubits.
Since all Pauli operators are accessible, the Pauli product measurement protocol of Fig. 8 can be used to consume a magic state every 1 .n qubits occupy a square arrangement of tiles with a side length of n/2 + 1, i.e., a total of 2n + √ 8n + 1 tiles.Even if n/2 is not integer, one should keep the block as square-shaped as possible by picking the closest integer as a side length and shortening the last column.While the fast block uses more tiles compared to the compact and intermediate blocks, it has a lower space-time cost, making it more favorable for large quantum computers for which the distillation of a magic state takes less than 5 .
Note that if undistilled magic states are sufficient, then any data block can already be used as a full quantum computer.A proof-of-principle two-qubit device in the spirit of Ref. [31] that constitutes a universal two-qubit quantum computer with undistilled magic states and can demonstrate all the operations that are used in our framework can be realized with six tiles, as shown in Appendix C.This proof-of-principle device uses (3d − 1) • 2d physical data qubits, i.e., 48, 140, or 280 data qubits for distances d = 3, 5 or 7.If ancilla qubits are used for stabilizer measurements, the number of physical qubits roughly doubles, but it is still within reach of near-term devices.
Summary.Data blocks store the data qubits of the computation and consume magic states.Compact blocks use 1.5n + 3 tiles for n qubits and require up to 9 to consume a magic state.Intermediate blocks use 2n + 4 tiles and take up to 5 per magic state.Fast blocks use 2n + √ 8n + 1 tiles and take 1 per magic state.Data blocks need to be combined with distillation blocks for large-scale quantum computation.

Distillation blocks
In this section, we discuss designs of tile blocks that are used for magic state distillation.This is necessary, because with surface codes, the initialization of non-Pauli eigenstates is prone to errors, which means that π/8 rotations performed using these states may lead to errors.In order to decrease the probability of such an error, magic state distillation [16] is used to convert many low-fidelity magic states into fewer higherfidelity states.This requires only Clifford gates (i.e., Pauli product measurements), so, in principle, any of the data blocks discussed in the previous section can be used for this purpose.However, magic state distillation is repeated extremely often for large-scale quantum computation, so it is worth optimizing these protocols.
Here, we discuss a general procedure that can be applied to any distillation protocol based on an error- correcting code with transversal T gates, such as punctured Reed-Muller codes [16,17] or block codes [18][19][20].To show the general structure of such a protocol, we go through the example of 15-to-1 distillation [16], i.e., a protocol that uses 15 faulty magic states to distill a single higher-fidelity state.

15-to-1 distillation
The 15-to-1 protocol is based on a quantum errorcorrecting code that uses 15 qubits to encode a single logical qubit with code distance 3. The reason why this can be used for magic state distillation is that, for this code, a physical T gate on every physical qubit corresponds to a logical T gate (actually T † ) on the encoded qubit, which is called a transversal T gate.The general structure of a distillation circuit based on a code with transversal T gates is shown in Fig. 14 for the example of 15-to-1.It consists of four parts: an encoding circuit, transversal T gates, decoding and measurement.
The circuit begins with 5 qubits initialized in the |+ state and 10 qubits in the |0 state.Qubits 1-4, 5 and 6-15 are associated with the four X stabilizers, the logical X operator, and the ten Z stabilizers of the code.The first five operations are multi-target CNOTs that correspond to the code's encoding circuit.They map the X Pauli operators of qubits 1-4 onto the code's X stabilizers, the X Pauli of qubit 5 onto the logical X operator and the Z operators of qubits 6-15 onto the code's Z stabilizers.Because we start out with +1-eigenstates of X and Z, this circuit prepares the simultaneous stabilizer eigenstate corresponding to the logical |+ L state.
Next, a transversal T gate is applied, transforming the logical state to T L |+ L (actually to T † L |+ L ).Note that the 15 Z π/8 rotations are potentially faulty.Finally, the encoding circuit is reverted, shifting the logical qubit information back into qubit 5, and the information about the X and Z stabilizers into qubits 1-4 and 6-15.If no errors occurred, qubit 5 is now a magic state T |+ (actually T † |+ ).In order to detect whether any of the 15 π/8 rotations were affected by an error, qubits 1-4 and 6-15 are measured in the X and Z basis, respectively, effectively measuring the stabilizers of the code.Since the code distance is 3, up to two errors can be detected, which will yield a -1 measurement outcome on some stabilizers.If any error is detected, all qubits are discarded and the distillation protocol is restarted.This way, if the error probability of each of the 15 T gates is p, the error probability of the output state is reduced to 35p 3 to leading order.In other words, this protocol takes 15 magic states with error probability p as an input, and outputs a single magic state with an error of 35p 3 .
Simplifying the circuit.Using the commutation rules of Fig. 4b, we can commute the first set of multitarget CNOTs to the right.This maps the Z π/8 rotations onto Z-product π/8 rotations.Since controlled-Pauli gates satisfy C(P 1 , P 2 ) = C(P 1 , P 2 ) † , the multitarget CNOTs of the encoding circuit precisely cancel the multi-target CNOTs of the decoding circuit, leaving a circuit of 15 Z-type π/8 rotations in Fig. 14.
Note that qubits 6-15 in this circuit are entirely redundant.They are initialized in a Z eigenstate, are then part of a Z-type rotation, and are finally measured in the Z basis, trivially yielding the outcome +1.Since they serve no purpose, they can simply be removed to yield the five-qubit circuit in Fig. 15, where we have absorbed the single-qubit π/8 rotations into the initial |+ states and rearranged the remaining 11 rotations.
This kind of circuit simplification is equivalent to the space-time trade-offs mentioned in Ref. [17] and can be applied to any protocol that is based on a code with transversal T gates.In general, a code with m x X stabilizers that uses n qubits to encode k logical qubits yields a circuit of n−m x π/8 rotations on m x +k qubits.Each of the m x + k qubits are either associated with an X stabilizer or one of the k logical qubits.For each of the n qubits of the code, the circuit contains one π/8 rotation with an axis that has a Z on each stabilizer or logical X operator that this qubit is part of.In order to more easily determine the n − m x rotations, it is useful to write down an n × (m x + k) matrix that shows the X stabilizers and logical X operators of the code.For 15-to-1, such a matrix could look like this: Each of the first four rows describes one of the four X stabilizers of the code, where 0 stands for 1 and 1 stands for X.For instance, the first row indicates that the first X stabilizer of this 15-qubit code is The rows below the horizontal bar -in this case the last row -show the logical X operators of the code.The circuit in Fig. 15 is then obtained by placing a |+ state for each row and a π/8 rotation for each column, with the axis of rotation determined by the indices in the column -a 1 for each 0 and a Z for each 1.Note that, in Fig. 15, the first four rotations (columns) of Eq. (1) are absorbed by the initial states.

Triorthogonal codes
The aforementioned circuit translation can be applied to any code with transversal T gates.One particularly versatile and simple scheme to generate such codes is based on triorthogonal matrices [17,18], which we briefly review in this section.The first step is to write down a triorthogonal matrix G, such as Triorthogonality refers to three criteria: i) The number of 1s in each row is a multiple of 8. ii) For each pair of rows, the number of entries where both rows have a 1 is a multiple of 4. iii) For each set of three rows, the number of entries where all three rows have a 1 is a multiple of 2. In other words, A general procedure based on classical Reed-Muller codes to obtain such matrices is described in Ref. [17].
After obtaining a triorthogonal matrix, such as the one in Eq. (2), the second step is to put it in a row echelon form by Gaussian elimination The last step is to remove one of the columns that contains a single 1, i.e., one of the first five columns, which is also called puncturing.Puncturing an a×b triorthogonal matrix k times yields a code encoding k logical qubits with m x = b − k and n = a − k.The rows of the matrix after puncturing that contain an even number of 1s describe X stabilizers, whereas the rows with an odd number of 1s describe X logical operators.In terms of distillation protocols, a code described by such a matrix can be used for n-to-k distillation.Indeed, if we puncture the matrix in Eq. (4) once by removing the first column, we retrieve the 15-to-1 protocol of Eq. (1).We can also puncture it twice by removing the first two columns.This yields the matrix which describes a 14-to-2 protocol.The corresponding circuit can be simply read off from this matrix.It is almost identical to the 15-to-1 protocol of Fig. 15, except that the fourth qubit is initialized in the |+ state and is not measured at the end of the circuit, but instead outputs a second magic state.However, because the code of 14-to-2 has a code distance of 2, the output error probability is higher, namely 7p 2 [18].Puncturing the matrix G any further would yield codes with a distance lower than 2, precluding them from detecting errors and improving the quality of magic states.In fact, the minimum number of qubits in triorthogonal codes was shown to be 14 [32].

Surface-code implementation
Having outlined the general structure of distillation protocols, we now discuss their implementation with surface codes.Distillation protocols are particularly sim-0 Step 1 1 Step 2 1 Step 3 11 Step 22 11 Step 23  ple quantum circuits, since they exclusively consist of Z-type π/8 rotations.Therefore, we can use a construction similar to the compact data block, and still only require 1 per rotation.We first discuss the example of 15-to-1 distillation.
Because the distillation circuit is relatively short, it is useful to avoid the Clifford corrections of Fig. 7 that may be required with 50% probability after a magic state is consumed.These corrections slow down the protocol, because they change the final X measurements to Pauli product measurements.Instead, we use a circuit which consumes a magic state and automatically performs the Clifford correction.It is based on the selective π/4 rotation circuit in Fig. 17a.To perform a P π/4 rotation according to the circuit in Fig. 11b, a |0 state is initialized and P ⊗ Y is measured, which takes 1 .However, the π/4 rotation is only performed if the |0 qubit is measured in X afterwards.If, instead, it is measured in Z, the qubit is simply discarded without performing any operation.In other words, the choice of measurement basis determines whether a P π/4 or a 1 operation is performed.This can be used to construct the circuit in Fig. 17b.Here, the first step to perform a P π/8 gate is to measure P ⊗ Z between the qubits and a magic state |m , and Z ⊗ Y between |m and |0 .These two measurements commute and can be performed simultaneously.If the outcome of the first measurement is +1, no Clifford correction is required and |0 is read out in Z.If the outcome is -1, |0 is measured in X, yielding the required Clifford correction.
This can be used to implement the 15-to-1 protocol of Fig. 15 in 11 using 11 tiles, as shown in Fig. 17c.Four qubits are initialized in |m , and a fifth in |+ .A 2 × 2 block of tiles to the left is reserved for the |m and |0 qubits of the auto-corrected π/8 rotations.Two additional tiles are used for the ancilla of the multipatch measurement.In step 2, the first π/8 rotation the measurement outcome of step 2, the |0 ancilla is read out in the X or Z basis.This is repeated 11 times, once for each of the 11 rotations in Fig. 15.Finally, in step 23, qubits 1-4 are measured in X.If all four outcomes are +1, the distillation protocol yields a distilled magic state in tile 5. Since 11 tiles are used for 11 , the space-time cost is 121d 3 in terms of (physical data qubits)•(code cycles) to leading order.
Caveat.Even though our leading-order estimate of the time cost of 11d code cycles is correct, the full time cost also contains contributions that do not scale with d.The two processes that may require special care in the magic state distillation protocol are state injection and classical processing.Every 1 requires the initialization of a magic state and a short classical computation to determine whether the |0 state needs to be measured in X or Z.While neither of these processes scales with d, they can slow down the distillation protocol, depending on the injection scheme and the control hardware that is used.This slowdown can be avoided by using additional 2 × 2 blocks of |0 -|m pairs, as shown in Fig. 18 for one additional block.Here, the left and right block Step 36 19 Step 37 19 Step 38 can be used in an alternating fashion, i.e., the left block for rotations 1, 3, 5, . . .and the right block for rotations 2, 4, 6, . . .While one block is being used for a rotation, the other one can be used to prepare a new magic state and to process the measurement outcomes of the previous rotation.
General space-time cost.The scheme of Fig. 17 can be used to implement any protocol based on a triorthogonal code.For an n-qubit code with k logical qubits and m x X stabilizers, the protocol uses 1.5(m x + k) + 4 tiles for (n − m x ) .In this time, it distills k magic states with a success probability of ∼(1 − p) n , since any error will result in failure.Therefore, such a protocol distills k magic state on average every (n−m x )/(1−p) n time steps.Thus, the space-time cost per magic state is (8) In order to minimize the space-time cost for distillation in our framework, one should pick a distillation protocol that minimizes this quantity for a given input and target error rate.
20-to-4 protocol.The previous estimate is only valid for triorthogonal codes.With semi-triorthogonal codes, additional time steps may be necessary to perform the final measurements.The example of the 20-to-4 protocol is shown in Fig. 19.Because the three qubits that are measured are discarded at the end of the protocol in Fig. 16, the three Pauli products at the end of the circuit can be measured in 2 instead of 3 , as shown in Fig. 20.For this, the operator Z ⊗ Z ⊗ X is measured in the first step.In the second step, X ⊗1⊗1 and 1 ⊗ Z ⊗ Z are measured simultaneously.Their product yields one of the required measurements.Finally, qubits 2 and 3 are measured in X at no time cost.Multiplying these two results with the X measurement in the previous step yields the final X ⊗X ⊗X measurement.Thus, the 20-to-4 protocol requires 17 for the π/8 rotations and 2 for the final measurements.With a space cost of 14 tiles, the total space-time cost is 266d 3 .

Benchmarking
We can use the previously described 15-to-1 and 20to-4 schemes to benchmark our implementations.In Ref. [35], these schemes were implemented with lattice surgery and their cost compared to implementations based on braiding of hole defects.In addition, the 7to-1 scheme was considered, which is a scheme to distill |Y states.The distillation of these states is not necessary in our framework, but for benchmarking purposes we show the 7-to-1 protocol in Appendix D. It can be implemented using 7 tiles for 4 , i.e., with a space-time cost of 28d 3 .
We summarize the leading-order space-time costs of the three protocols in Table 1.The comparison shows drastic reductions in space-time cost compared to schemes based on braiding of hole defects and compared to other approaches to optimizing lattice surgery.Compared to the braiding-based scheme, the space-time cost of 7-to-1, 15-to-1 and 20-to-4 is reduced by 60%, 84% and 89%, respectively.

Higher-fidelity protocols
So far, we have only explicitly discussed protocols that reduce the input error to ∼p 2 or ∼p 3 .There are two strategies to obtain protocols with a higher output fidelity: concatenation and higher-distance codes.
Concatenation.In the 15-to-1 protocol, we use 15 undistilled magic states to obtain a distilled magic state with an error rate of 35p 3 .If we perform the same protocol, but use 15 distilled magic states from previous 15-to-1 protocols as inputs, the output state will have an error rate of 35(35p 3 ) 3 = 1500625p 9 .This corresponds to a 225-to-1 protocol obtained from the concatenation of two 15-to-1 protocols.It is also possible to concatenate protocols that are not identical.Strate-
gies to combine high-yield and low-yield protocols are discussed in Ref. [18].
In Fig. 21, we show an unoptimized block that can be used for 225-to-1 distillation.It consists of 11 15-to-1 blocks that are used for the first level of distillation.Since each of these 11 blocks takes 11 to finish, they can be operated such that exactly one of these blocks finishes in every time step.Therefore, in every time step, one first-level magic state can be used for secondlevel distillation by moving it into one of the two level-2 |m -|0 blocks via the blue ancilla.The qubits that are used for the second level are highlighted in red.Note that since, for the second level, the single-qubit π/8 rotations require distilled magic states, the 15-to-1 protocol of Fig. 15 requires 15 rotations instead of just 11.Therefore, the entire protocol finishes in 15 using 176 tiles with a total space-time cost of 2640d 3 .
Higher-distance codes.Alternatively, we can use a code that produces higher-fidelity states.In Ref. [17], several protocols based on punctured Reed-Muller codes are discussed.One of these protocols is a 116-to-12 protocol based on a code with n = 116, k = 12 and m x = 17.It yields 12 magic states which each have an error rate of 41.25p 4 .According to Eq. (8), this protocol can be implemented using 44 tiles for 99 with a space-time cost of 363d 3 per output state and a success probability of (1 − p) 116 .For protocols with a high space cost such as 116-to-12, the space-time cost can be slightly reduced by introducing additional ancilla space, such that two operations can be performed simultaneously.One possible configuration is shown in Fig. 22.This increases the space cost to 81 tiles, but reduces the time cost to 50 , with a total space-time cost of 337.5d 3 per output state.
Output-to-input ratio is not everything.A popular figure of merit when comparing n-to-k distillation protocols is the ratio k/n.One of the protocols in Ref. [17] is a 912-to-112 protocol with n = 912, k = 112 and m x = 64, which yields 112 output state, each with an error rate of 10.63p 6 .While the output fidelity is not as high as for 225-to-1, the output-to-input ratio is much higher.For p = 10 −3 , the output fidelity of 225to-1 is ∼1.5 × 10 −21 , while it is only ∼10 −17 for 912to-112.Therefore, if output-to-input ratio were a good figure of merit, we would expect the 912-to-112 protocol to be considerably less costly compared to 225-to-1.If we use an implementation in the spirit of Fig. 22, the space cost is roughly 2.5(m x + k) tiles and the protocol takes (n − m x )/2 time steps.Thus, 912-to-112 uses 440 tiles for 424 .This would put the space-time cost per state at 1665d 3 , which is indeed lower than that of 225-to-1.However, the success probability of 912-to-112 for p = 10 −3 is only at ∼40%, which more than doubles the actual space-time cost.On the other hand, the space-time cost of 225-to-1 is barely affected by the success probability, as each of the level-1 15-to-1 blocks finishes with 98.5% success probability.This means that, with 1.5% probability, a time step of 225-to-1 is skipped, since the necessary level-1 state is missing.This only increases the space-time cost from 2640 3 to 2680d 3 , indicating that the output-to-input ratio is not a good figure of merit in our framework.
Summary.The class of magic state distillation protocols that are based on an n-qubit error-correcting code with m x X stabilizers and k logical qubits can be implemented using 1.5(m x + k) + 4 tiles and n − m x time steps.Such protocols output k magic states with a success probability of (1 − p) n .Therefore, if the input fidelity and desired output fidelity are known, the distillation protocol should minimize the cost function given in Eq. (8).

Trade-offs limited by T count
Having discussed data blocks and distillation blocks in the previous two sections, we are now ready to piece them together to a full quantum computer.In order to illustrate the steps that are necessary to calculate the space and time cost of a computation and to trade off space against time, we consider an example computation with a T count of 10 8 and a T depth of 10 6 .We consider error rates of p = 10 −3 and p = 10 −4 .This error rate is assumed to be the physical error rate per code cycle of every physical qubit, as well as the error rate of undistilled magic states.To calculate concrete numbers, we assume that the quantum computer can perform a code cycle every 1 µs.We want to perform the 10 8 -Tgate computation in a way that the probability of any one of the T gates being affected by an error stays below 1%.In addition, we require that the probability of an error affecting any of the logical qubits encoded in surface-code patches stays below 1%.This results in a 2% chance that the quantum computation will yield a wrong result.In order to exponentially increase the precision of the computation, it can be repeated multiple times or run in parallel on multiple quantum computers.

Step 1: Determine distillation protocol
The first step is to determine which distillation protocol is sufficient for the computation.In order to stay below 1% error probability with 10 8 T gates, each magic state needs to have an error rate below 10 −10 .For p = 10 −4 , the 15-to-1 protocol is sufficient, since it yields an output error rate of 35p 3 = 3.5 • 10 −11 .For p = 10 −3 , 15-to-1 is not enough.On the other hand, two levels of 15-to-1, i.e., 225-to-1, yield magic states with an error rate of 1.5 • 10 −21 , which is many orders of magnitude above what is required.A less costly protocol is 116to-12, which yields output states with an error rate of 41.25p 4 = 4.125 • 10 −11 , which suffices for our purposes.

Step 2: Construct a minimal setup
In order to determine the necessary code distance, we first construct a minimal setup, i.e., a configuration of tiles that can be used for the computation and uses as little space as possible.The reason why this is useful to determine the code distance is that the initial spacetime trade-offs that we discuss significantly improve the overall space-time cost.Therefore, the minimal setup can be used to comfortably upper-bound the required code distance.For p = 10 −4 , a minimal setup consists of a compact data block and a 15-to-1 distillation block, see Fig. 23a.The compact block stores 100 qubits in 153 tiles and requires up to 9 to consume a magic state.The 15to-1 distillation block uses 11 tiles and outputs a magic state every 11 with 99.9% success.To ensure that the tile of the distillation block that is occupied by qubit 5 is not blocked during the first time step of the distillation protocol, the first π/8 rotation of the protocol should be chosen such that it does not involve qubit 5, e.g., the fourth rotation of Fig. 15.In total, this minimal setup uses 164 tiles and performs a T gate every 11 , i.e., finishes the computation in 11 • 10 8 time steps.
For p = 10 −3 , a minimal setup consists of a compact data block and a 116-to-12 distillation block, as shown in Fig. 23b.For the minimal setup, we do not use the larger and faster distillation block shown in Fig. 22, but instead a block in the spirit of the 15-to-1 block.This 116-to-12 distillation block uses 44 tiles and distills 12 magic states in 99 with 89% success probability, i.e., on average one state every 9.27 .Because this distillation protocol outputs magic states in bursts, i.e., 12 at the same time, these states need to be stored before being consumed.Therefore, we introduce additional storage tiles (green tiles in Fig. 23b).Here, we choose the 12 output states to be qubits 6, 8, 10, . . ., 26 and 27.In the last step of the protocol these states are moved into the green space, where they are consumed by the data block one after the other.This minimal setup uses  153 tiles for the data block, 44 tiles for the distillation block and 13 tiles for storage.In total, it uses 210 tiles and finishes the computation in 9.27 • 10 8 time steps.

Step 3: Determine code distance
Since each tile corresponds to d × d physical data qubits and each time step corresponds to d code cycles, 164 encoded logical qubits need to survive for (11 • 10 8 )d code cycles for the minimal setup with p = 10 −4 .The probability of a single logical error on any of these 164 qubits needs to stay below 1% at the end of the computation.The logical error rate per logical qubit per code cycle can be approximated [12] as for circuit-level noise.Therefore, the condition to determine the required code distance is For distance d = 11, the final error probability is at 19.8%.Therefore, distance d = 13 is sufficient, with a final error probability of 0.2%.The number of physical qubits used in the minimal setup can be calculated as the number of tiles multiplied by 2d  For p = 10 −3 , the condition changes to which is satisfied for d = 27 with a final error probability of 0.5%.The final error probability for d = 25 is at 4.9%.Thus, the minimal setup uses 210 • 2 • 27 2 ≈ 306,000 physical qubits and finishes the computation in 27 • 9.27 • 10 8 code cycles, which amounts to roughly 7 hours.Note that, in principle, a success probability of less than 50% would be sufficient to reach arbitrary precisions by repeating computations or running them in parallel.This means that the code distances that we consider may be higher than what is necessary.

Step 4: Add distillation blocks
Only a small fraction of the tiles of the minimal setup is used for magic state distillation, i.e., 6.7% for p = 10 −4 and 21% for p = 10 −3 .On the other hand, adding one additional distillation block doubles the rate of magic state production, potentially doubling the speed of computation.Therefore, in order to speed up the computation and decrease the space-time cost, we add additional distillation blocks to our setup.For p = 10 −4 , adding one more distillation block reduces the time that it takes to distill a magic state to 5.5 per state.However, the compact block can only consume magic states at 9 per state.In order to avoid this bottleneck, we can use the intermediate data block instead, which occupies 204 tiles, but consumes one magic state every 5 .With 22 tiles for distillation (see Fig. 24), this setup uses 226 tiles and finishes the computation after 5.5 • 10 8 time steps.This increases the number of qubits to 76,400, but reduces the computational time to 2 hours.
For p = 10 −3 , the addition of a distillation block reduces the distillation time to 4.64 .At this point, one should switch to the more efficient 116-to-12 block of Fig. 22, which uses 81 tiles and distills a magic state on average every 4.68 .The intermediate data block cannot keep up with this distillation rate, but we can still use it to consume one magic state every 5 instead of 4.68 .Such a configuration uses 228 data tiles, 81 distillation tiles and 13 storage tiles, i.e., a total of 322 tiles corresponding to approximately 469,000 physical qubits.The computational time reduces to 5 • 10 8 time steps, i.e., 3.75 hours.Note that in Fig. 24b, the 12 output states of the 116-to-12 protocol should be chosen as 1, 3, 5, . . ., 25.They can be moved into the green storage space in the last step of the protocol, since the space denoted as ancilla 2 in Fig. 22 is not being used in the last step.
Trade-offs down to 1 per T gate.Adding additional distillation blocks can reduce the time per T gate down to 1 .For p = 10 −4 , 11 distillation blocks produce 1 magic state every 1 .To consume these magic states fast enough, we need to use a fast data block.This fast block uses 231 tiles and the 11 distillation blocks together with their storage tiles use 11 * 12 = 132 tiles, as shown in Fig. 25a.With a total of 363 tiles, this setup uses 123,000 qubits and finishes the computation in 10 8 , i.e., in 21 minutes and 40 seconds.
For p = 10 −3 , parallelizing 5 distillation blocks produces a magic state every 0.936 .This is faster than the fast block can consume the states, but allows for the execution of a T gate every 1 .With 231 tiles for the fast block, 405 distillation tiles and 60 storage tiles, the total space cost is 696 tiles.The setup shown in Fig. 22b contains four unused tiles to make sure that all storage lines are connected to the data block.Storage lines need to be connected to the ancilla space of the data block either directly, via other storage lines or via unused tiles.In any case, this corresponds to roughly 1,020,000 physical qubits.The computation finishes after 45 minutes.
Avoiding the classical overhead.Every consumption of a magic state corresponds to a Pauli product measurement, the outcome of which determines whether a Clifford correction is required.This correction is commuted past the subsequent rotations, potentially changing the axis of rotation.Therefore, the computation cannot continue before the measurement outcome is determined.This involves a small classical computation to process the physical measurements (i.e., decoding and feed-forward), which could slow down the quantum computation.In order to avoid this, the magic state consumption can be performed using the autocorrected π/8 rotations of Fig. 17b.Here, the classical computation merely determines, whether the ancilla qubit -which we refer to as the correction qubit |c -is measured in the X or Z basis.While this classical computation is running, the magic state for the subsequent π/8 rotation can be consumed, as the auto-corrected rotation involves no Clifford correction.This means that distillation blocks should output |m − |c pairs, for which we construct modified distillation blocks in the following section.If the classical computation is, on average, faster than 1 (i.e., d code cycles), then classical processing does not slow down the quantum computation in the T -count-limited schemes.
Summary.Data blocks combined with distillation blocks can be used for large-scale quantum computing.The first step is to determine a sufficiently high-fidelity distillation protocol.Next, one constructs a minimal setup from a compact data block and a single distillation block to upper-bound the required code distance.Finally, one can trade off space against time by using fast data blocks and adding more distillation blocks.This can reduce the time per T gate down to 1 .In our example, the trade-off also reduces the space-time cost compared to the minimal setup by a factor of 5 for p = 10 −4 and by a factor of 2.8 for p = 10 −3 .In order to fully exploit the space-time trade-offs discussed in this section, the input circuit should be optimized for T count.

Trade-offs limited by T depth
In the previous section, we parallelized distillation blocks to finish computations in a time proportional to the T count.In this section, we combine the previous constructions of data and distillation blocks to what we refer to as units.By parallelizing units, we exploit the fact that, in our example, the 10 8 T gates are arranged in 10 6 layers of 100 T gates to finish the computation in a time proportional to the T depth.We first slightly increase the space-time cost compared to the previous section, in order to speed up the computation down to one measurement per T layer.In this sense, we implement Fowler's time-optimal scheme [21].

T layer parallelization
The main concept used to parallelize T layers is quantum teleportation.The teleportation circuit is shown in Fig. 26a.It starts with the generation of a Bell pair (|00 +|11 )/ √ 2 by the Z ⊗Z measurement of |+ ⊗|+ .An arbitrary gate U is performed on the second half of the Bell pair.Next, a qubit |ψ and the first half of the Bell pair are measured in the Bell basis, i.e., in X ⊗ X and Z ⊗ Z.After the measurement, the first two qubits are discarded and |ψ is teleported to the third qubit through the gate U .This means that the output state is U |ψ , if the teleportation is successful.However, it is only successful, if both Bell basis measurements yield a +1 outcome.In the other three cases, the teleported state is U X |ψ , U Y |ψ or U Z |ψ .Note that the correction operation to recover the state |ψ is not a Pauli operation P , but instead U P U † , which, in general, is as difficult to perform as U itself.
If U is a P π/8 rotation, as in Fig. 26b, the Pauli errors change P π/8 to P −π/8 up to a Pauli correction.Since it is only after the Bell basis measurement that we know, whether we should have performed a P π/8 or a P −π/8 gate, we use post-corrected π/8 rotations in Fig. 27b, which are similar to the auto-corrected rotations of Fig. 17b.The post-corrected rotation uses a    The time-optimal circuit.This can be used to execute multiple T layers simultaneously.If U is a product of mutually commuting π/8 rotations, i.e., a T layer, the teleportation corrections replace all π/8 rotations with post-corrected rotations.An example is shown in Fig. 27 for a three-qubit computation of three T layers, where all three T layers are executed simultaneously.The reason why we can only group up T gates that are part of the same layer is that otherwise the Pauli corrections of the post-corrected rotation would not commute with the other rotations.The time-optimal circuit consists of three steps: The preparation of Bell pairs for each T layer, the application of T gates, and a set of final Bell measurements.At this point, the computation is not finished, as we still need to measure the correction qubits of the post-corrected rotations.Because these involve potential Pauli corrections, the correction qubits of the different T layers need to be measured one after the other.Thus, every T layer is executed one after the other, where each execution requires the time that it takes to measure the correction qubits and perform the classical processing to determine the next set of measurements from the Pauli corrections.We refer to this time as t m .In other words, any Clifford+T circuit consisting of n L T layers can be executed in n L • t m , independent of the code distance, which is the main feature of the time-optimal scheme [21].
The circuit in Fig. 27c naively requires 2n • n L qubits for an n-qubit computation, which scales with the length of the computation.Since we only have a finite number of qubits at our disposal, our goal is to implement the circuit in Fig. 28 instead.Here, the qubits Figure 28: An example of a time-optimal circuit using four units.In this case, each unit consists of six qubits, i.e., it is a three-qubit quantum computation, where three T layers can be executed simultaneously.
form groups of 2n qubits.We refer to each of these groups as a unit.Using n u units, n u −1 layers of T gates can be performed at the same time.In the circuit, the steps of Bell state preparation (BP ), post-corrected T layer execution (T ) and Bell basis measurement (BM ) are performed repeatedly until the end of the computation.We refer to the block of operations (BP -T -BM ) as unit preparation.Every time that unit preparation is finished, all qubits except for the correction qubits (not shown in Fig. 28) and half of the qubits of the last unit are discarded.At this point, the next set of unit preparations begins.Simultaneously, the correction qubits of the recently finished units are measured one after the other, which has a time cost of (n u −1)•t m .This means that the number of units can be increased to speed up the computation, until (n u −1)•t m reaches time that it takes to prepare a unit t u .At this maximum number of units n max = t u /t m + 1, a T layer is executed every t m and the computation cannot be sped up any further in the Clifford+T framework.
Note that the first and last unit differ from the other units.While all other units need to execute n T T gates every t u , the first and last unit need to execute n T T gates only every 2t u , where n T is the number of T gates per layer.Furthermore, the other blocks need to be able to store up to 2n T correction qubits, since, after the end of a unit preparation, n T correction qubits are stored, and may need to remain stored until the end of the next unit preparation.For the first and last block, on the other hand, the required storage space is halved.
In the following, we will show how to prepare units in our framework.We find that, for our examples, unit preparation takes 113 .If t m = 1 µs, then n max is ∼1500 for p = 10 −4 and ∼3000 for p = 10 −3 .Independently of the error rate, the computational time drops to one second.

Units
Units differ from the fast setups in Fig. 25 in three aspects.First, the number of qubits stored in the data block is doubled.Secondly, the distillation protocols are modified to output |m -|c pairs, instead of just magic states |m .Thirdly, in order to store correction qubits |c , additional space is required.Contrary to magicstate storage tiles, correction-qubit storage tiles do not need to be connected to the data block's ancilla region.
Modified distillation blocks.In order to have distillation blocks output |m -|c pairs, extra tiles and operations are required.We show the necessary modifications for the example of 15-to-1 and 116-to-12 distillation.A modified 15-to-1 block is shown in Fig. 29a.Apart from the standard 11 distillation tiles (orange) and one magic-state storage tile (green), it also contains 19 correction-qubit storage tiles (purple) and an additional tile (gray) that is used for neither distillation nor storage.The additional steps that modify the protocol are shown in Fig. 29c, which zooms into the highlighted region of Fig. 29a.In step 1 of the shown protocol, the distillation has just finished after 11 .The patch of the output state is deformed in step 2, and an additional qubit |c is initialized in the |0 state.The Y ⊗ Z operator between |c and |m is measured in step 3.In step 4, the correction qubit is sent to storage.Finally, in step 5, the magic state |m is moved to its storage tile.This operation blocks one of the orange tiles that is used for the distillation protocol for 4 .Still, this does not slow down 15-to-1 distillation, since the first 4 rotation of the protocol in Fig. 15 can be chosen, such that the output qubit is not needed.Therefore, the modified distillation block outputs one |m -|c pair every 11 .
For 116-to-12 distillation, a modified block is shown in Fig. 29b.We arrange the qubits, such that the 12 output states are found in the positions shown in step 1 of   Finally, the patches are deformed back to square patches and all magic states are sent to the green storage, while all correction qubits are sent to the purple storage.This adds 3 to the protocol, meaning that this block outputs 12 |m -|c pairs every 53 with a success probability of (1 − p) 116 .For p = 10 −3 , this corresponds to one output every 4.96 .As mentioned in Sec. 4, modified distillation blocks can also be used with setups, in which T gates are performed one after the other, in order to deal with slow classical processing.In this case, only one correction qubit storage tile per magic state is required.
Units.Modified distillation blocks together with fast data blocks are what we refer to as units.The units for our example computation for p = 10 −3 and p = 10 −4 are shown in Fig. 31a-b.They both consist of a 200qubit fast data block, 200 correction-qubit storage tiles, and a number of distillation blocks.Since we will show that unit preparation takes 113 in our case, the number of distillation blocks is chosen such that at least 100 |m -|c pairs can be distilled in 113 .A full timeoptimal quantum computer consists of a row of multiple units, see Fig. 31c.The units shown in the figure contain some unused tiles.This gives the units a rectangular profiles, even though this is not necessarily required.
In our case, the units have a footprint of 54 × 21 and 37 × 21 tiles, respectively.Note that the first and last unit of a time-optimal setup are smaller, as they only require 100 correction-qubit storage tiles and half the number of distillation blocks.
Unit preparation.In order to implement the timeoptimal circuit of Fig. 28 with the setup of Fig. 31, we show protocols that can be used for the BP -T -BM op-0 Step 1 Step 2 1 Step 3 2 Step 4 2 Step 5  erations.The data blocks of every unit store 2n qubits in n two-qubit patches.We arrange the qubits in such a way that the the final Bell measurements (BM ) are Z ⊗ Z and X ⊗ X measurements of the two qubits of every two-qubit patch.This Bell measurement can be done in 2 , as shown in Fig. 30.This arrangement of qubits implies that, for every two-qubit patch, one of the qubits needs to be part of a Bell state preparation (BP ) with the neighboring unit to the top, and the other with a neighboring unit to the bottom.For an n-qubit quantum computation, this Bell state preparation can be performed in √ n+1 time steps, as we show in Fig. 32 for the example of n = 9.For this, every qubit is initialized in the |+ state.The Bell state preparation requires a series of Z ⊗ Z measurements.The protocol in Fig. 32 shows that, since an n-qubit computation implies that the number of rows of the data block is √ n, these measurements require a total of √ n + 1 time steps.In total, the unit preparation of an n-qubit computation with n T T gates per layer requires √ n+1 time steps for the Bell state preparation, n T time steps for the execution of the T layer, and 2 time steps for the Bell basis measurement, i.e., a total of n T + √ n + 3 time steps.In our example, this amounts to 113 , which corresponds to t u = 1469 µs for p = 10 −4 and t u = 3051 µs for p = 10 −3 .Thus, time optimality is reached with 1470 units for p = 10 −4 and 3052 units for p = 10 −3 .

1
Step 1 2 Step 2 3 Step 3 4 Step 4 Figure 32: Bell state preparation (BP ) for a 9-qubit computation (18 qubits per unit) in 4 .All two-qubit patches are initialized in the |+ ⊗2 state.Each measurement ancilla is used for a Z ⊗ Z measurement between two qubits in different units.
For n-qubit computations, this requires √ n + 1 time steps.
Space-time trade-offs.Of course, it is also possible to use fewer units than required for time optimality.Using n u units means that n T • (n u − 1) T gates are performed every t u .In our example, 100 • (n u − 1) T gates are performed every 113 .With three units, the computational time drops to 56.5% of the computational time of the fast setup in Fig. 25.With ten units, it drops to 11%.The number of qubits per unit is ∼260,000 for p = 10 −4 and ∼1,650,000 for p = 10 −3 , so going from the fast setup to parallelized units is, initially, not a favorable space-time trade-off.Since the space-time cost has increased compared to the fast setup, it is also useful to check whether the code distance needs to be readjusted.If we use three units -ignoring that the first and last unit are, in principle, smaller -the space-time cost is still below the space-time cost of the minimal setup in both cases.Adding more units significantly improves the space-time cost.It is also a prescription to linearly speed up the quantum computer down to the time-optimal limit.between different superconducting chips [39][40][41].

Distributed quantum computing
For the time-optimal scheme, quantum computers may be arranged in a circle as shown in Fig. 33a, with the ability to share Bell pairs between neighboring quantum computers.This effectively implements the circuit that is schematically drawn in Fig. 33b.Note that in this circuit, there is no first and last unit.Here, every unit performs n T π/8 rotations every t u .Therefore, time optimality is reached with one fewer unit, and each unit only needs to store n T correction qubits instead of 2n T .With only 100 correction-qubit storage tiles and ignoring the unused tiles, the qubit count of the units in Fig. 31 drops to ∼220,000 for p = 10 −4 and ∼1,470,000 for p = 10 −3 , which are the numbers that we report in Fig. 3. Thus, if nearest-neighbor communication between quantum computers is feasible, already fewer than 2 million physical qubits per quantum computer can be used to implement the full time-optimal scheme with 1500-3000 quantum computers.
Entanglement distillation increases the qubit count.Note that it does not slow down the computation, as Bell pairs do not need to be distilled instantly.Entanglement distillation can take up to t u to distill the n T Bell pairs required per entanglement distillation block.
Summary.In order to speed up an n-qubit quantum computation beyond 1 per T gate, we parallelize T layers using units.With an average of n T T gates per layer, a unit consist of 4n + 4 √ n + 1 tiles for the data block, 2n T storage tiles for the correction qubits, and enough distillation blocks to distill n T |m -|c pairs in the time it takes to prepare a unit, which is n T + √ n + 3 time steps.If the unit preparation time is t u and the time for single-qubit measurements and classical processing is t m , a time-optimal setup consists of t u /t m + 1 units, executing one T layer every t m .Using fewer units results in a linear space-time trade-off.With n u units, n T • (n u − 1) T gates are performed in t u .A circular arrangement of units can be used for distributed quantum computing.This also reduces the number of correctionqubit storage tiles to 1n T and the number of units in a time-optimal setup to t u /t m .In order to fully exploit the space-time trade-offs discussed in this section, the input circuit should be optimized for T depth.

Trade-offs beyond Clifford+T
Under the assumption that measurements and feedforward can be done in 1 µs, we described how to perform a 10 8 -T -gate computation in just 1 second.A more conservative assumption would be a measurement and feed-forward time of 10 µs, which increases the computation time to 10 seconds.Although this seems fast, many quantum computations have T counts that are significantly higher than 10 8 .While the T count of Hubbard model simulations [2] is indeed in this range, quantum chemistry simulations can be more demanding.In particular, the simulation of FeMoco [1], a structure that plays an important role in nitrogen fixation, can have a T count of up to 10 15 .With a serial execution of one T gate every 10 µs, the computation takes 317 years to finish.Even if the gates are grouped into 100 T gates per layer, the computation still takes over 3 years.
While Clifford+T is a gate set that is very well suited for surface codes, it is often not the gate set which is natural to the quantum computations in question.In particular, quantum simulation based on Trotterization consists of many small-angle rotations.In the Clifford+T framework, each small-angle rotation is translated into a series of T gates via gate synthesis.Depending on the desired precision, this can require ∼100 T gates for each rotation [42], which must be executed in series.In order to speed up computations beyond their T count or T depth, it is therefore constructive to consider additional resources for gates other than T gates.

Clifford+ϕ circuits
Instead of requiring an input circuit that consists of Clifford gates and π/8 rotations, we consider circuits that consist of Clifford gates and arbitrary ϕ rotations, which we call Clifford+ϕ circuits.Using the procedure in Sec. 1, Clifford gates can be commuted to the end of the circuit, such that we end up with a circuit like the one in Fig. 34.Rotations that mutually commute can be grouped up into layers.The algorithm of Sec. 1 can be used to reduce the number of layers.It can even reduce the number of rotations, since, if two rotations P ϕ1 and P ϕ2 with the same axis of rotation are moved into the same layer, they can be combined into a single rotation P ϕ1+ϕ2 .Clifford+ϕ circuits are characterized by their rotation count (or ϕ count) and rotation depth (or ϕ depth), rather than T count and T depth.Each ϕ rotation can be performed using a |ϕ = |0 + e i(2ϕ) |1 resource state.When this state is consumed to perform a P ϕ rotation, there is a 50% chance that a P −ϕ rotation is performed instead.For π/8 rotations, this is not very problematic, since the correction operation is a π/4 rotation, which can simply be commuted to the end of the circuit.For general P −ϕ , the correction is a P 2ϕ rotation, which requires the use of a |2ϕ state.If this fails, the next correction is a P 4ϕ rotation requiring a |4ϕ state and so on.Thus, a wide variety of resource state is required to execute arbitrary-angle rotations.These can either be pieced together from ordinary magic states |m , or, more efficiently, distilled using specialized protocols [34,43].
All the schemes discussed in this work can be used with Clifford+ϕ circuits by replacing magic state distillation blocks by distillation blocks that produce resource states for arbitrary-angle rotations.In order to consume these states in a systematic way similar to the post-corrected π/8 rotations in Fig. 27b, we can use the post-corrected version of ϕ rotations shown in Fig. 35.First, the n resource states are entangled with the data qubits via a C(P, Z ⊗n ) gate.Just like magic state consumption, this can be done every 1 , since the data qubits are only part of one measurement in the measurement circuit in Fig. 35b.Next, the |ϕ state is measured in Z.If the outcome of this measurement is +1, then the rotation is successful and all other re-  source states are discarded by measuring them in X. If, instead, the outcome is -1, the |2ϕ state is measured in Z.If the outcome of this Z measurement is +1, the correction is successful, and the remaining resource states are discarded by X measurements.For -1, the corrections continue with a Z measurement of |4ϕ .Note that, in most cases, this cascade of measurements finishes in the second step.Therefore, on average, it takes 2t m to perform these measurements.However, sufficiently many resource state are required in order to be prepared for the most unlikely situations, in which many measurement steps are required.The probability to require n measurement steps (i.e., n resource states down to |2 n ϕ ) is exponentially low, 2 −n .Therefore, the number of resource states that need to be generated for each ϕ rotation scales logarithmically with the rotation count of the circuit, if one wants to stay below a certain probability that any of these rotations is slowed down by a missing resource state.If |π/2 k states are used, the cascade of measurements terminates after k steps.This technique of cascading resource state measurements is also referred to as programmable ancilla rotations [44].Note that the cascade of measurements can also be postponed to a later point, such that the post-corrected ϕ rotations can be used in the time-optimal scheme.
Using the T -count-limited scheme of Sec. 4, we can execute a ϕ rotation every 1 .For 100 T gates per ϕ rotation, this speeds up the computation by a factor of 100.Also, the time-optimal setting of Sec. 5 can be used with Clifford+ϕ circuits.However, the execution of a ϕ layer can take more than 2t m , as the measurement cascades for all rotations in the layer need to terminate.For instance, for 100 rotations per layer, each layer execution takes, on average, 8t m .For 100 T gates per rotation, ϕ layer parallelization reduces the computational time by a factor of 12.5 compared to T layer parallelization, i.e., from over 3 years to 3 months.In the specific case of quantum chemistry simulations, their T count can be reduced significantly by using more advanced algorithms [45][46][47], which also profit from arbitrary-angle rotations.Thus, if distributed quantum computing is feasible, Clifford+ϕ circuits such as the ones used for quantum chemistry can be executed with qubit counts per quantum computer not far above the numbers reported in Fig. 3.The only difference to Clifford+T units is that larger distillation blocks are required to produce and store the |ϕ resource states.
Multi-controlled Pauli gates.Other gates that are used extensively in quantum algorithms are multicontrolled Paulis, such as Toffoli or CCZ gates.In Fig. 5, we have shown how C(P 1 , P 2 ) gates can be written in terms of π/4 rotations.A similar decomposition is possible for multi-controlled Pauli gates.In Fig. 36, we show how a C(P 1 , P 2 , P 3 ) gate is a product of 7 π/8 rotations.For instance, C(Z, Z, X) is the Toffoli gate.From the circuit, it is evident that the T depth of C(P 1 , P 2 , P 3 ) gates is one [28].In principle, these doubly-controlled Pauli gates can be written with just four T gates [48], but this increases the number of layers and a similar effect can be obtained by cancelling π/8 rotations from pairs of doubly-controlled gates in a circuit.Reducing the T count by increasing the circuit depth [49] can still be a useful circuit manipulation for T -count-limited setups.We also note that the T count can be reduced by combining gate synthesis and magic state distillation (synthillation) [50,51].
C(P 1 , P 2 , P 3 , P 4 ) gates, i.e., triply-controlled Pauli gates, can be written as 15 π/16 rotations, as shown in Fig. 37.While the T depth of this circuit is no longer 1, the rotation depth is.In fact, any multicontrolled Pauli gate with n controls can be constructed from 2 n − 1 P π/2 n rotations by following the pattern shown in Figs. 5, 36 and 37.The rotation depth of all these gates is 1.Multi-controlled gates can also be pieced together from C(P 1 , P 2 , P 3 ) rotations, but this increases the circuit depth.By using small-angle rotations, any multi-controlled Pauli gate can be executed in one step.

Shorter measurements
If the bottleneck of slow classical processing can be overcome, then the only hardware-based restriction to the speed of quantum computation is the time it takes to measure a physical qubit.In the time-optimal scheme, the execution time of each rotation layer is governed by the measurement time.This measurement time only needs to be high, if the measurement fidelity is required to be sufficiently low.In order to speed up the computation, one can use shorter qubit measurements.This exponentially decreases the measurement fidelity.On the other hand, the measurement fidelity of encoded surface-code qubits increases exponentially with the number of qubits comprising the logical qubit.Thus, by using twice as many physical qubits to encode the measured logical qubit, the measurement time can be decreased by a factor of two, doubling the computational speed of the quantum computer.In fact, not all qubits need to use a higher code distance.Only the correction qubits that are measured to execute each rotation layer need to be larger, and only right before they are measured.The physical qubit measurement does not need to be a quantum non-demolition measurement, but can be a desctructive measurement.Ultimately, however, the speed of quantum computation is limited by the speed of classical computation.Exploring superconducting logic [52] to speed up classical computation may be a viable route to speed up quantum computers.
Summary.All the schemes discussed in this paper can not only be used with Clifford+T circuits, but also with Clifford+ϕ circuits.The only difference is that more and different resource states are required.Their distillation and storage requires more space than ordinary magic state distillation, but their use can speed up the computation by several orders of magnitude.

Conclusion
In this work, we described how full quantum computations can be performed in surface-code-based architectures of different sizes.Previous works on the translation of quantum computations into surface-code schemes [35,[53][54][55] attempted to optimize the logical qubit arrangement via algorithms that take a quantum circuit as an input.Here, we took a different approach by discussing computational schemes that do not require any prior knowledge about the input circuit.This has the advantage that a resource count with our schemes only requires the T count and T depth of the input circuit, and that the schemes consist of modular blocks that can be optimized independently of each other.In addition, the space-time cost is lower compared to earlier works [20,35].
Big quantum computers are fast.Starting from the minimal setup in Fig. 23 that consists of a compact data block and a single distillation block, we traded off space versus time, increasing the size of the quantum computer and, in return, decreasing the computational time.For the example of a computation with a T count of 10 8 and a T depth of 10 6 with an error rate of p = 10 −4 , the minimal setup consists of 164 tiles and executes one T gate every 11 , corresponding to a computational time of 4 hours with 55,400 physical qubits.From here, the space-time cost is drastically reduced by adding more distillation blocks, as shown in Fig. 38 and Tab. 2. With this strategy, the computational time is reduced to 1 per T gate, where the computational cost of a circuit is governed by its T count.
For further space-time trade-offs, we parallelized T layers using units.This is an increase in space-time cost, especially for linear arrangements of units (dashed line in Fig. 38), but enables further space-time tradeoffs.Linearly trading off space versus time, the computational time can be reduced to one measurement per T layer.Units are well-suited for distributed quantum computing, as the sharing of Bell pairs between neighboring units is part of the parallelization scheme.This exhausts the space-time trade-offs that are possible within the Clifford+T framework.Switching to Clifford+ϕ circuits can provide further trade-offs, as additional resources are introduced for arbitrary-angle rotations.This can be used to execute circuits in a time proportional to their rotation depth, as opposed to their T depth.We have not investigated how this trade-off affects the space-time cost in our scheme.
Room for optimization.In our T -count-limited schemes and for the preparation of units, one T gate is performed after the other.If the input circuit is known, it is reasonable to assume that qubits can be arranged in a way that allows for the parallel execution of multiple T gates in the same data block.Furthermore, there is a strict separation between tiles used for magic state distillation and tiles used for data blocks in our schemes.By sharing tiles between blocks, the space overhead may be reduced.Moreover, we have only considered a handful of distillation protocols.It would be interesting to see which distillation protocols can be used to optimize the cost function of Eq. (8).Finally, concrete tile layouts that can be used to distill and consume the additional resources necessary for Clifford+ϕ computing are still missing.
Beyond surface codes.Even though we designed our schemes with surface codes in mind, they can, in principle, be applied to other toric-code-based patches, such as Majorana surface-code patches [11] or colorcode patches [13,56].Color codes can reduce the number of physical qubits due to more compact encoding, but require more elaborate hardware to measure the higher-weight check operators.The space cost is reduced by replacing all surface-code patches by colorcode patches, with the exception of Pauli product measurement ancillas.In order to keep the space cost low, measurement ancillas should remain surface-code patches and color-to-surface code lattice surgery [57] should be used during the Pauli product measurement protocol, as described in Ref. [58].
Outlook.If the number of qubits continues to double every 8 months [59], the 60,000 -300,000 physical qubits necessary for classically intractable Hubbard  A Surface-code qubits and latticesurgery operations To illustrate the translation of protocols in our framework into surface-code patches, we show how the patches and protocols of Figs. 1 and 2 are implemented with surface codes.Surface-code patches.Each patch corresponds to a surface-code patch with code distance d.Therefore, each tile corresponds to d 2 physical data qubits, as shown in Fig. 39 for d = 5.In our surface-code patches, physical qubits are placed on the vertices, bright faces correspond to Z stabilizers and dark faces to X stabilizers.Solid and dashed boundaries correspond to X and Z boundaries (also called rough and smooth boundaries).For one-qubit patches, the product of all d physical X (Z) operators along any of the X (Z) boundaries is the logical X (Z) operator of the encoded qubit.For two-qubit patches with six boundaries, the string operators located at the boundaries correspond to the logical operators shown in Fig. 1, i.e., going clockwise, X 1 , Z 1 , X 1 •X 2 , Z 2 , X 2 , and Z 1 •Z 2 .Note that, in principle, the width of two-tile patches can be 2d − 1 instead of 2d, potentially reducing the space cost [11].Furthermore, the correspondence between solid and dashed, and X and Z boundaries is interchangeable.
Bell state preparation.To demonstrate the implementation of the available set of operations in our framework with surface-code patches, we now show how the protocols of Fig. 2    ator.The measurement outcome is the product of the newly introduced Z stabilizers highlighted in red, as the product of these stabilizers corresponds to the product of the logical Z operators encoded in the two surfacecode Z boundaries.To account for measurement errors, this measurement is repeated for d code cycles.Finally, the patch is split into two patches again, leaving the two logical surface-code qubits in an entangled Bell state.
Moving corners.The movement of corners of a surface-code patch is shown in Fig. 40b.It corresponds to a change of boundary stabilizers.In order to account for measurement errors of the newly measured stabilizers, this requires d code cycles.The top left physical qubit in the second step of Fig. 40b is removed from the patch via an X measurement.
Moving boundaries.The protocol to move patches is similar to the first protocol.It is shown in Fig. 40c.Extending the patch via its Z boundary in the second step is the same operation as a Z ⊗ Z lattice surgery between the patch and a rectangular |+ ancilla qubit to the right.This needs to be done for d code cycles to account for measurement errors.Finally, the patch is shortened again by measuring the left two thirds of physical qubits in the X basis.
Y measurements.The fourth protocol is shown in Fig. 40d.First, a patch is deformed to a wider patch by initializing physical qubits in the X basis and measuring the new stabilizers, which takes d code cycles.Below the wide patch, a rectangular ancilla patch is initialized in the |0 state.A column of physical qubits in the center is missing, so that, in the next step, the ancilla can be  used for twist-based lattice surgery [11], measuring the Y operator.The product of the operators highlighted in red in the third step corresponds to the logical Y ⊗ Z operator between the two logical qubits.The lattice surgery in the third step involves dislocation operators and a five-qubit twist defect.Even though these stabilizers are irregular, they can still be measured in a square lattice of physical qubits with nearest-neighbor couplings, as we show in Fig. 41.For the measurement of twist operators and wide X and Z stabilizers, up to three measurement ancillas can be used.
Multi-patch measurements.For a multi-patch measurement in Fig. 42, all physical qubits located in the region of the ancilla patch are initialized in the |+ state.Next, new check operators are introduced.The newly introduced X-type stabilizers all yield trivial outcomes, since they are products of physical qubits initialized in an X eigenstate and previously measured check operators.The nontrivial operators are highlighted by a red dot in Fig. 42.Their product is equivalent to the desired operator, i.e., Y |q1 ⊗ X |q3 ⊗ Z |q4 ⊗ X |q5 .The new check operators are measured for d code cycles to account for measurement errors.This procedure corresponds to the multi-body lattice surgery protocol introduced in Ref. [12].It can be used to measure any product of surface-code-boundary Pauli operators by initializing physical qubits in the |+ state in an ancilla region of width d, and then measuring new check operators, where the product of the nontrivial operators yields the outcome of the desired multi-patch measurement.The ancilla region of width d is required to ensure that the code distance of the stabilizer configuration during the multi-body lattice surgery remains d.

B Extended ruleset
Some surface-code operations are not covered by the rules discussed in the introduction.In particular, we only consider patches with 4 or 6 corners, where we refer to the points where two edges meet as corners.In general, one could also consider patches with a higher number of corners.A patch with 2N + 2 corners represents N qubits, as shown in Fig. 43   simplest case is a four-corner patch (a/b) representing a single qubit.Six-corner patches (c) are two-qubit patches.The general rule that assigns the operators of N qubits to the edges of a (2N + 2)-corner patch is given in Fig. 43d.Going clockwise, the dashed boundaries correspond to X 1 , X 1 X 2 , X 2 X 3 , . . ., X N −1 X N and X N .Starting to the right of X 1 , the solid edges correspond to Z 1 , Z 2 , . . ., Z N and the product One can also consider patches with shortened edges, such that they occupy fewer tiles.The drawback of this is that in every time step, an error corresponding to the Pauli operator represented by the shortened edge will occur with a certain probability p err .An example of a six-corner patch with two shortened X edges is shown in Fig. 44, meaning that this six-corner patch is susceptible to X errors.In the surface-code implementation, this corresponds to a patch with boundaries that are shorter than d physical data qubits, effectively reducing the code distance of the logical operators encoded by the shortened edges.Note that patches with shortened edges may occupy more than d 2 physical data qubits per tile.
With (2N + 2)-corner patches, the set of operations needs to be modified.The initialization rule for such patches is: -Qubits can be initialized in the X and Z eigenstates |+ and |0 .All qubits that are part of one patch must be initialized in the same state.(Cost: 0 ) Similarly, the single-patch measurement rule is modified to -Qubits can be measured in the X or Z basis.All qubits that are part of the same patch are measured simultaneously and in the same basis.This measurement removes the patch from the board.(Cost: 0 ) Pauli product measurements.Using multi-corner patches with shortened boundaries, the multi-patch measurement rule is, in principle, redundant.For instance, the Pauli product measurement of Fig. 8 can be equivalently performed in 1 via the protocol shown in Fig. 45.An 8-corner ancilla patch is initialized in the

|+
⊗3 state.The shape of this patch is chosen, such that each of the four Z edges is adjacent to one of the four operators that are part of the measurement.Note that this means that some of the X edges are shortened, such that the qubits are susceptible to X errors.In this case, this is not a problem, since the qubits are initialized in X eigenstates and random X errors will cause no change to the states.Next, in step 3, we measure the four Pauli products Because the ancilla is initialized in an X eigenstate, the operators Z 1 , Z 2 and Z 3 are unknown, and the outcome of each of the four aforementioned measurements is entirely random.However, multiplying the four measurement outcomes yields which is precisely the operator Z |q1 ⊗Y |q2 ⊗X |q4 ⊗Z |m that we wanted to measure.Finally, to discard the ancilla patch we measure its three qubits in the X basis.Again, X errors will have no effect, as they commute with the measurement basis.Measurement outcomes of X i = −1 prompt a Pauli correction.If in the previous step, the Z i edge was measured together with a Pauli operator P , the correction is a P π/2 gate.For instance, if in Fig. 8 the final measurements yield X 2 = −1 and X 3 = −1, the corrections are a Y π/2 rotation on |q 2 and a Z π/2 rotation on |m .This type of protocol can be used to measure any product of n Pauli operators.An ancilla patch needs to be initialized in the |+ ⊗n state with Z edges adjacent to the n operators part of the measurement.The surface-code implementation of this protocol is identical to the surface-code implementation of multi-patch measurements in Fig. 42.
While multi-corner patches and shortened edges increase the number of surface-code operations that are covered by the framework, there are still rules that can be added to the ruleset to account for more operations, such as, e.g., the movement of corners inside a patch [10].Also, for the initialization of non-Pauli eigenstates, error models other than random Pauli errors can be considered.

C Proof-of-principle device
Here, we discuss how (3d − 1) • 2d physical data qubits can be used to build a proof-of-principle device that is a universal two-qubit error-corrected quantum computer that uses undistilled magic states and can demonstrate all the operations required for large-scale quantum computing.We go through the example of a computation that starts with three π/8 rotations around Z⊗Z, Y ⊗X and Y ⊗ Y in Fig. 46.For the first rotation, we need to measure Z 1 ⊗ Z 2 ⊗ Z |m .A magic state is initialized in a long patch in step 2, which is equivalent to initializing a magic state and measuring X ⊗ X between the magic state and neighboring |0 ancillas.This effectively encodes the magic state in a three-qubit repetition code with a logical Z operator Z L = Z ⊗ Z ⊗ Z.To consume the magic state, Z 1 ⊗ Z 2 ⊗ Z L is measured in step 3.This consumes a magic state for the Z ⊗ Z rotation.
The next rotation is a Y ⊗ X rotation.Here, we first need to deform |q 1 , such that both the X and Z boundaries of the qubit are accessible.Qubit |q 2 is rotated in steps 5-8 using the protocol in Fig. 11a.In step 9, again, a magic state is initialized in a two-qubit repetition code with Z L = Z a1 ⊗ Z a2 .In step 10, the magic state is consumed via a Y 1 ⊗ Z a1 and a X 1 ⊗ Z a2 measurement.
This kind of protocol consisting of patch deformations and patch rotations can be used to perform any π/8 rotation with the exception of (Y ⊗ Y ) π/8 , since there is not enough space to make both Y operators accessible for lattice surgery.For this rotation, we first explicitly execute a Clifford gate to change (Y ⊗Y ) π/8 to any other rotation.Any Clifford gate that does not commute with Y ⊗ Y will suffice.In our example, we choose a Z π/4 rotation.It is performed by initializing a |0 state in step 13, and measuring Z 1 ⊗ Y between |q 1 and the ancilla, following the protocol of Fig. 11b.
This demonstrates that a proof-of-principle experiment can be built with 48 physical data qubits.In general, this requires 6d 2 − 2d qubits, i.e., 48 for d = 3, 140 for d = 5 and 280 for d = 7.If measurement qubits are required for syndrome readout, the number of physical qubits roughly doubles.

D Implementation of the 7-to-1 protocol
Even though the distillation of |Y = |0 + i |1 states has no use in our framework, we show how to implement the 7-to-1 distillation protocol for benchmarking purposes in Fig. 47.The protocol is based on the 7qubit Steane code.Its X stabilizers are the faces shown in Fig. 47a, and its logical X operator can be chosen as the X ⊗ X ⊗ X operator with support on the three qubits drawn in red.
Following the procedure in Sec. 3, the distillation circuit is obtained by initializing m x + k = 4 qubits in the |+ state, where the first three qubits are associated with the three X stabilizers, and the last qubit is associated with the logical X operator.For each qubit of the Steane code, the circuit contains a π/4 rotation with Z's on each stabilizer and logical operator that the qubit is part of.The three qubits in the corner of the triangle are only part of a single stabilizer and no logical operator, therefore they contribute with  single-qubit Z π/4 rotations, which can be absorbed into the initial state.The remaining four rotations are shown in Fig. 47c.A distillation block that can be used for this protocol is shown in Fig. 47b.Since the consumption of |Y resource states requires no Clifford correction, this block consists of only 7 tiles.With four rotations, the leading order of the space-time cost of this protocol is 7d 2 • 4d = 28d 3 .

Figure 1 :
Figure 1: Examples of one-qubit (a/b) and two-qubit (c) patches in a 5 × 2 grid of tiles.

Figure 2 :
Figure 2: Examples of short protocols.(a) Preparation of a two-qubit Bell state in 1 .(b) Moving corners of a four-corner patch to change its shape in 1 .(c) Moving a square-patch qubit over long distances in 1 .(d) Measurement of a squarepatch qubit in the Y basis using an ancilla qubit and 2 .(e) A multi-qubit Y |q 1 ⊗ X |q 3 ⊗ Z |q 4 ⊗ X |q 5 measurement in 1 .

Figure 4 :
Figure4: A generic circuit consists of π/4 rotations (orange), π/8 rotations (green) and measurements (blue).The Pauli product in each box specifies the axis of rotation or the basis of measurement.If the Pauli operator is −P instead of P , a minus sign is found in the corner of the box, such that, e.g., Z −π/4 corresponds to an S † gate.Using the commutation rules in (a/b), all Clifford gates can be moved to the end of the circuit.Using (c), the Clifford gates can be absorbed by the final measurements.

Figure 5 :
Figure 5: Clifford+T gates in terms of Pauli rotations.(a) Single-qubit Clifford gates are π/4 rotations, and the T gate is a π/8 rotation.(b/c) P1-controlled-P2 gates are Clifford gates, where C(Z, X) is the CNOT gate.

Figure 7 :
Figure 7: Circuit to perform a π/8 rotation by consuming a magic state. 0

Figure 10 :
Figure 10: For compact blocks, the worst-case scenario are Pauli product measurements involving an even number of Y operators, e.g., the measurement required for a (Y ⊗ 1 ⊗ Y ⊗ Z ⊗ Y ⊗ Y ) π/8 gate.Such measurements require two explicit π/4 rotations (left), and two π/4 rotations that are commuted to the end of the circuit (right).

Figure 11 :
Figure 11: (a) Patches can be rotated in 3 to change whether the X or Z operator is adjacent to the compact block's ancilla region.(b) A P π/4 gate can be performed explicitly via a P ⊗ Y measurement with a |0 ancilla qubit.(c) Six-step protocol to perform the rotation of Fig. 10 in a compact block.The magic state is consumed in 9 , where steps 2-5 are the two π/4 rotations in Fig. 10, steps 6 and 7 are patch rotations, and step 8 is the Pauli product measurement consuming the magic state.

Figure 12 :
Figure 12: Patch rotations in preparation of a Z ⊗ X ⊗ Z ⊗ Z ⊗ X measurement with an intermediate block.

Figure 13 :
Figure 13: (a) Intermediate blocks store n data qubits in 2.5n+ 4 tiles and require up to 5 per magic state.(b) Fast blocks use 2n + √ 8n + 1 tiles and require 1 per magic state.

Figure 14 :
Figure 14: Encode-T -decode circuit of the 15-to-1 distillation protocol.The multi-target CNOTs (orange) can be commuted past the T gates, such that they cancel and leave 15 Z-type Pauli product rotations.

Figure 17 :
Figure 17: Implementation of the 15-to-1 distillation protocol in our framework.Each time step in (c) corresponds to an autocorrected π/8 rotation (b), which in turn is based on selective π/4 rotations (a).

Figure 18 :
Figure 18: Two 2 × 2 ancilla blocks can be used to prevent state injection and classical processing from slowing down the 15-to-1 protocol.

Figure 19 :
Figure 19: Implementation of the 20-to-4 protocol in our framework.Steps 1-35 are used for the 17 rotations in the circuit of Fig. 16.The final measurements in steps 36-38 are the measurements shown in Fig. 20.

Figure 20 :
Figure 20: The three final measurements in the circuit of Fig. 16 can be done in 2 instead of 3 .

Figure 21 :
Figure 21: 176-tile block that can be used for 225-to-1 distillation.The qubits highlighted in red are used for the second level of the distillation protocol.The blue ancilla is used to move level-1 magic states into the two |m -|0 blocks of the level-2 distillation.

Figure 22 :
Figure22: 81-tile block that can be used for the 116-to-12 protocol.Here, two π/8 rotations can be performed at the same time, where one rotation uses the ancilla space denoted as ancilla 1, and the other one uses ancilla 2.

Figure 23 :
Figure 23: Minimal setups using compact data blocks for p = 10 −4 (with 15-to-1 distillation) and p = 10 −3 (with 116-to-12 distillation).Blue tiles are data block tiles, orange tiles are distillation block tiles, green tiles are used for magic state storage and gray tiles are unused tiles.

Figure 26 :
Figure 26: (a) Circuit for quantum teleportation of |ψ through a gate U .Only if both Bell basis measurement yield +1, the teleported state is U |ψ .If Z ⊗ Z = −1, the state is U X |ψ .If X ⊗ X = −1, the state is U Z |ψ .If both measurements yield -1, the state is U Y |ψ .(b) If U is a π/8 rotation, the corrective Paulis change P π/8 to P −π/8 .

Figure 27 :
Figure27: Time-optimal implementation of a three-qubit quantum computation consisting of 9 T gates in 3 T layers.Postcorrected π/8 rotations (b) can be used to decide at a later point, whether the performed operation was a P π/8 or a P −π/8 rotation.

Figure 29 :
Figure 29: Modified 15-to-1 distillation blocks (a) output a |m -|c pair every 11 .After the end of the distillation protocol, four additional steps (c) are necessary.The modified 116-to-12 distillation block (b) finishes after 53 , due to the three additional steps in (d).

Fig. 29d .
Fig.29d.Using 2 , correction qubits are prepared and Y ⊗ Z operators are measured.Finally, the patches are deformed back to square patches and all magic states are sent to the green storage, while all correction qubits are sent to the purple storage.This adds 3 to the protocol, meaning that this block outputs 12 |m -|c pairs every 53 with a success probability of (1 − p)116 .For p = 10 −3 , this corresponds to one output every 4.96 .As mentioned in Sec. 4, modified distillation blocks can also be used with setups, in which T gates are performed one after the other, in order to deal with slow classical processing.In this case, only one correction qubit storage tile per magic state is required.Units.Modified distillation blocks together with fast data blocks are what we refer to as units.The units for our example computation for p = 10 −3 and p = 10 −4 are shown in Fig.31a-b.They both consist of a 200qubit fast data block, 200 correction-qubit storage tiles, and a number of distillation blocks.Since we will show that unit preparation takes 113 in our case, the number of distillation blocks is chosen such that at least 100 |m -|c pairs can be distilled in 113 .A full timeoptimal quantum computer consists of a row of multiple units, see Fig.31c.The units shown in the figure contain some unused tiles.This gives the units a rectangular profiles, even though this is not necessarily required.

( a ) 4 ( 4 Figure 31 :
Figure 31: Units consist of fast data blocks, modified distillation blocks and storage tiles.(a) The unit for p = 10 −3 consists of 54 × 21 = 1134 tiles.(b) For p = 10 −4 , the number of tiles is 37 × 21 = 777.(c) A time-optimal setup consists of a row of multiple units, which means that the space to the bottom and top of the fast data blocks needs to remain free.

Figure 33 :
Figure33: Scheme for distributed quantum computing in a circular arrangement of quantum computers with the ability to share Bell pairs between nearest neighbors.If the Bell-pair fidelity is low, entanglement distillation (ent.dist.) can be used to increase the fidelity.This scheme effectively implements the circular time-optimal circuit drawn schematically in (b).

layer 1 layer 2 Figure 34 :
Figure 34: Clifford+ϕ circuit.The first two rotation layers (ϕ layers) with three rotations per layer are shown.

Figure 35 :
Figure 35: (a) A post-corrected ϕ rotation can be used to decide at a later point, whether the performed operation was a Pϕ or a P−ϕ gate.(b) A C(P1, P2) gate can be performed explicitly using a |+ ancilla and Pauli product measurements.

Figure 38 :
Figure38: Space-time, space, and time cost of the schemes discussed in this paper for the example of a 100-qubit quantum computation with T count 10 8 and T depth 10 6 , under the assumption of a 1 µs code cycle time, and a 1 µs measurement and classical processing time.The solid and dashed lines in M-P are for circular (solid) and linear (dashed) arrangements of units.

Figure 39 :
Figure 39: Surface-code implementation of the patches shown Fig. 1.Physical qubits are placed on vertices.Bright faces correspond to Z stabilizers and dark faces to X stabilizers.

Figure 40 :
Figure 40: Surface-code implementation of the protocols in Fig. 2a-d.

Figure 41 :
Figure 41: Twist-based lattice surgery in a square lattice of qubits with nearest-neighbor couplings.The black dots are physical data qubits and the white dots are physical measurement qubits.

Figure 42 :
Figure42: Surface-code implementation of the multi-patch measurement in Fig.2e.The measurement outcome is the product of all check operators with a red dot.
. The Four-, six-and eight-corner patches (

Figure 47 :
Figure 47: The Steane code (a) is the basis of 7-to-1 distillation (c).In our framework, the corresponding distillation block (b) uses 7 tiles for 4 .

Table 2 :
Space and time cost of the schemes plotted in Fig.38.The number in parentheses are for linear arrangements of units (dashed lines in Fig.38).