Efficient magic state factories with a catalyzed |CCZ>to 2|T>transformation

We present magic state factory constructions for producing $|CCZ\rangle$ states and $|T\rangle$ states. For the $|CCZ\rangle$ factory we apply the surface code lattice surgery construction techniques described by Fowler et al. to the fault-tolerant Toffoli. The resulting factory has a footprint of $12d \times 6d$ (where $d$ is the code distance) and produces one $|CCZ\rangle$ every $5.5d$ surface code cycles. Our $|T\rangle$ state factory uses the $|CCZ\rangle$ factory's output and a catalyst $|T\rangle$ state to exactly transform one $|CCZ\rangle$ state into two $|T\rangle$ states. It has a footprint 25% smaller than the factory of Fowler et al. but outputs $|T\rangle$ states twice as quickly. We show how to generalize the catalyzed transformation to arbitrary phase angles, and note that the case $\theta=22.5^\circ$ produces a particularly efficient circuit for producing $|\sqrt{T}\rangle$ states. Compared to using the $12d \times 8d \times 6.5d$ $|T\rangle$ factory of Fowler et al., our $|CCZ\rangle$ factory can quintuple the speed of algorithms that are dominated by the cost of applying Toffoli gates, including Shor's algorithm and the chemistry algorithm of Babbush et al.. Assuming a physical gate error rate of $10^{-3}$, our CCZ factory can produce $\sim 10^{10}$ states on average before an error occurs. This is sufficient for classically intractable instantiations of the chemistry algorithm, but for more demanding algorithms such as Shor's algorithm the mean number of states until failure can be increased to $\sim 10^{12}$ by increasing the factory footprint ~20%.


Introduction
In fault-tolerant quantum computation based on the surface code (a likely component of future error corrected quantum computers due to the surface code's comparatively high threshold and planar connectivity requirements [3,11,29,30,13]), the cost of a quantum algorithm is well approximated by the number of non-Clifford operations. This is due to the fact that non-Clifford operations are performed via magic state Craig Gidney: craiggidney@google.com distillation [5], and the cost of state distillation is large. For example, the spacetime volume (qubit-seconds) of the T state factory from [15] is two orders of magnitude larger than the volume of a CNOT operation between adjacent qubits [19]. The non-Clifford gate count will likely be particularly significant for the earliest error corrected quantum computers, which will not have enough space to distill magic states in parallel.
Over the past decade, thanks to techniques such as block codes [4,16], bridge compression [14], and many others [19,8,9,25], the cost of magic state distillation has steadily decreased. This paper adds catalyzed phasing to the pile of known techniques, continuing the tradition of gradually chipping away at the convenient approximation that magic states are the dominant cost in error-corrected quantum computation.
Note that, in this paper, we focus on optimizing the cost of distillation in the single-factory regime. For example, we do not investigate whether there are block code factories that can use catalyzation. We focus on the single-factory regime because we are interested in estimating the minimum number of physical qubits needed to run classically intractable instances of various quantum algorithms at a reasonable rate, and the single-factory regime is the relevant one for these kinds of estimates. In Figure 1 and Figure 2, we give a highlevel view of this paper's improvements in footprint and spacetime volume, over previous factories in the singlefactory regime.
The paper is organized as follows. Section 1 provides an overview of the paper, and explains the various notation and diagram conventions we will be using. In Section 2, we explain how to construct an efficient |CCZ factory by applying the techniques of [15] to the construction of [21,12]. In Section 3, we construct a circuit which can transform a |CCZ state into two |T states if a catalyst |T state is present. In Section 4, we show that this catalyzed circuit generalizes to other phase angles and note that this generalized circuit can produce two | √ T states using only five |T states. In Section 5, we combine constructions from the previous sections into an efficient |T factory. Finally, in Section 6, we discuss applications of our constructions, Figure 2: Size comparison of various factories producing two magic states, including output error rates. Error rates are computed assuming a physical gate error rate of 10 −3 , and include topological errors from the surface code itself. Includes the 15|T 35 3 → |T factory construction using braids [14] (left) and lattice surgery [15] (middle left), as well as our 8|T 28 2 → |CCZ factory construction (middle right) and our |T -catalyzed 8|T 28 2 → 2|T factory (right). |T output events are indicated with red cubes. |CCZ output events are indicated with a triplet of orange cubes. The braided factory has been scaled to account for the fact that it uses the unrotated surface code instead of the rotated surface code [19]. The braided T factory's error rate is significantly higher because it uses an older injection technique, resulting in the level 0 T gates having an error rate of 10 −2 instead of 2 · 10 −3 . The error rate of the catalyzed T factory has an asterisk because its errors are correlated: if one error occurs it can poison the catalyst state and cause many more errors. This means that this factory should be used in contexts where a single error is already considered a complete failure (e.g. at the level of an entire algorithm, not as an input to further distillations). Figure 3: Screenshot of the resource estimation spreadsheet include in the supplementary materials of this paper (file name "calculator-CCZ-2T-resources.ods"), with various interesting cases pre-entered. Assuming a physical gate error rate of 10 −3 , and minimal code distances, the |CCZ factory is unlikely to fail when producing on the order of 10 10 states. This is sufficient to run classically intractable chemistry algorithms [1], but not quite sufficient to factor a 1024 bit number with a 50% success rate (assuming that factoring an n bit number requires 12n 3 Toffoli gates and 3n space [32]). However, if the physical gate error rate is improved slightly or (more plausibly) the factory is made slightly larger by increasing the level 1 code distance from 15 to 19, then the number of states that can be produced increases to be on the order of 10 12 . This allows 4096 bit numbers to be factored (though we do not recommend using a single factory for this task, since it would take 5 years to produce the necessary magic states assuming a surface code cycle time of 1 microsecond). Figure 4: Comparison of the to-scale diagram style from [15] with the exaggerated-spacing diagram style used by this paper. The to-scale diagram style emphasizes how things fit together and is ideal when reasoning geometrically. The exaggerated-spacing diagram style emphasize how things connect together and is ideal when reasoning topologically. An even more abstract diagram style for lattice surgery is the ZX calculus [10]. In Figure 9 we show how translate a topological diagram into a ZX calculus graph. summarize our contributions, and point towards future work.
In this paper we will refer to factories using the notation "|In f ( ) → |Out factory". The left hand side is the state input into the factory, the right hand side is the state output from the factory, and the function above the arrow indicates the amount of error suppression up to leading terms (i.e. the f ( ) above the arrow is shorthand for the true suppression f ( ) + O( f ( ))). For example, we will refer to the |T state distillation based on the 15-qubit Reed-Muller code [5] as the 15|T 35 3 → |T factory.
We use three main types of diagram in this paper: circuit diagrams, time slice diagrams, and 3D topological diagrams. The circuit diagrams demonstrate the functionality that the 3D topological diagrams are supposed to be implementing, and the time slice diagrams are a sequence of slices through the 3D topological diagrams, showing boundary information and which patches are being merged or split. We often provide multiple diagrams of the same construction, with common labelling between the diagrams. For example, Figure 5, Figure 7, and Figure 8 are all diagrams of our CCZ factory. Discerning readers can use the labels common to all three diagrams to verify that they agree with each other. In particular, those three diagrams all have three output qubits labelled 1 through 3, eight ancillae qubits labelled a through h, and four "stabilizer qubits" labelled by the stabilizer measurement they correspond to.
To make it possible to see the internal topological structure of the 3D topological diagrams, we have chosen to significantly exaggerate the amount of space between events. We draw operations as if they had linear O(d) separation (where d is the code distance), but on actual hardware the operations have a constant O (1) separation. This exaggeration of separation does not change the topology, so interpreting the figures as if they were to scale will still produce the correct computation. But it is important to account for the distortion when computing the footprint or depth of the computation. Figure 4 shows a comparison between the old to-scale diagram style and our new exaggerated spacing style diagrams.
We will sometimes refer to multi-qubit stabilizers using a concatenated-subscript notation such as Z 123 . Each subscript refers to a separate qubit, i.e. Z 123 = Z 1 Z 2 Z 3 .
We will often refer to |T states as having a particular "level", e.g. "a level 1 |T state" or equivalently "a |T 1 state". The level refers to the number of distillation steps used to produce the state. We will also refer to factories by the level of their output. For example, our starting point is level 0 |T states produced using the post-selected state injection of Li [24]. These |T 0 states are then distilled by the level 1 T factory from [15] into |T 1 states, which we can then feed into our |CCZ factory.
Lastly, we wish to point out the useful supplementary materials included with this paper. First, because it is significantly easier to understand 3D diagrams when one is able to move the camera, the supplementary materials include SketchUp files storing the models shown in the 3D topological diagrams. Second, the supplementary materials include a spreadsheet (file name "calculator-CCZ-2T-resources.ods") that can compute the overhead of computations that use our factories. Interested readers can estimate the running time and number of physical qubits required by their algorithms by entering into the spreadsheet the number of T and Toffoli gates performed by the algorithm, how many qubits the algorithm uses, and an error budget. Figure 3 shows a screenshot of the spreadsheet.
2 Lattice surgery construction of the 8|T 28 2 → |CCZ factory A key technique introduced in [15] is a single-layer stabilizer measurement involving an arbitrary numbers of qubits. We use this technique in order to quickly measure the 4 stabilizers of the error-detecting Toffoli distillation protocol [21,12]. See Figure 5 for a circuit diagram of the CCZ-distillation process. The operations in the circuit are chosen in a way that trivially translates into lattice surgery. In Figure 6 we show time slices of one possible translation of the circuit into lattice surgery (with matching qubit labels and operation labels), and then in Figure 7 show the time slices of our CCZ factory (corresponding to two interleaved translations of the circuit). We also provide an annotated 3D topological diagram of the CCZ factory (see Figure 8).
Our |CCZ factory has a naive depth of 4 (stabilizer measurements) + 1.5 (T state injections) + 1 (X or Y basis measurement, depending on T injection measurements) + 2 (detect errors) = 8.5. We use the same technique as in [15] to partially overlap executions of the factory, resulting in an effective depth of 5.5. The T state injections take 1.5d layers because they are performed at half code distance and it takes 0.5d layers to move the black side into position for a parity measurement, then 0.5d layers to perform the parity measurement, then 0.5d layers to return the black side to its original position. It is acceptable to inject at half code distance because the incoming T states have an error rate larger than the topological error incurred from an injection at this distance.
Our |CCZ factory produces magic states fast enough that algorithms will bottleneck on routing instead of . The box with blue circles in the top right is a state display from the online simulator Quirk, with each circle representing an amplitude (the radius of the colored circle indicates the amplitude's magnitude, and the angle of the line rooted at the center of the circle indicates the phase). The state display is showing that the output state is a |CCZ state. The small circled pluses in the circuit are X-axis controls (equivalent to a normal control surrounded by Hadamard gates); whenever one of these controls directly precedes a measurement the measurement corresponds to a Pauli product measurement. The post-selection operation represents the classical control software determining if the an error was detected; if it fails the output must be discarded. Pauli operations and classically-controlled Pauli operations appear here, but not in Figure 8, because they are performed entirely within classical control software. The circuit can be opened in Quirk by following this link. Discerning readers can follow the link and edit the circuit in order to confirm that adding a single Z error by any T gate is caught by the post-selection, and also that all possible pairs of Z errors escape detection. Figure 6: Time slices of lattice surgery activity during production of a single |CCZ state. Each red square corresponds to a qubit, and the label inside the red square identifies the qubit from Figure 5 that the square corresponds to. Gray rectangles correspond to X stabilizer measurements between sets of qubits. The red arrows labelled "T" correspond to a noisy T state entering the system. It is possible to double the throughput shown here by interleaving the production of two states (shown in Figure 7). → |CCZ factory. To maximize utilization, two states are produced concurrently. Each red or blue square corresponds to a qubit, and the label inside the square identifies the qubit from Figure 5 that the square corresponds to. Gray rectangles correspond to X stabilizer measurements between sets of qubits. The red arrows labelled "T" correspond to a noisy T state entering the system. Blue squares correspond to qubits involved in producing one of the states, and red squares correspond to qubits involved in producing the other state. The red squares in each step are exactly identical to the red squares shown in the matching step of Figure 6. See Figure 8 for a 3D topological diagram corresponding to the time slices. The green boxes atop the columns are performing either an X or Y basis measurement at half code distance, as described in [15], by including or omitting an S gate performed using twists [6]. Using half code distance is acceptable because, at the location in the factory where these operations are performed (i.e. after the T injections), individual errors are detected as distillation failures. The labels along the right hand side indicate the stabilizer measurements occurring at each time. The red/blue coloring of labels matches the red/blue coloring of Figure 7. Note that inserting the |T1 state has a depth of 1.5, unlike the other steps which have depth 1. Each horizontal bar linking several vertical poles is a stabilizer measurement of a product of logical X observables. The groups of three qubits highlighted orange and exiting left are the |CCZ states being output (note that the middle pole of each |CCZ state is rotated with respect to the others, with white on top instead of black on top). The two instances of the factory that are shown differ slightly. Their qubits have been permuted so that each factory's top layer fits into a void at the bottom of the following factory, saving a layer of depth. Figure 9: A substitution procedure (left) for translating our 3D topological diagrams into (nearly) the ZX calculus [10], as well as an example translation of one of the CCZ factories from Figure 8 (right). We use black (white) nodes instead of green (red) nodes (the usual notation for the ZX calculus) so that the node colors match the boundary colors in the 3D topological diagrams. Pieces with two ports are translated into edges or degree-2 nodes of either color. Pieces with three or more ports are translated into a node of matching color. The Z ⊗ Z measurement of a qubit vs a |T state followed by measuring the qubit in the X or Y basis depending on the outcome of the parity measurement is translated into a (non-standard) red node. The red node can be expanded into a proper ZX calculus construction, but we do not attempt to do so. The ZX calculus graph is more amenable to verification than the 3D diagram. The reverse translation, from ZX calculus graph to 3D topological diagram, is often more difficult. magic state production unless special care is taken. For example, suppose there are several Toffoli operations to perform on qubits all placed in a common area; a common area with exactly one entrance capable of allowing exactly one qubit to enter or leave every d cycles. Because a new |CCZ state is produced every 5.5d cycles, and each such state involves three qubits, the entrance will be occupied for 3d out of every 5.5d cycles moving magic state qubits into the common area to meet target qubits (or vice versa). This leaves only 2.5d cycles for other work requiring the entrance. Furthermore, the |CCZ teleportation process requires classically controlled CZ and CNOT operations. If these operations also block the entrance, and are not done in a way that minimizes depth, they will use up the remaining 2.5d cycles and cause a routing bottleneck.
We see three ways for algorithms to avoid bottlenecking on routing and keep up with our |CCZ factory: 1. Increase the amount of space dedicated to routing. Play it safe; do not have areas with narrow entrances or hallways that can only accommodate one qubit per d cycles. This strategy is simple and effective, but costly.
2. Carefully distribute logical qubits across multiple disjoint areas with the goal of ensuring that Toffolis rarely target multiple qubits in the same area. This avoids the bottleneck by having the magic state qubits pass through multiple different entrances, instead of one common entrance. This strategy will not work for all algorithms, but it will work for some algorithms.
3. Use generalized CCZ operations capable of targeting arbitrary stabilizers instead of individual qubits, and move Clifford work into the classical control system. The generalized CCZ is performed in the same way that [25] performs generalized T gates targeting arbitrary stabilizers. The gate teleportation process is modified; replacing each Z t ⊗Z m parity measurement between a target qubit t and the magic state qubit m with a many-body stabilizer measurement P ⊗ Z m where P is a vector of Pauli operations possibly involving every logical data qubit in the computation. The main drawback of this approach is that there is 2x space overhead associated with ensuring it is always fast to access the X, Y , and Z observable of every qubit. This can likely be avoided by interleaving single-qubit work between the Toffoli operations, but requires careful algorithm-by-algorithm consideration.
Note that our CCZ factory's footprint includes an unused 2x4 area, adjacent to where the |CCZ state exits the factory (see Figure 1). This area can be used to hold target qubits waiting for a Toffoli operation, which helps with the routing overhead. Our overhead spreadsheet assumes this space will be used in this manner.
In order to produce a |CCZ state every 5.5d cycles, we need enough level 1 T factories to create 8 |T states every 5.5d cycles. The half-code-distance level 1 T factory from [15] produces a |T state every 3.25d cycles, except when distillation errors are detected. Assuming a physical gate error rate of 10 −3 and a level 1 code distance of 15, distillation errors will be detected approximately 3% of the time (the |T 0 states have ∼ 10 −3 error when injected, gain ∼ 10 −3 error while the level 0 T gates are performed at distance 7, there are fifteen of them, and the most likely case is that a single one fails: 2 · 10 −3 · 15 = 3%). These failures reduce the effective output rate to a |T state every 3.35d cycles, so five of these factories will produce ∼ 8.2 |T states every 5.5d cycles, which is sufficient to keep up with the |CCZ factory. We accumulate a buffer of surplus level 1 |T states in the small hallways between the |CCZ factory and the level 1 |T factories so that a single level 1 T factory failure does not delay the entire |CCZ factory. As shown in Figure 1, the five level 1 factories are placed to either side of the |CCZ factory. Note that it is occasionally necessary to route the fifth factory's output to the opposite side, and that there is enough contiguous unused volume in the factory to do this when needed.
We compute the error rate of the |CCZ states being produced by our factory in two different regimes: the large code distance regime where the factory is distillation limited, and the minimal code distance regime where the factory may be limited by topological errors in the surface code. We assume a physical gate error rate of 10 −3 in both cases, and assume that the post-selected state injection of Li [24] creates |T 0 states with approximately this probability of error. In the distillation limited regime, we run these states through the 15|T 35 3 → |T factory and then through our 8|T 28 2 → |CCZ factory producing intermediate |T 1 states with error rate ∼ 3.5 · 10 −8 and then |CCZ states with error rate ∼ 3.4 · 10 −14 . In the minimal code distance regime, we must account for topological error introduced while performing T gates and the Clifford operations making up the factory. For example, we assume that the error rate of the |T 0 states doubles while performing a level 0 T gate at distance 7. This increases the effective error of the |T 1 states, but this contribution is overshadowed by the large size and proportionally small code distance of the level 1 T factory operating on these states. The factory adds approximately 10 −6 error to the output error, which is three to four times more than the distillation error. We sum the two error rates, resulting in an estimated error rate for |T 1 states of ∼ 1.4·10 −6 . This is forty times more error than in the distillation limited case. The CCZ factory has a code distance large enough that we are distillation limited, and the error rate of the final |CCZ states is correspondingly ∼ 5.3 · 10 −11 .
As shown in figure Figure 3, the minimal distance factory causes errors in more than 50% of runs when attempting to factor a 1024 bit number, but can comfortably run classically intractable chemistry algorithms. However, if one increases the level 1 code distance from 15 to 19 (increasing the footprint of the factory by roughly 20%), then the level 1 error improves so much that it's possible to factor 4096 bit numbers.

The |T -catalyzed |CCZ → 2|T factory
In [21], it is shown how to perform a Toffoli gate by using Clifford operations, measurement, and four T gates. That circuit can be rewritten into an inline circuit that transforms three |+ states into a |CCZ state via Clifford operations and four T gates [17]. Then, by diagonalizing that circuit's stabilizer table, said circuit can be rewritten into a form where three of the T gates apply directly to an input |+ state. Those three T gates can then be replaced by three |T state inputs, resulting in a circuit that maps |T ⊗3 to |CCZ using Clifford gates and one T gate. This circuit contains no measurement, and therefore can be inverted. The inverse circuit (shown in Figure 10) maps a |CCZ state to three |T states using Clifford gates and one T gate.
Because |T states can be used to perform T gates, the T gate used to transform the |CCZ into three |T states can be powered by a |T state output from a previous iteration of the circuit. If we keep feeding a |T state output from iteration k into iteration k + 1, then we effectively have a circuit that takes a |CCZ state and outputs two |T states. Under this interpretation of the circuit, the third |T state is an ancillary state that is necessary for the transformation to be possible, but is not consumed by the transformation. Thus, in keeping with terminology for ancillary states that enable LOCC communication tasks without being consumed [20], and previous work [7], we refer to the third |T as a catalyst. We refer to the circuit as a whole as the |T -catalyzed |CCZ → 2|T factory, or "C2T factory" for short.
Beware that, although the catalyst |T state is not consumed by the C2T factory, it does accumulate noise from the incoming |CCZ states. If a catalyst |T has cycled through n iterations of the C2T factory, and there is a probability of each |CCZ containing an error, then there is an Θ(n ) chance that the catalyst has been poisoned and is causing the factory to produce bad outputs. However, because every error in the catalyst ultimately traces back to an error in a |CCZ state, the chance of there being any error grows like Θ(n ), instead of Θ(n 2 ) as would be expected from a naive calculation assuming uncorrelated errors.
Distillation protocols usually require inputs with uncorrelated errors, so it is important that we only use the C2T factory as the last step in a distillation chain. In a sense, because of how we use the C2T factory, the correlation between errors is beneficial to us instead of detrimental. It means that when we run an algorithm many times there will be a small number of runs with many errors, instead of many runs with a small number of errors. We experience quadratically fewer whole-algorithm failures than would be expected from the fact that the expected number of errors is growing like Θ(n 2 ). For other examples of correlation between errors being beneficial, we recommend reviewing hat guessing games [28].
The C2T factory circuit shown in Figure 10 is compact, but not in an ideal form for embedding into lattice surgery. Figure 11 fixes this by providing an equivalent circuit that, although it appears much more complicated, trivially translates into lattice surgery. We show the result of this translation in Figure 12, which has time slices of the lattice surgery operations occurring as the factory operates. And finally Figure 13 shows an annotated 3D topological diagram of the process. Figure 10: A circuit that transforms a |CCZ state into three |T states by applying Clifford operations and a single T gate. By using one of the outputs to fuel the next iteration, the circuit can be re-interpreted as a circuit that turns one |CCZ into two |T states when catalyzed by one |T state. The boxes with blue circles are state displays from the online simulator Quirk, with each circle representing an amplitude (the radius of the colored circle indicates the amplitude's magnitude, and the angle of the line rooted at the center of the circle indicates the phase). The state displays are showing that the input state is a |CCZ and the output states are |T states. The small circled pluses are X-axis controls (equivalent to a normal control surrounded by Hadamard gates). The circuit can be opened in Quirk by following this link. Figure 11: A circuit for catalyzed |T state production, specialized for lattice surgery. Given a |CCZ state (first three qubits) and a |T state (fourth qubit), produces three |T states. Red areas correspond to a product-of-Paulis measurement. The blue area happens entirely within classical control software. The S ancilla is preparing an |S state that can be used to correct the T gate teleportation used to perform the Z −1/4 gate from Figure 10. The B ancilla is being used to perform the X −1/2 gate from Figure 10. The A ancilla is being used to perform the multi-target CNOT from Figure 10. The boxes with blue circles, at the beginning and end of the circuit, are state displays from the online simulator Quirk. Each circle represents an amplitude (the radius of the colored circle indicates the amplitude's magnitude, and the angle of the line rooted at the center of the circle indicates the phase). The state displays are showing that the input and output states are |CCZ and |T states as described. The small circled pluses in the circuit are X-axis controls (equivalent to a normal control surrounded by Hadamard gates); whenever one of these controls directly precedes a measurement the measurement corresponds to a Pauli product measurement. The circuit can be opened in Quirk by following this link. Figure 12: Time slices of lattice surgery activity during transformation of a |CCZ state (orange qubits labelled 1, 2, 3) into three |T states (shown in red in last slice), catalyzed by a |T state (bottom right qubit in red). Black and dark gray bars correspond to stabilizer measurements. Ancillae qubits are shown in blue. The code distance of the ancillae qubits is doubled when single-qubit Clifford operations are being applied, to ensure there is sufficient suppression of errors. The light gray "(CCZ)" box to the left will be used by the CCZ factory producing |CCZ states to be transformed. See Figure 13 for a 3D topological diagram corresponding to the time slices. Every step being performed can be matched up with a step from Figure 11, and the qubit labels shown here correspond to the qubit labels there. Figure 13: 3D topological diagram of a lattice surgery circuit transforming a |CCZ state (orange-tipped inputs at bottom) and a |T state (bottom right red-tipped input) into three |T states (red-tipped outputs at top). We conservatively assume that the green boxes are large enough to perform any single-qubit Clifford with negligible error. See Figure 8 for details about how to interpret the diagram.

Arbitrary-Angle Phase Catalysis
The catalysis technique used in the C2T factory from the previous section generalizes to phasing angles other than the T gate's 45 • . In Figure 14, we show a generalization of Figure 10 that works for an arbitrary angle θ. This circuit performs two Z θ operations by performing cheap stabilizer operations, performing one Toffoli gate, performing one Z 2θ operation, and being catalyzed by one Z θ |+ state. Contrast with gate teleportation [18], which consumes a previously prepared Z θ |+ state in order to perform one Z θ operation, with a 50% chance of requiring a fixup Z −2θ operation.
One way to discover the generalized phase catalysis circuit is to start from the phase-gradient-via-addition circuit [22,17,27], which performs a series of rotations Z, S, T , √ T , √ T , etc by adding a register containing the target qubits into a phase gradient catalyst state. Include a carry bit input in the addition of the phasegradient-via-addition circuit, truncate the circuit after the first ripple-carry step by using the correct fixup operation, and the result is a phase catalysis circuit for an angle θ = π/2 k which trivially generalizes to arbitrary angles. The catalysis circuit can likely also be derived from synthillation parity-check circuits [9], which use similar magic states and have a similar structure but are used to perform distillation of existing states instead of producing additional states.
Specializing the generalized phase catalysis circuit to θ = 22.5 • , i.e. to the √ T gate, produces the circuit shown in Figure 15. This specialized circuit creates two | √ T states by performing one Toffoli operation and one T gate. This is significantly more efficient than previous techniques we were able to find and adapt to the task of producing | √ T states [23,2,26,22,17,27], assuming a physical gate error rate of 10 −3 and a target error rate of 10 −10 . For example, according to figure 5 of [2], repeatuntil-success circuits use ≈ 45 T gates to approximate a √ T gate to within precision = 10 −10 . As another example, according to table III of [26], direct synthesis of √ T state uses ≈ 25 times more volume than direct synthesis of |T states (though this ratio improves as the physical gate error rate improves). A final example: the phase-gradient-via-addition operation described in [22,17] can perform a √ T gate with a 4-bit adder (which requires 3 |CCZ states). Phase-gradient-via-addition is the closest to competing with phase catalysis, which is perhaps not surprising since phase catalysis is an optimized form of this technique. Other techniques appear to be very far behind; requiring an order of magnitude more spacetime volume.
5 Lattice surgery construction of the 8|T 28 2 → 2|T factory We now combine the |CCZ factory from Section 2 with the C2T factory from Section 3, producing a |Tcatalyzed T factory that transforms eight noisy |T states into two |T states with quadratically less noise. Note that this means we achieve a 4:1 ratio of input |T states to output |T state, which is competitive with the 3:1 ratio of block codes [4]. This is surprising, because normally one has to work with a larger number of |T states in order to achieve good ratios. Note that we do not use exactly the same CCZ factory as in Section 2. We re-order the stabilizer measurements and place the output qubits in a different location, so that it fits into the C2T factory from Section 3. Furthermore, we do not bother interleaving the factory with itself anymore. There's no point; we need five T 1 factories to run at the rate achieved by interleaving but now only have four factories (recall Figure 1).
The details of the combined factory are covered in Figure 17, which shows the parallel operation of C2T factory and CCZ factory. Recall that the qubit labels can be matched up with Figure 5 for verification that the correct stabilizers are being measured (though in a different order). Our penultimate figure, Figure 18, shows a 3D topological diagram of the factory. Note that the figure omits the level 1 T factories feeding in noisy |T 1 states, and exaggerates the spacing between qubits in order to make internal structures visible, but is otherwise complete.
To bootstrap the factory, an initial catalyst |T state is made "the hard way", using some less efficient |T factory that can output |T states with error no higher than the error rate of the |CCZ factory. Bootstrapping occurs once at the start of the computation, and any time the catalyst |T state is lost. Specifically, note that the |CCZ state produced by the CCZ part of the factory is being consumed before it's known if it contained a distillation error. Therefore, when a detected distillation error does occur, the |T state catalyst must be discarded. This has a negligible effect on the effective depth of the factory, because it occurs so rarely (approximately once per hundred thousand distillations). There is a space towards the top right of the factory where a spare |T catalyst could be placed, to be used as a backup when the main catalyst is lost.
The primary bottleneck on the output of this factory is the rate at which |T 1 states are produced. As shown in Figure 1, we assume there are four |T 1 factories present (one beside each pair of qubits require |T 1 states. When functioning perfectly, each of these factories produces a pair of noisy |T 1 states every 6.5d cycles, which is just enough to feed the catalyzed T fac- Figure 14: Generalized phase catalysis circuit. Given a Z θ |+ catalyst, two Z θ operations can be applied via stabilizer gates, one AND computation gate (notation from [17]), and one Z 2θ gate. Figure 15: Using a catalyst | √ T state to create 2 additional | √ T states using cheap stabilizer operations, one T gate, and one AND computation. Has a T-cost of 5 [21,17], implying the T-cost of a √ T state is at most 2.5.
tory and keep it producing a pair of |T 2 states every 6.5d cycles. Of course, the |T 1 factories do not always function perfectly. They discard their output roughly 3% of the time due to detecting an error (computed in Section 2). In order to actually achieve a depth of 6.5d for the 8|T 28 2 → 2|T factory, it is necessary increase the |T 1 factory output rate by more than 3% to compensate. There are many ways to achieve such a small gain, and Figure 16 sketches one way to do so. Therefore the T 1 factories can keep up with the catalyzed T factory producing a pair of |T 2 states every 6.5d cycles.

Conclusions
In this paper we presented two factories: a |CCZ factory and a catalyzed |T factory. We compiled these factories all the way down to 3d topological diagrams (see Figure 19) and gave detailed estimates of their spacetime volume, footprint, and error rates. We also showed how to generalize the phase catalysis technique used by our |T factory to apply to arbitrary angles, including the particularly-efficient angle of θ = 22.5 • . Finally, we slightly improved the output rate of the level 1 T factories from [15], gave a simple procedure for transforming topological diagrams into ZX calculus graphs, provided a resource estimator spreadsheet, and gave working simulator links for verifying most of our circuit constructions.
Because it takes four |T states to perform a Toffoli gate, but only one |CCZ state to do the same, algorithms dominated by applying Toffolis, such as Shor's algorithm and the chemistry algorithm in [1], run five times as fast when using our |CCZ factory instead of the |T factory from [15]. However, we caution that it is often necessary to rework these algorithms' circuits to account for the much faster Toffoli rate. Assuming that such a reworking is possible for [1], the runtimes at classically intractable sizes would be reduced from ∼10 hours (see table VII of [1]) to ∼2 hours. For algorithms dominated by performing T gates, our catalyzed T factory provides a more modest 2× speedup.
We believe it is possible to further decrease the volume of our factories. For example, we suspect that the level 1 T state injection at the end of each factory can be partially merged with that factory's final stabilizer measurement. If that is true, then the depth of the factories could be reduced by 1d. However, the effect of this optimization on the topological error rate is difficult to predict and we will use simulation to check the optimization's correctness before claiming it.
Another possible optimization is to eagerly route |CCZ qubits emerging from the CCZ factory to their final destination (in preparation for a parity measurement), instead of holding them next to the factory until they are verified. Removing the output-holding area reduces the CCZ factory's footprint by over 20%, which is a large gain, but it is important to keep in mind that this is not a true reduction in volume but rather a reclassification of some of the factory's volume as routing volume.
Yet another possible optimization would be to carefully analyze how topological errors within the surface code propagate through the factory. At any location where an error chain between two boundaries would result in a detected failure, the boundaries can be moved closer together.
A final idea that should be investigated is estimating logical error probabilities from the observed pattern of detection events produced by the surface code's stabilizer measurements. For example, if there were a sudden burst of detection events crossing between two boundaries during the execution of a factory, the factory's output could be cautiously discarded even if the logical measurement results indicate there is no problem. Assuming there is some metric that can be derived from the raw detection events, that reliably correlates with the true failure probability, this would allow us to reduce the number of false negatives (where an undetected error escapes the factory) at the cost increasing the number of false positives (where a run with no error is discarded).
In this paper we focused on making a low-volume factory in the single-factory regime, but it is also important to consider factories optimized to have a tiny footprint. Early quantum computers will have limited space; it's worth sacrificing depth if it means the factory actually fits on the machine. By combining techniques from this paper and low-footprint distillation techniques mentioned in [25], it should be possible to create factories covering fewer qubits but with roughly the same volume as ours.
Another interesting avenue to explore is the highfootprint / multi-factory regime, where factories based on block codes become possible. Block factories should be able to outperform the efficiency of our factory, assuming enough states are being distilled in parallel. But this raises the question of whether block factories can also be improved by catalysis; are there catalyzed block factories? We don't know the answer to this question.
We suspect that the space of quantum circuits contains many other gems akin to the catalyzed phasing circuit. We consider finding these circuit to be important, because they can be surprisingly efficient at their tasks. It would be particularly useful to have a general framework for finding catalyzed circuits, to better understand what makes them efficient, and to understand Figure 16: 3D topological diagram of a re-arrangement of the level 1 T factory from [15]. Quantities are quoted in units of d/2 instead of d because the factory is performed at half code distance. Improves the depth from 13d/2 to 12.5d/2, increasing the output rate by roughly 4%, which ensures four level 1 T factories is sufficient to feed our T-catalyzed factory. There are two variants of the factory: the one shown with an output on the left (A) and the one shown with an output on the right (B). The qubits of A and B have been permuted so that their first stabilizer measurement involves qubits that are all on the same side, allowing the stabilizer measurement to be performed without using a central bar. The second measurement of B (the first measurement using the central bar) is over the back 8 qubits and the last measurement of A is over the front 8 qubits. This allows B to be lowered by half of d/2, so that B rests on the level 0 |T injections to the right of A. The transition back from B to A cannot be lowered quite as far, because the top of B would intersect the first central bar used in A. Overall this optimization saves 0.5d/2 depth relative to the interleaving technique used in [15].   Figure 17, and the circuit in Figure 5 combined with the circuit in Figure 10. Single-qubit Clifford gates that would affect the catalyst |T if they failed are performed with extremely conservative code distance (large green boxes). the connection with related constructions such as distillation via parity-checks [9], synthillation [8], and phase gradient kickbacks [22,17,27]. Our guess as to the nature of the connection between these constructions is that there are a small number of circuit identities underlying all these related but different techniques, and that each technique is rewriting and interpreting the underlying circuit identities in a different way. If the connection is actually of this form, then perhaps it is possible to write code that takes a circuit using one of these techniques, derives the identity the circuit is using, and then produces a whole related family of interesting circuits (perhaps including circuits that use catalysis).