Optimization of the surface code design for Majorana-based qubits

The surface code is a prominent topological error-correcting code exhibiting high fault-tolerance accuracy thresholds. Conventional schemes for error correction with the surface code place qubits on a planar grid and assume native CNOT gates between the data qubits with nearest-neighbor ancilla qubits. Here, we present surface code error-correction schemes using $\textit{only}$ Pauli measurements on single qubits and on pairs of nearest-neighbor qubits. In particular, we provide several qubit layouts that offer favorable trade-offs between qubit overhead, circuit depth and connectivity degree. We also develop minimized measurement sequences for syndrome extraction, enabling reduced logical error rates and improved fault-tolerance thresholds. Our work applies to topologically protected qubits realized with Majorana zero modes and to similar systems in which multi-qubit Pauli measurements rather than CNOT gates are the native operations.

The surface code is a prominent topological error-correcting code exhibiting high fault-tolerance accuracy thresholds. Conventional schemes for error correction with the surface code place qubits on a planar grid and assume native CNOT gates between the data qubits with nearest-neighbor ancilla qubits.
Here, we present surface code error-correction schemes using only Pauli measurements on single qubits and on pairs of nearest-neighbor qubits. In particular, we provide several qubit layouts that offer favorable trade-offs between qubit overhead, circuit depth and connectivity degree. We also develop minimized measurement sequences for syndrome extraction, enabling reduced logical error rates and improved fault-tolerance thresholds.
Our work applies to topologically protected qubits realized with Majorana zero modes and to similar systems in which multi-qubit Pauli measurements rather than CNOT gates are the native operations.

I. INTRODUCTION
Fault tolerance is widely believed to be necessary to run viable applications on a quantum computer. Errors occurring during the computation must be corrected at regular intervals and faster than they accumulate. The design of a fault-tolerant quantum computer is constrained by the limitations of quantum hardware. For instance, at present it remains extremely challenging to produce a large number of high-quality qubits. Moreover, quantum chips generally offer only a reduced qubit connectivity, often limited to nearest-neighbor interactions.
Given these constraints, the surface code [1-3] has proven to be one of the leading candidates for error correction in a quantum computer. Two crucial properties make the surface code very attractive for a first generation of fault-tolerant quantum computers: (i) Error correction with the surface code can be implemented on a planar grid of qubits using only single-qubit operations and nearest-neighbor gates, (ii) The surface code tolerates qubits and elementary operations affected by relatively high error rates [4,5]. These properties have been established for qubits equipped with CNOT gates, e.g., superconducting qubits; however, it is unclear whether similar results hold with other types of qubits.
In this article, we consider the performance of the surface code for measurement-based qubits. These qubits do not possess a native CNOT: instead they are equipped with single-qubit and two-qubit Pauli measurements. These two sets of operations, based on CNOT gates or Pauli measurements, are equivalent in the sense that they can simulate each other in polynomial time. In particular, the surface code can be implemented with measurement-based qubits up to a polynomial overhead. However, for practical purposes a polynomial overhead can have dramatic consequences. A naive translation from the CNOT-based implementation of the surface code error correction into a measurement-based circuit leads to a blow-up of the qubit overhead. Five times as many ancilla qubits are required, since each CNOT gate implemented as a sequence of measurements requires an extra ancilla qubit. More qubits also incur more potential fault locations, which result in a significant reduction of the surface code performance, and which may cancel property (ii).
In this work, we propose implementations of surface code error correction with measurement-based qubits that retain both of the positive properties (i) and (ii) described above, and meanwhile (iii) consume the same number of ancilla qubits as the CNOT-based implementation. Note that property (iii) is valuable, since the extra ancillas required for emulating CNOT gates is one of the main potential drawbacks of measurement-based qubits. Our implementations rely on two main ingredients. First, we design planar qubit layouts for the surface code, where the ancilla qubits can be recycled both for emulating the CNOT gates and for revealing the syndrome bits, via only local measurements. Second, we optimize the decomposition of the CNOT-based circuit into measurements by reducing the circuit depth, enabling a shorter error-correction cycle. By reducing the number of locations at which faults can occur in each error-correction cycle, this also leads to a reduction of the logical error rate.
We numerically simulate the error-correction schemes which combine the optimized layouts and syndromeextraction circuits, using the Union-Find decoder [6,7]. Under a circuit-level error model where each location experiences depolarizing noise, we observe empirical error rate thresholds as high as 2.37 × 10 −3 .
One potential application of our measurement-based surface code designs is for quantum computers consisting of Majorana zero modes [8], where physical qubits are encoded into an even number of Majoranas with fixed parity. Both the storage and manipulation with Majoranabased qubits are topologically protected, i.e., robust to local perturbations. In particular, reliable measurements of qubit Pauli operators-which can be realized by gathering relevant constituent Majoranas and measuring their joint parities-are amenable to our schemes. Thanks to the topological protection, one should be able to manufacture high-quality Majorana qubits with error rates far below the thresholds required to implement our schemes. Let us remark that there have been studies on the implementations of the Bacon-Shor code [9], Majorana surface codes [10] and Majorana color codes [11], where plaquette operators with weight four or six are directly measured without the usual need of ancilla qubits. Here, we instead allow the use of ancillas and restrict to at most weight-two Pauli measurements, mainly to avoid harmful correlated noise or reduction of the effective distance when considering circuit-level errors. It is also likely that achieving high-fidelity measurements of more than two qubits will prove much more difficult in practice.
Section II introduces notions about measurementbased qubits along with a simplified noise model. Section III presents a windmill-like qubit layout which has the same qubit overhead as in the standard CNOT-based surface code error correction. In addition, Section III A gives two alternative layouts that feature favorable circuit depth and qubit connectivity, respectively. Section IV explains how to measure the weight-four plaquette operators using fewer time steps than the naive translation from CNOT gates. Section V lists the results of the numerical simulations and gives threshold estimates. We introduce a mapping of error distributions which expedites the sampling of errors in simulation significantly. The resulting error distribution, called the inclusive error model as explained in Appendix E, is equivalent to the conventional error model (e.g., the depolarizing noise or the bit flip noise) in all important regimes and may be of independent interest.

II. MEASUREMENT-BASED QUBITS
We consider a set of qubits equipped with single-qubit measurements of Pauli matrices X, Y and Z. The only available entangling operations are joint measurementsmeasurements of two-qubit Pauli operators acting on connected qubits. A T gate or another non-Clifford gate must be added to the gate set in order to achieve universality. The optimization of the production of non-Clifford operations is not considered in this work. We focus on the design of the error-correction schemes that depend only on the Clifford part of the gate set-single-qubit measurements and joint measurements.
A graph, whose edges support joint measurements, describes the qubit connectivity. In order to make the chip design possible, it is often necessary to restrict ourselves to low connectivity (low degree) and to graphs that can be embedded in a plane with a small number of crossing edges. Planar connectivity graphs are optimal in that regard.
In addition to single-qubit and joint measurements, measurement-based qubits are equipped with an additional operation that we call a Pauli update and that can be implemented in the classical control device with- A CNOT gate with control qubit 1 and target qubit 3 using qubit 2 as an ancilla. The CNOT gate is implemented as a sequence of two single-qubit measurements and two joint measurements. Each measurement MP (represented by a Pauli in blue squares, with joint measurements connected with vertical lines) is followed by a Pauli update UQ (represented by a Pauli in rounded pink squares, connected to the measurement with a thick horizontal line). The update UQ applies the Pauli operation Q if the outcome of the preceding measurement is non-trivial. out any physical action on the qubits. In general, each measurement M P is followed by a Pauli update U Q which applies the Pauli operation Q to the system if and only if the outcome of the measurement of Pauli operator P is non-trivial.
Sequences of Pauli measurements and Pauli updates can generate arbitrary Clifford circuits. A CNOT gate implementation for measurement-based qubits is given in Fig. 1. An ancilla qubit is necessary in order to implement this two-qubit gate.

A. Noise model
We assume a circuit-level noise model where each elementary operation in the error-correction circuit is afflicted by a fault, chosen according to some distribution, from a certain finite set specified as follows.
FIG. 2. A distance-five surface code encoding 1 logical qubit into 25 physical qubits. Black nodes represent physical qubits and each colored region corresponds to the measurement of a syndrome bit. Green and yellow plaquettes support respectively X-type and Z-type measurements. With CNOT-based qubits, a measurement is implemented locally inside a plaquette using one ancilla qubit connected to the plaquette qubits by CNOT gates. • Single-qubit identity gate faults: {I, X, Y, Z}.
Here, "flip" or "no flip" indicates whether or not the measurement outcome is incorrectly flipped, that is, whether or not an erroneous Pauli update is introduced. Clearly, this is essentially equivalent to a stochastic Pauli error model. In all the numerical simulations that we have performed (to be explained in Section V), individual elementary operations are faulty independently with same probability. When faulty, an operation is affected by a fault which is chosen from the set of all possible nontrivial faults, uniformly at random. Although our noise model resembles the depolarizing error model typically assumed for the CNOT-based circuits, it is difficult to establish a meaningful or quantitative comparison between these two models and thus the corresponding error-correction schemes.

III. SURFACE CODE LAYOUT
The surface code [1-3] encodes one logical qubit into a grid of d × d physical qubits as shown in Fig. 2. Additional qubits are consumed by the implementation of the correction scheme. Error correction is based on the measurements of the plaquette operators of the form X q1 X q2 X q3 X q4 acting on the four qubits of green plaquettes or Z q1 Z q2 Z q3 Z q4 over yellow plaquettes. Side plaquettes involve only two qubits. A round of stabilizer measurements produces an outcome bit for each plaquette, the so-called syndrome bits. The decoder provides an estimation of the errors which occur based on the knowledge of the syndrome. A number of efficient decoding algorithms have been proposed for the surface code [2]. In this work, we consider the Union-Find decoder for its rapidity [6,7].
Qubits equipped with native CNOT gates consume exactly one ancilla qubit per syndrome bit extracted during a round of syndrome measurement. Figure 2 shows the locations of ancilla qubits and the connectivity of the CNOT gates used inside a plaquette. Overall, a square grid connectivity is enough to implement the surface code with CNOT-based qubits. Figure 3 compares the CNOT connectivity graph with the connectivity required for measurement-based qubits. A naive solution is to simulate each CNOT by a sequence of measurements. This costs one extra ancilla qubit per CNOT as we can see in Fig. 3(b). The number of ancillas jumps by a factor of five for large minimum distance. The implementation of the smallest surface code (with distance three) would require 33 qubits instead of 17 qubits.
The extra ancilla required for simulating a CNOT with measurement-based qubits cannot be omitted but it can be shared between multiple CNOT gates in the same plaquette and between neighbor plaquettes. This leads to the windmill layout described in Fig. 4. A plaquette is measured using two ancillas, one that stores the mea- A single ancilla per plaquette is sufficient. Ancilla qubits have a degree-five connectivity and data qubits remain degree-four.
In order to implement a plaquette measurement, we need a pair of ancillas. We use the ancilla in the center of the plaquette and its neighbor ancilla (linked by a blue edge). A complete round of measurements is done in two steps, measuring together all green plaquettes and then all yellow plaquettes. In (d) we show the connectivity used during the measurement of a greeen plaquette. The neighbor yellow ancilla is required.  surement outcome and a second one that supports the CNOT gates between the first ancilla and the four plaquette qubits. This layout is particularly advantageous when the first priority is to minimize the qubit overhead, that is the number of physical qubits per logical qubit. Such priority brings two slight differences from the CNOT syndrome-extraction circuit. First, the chip must allow for degree-five ancilla connectivity. This remains reasonable, perhaps at the price of a small increase of the measurement error rate. The fact that the data qubits remain degree-four is encouraging. Second, since the two connected ancillas are used together for a single plaquette measurement (see Fig. 4(d)), green and yellow plaquettes cannot be measured simultaneously. One has to implement a complete syndrome measurement round in two consecutive stages, each for one type of stabilizers.

A. Alternative layouts for syndrome extraction
We describe alternative layouts for syndrome extraction that may be useful in different regimes.
The windmill layout is designed to minimize the number of ancilla qubits. In the regime where it is easy to fabricate a large number of qubits, one may consider the double ancilla layout represented in Fig. 5(a), which uses two ancilla qubits per plaquette. This reduces the time required for a complete error-correction cycle by a factor of two, but it costs twice as many ancillas as the windwill layout.
In an alternative layout depicted in Fig. 5(b), all the physical qubits are connected with at most 4 neighboring qubits. This layout would be useful in a situation where a qubit may not be connected to 5 or more other qubits.
There are approximately 2.5 ancillas per plaquette of the surface code.

IV. OPTIMIZATION OF THE SYNDROME-EXTRACTION CIRCUIT
Here we consider how to implement the weight-four X ⊗4 and Z ⊗4 stabilizer measurements required for the surface code using measurement-based qubits. The weight-two stabilizer measurements at the boundary of the lattice are implemented similarly. This is a special case of a more general scheme for measuring arbitrary Pauli operators that we present in Appendix A.
First let us review the approach already known for CNOT-based (rather than measurement-based) qubits [3]. In that case, it is standard to include a single extra ancilla qubit which is entangled using CNOT gates with the four plaquette qubits which are to be measured jointly. To implement an X ⊗4 joint measurement in CNOT-based qubits, one prepares the ancilla in |+ , and sequentially applies a CNOT controlled by the ancilla and targeted on each of the four qubits involved in the measurement, before measuring the ancilla in the X basis. This circuit propagates error in a non-faulttolerant manner. For example, an XX fault on the third CNOT gate will propagate to the final CNOT and will result in an X error on each of the last two qubits in the  Fig. 6(b), while boundary plaquettes are measured using gadgets as in Fig. 11(a). Qubits of the X-type (green) and Z-type (yellow) stabilizers are acted upon in Z and N orders, respectively. The two types of stabilizers are measured in separate stages in (a), and simultaneously in (b). measurement. However, it can be incorporated into a fault-tolerant error-correction protocol provided the errors resulting from single faults are sufficiently benign given the structure of the error-correction scheme. In the case of the surface code, one can choose the ordering of the CNOT gates so that the weight-two error just described (an example of a hook error [2]) is orthogonal to the minimum-weight logical operators, thereby behaving effectively as a weight-one error for the purposes of error correction [12]. Now let us turn to the case of measurement-based qubits. The simplest approach is to use precisely the same technique as is employed with CNOT-based qubits, but to decompose each CNOT in the circuit into measurements using the gadget shown in Fig. 1. This results in the circuit shown in Fig. 6(a). We remark that circuit Fig. 6(a) can be compressed by merging consecutive single-qubit X or Z measurements and accordingly changing the subsequent measurement bases. The compressed circuit behaves the same as that in Fig. 6(a) in the absence of faults; however, it has malignant hook errors which the uncompressed circuit does not have.
By performing a search of sequences involving singlequbit and joint measurements, we have found small circuits that implement two-and three-target controlled-NOT gates CXX and CXXX (see Appendix A). Iteratively using these circuits as modular components, one can measure a Pauli operator of arbitrary weight n ≥ 2 using two ancillas and either n + 2 single-qubit measurements and 3n/2 joint measurements when n is even, or n+4 single-qubit measurements and (3n+1)/2 joint measurements when n is odd. Relevant to the surface code error correction is the special case of n = 4, as depicted in Fig. 6(b) for measuring bulk plaquettes. This sequence is significantly shorter and involves fewer measurements, and thereby is expected to perform better than the naive circuit Fig. 6(a) built from CNOT gadgets.
Note that in gadget Fig. 6(b), the two consecutive Z measurements on qubit 5 in the middle may seem redundant, but are necessary to keep a single measurement error from propagating; the latter Z measurement on qubit 5 and the subsequent X correction make sure that qubit 5 is set to the state |0 , even if the former Z measurement was faulty. Furthermore, even though it is a native operation with our measurement-based qubits to measure weight-2 plaquettes on the boundaries, they should be measured using gadgets in Fig. 11(a), rather than direct joint measurements, to make hook errors benign.

V. NUMERICAL RESULTS
In this section, we present numerical results of the Monte Carlo simulations of our measurement-based surface code error-correction schemes. The first in-depth numerical study of the surface code is by Dennis et al.
[2], using CNOT-based qubits. Simulations of the surface code performance and circuit-level optimization have been realized in [4,5,12], providing numerical estimates of the surface code threshold.

A. Methods
We estimate the logical error rate with a circuit-level simulation for two layouts: the windmill layout as in Fig. 4 and the double ancilla layout as in Fig. 5(a), whose syndrome-extraction circuits are fully specified by Fig. 7(a) and Fig. 7(b), respectively. Weight-four plaquettes in the bulk are measured using the gadgets in Fig. 6(b), with the joint measurements on data qubits scheduled in Z or N order. Weight-two plaquettes on the boundaries are measured using the gadgets in Fig. 11(a). For the windmill layout, X-and Z-type stabilizers are measured in separate stages, whereas for the double ancilla layout all stabilizers are measured simultaneously.
Given a specific syndrome-extraction circuit, we simulate the error process according to the model as in Section II A, and calculate corrections using the Union-Find decoder. Below we remark on the numerical methods of detecting logical errors.
In Ref. [5] the surface code of distance d was considered with time boundary conditions such that initially the physical qubits are in a code state without error, syndrome bits are extracted for T rounds by a noisy circuit, and then a final round of syndrome bits are obtained with a noiseless circuit. The value of T is increased until a logical error is observed. This setting makes it easy to detect any logical error introduced by errors and their correction, but we were unable to find an operational meaning to their time boundary conditions. In particular, the presence of the noiseless syndrome measurement at the end could result in the underestimation of the logical error rate because, in principle, a decoder may exploit the information from the last perfect syndrome measurement. A potential justification for this could be that ultimately (for instance at the end of a quantum computation) each qubit in the surface code will be measured out qubit by qubit, allowing for a more reliable syndrome readout than usual. However, while modeling the performance of the surface code far from this final readout, it seems important to ensure that the model is not sensitive to this artificial step.
We have observed that the time boundary conditions make the logical error rate underestimated by a factor of ≈ 1.5, independent of the code distance, when the physical error rate is 10 −3 . See Appendix C. Based on this observation, we continue to use the same time boundary conditions as in [5] with d rounds of noisy syndrome measurement. That is, we start with a code state, perform d rounds of noisy syndrome measurements, and then finish by one round of noiseless syndrome measurement. The whole history is passed to the Union-Find decoder, and we define the storage error rate p L to be the probability that this procedure results in a nontrivial logical operator.
We believe that in future work it is desirable to have more operationally meaningful estimations of the failure rate of logical operations.
In Appendix D, we argue that the Union-Find decoder succeeds in error correction as long as there are at most  FIG. 9. Logical error rates pL for distance d = 3, 5, 7, 9, 11, 13 in the low-p regime, using the (a) windmill layout as in Fig. 7(a) and the (b) double ancilla layout as in Fig. 7 lined above, thereby maintaining an effective distance of the surface code.
In Appendix E, we introduce a simulation technique based on the inclusive error model, which has been used throughout our simulations. Instead of randomly generating faults and tracing them through the circuit to determine a history of syndromes, we obtain the syndromes straightforwardly by sampling edges on a decoding graph. Our technique is exactly equivalent to the conventional simulation procedure of sampling by circuit faults under the depolarizing noise model considered here. However, our technique provides a substantial speedup. Figure 8 plots the logical error rate p L of our errorcorrection schemes for surface codes with odd distance d = 3, 5, . . . , 41 with relatively high physical error rate p. Each dot is obtained from 10 6 trials of Monte Carlo simulation; error bars indicate 95% statistical confidence. We observe empirical thresholds p th = 1.54 × 10 −3 for the windmill layout and 2.37 × 10 −3 for the double ancilla layout.

B. Results
Figures 9(a) and 9(b) plot p L for surface codes with distance d = 3, 5, 7, 9, 11, 13 in the low-p regime. Each dot is obtained from 10 6 trials of Monte Carlo simulation; diamonds are obtained from importance sampling. We explain importance sampling in details in Appendix F. Following the heuristic in [3], for each d, we fit the data points in the relatively low error regime (with p ≤ 10 −4 ) to the model where c(d) is a constant that only depends on d, and p th = 1.54 × 10 −3 . Since our schemes can correct up to d−1 2 faults, provided p is small (1) should be a reasonable heuristic. The fitting parameters c(d) are listed in Tab. 9(c), and the fitting curves are depicted in corresponding colors in dashed lines in Fig. 9(a) and Fig. 9(b).
Since the values of c(d)'s are comparable to each other, we continue to fit all the low-p regime data with different d to a uniform heuristic where c and p th are both to be fitted, independent of d.
The fitting parameters c and p th are given in the bottom row of Tab. 9(c), and the fitting curves are depicted by red lines in Fig. 9(a) and Fig. 9(b). Figure 10(a) plots the pseudothresholds for d = 3, 5, . . . , 41 by dots, and fits them with the solid curves, using the following heuristic where p th = 1.54 × 10 −3 or 2.37 × 10 −3 , and a, b are to be fitted, independent of d. We choose the heuristic (3) because we find that the relation between log(p th − p pseudo ) and log d is close to linear. Observe that in general the scheme using the double ancilla layout has lower logical error rate and higher threshold or pseudothreshold than the one using the windmill layout. This better performance is consistent with the fact that the space-time volume of the double ancilla layout is smaller than that of the windmill layout.

VI. CONCLUSION
We have described several surface code error-correction schemes that are tailored for measurement-based qubits, i.e., hardware equipped with Pauli measurements on single qubits and pairs of nearest-neighbor qubits. Instead of directly translating from the canonical CNOT-based scheme, our schemes feature a hardware-efficient qubit layout and an optimized syndrome-extraction circuit, together giving rise to reasonable error thresholds. We have also designed alternative surface code layouts and measurement circuits for general Pauli operators, which might be of independent interest.
It remains to develop systematic methods for constructing more efficient measurement-based syndromeextraction circuits. More work has to be done to investigate the tradeoff among the circuit depth, qubit connectivity and ancilla overhead.

ACKNOWLEDGMENTS
This paper is dedicated to David Poulin, a friend, a mentor, and an inspiration to the quantum computing community. R. C. thanks Microsoft Quantum for hospitality during his internship; Qian Yu for helpful discussions; NSF grant CCF-1254119, ARO grant W911NF-12-1-0541 and MURI Grant FA9550-18-1-0161 for partial support.  11. (a) A circuit which implements a 2-target controlled-NOT gate, with qubit 4 as the control, qubits 1 and 2 as the targets, and where qubit 3 is an ancilla. (b) The 2-target controlled-NOT gate can be bootstrapped to implement an arbitrary weight-n measurement, where n is even. The dangling thick line represents that the given measurement outcome encodes the overall measurement outcome of X ⊗n .

Appendix A: Optimizing general Pauli measurement circuits
Here we consider optimizations of the circuit built from single-qubit and joint measurements to measure a general n-qubit Pauli operator P . These can be used to measure the stabilizer generators of any stabilizer code, including LDPC codes, surface codes, color codes etc., and also the gauge generators of any Pauli subsystem code.
Our goal will be to minimize the number of ancilla qubits and measurements which are required. We assume here that any single-qubit Pauli measurement is possible on any qubit, and that any joint Pauli measurement is possible on any pair of qubits. First we note that any circuit which measures a Pauli P is equivalent to a circuit to measure X ⊗n since one can move between the two circuits using single-qubit Clifford operations. Therefore we will focus on measuring X ⊗n , but this can be straightforwardly converted to a circuit for measuring any other weight-n Pauli operator with the same number of ancilla qubits, connectivity and single-qubit and joint measurements (albeit in different measurement bases).
Our general approach is to split the n relevant qubits into subsets n = i m i , then to prepare an ancilla in |+ , and sequentially apply m i -target controlled-NOT gates from that ancilla to the subsets of m i qubits, before finally measuring the ancilla in the X basis to read off the measurement outcome. Then we can separately optimize the modular component of each m i -target controlled-NOT gate. The trivial case is where m i = 1 for all i, and we therefore break the measurement up into a sequence of n CNOT gates each implemented as in Fig. 1. This would require 2 ancilla qubits, and 4n + 2 measurements (2n + 2 single-qubit measurements and 2n joint measurements).
We now focus on m i = 2, i.e., optimizing the 2-target controlled-NOT CXX gate. By numerically searching over measurement-based circuits we have found the circuit shown in Fig. 11(a). When n is even, we can use this approach to construct a circuit which measures X ⊗n and uses two ancilla qubits, 5n/2 + 2 measurements (n + 2 single-qubit measurements, and 3n/2 joint measurements) as shown in Fig. 11(b). Also note that when n = 4 this recovers the circuit described in Fig. 6(b), which we use for the implementation of the surface code with measurement-based qubits.
Further reduction in the number of measurements is possible; for example, note that in Fig. 11(b) Z is measured at the end of C 1 X 3 X 4 , and then again immediately after at the beginning of C 1 X 5 X 6 . One of these  12. (a) A circuit which implements the X ⊗3 measurement on qubits 1, 2 and 3, using qubits 4 and 5 as ancillas. This is more efficient than using a 2-target controlled-NOT gate followed by a single CNOT gate. Note that the measurement outcome of X ⊗3 is obtained from the parity of those of a pair of measurements, indicated by a dangling junction of thick lines. (b) By adding this (slightly modified) to the end of a sequence of 2-target controlled-NOT gates, we obtain a general scheme for measuring X ⊗n for odd n. Notice the first X measurement on the bottom ancilla qubit is omitted in going from (a) to (b). Also note that the Z ⊗2 measurement on the two ancilla qubits 1 and 2 in the XnXn+1Xn+2 block has a Pauli update which is supported on all but the top three data qubits.
can clearly be removed; however, it is worth noting that this removal affects how errors propagate within the circuit, and may result in a less robust measurement of X ⊗n with regard to faults.
Suppose now that n is odd. The most naive strategy is to use the circuit obtained from using n CNOT gates each implemented as in Fig. 1, and would require 2 ancilla qubits, and 4n + 2 measurements (2n + 2 singlequbit measurements and 2n joint measurements). A better strategy is to use what we have found above to implement (n − 1)/2 2-target controlled-NOT CXX gates followed by a single CNOT gate. This would require 2 ancilla qubits, and (5n+9)/2 measurements (n+4 singlequbit measurements, and (3n+1)/2 joint measurements). However, there is yet a better way-consider the gadget shown in Fig. 12(a) to measure X ⊗3 directly. We can use a sligthly modified version of this X ⊗3 measure-ment circuit in combination with the circuit in Fig. 11(a) (n−3)/2 times for the remaining n−3 qubits in the measurement; see Fig. 12(b). This requires 2 ancilla qubits and (5n − 3)/2 measurmements (n single-qubit measurements and 3(n − 1)/2 joint measurements).
We also include an efficient implementation of the swap circuit in Appendix B, which might be of independent interest.

Appendix B: Efficient measurement-based swap circuit
In Fig. 13, we show a swap circuit which uses one ancilla qubit, and 5 measurements (2 single-qubit measurements and 3 joint measurements). The naive implementation is built from 3 CNOT gates as in Fig. 1 and  FIG. 13. An optimized version of the swap circuit, which swaps qubits 1 and 2 using qubit 3 as an ancilla. requires 12 measurements (6 single-qubit measurements and 6 joint measurements).

Appendix C: Time boundary conditions for logical error rate estimation
We continue the discussion on the issue of time boundary conditions, started in Section V A, to measure logical error rate of the surface code using noisy circuits.
Ideally, we would measure the probability p ideal logical of the event that errors and the correction operator together form a nontrivial logical operator in a given unit time window, assuming that the memory has existed and will exist for an indefinite period of time. This practically irrelevant but mathematically sound scenario, 14. Logical error rates within varying time windows for distance d = 3, 5, 7, 9 with physical error rate p = 10 −3 , using the double ancilla layout as in Fig. 7(b). Each dot is obtained from 10 6 trials of Monte Carlo simulation; individual trials start with a code state with no error, extract syndromes with noisy circuits for T = 1, 2, . . . , 20 rounds and end with a noiseless measurement round. Rescaled dots with 0.6 ≤ T /d ≤ 1.5 collapse to the black solid line, suggesting (C1) with α ≈ 1.5. See Appendix D and Appendix E for more simulation details. poses a problem to numerics since no decoding algorithm can take the infinite history of syndrome measurements. However, since errors and the corresponding correction operator have exponentially decaying correlation in time given a reasonable decoding algorithm [2], it should suffice to consider a finite time segment to measure p ideal logical . Let p logical (T ) be the probability that there is a logical error in the setting of [5] when there are T rounds of noisy and one additional round of noiseless syndrome measurement. Since errors and correcting operators have short time correlations with the Union-Find decoder (and with the minimum weight matching decoder), we may expect that p logical is a reasonable proxy to p ideal logical in our setting. We believe that p logical (T ), as a function of T , converges for large T to a linear function for any fixed physical error rate p and code distance d. Then, p ideal logical can be identified with the coefficient α times the unit memory time which can be d: We have confirmed (C1) for p = 10 −3 (physical error rate) with the double ancilla layout as depicted in Fig. 5(a); see Fig. 14. We observe that independent of d for p = 10 −3 .
All the logical error rates we report in this paper use as the storage error rate of our surface code.

Appendix D: Union-Find decoder
Here we briefly explain how to use the Union-Find decoder [6,7] to correct errors in the surface code given a fixed qubit layout and syndrome-extraction circuit.
As explained in Section V A and Appendix C, in a single trial of Monte Carlo simulation for a distance-d surface code, we start with all the qubits without error, and then repeat the syndrome extraction with faulty circuits for d rounds, followed by an additional round with noiseless circuit.
Among all the popular surface code decoders, we choose to use the Union-Find decoder due to its simplicity and rapidity. As is typical with CSS codes, X-type and Z-type errors on the surface code can be dealt with separately. For simplicity, we only care about X stabilizer syndromes throughout the simulations, and evaluate the logical error rate p L as the probability of having a logical Z error, whereas the circuit used in the simulation still extracts both X and Z stabilizer syndromes.

���� ��
FIG. 15. The decoding graph for the distance-five surface code using the windmill layout and optimized syndrome extraction circuit as specified in Fig. 7(a). Time proceeds up vertically. The vertices within each of the six layers correspond to the changes between the X syndrome bits extracted in that round and those in the previous round. All the dangling edges on the space boundaries are connected to a same vertex b (not depicted). Edges exist between those vertex pairs whose triviality are changed by a single fault. The decoding graph using the circuit as in Fig. 7(b) with the double ancilla layout is similar.
We consider the simplest version of the Union-Find decoder without weighted growth which grows small clusters first [7]. We have not tried to improve the Union-Find decoder by exploiting the correlation between the two types of syndromes [13,14]. We have also not tried the recent optimized version of the Union-Find decoder [15,16], which might lead to better performance at the price of a slightly more complex decoding algorithm.
A useful way to analyze the decoding algorithm is to imagine the space-time error-correction circuit as a threedimensional decoding graph G = (V, E); see Fig. 15.
where V τ are all identical to one another as sets. This reflects the fact that we repeatedly measure (via our noisy circuit) the same set of stabilizers. A syndrome bit measured in round τ = 1, 2, . . . , d corresponds to a vertical edge that connects vertices, one in V τ and the other in V τ +1 . Given d + 1 rounds of observed syndrome bits, we call a vertex v ∈ V τ , τ = 1, . . . , d + 1 to be nontrivial and assign bit 1 to it if and only if the corresponding syndrome bit in round τ is different from that in round τ − 1. (All syndrome bits in round 0, by definition, are zero.) That is, a vertex of V τ records the change in the syndrome bit. We assign bit 1 to the vertex b if and only if the number of nontrivial vertices in V \{b} is odd.
Denote by F the union of all possible faults (see Sec-tion II A) that afflict individual elementary operations in our optimized space-time syndrome-extraction circuit (see Fig. 7). Our circuit has been designed in such a way that any fault in F either causes only trivial syndrome bits, or flips the triviality of exactly two vertices, between which there is an edge in E. (There are two decoding graphs, one for X syndromes-which is G-and the other for Z syndromes, and a single fault may flip more than two vertices in total; but in each decoding graph the number of flipped vertices is always either zero or two.) In particular, our circuit induces F along with a surjection from F to E, or with a little abuse of notation, a Z 2 -linear map ϕ : Z F 2 → Z E 2 . The Union-Find decoder is fully specified by the decoding graph G, which is itself determined by the distance d and the syndrome-extraction circuit. In a trial of the simulation, the fault configuration can be represented by a subset F ⊆ F. The input to the Union-Find decoder is thus the 0-boundary of the 1-chain ϕ(F ). That is, the input is the subset of nontrivial vertices in V . Then, the decoder will find a subset C ⊆ E, whose 0-boundary coincides with the input, in time almost-linear with |V |. One further projects C into a 1-chain on the two-dimensional spatial plane, and each link of this 1-chain corresponds to a (weight-1 or -2) Pauli operator supported on the data qubits. The product of all these Pauli operators constitute the final Pauli correction (only used by the classical control device by Pauli frame tracking). The decoding succeeds if and only if ϕ(F ) + C is homologically trivial, i.e., has even overlap with any side of the boundary. The shortest homologically nontrivial loop in G has length d, and it follows from [6,7] that ϕ(F ) + C is trivial as long as 2 |ϕ(F )| < d. Hence the decoding is guaranteed to succeed as long as the number of faults |F | ≤ d−1 2 .

Appendix E: Inclusive Error Model
We continue explaining our numerical simulation using the notations introduced in Appendix D.
Conventional Monte Carlo simulation for error correction involves fault sampling. That is, for each elementary circuit operation, one fault is randomly chosen out of a finite set. The eventual chosen faults constitute the fault configuration F . The overall time taken by each trial is O(d 3 ). Here, due to the nature of the Union-Find decoder, we instead adopt edge sampling. Specifically, in each trial we sample the 1-chain ϕ(F ) by sampling edges in E. We pick each edge e ∈ E independently, whose probability equals the sum of probabilities of those faults that flip e. The time consumed by edge sampling is still O(d 3 ), but has favorable constant factor reduced almost two orders of magnitude since many faults map onto the same edge.
Below we will argue the efficacy of edge sampling, starting with introducing the inclusive and exclusive error models. The exclusive model refers to the standard stochastic Pauli error model. For example, consider a single-qubit gate, which is randomly affected by a Pauli fault f ∈ {I, X, Y, Z} with probability Q(f ). Note that f ∈{I,X,Y,Z} Q(f ) = 1, meaning that different faults occur exclusively. However, one can also adopt an inclusive model: the gate is first afflicted by X with some probability P (X), then, independently, afflicted by a subsequent Y with probability P (Y ), and then Z with probability P (Z). The values of P (X), P (Y ), P (Z) are within [0, 1], and are not constrained otherwise; in particular, they are independent. Observe that P and Q satisfy the following relations: where P (f ) = 1 − P (f ). The above definitions of exclusive and inclusive models for a single-qubit gate can be extended to multi-qubit gates or measurements, as long as the possible faults thereof form a group isomorphic to Z n 2 for some integer n ≥ 2. For example, the models with n = 2, 3, 5 can respectively characterize the cases of single-qubit identity gate, single-qubit measurement and joint measurement; see Section II A.
Furthermore, there is a general relation between exclusive model Q and inclusive model P , which is analogous to (E1). For a general gate or measurement, denote the set of its nontrivial faults by E = Z n 2 \{0 n } with some integer n ≥ 2. An inclusive model P is essentially an arbitrary real-valued function P : E → [0, 1]. Then, P induces a probability distribution Q P over Z n 2 such that That is, Q P is the exclusive model induced by P . A natural question then is that, given a general exclusive model Q, whether there exists an equivalent inclusive model P , i.e., such as P induces Q. We claim that for any n ≥ 2, there exists an exclusive model not induced by any inclusive model. It suffices to consider n = 2, because any distribution with n = 2 is the marginal distribution of some distribution with n ≥ 2. Solving (E1) for P , we have where {f 1 , f 2 , f 3 } = {X, Y, Z}. For some choice of Q, this solution may not even be real valued.
However, in the special case where Q(f ) is small and uniform over all nonzero f , there always exists a corresponding P that induces Q. Claim 1. Given n ≥ 2, let Q be an exclusive model such that for all f = 0 n , we have Q(f ) = q ≤ 2 −n for some constant q. Consider the inclusive model P defined by P (f ) ≡ 1 2 ± 1 2 (1 − 2 n q) 2 1−n for all f = 0. Then, the inclusive model P induces Q.
Note that when q = o(1), P (f ) can be chosen to match q to the first order.
Lemma 2 (MacWilliams identity [17]). For any binary linear code on N bits, define where wt denotes the Hamming weight. Then where C ⊥ is the dual code of C.
It is easy to see that the right-hand-side of (E3) equals W C (1 − p, p) where C is the Hamming code [2 n − 1, 2 n − 1 − n, 3]. Using the fact that the dual of Hamming code has uniform Hamming weight, one easily obtains Consider any S ∈ A k f . There must exist g ∈ S such that g + ∆ / ∈ S; otherwise, f = v∈S v would be either zero or ∆, which is impossible since f = 0 and f = 0. Fix any total ordering on E, and choose for each S the least element g S ∈ S such that g S +∆ / ∈ S. Substituting g of S with g + ∆, we have a new subset S = (S \ {g S }) ∪ {g S + ∆}, which has exactly k elements. We thus have a map S → S from A k f into A k f . This map is clearly one-to-one. The mapping from A k f to A k f is defined similarly.
For a syndrome-extraction circuit equipped with an exclusive model Q, the edges in E of the corresponding decoding graph are generally not independent. Indeed, different faults at a same operation, which are mutually exclusive, may flip different edges. In this case, sampling edges independently is not faithful.
However, if Q admits an equivalent inclusive model P , then the events of individual edges being flipped are mutually independent under P . Specifically, an edge is flipped if and only if an odd number of different faults corresponding to that edge have occurred.
Recall that F is the set of all nontrivial circuit faults in our scheme. Our starting noise model in Section II A is an exclusive model Q on F where nontrivial faults from a given operation happen uniformly at random. By Claim 1, we convert this model to the corresponding inclusive model P . For each e ∈ E, the probability of its being flipped is given by F e = f f flips e .
However, for ease of simulation, we instead calculate the edge weight: Provided that the error rates P are relatively small, independent edge sampling by W is a linear approximation of the conventional fault sampling, which is correct to the second order.
Appendix F: Importance sampling As the code distance increases and the physical error rate p decreases, the event of logical failure becomes so rare that the Monte Carlo simulation is no longer feasible. In this section, we explain how to use the importance sampling method to reliably estimate the logical error rate p L , i.e., the numerical data as in Fig. 9(a) and Fig. 9(b). We will use the notations introduced in Appendix D and Appendix E.
For simplicity consider only odd distance d = 2t + 1. Due to the argument about the decoding graph G = (V, E) in Appendix D, our scheme succeeds whenever there are at most t edges flipped. Therefore we have Our importance sampling method for estimating p L goes as follows.
• For each w ∈ I do -For i = 1, 2, . . . , N , sample from E a subset S i of w edges uniformly at random.
where I is the indicator function.
It is easy to verify that A w and B w converge to A w and B w respectively when N goes to infinity. In addition, p L should be reasonably faithful since I includes typical w's with largest probabilities.
Due to the vanishing B w with increasing t, we have only managed to perform the above importance sampling procedure for t up to 6. It would be interesting to develop more efficient sampling algorithms for rare events, e.g., extending methods in [18] to the Union-Find decoder.