A High Performance Compiler for Very Large Scale Surface Code Computations

We present the first high performance compiler for very large scale quantum error correction: it translates an arbitrary quantum circuit to surface code operations based on lattice surgery. Our compiler offers an end to end error correction workflow implemented by a pluggable architecture centered around an intermediate representation of lattice surgery instructions. Moreover, the compiler supports customizable circuit layouts, can be used for quantum benchmarking and includes a quantum resource estimator. The compiler can process millions of gates using a streaming pipeline at a speed geared towards real-time operation of a physical device. We compiled within seconds 80 million logical surface code instructions, corresponding to a high precision Clifford+T implementation of the 128-qubit Quantum Fourier Transform (QFT). Our code is open-sourced at \url{https://github.com/latticesurgery-com}.


Introduction
Applying surface quantum error correcting codes (QECCs) efficiently to large computations is challenging in terms of classical computing resources necessary for the compilation process.Compilers tailored for QECC are only starting George Watkins: invio.george@gmail.comHoang Minh Nguyen: hoangminh98@gmail.comKeelan Watkins: keelan w@outlook.comSteven Pearce: stevenp@sfu.caHoi-Kwan Lau: hklau.physics@gmail.comAlexandru Paler: alexandrupaler@gmail.com to appear, often with significant limitations with respect to the scale of the circuits that can be handled or the compilation time.
Large scale QECC compilation is a necessity, because practical algorithms, like Shor's and Grover's assume high-quality qubits with a very low error rate [1], but we are unlikely to obtain hardware (physical) qubits with such fidelity in the near future [2].QECCs solve this issue by using a large number of error prone physical qubits to encode higher fidelity logical qubits.For example, a quantum factoring algorithm needs roughly 1000 qubits to factor a 1000-bit number and millions of gates [3,4].Consequently, practical algorithm require very large scale quantum computers, while only some carefully crafted examples of problems where quantum hardware has an advantage with small devices exist [5].
Surface codes are a family of QECCs that require low qubit connectivity and a reasonably high hardware error rate (such as between 0.1% and 1%) to create good logical (computational) qubits [6, 7, 8] and only require degree four nearest neighbour connectivity.These properties make them a promising option for error correcting devices with a couple hundred logical qubits.Physical devices with compatible layouts have already been made or proposed, albeit on a small scale [5,9,10,11,12].Examples of larger scale quantum circuits protected by surface QECCs were compiled manually in [13,14].The complexity of optimising surface code circuits has been shown to be related to NP-hardness [15,16].
We present and demonstrate the extremely high scalability of our efficient QECC compiler.This is a step forward for quantum software: we create a streaming pipeline and a compilation Each slice has an associated time instant when it is taking place.The slices are obtained after mapping a circuit to a layout, where patches end up being used for holding error-corrected logical qubits (brown), distillation procedures (pink), routing merge and split operations of patches (blue), or not used (white).1) There is a merge and split operation (blue) taking place between logical qubits 0 and 3. 2) Qubits 11 and 13 are measured; 3) logical operation between qubits 4 and 9, and logical operation between 5 and 6, and the bottom right distillation is outputing a distiiled state; 4) qubit 9 is measured, merge and split between 1 and 2, and the right most distillation region is outputing a distilled state.
environment for the compilation and optimisation of very large scale QECCs.Our high performance pipeline makes it possible to process extremely large circuits (would not fit in memory).We can compile directly, in a streaming process, by reading and writing to mass storage.Streaming enables the real-time operation of our compiler, meaning that this tool may be integrated in the classical control software necessary to operate quantum computers [17].This paper is organised as follows: In Sec. 2 we introduce the concepts necessary for presenting the compilation methods and workflow.Sec. 3 describes the two-stage compilation pipeline that consists of gate level processing (Sec.3.2) and logical operation routing (Sec.3.3).The latter includes also a fast method to perform state vector simulation that takes into account the entangling and disentangling action of the lattice surgery operations.Finally, Sec. 4 illustrates the perfor-mance of our compiler.We compile within seconds a high-precision 128-qubit Quantum Fourier Transform (QFT) [18].To the best of our knowledge, this is the largest-scale compilation of this kind.

Background
This section introduces the necessary background details for describing the compilation process.The application of error correction to quantum circuits resembles the process well known to classical computer scientists of program compilation: the compiler reads code in a programming language (higher level quantum gates) and outputs machine instructions (lattice surgery quantum gates).
We opted for flexibility and developed a compiler with a well-defined intermediate representation to separate circuit pre-processing from sur-face code instruction layout.Surface code instructions for large scale computations at present is interesting for at least two purposes, one is being able to produce reliable resource estimates, and the other is to start preparing for when we will have such devices, so that hardware engineers can start designing devices with instruction sets for error correction in mind.
We assume the reader is familiar with the basic concepts of quantum computing and quantum information [18,19].We assume the conventional meaning for common quantum gates (Phase gates S and T, Hadamard gate H, CNOT, Toffoli) and the Pauli matrices (I, X, Y, Z).By the phase rotation gate R Z (θ) we mean: We will frequently use Pauli product rotations, for which we assume the following: given an axis P (which may be a Pauli matrix or a tensor product of Pauli matrices) we denote by P (θ) = exp(−iθP ) = cos(θ)I − i sin(θ)P .Note that under this convention, R P (θ) = P ( θ 2 ) for P = X, Z. Also, when the Pauli matrices appear with sub-indices, e.g.Z 1 Z 2 Z 3 Z 4 in Fig. 2, we mean the tensor product X ⊗ Z of the Pauli matrices applied to qubits indexed 1 and 2. Similarly, we use gate R X (θ) = HR Z (θ)H.

Surface Codes
A major challenge with the current generation of quantum computers is the occurrence of errors while performing computations.Errors may occur because of control system faults or stray interaction with the environment.A proposed solution for avoiding errors are Quantum Error Correcting Codes (QECC).These codes add some degree of fault tolerance to computations by using many physical qubits to form fewer but more reliable abstract logical qubits [22].Surface codes are a family of QECCs that aim at improving computational fidelity by entangling physical qubits in a physical lattice [23,18].This kind of codes, with topological properties, was first theorized with exotic particles known as "anyons" [24].Surface codes are appealing because they are well understood, and feature a high error threshold.In near future, quantum computing hardware with thousands of qubits The white circles are data qubits protected from errors by measuring stabilizers around them.The squares in the lighter and darker shades of yellow represent stabilizer measurements.For example, the squares marked with Z and X represent the Z 1 Z 2 Z 3 Z 4 and X 5 X 6 X 7 X 8 stabilizer measurements respectively.If an error occurs in a data qubit, such as a phase flip occurring on 9, the X stabilizers around it will pick it up by changing outcome (syndromes, highlighted in purple).There are advanced methods to decode sets of errors (e.g [6, 20,21]).Errors can either be corrected on the spot or tracked classically by inverting later readouts.This cycle of detecting, decoding and correcting is referred to as the surface code cycle.
might be realized [25,26,12] which would be able to operate a surface code cycle on a lattice of qubits.
The key step of surface code error detection is stabilizer measurement, as shown by the shaded squares in Fig. 2.These measurements act as parity checks on bit flips or phase flips of a square lattice of data qubits.The surface code and its cycle (the sequence of quantum gates applied for enforcing the code constraints) only tell us how to protect a lattice from error.The surface code distance indicates how much error is tolerated [27].

Logical Qubits and Logical Operations
Logical (computational) qubits are encoded by "cutting out" portions of a device's physical lattice into patches, which are cluster states error corrected by the surface code cycle.This encoding of logical qubits is known as the planar code [27,28].Patches have boundaries outside of which they don't interact, except when performing certain logical operations (Sec.2.2).Fig. 3 outlines how patches relate to the surface code.3: Abstracting physical qubits to patches.We omit the details of stabilizers and data qubits that make up patches, and instead represent distance-independent features.It is always possible to compute back these details about stabilizers from the output format and code distance.The picture to the left shows how this abstract representation relates to the physical implementation, and to the right there is a fully abstract patch, which has its own logical state.The different stabilizers on the boundaries yield two different kinds of boundaries, which are often referred to as rough and smooth.
The patch-based approach has been shown to be a resource-efficient choice for quantum error correction [29,30,31].
We will be looking at square patches with two kinds of boundaries that encode a single qubit.Patch size is proportional to code distance and the performance of the decoding algorithm (e.g.[6,21]).For our intents, it suffices to know that the size of the patches will depend on the physical error rate of the device, length of the computation and desired success rate of the logical computation.In Sec. 4 we estimate the resources [32] necessary to execute the compiled output.
Having obtained logical qubits, we require a method to perform operations between them.Table 1 offers an overview of all the surface code operations supported by our compiler at the logical level.Some logical operations are performed directly on patches: Pauli X and Z [33], and Hadamard gates [29], can be implemented in this way and are called transversal operations.It is also possible to directly initialize a patch in the |0⟩ or |+⟩ states and to measure in the X or Z basis [31].For the remaining operations needed to complete a universal gate set we use lattice surgery [29].This protocol achieves entangling multibody measurements by merging and splitting patches.
We use these measurements along with prepared ancillae states (and corresponding patches) to implement CNOT as shown in [29], and the S and the T gates (Fig. 12   by activating the stabilizer measurements with the data qubits between them (blue regions).This operation causes the two patches to become one, hence losing a degree of freedom and projecting the logical state into a subspace.After stopping the stabilizer measurements and measuring the mediating data qubits, the patches are split.Overall, this operation is equivalent to a logical multi body measurement [29].The observable depends on the boundaries: rough for X and smooth for Z.This figure shows measurements of the observables Z ⊗ Z (top) and Z ⊗ X (bottom) which in the surface code cannot be initialized directly with a high fidelity.These states have to be prepared by distillation.There are several protocols for magic state distillation [34,35], but for our compilation purposes it suffices to acknowledge the fact that these distillations occupy some amount of space on the device's lattice and that they have a certain duration in time: distillation regions are described by their bounding box which includes a time axis for how long it will take to produce the next magic state.

Related Work
Compilers for surface codes have been previously presented in the literature, and most of the times, the compilation problem has been decoupled from the challenges of optimising the resulting circuits.In general, automatic optimisation is performed by implementing heuristic algorithms for the efficient layout of the logi-cal operations, and this includes parallelizing as many as possible operations, using fewer patches for the routing etc. Surface code computations can be implemented through braiding (e.g.Surfbraid [36]) or lattice surgery like the tool presented herein (e.g.OpenSurgery [37] or the compilers from [38,39,40]).
Our compiler is distinct from the others in the following ways.Our compiler's source and target are similar to OpenSurgery, but improves on the compilation time performance, offers new optimizations, adds the ability to customize layouts, and handles parallel magic state distillation.
The compilers from [39,40] focus specifically on routing long range surface code interactions.While we do tackle such problem, as it is necessary for our overall compilation goal, the focus of this project is broader in scope and we organise our compilation into a very modular, highly efficient pipeline which can handle both short and long range interactions.
Our compiler supports very large scale layouts through a layout specification and the compiler can automatically map large-scale circuits to the layouts.In contrast, the compiler from [38] is a small scale procedure that explores the trade offs of different layouts for mapping algorithms onto surface code architectures.
Our compiler is modular and can include manual optimisation techniques, for example, by replacing existing gate decompositions, or reconfiguring the bounding boxes of the distillation subcircuits.For completeness, manually obtained surface code layouts with techniques such as the AutoCCZ for optimizing ripple carry adders where presented for example by [14].Finally, one last approach to quantum compilers worth mentioning are variational compilers [41], which share with our project the challenges of circuit pre-processing.
Compared to existing surface code compilers, our tool extends the state of the art by including at least the following novelties: • support for an intermediate language for compiling high-level circuits from different languages (e.g.Q#, Cirq, Qiskit) and descriptions (e.g.Clifford+T, multibody measurements); • highly configurable layouts for qubits, routing space and multiple parallel distillation procedures;

|1⟩
Distillation in dedicated regions [35] Table 1: The list of logical surface code operations supported by the compiler.The operations are formalized into logical lattice instructions (LLI), which serve as a central intermediate representation to our compiler.LLI decouples the pre-processing to surface code instructions from laying them out on an abstract lattice.
• very high-speed, configurable routing heuristics which can be easily replaced with more sophisticated approaches including based on machine learning; • pipelined, modular design that is compatible with distributed computing platforms such that compilations and optimisations can be performed on multi-core/paralell computers.

Methods
We address the problem of taking a circuit specified in a machine readable format, and converting the circuit to the surface code operations outlined in Table 1.For small circuits it is easy enough to perform such conversion by hand, but automation is necessary for large scale circuits.
Our compiler is a computer program that reads text in a source formal language and outputs machine code in another language, called the target.In our case the source is a quantum circuit in a subset of OpenQASM 2.0 [43] (Sec.5.1), while the target is a JSON logical operation instructions (Sec.2.2 and Fig. 1).
We implemented our compiler and the source code is open sourced at https://github.com/latticesurgery-com.In order to improve the readability of this paper, we keep the engineering and implementation details to a minimum, and point the interested reader to the open sourced code.The latter is written with modern C++ features which increase the comprehension of the code's functionality.
The compiler is continuously tested and verified for functional correctness with modern continuous integration, while practical performance plays a significant role.Our compiler offers a wide range of configuration options, ranging from optimization heuristics, intermediate representations of the computations, as well as flexible layouts.

The Compilation Pipeline
The compiler operates a two-stage pipeline (Fig. 5): 1) a pre-processing stage, and 2) a layout and routing stage.The two stages communicate through an intermediate representation we refer to as logical lattice instructions (LLI from Table 1).The LLI contains all the information about the logical operations happening on the lattice, but none about the physical locations of the patches, or about routing and distillation regions.The physical qubit lattice will be operated according to LLI instructions (Table 1).
The first stage, the gate level processing stage, operates mostly at the logical circuit level.We resort to a universal gate set based on surface code operations.We gradually process the input circuit's gates to align with our surface code instructions.Once the circuit is in a suitable format (only Clifford+T gates or certain Pauli rotations), the circuit maps 1-to-1 with surface code operations and is written down as LLI.
The second stage is the slicer.Herein, the LLI are combined with a layout specification in space and time (Fig. 7).The LLI language is circuit layout agnostic, meaning that the mapping of the logical qubits to the physical lattice may have a great impact on the efficiency of the compiled circuit.The result of these steps is a "sequence" of slices of the physical lattice.The slices depict the state of the computation at each point in time, as shown in (Fig. 1).We offer two such slicers: one written in Python, geared towards the verification of small scale circuits (Sec.3.3.1)and a high performance one written in C++ for large scale circuits (Sec.3.3.2).

Gate Level Processing
The first stage takes a logical circuit specified in our own minimal dialect of OpenQASM 2.0.We offer two ways to pre-process the circuit: 1) with Pauli rotations and Pauli product measurements, and 2) directly with higher level quantum gates such as Toffoli gates.In both cases, we first parse the circuit into a list of gates, using either Qiskit [44], PyZX [45], a custom parser or a combination of the three depending on the circuit.
The gate list expression of the input circuit might use gates which are not supported by the error-correction procedure.In this step we reduce the gate set so that it easily translates to LLI.Our custom parser is able to break down very small angle rotations, such as Z( π 2 128 ) by symbolic processing of the argument.These rotations are needed to compile, for example, a 128qubit quantum Fourier transformation (QFT) circuit.After parsing, the list of gates is passed through the pipeline to the next stage.
First, controlled gates are broken down to CNOTs and single qubit rotations using the identity in Fig. 13.The circuit now only has single qubit Clifford gates, CNOTs and single qubit rotations.At the last stage of pre-processing in the gate model single qubit rotations smaller than π 4 are approximated to single qubit Clifford+T gates.
It is possible to convert controlled-rotation gates to Clifford operations plus some small angle Z(θ) rotations (Figure 13 in Appendix).The latter are not Clifford+T and are difficult to perform in a fault-tolerant way [46].We achieve arbitrary Z(θ) rotations by approximating them with Clifford+T gates, for which we leverage the Gridsynth package [47] which outputs approximations constituted of sequences of H, X, Z, S and T gates.The T gates are performed by consuming magic states, which are prepared in dedicated distillation regions [33,35].
We utilize two methods to convert the Gridsynth appproximation to LLI.The first is to directly apply the gates with the methods of Table 1: H, X and Z transversally, S with a twist and T as Z( π 8 ) rotation as shown in Fig. 12.The second approach, we refer to as Pauli rotation compression, is shown in Fig. 6, and consists of interpreting the gate sequences returned by the Gridsynth approximations as a sequence of Pauli rotations of varying angles.
The direct application of gates is simpler and results in the same Clifford corrective terms for every rotation.With Pauli rotation compression the Clifford corrective terms change for every angle, thus more complex classical control would be required by a downstream stage.In the Appendix we present an algorithm for Clifford gate optimization.

Slices and Routing
To overcome the logistical challenges of structuring a computation on surface code device, we arrange the computation in space and time.Space structure is given by partitioning the physical lattice into square cells.A cell may hold or may not hold a patch, be part of a distillation region, or may be used for routing, but patches, distillation regions and routing areas are always placed in accordance to cell boundaries ( Figs. 1, 3 and 7).
Time structure is given by thinking of the computation in terms of slices.Surface code computations can be viewed as 3D structures in spacetime [48,37,14], and a slice is a plane through the structure at a fixed time value (Fig. 1).In a For instance, the sequence HSHT SHX would be split as HSH, T S, H, X and would become the sequence nutshell, a slice is a temporally discretized partition of the computation (clock timesteps in Litinski [31] or moments in Google Cirq [49] terminology, for example).Each slice represents a snapshot of the the LLIs that are happening simultaneously on the lattice -slice duration is given by the duration of the slowest LLI.
Routing is the problem of deciding how the cells of a slice are allocated to patches or reserved for other purposes.Finding optimal layouts has a great impact on the depth of the computation.Different layouts can for example be used to trade off space for time [38].Layouts may need to change depending on the task during an algorithm.For example, the oracle in Grover's search algorithm may be very different from the implementation of the diffusion operator [50].We defined our own configurable layout specification (Fig. 7).The compiler reads a text file containing the layout specification and produces slices with patches arranged accordingly.

On-the-fly, Functionally Verified Slicer
Our first slicer supports the real-time, on-thefly functional verification for correctness.This slicer can be used as a preliminary verification of smaller scale lattice surgery circuits.The slicer and the simulation operate on an array of patches of variable length and assumes that all magic states have been prepared ahead of time.The verified slicer is very powerful when it comes to understanding the details of small computation and we used it in the development of the com-

piler.
The simulator, called the Lazily Tensored State-vector Simulation (LTSvS), has the major feature of being able to simulate patch states at the LLI instruction level, such as simulating multi body measurements and Pauli operator gates.LTSvS tensors at the matrix level only when strictly required, otherwise just tracks the fact that the global state is given by a tensor product of sub vectors.LTSvS offers a great performance advantage over näive state vector simulation: our simulator doesn't expand the full state vector of all logical qubits on the lattice.In particular, qubits that are known to not be entangled, because they were just initialized or measured, are automatically tracked in separate sub-state vectors.Qubits may be entangled within a sub-state vector.An example of unentangled qubits is the array of magic states waiting to be used or ancillae patches.
Methodologically, the LTSvS simulator is very similar to the matrix-product-states (MPS) simulation techniques [51], which are efficient on circuits with low counts of entangling gates.Compared to the MPS simulators, e.g. from Qiskit [44], ours is fine tuned for computations with many ancillae and measurements, can handle classical control and can be executed in parallel with the compilation process.

High Performance Slicer
The main goal of our compiler is to handle very large scale circuits with thousands of logical qubits and millions of LLIs.At this scale every CPU clock cycle and every byte are precious.Our high performance compiler is written in C++ because it comes with zero cost abstraction [52].
The first step of the slicer is to read a layout file (Fig. 7) in order to create an abstract layout representation that describes the device layout.The layout is used to initialize a slice template which will be reused for the routing.The template will be recomputed once the layout dictates this.Our implementation of slice processing keeps memory usage to a minimum because O(1) slices are ever kept in memory by the slicer itself.Moreover, this representation is stored in a high performance data structure based on bitstreams and hash-maps.The representation will be used for computing routes using a variant of Dijkstra's algorithm.
The slicer streams LLI from text or standard input, updating the slice with each instruction, evolving the slices over time.Since the slicer can also stream read from standard input and write to standard output, its possible to implement external programs (e.g.Python scripts or other command line tools) that visit slices by reading from standard input.Given the capability of evolving the lattice state, the slice processing functionality is implemented by defining a C++ functor to visit all slices.
The streamed evolution of slices includes managing distillation, queuing magic states [48,53], initializing ancillae and LLI operations.A user may collect statistics on slices, such as magic state queue and routing space usage in seconds, without having to store in terabytes of memory that slow down the processing.At the same time this functor approach has the advantage of hiding the implementation details from the client so that they can focus on the processing functionality.
To place routing regions, we used our own implementation of Dijkstra's algorithm, which is implemented in place, such that our tool can search the lattice without constructing a graph of it.Our implementation has close to zero cost overheads with respect to memory and CPU instructions needed to translate back and forth between the lattice layout and the graph needed for performing Dijkstra's algorithm.To further speed up routing, we employ a cached routing technique where previously computed routes are saved and reused later.

Results
We present results for compiling very large quantum circuits, and focus on scalability and resource estimation.

128-Qubit QFT
To validate the performance of our compiler and high performance slicer we took a circuit that has wide spread use and presents technical faulttolerant execution challenges.The Quantum Fourier Transform (QFT) is a crucial component providing quantum speedup to algorithms such as Shor's algorithm and quantum phase estimation [18].The fault-tolerant implementation of the QFT is challenging because of the presence of small angle controlled rotations.For the QFT to retain the desired level of precision, these have to be approximated by a long sequence of Clif-ford+T, which results in a very long computation.We set Gridsynth's precision to 10 −41 for the Clifford+T approximations for small angle rotations, which results in thousands of gates.Such number was chosen as it is 3 orders of magnitude less than the smallest angle rotation in our circuit π 2 128 ≈ 10 −38 , after expanding out the controlled rotations (Sec.3.2).
The number of controlled rotations increases quadratically with the number of qubits the QFT is applied to.Thus, at 128 qubits and after small angle rotation approximation, the QFT circuit has more than 80 Million LLI without gate to Pauli compression.The number of LLI includes Clifford corrective terms that are meant to be applied depending on measurement outcomes.Thanks to concurrent magic state distillation, there are no idle slices waiting for magic states to be produced.We used the high performance slicer to compile the 128-qubit QFT: for example, laying out the slices for the roughly 80 million LLI of the 128-qubit QFT takes less than 15 minutes on an ordinary laptop.The generation of LLI of a QFT on 128 qubits takes negligible time (under 10s on a laptop).Fig. 8 illustrates the performance of the C++ slicer for the QFT128 circuit.

Resource Estimation
A challenging problem in the community of faulttolerant QC is determining the amount of physical resource that is necessary to carry out a logical computation with a certain degree of precision.Such resource is often quantified by physical qubits over time -often called a space time volume.The depth of the circuit and the required magic state fidelity affect the code distance, which in turn affect the number of physical qubits required.Moreover, the degree of parallelization achieved at the routing stage will affect computation depth.
Our compiler includes a prototypical resource estimator for surface code computations.We use the Qentiana [54] software to estimate such values, and computed some code distances for randomized circuits of H, T and CNOT gates (Fig. 9).

Conclusion
We introduced and described a compiler for lattice surgery quantum circuits and showcased some of the results achieved with it.We motivated the design choices behind our two stage pipeline.The first stage included how input circuits are parsed, pre-processed, reduced to Clif-ford+T and viewed as Pauli rotations.The second stage focuses on laying out circuits on physical devices, which presents substantial performance challenges.
We demonstrated the compiler's performance by compiling a 128-qubit QFT.We believe this is a notable achievement: despite its widespread appearance in algorithms, to the best of our knowledge, no surface code compiler is able to handle such large scale circuits.We also showcased the compiler's ability to estimate resource requirements, in particular patch code distance, which is promising in the perspective of quantum benchmarking.
Our project is laying the foundation for a fullstack quantum circuit compilation framework.

Compiler Pipeline and Operating Systems
It is possible to accept a wider range of circuits with additional processing by using Qiskit and PyZX.For example, we have successfully processed Grover circuits with multi qubit gates and Toffoli based adders by decomposing additional gates such as mcx, ccx and rccx which are not in the natively supported set.This kind of gate decomposition is done on a case by case basis.Figure 10 is a diagram of how the C++ slicer can take advantage of the operating systems ability to broker messages by using POSIX pipes and an example shell command to run such a pipeline.

Circuit Simulation and Gate Decomposition
For verification purposes, we require a statevector snapshot of the logical state of the lattice computation at every time-step.Conversion of gates and Pauli rotations to multi-body measurements and other LLI are presented in Fig. 12. , but other angles with the same denominator are also possible by adjusting the classical logic controlling the corrective terms that follow.The π 8 Pauli rotations consume a distilled magic state, while the π 4 rotations consume a positive Y eigenstate which is prepared by applying a twist based measurement [42].
In Figure 13, controlled rotations are common in circuit such as the QFT.The first step towards converting them to fault-tolerant instructions is breaking them down into single qubit rotations and CNOTs [56].Single qubit rotations of angles greater than π 8 have to be approximated, while CNOTs we implement with lattice surgery [29].
Fig. 14 illustrates the efficiency of the rotation compression technique we described in Fig. 6 from Sec. 3.2.How much compression we get exactly depends on the types of gate sequences appearing in the approximations (e.g. both HSSSTH and a lone T compress to a single Pauli rotation).Testing on rotations from R Z ( π 2 8 ) to R Z ( π 2 128 ) we observed a factor of ≈ 2.5 fewer LLI per small angle rotation -this is an an illustration of our optimisation heuristics.

Clifford Elimination
According to the Gottesman-Knill theorem, it is possible to efficiently simulate circuits which only contain a particular set of gates, known as Clifford gates [18,57].It is natural to ask whether it is possible to leverage classical computing to reduce the load on the QPU when processing such circuits.Litinski [31] outlined an algorithm to remove the Clifford part of the circuit at compile time, when all we care about are measurement outcomes (i.e.we are not compiling the circuit to be a state preparation routine).We call this algorithm the Litinski Transform (LT) and provide an implementation in the following manner.
The first step of LT is to convert each gate of the input circuit into a sequence of rotations about π 2 , π 4 or π 8 , or multibody measurements with Pauli product observables.Of these blocks, only the π 8 rotations are not Clifford, so we apply Litinski's commutation rules to bring them all to the front of the circuit.Next the π 2 and π 4 rotations are commuted past the previously endof-circuit measurements.Since this case we only care about measurement outcomes, the Clifford rotations that are now after measurement can be discarded.

Layout File Format
The Structure.The purpose of our layouts is to define the structure of a lattice intended for lattice surgery operations, abstracting the details of the physical implementation.The layout is stored as an ASCII plain text file, in order to make it easy for humans to view and edit and to enable good portability (i.e.special tools needed to edit).The layout is specified by a grid of ASCII characters where each cell has a specific meaning based on the character it contains.For example, the text QrQ represents two logical qubits separated by an inactive routing space: The layout is always assumed to be rectangular.If a layout file is not rectangular in content, then the compiler will assume that the bounding box of it's contents is available and pad with empty routing space.
Qubits Q: Represents a patch holding a logical qubit encoded using the surface code planar code.The boundaries are assumed to be rough northsouth, and smooth east-west, so it's the users responsibility that connectivity between each qubit is possible.
Routing r: In their default or quiescent state, these cells are inactive.This means they do not actively participate in quantum computations.However, when required, the contents of these routing cells can be "activated" to facilitate longrange merges and splits between distant Q cells, as shown in Fig. 4.
Ancilla A: These cells are reserved to allocate new ancillae patches.Ancillae patches are auxiliary qubits used in quantum computations to assist in the construction of measurement-based gates.These patches can be in states like the |+⟩ state, which mediates CNOT operations, or they can be places for the Y eigenstates used by n 4 Pauli rotations.
Distillation regions numbers 0-9: Distillation regions are specialized areas designated for the production of magic states, as defined in section 2.2.These regions are represented by areas with the same number in the ASCII layout.The extents of these regions are identified by running a connected components search on areas with matching numbers.This means that contiguous cells with the same number are considered part of the same distillation region.
The magic states produced by these regions are output to a neighboring r cell.Distillation time is assumed to be the same for every distillation region.The compiler makes no assumptions about the internal operations or processes occurring within distillation regions.It is the user's responsibility to ensure that the size and configuration of a region is correct.
Example 1.The layout below supports two logical qubits, and routing.No two-qubit gates can be applied immediately, because there are no A patches.

QrQ rrr
Example 2. Seven qubits and four distillation regions can operate on the layout below.The A ancilla can be used for performing CNOTs between the logical qubits, for example.rrrrrrr444 rQQrQQr444 rQQrQAr444 rrrrrrrrrr r111222333 r111222333 r111222333 Planned future extensions of the layout file include: a) incorporation of multi-cell patches support; b) qubit indexing enhancements; c) initialization of square patches with alternate boundary configurations; d) introduction of a in browser editor equipped with syntax highlighting.

Figure 1 :
Figure1: Example output of from the compiler: sequence of discrete time steps, called slices, of a surface code computation.Each slice has an associated time instant when it is taking place.The slices are obtained after mapping a circuit to a layout, where patches end up being used for holding error-corrected logical qubits (brown), distillation procedures (pink), routing merge and split operations of patches (blue), or not used (white).1) There is a merge and split operation (blue) taking place between logical qubits 0 and 3. 2) Qubits 11 and 13 are measured; 3) logical operation between qubits 4 and 9, and logical operation between 5 and 6, and the bottom right distillation is outputing a distiiled state; 4) qubit 9 is measured, merge and split between 1 and 2, and the right most distillation region is outputing a distilled state.

Figure 2 :
Figure 2: A graphical depiction of surface code layout.The white circles are data qubits protected from errors by measuring stabilizers around them.The squares in the lighter and darker shades of yellow represent stabilizer measurements.For example, the squares marked with Z and X represent the Z 1 Z 2 Z 3 Z 4 and X 5 X 6 X 7 X 8 stabilizer measurements respectively.If an error occurs in a data qubit, such as a phase flip occurring on 9, the X stabilizers around it will pick it up by changing outcome (syndromes, highlighted in purple).There are advanced methods to decode sets of errors (e.g [6,20,21]).Errors can either be corrected on the spot or tracked classically by inverting later readouts.This cycle of detecting, decoding and correcting is referred to as the surface code cycle.

Figure
Figure3: Abstracting physical qubits to patches.We omit the details of stabilizers and data qubits that make up patches, and instead represent distance-independent features.It is always possible to compute back these details about stabilizers from the output format and code distance.The picture to the left shows how this abstract representation relates to the physical implementation, and to the right there is a fully abstract patch, which has its own logical state.The different stabilizers on the boundaries yield two different kinds of boundaries, which are often referred to as rough and smooth.

Figure 4 :
Figure4: Lattice surgery of patches.Patches are merged by activating the stabilizer measurements with the data qubits between them (blue regions).This operation causes the two patches to become one, hence losing a degree of freedom and projecting the logical state into a subspace.After stopping the stabilizer measurements and measuring the mediating data qubits, the patches are split.Overall, this operation is equivalent to a logical multi body measurement[29].The observable depends on the boundaries: rough for X and smooth for Z.This figure shows measurements of the observables Z ⊗ Z (top) and Z ⊗ X (bottom)

Figure 5 :
Figure 5: The pipeline as implemented in the compiler.

Figure 6 :
Figure 6: Pauli rotation compression of Cliffod+T approximations of small angle rotations E.g.R Z ( π 2 128 ): gate sequences obtained from Gridsynth are interpreted in the Pauli Frame by breaking it into subsequences.For instance, the sequence HSHT SHX would be split as HSH, T S, H, X and would become the sequence X π 4 Z 3π 8 HX

Figure 7 :
Figure7: ASCII specification for patch layouts.Q indicates a patch holding a logical qubit, r marks cells that are reserved for routing (the cyan "snakes" of Fig.1).Numbers 0 to 9 are used to identify distillation regions.The boundaries of the distillation regions are computed by connected components search for same numbers, so it is possible to have more than 10 distillation regions.Magic states produced by these regions are queued in the r cells neighbouring a distillation block.Finally, A marks cells reserved to allocate new ancillae patches in the states, such as the |+⟩ states that mediate CNOTs or the places for the Y eigenstates used by n 4 Pauli rotations.The layout file format is described in the Appendix.

Figure 8 :
Figure 8: Time taken to compile a QFT with the C++ slicer on a laptop (Intel i5350U, 8GB RAM) and number of LLI instructions for different QFT sizes.The Clif-ford+T implementation of the QFT requires thousands of gates for each controlled rotation, to retain the rotation accuracy we set (10 −41 ).

Figure 9 :
Figure 9: A snapshot of the resource requirement landscape for random H, T and CNOT circuits.The horizontal axes show circuit width (number of qubits) and depth (number of gates).The vertical axis shows the code distance required to execute the desired circuit with a success rate of 99%.The colour scale represents the spacetime volume of the computation, which relates closely with code distance.

Figure 10 :
Figure10: qft to stdout.py is our tool to stream LLI for a high precision QFT, the 128 argument indicates the number of qubits.lsqecc slicer -f json is our C++ slicer, which lays out the LLI instructions and outputs a stream of JSON slices to stout, the -l 12by12.txtflag tells it which layout to use.Finally jq[55] is a streaming JSON processor that can extract information from the sequence of slices, here combined with some POSIX utilities to count the total number of routing cells.

Figure 11 :
Figure 11: Simulating a deep lattice surgery computation is challenging, because there are constantly patches being entangled and measured out, but at a given time few are actually entangled.Each state is represented by certain number of complex variables (c.v.).

Figure 12 :
Figure 12: This figure only shows rotations by π 8 and π 4, but other angles with the same denominator are also possible by adjusting the classical logic controlling the corrective terms that follow.The π 8 Pauli rotations consume a distilled magic state, while the π 4 rotations consume a positive Y eigenstate which is prepared by applying a twist based measurement[42].

Figure 14 :
Figure 14: Benchmarking the decomposition techniques for arbitrary rotations to surface code instructions with and without Pauli rotation compression.The benchmark circuit is a single R Z ( π 2 n ) rotation.It is possible to see how grouping gates that rotate in the same basis (as shown in Fig. 6) drastically reduces the number of required surface code instructions.
layout determines how computation elements are placed on the lattice.Herein we describe the technical details, limitations, examples and future enhancements of the layout format introduced in Sec.3.3.The online source code repository includes more layout examples.
|+⟩ , |m⟩ and display them as such.Figure11is a graphical depiction of a circuit with intermediate states in a Lazily Tensored State-vector Simulation (LTSvS).