Quantum Circuit Compiler for a Shuttling-Based Trapped-Ion Quantum Computer

,


Introduction
The current rapid maturation of quantum information processing platforms [1] brings meaningful scientific and commercial applications of quantum computing within reach.While it seems unlikely that faulttolerant devices [2,3] will scale to sufficiently large numbers of logical qubits in the near future, noisy intermediate scale quantum (NISQ) devices are predicted to lead the way into the era of applied quantum computing [4].The quantum compiler stack of such platforms will crucially determine their capabilities and performance: First, fully automated, hardwareagnostic front-ends will enable access for non-expert users from various scientific disciplines and industry.Second, optimizations performed at the compilation stage will allow overcoming limitations due to noise and limited qubit register sizes, thereby increasing the functionality of a given NISQ platform.
As quantum hardware scales to larger qubit register sizes and deeper gate sequences, the platforms become increasingly complex and require tool support and automation.It is no longer feasible to manually design quantum circuits and use fixed decomposition schemes to convert the algorithm input into a complete set of qubit operations which the specific quantum hardware can execute, referred to as its native gate set.Dedicated quantum compilers are required for the optimized conversion of large circuits into lowlevel hardware instructions.
In this work, we present such a circuit compiler developed for a shuttling-based trapped-ion quantum computing platform [5][6][7][8].This platform encodes qubits in long-lived states of atomic ions stored in segmented microchip ion traps, as shown in Fig. 1.To increase potential scalability, subsets of the qubit register (a belonging set of qubits) are stored at different trap segments.Shuttling operations, performed by changing the voltages applied to the trap electrodes, dynamically reconfigure the register between subsequent gate operations.Gates are executed at specific trap sites and are driven by laser or microwave radiation.While this approach provides all-to-all connectivity within the register and avoids crosstalk errors, the shuttling operations incur substantial timing overhead, resulting in rather slow operation timescales in the order of tens of microseconds per operation [9].This can lead to increased error rates because the reduced operation speed aggravates dephasing.Furthermore, the shuttling operations can lead to reduced gate fidelities due to shuttling-induced heating of the qubit ions.Therefore, besides compiling a given circuit into a native gate set, the main task of the gate compiler stage of a shuttling-based platform is to minimize the required amount of shuttling operations.This is achieved by minimizing the overall gate count and by arranging the execution order of the gates in a favorable way.A subsequent Shuttling Compiler stage, which is beyond the scope of this work, handles the task of generating schedules of shuttling operations based on the compilation result [10][11][12][13][14].This paper focuses on taking into account the properties of the shuttling-based quantum computing hardware when optimizing the circuit.It provides insights into how the hardware architecture can be exploited to further improve the fidelity of the compiled circuit.Since we use many state-of-the-art transformations and algorithms as the basis for our circuit compiler, much of this work is also applicable to more general quantum circuit compilers.The structure of this paper is as follows: Sec. 2 reviews existing circuit optimization techniques.Sec. 3 defines the representation of quantum circuits used in this work.This is followed in Sec. 4 by a detailed description of all circuit transformation algorithms used.Parameterized circuits and their compilation are discussed in Sec. 5.An evaluation of the methods is presented in Sec.6 and shows the benefits of our circuit compiler.

Background
Due to the increasing size and complexity of quantum circuits, automatic circuit compilation is required to execute quantum circuits on different platforms.For this purpose, powerful frameworks for quantum computing [15][16][17] have been developed.Although their features vary widely, all frameworks provide some kind of built-in optimization.While the simplest form of circuit compilation replaces gates with predefined sequences of other gates (often referred to as decomposition), more advanced techniques minimize the number of gates.A common strategy is to reduce the overall gate count, with a particular focus on expensive two-qubit gates [18][19][20].One such approach uses a different circuit representation called ZX-calculus [21], which allows simplifications at the functional level.Another algorithm searches for common circuit patterns, called templates, and replaces them with shorter or otherwise preferable but functionally identical gate sequences [20].When compiling quantum circuits, the qubit mapping is often considered as well.Ideally, each qubit can interact with any other qubit, allowing two-qubit gates to be executed between any pair of qubits.However, for existing platforms, interactions are limited to nearest neighbor topology or full connectivity within subsets of limited size.Mapping the qubits from the algorithmic circuit to physical qubits subjected to these hardware constraints is called the routing problem [15].To make arbitrary two-qubit gates executable on the quantum hardware, SWAP gates must be inserted into the circuit [22][23][24][25].In the case of ion trap quantum computers, ion positions can be physically swapped to establish dynamic all-to-all connectivity.Consequently, no computational SWAP gates need to be inserted at this stage.The Pytket framework [15] provides a wide variety of circuit transformation algorithms and therefore we use it as the operational basis for the custom circuit compiler described in this paper.Functionality such as the removal of redundancies and the rebasing of arbitrary gates into the native gate set is mainly realized using Pytket's built-in functions.Since Pytket is designed for superconducting architectures, we have additionally developed and implemented some specific functionalities for trapped-ion quantum computers.These include concatenating multiple local rotations into global rotations, restricting gate parameters to a fixed set of values, and improving gate ordering.Previous approaches to quantum circuit compilers have focused on different architectures such as photonic [26] and superconducting quantum computers [27].These compilers share similarities with our approach, such as the use of the ZX-calculus [28] to op-timize the circuits.There are also several Pytket extensions for different quantum devices [29].However, there are inherent differences in the kind of parallelism offered by the hardware and thus should be used to get the best results.The same applies to the native operations (like the physical ion swap in our case).

Graph description of the quantum circuit
This section describes a quantum circuit as a directed acyclic graph (DAG), which is the data structure on which Pytket and our custom subroutines operate.The first subsection defines the DAG, and the second subsection constructs it.Such a graph is depicted in Fig. 2. At the end of this section, we describe the native gate set of our platform.

Graph definition
We consider a quantum circuit consisting of a set Q = {q 0 , . . ., q n−1 } of n qubits.The circuit is represented as a directed acyclic graph C = (V ∪ G ∪ W, E) with sets of vertices V, G and W which are pairwise disjoint and defined as follows: • V is the set of input vertices.For each qubit q i ∈ Q there is exactly one vertex v i ∈ V, so |V| = n holds.Each vertex v i has exactly one outgoing edge.
• W is the set of output vertices.For each qubit q i ∈ Q there is exactly one vertex w i ∈ W, so |W| = n holds.Each vertex w i has exactly one incoming edge.

Graph construction
Each quantum circuit C starts with the input vertices v 0 , . . ., v n−1 ∈ V and each qubit q i ∈ Q is assigned to its vertex v i .The outgoing edge (v i , G) ∈ E leads to the vertex G ∈ G which represents the quantum gate to be executed first on q i .If the circuit does not contain a quantum gate to be executed on q i , Fig. 2: Example of a quantum circuit graph with four qubits and seven gates.The four qubits are depicted on both sides, and each qubit qi is assigned to its input vertex vi.In the middle are the gates, which are executed on the qubits.Each gate has as many subvertices as the number of qubits it affects.To the right are the output vertices for each qubit.
The directed edges show the order in which the gates are executed on each qubit, represented by the different edge colors.No gate is executed on q3.
the outgoing edge goes directly to the output vertex w i ∈ W, so (v i , w i ) ∈ E holds.
All vertices in G represent the inner vertices of the circuit C. If a gate G j ∈ G is executed directly before a gate G k ∈ G on the same qubit q i , a directed edge G j , G k ∈ E connects G j and G k .The output vertices w 0 , . . ., w n−1 ∈ W form the end of the circuit C. Each qubit q i ∈ Q is assigned to its vertex w i .The incoming edge (G, w i ) ∈ E comes from the vertex G ∈ G which represents the last quantum gate executed on q i .If the circuit does not contain a quantum gate executed on q i , the incoming edge comes directly from the input vertex v i ∈ V.
In contrast to the input and output vertices, which all have exactly one outgoing or one incoming edge, the inner vertices have m ≥ 1 incoming and outgoing edges.To manage these edges, each inner vertex G j ∈ G consists of m subvertices s j,0 , . . ., s j,m−1 .Each of these subvertices has exactly one incoming and one outgoing edge.In the following we denote the set of subvertices of G j as S G j .Assume that G j , G k and G l with G j ∈ V ∪ G, G k ∈ G and G l ∈ G ∪ W are executed sequentially on a qubit q i ∈ Q, so that the edges e in = G j , G k ), e out = (G k , G l ∈ E exist.Instead of connecting the incoming edge e in directly to G k , it is connected to a subvertex s k,α ∈ S G k .The outgoing edge e out starts at the same subvertex s k,α .A special kind of gates acting on m > 1 qubits are controlled gates, where µ qubits with 1 ≤ µ < m act as control and m − µ qubits act as target qubits.The gate is only executed on the target qubits if and only if all µ control qubits are one, whereby the control qubits can also be in a superposition.If G k is a controlled gate, the µ control qubits are connected to the subvertices s k,0 , . . ., s k,µ−1 and the m − µ target qubits are connected to the subvertices s k,µ , . . ., s k,m−1 .Since only q i is connected to the subvertex s k,α , the subvertex can be denoted as s k,α i .After constructing the complete graph, there are n disjoint, well-defined paths in C starting at an input vertex v i ∈ V and ending at an output vertex w i ∈ W. Each path from v i to w i describes the order in which the gates are applied to the qubit q i ∈ Q.An example of a graph representation of a quantum circuit is shown in Fig. 2.

Native gate set
The state of an n qubit register is commonly represented by a normalized complex-valued vector with 2 n entries, corresponding to the probability amplitudes of the logical basis states.The gates then act as unitary transformations, represented by unitary matrices of dimension 2 n × 2 n , on the state.Products of unitary operators or their corresponding matrices represent the serial execution of gates on different qubits, where the products are read from right to left.Note that we use a less strict notation throughout the paper, where operator products are written as simple products, even though the operators may act on different subsets of the qubit register, and tensor products are not always written explicitly.Since global phases of the quantum states do not affect the measurement outcomes, equality of unitaries and states means equality up to a global phase.To execute the circuit C on a given hardware platform, it must be transformed into an equivalent circuit consisting only of gates from a native gate set.Our platform [9] implements the native gate set where each gate is parameterized by up to two rotation angles θ and ϕ with 0 ≤ θ, ϕ < 2π.Due to their meaning for the actual operation, they are referred to as the pulse area and the phase, respectively.The gates from M are defined in terms of the Pauli operators X, Y and Z as follows: The gates R and Rz are single-qubit gates, while ZZ is a two-qubit gate.This set is complete, so any quantum algorithm can be decomposed into a sequence of these operations [30,31].Note that some trapped-ion platforms do not allow native ZZ gates, but instead use XX gates generated by bichromatic radiation fields [32].Our compilation scheme is still valid for such architectures, since ZZ gates can be generated from XX gates using local wrapper rotations.Furthermore, the identity I, the Pauli gates X, Y, Z, and the rotations around X and Y , Rx and Ry, are special forms of the R and Rz gates and thus also part of the native gate set.Their relation to the gates in SWAP gates are defined as These gates are required to establish full connectivity and can be realized using the logical gates contained in M. Since our platform can store a maximum of two ions at a trap segment [9], we remove the SWAP gates from the circuits at compile time and reintroduce them at a later compilation stage, which we discuss in Sec.4.1.This is advantageous because instead of laser-driven SWAP gates it allows us to use physical ion swapping to reconfigure the qubit registers, which does not require the manipulation of the internal qubit states and can therefore be executed at unit fidelity [33][34][35].
On our platform, laser pulses realize all gate operations in (3), and the rotation angle parameters θ correspond to pulse areas, i. e. integrals of intensity over time.To perform gate operations at high fidelities, these pulse areas must be carefully calibrated.We therefore restrict the set of available gates to rotation angles equal to the precalibrated pulse areas θ = π and θ = π 2 for R gates and π 2 for ZZ gates.Note that there is no restriction on the rotation angle ϕ for the Rz gates.The concatenated use of R gates allows for all pulse area multiples of π 2 .The phases ϕ of the gates can be chosen arbitrarily with a resolution limited only by the hardware capabilities.These considerations lead to a restriction of the gate set to: The following sections describe the individual transformations which lead to a circuit consisting entirely of elements from N .

Transformations
The overall goal of a quantum compiler is to modify and rearrange the gates in a given quantum circuit in order to obtain an equivalent circuit with a reduced total gate count after mapping to the native gate set, and more favorable operations in terms of execution resources, fidelity, and runtime.The following section describes the transformations to convert the input circuit C into a circuit consisting only of gates from the allowed gate set N , which minimizes the execution overhead on a shuttling-based platform.For some of the following transformations, we use the builtin functions of the quantum programming framework Pytket [15].An overview of all compilation steps and Commutations of blocks and blockless sequences on disjoint sets of qubits (Sec.the corresponding set of gates is shown in Fig. 3.In the same order, each compilation step is described in a subsection.In general, transformations which affect the circuit structure on a large scale are applied earlier, while local adjustments are made later to preserve optimized structures from previous steps.Throughout this section we assume that the input circuit C consists only of quantum gates defined within Pytket [36].This means that all high-level subcircuits, e. g., a Quantum Fourier Transform [37], have been replaced by Pytket-defined quantum gates before the following transformations are applied.An uncompiled circuit, used as an example throughout this section, is depicted in Fig. 4. In addition to Pytket's built-in algorithms, we take into account the characteristics of the segmented ion trap architecture.Most importantly, gates are always executed simultaneously on all ions stored at the laser interaction zone.This allows the parallel execution of two local qubit rotations R.

Elimination of SWAP gates
Our shuttling-based ion trap quantum computer natively supports the physical swap of ions [33][34][35], which is required to establish full connectivity between the qubits.Thus, in contrast to other gates, SWAP gates are executed by physically swapping ions.In the first compilation step, our compiler replaces the function of SWAP gates by renaming the qubits for all succeeding operations.It is the task of the Shuttling Compiler further downstream in the software stack to generate the corresponding reconfiguration operations.The process of SWAP gate elimination on the circuit from Fig. 4 is shown in Fig. 5.To eliminate the SWAP gates, the elimination algorithm iterates over the gates G ∈ C in their execution order.If G is a SWAP k i,j gate acting on the qubits q i and q j , G has two subvertices s k,0 i and s k,1 j , respectively.The algorithm exchanges the outgoing edges of these two subvertices.This means that all gates on the path from the input vertex v i to SWAP k i,j are applied to q i .Since the outgoing edges of s k,0 i and s k,1 j have been exchanged, after SWAP k i,j q i follows the path originally taken by q j .The same holds for q j , which after SWAP k i,j follows the path The SWAP gate elimination is applied to the first SWAP gate (the red box).Compared to Fig. 4, the outgoing edges are exchanged.This means that after the SWAP gate, q 1 follows the green path and q 2 the blue one.Consequently, the qubits on which the CNOT 2,0 , CRy 3,1 (π), SWAP 2,0 , and ZZ 1,2 (π) gates from Fig. 4 act must be adapted to the new qubit.Since all gates on the path of q 1 and q 2 after the SWAP gate are adapted to the new qubit following the path, the algorithm can eliminate the SWAP gate from the circuit.
(b) The SWAP gate elimination is also applied to the second SWAP gate (the right red box).Consequently, the outgoing edges are exchanged compared to (a).This means that after the SWAP gate, q 0 follows the green path and q 1 the red one.
(c) The circuit after applying the SWAP gate elimination.The orange gates now operate on different qubits than in Fig. 4. originally taken by q i .Consequently, all gates after SWAP k i,j which were originally executed on q j are now executed on q i and vice versa.In this way, the swap is passed through all gates succeeding SWAP k i,j and the algorithm can eliminate SWAP k i,j from the circuit.

Repeated removal and commutation of gates
Our compiler uses Pytket's RemoveRedundancies pass to remove redundant gates or gate sequences from the circuit.Additionally, our compiler executes Pytket's CommuteThroughMultis pass to commute gates.When applying commutations, singlequbit gates are commuted through two-qubit gates whenever possible.This may again introduce redundant gates, which can then be removed.The process of commuting and removing gates is repeated until the overall gate count is no longer reduced.Our compiler executes this transformation step first after the elimination of the SWAP gates, and will also apply it after some of the later transformation stages.The transformation preserves the property that a gate sequence is in the gate set M or in the gate set N up to multiples of π 2 .

Macro Matching
Although Pytket provides a transformation for arbitrary gates into the native gate set M, which is used in Sec. 4.
Due to the angle restrictions of the gates in N , this macro is only applied if θ = ℓπ with ℓ ∈ Z holds.Otherwise, the original CRy gate remains in the circuit and will be replaced by the transformations in Sec.4.4.To simplify the circuit, our compiler executes the repeated removal and commutation of gates from Sec. 4.2 again.The application of the macro matching to the circuit from Fig. 5c is depicted in Fig. 6.

Transformation to the native gate set
The predefined gate decomposition into the native gate set by macro matching is followed by the conversion of all remaining gates into the native gate set.Our compiler offers four different approaches for this transformation.While they all start differently, they all end with the RebaseCustom pass to convert the remaining non-native gates.
The first approach applies Pytket's CliffordSimp pass to the quantum circuit C.This pass contains simplifications similar to those of Duncan and Fagan [38].After applying the CliffordSimp pass to a circuit, the resulting circuit consists only of Pytket's universal single-qubit TK1 gates and two-qubit CNOT gates, defined as follows: Since the CliffordSimp pass also converts ZZ gates with θ ∈ { π 2 , π, 3π 2 } already contained in G to a sequence of several TK1 and CNOT gates, our compiler does not apply it directly to the entire circuit C, but executes it on subcircuits of C in such a way that ZZ gates with θ ∈ { π 2 , π, 3π 2 } are preserved.This is advantageous because the ZZ π 2 gates are already part of our trapped-ion native gate set N and thus would not become shorter.Similarly, ZZ (π) as well as ZZ 3π 2 gates can be transformed into smaller gate sequences, see Sec. 4.11.The result of applying the CliffordSimp pass to the example from Fig. 6 can be seen in Fig. 7a.Sec.6 shows that the CliffordSimp pass has a nonlinear runtime with the number of gates.To make even large circuits transformable in a reasonable time, a second approach to convert an arbitrary gate sequence to the given native gate set is to execute Pytket's SquashTK1 pass, which simplifies singlequbit gate sequences.Since this pass does not modify the two-qubit gates, and the built-in rebasing routine used later does not allow to restrict the rotation angles to those contained in our trapped-ion native gate set N , all ZZ gates in G with θ / ∈ { π 2 , π, 3π 2 } must be replaced manually.To do this, we use the following decomposition: After the substitution, our compiler executes the repeated removal and commutation of gates from Sec. 4.2 again to simplify the circuit.Then we apply Pytket's SquashTK1 pass, which converts each sequence of single-qubit gates into exactly one TK1 gate.Applying these substitutions to the example from Fig. 6 results in the circuit shown in Fig. 7b.Besides these two compilation strategies, there are two other approaches.Both are passes which come with the Pytket package and generally perform (a) Example from Fig. 6 after applying Pytket's CliffordSimp pass.All sequences of single-qubit gates have been replaced by exactly one TK1 gate.Moreover, the CRy 3,1 (0.5π) gate in Fig. 6 has been transformed into the two orange CNOT 3,1 , the TK1 1 (0.5π, 0.25π, 1.5π) and the TK1 1 (0.5π, 1.75π, 1.5π) gates.Since our compiler executes the CliffordSimp pass so that ZZ gates with θ ∈ { π 2 , π, 3π 2 } are not converted, ZZ 3,2 (0.5π) and ZZ 2,0 (π) remain in the circuit.
(b) Example from Fig. 6 after applying Pytket's SquashTK1 pass as described in Sec.4.4.Again, all sequences of single-qubit gates have been replaced by exactly one orange TK1 gate.In contrast to the CliffordSimp pass, the green two-qubit gates remain unchanged.
(c) Circuit after applying Pytket's RebaseCustom pass as described in Sec.4.4 to the circuits from (a) and (b).In this case, both circuits result in the same transformed circuit containing only gates from the gate set M. Each CNOT gate has been transformed into exactly one orange ZZ gate and eleven Rz and Rx gates.Additionally, each TK1 gate has been replaced by exactly three Rz and Rx gates.Since the green ZZ gates are still an element of the gate set M and have an angle θ ∈ { π 2 , π, 3π 2 }, they are the only gates which remain unchanged.well, as shown in Sec. 6.The first approach uses Pytket's KAKDecomposition pass and performs the KAK decomposition [39] on C. The second approach uses Pytket's FullPeepholeOptimise pass, which executes Clifford simplifications, commutes single-qubit gates, and squashes subcircuits of up to three qubits [40].Our compiler executes both approaches on the entire circuit C, so that the full potential of these optimizations can be exploited on deep circuits.
Regardless of the approach used, the gates in G are then transformed into the gates in the set {Rx, Rz, ZZ}.This process is called rebasing.For this transformation we use Pytket's RebaseCustom pass.We found smaller gate counts when excluding the Ry gate, so we excluded Ry from the set.Since G may contain two-qubit gates which are not ZZ gates when using the approach with Pytket's SquashTK1 pass, the rebasing first replaces all two-qubit gates which are not ZZ gates with sequences of TK1 and CNOT gates.Then the rebasing replaces the TK1 gates with the definition in (8a) and the CNOT gates with The decomposition guarantees that the rebasing introduces only ZZ gates with an angle θ = π 2 into G, so that after applying this transformation all ZZ gates have an angle θ ∈ { π 2 , π, 3π 2 }.After applying the RebaseCustom pass to the circuits from Fig. 7a and Fig. 7b, the circuit in Fig. 7c results, which is the same circuit for both approaches in this example.(b) Circuit after applying Pytket's RebaseCustom pass to the circuit from (a).Each TK1 gate has been transformed into the TK1 decomposition in (8a), which consists of three gates.Since the angles of some gates from the decomposition are zero, and thus these gates are equal to the identity gate, some decompositions have less than three gates.
(c) The circuit from (b) after the Rz gates have been commuted through the ZZ gates and the resulting redundancies have been removed.Consequently, there are at most two single-qubit gates between two ZZ gates on each qubit q i and between the last gate executed on q i and the output vertex of q i .Only between the input vertex of q i and the first gate executed on q i three single-qubit gates are possible.

Building Rx-Rz sequences
At this point, the quantum circuit contains only R and Rz gates with arbitrary angle parameters as singlequbit gates and ZZ gates with an angle parameter of ℓπ 2 as two-qubit gates.This makes the gate set of the quantum circuit compatible with the native gate set M. Since exactly one TK1 gate can compactly represent any arbitrary sequence of single-qubit gates, the first step is to reduce any sequence of concatenated single-qubit gates to one TK1 gate.Such sequences start and end either at two-qubit gates or at the input or output vertices v i , w i .The implementation of the algorithm is Pytket's SquashTK1 pass.It is depicted in Fig. 8a how this algorithm is applied to the circuit from Fig. 7c.Then we use Pytket's RebaseCustom pass with (8a) to transform the TK1 gates into gates of set M. The circuit unitary reduced to all gates acting along the path of the qubit q i has the following form: In this expression, ω i is the number of ZZ gates executed on q i and j (λ) stands for the qubit q j(λ) on which the ZZ gate λ also acts.The circuit from Fig. 8a after applying the transformation is shown in Fig. 8b.Each path in this circuit has the above structure.Each ZZ gate ZZ k i,j ∈ G is now sandwiched by two Rz gates, which commute with the ZZ k i,j gate on both qubits q i and q j .So we can commute the succeeding Rz gates through the ZZ gates and merge with the preceding Rz gates: Repeating this procedure, combined with the removal of redundant gates until no further reduction in the number of gates is possible, as described in Sec.4.2, results in a circuit C in which the disjoint paths of each qubit q i ∈ Q have the following form: Consequently, on each qubit q i at most one Rx and one Rz gate are executed between two ZZ gates.Let (a) Example from Fig. 8c after applying the transformation of single-qubit gates into our trapped-ion native gate set.The two Rx 1 (0.25π) gates in Fig. 8c are transformed into the two orange gate sequences, and their angles are included in the phase of the Rz 1 (1.25π) gates.All single-qubit gates of the circuit are in our trapped-ion native gate set N .
(b) Circuit from (a) after the Rz gates have been commuted through the ZZ gates and the resulting redundancies have been removed.Since the transformation is applied directly after the Rx-Rz sequences have been built, there are at most four single-qubit gates between two ZZ gates on each qubit q i and between the last gate executed on q i and the output vertex of q i .Only between the input vertex of q i and the first gate executed on q i five single-qubit gates are possible.
Fig. 9: The process of transforming the single-qubit gates into our trapped-ion native gate set N .
be the number of ZZ gates in G. Thus G consists of exactly ω two-qubit gates and at most 4ω + 3n single-qubit gates.In Fig. 8c it is shown how this transformation is applied to the circuit from Fig. 8b.
Rx and Rz with arbitrary angles are the only remaining single-qubit gates in the circuit, and both are also part of M. Since all remaining two-qubit gates are also part of this set, all gates G i of the circuit now satisfy G i ∈ M.

Transforming single-qubit gates into our trapped-ion native gate set
Since all gates in the circuit are now part of M, the next step is to make the angles conform to our trapped-ion native gate set N , which has restrictions on the allowed rotation angles.Therefore, all Rx gates must have θ ∈ π 2 , π .We can use two trivial conversions for θ = 0 and θ = 3π 2 .In the first case, the gate is equal to the I gate and can be eliminated.In the second case, we can replace the gate by Rx π 2 Rx (π).Rotations Rx with any other values of θ must be converted into a sequence of Rx gates with allowed rotation angles and Rz gates with freely variable rotation angles.We use the decomposition After this substitution, the restrictions of our trapped-ion native gate set N are satisfied for the single-qubit gates.In Fig. 9a it is shown how the circuit from Fig. 8c is transformed.Since our compiler applies this transformation directly after building the Rx-Rz sequences in Sec.4.5, the unitaries corresponding to the disjoint paths of each qubit q i have the following form: By again commuting Rz gates through ZZ and combining successive Rz gates according to (12) using the procedures from Sec. 4.2, we can simplify the expression to Consequently, on each qubit q i there are at most two Rx gates and two Rz gates between two ZZ gates.Let ω be the number of ZZ gates in G as defined in (14).Thus G consists of exactly ω two-qubit gates and at most 8ω + 5n single-qubit gates.It is depicted in Fig. 9b how this transformation is applied to the circuit from Fig. 9a.Since all R(x) gates now satisfy θ ∈ π 2 , π , all singlequbit gates are now part of our trapped-ion native gate set N .

Phase tracking
We now use the fact that an Rz gate followed by an R gate is equivalent to a single R gate with a phase shifted by the rotation angle of the Rz gate: Combined with the fact that Rz gates commute through ZZ gates, this allows a virtual execution of all Rz gates using phase tracking [7,41].This technique avoids any physical execution of Rz gates and Fig. 10: Example from Fig. 9b after applying the phase tracking algorithm to the circuit.For each Rx gate, the graph contains an R gate depending on the angle of the original gate and the value of the tracking phase bi when the algorithm is applied to the qubit qi.The R gates can be simplified to Rx gates if bi = 0 or Ry gates if bi = π 2 .The graph contains no more Rz gates except between the last R gate executed on each qubit qi and the output vertex of qi.Consequently, there are at most two R gates between two ZZ gates on each qubit qi and between the input vertex of qi and the first ZZ gate executed on qi.Only between the last gate executed on qi and the output vertex of qi two R gates and one Rz gate are possible.Since the values of the tracking phases b1 and b3 are zero after applying the algorithm to q1 and q3, no Rz gate is placed before the output vertices of q1 and q3.
is therefore beneficial in terms of overall runtime and fidelity.All R gates contained in G must be modified with respect to their phase arguments according to the procedure described in the following.The algorithm initializes a tracking phase b i = 0 for each qubit q i ∈ Q and follows the path from the input vertex v i ∈ V to the output vertex w i ∈ W. It applies the following rules to each gate G i ∈ G it encounters: To correct the final accumulated phase, the algorithm inserts an additional gate Rz i (b i ) into C directly before w i .Since the Rz gate only changes the phase of the qubit, the measurement result in the computational basis as performed on the hardware would not be affected.However, adding the Rz gate still has advantages when working with the unitary matrix and when using the circuit as a building block for even larger circuits.An example can be seen in Fig. 10.Since our compiler executes the phase tracking algorithm immediately after the transformation in Sec.4.6, the disjoint paths of each qubit q i ∈ Q have the following form: Note that the subsequent R gates generally cannot be merged because they have different phase parameters.Thus, on each qubit, at most two R gates are executed between ZZ gates.Hence, G consists of exactly ω two-qubit gates and at most 4ω + 3n single-qubit gates.Consequently, phase tracking can almost halve the number of single-qubit gates.Since phase tracking only changes the phases and not pulse areas, all single-qubit gates are still in N .

Block aggregations
In this section, we describe further optimization steps which are specific to our shuttling-based architecture.
In our architecture, the ions are stored in a linear segmented trap, with qubit subsets stored at different trap segments.Shuttling operations can rearrange the qubit sets between gate operations [6,9].Laser beams directed to a fixed location, the laser interaction zone, perform all gate operations.There, each gate operation is executed simultaneously on all stored ions.Hence, identical single-qubit gates can be executed on multiple qubits in parallel.We use this property to reduce the number of gates.As an additional constraint, we keep shuttling operations to a minimum.Thus, we parallelize single-qubit operations only on qubits which are already in the same segment, i. e. before or after a two-qubit gate is executed on them.
The optimization described in this section deals with the aggregation of single-qubit gates.It uses the property that there is a specific structure around certain ZZ gates.
Directly before and/or directly after a ZZ gate ZZ k i,j there can be sequences of preceding single-qubit gates p = (p 0 , . . ., p α−1 ) , p 0 , . . ., p α−1 ∈ C and/or succeeding single-qubit gates s = (s 0 , . . ., s β−1 ) , s 0 , . . ., s β−1 ∈ C which are similar on both qubits q i , q j ∈ Q on which the gate ZZ k i,j acts.This means that in these sequences the gates as well as the angle parameters match exactly on q i and q j .The advantage of these sequences is that the gates p 0 , . . ., p α−1 , s 0 , . . ., s β−1 can be executed simultaneously for both qubits q i and q j instead of sequentially for each qubit, which reduces the execution time and the required shuttling overhead.Each ZZ gate, including the single-qubit gates contained in the sequences p and s, build a block A. We denote the block belonging to ZZ k i,j as A ZZ k i,j in the following.Iterating over all ZZ gates in their execution order, the algorithm checks for each gate ZZ k i,j whether q i and q j undergo an identical single-qubit gate G p di-Fig.11: An example circuit with the blocks built by the algorithm from Sec. 4.8.Since the graph contains three ZZ gates, it also contains three blocks, represented by the purple boxes around the ZZ gates.Each block A ZZ k i,j can have a predecessor sequence p and/or a successor sequence s.The block A ZZ 2 0,2 contains neither a sequence p nor a sequence s and consists only of the ZZ gate.Besides the three blocks, the graph consists of the seven blockless sequences b0 to b6.
rectly before ZZ k i,j .If this is the case and G p does not already belong to another block on q i or q j , the algorithm adds G p to the predecessor sequence p of the block A ZZ k i,j and repeats the procedure with the gate directly before G p .Otherwise p cannot be increased.The same procedure is repeated with the gates directly after ZZ k i,j to build the sequence s of A ZZ k i,j .Since the algorithm traverses the ZZ gates in their execution order, it is guaranteed that candidate gates for s do not already belong to another block.If it is not possible to build p and s, the block consists only of ZZ k i,j .An example of the block building process is depicted in Fig. 11.After building such blocks around each ZZ gate ZZ k ∈ G, there may exist gate sequences b 0 , . . ., b ℓ−1 which do not belong to a block.Each of these blockless sequences consists only of single-qubit gates and belongs to exactly one qubit.The goal is to minimize the number ℓ of blockless sequences.To minimize the amount and length of blockless single-qubit sequences, the blocks can be rearranged and single-qubit gates can be split.These two approaches are discussed in the following two subsections.

Rearrangement of blocks
A block rearrangement is possible if before a sequence p of a block A α both qubits q i and q j undergo the same gate G, but for one qubit, e. g., q j , G already belongs to a preceding block A β and for q i it belongs to a blockless sequence b µ .Assume A β operates on the qubit q j and a third qubit q k .If the entire blockless sequence b µ is equal to the end s e of the sequence s of the block A β , the gates of b µ and s e acting on the qubit q j are appended to the front of the sequence p of A α .Consequently, the blockless sequence b µ is eliminated and the gates of q j in s e are removed from s e , see Fig. 12a and Fig. 12b.Since only the gates of the qubit q j in s e are removed from s e , the gates of the qubit q k in s e result in a new blockless sequence b ν .To eliminate b ν , it is necessary that b ν is adjacent to a third block A γ operating on the qubit q k and a qubit q l .Here the case q i = q l is possible.If before the sequence p of A γ there is a blockless sequence b ξ on the qubit q l with the sequence b ν at the end, the procedure appends b ν and b ξe to the front of p of A γ and the number of blockless sequences on the circuit is reduced by one.This can be seen in Fig. 12b and Fig. 12c.If minimizing the number of blockless sequences is not possible because a matching sequence is missing during the algorithm, the blocks are not rearranged.We repeat the rearranging procedure for all blocks on which a rearrangement is possible.

Angle splitting
After the rearrangement, the number of blockless sequences might be further reduced by splitting the rotation angles of certain gates.Assume that before a sequence p or directly after a sequence s of a block A ZZ k i,j both qubits q i and q j undergo rotations R i (θ i , ϕ) and R j (θ j , ϕ) with the same phase angle but different rotation angles.Then it is possible to split the gate with the larger rotation angle into two rotations, merge the resulting two identical gates into the respective set s or p of the block, and eventually reduce the number of blockless sequences.The phase angle is omitted from the notation in the following.We now assume that G i (θ i ) belongs to the blockless sequence b i and G j (θ j ) belongs to the blockless sequence b j , and that both b i and b j either directly precede or directly succeed block A. We also assume without loss of generality that θ i < θ j .The gate G j (θ j ) can be split into two consecutive gates G j (θ i ) G j (θ j − θ i ), so that a simultaneous gate G(θ i ) can be merged into either p or s.In the former case, the procedure appends G (θ i ) to the front of p, while in the latter case, it appends G (θ i ) to the end of s.In both cases, the procedure removes G i (θ i ) from b i , and within b j it replaces G j (θ j ) with G j (θ j − θ i ).This re-(a) The same gate sequences bµ and se are executed on q 1 and q 3 before the block Aα.On q 1 the gate sequence bµ is a blockless sequence, while on q 3 the gate sequence se belongs to a different block A β .To eliminate the blockless sequence bµ, bµ and the gates on q 3 from se can be added as a predecessor sequence to Aα.
(b) bµ and the gates on q 3 from se have been added as predecessor sequence p to Aα.Consequently, the gates on q 2 from se now build a new blockless sequence bν .This blockless sequence is identical to the blockless sequence b ξe on q 0 .Since bν and b ξe are both directly in front of a third block Aγ , they can be added as predecessor sequence to Aγ .
(c) bµ and b ξe have been added as predecessor sequence p to Aγ .By rearranging the blocks, the graph now has one blockless sequence less than in (a).
(d) In (c) the same gates are executed before the block of gate ZZ 0 0,1 and after the block of gate ZZ 2 0,2 , but with different pulse areas θ.Using the angle splitting approach, the gate with the larger angle is replaced by the two orange gates.Since the orange gate next to the ZZ gate and the blue gate have the same angle, these two gates can be added to the block.This eliminates two blockless sequences in this example.The two R gates before the block of gate ZZ 1  2,3 cannot be used for angle splitting because the phase ϕ is different.
Fig. 12: The process of rearranging blocks and splitting angles.
duces the blockless sequence b i by one gate and eliminates b i if it is now empty.If this is not the case and b i cannot be eliminated, the procedure rejects the transformation step and keeps the initial gates G i (θ i ) and G j (θ j ).We repeat the procedure until no more block A can be increased.The execution of the angle splitting approach on the circuit from Fig. 12c is shown in Fig. 12d.
Afterwards, our compiler transforms all blockless sequences which cannot be completely eliminated according to Sec. 4.4 so that they consist of a minimal number of native gates from N .The same transformations are applied to the sequences p and s of each block A to minimize the number of gates used.Note that before angle splitting the circuit consists only of gates with rotation angles θ ∈ π 2 , π , so angle splitting preserves the property that all gates are contained in the set of allowed gates N .

Commutations of blocks and blockless sequences on disjoint sets of qubits
The previous compilation steps ensure that the gates acting on a certain qubit are executed in a function receiving order.However, these compilation steps are not optimized for the shuttling overhead, and this gen-erally results in an unfavorable order of blocks and blockless sequences.We now describe how the execution order of blocks and blockless sequences can be rearranged to reduce the amount of shuttling operations required.We exploit the property that the execution order of two blocks or blockless sequences can be swapped if they act on disjoint sets of qubits.The approach then tries to commute each blockless sequence so that it is executed immediately before or after a block applied to the same qubit.An example of such a rearrangement is shown in Fig. 13.Now assume that two blockless sequences b i and b j are executed after a block A α and that A α acts on the qubits q i , q j ∈ Q and b i is applied to q i and b j is applied to q j .The algorithm now iterates over the initial execution order of the circuit, searching for the next block A β which acts on either q i or q j after A α .If none exists, the order of b i and b j may be arbitrary.If A β is applied to q i and not to q j , then b j is executed before b i .The analogous rule holds if A β acts on q j and not on q i .Moreover, if A β acts on q i and q j , the algorithm commutes A β so that it is executed immediately after A α , with only b i and b j in between.(a) The gates are executed in a semi-random order, taking care only that on each qubit q i a gate G is executed after all gates placed before G on q i have been executed.(b) The commutations have been applied.Blockless sequences of a qubit q i are always executed directly before or after a block acting on q i .Moreover, since the Blocks 8 and 11 act on the same qubits, they are executed successively only with the blockless sequences 9 and 10 in between.Since the Blocks 3 and 14 have only q 1 in common, blockless sequence 5 is executed after blockless sequence 4. Additionally, blockless sequence 13 is executed after blockless sequence 12 because the Blocks 11 and 14 have only q 2 in common.The order of the blockless sequences in all other pairs executed before or after a block is arbitrary.E. g., the order of the blockless sequences 1 and 2 can be reversed.Fig. 13: Example of applying the commutation of blocks and blockless sequences on disjoint sets of qubits to a circuit.The purple boxes represent blocks and the yellow boxes represent blockless sequences.The numbers inside the boxes indicate the order in which the gates are executed, i. e. the gate with the number 1 is executed first, then the gate with the number 2, and so on.
In this way, blocks which operate on exactly the same qubits are executed successively if possible.

Block commutations with correction unitaries
In the last subsection we have commuted blocks and blockless sequences acting on disjoint sets of qubits.In this subsection, we present an additional algorithm which commutes superblocks with the following properties: Each superblock consists of at least one two-qubit gate and can consist of several single-qubit gates.All gates of the superblock must operate on the same two qubits and must be executed consecutively on these qubits.For the commutation, the algorithm needs two superblocks with exactly one qubit in common, so that the commutation affects gates on three qubits.Between the two superblocks, no gate may affect the common qubit.The commutation of the superblocks is performed in such a way that afterwards subsequent superblocks in the execution order should operate on the same qubits or have at least one qubit in common.If possible, these commutations can increase the number of such sequences and thus reduce the shuttling overhead.The order of blocks with unitaries of size ν × ν with ν ∈ N cannot simply be swapped, because the corresponding matrix multiplications are noncommutative.Consequently, the commutation of the unitaries associated with the superblocks introduces an error into the circuit.To correct the error, a correction unitary is inserted into the circuit after commutation.It is placed behind the superblocks and operates on the same three qubits as the two superblocks.However, the correction unitary can insert two-qubit gates between all three qubits, breaking the structure that the gates operate only between the qubits of their superblock.To prevent this, we show that under certain conditions the correction unitary can be factorized into two correction unitaries.While one of the unitaries is applied only to one of the qubits which is not the common qubit, the other unitary is applied to the other two qubits which then act on the same superblock.This guarantees that the correction unitaries introduce at most two-qubit gates between these two qubits.If the correction unitary does not factorize, the presented algorithm rejects the commutation of the superblocks.An example is shown in Fig. 14.In Sec.4.10.1 the detailed theory of the block commutation with correction unitaries is described.Afterwards, in Sec.4.10.2 an algorithm for the execution of the block commutation with correction unitaries is presented.This is followed by a heuristic in Sec.4.10.3 to measure the impact of the block commutation with correction unitaries on the shuttling operations independently of a concrete implementation of a Shuttling Compiler.or not to the adjacent superblocks.Since no gates are allowed on the common qubit q 2 between the two superblocks, blockless sequence 13 must be added completely to B β or Bγ or divided between them.(b) The superblock B β has been commuted before Bγ and the correction unitaries U ′ q 1 and U ′ q 2 q 3 (the dark gray boxes) has been inserted into the circuit.The brown, red, green, and light gray colors indicate to which blockless sequences the corresponding blockless sequences from (a) can be transformed, depending on whether they have been added completely, partially, or not to a superblock, and how they are ordered in the new blockless sequences.The commutation leads to a step-shaped arrangement of the gates, which further reduces the number of shuttling operations.
Fig. 14: Example of applying the block commutation with correction unitaries to a circuit.While the purple boxes represent the blocks built during block building in Sec.4.8, the blue boxes represent the superblocks to be swapped, and the dark gray boxes are the correction unitaries.All other boxes are blockless sequences.Analogous to Fig. 13, the numbers inside the boxes indicate the order in which the gates are executed.

Theory of the block commutation with correction unitaries
We consider two consecutive superblocks B β acting on the qubits q i , q j ∈ Q and B γ acting on q j , q k ∈ Q, resulting in the three-qubit unitary where U qiqj is the unitary associated with B β and U qj q k is the unitary associated with B γ .Consecutive in this context means that no gates may be executed on the common qubit q j between the two superblocks.
It can be beneficial to reverse the execution order of the superblocks to reduce the shuttling overhead.In this section, we show how to determine the conditions under which the execution order can be reversed.In general, the reversal requires a correction unitary U ′ to obtain the same total unitary: The three-qubit correction unitary is given by Fig. 15: The two superblocks represented by the unitaries U (β) q i q j and U (γ) q j q k are commuted.Since in general it is not guaranteed that both unitaries are commutable, a correction unitary U ′ is inserted, which corrects the error introduced by the commutation of the superblocks.Under certain conditions U ′ can be factorized into two correction unitaries U ′ q i and U ′ q j q k .
Given U (β) qiqj and U (γ) qj q k in matrix form, we need to check if U ′ factorizes into a q i /q j q k separable form: We explicitly compute U ′ and use the set of Pauli operators for a single-qubit P q = {I q , X q , Y q , Z q } to decompose it into single-qubit Pauli operators acting on q i and two-qubit Pauli operators acting on q j and q k : Note that the P µ are two-qubit unitaries from the tensor product set P qj ⊗ P q k .The decomposition results in a 4 × 16 matrix M with the entries which has the singular value decomposition with the 4 × 4 and 16 × 16 unitary matrices U and V .The 4×16 matrix S consists of a 4×4 diagonal matrix on the left, whose non-zero entries are the singular values of M , and a 4 × 12 zero matrix on the right.If rank S = 1, only S 11 ̸ = 0 and U ′ factorizes according to (23).The resulting correction unitaries are given by The final unitary in favorable order is which contains U qiqj and U qj q k in the commuted order compared to (20).A visualization of the commutation is given in Fig. 15.

Algorithm for executing the block commutation with correction unitaries
In the following, we describe the algorithm for performing the block commutation with correction unitaries on an input circuit C. While for the theory in Sec.4.10.1 we assumed that the superblocks are already known, in practice the algorithm must first determine the superblocks on the circuit.Therefore, a block from Sec. 4.8 forms the basis of each superblock, which is enlarged in a second step.The blocks are chosen to reduce the number of shuttling operations required.The algorithm iterates over the blocks and blockless sequences in their execution order.It searches for three blocks A α , A γ , and A β , where A γ and A β are the base blocks of the superblocks B γ and B β , respectively.A α is the block executed directly before B γ in the execution order and has at least one qubit in common with B β .Thus, commuting B γ and B β places B β between A α and B γ regarding the execution order and ensures that after the commutation, A α and B β are executed successively with at least one qubit in common.Consequently, this may lead to better locality during shuttling and reduce the shuttling overhead.At the beginning, the algorithm searches for the block A α , which is the first block found in the execution order during the iteration.The block following A α in the execution order is used as A γ acting on the qubits q j and q k and as the basis for the superblock B γ .If A α and A γ act on exactly the same qubits, A γ becomes the new A α and the algorithm searches for a new A γ .This ensures that the locality of acting on common qubits between both blocks is not broken.
If A α and A γ act on at least one different qubit, the algorithm searches for the block A β as the basis for the superblock B β .To perform the commutation as described in Sec.4.10.1,A β needs exactly one common qubit, q j , with A γ .Since B β immediately follows A α in the execution order after commutation, A β must also have at least q i in common with A α to reduce the shuttling overhead.When searching for A β , it need not be the direct successor block of A γ in the execution order.It is possible that there are other blocks between A γ and A β .We use the intermediate blocks which act on q j and q k later to extend B γ , as long as there is no other block which acts only on q j or q k in the execution order.The other intermediate blocks must not have a qubit in common with A β to guarantee that A β commutes with them.If A α has exactly one qubit in common with A β and A γ , a commutation of B β between A α and B γ does not cause A α to be followed by a block which has more qubits in common with A α than A γ .Nevertheless, the commutation is performed in this case because it may allow B β to be commuted with another commutation before A α , which may lead to better locality in shuttling.If a block A β cannot be found, A γ becomes the new A α and the algorithm continues to search for a new A γ .
After our algorithm has determined the base gates of the superblocks, there are two ways to extend them.First, a neighboring block from the blocks built during block aggregation can be merged into a superblock if it operates on the same qubits as the superblock and there is no other block between it and the superblock which operates on only one of the qubits.If another block is added to a superblock, all blockless sequences operating on the same qubits between the added block and the superblock also become part of the superblock.
The second way is to extend the superblocks by their neighboring blockless sequences or parts of them which act on a qubit of the corresponding superblock.
While in general we can add blockless sequences completely or only some gates of them, there are two exceptions where the algorithm always adds blockless sequences completely to the adjacent superblock.Both exceptions have in common that the blockless sequence is executed directly before or after B β on q i or B γ on q k in the execution order.However, in the first case, the block before or after the blockless sequence in the execution order does not act on the same qubit as the blockless sequence.On the other hand, in the second case, the blockless sequence contains the first or last gates of the corresponding qubit in the circuit.Adding these blockless sequences to the superblocks leads to better locality during shuttling.All other blockless sequences which do not fall under these exceptions can be divided into two parts.When the algorithm divides a blockless sequence, only the gates closer to the superblock are added, while the other part remains outside the superblock.It is possible that one of the parts does not consist of a gate, which means that depending on the part, the entire blockless sequence may or may not be added to the superblock.An exception is the blockless sequence on the common qubit q j between the two superblocks.Since no gate is allowed at this position, this blockless sequence must be added to one of the superblocks.Therefore, it can be added completely to either B β or B γ , or it can be divided so that one part is added to B β and the other part to B γ .
To find the best partitioning of the dividable blockless sequences, our algorithm determines all possible partitionings of these sequences.Afterwards, for both superblocks, each partitioning of each dividable blockless sequence is combined with each partitioning of the other blockless sequences.In this way, we build several candidate superblocks for B β and B γ .
Then we compute the unitary matrices of each candidate superblock of B β and B γ , and the procedure in Sec.4.10.1 is executed for each possible combination of a candidate superblock of B β with a candidate superblock of B γ .Due to the generally intertwined structure of the circuit archived by the previous compilation steps, there are only a few combinations to execute.Thus, it is not computationally infeasible to (a) Example from Fig. 10 after applying block aggregation and the commutation of blocks and blockless sequences on disjoint sets of qubits.The purple boxes represent the blocks built by block aggregation.The left-to-right order of the gates determined by the commutation of blocks and blockless sequences on disjoint sets of qubits describes the execution order of the gates.When the algorithm in Sec.4.10.2 is executed, it first finds the block Aα acting on q 0 and q 1 , and Aγ acting on q 2 and q 3 .Since the third block, A β , has q 1 in common with Aα and q 3 in common with Aγ , it can be used for the block commutation with correction unitaries.The blocks Aγ and A β are the basis for the superblocks Bγ and B β , respectively.Expanding the base superblocks with other blocks and the blockless sequences, which are always added to the superblocks, results in the displayed blue superblocks Bγ and B β .However, it is not guaranteed that the blue blocks are commutable.It is possible to further enlarge Bγ with the yellow gate and B β with the red gates.In the case of B β , it is possible to add only the right red gate or both red gates.The value of the heuristic in Sec.4.10.3 for this circuit is one.
(b) Circuit from (a) after commuting B β before Bγ .Since U ′ q 1 and U ′ q 3 q 2 are identity matrices, no additional gates have been added to the circuit.The commutation has been performed without adding the red and yellow gates to their corresponding superblocks.While adding the right red gate or both red gates to B β would have produced the same result as without adding, adding the yellow gate to Bγ would have made the commutation unexecutable regardless of how many red gates were added to B β .The value of the heuristic in Sec.4.10.3 for this circuit is 4  3 .In contrast to the circuit in (a), where the successive blocks Aα and Aγ act on completely different qubits, after commutation each block has at least one qubit in common with its predecessor block, making the operations more local.check all combinations for a typical case.In Fig. 16a it is depicted how our algorithm builds superblocks in the example from Fig. 10 and how they can be extended.For all combinations where U ′ factorizes, our algorithm transforms the unitaries U ′ qi and U ′ qj q k into circuits using Pytket.We then optimize these circuits using our compilation flow up to phase tracking and using the FullPeepholeOptimise pass in Sec.4.6.If there are multiple combinations for which U ′ factorizes, the algorithm determines which combination to use for the commutation.Therefore, several heuristics can be used, depending on the optimization goal, such as minimizing the overall gate count or the twoqubit gate count.We present three such heuristics in the following.The first heuristic prioritizes the reduction of the overall gate count over the minimization of the two-qubit gate count as the optimization goal.The heuristic is applied top-down.If multiple combinations satisfy a criterion, the next criterion is used for further selection: 1. Select the combination(s) with the smallest total number of gates in the circuits of U ′ qi and U ′ qj q k .
2. Select the combination(s) with the lowest number of two-qubit gates in U ′ qj q k .
3. Select the combination(s) with the lowest number of blockless sequences.

Select one combination randomly.
The second heuristic reverses the first two criteria, giving priority to reducing the two-qubit gate count over minimizing the overall gate count.A middle ground between the two heuristics is the third heuristic, which uses the same order of the criteria as the first heuristic.However, when counting the number of gates in U ′ qi and U ′ qj q k , each two-qubit gate is counted as ξ > 1 gates.This way the two-qubit gates have a higher weight in the gate count, prioritizing combinations with a lower number of two-qubit gates, but without completely ignoring the single-qubit gates in the first criterion as in the second heuristic.
We then commute B β and B γ with respect to the selected combination and insert the gates from U ′ qi and U ′ qj q k into the circuit C. In the following, our algorithm treats each two-qubit gate contained in the circuit of U ′ qj q k as a block containing only the corresponding gate, while it combines the inserted singlequbit gates into blockless sequences.As an example, Fig. 16b shows the circuit from Fig. 16a with commuted superblocks.
To further determine possible superblocks for commutation, the algorithm selects the new A α in such a way that A β can serve as A β again and can be commuted again.If this is not possible, the algorithm uses the first block in the execution order of the circuit as the new A α .This allows a recursive commutation of blocks, which leads to a nonlinear runtime of the block commutation with correction unitaries.To avoid that the blocks are commuted cyclically, we check that the Fig. 17: Example from Fig. 16b after applying the transformation of two-qubit gates into our trapped-ion native gate set.The ZZ2,0 (π) gate in Fig. 16b is transformed into the two orange ZZ2,0 (0.5π) gates.All gates of the circuit are in our trapped-ion native gate set N .same two blocks are not commuted twice.If the algorithm cannot find any more commutations and has performed at least one commutation, we optimize the circuit with the commuted blocks.Therefore, we remove the block aggregations of Sec.4.8.Due to the commutations, it is possible that more than two R gates act on the same qubit between two ZZ gates.We apply Pytket's SquashTK1 pass and the RebaseCustom pass to these sequences.Then we rerun our compilation flow from the transformation pass in Sec.4.6.This removes the Rx gates introduced by the circuits of U ′ qi and U ′ qj q k as well as the rebasing pass which do not have an angle of π 2 or π.Furthermore, phase tracking eliminates the newly introduced Rz gates and the block aggregation builds new blocks considering the new optimizations.The new blocks and blockless sequences acting on disjoint sets of qubits are then commuted as described in Sec.4.9.

Heuristic evaluation of the commutation
The algorithm presented above puts the circuit into a favorable structure by commuting superblocks.However, the algorithm does not measure the impact of the commutations on the shuttling operations.To allow our compiler to be used independently of a concrete implementation of a Shuttling Compiler, we present a heuristic for evaluating the impact of the commutations.We calculate this heuristic for the circuit before and after applying the block commutation with correction unitaries.The circuit with the higher heuristic value is used for the further compilation.To compute the heuristic, the algorithm iterates over the entire circuit in the execution order, looking at each pair of consecutive blocks built in Sec.4.8.For each pair, the algorithm determines how many qubits the blocks have in common.These amounts are added and divided by the number of pairs.The result is a value between zero and two.The heuristic can only reach a value of zero if each block affects two different qubits compared to its immediate predecessor block in the execution order.This means that the blocks are in the most unfavorable order.In contrast, the heuristic can only reach a value of two if all blocks operate on the same two qubits.Consequently, reaching a value of two is not possible for circuits where the block commutation with correction unitaries has been applied, because it is only applicable to circuits where gates act on at least three different qubits.Using the block commutation with correction unitaries should increase the value of the heuristic.The higher the value, the more blocks following each other in the execution order will have qubits in common and the more local the operations will be.As a result, the number of shuttling operations will decrease.See Fig. 16 for an example of using the heuristic.

Transformation of two-qubit gates into our trapped-ion native gate set
The transformation in Sec.4.4 has converted the ZZ gates so that they have a phase θ ∈ π 2 , π, 3π 2 .Since our trapped-ion native get set N contains only the gate ZZ π 2 , the ZZ gates in G with a phase of π or 3π 2 must be transformed to satisfy the condition.Therefore, the following decomposition is used: This transformation is applied to the circuit from Fig. 16b in Fig. 17.
Note that this transformation could also have been applied earlier.However, not only would there be no benefit, but it would also worsen the runtimes of the block aggregation and the commutations because there are more two-qubit gates to consider.It should also be noted that the single-qubit gates are already part of our trapped-ion native gate set N , and since the two-qubit gates are as well, the circuit C is now executable on the target quantum computing architecture.

Parameterized Circuits
Important applications of quantum processors in the NISQ era include quantum machine learning [42] and Variational Quantum Eigensolver (VQE) algorithms [43].These use cases require a parameterized circuit to be executed multiple times with only small variations in some gate parameters, i. e. phases and angles of the gates.A classical optimizer then compares and adjusts the results of these executions.To make the circuits comparable, it is important that their structure is identical.However, different gate parameters can lead to different circuits after applying the compilation steps of our paper.Introducing the compilation of parameterized circuits mitigates this problem.Whenever unresolved parameters are present in the circuit to be compiled, these parameters are preserved throughout the compilation and only generally applicable optimizations are performed.Pytket natively supports such parameterized compilation and generates parameter-dependent expressions for all its transformations.When compiling parameterized circuits, our compiler still removes all SWAP gates (Sec.4.1) and the gates that cancel out (Sec.4.2), but only if the cancellation is independent of the current parameters.Additionally, macro matching is performed (Sec.4.3) only if all values of the parameters satisfy the macro condition.The same scheme is also applied to all following steps.The final circuit may contain gates with angles and phases defined by non-trivial mathematical expressions, mainly introduced when transforming the circuit into our native gate set (Sec.4.4).While compiling parameterized circuits does not restrict phase tracking (Sec.4.7), block aggregation (Sec.4.8) and the block commutation with correction unitaries (Sec.4.10) lose much of their impact.The latter is not performed for parameterized circuits for performance reasons.
As an example, we used [44] to generate a parameterized circuit representing a Unitary Coupled Cluster ansatz [45], which is needed when executing variational algorithms such as VQE [46].The circuit has four qubits, 48 single-qubit and 32 two-qubit gates, and contains two parameters.When we compiled the circuit together with the parameters into our trappedion native gate set, we got a circuit with 69 singlequbit and 32 two-qubit gates.In this circuit, the parameters can be replaced by numeric values to get executable circuits.This allows us to execute the circuit with different values after compiling it once.If the values are known before compilation and the circuit only needs to be executed with these values, the parameters can be replaced before compilation.In this case, we got a circuit with 61 single-qubit and 32 two-qubit gates.The small increase factor of 1.13 in the number of single-qubit gates for the parameterized compilation compared to the compilation with known parameters shows that our compiler also works well for parameterized circuits.

Evaluation
This section presents the evaluation results of our compiler.We tested the compiler on a library of 153 quantum circuits [47], which had been previously used for benchmarks [22][23][24][25].Each circuit in the library has between 3 and 16 qubits and 5 to 207,775 gates.In addition, we benchmarked our compiler for the algorithms Quantum Approximate Optimization (QAOA) [48], Quantum Fourier Transform (QFT) [37], Supremacy [49,50], Sycamore [1], and Quantum Volume Estimation (QV) [51] for up to 200 qubits.While we generated the circuits for the first four algorithms using the quantum circuit generator from [52], we generated the circuits for QV using Qiskit [17].
We only evaluated QFT up to 49 qubits, because for a higher number of qubits the angles of some gates in the initial circuits are so small that they were considered as zero in the calculation.For Supremacy and Sycamore, we only used circuits where the qubits can be arranged in a square lattice with a depth of 1,000.We executed the evaluations on the Mogon II cluster of the Johannes Gutenberg University.Each evaluation was executed with one core of an Intel 2630v4 CPU and 16 GB RAM.We analyzed the impact of the different compilation stages on the result.Additionally, we compared our compiler with standard Pytket [15] and Qiskit [17] passes, using Pytket version 1.11.1 and Qiskit version 0.39.5.When measuring the runtime, we always averaged it over 10 executions of a given circuit.The resulting single-qubit and two-qubit gate counts are given in App. A.
We executed all the circuits using the compilation flow depicted in Fig. 3.For the macro matching in Sec.4.3, we used only the CRy macro given in (7).As a heuristic for selecting the combination of blocks used for the block commutation with correction unitaries in Sec.4.10.2,we used the first heuristic, which prioritizes the minimization of the overall gate count as the first criterion.The reason for this is that we compared our results with those of standard Pytket and Qiskit passes, and both tools do not explicitly prioritize one type of gate over another.Since our compiler offers four alternative ways to transform the circuit into the native gate set, as discussed in Sec.4.4, we compared these approaches.While the first scheme performs additional optimizations using Pytket's Clifford-Simp pass, the second scheme transforms based on the ZZ decomposition in (9) and Pytket's SquashTK1 pass without further optimizations.The third scheme performs optimizations achieved by the KAK decomposition [39], while the fourth scheme combines various optimizations using Pytket's FullPeepholeOptimise pass.The four different schemes are visualized in Fig. 3, where each scheme represents a different control path in the uppermost blue box.In the following, we refer to these four schemes as the CliffordSimp, SquashTK1, KAKDecomposition, and FullPeepholeOptimise approaches, respectively.
After calculating the results of our complete compilation flow, we computed the shuttling schedules for the circuits using a Shuttling Compiler [10].To analyze the influence of the commutations described in Sec.4.9 and Sec.4.10 on the shuttling, we compiled all circuits two additional times: once without both types of commutations and a second time only without the block commutation with correction unitaries.
This allowed us to determine the impact of the different commutations by comparing the translation, separation/merge, and ion swaps counts of the circuits.
To calculate the shuttling schedules, we assumed a linear trap with 1,401 segments, each containing a maximum of two ions, and the laser interaction zone placed in the center of the trap.This configuration ensured that there was enough space for all ions and that there was no reconfiguration overhead due to lack of space.
In the following, we analyze first the impact of the different transformations and then the impact of the different commutations on the shuttling.Finally, we compare our compiler with standard compilation passes of Pytket and Qiskit.

Analysis of the impact of the different transformations
When analyzing the results of the circuit library, the FullPeepholeOptimise approach produced the circuits with the lowest overall gate count for 72 % of the circuits, followed by the CliffordSimp approach with 65 %, the KAKDecomposition approach with 40 %, and the SquashTK1 approach with 39 %.Since for some circuits multiple approaches calculated a result with the lowest gate count, the percentages add up to more than 100 %.When comparing the results of the different approaches pairwise, the gate counts differ by factors up to 1.15.Additionally, the FullPeep-holeOptimise approach produced the lowest singlequbit gate count for 71 % of the circuits, followed by the CliffordSimp approach with 61 %, the KAKDecomposition approach with 42 %, and the SquashTK1 approach with 41 %.The single-qubit gate counts of the different approaches vary by factors up to 1.18.
For 83 % of the circuits, the FullPeepholeOptimise approach also determined the results with the lowest two-qubit gate count, followed by the Clifford-Simp approach with 59 %, the KAKDecomposition approach with 49 %, and the SquashTK1 approach with 47 %.The two-qubit gate counts of the different approaches differ by factors up to 1.21.The smallest differences in the overall, single-qubit, and two-qubit gate counts are between the SquashTK1 and KAKDecomposition approaches, whose counts vary by factors up to 1.04.
For QAOA, all four approaches produced the same results independently of the number of qubits.Also for QFT, all four approaches calculated the same results for the circuits up to 25 qubits, while for the larger circuits only the FullPeepholeOptimise approach determined the results with the lowest number of singleand two-qubit gates.For QV, the CliffordSimp and SquashTK1 approaches, as well as the KAKDecomposition and FullPeepholeOptimise approaches, each calculated the same results.However, the single and two-qubit gate counts of the KAKDecomposition and FullPeepholeOptimise approaches are about 1.25 times lower for the five-qubit circuit.As the number of qubits increases, these factors decrease and are only slightly greater than one for 200 qubits.The reason for the better results of these approaches is the specialization of the KAK decomposition to circuits with a structure like the QV circuits.Note that the KAK decomposition is also part of Pytket's FullPeep-holeOptimise pass.Moreover, for Supremacy, the CliffordSimp and FullPeepholeOptimise approaches, as well as the SquashTK1 and KAKDecomposition, each determined the same gate counts.The gate counts of the CliffordSimp and FullPeepholeOptimise approaches are slightly lower than the gate counts of the other two approaches.For Sycamore, the Clifford-Simp and FullPeepholeOptimise approaches calculated the same results for the circuits up to 49 qubits.The two-qubit gate counts are always the same for the CliffordSimp and FullPeepholeOptimise approaches, as well as for the SquashTK1 and KAKDecomposition approaches.While the single-qubit gate counts are up to 1.27 times lower for the SquashTK1 or KAKDecomposition approaches, the two-qubit gate counts are slightly lower for the other two approaches.
Regardless of the approach, 98 % of the compiled circuits in the circuit library have higher overall and single-qubit gate counts than the original circuits.
The only circuits with lower overall and single-qubit gate counts are the ising_model circuits, which contain several Rz gates and thus benefit greatly from phase tracking.On the other hand, the number of circuits with a higher two-qubit gate count depends on the compilation approach used.Hence, the SquashTK1 and KAKDecomposition approaches have a higher two-qubit count for 46 % of the circuits, the CliffordSimp approach for 40 %, and the KAKDecomposition approach for only 25 %.The single-qubit gate counts increased because all transformed circuits must satisfy our trapped-ion native gate set N .However, the reason for the increase in the two-qubit gate counts is the block commutation with correction unitaries, which can insert two-qubit gates through the circuit described by U ′ qj q k .In the compilation steps before the block commutation with correction unitaries, the two-qubit gate counts can only decrease because the original circuits contain some redundant CNOT gates, which can be removed after the singlequbit gates have been commuted through them.Then the decomposition in (10) replaces each remaining CNOT gate by exactly one ZZ gate.In the analysis below, we will see that for circuits with increasing twoqubit gate count, the block commutation with correction unitaries should be executed only if minimizing the overall gate count or the amount of shuttling operations is the optimization goal.
As depicted in Fig. 18, the compiled circuits of QAOA, QFT, and QV have higher overall gate counts compared to the original circuits.While for QAOA Fig. 18: The overall gate counts of the five QAOA (blue), QFT (red), QV (green), Supremacy (orange), and Sycamore (purple) depending on different numbers of qubits.For each algorithm, the gate counts of the original (triangles) and compiled (circles) circuits are depicted.For the latter, the rebasing approach with the lowest gate count is used.
and QV the two-qubit gate counts remained the same or decreased, for QFT the two-qubit gate counts at most doubled during compilation.This is because the original circuits contain CPhase gates as two-qubit gates, each of which was decomposed into two ZZ gates.However, some of the ZZ gates became redundant and could be removed after commuting the single-qubit gates through them.For Supremacy and Sycamore, the overall gate counts in the compiled circuits are lower than in the original circuits.Here, the circuits contain several T and Z gates, which phase tracking removed.The two-qubit CZ gates of the original circuits were replaced by one ZZ and two Rz gates where phase tracking could also remove the latter.However, for 69 % of the circuits, the two-qubit gate counts increased slightly due to the block commutation with correction unitaries, while the single-qubit gate counts are up to 2.20 times lower.For all five algorithms, the number of qubits has no effect on the gate count relation.The runtimes of the four approaches versus the number of gates in the original circuits for the circuit library are shown in Fig. 19a.All four approaches had nonlinear compile times.While the compile times of the CliffordSimp, SquashTK1, and KAKDecomposition approaches were nearly the same for the different circuits, the compile time of the FullPeepholeOptimise approach was on average 1.11 times higher.The standard deviation for all four approaches was on average about 6.53 % of the average runtimes.The evaluation of the runtimes for the five algorithms showed the same runtime behavior without any influence of the number of qubits.Compiling the circuits without the block commutation with correction unitaries achieved the compile times depicted in Fig. 19b.For small circuits, the  (b) Runtimes of our compiler when executed without the block commutation with correction unitaries.In this case, the Clifford-Simp and FullPeepholeOptimise approaches lead to a nonlinear scaling of the runtimes, while the SquashTK1 and KAKDecomposition approaches behave linearly with the number of gates.
Fig. 19: Runtimes of our compiler for the different rebasing approaches depending on the number of gates.Each color represents one of the rebasing approaches: CliffordSimp (dark blue), SquashTK1 (red), KAKDecomposition (green), and FullPeepholeOptimise (orange).Each dot represents one of the 153 circuits in the circuit library.
SquashTK1 and KAKDecomposition approaches resulted in compile times of about 4 ms per gate of the original circuits.In contrast, the CliffordSimp and FullPeepholeOptimise approaches showed a nonlinear growth of the compile times with the number of gates of the original circuits, with the FullPeep-holeOptimise approach having on average 2.09 times higher compile times than the CliffordSimp approach.
The reasons for the nonlinear growth are the additional optimizations executed by Pytket's Clif-fordSimp and FullPeepholeOptimise passes, which led to nonlinear runtime scaling.As mentioned in Sec.4.10.2, the block commutation with correction unitaries has a nonlinear runtime because it is applied to the blocks recursively.This also led to non-linear compile times of the SquashTK1 and KAKDecomposition approaches in the cases where the block commutation with correction unitaries was executed.In contrast, the other transformations executed in the SquashTK1 and KAKDecomposition approaches only iterate a constant number of times over each gate, resulting in linear runtimes of these two approaches when the block commutation with correction unitaries was not applied.The factors of the compile time increases when executing the block commutation with correction unitaries compared to the compilation without the block commutation with correction unitaries are shown in Fig. 20.On average, the runtimes with the block commutation with correction unitaries were 53.75 times higher for the KAKDecomposition approach, followed by the SquashTK1 approach with 48.92 times, the CliffordSimp approach with 41.49 times, and the FullPeepholeOptimse approach with 21.45 times.
While the factors for the SquashTK1 and KAKDecomposition approaches are approximately constant in a certain range, the factors for the CliffordSimp and FullPeepholeOptimise approaches are approximately constant up to a circuit size of 10,000 gates, and decrease for higher number of gates.
In the following, we compare the impact of the different compilation stages on the results of the circuits.To compare the gate counts, for each circuit we used the approach (CliffordSimp, SquashTK1, KAKDecomposition, or FullPeepholeOptimise) which produced the lowest overall gate count after executing our entire compiler flow as the baseline in all evaluations.If more than one approach had the same overall gate count, we preferred the result with the lowest two-qubit gate count.The impact of the different compilation stages are depicted in Fig. 21a for the circuit library and in Fig. 21b for the five algorithms.As a naive compilation, the gates of the original circuit were replaced using Pytket's RebaseCustom pass as described in Sec.4.4 and the ZZ decomposition given in ( 9) and (29).Afterwards, the circuit was transformed into our trapped-ion native gate set N .
Comparing this naive transformation with our compiler for the circuit library, the resulting circuits from our compiler have 2.50 to 5.38 times fewer single-qubit gates and up to 1.38 times fewer two-qubit gates.However, for 33 % of the circuits, our compiler produced results with up to 1.27 times more two-qubit gates.This is due to the additional gates inserted during the block commutation with correction unitaries to reduce the amount of shuttling operations.On average, our compiler produced circuits with 3.29 times fewer overall gates.For the evaluated circuits, this naive transformation produced results with about 1.10 times lower single-qubit gate counts when the Ry was not excluded from the native gate set before executing the RebaseCustom pass.This is because the naive transformation need not take care of commuta- Fig. 20: The runtime increases of our complete compilation flow compared to our compiler when executed without the block commutation with correction unitaries for the different rebasing approaches CliffordSimp (dark blue), SquashTK1 (red), KAKDecomposition (green), and FullPeepholeOptimise (orange) depending on the number of gates.Each dot represents one of the 153 circuits in the circuit library.For all circuits with an increase factor greater than one (the horizontal black line), our complete compilation flow has a longer runtime than our compiler when executed without the block commutation with correction unitaries.
tions, and including Ry gates allows the gates to be replaced by shorter gate sequences.
Comparing the naive compilation with our compiler for the five algorithms shows that for QAOA and QV, the overall and single-qubit gate count reduction factors first decrease as the number of qubits increases.
Then the overall gate count reduction factors converge to 3.13 for QAOA and 5.39 for QV, while the single-qubit gate count reduction factors converge to 3.43 for QAOA and 6.48 for QV.While the overall gate count reduction factors for Supremacy and Sycamore increase only slightly, the single-qubit gate count reduction factors increase from 9.85 to 11.16 for Supremacy and from 9.92 to 11.13 for Sycamore as the number of qubits increases.The reason for these large increases is that the substitution of the CZ gates contained in the circuits used by the naive compilation requires 15 single-qubit gates, while our compiler requires only two single-qubit gates for a CZ decomposition.Moreover, our compiler can remove the z-rotations contained in the circuit using phase tracking.While for QAOA, Supremacy, and Sycamore the gate count reduction factors for two-qubit gates are around one, meaning that the circuits calculated by our compiler have approximately as many two-qubit gates as the naive compilation, for QV the reduction factor decreases from 1.25 to a factor slightly above one.For QFT, the gate count reduction factors also increase with the number of qubits.While our compiler required 4.46 times fewer single-qubit gates and the same number of two-qubit gates compared to the naive compilation for five qubits, our compiler required 5.94 times fewer single-qubit gates and 1.27 times fewer two-qubit gates for 49 qubits.For all five algorithms, our compiler generated lower single-qubit gate counts when the Ry was not excluded from the native gate set before the RebaseCustom pass was executed.In this case, the single-qubit gate counts are on average 1.04 times lower for QAOA and QV, 1.10 times lower for QFT, 1.14 times lower for Supremacy, and 1.16 times lower for Sycamore compared to a compilation without Ry as a native gate.The higher factors for Supremacy and Sycamore are due to the fact that these circuits already contain several Ry gates, which do not need to be replaced when the Ry gate is added to the native gate set.
Comparing our compiler to the naive approach, but additionally removing trivial redundancies using Pytket's RemoveRedundancies pass, the compiled circuits in the circuit library have between 2.50 and 4.29 times fewer single-qubit gates and up to 1.13 times fewer two-qubit gates.For the same reason as for the naive approach, our compiler produced results with up to 1.27 times more two-qubit gates for 34 % of the circuits.The overall gate counts of our compiler are on average 2.86 times lower than the overall gate counts of the improved naive transformation.These high factors show the potential of the optimizations applied by our compiler.In the improved naive transformation, for 98 % of the circuits our compiler calculated results with a lower or equal gate count when the Ry gate was excluded from the native gate set before executing the RebaseCustom pass.This is because excluding gates from the native gate set reduces the number of different gates.Consequently, gates of the same type are more likely to be located next to each other, allowing for better redundancy removal.
The comparison of the advanced naive compilation for QAOA, QFT, QV, and Supremacy shows the same trends as the naive compilation with lower gate count reduction factors.E. g., the overall gate count reduction factors converge to 2.00 for QAOA and 3.72 for QV as the number of qubits increases.The only exception is Sycamore, whose overall gate count reduction factor is 5.83 for four qubits and converges to 5.64 for a higher number of qubits.While the overall gate count reduction factor decreases with the number of qubits, the single-qubit gate count reduction factor increases from 8.47 to 8.89, and the two-qubit gate count reduction factors are around one as in the naive compilation.For all algorithms except QAOA, the improved naive compilation produced better results when the Ry gate was excluded from the native gate set before executing the RebaseCustom pass for the reason mentioned above.For QAOA, the results of the improved naive compilation have about 1.06 times fewer single-qubit gates when including the Ry gate.The reason for this is that the circuits benefit  Fig. 21: The overall gate count reductions of our complete compilation flow compared to the following less optimized compilations: (1) a naive compilation which replaces gates only with respect to our trapped-ion native gate set N , but without any optimization (dark blue dots), (2) the naive compilation with additional removal of trivial redundancies (red dots), (3) the compilation as depicted in Fig. 3, but only up to the transformation described in Sec.4.6 (green dots), (4) the compilation up to phase tracking (orange dots), (5) the compilation up to block aggregation but without phase tracking (purple dots), and ( 6) the compilation up to the commutation of blocks and blockless sequences on disjoint sets of qubits (light blue dots).For each circuit, the best compilation result of the four different rebasing approaches is used.For all circuits with a reduction factor greater than one (the horizontal black line), our complete compilation flow produces a circuit with a lower gate count than the less optimized compilations.
more from the shorter CNOT decomposition which can be used when including the Ry gate than from the additional redundancy removal possible when excluding the Ry gate.
For the next evaluation, our compiler was executed as depicted in Fig. 3, but skipping phase tracking, block aggregation, and the block commutation with correction unitaries.Comparing our entire compiler flow with the flow skipping these three optimizations for the circuit library, our entire compiler calculated circuits with 1.38 to 2.05 times fewer single-qubit gates.For 47 % of the circuits, our entire compiler produced results with up to 1.27 times more two-qubit gates.
Since phase tracking and block aggregation do not change the two-qubit gates, the block commutation with correction unitaries inserted the additional twoqubit gates to reduce the amount of shuttling operations.The overall gate counts of our entire compiler flow are on average 1.65 times lower than those of the reduced compiler.
For QAOA, the single-qubit reduction factors remain constant at 1.71 independently of the number of qubits.In contrast, for QFT and QV, the singlequbit gate count reduction factors increase with the number of qubits from 1.80 to 1.97 for QFT and from 1.94 to 2.00 for QV, while the factors decrease from 1.97 to 1.75 for Supremacy and from 2.05 to 2.01 for Sycamore.The overall gate count reduction factors are similar to those for single-qubit gates, but with smaller factors.For all algorithms except Sycamore, the amount of two-qubit gates for the compilation without phase tracking, block aggregation, and the block commutation with correction unitaries is as large as the amount of our entire compilation flow.Only for Sycamore, our entire compiler flow has a slightly higher two-qubit gate count than the compilation without the skipped optimizations.
In the following, phase tracking was also executed, but block aggregation and the block commutation with correction unitaries were not executed.In this case, for the circuit library, the overall gate counts of our entire compilation flow are on average only 1.07 times and the single-qubit gate counts are up to 1.20 times lower than the gate counts of the flow without block aggregation and the block commutation with correction unitaries.For the ising_model circuits, the single-qubit gate counts of our entire compiler flow are up to 1.15 times higher.The reason for this is that these circuits contain several Rz gates, which phase tracking removed.However, the block commutation with correction unitaries introduced new gates to reduce shuttling operations.Using this reduced compilation flow for QAOA and QV, the gate counts are exactly the same as for our entire compiler flow.This means that block aggregation and the block commutation with correction unitaries do not affect the gate counts of these circuits when executing the transformations after phase tracking.For QFT, the single-qubit gate count re-duction factor is 1.02 for five qubits and decreases to one for 30 and more qubits.While for Supremacy the single-qubit gate count reduction factor decreases from 1.15 to 1.10 as the number of qubits increases, for Sycamore the single-qubit gate count reduction factor starts at 1.17 for the four-qubit circuit and converges to a factor of 1.41.Compared to the compiler which did not perform phase tracking and the block commutation with correction unitaries, but performed block aggregation, the overall gate counts of our entire compiler flow for the circuit library are on average 1.45 times lower, and the single-qubit gate counts are up to 1.77 times lower.Since the reduction factors are higher than in the previous case of executing phase tracking instead of block aggregation, the impact of block aggregation is less than the impact of phase tracking.However, an exception is the circuit graycode6_47 , which has a lower single-qubit gate count when not using phase tracking and the block commutation with correction unitaries.The reason for this is the construction of this circuit, which consists only of CNOT gates.Consequently, this circuit directly benefits from the symmetric structure of the CNOT decomposition in (10), which favors block aggregations.
For QAOA, the single-qubit gate counts of our entire compiler are about 1.57 times lower than the counts of the compilation without phase tracking and the block commutation with correction unitaries.However, these factors grow from 1.76 to 1.97 for QFT, from 1.72 to 1.75 for QV, and from 1.57 to 1.66 for Supremacy as the number of qubits increases.Only for Sycamore, they decrease slightly from 1.48 to 1.46.
For QFT, the single-qubit gate counts are the same as in the compilation even without block aggregation, except for the five-qubit circuit.This shows that QFT does not benefit from block aggregation when executed without prior phase tracking.On the other hand, since the reduction factors for QAOA, QV, Supremacy, and Sycamore are lower than in the compilation even without block aggregation, these algorithms benefit from block aggregation in this case.However, because the reduction factors are higher than in the previous case of executing phase tracking instead of block aggregation, the impact of block aggregation is less pronounced than the impact of phase tracking.
Evaluating the angle splitting in Sec.4.8.2 shows that our compiler applied this technique to 63 % of the circuits in the circuit library.On these circuits, our compiler performed angle splitting on 0.08 % to 4.65 % and on average on 0.72 % of the single-qubit gates in the original circuits.Of the five algorithms, only Supremacy and Sycamore benefit from angle splitting.
The number of single-qubit gates of the original circuit which were split by angle splitting is shown in Fig. 22.This number tends to grow with the number of single-qubit gates.As a final case, we compare our entire compiler flow to the flow without the block commutation with correction unitaries.For the circuit library, our entire compilation flow has on average 1.02 times lower overall gate counts and up to 1.09 times lower single-qubit gate counts than the compilation without the block commutation with correction unitaries.The three ising_model circuits and the circuit decod24-v0_38 have up to 1.15 times lower single-qubit gate counts when not executing the block commutation with correction unitaries.As mentioned above for 47 % of the circuits, performing the block commutation with correction unitaries results in up to 1.27 times higher two-qubit gate counts.The increase in the gate counts is due to the additional gates inserted during the block commutation with correction unitaries.However, the next subsection shows that the insertion of these additional gates results in a reduction of shuttling operations.Although 47 % of the circuits have a higher twoqubit gate count, only 3 % of the circuits have a higher overall gate count after executing the block commutation with correction unitaries, showing its potential to reduce single-qubit gates.
Compiling the circuits of the five algorithms without the block commutation with correction unitaries produced exactly the same results for QAOA, QFT, and QV as compiling even without block aggregation.Consequently, block aggregation has no effect on these three algorithms when executed after phase tracking.Additionally, as mentioned above, the block commutation with correction unitaries has no effect on the

Number of advanced block commutations
Circuit library QFT Supremacy Sycamore 1 Fig. 23: Number of block commutations with correction unitaries applied, depending on the number of two-qubit gates in the original circuit.For each circuit, the best compilation result of the four different rebasing approaches is used.Each dark blue dot represents at least one circuit in the circuit library where the block commutation with correction unitaries was performed on 148 of the 153 circuits.While the block commutation with correction unitaries was applied to every circuit of QFT (red dots), it was not performed on the smallest circuit of Supremacy (orange dots) and Sycamore (purple dots).Since the block commutation with correction unitaries was not applied to any circuit of QAOA and QV, these circuits are not depicted.
circuits of QAOA and QV, since the compilation without the block commutation with correction unitaries produced the same results as our entire compiler flow.However, the single-qubit gate count reduction factor for QFT decreases from 1.02 for the five-qubit circuit to one for the circuits with 30 and more qubits.This shows that for QFT, the block commutation with correction unitaries slightly reduces the single-qubit gate counts for the circuits up to 25 qubits.The single-qubit gates of our entire compiler flow are about 1.04 times less for Supremacy and 1.14 times less for Sycamore, independent of the number of qubits, than for the compilation without the block commutation with correction unitaries.Only for Sycamore, our entire compiler flow has a slightly higher number of twoqubit gates compared to the compilation without the block commutation with correction unitaries.This means that the block commutation with correction unitaries inserted some two-qubit gates to reduce the amount of shuttling operations.
Evaluating the number of commutations executed, we found that our compiler applied commutations to 97 % of the circuits in the circuit library.Up to 14,345 commutations were executed on these circuits.The number of executed commutations ranges from 0.65 % to 21.88 % and averages out to 12.28 % of the number of two-qubit gates in the original circuits.For the five algorithms, our compiler only executed commutations on the circuits of QFT, Supremacy, and Sycamore.While the number of commutations grows with the number of two-qubit gates for the circuits in the circuit library, Supremacy, and Sycamore, for QFT our compiler executed two commutations on the five-qubit circuit and applied nine commutations constantly to the circuits with a higher number of qubits.
The number of commutations depending on the number of two-qubit gates in depicted in Fig. 23.

Impact on the shuttling
In the following, we analyze the impact of the commutation of blocks and blockless sequences on disjoint sets of qubits and of the block commutation with correction unitaries on the shuttling by comparing the required number of translate, separation/merge, and physical ion swap operations.Therefore, we arranged as many ions in a linear chain as the circuit to be compiled has qubits.The chain was modeled as a linear graph with each vertex representing an ion.To map the logical qubits of the circuit to the ions, we used two different Pytket passes, the GraphPlacement pass and the LinePlacement pass, which use different heuristics.For each circuit, we executed both passes and obtained a mapping for each pass.Afterwards, we used Pytket's RoutingPass on both mappings, which inserts SWAP gates so that two-qubit gates are only executed on ions whose vertices are adjacent.The Shuttling Compiler executed all SWAP gates inserted by Pytket as physical ion swaps.Due to the two mappings, we got two results, from which we selected the result with the lower number of SWAP gates for submission to the Shuttling Compiler [10].
When both types of commutations were executed, 97 % of the circuits in the circuit library required fewer or the same number of translations as when both types of commutations were not executed.The commutations reduced the number of translations by factors up to 3.24, and on average by a factor of 1.33.Compared to a compiler which executed the commutation of blocks and blockless sequences on disjoint sets of qubits, but not the block commutation with correction unitaries, our compiler calculated results which lead to up to 1.33 times and on average 1.06 times fewer translations for 89 % of the circuits.
The impact of the commutations on the amount of translations for the circuits in the circuit library is depicted in Fig. 24a.The differences in the factors show that the commutation of blocks and blockless sequences on disjoint sets of qubits has a higher impact on the translation counts.However, the commutation of blocks and blockless sequences on disjoint sets of qubits has no effect on the amount of separation/merge and ion swap operations.Using the block commutation with correction unitaries also reduced the amount of separation/merge operations for 94 % of the circuits by factors up to 1.20 and on average by 1.04.Additionally, it reduced the number of ion swaps for 97 % of the circuits by factors up to 1.67 Fig. 24: The translation count reductions of our complete compilation flow compared to (1) the compilation without the commutation of blocks and blockless sequences on disjoint sets of qubits as well as without the block commutation with correction unitaries (blue dots) and (2) only without the block commutation with correction unitaries (red dots).For each circuit, the translations for the best compilation result of the four different rebasing approaches were calculated.For all circuits with a reduction factor greater than one (the horizontal black line), our complete compilation flow produces a circuit with a lower translation count than the less commuted compilation flows.
and on average by 1.12.The impact of both types of commutations on the five algorithms is shown in Fig. 24b.As mentioned in the last subsection, the compiler did not execute the block commutation with correction unitaries on the circuits of QAOA and QV.Moreover, as for the circuit library, the commutation of blocks and blockless sequences on disjoint sets of qubits did not affect the amount of separation/merge and ion swap operations for all five circuits.Consequently, for QAOA and QV, only the impact of the commutation of blocks and blockless sequences on disjoint sets of qubits on the translation count needs to be considered.For QV, the result of our entire compiler flow requires 2.63 times fewer translations for ten qubits compared to a compilation without the commutation of blocks and blockless sequences on disjoint sets of qubits.This translation count reduction factor decreases to 1.11 for 200 qubits.In contrast, for QAOA, the translation count reduction factor is 1.13 for five qubits and increases to 2.12 for 100 qubits.From there, it decreases rapidly to 1.58 for 105 qubits and increases again to 2.13 for 200 qubits.The reason for the rapid decrease is Pytket's placement pass, which we used to map the logical qubits to ions.The qubits in the QAOA algorithm have a nearest neighbor interaction.This means that the qubit q 0 interacts only with the qubit q 1 , the qubit q n−1 only with the qubit q n−2 , and all other qubits q i only with the qubits q i−1 and q i+1 .This allows to map the logical qubits to the ion chain in ascending order.I. e., q 0 is mapped to the leftmost ion of the ion chain, q 1 is mapped to the ion next to q 0 on the right, and so on until q n−1 is mapped to the rightmost ion of the chain.Up to the 100qubit circuit, Pytket's placement pass could map the logical qubits exactly according to this scheme.However, when the circuits used more than 100 qubits, the placement pass mapped q n−1 to the leftmost ion, q n−2 to the ion next to q n−1 on the right, and so on until the qubit q 100 was mapped.To the ion to the right of q 100 it mapped the qubit q 0 , and from there it mapped all the qubits in ascending order to the ions until it reached the qubit q 99 .Since q 99 also interacts with q 100 , which the placement pass mapped 100 ions to the left, q 99 had to be swapped with 99 intermediate ions to execute the gate with q 100 .The placement pass placed the ions according to this scheme, regardless of whether commutations had been used.While the additional ion swap and separation/merge operations did not affect the reduction factors of these operations, the additional translations resulted in a decreasing translation count reduction factor between the 100-and 105-qubit circuits.
For QFT, our entire compilation required between 1.67 and 2.93 times fewer translations than the compilation without both types of commutations.When executing the commutation of blocks and blockless sequences on disjoint sets of qubits, but not the block commutation with correction unitaries, the compiler required up to 1.52 times fewer translations and up to 1.22 times fewer seperation/merge operations for the circuits with 10 and more qubits compared to our entire compilation flow.Moreover, the number of ion swaps is up to 1.64 times higher for our entire compilation flow.
Furthermore, for both Supremacy and Sycamore, our entire compiler flow calculated circuits requiring fewer translations than a compilation without both types of commutations.For Supremacy, our entire compiler required between 1.05 and 1.44 times fewer translations, while for Sycamore it required between 1.02 and 1.78 times fewer.In this range, the factors alternate between the different numbers of qubits.When we compare our entire compiler flow with a compiler which used the commutation of blocks and blockless sequences on disjoint sets of qubits, but not the block commutation with correction unitaries, our compiler generated results with a lower number of translations for 54 % of the Supremacy circuits and 77 % of the Sycamore circuits.Thus, our entire compiler calculated results with up to 1.11 times fewer and up to 1.08 times more translations for Supremacy, and with up to 1.15 times fewer and up to 1.07 times more translations for Sycamore.For both algorithms, the factors alternate between the different numbers of qubits in these ranges.Regarding the amount of separation/merge and ion swap operations, our entire compiler flow calculated results with fewer separation/merge and ion swap operations for half of the circuits of Supremacy and Sycamore.While the reduction factors for separation/merge range from 0.95 to 1.08 for Supremacy and from 0.96 to 1.03 for Sycamore, the reduction factors for ion swap range from 0.93 to 1.09 for Supremacy and from 0.95 to 1.04 for Sycamore.For all circuits where our entire compiler flow produced a result with a lower number of separations/merge operations, the result also has a lower number of ion swaps.The same applies to the case where our entire compiler flow calculated a result with a higher number of separations/merge and ion swap operations.

Comparison to other compilers
We benchmarked our compiler against built-in Pytket passes, first using its FullPeepholeOptimise pass.
Then we rebased the circuits to {Rx, Rz, ZZ} using Pytket's RebaseCustom pass with the TK1 decomposition in (8a) and the CNOT decomposition in (10).
There are exceptions for the circuits graycode6_47, ex1_226, and xor5_254, as well as all QAOA circuits and Sycamore circuits with nine and more qubits, which we rebased to {Rx, Ry, Rz, ZZ} because Pytket calculated significantly better results for these circuits with Ry gate than without Ry gate.After rebasing, we applied the repeated removal and commutation of gates based on Pytket's built-in procedures from Sec. 4.2.Since Pytket does not allow the gates of the native gate set to be constrained to certain angles, it cannot directly transform the gates to our trapped-ion native gate set N .To make the gate counts comparable, we applied the transformation from Sec. 4.6 to the circuits.For the Qiskit compilation, we first applied the ZZ decomposition given in ( 9) and ( 29) to the circuits, because Qiskit cannot transform ZZ gates with θ ̸ = π 2 already in the circuits into ZZ gates with θ = π 2 and a set of single-qubit gates.Afterwards, we applied Qiskit's BasisTranslator pass with the target basis {Rx, Ry, Rz, ZZ}.The only exceptions are the circuits graycode6_47, ex1_226, and xor5_254, for which the Ry gate was removed from the target basis because Qiskit calculated better results for these circuits without Ry gate than with Ry gate.After rebasing, we executed the transpile function of the Qiskit compiler module with optimization level three.Like Pytket, Qiskit cannot transform the gates to our trapped-ion native gate set N , so we used Sec.4.6 to get comparable gate counts.The gate count reductions achieved by our compiler compared to Pytket and Qiskit are summarized in Fig. 25a for the circuit library and in Fig. 25b for the five algorithms.Compared to standard Pytket, our compiler always computed results with lower gate counts.For the circuit library, there are between 2.14 and 3.24 times fewer gates.A more detailed examination shows that for all circuits our compiler produced results with 2.50 to 4.20 times fewer single-qubit gates, while the twoqubit gate counts are only up to 1.08 times less or equal for 51 % of the circuits.The main reason for the small fraction of less or equal two-qubit gate counts is the block commutation with correction unitaries, which added additional two-qubit gates to the circuit to reduce shuttling operations.Without the block commutation with correction unitaries, our compiler calculated results with fewer two-qubit gates for 84 % of the circuits.For QAOA, our compiler required between 2.23 and 2.14 times fewer single-qubit gates than Pytket, with the factor decreasing as the number of qubits increases.In contrast, the reduction factors increase with the number of qubits from 3.80 to 3.99 for QFT and from 3.26 to 3.50 for QV.For all three algorithms, the results of our compiler have as many twoqubit gates as Pytket's results, independently of the number of qubits.The single-qubit gate count reduction factor for Supremacy increases from 7.02 for 16 qubits to 7.42 for 196 qubits.For Sycamore, our compiler calculated circuits with 5.49 to 6.20 times fewer single-qubit gates than Pytket, with the factor increasing with the number of qubits.The twoqubit gate count reduction factors for both algorithms are the same as for the naive compilation in Sec.6.1, showing that Pytket calculated the same amount of two-qubit gates.Thus, for the Sycamore circuits and half of the Supremacy circuits, Pytket produced results with slightly lower two-qubit gate counts.Again, the reason for the higher two-qubit gate counts of our compiler is the block commutation with correction unitaries.Without it, the two-qubit gate counts of our compiler and Pytket are the same for all five algorithms, independently of the number of qubits.

25:
The overall gate count reductions of our compiler compared to Pytket (dark blue dots) and Qiskit (red dots) standard passes.As result for our compiler, the compilation result with the lowest gate count of the four different rebasing approaches is used for each circuit.For all circuits with a reduction factor greater than one (the horizontal black line), our compiler produces a circuit with a lower gate count than either Pytket or Qiskit.
The main reason for Pytket's higher single-qubit gate counts is that it cannot apply phase tracking.Due to this limitation, the circuits contain several Rz gates, which in our compilation flow can only appear at the end of the circuit.Additionally, the Rz gates are often sandwiched by Rx gates, which prevents the Rx gates from being merged and removed when their angle is θ = 0. Since we had to execute all four approaches of our compiler to find the best result, and since one of our approaches also includes Pytket's FullPeep-holeOptimise pass, the runtimes of our compiler were always worse than Pytket's runtimes.The comparison with standard Qiskit shows for the circuit library that our compilation leads to 1.27 to 2.00 times fewer gates.Furthermore, the number of single-qubit gates is between 1.40 and 2.34 times less for our compiler, and the number of two-qubit gates is between 0.79 and 1.25 times less, with our compiler requiring a higher number of two-qubit gates for 33 % of the circuits.As in the Pytket comparison, the reason for the higher number of two-qubit gates is the block commutation with correction unitaries.Without executing it, our compiler needed fewer two-qubit gates for all circuits with reduction factors up to 1.25.For QAOA, our compiler calculated circuits with 1.71 times fewer single-qubit gates and the same number of two-qubit gates compared to Qiskit, independently of the number of qubits.In contrast, for QFT, the single-qubit gate count reduction factor increases from 1.93 to 2.01 with the number of qubits.Regarding the amount of two-qubit gates, our compiler and Qiskit produced results with the same number of twoqubit gates for the circuits up to 25 qubits, while for 30 and more qubits the results of our compiler have between 1.01 and 1.14 times fewer two-qubit gates, with the factor increasing with the number of qubits.For the other three algorithms, the single-qubit gate count reduction factors decrease from 2.43 to 2.01 for QV, from 2.21 to 1.74 for Supremacy, and from 2.29 to 1.94 for Sycamore as the number of qubits increases.The two-qubit gate count reduction factors for all three algorithms are the same as for the naive compilation comparison in Sec.6.1.Thus, Qiskit's results have the same number of two-qubit gates as the naive approach.This means that for QV, the amount of two-qubit gates of our compiler is between 1.01 and 1.25 times lower than Qiskit's, with a decreasing factor as the number of qubits increases.Moreover, for Supremacy and Sycamore, Qiskit's results have exactly the same number of two-qubit gates as Pytket.This means that for the Sycamore circuits and half of the Supremacy circuits, Qiskit produced results with slightly lower two-qubit gate counts for the same reason mentioned above.The main reason for the single-qubit gate count reductions achieved by our compiler is the same as for Pytket.Regarding the runtime, Qiskit was always faster than our compiler.However, Qiskit's faster runtime comes at the cost of a larger number of gates.In summary, our compiler achieved larger reductions compared to Pytket, with an average reduction factor of 2.80 for the circuit library, while the average reduction factor compared to Qiskit is only 1.67.Compared to both standard passes, our compiler always computed circuits with lower overall and single-qubit gate counts.

Conclusion and outlook
We have presented a quantum circuit compiler for a shuttling-based ion trap quantum computer that im-plements a number of optimization techniques such as phase tracking, block aggregation, and a block commutation with correction unitaries.These reduce the number of gates and shuttling operations when compiling a quantum circuit into the native gate set.Compared to other compilation algorithms, our compiler reduces the gate counts by factors of up to 5.11 for Pytket and up to 2.22 for Qiskit.In the future, we plan to extend the functionality of our compiler, increase its performance using more advanced compilation techniques, and adapt it to more powerful architectures.The next architectural step will be to adapt it to a platform which allows simultaneous addressed manipulation of larger subsets of commonly confined trapped-ion qubits, where gates can be executed in parallel.This will allow lowresource quantum computation on sub-registers, interleaved with time-consuming register reconfiguration steps.Future multiplexed laser addressing units [53] will also allow the execution of multi-qubit gates, so that e. g., three-qubit Toffoli gates can be realized within a single laser interaction sequence [54] instead of being decomposed into two-qubit gates.This can massively increase the quantum computational power of a trapped-ion platform, but also requires an advanced compilation layer.

A Detailed evaluation results
This appendix presents the detailed results of the evaluations described in Sec.6.1 and Sec.6.3.For the 153 quantum circuits in the circuit library, Tab. 1 shows the results for the circuits with three to five qubits, and Tab. 2 shows the results for the circuits with six to sixteen qubits.We took all these circuits from [47].For QAOA, QFT, and QV, Tab. 3 shows the results for various numbers of qubits, while Tab. 4 shows the results for Supremacy and Sycamore.While we generated the circuits for QAOA, QFT, Supremacy, and Sycamore using the quantum circuit generator from [52], we generated the QV circuits using Qiskit [17].
Tab. 1: Results of the evaluations for the circuits in the circuit library with three to five qubits.The first column shows the name, number of qubits (q), single-qubit gate count (1qg), and two-qubit gate count (2qg) of the original circuit.The next four columns show the single-qubit and two-qubit gate counts for our CliffordSimp, SquashTK1, KAKDecomposition, and FullPeepholeOptimise approaches.The green markers show which approaches have the lowest single-qubit and two-qubit gate counts for each circuit.If multiple approaches have the same single-qubit or two-qubit gate count, the approach with the lowest overall gate count is marked.The last two columns show the results of the standard Pytket and Qiskit passes.In each of these columns, the first two entries show the single-qubit and two-qubit gate counts.The last entry in each column shows the factors by which the best approach of our compiler reduces the overall gate count compared to Pytket or Qiskit (rg).Tab.2: Results of the evaluations for the circuits in the circuit library with six to sixteen qubits.The meaning of the columns is the same as in Tab. 1.

Fig. 1 :
Fig. 1: Linear shuttling-based segmented ion trap architecture: The ions are stored in small groups at different segments of the architecture.The lasers performing the gate operations (here on the red ions) are directed only to a specific segment, the laser interaction zone (LIZ, purple segment).The following operations reconfigure the ion positions (from left to right): merging ions into a new group (green ions), physically swapping ions (blue ions), splitting a group (purple ions), and translating ions between different segments (yellow ion).

4 . 9 )Fig. 3 :
Fig.3: The compilation flow of our compiler.Each yellow box represents a transformation step of the compilation process.The boxes contain either only the name of the transformation step or the name of the transformation step on the left side and the gate set of which the circuit consists after the transformation on the right side.The blue boxes bundle several transformation steps into larger logical steps, each of which is described in a separate section.

Fig. 4 :
Fig. 4: Example of a graph representation of a quantum circuit with four qubits.The circuit contains gates which are not part of our trapped-ion native gate set N .

Fig. 5 :
Fig. 5: Applying the SWAP gate elimination to the circuit from Fig. 4.

Fig. 6 :
Fig. 6: Example from Fig. 5c with the macro matching applied.Since only the gate CRy 3,2 (π) of the CRy gates in Fig. 5c satisfies the condition θ = ℓπ with ℓ ∈ Z, only this gate is replaced by the orange gates.

Fig. 7 :
Fig. 7: The process of transforming a circuit into the gate set M.
(a) Example from Fig.7cafter applying Pytket's SquashTK1 pass.Each single-qubit gate sequence has been transformed into exactly one orange TK1 gate.
Example from Fig.13bwith the superblocks B β and Bγ .The non-yellow blockless sequences can be added completely, partially

Fig. 16 :
Fig.16: The process of applying the block commutation with correction unitaries.

1 (
a) Runtimes of our complete compilation flow.All four compilation approaches behave nonlinearly with the number of gates.
Fig.22: Number of splits performed during angle splitting in Sec.4.8.2, depending on the number of single-qubit gates in the original circuit.For each circuit, the best compilation result of the four different rebasing approaches is used.Each dark blue dot represents at least one circuit in the circuit library, with splits applied to 97 of the 153 circuits.While angle splitting was performed on every circuit of Sycamore (purple dots), it was not applied to the 25-qubit circuit of Supremacy (orange dots).Since angle splitting was not performed on any circuit of QAOA, QFT, and QV, these circuits are not depicted.
The overall gate count reductions for the 153 circuits in the circuit library.The circuits are sorted by the number of qubits used.
Tab. 4: Results of the evaluations for the algorithms Supremacy and Sycamore for different numbers of qubits.The meaning of the columns is the same as in Tab. 1.