Scaling of the quantum approximate optimization algorithm on superconducting qubit based hardware

Quantum computers may provide good solutions to combinatorial optimization problems by leveraging the Quantum Approximate Optimization Algorithm (QAOA). The QAOA is often presented as an algorithm for noisy hardware. However, hardware constraints limit its applicability to problem instances that closely match the connectivity of the qubits. Furthermore, the QAOA must outpace classical solvers. Here, we investigate swap strategies to map dense problems into linear, grid and heavy-hex coupling maps. A line-based swap strategy works best for linear and two-dimensional grid coupling maps. Heavy-hex coupling maps require an adaptation of the line swap strategy. By contrast, three-dimensional grid coupling maps benefit from a different swap strategy. Using known entropic arguments we find that the required gate fidelity for dense problems lies deep below the fault-tolerant threshold. We also provide a methodology to reason about the execution-time of QAOA. Finally, we present a QAOA Qiskit Runtime program and execute the closed-loop optimization on cloud-based quantum computers with transpiler settings optimized for QAOA. This work highlights some obstacles to improve to make QAOA competitive, such as gate fidelity, gate speed, and the large number of shots needed. The Qiskit Runtime program gives us a tool to investigate such issues at scale on noisy superconducting qubit hardware.


Introduction
Gate-based quantum computers process information by applying unitary operations on information stored in qubits. Such computers may provide an advantage for complex computational tasks in chemistry [1][2][3], finance [4][5][6] and combinatorial optimization [7,8]. We focus on the Quantum Approximate Optimization Algorithm (QAOA) [7][8][9] which maps combina-Daniel J. Egger: deg@zurich.ibm.com torial optimization problems, for instance, quadratic unconstrained binary optimization (QUBO) problems with n variables to the problem of finding the ground state of an Ising Hamiltonian, H C [7]. Here, H C is constructed by mapping each of the n decision variables to a qubit by the relation x i = (1 − z i )/2 and replacing z i by a Pauli spin operator Z i to obtain H C [10,11]. Two qubits i, j thus only interact through Z i Z j if the corresponding quadratic term Σ i,j is not zero. The QAOA first creates an initial state which is the ground state of a mixer Hamiltonian H M . A common choice of H M and initial state is − n−1 i=0 X i and |+ ⊗n which is easy to prepare. Here, X i are Pauli X operators. Next, a depth-p QAOA circuit creates the trial state |ψ(β, γ) for vectors γ, β ∈ R p by applying exp (−iβ k H M ) exp (−iγ k H C ) at each layer k = 1, ..., p, implemented by R X (β) = exp(−iβX/2) and R ZZ (γ) = exp(−iγZZ/2) gates. A classical optimizer seeks the optimal values of β and γ to create a trial state which minimizes the energy of H C . The potential for a quantum advantage of QAOA and its variants over highly-optimized classical solvers, such as CPLEX [12], must be explored empirically. Such benchmarks must be two dimensional where both the quality and the time to reach the proposed solution matter [13][14][15][16].
Business-relevant problems often require budget or capacity constraints, and thus, Σ tends to be dense [17] and the corresponding interaction graph non-planar [18]. Implementing such problems on superconducting qubit [19,20] platforms is hindered by the limited qubit connectivity, expressed by the coupling map, and therefore requires SWAP gates [14]. By contrast, cold-atomic architectures based on the Rydberg blockade [21] and trapped ions [22] may overcome this issue [23] but in turn suffer from low repetition rates which limits the speed at which shots are gathered [24].
In this work we discuss all aspects relevant to scaling QAOA on superconducting qubits. First, in Sec. 2, we discuss strategies to map dense problems into linear, grid [14] and heavy-hex [25] coupling maps. Next, in Sec. 3, we estimate the quantum hardware requirements needed to solve such problems. Here, we estimate in Sec 3.1 gate fidelity requirements for problems of varying density and present a methodology to reason about the run-time of QAOA in Sec. 3.2. Indeed, QAOA can only provide an advantage by yielding better solutions than classical optimizers or comparable solutions in a shorter time. In Sec. 4 we present a QAOA Qiskit Runtime program to explore the QAOA scaling on noisy hardware. Finally, we discuss these results and conclude in Sec. 5.

Effects of limited device connectivity
The qubit connectivity in, e.g., superconducting qubit devices, is limited by engineering constraints and the need to avoid unwanted effects like cross-talk [26][27][28]. Typically, qubits are arranged in a planar graph, called the coupling map and two-qubit gates can only be applied to adjacent qubits. Therefore, additional SWAP gates are inserted into the circuits to make them hardware-compatible, a task known as qubit routing [29,30]. The number of gates after transpilation to a hardware device thus depends on the problem, the coupling map, and the routing algorithm. We propose a set of hardware dependent routing algorithms that perform particularly well on dense circuits of commuting operators, such as the cost operator in a QAOA on a complete graph. We investigate the resulting circuit depth and gate count for linear, grid [14], and heavy-hex [25] coupling maps.

Hardware-optimized transpiler pass
Swap transpiler passes typically divide quantum circuits into layers of simultaneously executable gates on the coupling map [29]. They transition between the qubit mappings of different layers, i.e., a positioning of logical to physical qubits, by inserting SWAP gates consistent with the coupling map. Here, a logical qubit is a qubit in an algorithm and a physical qubit is a hardware qubit such as a transmon [31]. Mapping circuits to hardware is a hard optimization problem with combinatorial scaling, even on grid coupling maps [32]. A variety of heuristic algorithms have therefore been introduced [33] and different coupling maps studied [34,35]. Application-specific transpilers leverage the mathematical properties of the application to reduce circuit depth. For example, the 2QAN transpiler exploits the flexibility of permuting Trotter operators in two-local Hamiltonians [36]. Swap synthesis is simpler when the considered gates commute, as in the cost layer of QAOA [37]. Here, various strategies have been developed such as applying gates according to some ranking or stitching layers of R ZZ gates together [38]. Transpiler passes that do not consider commutativity yield sub-optimal gate counts for circuits with a high number of commuting gates.
We develop a transpiler pass that first identifies the subset of all device qubits to run on and then applies a corresponding swap strategy. The swap strategy exploits gate commutativity by reordering commuting gates and inserting layers of SWAP gates from a set of predefined swap layers S = {S 0 , ..., S K }. Throughout this work a layer is a set of simultaneously executable gates on the coupling map and has thus depth one. Therefore, for a given coupling map, a swap strategy is a series of swap layers S k1 , S k2 , ... of length L S applied in a predefined order and chosen from S, i.e. k i ∈ {0, ..., K}. A swap strategy applies the following steps: 1. Split the circuit into sequential sets of commuting gates T 1 , T 2 , ..., and choose the first one, i.e. i = 1, as the current set.
2. Repeat the following steps (a) to (d) until all gates in the current set T i are applied, see Fig. 1. Set j = 1, and (a) select all remaining gates E j ⊆ T i from the current set that are executable given the current qubit mapping and remove them from T i .
(b) Partition the selected gates E j into subsets of simultaneously executable gates G 1 , G 2 , ... either by sorting them according to a provided edge coloring of the coupling map or by greedily building the sets.
(c) Iterate through the subsets G 1 , G 2 , ..., e.g., in decreasing set size, to simultaneously apply all gates in each set.
(d) Apply a single swap layer S f (i,j) to alter the current qubit mapping. Here, f (i, j) is the order in which we apply the swap layers. Increment j and move to step (a) if T i is not empty.
3. Remove superfluous SWAP gates at the end of the circuit and continue to Step 2 with the next set of commuting gates T i+1 , or terminate if all gates are applied.
We call a swap strategy optimal if it leads to full connectivity with the least possible number of swap layers. Our task is thus to find a good set of swap layers S, the order in which to apply them f (i, j), and the initial qubit mapping for a given problem and coupling map. For example, in QAOA, the first set T 1 creates the inital state which is trivial to apply as it is made of single-qubit gates. Next, T 2 corresponds to the cost operator which requires SWAP gates. We note that for QAOA the order in which the gates are applied in Step 2(c) is chosen to leverage gate cancellations between R ZZ gates in E j and SWAP gates in S f (i,j) . In sub-sections 2.2 and 2.3 we present the SWAP strategy S0 S1 S0 Figure 1: Transpilation of a five-qubit exp(−iγHC ) circuit (left) to a line coupling map using S = {S0, S1}. The swap layers alternate between the sets S0 = {SWAP0,1, SWAP2,3} and S1 = {SWAP1,2, SWAP3,4}. In the transpiled circuit (right) a redundant SWAP0,1 gate is removed from the last layer. The resulting qubit mapping is highlighted in green.
swap strategies and the scaling of their gate count and depth with problem size and density, respectively.

Swap strategies
The number of swap layers L S required to implement a quantum circuit U of commuting two-qubit gates under a fixed swap strategy S depends on its density D and structure. Here, D is the number of two-qubit terms normalized by its maximum possible number n(n − 1)/2. We describe U with an interaction graph G int where vertices correspond to qubits and edges to two-qubit gates. Let G L S be the graph of all possible qubit interactions implementable after L S swap layers of S. The circuit can be implemented with L S swap layers of S if G int can be embedded in G L S . The graphs G L S therefore describe the structure of a potential U that can be implemented after L S swap layers. The density of G L S bounds the possible density of U from above. Vice versa, we obtain a lower bound L S (D) on the number of swap layers required to implement a given U with a density D using a particular swap strategy. Indeed, a G L S that achieves the same density as U does not necessarily have the required structure to implement U . We now discuss swap strategies that reach full connectivity, i.e. D = 1.

Linear coupling map
Transpiling arbitrary quantum circuits to a line coupling map has been studied in Ref. [39] and when all the gates commute it can also be done optimally [14,37]. For a line coupling map with n qubits the swap strategy which alternates between two swap layers S 0 and S 1 which apply SWAP gates on all even and odd numbered edges 1 , respectively, is provably optimal, see Appendix A. This strategy requires L S = n − 2 swap layers and is illustrated in Figs. 1 and 2. For this strategy the minimum number of swap layers needed to reach a density D is L S (D) = (n − 2)D. 1 We call an edge (i, j) of a line graph even if i = 0 mod 2 and odd otherwise.

Grid coupling map
We adapt the line graph strategy to the two-and three-dimensional nearest-neighbour grid coupling maps to create strategies that reach full connectivity after n/2 + O( √ n) and n/4 + O(n 2/3 ) swap layers, respectively. For the two dimensional case we consider square grids with x rows and columns, i.e. n = x 2 . The swap strategy has four layers S 0 , ..., S 3 and repeats two steps until full connectivity is reached: 1. Apply x − 1 steps of the line swap strategy to each row. Importantly, in the same swap layer, the SWAP gates in one of two neighboring rows are applied on even edges while in the other row they are applied on odd edges, see S 0 and S 1 in Fig. 3.
2. Swap rows by applying two steps of the line swap strategy to each column in parallel, see S 2 and S 3 in Fig. 3. Applying S 2 and S 3 directly after each other guarantees that each row has, up to edge effects, two new neighboring rows, one above and one below itself.  The solid colored edges indicate the swap layer of the swap strategy S = {S0, S1, S2, S3}. S0 and S1 are repeatedly applied to reach full connectivity in each row and between neighbouring rows (step 1). Next, swap layers S2 and S3 are each applied once such that each row becomes adjacent to two different rows (step 2).
Step 1 creates full connectivity within each row and between adjacent rows. Note that while x − 2 layers are sufficient to reach full connectivity in each row, see Sec. 2.2.1, we need x − 1 swap layers to connect all qubits of adjacent rows.
Step 2 swaps rows such that all rows are adjacent to one another at some point in the swap process. Each iteration of steps 1 and 2 requires x + 1 swap layers. After repeating both steps x 2 times, full connectivity is reached. In total, the number of swap layers is This strategy generalizes to grids of higher dimension: for η-dimensional grid coupling maps a problem density D requires at least L S (D) = nD/2 η−1 + O(n 1−1/η ) swap layers. Details are in Appendix A.

Heavy-hex coupling map
IBM Quantum systems have a heavy-hex coupling map, i.e., a regular hexagonal lattice with additional nodes inserted on each edge [25]. A simple strategy applies the optimal line graph strategy to the longest continuous line embedded in the heavy-hex graph. This strategy cannot reach full connectivity since a single continuous line does not include all qubits. However, a swap strategy that applies the line strategy to the longest line in the heavy-hex graph and periodically swaps qubits positioned in the line with qubits that are not part of the line reaches full connectivity after at most swap layers. Details and a proof are given in Appendix A. A lower bound on the number of swap layers to implement a problem with density D is almost linear and we approximate it by nD, see Fig. 4 and Tab. 1.  three-dimensional grid is the best. This results from two competing scalings. On one hand, the number of swap layers required to reach full connectivity scales as 1/2 η−1 with η, i.e., higher-dimensional grids require less swap layers. On the other hand, the number of CNOT gates per swap and R ZZ layer combined scales as 4η − 1, i.e., higher dimensional grids have less R ZZ and CNOT cancellations. Therefore, the additional connectivity of the two-dimensional grid over the line is not useful for reaching full connectivity. On two-dimensional grids it is better to use a line swap strategy on the longest line in the grid instead of the gird swap strategy. By contrast, threedimensional grids, such as cold atomic lattices [40], lead to a smaller gate count than a line and a twodimensional grid. We now benchmark the swap strategies of Sec. 2.2 to t|Ket [41], 2QAN [36], the QAOA Compiler [38], SabreSwap [42], and a commutation aware version of SabreSwap that we implemented and describe in Appendix C. We consider depth-one QAOA circuits of MaxCut [18,43], formally introduced in Appendix D, for graphs chosen uniformly at random from the set of all graphs with n nodes and D n(n−1) 2 edges 2 . We compute the number of layers of simultaneous CNOT gates, and the CNOT gate count after each transpiler pass for heavy-hex coupling maps with a variable number of rows and columns, with D ∈ {0.25, 0.5, 0.75, 1}, details are in Appendix E. Commutative aware SabreSwap results in fewer CNOT gates but deeper circuits for problems with D = 0.25 and 0.5. 2QAN has the least number of CNOT gates but results in deeper circuits than the swap strategies. The swap strategies are clearly advantageous for dense graphs, see Fig. 5. This is expected since the swap strategies are tailored for dense problems and may perform unnecessary SWAP gates on qubits that do not need to be connected in incomplete graphs. For hardware subject to finite T 1 and T 2 -times shallow circuits are advantageous as idling qubits accumulate errors. We find that the swap strategies result in the shallowest circuits as shown by the number of CNOT 2 The graphs are generated with the gnm random graph method from the networkx package. layers in Fig. 5. Additionally, the time taken to transpile with the swap strategies is significantly lower than the time needed by t|Ket , 2QAN, the QAOA Compiler, and commutative aware SabreSwap and is comparable to SabreSwap, see Appendix E.

Hardware requirements
We now discuss how gate fidelity and gate duration impact QAOA in Sec. 3.1 and Sec. 3.2, respectively.

Gate fidelity
The error-prone unitary gates limit performance. Entropic inequalities help bound the maximum circuit depth for QAOA [44,45]. Following Proposition 2 of Ref. [45], the maximum depth of a QAOA circuit with a fraction f 1 of single-qubit gate layers and a fraction f 2 of two-qubit gate layers with depolarizing noise with probability p 1 and p 2 , respectively, is bounded by For circuits deeper than L max there exists a polynomial time classical algorithm that finds a Gibbs state that we can classically sample from with the same energy up to an error ||H C || of the noisy QAOA state.
Here, controls the precision with which we approximate the energy. Reference [45] argues that should range from 10 −1 to 10 −2 since most optimization algorithms require a number of shots with an inverse polynomial scaling in [46,47]. This implies that going beyond an ∼ 10 −2 incurs a significant sampling cost. Since the CNOT gate is the dominant source of error and QAOA circuits for denser problems are dominated by two-qubit gate layers, we further simplify Eq. (2) to L max ≈ ln( −1 )/2p 2 . The CNOT fidelity F cx quoted by IBM Quantum systems is the probability of a depolarizing channel since it is measured with randomized benchmarking [48][49][50]. Each CNOT gate layer in a QAOA circuit transpiled with the swap strategies presented in Sec. 2.2 will on average have l cx CNOT gates, see  Tab. 1. We make the simplifying assumption that the depolarizing probability of a layer of CNOT gates is cx . This is an optimistic assumption since effects such as crosstalk may degrade the performance of gates applied in parallel [51,52]. A QAOA with depth p using L cx (n, D) CNOT layers per application of exp(−iγH C ) for a graph with n nodes and density D must satisfy otherwise there is a corresponding Gibbs state which can be sampled from classically in polynomial time [45]. Following Ref. [45] there is therefore little chance that running a quantum circuit that requires more CNOT layers than the bound in Eq. (3) will lead to a quantum advantage, i.e., we assume there is little hope for a quantum advantage if there exists a polynomial-time classical approximation.
We calculate the bound in Eq. (3) for a heavyhex coupling map with 485 qubits as a function of D and F cx . The density-dependent upper bound on the gate error is one to three orders of magnitude lower than current hardware capabilities, see Fig. 6. The data indicate that non-hardware native optimization problems will require gate fidelities above error correction thresholds which typically range from 99% to 99.99% [25,53,54]. When such fidelities are reached it may still be advantageous to run QAOA on noisy hardware due to the large qubit overhead imposed by error correction and the potentially lower execution times.

Execution-time analysis
We now estimate the execution-time of QAOA as the product of the number of iterations of the classical solver N iter times the time taken to gather the data at each iteration Here, the number of measurements per iteration is N shots and the duration of a single-shot τ shot = τ circ + τ delay . The time taken to run all the gates, measurement, and reset instruction is captured in τ circ while τ delay is a fix delay after each measurement used to improve the qubit reset [55,56]. At each iteration the control hardware must be setup to gather the next shots and therefore incurs a time penalty τ init . This decomposition is similar to the Circuit Layer Operations per Second (CLOPS) benchmark [55]. We esti-  mate the execution-time τ QAOA for different problem sizes n and on different coupling maps. We focus on problems with D = 1 since they upper bound the D < 1 instances. Substituting D < 1 in the following equations may underestimate the execution time depending on the graph structure.

Duration of a single-shot
On cross-resonance based hardware [59,60] the duration of a single-shot is determined by the duration of the CNOT gate τ cx , the QAOA depth p, the problem density D, and the coupling map as discussed in Sec. 2.2. Since QAOA at sub-logarithmic depth is not expected to outperform classical solvers [61][62][63], we assume that p scales at least logarithmically with the number of variables n. We therefore chose p = log 2 (n) for our runtime analysis. With the number of CNOT layers L cx (n, D) we estimate that a single-shot lasts at least Since L cx (n, D) scales as Ω(Dn) the duration of a single QAOA shot scales at least as fast as τ shot = Ω(Dn log 2 n). With a 400 ns CNOT gate and a heavyhex coupling map, i.e. L cx = 9nD, the duration of a shot is significant, see Fig. 7 which only includes the CNOT gate time. Here, measurement and reset instructions, which can last up to a few microseconds [56,64,65] and typically only appear once in a quantum circuit, are neglected. With current cross-resonance gate durations of 200 − 400 ns, delays can also be neglected since τ delay ≈ 250 µs for current hardware [55] which is two orders of magnitude faster than the circuit duration. The CNOT duration is thus currently the main driver of execution-time on noisy quantum hardware. For example, the circuit of a complete interaction graph with 485 variables has a single-shot duration of 14.9 ms on a heavy-hex coupling map. Optimal control schemes show that it is in principle possible to reduce the duration of the singleand two-qubit gates by an order of magnitude [57,58], see Tab. 2. This reduces the QAOA run-time by an order of magnitude and will make the fixed delay τ delay after each shot more relevant.

Number of shots required
The classical optimizer has to minimize the objective function which it can only stochastically access [66]. A simulation of a variational algorithm must therefore consider this effect. For example, QAOAs simulated in Qiskit [67] always perform better with the state-vector simulator, which does not have sampling noise, than the shot-based QASM simulator, see Fig. 8. A large number of shots thus helps QAOA converge [15,23] but increases its execution-time. Zeroth-order methods optimize by directly estimating E(θ) [47]. They prepare and measure each trial state N shots times. Measurement k randomly projects |ψ(θ) onto a basisstate with an energy E k thereby estimating E(θ) by [47] shows that for 1-local Hamiltonians with n qubits the total number of shots over the course of the optimization, i.e., N iter N shots , needed to reach a precision within the vicinity of the optima is lower bounded by Ω(n 3 / 2 ). By contrast, first-order methods optimize by taking measurements that correspond to the gradient ∂ θ E(θ) and are sometimes referred to as analytical gradient measurements [68,69]. Furthermore, for 1-local Hamiltonians Ref. [47] shows that the total number of shots required by first-order methods to reach an precision scale as Θ(n 2 / ). In this setting, firstorder methods therefore converge faster but still require a large number of shots. Recently, optimizers that scale the number of shots based on the magnitude of the gradient have been developed to reduce the shot cost [70]. The Θ(n 2 / ) scaling is a significant shot cost. However, since optimal QAOA parameters concentrate for similar problem instances [71] this cost may be amortized when solving many problem instances originating from the same reasonable distribution.

Number of iterations
We further investigate the number of iterations of the classical optimizer empirically with COBYLA by running QAOA simulations on Qiskit's QASM simulator. We set the number of maximum iterations to a high value (100,000) so that COBYLA terminates before reaching this threshold. The number of completed iterations is recorded for Sherrington-Kirkpatrick graphs with size n from 5 to 10, with edge weights randomly chosen from {−1, 1} resulting in a total of 30 graphs, i.e., five per size. We initialize β and γ for QAOA depths p ∈ {1, 4, 8, 12, 16, 20, 24} with the Trotterized Quantum Annealing (TQA) protocol, i.e. a discretized annealing schedule, with a time-step of 0.75 [72] as it performs better than random guesses. We observe that the number of iterations grows linearly with p for all simulated graphs, see Fig. 9. At fixed p we did not see a change due to the graph size. The mean cut value shows a noticeable improvement when increasing the shots from 10 4 to 4 · 10 4 and the number of iterations increases with the number of shots since the optimizer is able to resolve finer details in the optimization landscape. Since we use p = Ω(log 2 n) we therefore approximate N iter ≈ 25 log 2 (n). This estimate is obtained with noiseless simulations. In practice experimental noise will make it harder to converge to a good solution [73] and advances in optimizers for variational quantum algorithms may help speed-up convergence [74]. Furthermore, if N iter N shots scales as Θ(n 2 / ), then a logarithmic scaling of N iter with n implies that many shots are needed at each iteration further confirming that hardware initialization times can be neglected, i.e., N shots τ shot τ init .

Total QAOA execution time
We now combine the results from the previous three sections to estimate the execution time of QAOA as Here, the linear dependence on p of the single-shot duration gives a factor log 2 (n). The required number of shots is the largest source of uncertainty in the estimation of the execution time. We require at least N shots (n) ≥ 10 3 for small problems in noiseless conditions. If N iter N shots scales as Θ(n 2 / ), as suggested by 1-local Hamiltonians [47], the impact on the execution time will be significant even for problems with a few hundred of variables, see Fig. 10. With 10 4 shots we estimate that the execution time of a complete graph with 485 nodes on a heavy-hex processor is 9.7 hours, see Fig. 10. With the same number of shots a sparse graph with D = 0.1 would require at least one hour to execute. Furthermore, these estimates show that decreasing the CNOT gate duration is crucial. We have not taken into account the cost of error mitigation strategies. For example, the expectation  value can be extrapolated to the zero-noise level by measuring the energy at different noise levels [2,75]. This multiplies N shots · τ shot + τ init by the number of noise-levels measured. Less noisy energy evaluations may also reduce the number of iterations needed.
Care must be taken when comparing quantumbased optimizers to classical solvers [76] but with current quantum technology the estimated execution appears to be significant [77]. However, a possible advantage of QAOA over classical optimizers could lie in quickly generating good yet sub-optimal solutions by foregoing the classical optimization algorithm and initializing β and γ from a known good schedule, such as an annealing schedule [72]. Furthermore, as optimal QAOA parameters tend to concentrate, optimizing one problem may be sufficient for a family of similar problems [71,[78][79][80][81]. We account for this possibility by removing the factor N iter = 25 log 2 (n) in the execution time and use a fixed number of shots N shots = 10 4 . Under these assumptions candidate solutions for a graph with 500 nodes can be generated in under three minutes, see the dashed-dotted line in Fig. 10.

Qiskit Runtime hardware results
The current execution model on cloud-based quantum computers sends a set of circuits as a job through the entire stack and queue. Circuit transpilation and re-sult analysis is done on the client side. This is particularly inefficient for variational algorithms. The Qiskit Runtime allows users to run an entire program in a containerized service close to the backend to avoid latencies between the user and the backend at each iteration. This enables a significantly faster execution of variational algorithms like QAOA 3 .
We first demonstrate the QAOA Runtime program with a seven-variable weighted maximum cut optimization problem with a graph G 10 with 10 unique Pauli Z i Z j terms and depths p ∈ {2, 3, 4}. Each edge (i, j) has a weight ω i,j of −1 or 1 with a 50% probability, see Appendix F. G 10 was constructed such that it can be implemented with one swap layer on the seven-qubit ibm nairobi system. Since the energy E is related to the cut value C by E = −2C + (i,j) ω ij we minimize the energy to maximize the cut. For G 10 we have E = −2C. We run QAOA with SPSA due to the noisy environment [82,83] and measure 2 14 shots at each energy evaluation. The 0.005 learning rate and 0.01 perturbation of SPSA were chosen by calibrating it on a depth-one landscape. The initial γ and β values come from TQA initialization [72]. First, we run G 10 three times with p = 2 with the default transpiler setting of Qiskit which does not use swap strategies. Here, we do not observe any convergence, see the red data in Fig. 11. We attribute this to the 0.83% average CNOT gate error of ibm nairobi and the deep CNOT gate count of the circuits. Indeed, the circuits have 89, 127, and 173 CNOT gates for depth p = 2, 3, and 4, respectively.
To improve convergence we rerun the QAOA Runtime program with optimized settings. First, we use the swap strategies discussed previously. Second, after the parameters are bound we run a pulse-efficient transpiler pass [84] at each QAOA iteration to remove any unnecessary single-qubit gates and to minimize the cross-resonance gate usage. The gains from these transpiler passes are summarized in Tab. 3. Third, we employ CVaR optimization with an α of 0.5, i.e. we retain only the best 50% of the measured shots at each iteration [85]. Finally, we use readout error mitigation to reduce measurement errors [86,87]. We observe a significant reduction in energy for G 10 for depth p = 2 as function of the iteration number. Depths 3 and 4 would require 72 and 96 CNOT gates, respectively, without pulse-efficient transpilation and are most likely noise limited. Crucially, the jobs that did see convergence manage to increase the probability of sampling a good cut when compared to random sampling, compare the histograms in Fig. 11 with the dashed grey line. Interestingly, the hardware has a higher probability of sampling the maximum cut which may be due to noise such as T 1 -induced errors. Crucially, each run of the variational algorithm   required only one hour on the cloud-based quantum computer. In addition, we evaluate the criterion in Eq. (3). A single layer of H C has eleven and two layers of two and one CNOT gates, respectively. We therefore approximate this as 12 layers with two gates. The average gate fidelity on ibm nairobi is 98.88% which results in a maximum gate bound of 52 and 103 layers for values of 10 −1 and 10 −2 , respectively. Equivalently, for the depth-four QAOA, which has 48 layers, there is a Gibbs state which can be sampled from classically in polynomial time which approximates the We also investigate QAOA on the 27 qubit device ibmq mumbai. Here, we only use a hardware native graph with random edge weights chosen from {−1, 1} since such graphs already require 56 CNOT gates per QAOA layer and ibmq mumbai has an average CNOT gate error of 0.66 ± 0.14%. A hardware native graph requires six CNOT layers with on average 28/3 CNOT gates per layer. Following Eq. (3) the maximum number of layers with = 10 −1 is thus 13 when cross-talk effects from applying gates in parallel are neglected. Here, we set the learning rate and perturbation of SPSA to 0.005 and 0.01, respectively. Depth-one QAOA (which can be classically simulated efficiently) distinguishes noise from signal   Fig. 12(b). The last three columns indicate the probability of a single-shot producing a cut of the given size with uniform sampling and the QAOA distributions of Fig. 12(b).
and we observe an improved average cut value compared to random sampling. Depth-two QAOA shows signs of convergence but is impacted by noise which causes large energy jumps. However, it still produces a lower energy state than depth-one QAOA. Details are given in Appendix F. Evaluating the cut value of each of the 2 27 possible solutions is still numerically feasible. There are 212, 12, and 2 solutions with a cut value of ten, eleven and twelve (the max-cut), respectively. Therefore, the probability of measuring a cut with a value of ten or more by random sampling is 1.6 · 10 −6 . Out of the 2 14 cuts sampled from the optimal depth-two state 7 cuts had a maximum value which corresponds to a probability of 0.043%, see Tab. 4. Here, a cut with a value of ten corresponds to an approximation ratio of 83% which is close to, but lower than, the Goemans-Williamson approximation ratio of ∼ 88% [88]. According to the criterion in Eq. (3) with = 10 −1 depth-two is just within reach.

Discussion and Conclusion
We have investigated swap strategies to implement dense circuits built from commuting two-qubit gates on linear, grid, and heavy-hex coupling maps. For QAOA higher connectivity is not always synonymous with lower gate count due to simplifications between R ZZ and SWAP gates. Crucially, a line swap strategy is better on a two-dimensional grid than the grid strategy we put forward. However, our swap strategies on grid coupling maps with dimension three or higher reduce the circuit depth compared to a linear strategy. Furthermore, the heavy-hex coupling map is almost identical to a linear coupling map. We note that these strategies may not be optimal and that better strategies may exist, especially for low-density problems. Crucially, the ability to move logical variables through the physical coupling map means that digital quantum computers do not incur an embedding overhead in the number of qubits as do quantum annealers [89].
The fidelity estimates in Sec. 3.1 show that dense optimization problems will almost certainly require gates with an error rate far below fault-tolerance thresholds. Furthermore, evaluating the depth criterion in Eq. (3) with gate fidelities measured in iso-lation may yield a depth bound that is too optimistic. This also shows a need for application tailored hardware benchmarks [90,91].
In Sec. 3.2 we provided a methodology to estimate the execution time of QAOA on noisy hardware which we found to be significant. We caution the reader that these numbers contain a large amount of uncertainty and can be impacted by noise [92]. This methodology can guide the development of the control hardware. For example, for the large number of shots (> 10 3 ) that QAOA needs the initialization time of the classical control hardware [55] is negligible if kept below a second per iteration since the duration of a single shot is likely larger than 1 ms for problems of practical interest, see Fig. 7. Nevertheless, the duration of the two-qubit gate and the number of shots needed to estimate the energy significantly impact the execution time. Pulse-efficient gate implementations may help reduce the execution time of QAOA circuits by leveraging the R ZX gate instead of the CNOT gate [84,93,94]. The execution time estimation also depends on the variant of QAOA. For example, counteradiabatic driving reduces p by adding counteradiabatic gates and an extra parameter at each layer [95,96] while Recursive-QAOA [62] will also produce different execution time estimates. However, the extra gates and different number of parameters to optimize may impact the execution time as well. By contrast, fault-tolerant architectures require a different execution time estimation methodology, as discussed in Ref. [97] which also found that faster error correcting codes are needed to make heuristics for combinatorial optimization competitive. This also suggests that algorithmic improvements to QAOA [95,98] such as warm-start methods [99][100][101] will be required to get a quantum advantage in combinatorial optimization with heuristic algorithms.
We demonstrated a Qiskit Runtime program for QAOA on a cloud-based quantum computer. Using QAOA-tailored transpiler methods we significantly reduced the gate count and duration of the underlying schedules. This resulted in cut distributions biased towards high-value cuts when compared to random sampling. Here, we caution that Goemans-Williamson randomized rounding and related procedures are a more meaningful benchmark for large problems [88].
Ultimately, our results are limited by the fidelity of the cross-resonance gate and decoherence. CVaR aggregation made it possible to observe convergence at depth-two with 27 qubits. Deeper QAOA circuits may yet be possible, often at the expense of more shots, using advanced error mitigation methods [102] such as Pauli-Twirling [103], Probabilistic Error Cancelation [104], M3 readout error mitigation [105], and Zero-Noise Extrapolation [2,75].

Acknowledgements
The authors acknowledge Sergey Bravyi, Giacomo Nannicini, and Libor Caha for useful discussions. This work was also supported by the Hartree National Centre for Digital Innovation program, funded by the Department for Business, Energy and Industrial Strategy, United Kingdom. We acknowledge the use of IBM Quantum services for this work. Code availability: The swap strategies have been implemented in Qiskit as transpiler passes for blocks of commuting two-qubit gates 4 . The views expressed are those of the authors, and do not reflect the official policy or position of IBM or the IBM Quantum team.
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. The current list of IBM trademarks is available at https: //www.ibm.com/legal/copytrade.

A SWAP Strategy Details
Here, we discuss details of the swap strategies summarized in Sec 2.2 of the main text.
is the graph of a coupling map and N (e) the set of edges adjacent to edge e. Then, any swap layer S i executable on the coupling map is defined by a subset of edges S i ⊆ E 0 that satisfy e ∈ N (e ) ∀ e, e ∈ S i . A swap strategy on G 0 is then a series of swap layers We now discuss swap strategies for the line, grid and heavy-hex coupling maps.

A.1 Line
We consider the line graph of size n with vertices numbered from 0 to n − 1 and the swap strategy shown in Fig. 2 of the main text.

Lemma 1.
For the line graph of size n the swap strategy that alternates between two swap layers, one on all odd numbered edges, and one on all even numbered edges reaches full connectivity in n − 2 layers and is optimal.
Proof. Let q i denote the i-th qubit in the line. Following Lemma 1, each qubit moves continuously in one direction until reaching a line end where it reverses direction. Further, starting with the swap layer applied to the even numbered edges, the odd and even numbered qubits begin by moving left and right, respectively. The position of qubit q i starting at node i after applying k ≤ n swap layers is . . , n}. Setting k = n in the equation above, it holds that p n (q i ) = n − 1 − i for all i ∈ 0, . . . , n − 1. Then, after n steps, the line is fully reversed implying that any two qubits were positioned next to each other at some point during the process. This shows that the strategy reaches full connectivity after n steps.
It turns out, that we already reach full connectivity after n − 2 steps. To prove this, note that after n − 2 steps the position of q i is First, consider qubit q 0 , i.e. case i = 0 in Eq. (6), which starts in the leftmost position. Its final position is the second rightmost node in the line. Therefore, during the process q 0 passes all nodes but the rightmost one. It follows that q 0 must have been positioned next to every other qubit at some point of the swap process. The qubit initially at i ≥ n−2 with odd i, i.e. the fourth case in Eq. (6), arrives in position 0 or 1 after n − 2 SWAP layers, i.e. p n−2 (q i ) ∈ {0, 1}. It was therefore positioned next to every other qubit at some point of the SWAP strategy. Now consider the second and third cases in Eq. (6). When i 1 , i 2 ∈ 1, . . . , n − 1, such that qubit q i1 is to the left of qubit q i2 , i.e. i 1 < i 2 , and neither i 1 nor i 2 are larger than or equal to n − 2 and odd, then, after n − 2 steps of the swap In the first two cases p n−2 (q i1 ) − p n−2 (q i2 ) > 0, i.e. q i1 is now to the right of q i2 in the line, since by assumption i 1 < i 2 . Hence, the corresponding qubits have switched their order within the line and must have been adjacent to each other at some point during the strategy. If i 2 is even and i 1 is odd, it suffices to consider the case where i 1 < i 2 − 1, since otherwise the corresponding qubits are initially adjacent. Then i 2 −i 1 ≥ 3 and Eq. (7) implies p n−2 (q i1 )−p n−2 (q i2 ) ≥ −1. Hence, the corresponding qubits either end up adjacent to one another or have switched their order after n − 2 steps. This proves that we reach full connectivity after n − 2 swap layers. To prove optimality, consider qubit q 0 starting in the leftmost position of the line. At any point of the process, all qubits left of q 0 have been adjacent to q 0 at some previous point. In particular, this holds for the adjacent qubit to the left of q 0 . By this argument and as every node in the line graph has at most degree two, q 0 can only become connected to at most one additional qubit after every additional swap layer, namely by the edge to its right. Since q 0 is initially only connected to q 1 and needs to be connected to n − 2 additional qubits to reach full connectivity, it then follows that no swap strategy with less that n − 2 layers can lead to full connectivity in the line graph.

A.2 Grid
The swap strategy for the two-dimensional grid is an extension of the line swap strategy.

Lemma 2.
For the two-dimensional grid of size n, there exists a swap strategy that reaches full connectivity in n/2 + O( √ n) layers. For the three dimensional grid of size n a strategy with depth n/4 + O(n 2/3 ) exists.
Proof. First, consider two equally long adjacent horizontal lines of qubits, where each qubit in the lower line is connected to the corresponding qubit in the upper line. We apply the line strategy in Lemma 1 to both lines, where we begin by applying SWAP gates to even numbered edges in one line and odd numbered edges in the other, see Fig 13. Since both lines are reversed after n steps, any two nodes in the upper and lower line were adjacent at some point. Crucially, this is only possible because SWAP gates for odd edges on one line are executed simultaneously with SWAP gates on even edges of the other line and vice versa. Thus, full connectivity is reached after at most n steps. Additionally, since both lines fully reverse no new connections are obtained in the last step and full connectivity is already reached after n − 1 steps.
Second, consider the square grid with n = x 2 nodes divided into x rows and columns. The grid swap strategy, shown in Fig. 3 of the main text, repeats two steps.
1. Apply x − 1 steps of the line swap strategy separately to each row. Importantly, on two adjacent rows the SWAP gates must never simultaneously be on edges with the same parity.
2. Swap the rows by applying exactly two steps of the line swap strategy to each column in parallel.
The double line example shows that after executing the first step a qubit in a row is connected to all the other qubits in its row and the neighboring rows. The second step of the strategy is executed with two swap layers. It swaps rows such that every row is now positioned next to two different rows. Thus, step 1 connects qubits of adjacent rows and step 2 shuffles the order of rows such that all rows are adjacent at some point. Since we perform two vertical swap layers in each iteration we reach full connectivity after repeating both steps (x − 2)/2 times and step 1 one additional time at the end. We thus need swap layers, which proves the two-dimensional case.
The proof for the three-dimensional case is similar. Figure 13: Swap strategy in a graph with two connected lines each with five vertices. The blue SWAP gates show the first swap layer. The qubit order in each line fully reverses after five steps and full connectivity is reached after four steps.

A.3 Heavy-hex
A heavy-hex coupling map has a mixture of degreetwo and degree-three nodes. Its qubits are placed on the edges and vertices of hexagons [25]. Each hexagon therefore has 12 qubits. Here, we focus on i × j heavy-hex graphs which have i rows and j columns of hexagons, as exemplified by the 3 × 3 heavy-hex graph in Fig. 14. The total number of qubits is n = 5ij +4(i+j)−1 and the length of the longest line in the graph is l max = 4(ij + i + j) + 1. The length of this line is related to the total number of qubits by l max = 4[n(i, j) + i + j + 1]/5 + 1 which, to leading order, scales as 4n/5. Furthermore, l max is bound from above by when the grid of hexagons is approximately square, i.e. i ∼ j. These preliminaries allow us to formulate the following Lemma for the heavy-hex swap strategy.

Lemma 3.
For the approximately square heavy-hex grid of size n, there exists a swap strategy that reaches full connectivity in less than n + √ n + 61 swap layers.
Proof. We prove Lemma 3 by unfolding the heavy-hex graph along a line of length l where l mod 4 = 0, see Fig. 14. To unfold, we delete one edge connected to the nodes not in the longest line. The result is a line graph with an optional additional node connected to every fourth node, see Fig. 15. The graph additionally has a tail of 5 vertices on one side and a tail of t vertices on the other side, such that t = 2 mod 4, see Fig. 14. Here, t depends on the width of the heavyhex graph. We will prove that Lemma 3 holds for any graph of this kind. The proof applies a line swap strategy on the unfolded heavy-hex graph modified such that at any time some qubits remain in the positions of the additional nodes without moving along the line. We divide the process into five iterations. In each iteration a qubit either moves along 1/4 of the line or waits in one of the adjacent nodes. If every qubit only enters a waiting state once during the complete process, all qubits will have completed l steps of the simple line strategy after five iterations, leading to full connectivity. The difficulty then lies in ensuring that every qubit is swapped into a waiting position at most once. We divide the additional nodes into two sets A and B spaced apart by eight nodes in the line, see Fig. 15. We number the edges in the line by their position and define four swap layers to reach full connectivity.
• S 1 : All odd-numbered edges in the line.
• S 2 : All even-numbered edges in the line.
• S 3 : All edges connected to vertices in group A.
• S 4 : All edges connected to vertices in group B.
We claim that the swap strategy in which we 1. alternate between S 1 and S 2 k − 7 times starting with S 1 , and 2. apply S 4 once, and 3. alternate between S 1 and S 2 7 times starting with S 2 , and 4. apply S 3 once, and 5. repeat Steps 1-4 five times, reaches full connectivity on the unfolded heavy-hex if Here, k is chosen such that k mod 8 = 2 and k > l/4 so that every qubit travels the full line with four iterations of steps 1. and 3. Steps 2. and 4. swap qubits in and out of nodes A and B, respectively. Steps 1. to 4. require k + 2 swap layers so that after five iterations the total number of swap layers is 5k + 10 which is thus bounded from above by n + √ n + 61 as seen by injecting Eq. (8) in Eq. (9) and conservatively assuming l/4 mod 8 = 0.
As argued above, it suffices to show that any particular qubit will remain in a position corresponding to groups A or B for at most one iteration. We assign each node along the line to one of eight evenly-spaced groups V i with i ∈ {0, . . . , 7}. The set of all vertices is thus partitioned into ten sets, A, B and V i , see Fig. 15. We now examine how the qubits switch groups during steps 1. to 4. Note that during each iteration k steps of the line strategy are executed and k was chosen, such that k mod 8 = 2.
We first ignore groups A and B. Since k is even and we begin by applying SWAP gates to odd numbered edges in the line, i.e. S 1 , qubits moving towards the left will switch from group V i to V i−2 and qubits moving towards the right will move from V i to V i+2 . If a qubit reaches the end of the line during an iteration it will either • switch direction from left to right and switch groups from V i to V 7−i , or • switch direction from right to left and switch groups from V i to V 7−i+4 .
This behavior results from k mod 8 = 2. The movement pattern thus undergone by individual qubits is depicted in Fig. 16. Figure 15: The unrolled heavy-hex graph with numbered nodes. The nodes are divided into ten subsets A (in red) B (in green) and Vi for i ∈ {0, . . . , 7} (from white to blue).

Qubits moving right
Qubits moving left Figure 16: Qubit movement pattern during the heavy-hex swap strategy. Each node shows one of the ten categories of nodes defined in Fig. 15. The arrows show how the qubits change group after an iteration of steps 1. through 4. Solid and dashed lines show qubits that were and were not reflected at line ends during the iteration, respectively. Note that a dashed arrow from V6 to V5 is not shown.
We now consider groups A and B. During one iteration qubits positioned in groups A and B will switch to groups V 0 and V 5  In either case, a qubit can only arrive in groups A or B if it was previously positioned in group V 6 or V 7 . All possible movements of qubits among groups from one iteration to the next are thus captured by Fig. 16 and it is clear that within five iterations each qubit will only visit A or B at most once. This concludes the proof.

B Circuit depth and CNOT gate count
We now investigate how the number of CNOT layers and gates in QAOA circuits, transpiled using the pass described in Sec. 2.1, scale with problem size.
Here, G 0 = (V, E 0 ) is either a heavy-hex or an ηdimensional grid coupling map and {S i } with i ∈ N is a swap strategy compatible with G 0 . The transpiled circuit has alternating layers of R ZZ and SWAP gates applied on edge sets E i and S i , respectively, see Fig. 17. We determine the number of CNOT gates and layers by counting gates and layers in E i and S i taking into account gate cancellations across layers.
In every layer E i we apply R ZZ gates on the edges in E i . The number of edges in E i thus determines the number of R ZZ gates. Here, E i is the set of edges in the hardware coupling map (i.e. E i ⊆ E 0 ) which give new qubit connections after applying the swap layer S i−1 . Edges which may give new qubit connections after S i−i are in the neighbourhood of swapped edges, i.e.
as exemplified in Fig. 18. Since the SWAP gates in a swap layer S i are executed in parallel S i never contains two neighboring edges, i.e. e / ∈ N (e ) ∀ e, e ∈ S i . Equation (10) Carefully positioned CNOT gates across E i and S i may cancel. We therefore position all R ZZ gates of E i that are applied across edges contained in S i at the end of E i . This allows us to simplify two CNOT gates and implement SWAP·R ZZ (θ) with three CNOTs and a R Z (θ) gate, see Fig. 19. The number of CNOT layers L cx (E i , S i ) required to implement E i and S i is thus bounded by the edge chromatic number Here, the R ZZ gates in E i \ S i are divided into χ (G i ) sets of gates to execute in parallel each with two CNOT gates. The extra three CNOT layers come from SWAP gates that may have absorbed R ZZ gates. Similarly, the number of CNOT gates in E i and S i is since a swap layer will always require three CNOT layers even after absorbing R ZZ gates from E i . Since In the line graph, every SWAP layer contains half of the edges (on average in case the number of edges is odd). More generally, for the η-dimensional grid strategy the number of swap gates is, on average, Combining Eq. (13) and (14) yields the bound on the total number of CNOT gates k required to implement exp(−iγH C ). These bounds are summarized in Tab. 1 of the main text as (4η − 1)/2nL S . Furthermore, since the swap layers in the grid strategy for the η-dimensional grid graph exactly correspond to an edge coloring with 2η colors, the second requirement from Lemma 4, i.e. the existence of the edge coloring, also holds with Together with the 2(2η − 1) CNOT layers required to implement the final layer E L S , this bounds the total number of CNOT layers L cx of a cost layer with L S swap layers following Plugging in η = 1, 2, 3 gives the numbers in Tab. 1 of the main text.

B.0.2 Heavy-hex coupling maps
We now consider heavy-hex graphs with the same number of rows and columns and with n qubits. This graph has |E 0 | = 6n/5 + O ( √ n) edges and a longest line with length l max = 4n/5+O ( √ n), as exemplified in Fig. 14

C Commutation aware SabreSwap
Quantum circuits can be represented as directed acyclic graphs (DAG) in which nodes are instructions and edges are qubits. The SabreSwap algorithm traverses the DAG. It first builds up a front layer with the gates in the DAG that are first executed on the qubits. Next, the algorithm inserts SWAPs attempting to minimize the distance between qubits that interact in the front layer. When two qubits become adjacent the gate from the front layer is inserted and the front layer is updated with the next gates from the DAG.
In the commutation aware version of SabreSwap, the front layer is adapted to contain all upcoming commuting gates obtained from the dependency DAG. The DAG dependency is built by taking commutation relations into account [106]. Here, two gate nodes are only connected by an edge if the corresponding gates do not commute.

D Maximum cut
In the weighted maximum cut problem we are tasked to partition the set of nodes V of a given graph G = (V, E) in two such that the sum of the edge weights ω i,j with (i, j) ∈ E traversed by the cut is maximum [88]. This can be formulated as the QUBO problem The binary variable z i indicates which side of the cut node i is. In Sec. 4 we consider weighted maximum cut instances in which the weights can be negative as well as positive [107].

E Swap strategy benchmark
This section provides additional details on the swap strategy benchmark presented in the main text. For SabreSwap, commutative aware SabreSwap, the QAOA Compiler, 2QAN, and t|Ket we generate the qaoa circuit using R ZZ gates based on the problem graph. For the swap strategies we create the qaoa circuit as a single instruction built from the PauliSumOp in Qiskit which the transpiler must identify as being made of commuting two-qubit Pauli terms. In all cases the transpiler must map the instructions to hardware native CNOT gates. The Figure 20: Heavy-hex graph with two rows and four columns.
The color of the edges shows the edge coloring used to prioritize the RZZ gates which can be simultaneously applied. The order of the applied gates is red, green, blue, and black. Furthermore, red and green edges form the longest line. Runtime (s) (d) Figure 21: Logarithmic transpiler runtime for QAOA circuits of graph instances with different size and density after transpilation to a heavy-hex coupling map using SabreSwap, commutative aware SabreSwap, t|Ket , 2QAN, and the heavy-hex swap strategy. Each transpiler was allowed a maximum runtime of ten minutes. Each data point is an average over ten random graphs. The lines show the average and the error bars show the standard deviation. CNOT circuit depth shown in the main text is computed using the Qiskit function circuit.depth( lambda x: isinstance(x.operation, CXGate) ) We now list the key passes with which we benchmark the different transpilers. We obtain the execution time of each transpiler with time as exemplified for t|Ket . The t|Ket transpiler steps are listed below. The QAOA Compiler, described in Ref. [38], performs an initial mapping of QAOA and routs the resulting circuit to the hardware using an external compiler. The external compiler is called on partial circuits and the results are then stitched together. We run the QAOA Compiler with Qiskit on optimization level 3 as external compiler and use the Variation-aware Incremental Compilation (VIC) setting described in Ref. [38]. We therefore create an instance of CompileQAOAQiskit and call its run incr c method with variation aware=True and initial layout method="qaim". The configuration used to initialize the CompileQAOAQiskit instance is config = { "Backend": "qiskit", "Target_p": "1", "Packing_Limit": "10e10", "Route_Method": "sabre", "Trans_Seed": "0", "Opt_Level": "3", } To benchmark the swap strategies in Qiskit we use the pass manager Here, the swap strategy contains the information on the coupling map and the double Decompose replaces the routed two-qubit Pauli evolution operators by CNOT gates. The edge color, exemplified in Fig. 20, is used to prioritize which R ZZ gates should be applied first at each step of the swap strategy. This increases the number of R ZZ gates that are followed-by SWAP gates such that the CXCancellation pass can cancel consecutive CNOT gates as shown in Fig. 19 In addition to the circuit depth and CNOT gate count presented in Fig. 5 of the main text we also compute the time it takes to transpile the corresponding circuits. We find that the commutative aware version of SabreSwap and 2QAN have the runtime with the worst scaling while the swap strategies and SabreSwap are up to two orders of magnitude faster, see Fig. 21.

F Experiment details
The seven node graph used in the Qiskit Runtime example can be embedded with one swap layer S 0 = {(0, 1), (3, 5)} on the seven qubit ibm nairobi system. The weights of the graph were chosen at random from {−1, 1}. The seven node graph given as sums of Pauli-Z operators is and has a single maximum cut with a value of three, see Fig. 22. The properties of the hardware and the gates as reported by ibm nairobi on the date the data were acquired are given in Tab. 5 and 6 .    Table 6: Two-qubit gate error and length in % and ns, respectively, reported by ibm nairobi on 22.12.2021. The 27 node graph is native to the coupling map of ibmq mumbai, see Fig. 23. The results presented in the main text use CVaR aggregation with α = 0.5, i.e., the expectation value of H C is computed from the best 50% of the shots. Without CVaR aggregation we observe a noisy convergence at depth-one and no convergence at depth-two, see Fig. 24. We allow SPSA 100 iterations at depth-one and 80 iterations at depth-two to avoid exceeding the maximum allowed time for a Qiskit Runtime program. When SPSA terminates Qiskit returns the last values of β and γ and samples the state produced by the corresponding circuit. For the 27 qubit problem instance with depthtwo the optimizer does not converge properly. We therefore resample the circuit five times, each with 2 14 shots, at the β and γ values that yielded the minimum energy. The average energy −16.57 ± 0.16 of the five runs is shown as a star in Fig. 12(a) and the best counts distribution is shown in Fig. 12(b) of the main text. The seven qubit problem on ibm nairobi calibrates a readout-error mitigation matrix with 2 7 circuits. We therefore do not use readout error mitigation on the 27 qubit problem and leave it up to future work to investigate the effect of scalable readout error mitigation such as M3 on QAOA [105]. The properties of the hardware and the gates as reported by ibmq mumbai on the date the data were acquired are given in Tab. 7 and 8.