Fast quantum circuit cutting with randomized measurements

We propose a method to extend the size of a quantum computation beyond the number of physical qubits available on a single device. This is accomplished by randomly inserting measure-and-prepare channels to express the output state of a large circuit as a separable state across distinct devices. Our method employs randomized measurements, resulting in a sample overhead that is e O (4 k /ε 2 ) , where ε is the accuracy of the computation and k the number of parallel wires that are “cut” to obtain smaller sub-circuits. We also show an information-theoretic lower bound of Ω(2 k /ε 2 ) for any comparable procedure. We use our techniques to show that circuits in the Quantum Approximate Optimization Algorithm (QAOA) with p entangling layers can be simulated by circuits on a fraction of the original number of qubits with an overhead that is roughly 2 O ( pκ ) , where κ is the size of a known balanced vertex separator of the graph which encodes the optimization problem. We obtain numerical evidence of practical speedups using our method applied to the QAOA, compared to prior work. Finally, we investigate the practical feasibility of applying the circuit cutting procedure to large-scale QAOA problems on clustered graphs by using a 30 -qubit simulator to evaluate the variational energy of a 129 - qubit problem as well as carry out a 62 - qubit optimization.


Introduction
In this work, we consider combining measurement outcomes from multiple quantum circuits to estimate the expectation value at the output of a larger quantum circuit. Computing such expectation values is a key step in many proposals for quantum algorithms such as the QAOA [23,25,30], VQE [48], or in the estimation of output probabilities, for example, for quantum classifiers [52,36].
Given a circuit comprising m gates, the standard procedure to accomplish this task is to repeatedly run the desired circuit and measure in some basis, for a total quantum runtime on the order of m/ε 2 . Alternatively, one may perform a full classical simulation of the circuit, which is believed to require a runtime exponential in the number of qubits n without strong assumptions about the structure of the circuit, for example, a bounded treewidth [39], low stabilizer-rank [7,11,9], or low negativity [45]. Based on current hardware limitations both approaches become infeasible for moderately large n: the time cost of the classical simulation is prohibitive, while on the other hand current quantum devices are limited to small numbers of qubits. This motivates considering a hybrid approach, where results obtained on smaller quantum devices are used to solve the estimation task.
For instance, Ref. [9] suggests constructing "virtual qubits" which enable the simulation of an otherwise intractable computation. Similarly, Ref.
[46] presented a framework for simulating clustered quantum circuits. A third framework, presented in Refs. [43,42], focuses on removing entangling gates between sub-circuits. For these methods, as well as in follow-up work [49,47,60,3,4,40], one takes advantage of large, disconnected components of the quantum circuit which may be obtained by removing a subset of wires or gates. We refer to such methods collectively as quantum circuit cutting. Related work analyzes hybrid approaches to solving 3SAT [17] and quantum simulation by combining tensor network methods with small quantum devices [5,61].
Our main result is a fast quantum circuit cutting method based on separating weakly entangled circuits through randomized measurements. Compared to the approach in [46] in which singlequbit Pauli measurements are performed, our method offers a quadratic improvement in the overhead in certain cases of interest. As a consequence, the number of wires that can be cut under a fixed budget on the runtime is approximately doubled using this approach. Furthermore, our method likely outperforms all other proposed circuit cutting methods for simulating quantum computation using circuits with appropriate structure (see Table 1).
Our approach combines intuition which stems from quantum tomography as well as the notion of a matrix-product state (MPS). Specifically, the algorithm can be viewed as an attempt to classically simulate the entanglement between qudits in an MPS with low bond dimension using repeated mid-circuit measurements. This measurement procedure is described by measure-andprepare channels (equivalent to entanglementbreaking channels [31]), enabling the expression of the post-measurement state as a separable state across distinct devices. The measurements we choose for this task are based on unitary 2designs, which are related to the sample-optimal measurements in various quantum learning settings [28,32,14,18].
A point of departure from previous work is that our procedure uses synchronized measurements and preparations. For circuits with an MPS-like structure -as considered in [33] for examplethis requirement may be satisfied by repeatedly executing circuits on a single device. However, if the so-called communication graph [46] of the sub-circuits contains a cycle, then our method genuinely requires multiple circuits, separated in space rather than time.
To understand the source of our speedup, we turn to the form of the equation we employ in our proposal. In any dimension d > 0, one may consider a collection of linear maps Φ i : where id : L(C d ) → L(C d ) is the identity map.
Throughout, we let L(C d ) denote the set of linear operators acting on C d . See Appendix A for a comprehensive set of definitions. As suggested in Ref.
[46], we choose an identity of this form such that the substitution of any individual map Φ i in a suitable location within the circuit enables the execution of a quantum computation which would otherwise be intractable. We refer to this as wire cutting. A similar identity is employed to decompose 2-qubit operations into product operators in Refs. [9,42,43,49], and we refer to this approach as gate cutting. In either case, each application of the identity results in a multiplicative overhead in the runtime scaling with the 1-norm of the coefficients, e.g., i |a i | in the case where the identity takes the form in Eq.
(1). This is similar to the effect of the negativity quantity associated with fully classical simulation method of [45] based on quasiprobability sampling.
As previously observed in [43,60,49], the overhead in circuit cutting depends on the choice of linear maps used in the decomposition. In particular, Ref.
[49] shows a lower bound on the overhead for gate cutting in several cases, while Ref. [60] proposed a simulated-annealing-based search for solutions to Eq. (1) to attempt to minimize the above 1-norm quantity. Inspired by recent work demonstrating the utility of randomized measurements [18], we give an explicit decomposition of the form in Eq. (1) whose 1norm scales with the dimension of the subspace upon which the channel acts. We argue that this bound results in our method outperforming the state-of-the-art for a natural problem (see Table 1), and are able to prove that it is at most (roughly) quadratically worse than the best possible wire cutting method by deriving an information-theoretic lower bound.
We then apply our observations to derive sufficient conditions for the efficient hybrid simulation of circuits which arise in the Quantum Approximate Optimization Algorithm (QAOA) [23,24,25]. This continues a line of research where the goal is to identify restricted families of circuits which can be simulated using smaller devices than might naively be expected. For instance, Ref. [9] proposed a notion of sparse circuits for which a small number of qubits could be efficiently traded for classical computation, while Refs. [46,15] show that weakly interacting, clustered Hamiltonians can be simulated ef- Figure 1: Example of our quantum circuit cutting scheme. One of two measure-and-prepare channels (z = 0, 1) is applied at random. If z = 0, a randomized measurement based on a unitary 2-design (e.g., a random Clifford) U is performed. If z = 1, the qubits are traced out, and a random basis state |y is prepared. The output is an unbiased estimator of the expectation value at the output of the original circuit.
ficiently for short times through wire cutting. We show that p-layer QAOA circuits can be efficiently simulated in a hybrid manner up to depth p = O(log n), provided the underlying graph encoding the optimization problem has a small balanced vertex separator. This addresses an open problem from Ref.
[46] which asked for good ways to partition circuits based on the structure of the problem at hand. Our observations should also carry over to Trotter-Suzuki-based circuits for the Hamiltonian simulation of 2-local Hamiltonians whose interaction graph has small vertex separators, as opposed to the edge separators considered in [46]. We then show how to sample from the output distribution of a QAOA circuit to recover bitstrings encoding optimal solutions by combining measurement outcomes obtained from the smaller circuits.
We remark that prior work [35] proposed utilizing balanced vertex separators to split instances of the Max-Cut problems into smaller sub-problems for the QAOA, without considering circuit cutting methods. In addition, Refs. [50,55] consider applying the circuit cutting method proposed in [46] to solve a modified version of the QAOA. However, these algorithms require estimating the measurement distribution of the subcircuits as a subroutine, which limits the practicality of such proposals.
We demonstrate practical speedups by classically simulating our quantum circuit cutting method applied to small instances of the QAOA and comparing to the procedure in [46]. These numerical experiments are enabled by leveraging open-source circuit cutting functionality available in PennyLane [6]. We then show that a circuit cutting procedure can be performed on large-scale QAOA problems. Using the tensorcontraction based method introduced in [46] and available in PennyLane, we break the full QAOA circuit into multiple circuit fragments of at most 30 qubits and execute them on a cluster of NVIDIA A100 40GB GPUs, acting as simulator substitutes for practical quantum hardware devices. As a result, the variational energy of a 2-layer QAOA circuit of 129 qubits is evaluated, as well as a full 1-and 2-layer QAOA optimization of a 62-qubit circuit.

Fast circuit cutting with randomized measurements
We first provide details for the computational model considered in this work. A quantum circuit is a directed acyclic network of gates (vertices) connected by wires (edges) which represent the qubits upon which the gates act. In general, gates represent quantum channels, i.e., completely positive and trace-preserving linear maps. The overall action of the circuit is some quantum channel acting on the input qubits, which throughout this paper we take to be ρ ⊗n 0 := (|0 0|) ⊗n for a circuit with n input qubits. The size of the circuit is the total number of gates. As in Peng et al. [46], we adopt a model of quantum computation in which the goal is to estimate the expectation value of some diago- More specifically, given an input circuit on n qubits representing the action of some channel N , the task is to estimate the value Tr(O f N (ρ ⊗n 0 )) to within additive error ε. This is a more general version of the computational model adopted in [9], and -up to single-qubit rotations prior to measurement -encompasses the "quantum mean-value problem" defined in [8]. The model is motivated by the fact that measurement in an alternative, efficiently implementable basis, can be absorbed into the unitary operation of the circuit, and an observable whose spectral norm is larger by a polynomial factor only incurs a polynomial overhead in the runtime for the estimation task. In the remainder of this paper we further assume that f can be evaluated in constant time, for succinctness in the statement of our results.
We now define the two different measure-andprepare channels which are used in our scheme. Let {|j } d j=1 denote the standard basis for a ddimensional complex Euclidean space. In any dimension d > 0 we may consider a random POVM {U |j j|U † } d j=1 comprising rank-1 measurement operators where U ∈ U(C d ) is a random unitary operator (matrix-valued random variable) which forms a unitary 2-design (see Appendix A). For a fixed choice of such a random POVM, the first channel we consider Ψ 0 : (2) for every X ∈ L(C d ). This mapping is a measureand-prepare channel, i.e., it describes a measurement performed on the d-dimensional register (using the random POVM described above) followed by preparing a new register in the state corresponding to the measurement outcome that was observed. It would also suffice for our purposes to consider a fixed rank-1 POVM whose measurement operators are based on state 2designs, since the resulting measure-and-prepare channel would be identical. However, we focus on randomized basis measurements to make explicit how such a channel could be implemented in practice.
The second measure-and-prepare channel we consider is the completely depolarizing channel for every X ∈ L(C d ). The following lemma implies that measure-and-prepare channels of the above form can be used to develop an improved approach to circuit cutting. The claim follows straightforwardly by evaluating the action of the channel Ψ 0 , as done previously in [28,32,18], for example. A proof is provided in Appendix B for completeness.
Eq. (4) suggests a sampling-based approach to wire cutting by randomly inserting channels on the wires to be "cut". This idea is depicted in Fig. 1, and is realized by the procedure described in Algorithm 1 in Appendix B.2. The following theorem then bounds the overhead incurred from using this method to perform wire-cutting on a natural example of a clustered circuit. Once again, we ignore the classical time to compute the post-processing function f in the statement of this theorem. Theorem 2.2 (Bipartitioning circuits). Let C be a size-m quantum circuit acting on n qubits which is a composition of circuits C A , C B acting nontrivially on sets of qubits A, B ⊆ [n], respectively. If |A ∩ B| ≤ k, then quantum computation using C can be simulated to within accuracy ε in time O(4 k (m + k 2 )/ε 2 ) by a quantum circuit acting on at most max{|A|, |B|} qubits using the procedure described in Algorithm 1.
The proof of the theorem is given in Appendix B.3.
The theorem follows from Lemma 2.1 by using the fact that random kqubit Clifford operators: (i) form a unitary 3design [34,59,62], (ii) can be sampled efficiently [56], and (iii) can be implemented using poly-depth circuits [2]. We remark that while Wire cutting 4 k LOCC Table 1: Upper bounds on the runtime of circuit cutting methods applied to circuits of the form given in Fig. 1, omitting polynomial factors in n, k, D, and ε. It is assumed that the task is to simulate a quantum computation with this circuit using a device comprising at most n/2 + k qubits ("doubling" the number of qubits). The "Type" column identifies whether the procedure exploits decompositions of gates or of the identity operation applied to wires. Here D is a sparseness parameter, denoting the assumption that the qubits being cut participate in at most D multi-qubit gates. Note that in this case, classical communication may be achieved by recycling qubits on a single quantum device.
random Clifford operations on many qubits can be challenging to implement in practice, we expect they may be feasible for small values of k, which is the regime suitable for efficient circuit cutting. Additionally, though we appeal to Hoeffding's inequality to bound the sample complexity in our proof, it may be possible to achieve a better scaling in terms of k for certain applications where the variance of the observables is non-trivially bounded, which we leave as a possible direction for future work.
When |A| ≈ |B| ≈ n/2 and k is a small constant, this technique effectively doubles the number of qubits which can be simulated given a quantum device. In Table 1 we compare the overheads of several circuit cutting methods assuming tightness of the upper bounds shown in each respective work (see Sec. 3.3 for numerical evidence that this is approximately the case for the wire cutting method in Ref. [46]). In the table, the setting being considered is a special case of that in Theorem 2.2: we have a composition of two circuits acting on wires belonging to overlapping sets A and B with |A| = |B| = n/2 + k, and the goal is to run the quantum computation using a device limited to n/2 + k wires. The scenario is depicted in the left-hand side of Fig. 1. To enable the comparison with gate-cutting methods, it is necessary that we assume each circuit satisfies a certain sparseness property, introduced in Ref. [7]. Namely, we assume that the qubits in A ∩ B participate in at most D multi-qubit gates. Otherwise, gate-cutting methods may incur an unbounded overhead to accomplish the simulation task.
A natural question is whether the method described in Algorithm 1 saturates the best possible overhead for wire cutting. For example, one might wonder whether a subexponential runtime, or runtime on the order of 2 ck for some 0 < c < 1 can be achieved. We rule out these possibilities using an information-theoretic lower bound on the sample complexity of a task that reduces to wire-cutting, which we describe in more detail in Appendix C. Roughly speaking, a procedure for wire cutting enables one to succeed at a difficult quantum state discrimination task by cutting n wires, similar to the lower bounds on classical shadows which appear in [32].

Theorem 2.3 (Informal).
Any procedure for bipartitioning the circuit C through wire cutting necessarily incurs a worst-case overhead of Ω(2 k /ε 2 ).
We now explain how the procedure generalizes to multiple applications of the identity in Lemma 2.1. Consider a depth-L, n-qubit quantum circuit comprising rounds 1, . . . , L of commuting quantum gates which implements the channel N . We say that a subset of the wires in the circuit are parallel if and only if there exists an i ∈ {1, . . . , L − 1} such that each wire connects a gate applied in round j ≤ i to a gate in round j ≥ i + 1. In a circuit diagram, this corresponds to a set of wires which may be bisected by a single vertical line. Suppose that the circuit may be separated into disconnected sub-circuits by removing subsets of parallel wires, and let k 1 , . . . , k denote the number of wires in each subset. In Fig. 2, for example, a depth-4 circuit has = 2 subsets of k 1 and k 2 parallel wires, respectively, and is partitioned into three disconnected sub-circuits by removing these wires (depicted as inserting a measure-and-prepare operation). Then by repeated application of Lemma 2.1, we have Here, z 1 , . . . , z are independent Bernoulli random variables equal to 1 with probability and N z 1 ,...,z is the modified quantum channel which is implemented by substituting the channel Ψ z j for the identity map at the j th location in the original circuit, for each j ∈ [ ]. For fixed z 1 , . . . , z , an unbiased estimator of the quantity Tr(O f N z 1 ,...,z (ρ ⊗n 0 )) can be obtained by measuring the output wires of the modified circuit in the computational basis. Furthermore, if removal of the wires in each of the subsets separates the circuit into r disconnected sub-circuits, fragments, then the circuit can be executed using r quantum devices, each possibly with fewer qubits than the original. This setting introduces the following drawbacks: (i) the overhead is comparatively large when k j = 1 for every j ∈ [ ] (as opposed to when is small and each k j is large) and (ii) classical communication between multiple separate devices is required. The latter issue is avoided in Algorithm 1 since the first circuit in the sequence of circuits has no dependence on the results from the second circuit, and so its output wires could be measured simultaneously with the wires to be cut. These measurement results could then be stored and used at a later time to construct the estimator, while qubits on that same device are recycled in order to execute the second circuit. But consider the communication graph [46] representing the flow of communication between different circuits. Then this recycling procedure is not possible whenever the graph contains a cycle, which may be the case in general. In this case, the circuit sends its partial measurement results to another, and has to wait until receiving a measurement outcome before proceeding with its remaining gates.

Circuit cutting for QAOA
In this section, we combine the above results with observations related to the structure of QAOA circuits, and perform numerical simulations of quantum circuit cutting applied to the QAOA.

Structure of QAOA circuits for circuit cutting
Here, we establish sufficient conditions for the efficient application of quantum circuit cutting to QAOA circuits for the Max-Cut problem. A key aspect to consider is the wires which must be cut to separate a single layer of the QAOA circuit for Max-Cut into multiple smaller fragments. For a given input graph G = (V, E) encoding an instance of the Max-Cut problem, the cost operator takes the form where Z i is the single-qubit Pauli-Z operator act-ing on the i-th qubit. The Max-Cut solution corresponds to the state with the minimal expectation value of this cost operator. One may alternatively view this operator as a 2-local Hamiltonian whose interaction graph is precisely the input graph. The QAOA consists of applying alternating layers of mixing unitaries based on singlequbit rotations and cost unitaries based on the above cost operator. Since the mixing unitary comprises single-qubit gates which do not influence qubit connectivity, this part of the circuit can safely be ignored for the purposes of this section. We may therefore focus on the cost unitary, which is implemented by a product of commuting two-qubit ZZ rotations, one for each edge in E. The fact that these operations commute also implies that the cost unitary can be implemented using multiple different circuits, corresponding to permutations of the terms in the product. This observation allows us to make the simplifying assumption that gates in the QAOA circuit are applied in an order which respects a chosen partition of the gates for the purpose of circuit cutting, i.e., all gates in a given partition of the edges are applied contiguously in the circuit.
In the specific case of circuit bipartitions, we make use of the idea of a balanced vertex separator of an undirected graph G, which is a subset of vertices whose removal leaves two sets of disconnected vertices of roughly equal size, for example, of size at most 2|V |/3. Theorem 3.1 (Imprecise, see restatement in Appendix D). Suppose the graph G = (V, E) with |V | = n has some known balanced vertex separator S ⊆ V such that |S| = κ. Then quantum computation with a p-layer QAOA circuit for Max-Cut on G can be simulated to within accuracy ε using a pair of (non-entangled) quantum devices with n − Ω(n) qubits in time O(2 αpκ /ε 2 ) for some universal constant α.
We once again ignore classical time to compute the objective function of a given bit-string in the statement of this theorem. The proof is in Appendix D. The theorem follows by combining observations about the structure of single-layer QAOA circuits with Eq. (5) as well as an intermediate result concerning how the number of wires to be cut scales with the number of layers p. We are thus able to introduce Ω(n) "virtual qubits" in polynomial time on up to log-depth circuits, so long as the underlying graph for the QAOA has a certain structure. In particular, this is a regime for which certain negative results on the overall effectiveness of the QAOA do not apply. For example, it is known that for o(log n)-depth QAOA both the average-and worst-case performance is limited compared to the best classical algorithms [10,22,21].
Based on the above observations, choosing suitable instances of the Max-Cut problem for circuit cutting is related to finding graphs with small vertex separators. Concretely, we should pick a class of graphs whose vertices are clustered, such that the graph can be (vertex-)separated into roughly equal-sized components. The number of vertices in the separator relates to the number of cuts required in the overall circuit, and thus the overhead in performing circuit cutting. We therefore suggest the class of graphs depicted in Fig. 3 for benchmarking QAOA circuit cutting. These graphs are defined to be those which can be written as r sets of n vertices connected through r − 1 groups of k = O(1) vertices. These connecting groups of vertices form a vertex separator of size k(r − 1), and the overhead for computing expectation values scales exponentially with this quantity.
We may also consider the more general case in which there is no prior information about the structure of the graphs available. While the problem of finding the minimal balanced vertex separator is known to be NP-hard [12], there is a polynomial-time classical algorithm for finding an O( √ log κ) approximation to the optimal solution based on semi-definite programming relaxation [26]. Here, κ is the size of the minimal vertex separator. Hence, even without prior knowledge of the minimal vertex separator, one may employ such classical algorithms to search for approximate solutions on the given input graph, and the overhead in Theorem 3.1 holds with κ the minimal vertex separator, up to a factor of O( √ log κ) in the exponent.

Sampling from cut circuits
In this section, we establish that it is possible to recover optimal solutions from circuit fragments which have been used to perform the QAOA. We emphasize that this sampling aspect of the al-gorithm has not been addressed in any previous works on circuit cutting known to the authors. Our analysis leads to a simple suggestion that helps bridge the gap between procedures for circuit cutting and their application to problems of practical interest.
In the Max-Cut problem, the objective function maps bitstrings (partitions of vertices) to their cost (number of edges cut), f : Here and throughout, we consider the graph G = (V, E) with n = |V | and M = |E| ≤ n 2 . The goal of the QAOA is to return a bitstring x such that f (x) is close to maximal, with high probability. The optimized QAOA circuit produces the state |γ, β according to parameters γ, β. This results in the measurement distribution given by is ideally close to the maximum. Let us denote this expectation by µ from now on, and note that µ ∈ [0, M ].
Given access to samples from q, it is straightforward to show that In other words, after around M trials one obtains a bit string x which has cost at least µ. Now consider the case where the QAOA circuit is being partitioned into disconnected fragments by removing k wires in total. Using Eq. (5) in the case where = k (individual wires are being cut), this expression implies for every x ∈ {0, 1} n , where z 1 , . . . , z k are independent Bernoulli random variables which determine the settings in a modified quantum circuit, as described in Sec. 2, and q(x|z 1 , . . . , z k ) is the probability of obtaining the outcome x from the modified circuit conditioned on fixed settings z 1 , . . . , z k . Considering the absolute value of each of the terms in the expectation, the expectation in the right-hand side of Eq. (10) is at most This is the marginal probability of observing the outcome x ∈ {0, 1} n in the modified circuit. Therefore, it holds that where q(S) := x∈S q(x) and likewise for the distribution q. Hence, using Eq. (8) we get which in turn implies one would have to wait at most 5 k times as long in expectation to sample a bit string x such that f (x) ≥ µ using the smaller circuit fragments. This is analogous to the overhead incurred for expectation values: in both cases, the cost of circuit cutting is a need to repeatedly execute the circuit by an amount that grows exponentially in the number of wires cut. We remark that repeating the above analysis using the wire cutting procedure derived in the proof of Theorem 1 in Ref.
[46] results in an improved overhead of 4 k for the sampling task, and as mentioned previously does not require synchronizing preparations to measurement outcomes in the modified circuits.

Numerical comparison for circuit bipartitions
Here we numerically benchmark our circuit cutting method against the methods given in Ref. [46], and compare to the theoretical bounds. The supporting code used to generate these numerical results can be found in [1]. We focus on a family of graphs described in Sec. 3.1, with specific examples depicted in Fig. 4. These graphs can later be generalized, as seen on Fig. 3. The cost variances for these graphs are examined at QAOA depth p = 1 and p = 2 as functions of the number of required device shots. The results can be seen on Fig. 4. All reference cost values have been calculated numerically exactly due to relatively small test graphs (maximum 13 qubits). For all graphs considered, QAOA parameters [23] γ = (γ 1 , . . . , γ p ) and β = (β 1 , . . . , β p ) have been fixed to their optimal values γ * , β * = arg min γ, β| H C |γ, β , where |γ, β is the QAOA output state given labeled parameters. One of the main advantages of the circuit cutting method proposed in this work is the faster convergence to the correct value of the target observable. To meet the requirement of the cost function being in the interval [−1, 1], we rescale the expression given in Eq. (6) by M = |E|, the number of edges in the graph. Simulations were performed using PennyLane [6], an open-source Python software framework for quantum differentiable programming, for both cutting methods.
As discussed in Appendix D, a QAOA circuit with p layers will generally require at least p separate cuts, one per QAOA layer. For p ≥ 2, the measure-and-prepare nature of both methods introduces mid-circuit measurements that are not commonly supported on qubit simulators. To accommodate mixed quantum states that would be generated by circuit cutting on a real quantum device, one can either employ a mixed-state simulator or introduce auxiliary qubits. Our implementation of the randomized channel method uses the former, while the PennyLane implementation of the Pauli-based method of Peng et al.
The randomized method introduced in Sec. 2 outperforms the method based on Pauli measurements [46] in terms of convergence speed and accuracy for all considered graph sizes, cut sizes, and QAOA depths. Improvements range from marginal to orders of magnitude. For larger cut sizes k, and especially at QAOA depth p = 2, at 10 6 shots, the randomized method offers reliable cost estimates, while the Pauli method still exhibits standard deviations larger than the target interval [−1, 1] as shown in Fig. 4. We observe that the randomized channel approach converges faster in all QAOA instances checked, as shown on Fig. 5.

Large-scale simulations
Having demonstrated the faster convergence of our circuit cutting method over existing approaches for small-sized QAOA problems, we now switch focus toward showing the applicability of a circuit cutting procedure to larger problem sizes of above 50 qubits, a scale that is challenging for direct execution on near-term quantum hardware devices. Our objective is to provide a proof-ofprinciple evaluation and optimization of a largescale QAOA problem when cut into smaller fragments of at most 30 qubits. To overcome the limitations of existing 30-qubit quantum hardware, we use a cluster of GPU-based simulators as a substitute. This section uses the circuit cutting method introduced in Theorem 2 of Peng et al. [46,47], which involves performing process tomography for each circuit fragment and contracting the resulting tensors. The tensor-contraction-based method is fully supported in PennyLane [6] and is compatible with both simulator-and hardwarebased devices. When performed using simulation, this method is analogous to existing tensor network methods that aim to optimize the order of contractions in the network [27].
As before, we focus on instances from the class of graphs introduced in Sec. 3.1 for the Max-Cut problem, consisting of r clusters of n nodes connected in a chain by smaller clusters of k nodes, as exemplified in Fig. 3. The choice of problem parameters r, n and k as well as the number of QAOA layers p determines the total number of unique sub-circuits N required for circuit cutting as well as the maximum number of qubits m among all sub-circuits, as discussed in Appendix E.
The simulations detailed here were performed on the NERSC Perlmutter supercomputer using a cluster of NVIDIA A100 40GB GPUs running PennyLane's cuQuantum-enabled [19] Lightning-GPU simulator device. Each GPU node in Perlmutter is equipped with 4 NVIDIA A100 40GB GPUs, an AMD EPYC 7763 CPU, and networked with the HPE Slingshot-10 interconnect. Each GPU supports executing circuits of up to m = 30 qubits. The N circuit fragments generated due to cutting were distributed within the cluster using the Ray parallel execution library [44]. Supporting utility functions, data and methods used for this section are available in [1].
We first detail the execution of QAOA circuits with parameters p = 2, n = 25 and k = 1 for a range of cluster numbers r ∈ {3, 4, 5}, resulting in circuits with 77, 103, and 129 qubits, respectively. Such circuits could not be executed directly on a 30-qubit device, but by using circuit cutting we can break the circuit up into multiple smaller circuit fragments of number N (r), depending on the choice of r. In this case, N (3) = 1812, N (4) = 3540 and N (5) = 5268, growing linearly with r as can be seen in Appendix E. To confirm the linear scaling numerically, we used simulation-based circuit cutting to evaluate the expectation value of the QAOA circuit with a randomly selected set of circuit parameters γ and β for all choices of r and recorded the execution time, resulting in the plot shown in Fig. 6. An analysis of the classical efficiency of our distributed GPU-based simulation is provided in Appendix F.
We also investigate the ability to optimize a large-scale QAOA problem using simulated circuit cutting. Given the ability to evaluate a circuit, it is straightforward to extend access to gradients using methods like parameter shift [53] or finite difference, which require repeated circuit evaluation. We optimize a 62-qubit QAOA problem using the gradient descent algorithm, providing gradients by combining circuit-cuttinginspired parallelization with the multi-parameter finite-difference method. Using the class of clustered graphs with parameters n = 20, r = 3 and k = 1, we consider QAOA circuits with p = 1 and p = 2 layers.
In the case of p = 1, an exact expression for the QAOA cost is available [57,41], from which we are able to obtain the analytic optimal value to use as a baseline comparison. It is shown in Fig. 7 that an optimization using simulated circuit cutting is able to achieve this optimal cost value. For p = 2 there is no analytic expression known for the cost function, making the global minimum harder to find. Nevertheless, the cutting-based optimization results in a decreasing cost function and tends towards a value of −0.51. Although the result does not achieve a global minimum, since the p = 1 case is able to reach a lower value of −33.8, this provides a demonstration of the abil- For p = 2, the optimization was run over 10 GPUnodes and took around 12 hours. In this case, a local minimum is obtained, although this may be improved upon by careful choice of initial circuit parameters [38].
ity to perform optimization using circuit cutting methods at larger depths p.

Conclusion and outlook
In this work, we presented a circuit cutting method based on randomized measurements that provides a quadratic runtime improvement over the current state-of-the-art for circuits where multiple neighbouring wires are cut simultaneously. Our method requires classical communication between circuit fragments to coordinate measurement outcomes and state preparation. For circuits with matrix-product-state structure, the synchronization between circuit fragments can be carried out by repeatedly executing smaller circuits, even on a single device. For general circuits, our algorithm requires multiple circuit executions across many devices, separated in space rather than time.
Our results raise the question of whether additional improvements may be possible in this setting. With this in mind, we derived an information-theoretic lower bound that is quadratically lower than our method. It is an open question whether such a lower bound is in fact achievable or whether it can be further tightened.
We chose the QAOA algorithm as a testbed for the new method. An emphasis was put on the practical and operational aspects for real devices -we give an algorithm based on randomized measure-and-prepare channels as well as a way to sample the cut circuit. To the best of our knowledge, the latter has not been addressed in any previous works, which instead focus on estimating expectation values. Numerical simulations indicate a large speedup over previous state-of-the-art for circuit cutting.
Finally, as a more general exploration of circuit cutting methods, we give results on large-scale simulations of QAOA circuits over many GPUs acting as proxies for individual ∼ 30-qubit devices. We performed forward passes of up to 129 qubits and full optimization procedures for 62qubit circuits. These results demonstrate that our software implementation of circuit cutting methods, built on top of PennyLane, indeed enables small-scale quantum devices to successfully emulate the results of a large circuit. Although additional difficulties may arise when employing quantum hardware instead of simulators, these numerical experiments are a testament to the practicality of large-scale circuit cutting workflows.

A Preliminaries
We explicitly define random variables throughout, including matrix-valued random variables where relevant. We let x ∈ A denote a random variable x which takes values in the set A. For any positive integer N we let [N ] denote the set {1, . . . , N }. We use sans-serif font to denote subspaces of complex finite-dimensional Euclidean vector spaces, or operators acting on these spaces. For any positive integer d we let L(C d ) denote the set of square linear operators acting on C d , H(C d ) the subset of operators in L(C d ) which are Hermitian, Psd(C d ) the subset of operators in H(C d ) which are positive semidefinite, and D(C d ) the subset of operators in Psd(C d ) which have unit trace (quantum states). We also let U(C d ) denote the set of unitary operators acting on C d and S(C d ) the set of unit-norm vectors in C d (vector-representation of pure states). We distinguish between identity operators acting on C d and on L(C d ) by adopting the notation 1 for the former and id for the latter, using subscripts to denote the space on which the operators act if this is not clear from context. Our main theorem will make use of structured POVMs based on unitary t-designs (in the case where t = 2), which we define below.

Definition A.1 (Unitary t-design). For positive integers t, d > 0, a unitary t-design is a random unitary operator
where µ is the Haar measure on U(C d ).
In the case where U in the above definition is a discrete random variable taking values in probabilities p 1 , p 2 , . . . , p m , respectively, one may associate with this unitary t-design the ensemble of unitaries We denote by Tr j (X) the partial trace over the j th space, where it should be made clear from context which subspace the index j corresponds to. We also use the notation Tr S (·) for a set S ⊂ Z + to denote tracing out a subspace corresponding to multiple qubits on a quantum circuit.

B Proof of main result B.1 Proof of Lemma 2.1
The lemma is restated for convenience. Recall that we have for every X ∈ L(C d ).
Lemma 2.1. Let d be a positive integer and Ψ 0 , Ψ 1 be the channels defined in Eqs. (2) and (3), respectively, acting on d-dimensional states. Define the Bernoulli random variable z ∈ {0, 1} to be equal to 1 with probability d/(2d + 1). It holds that Proof. Since we assume that U ∈ U(C d ) is a unitary 2-design, for any |v ∈ S(C d ) we have where W ∈ U((C d ) ⊗2 ) is the swap operator, and the second line is a well-known identity that follows by exploiting the permutation-invariance of Haar integrals of this form (cf. Eq. (7.179) in [58]). Using Eq. (18) as well as the linearity of trace, we may write the action of the first channel Ψ 0 on a linear operator X ∈ L(C d ) as where the final line follows from the identity Tr 1 ((A ⊗ 1)W ) = A for any square linear operator A. Also, Ψ 1 (X) = Tr(X)1/d for every X ∈ L(C d ) by definition. Substituting into Eq. (22), we have which, upon rearranging, gives for every X ∈ L(C d ). This proves the claim.

B.2 Bipartitioning pseudocode
We describe in full detail the circuit cutting procedure used in Theorem 2.2 using the pseudocode below, in Algorithm 1. As in the statement of the theorem, C is an n-qubit circuit composed of sub-circuits C A , C B acting on A, B ⊆ [n] respectively, and |A ∩ B| = k. Recalling our notation for the model of quantum computation introduced in Sec. 2, for any function f : The overall action of the quantum circuit C is to implement a quantum channel which we denote by N , and the goal is to estimate Tr(O f N (ρ ⊗n 0 )). Note also that we assume that A ∪ B = [n] for simplicity.

B.3 Proof of Theorem 2.2
We first restate the result for convenience. A, B ⊆ [n], respectively. If |A ∩ B| ≤ k, then quantum computation using C can be simulated to within accuracy ε in time O(4 k (m + k 2 )/ε 2 ) by a quantum circuit acting on at most max{|A|, |B|} qubits using the procedure described in Algorithm 1. V ← random Clifford operator on k qubits 6: Apply circuit V † to wires A ∩ B 7: y ← measurement of wires A ∩ B 8: else 9: y ← Unif({0, 1} k ) 10: end if 11: x A ← measurement of wires [n]\B 12: Initialize wires B to |y A∩B |0 [n]\A 13: Apply circuit C B to wires B 14: x B ← measurement of wires B 15: x

Theorem 2.2. Let C be a size-m quantum circuit acting on n qubits which is a composition of circuits C A , C B acting non-trivially on sets of qubits
To prove the claim regarding time and space resources required, first note that the random variable Y = (2 k+1 + 1)(−1) z f (x) in Algorithm 1 is bounded in magnitude by |Y | ≤ 2d + 1 with certainty. Hence by Hoeffding's Inequality the sample mean of N = O(d 2 /ε 2 ) = O(4 k /ε 2 ) iterations of Algorithm 1 suffices to estimate the expectation value of Y to within accuracy ε with high probability. Now, suppose we have a device comprising max{|A|, |B|} qubits. If z = 0, the procedure up to and including Line 11 can be performed on this device in time O(m + k 2 ). This follows from the fact that there exists an efficient procedure to sample a depth-O(k log(k)) circuit which implements a random Clifford operator, running in time O(k 2 ) [56]. If z = 1 then no sampling of random Cliffords is required, and the procedure yields x A and y in time O(m). In Line 12, the same device may then be re-initialized to the state |y |0 . Finally, the application of the circuit C B takes time O(m) for a total runtime of O(m + k 2 ) to produce a single sample.
It remains to show that Y is an unbiased estimator of Tr(O f N (ρ ⊗n 0 )), where N is the channel implemented by the circuit C. Consider the channels Ψ 0 and Ψ 1 defined in Eqs. (2) and (3), respectively. Lines 5-7 and 12 perform the action of the channel Ψ 0 on qubits in A ∩ B. This follows from the fact that the random Clifford operators comprise a 3-design [34,62,59]. Similarly, lines 9 and 12 perform the action of the channel Ψ 1 on qubits in A ∩ B. Deferring the measurement in Line 11, the procedure therefore has the following action on the initial state for a fixed z ∈ {0, 1}: Here, U A and U B are unitary channels acting on wires (qubits) A and B respectively, such that the overall action of the circuit is given by Let z now be the random variable defined in Line 1 of Algorithm 1, by Lemma 2.1 and the linearity of channels we have Therefore, by the linearity of trace it holds that as required.

C An information-theoretic lower bound
Note that a successful procedure for this task using N ≤ O(κ 2 /ε 2 ) iterations is implied by the existence of a decomposition of the identity of the form for some collection of measure-and-prepare channels Φ i : L(H B ) → L(H B ) and real a i satisfying This follows from the fact that such a decomposition could be used to perform the wire cutting task described above, resulting in an unbiased estimator of the expectation value which is bounded in magnitude by κ. Hence, our lower bound on N will also imply a lower bound on the quantity κ.
To prove our lower bound, let us consider the case where d A = d C = 1. Then we must be able to determine Tr(O f U 2 (ρ)) to within error ε using a procedure of the form described above. Namely, we measure ρ ∈ D(H B ) using some POVM {M y } and prepare the state τ y ∈ D(H B ) conditioned on receiving the outcome y. We then apply U 2 on this state and measure it in the computational basis receiving outcome z ∈ [d B ], and this procedure is repeated N times using N identical copies of the state ρ.
Since there is only one non-trivial register we will drop the subscript B from now on. Define where These states will allow us to construct a difficult state discrimination problem which reduces to a successful procedure for wire cutting. The lower bound on the number of iterations required will then follow by bounding the information gained from the measurement statistics in a single iteration.
To accomplish this, we make use of elementary facts from information theory (see, for example, Ref. [16].) We adopt standard notation for the mutual information and conditional mutual information between two random variables, as well as the entropy and conditional entropy of a random variable. We let D(P Q) denote the relative entropy between two distributions P , Q such that supp(P ) ⊆ supp(Q). We also make use of the χ 2 -divergence between two discrete distributions P , Q over the sample space X defined through In addition to more standard facts from information theory, we require the following two results, which we state without proof. The first is a special case of Lemma 6 in Ref. [13], while the second is a looser version of Eq. (5) in Ref. [51].
Lemma C.1. Let P X,Y be a discrete distribution over the sample space X × Y, and let P X , P Y be the marginal distributions over X , Y, respectively. For any pair of discrete distributions Q 1 over X and Q 2 over Y, it holds that Lemma C.2. Let P , Q be discrete distributions over the sample space X such that supp(P ) ⊆ supp(Q). It holds that Let U ∈ U(H) be a Haar-random unitary operator and x ∈ {0, 1} be uniformly random. Suppose we are given as input to our problem the unitary channel U : X → U † XU , observable Π 0 ∈ H(H), and the state U ρ x U † ∈ D(H). Then the wire cutting procedure produces standard basis outcomes z 1 , . . . , z N ∈ [d] which are independent given x and U , as well as intermediate measurement results y 1 , . . . , y N which are independent given x and U . The final output of the procedure after the N iterations enables one to correctly determine the value of x ∈ {0, 1} with high probability. This follows since without loss of generality we can assume the output will be an ε/3-accurate estimate of the quantity Tr(Π 0 U † U ρ x U † U ) = Tr(Π 0 ρ x ), with high probability. Now, because such an ε/3-accurate estimate of Tr(Π 0 ρ x ) allows one to infer the value of x with certainty. By Fano's Inequality [20], the mutual information between the random variables obtained from identical iterations of this procedure z := (z 1 , . . . , z N ) and the choice of state x satisfies On the other hand, we can upper bound the left-hand side of the above as follows.
where the second-to-last line follows because conditioning reduces entropy, and z i is independent of z i−1 , . . . , z 1 given x and U by the assumptions made in our criteria for what constitutes a wire cutting procedure. Hence, it remains to bound the individual mutual information terms.
Conditioning on U we find x → y i → z i forms a Markov Chain, using the fact that for fixed y i and U we defined z i to be the outcome obtained from measuring the state U † τ y i U in the computational basis. By Data-Processing Inequality we have Defineỹ to be the random variable y i conditioned on U , which has conditional distribution given by Pỹ |x=x (y) = Tr(M y U ρ x U † ) for each x ∈ {0, 1} and marginal distribution Pỹ(y) = 1 x =0 Pỹ |x=x (y). Also, let P x = ( 1 2 , 1 2 ), let P x,ỹ be the joint distribution, and define a distribution Q given by Q(y) = Tr(M y )/d. Then the right-hand side of the above is (by Lemma C.1) In the second line we made use of the unitary invariance of the Haar measure, which implies that the two terms in the sum in Eq. (53) are equal by the fact that there exits a unitary operator V ∈ U(H) such that ρ 1 = V ρ 0 V † . We then have The Haar integral in the numerator of each of the terms above is evaluated implicitly in previous arguments for lower bounds of this type [29,32], and explicitly as Lemma 3.2.6 in [37], by which it holds that for every outcome y, and hence . (58) In summary, we have for every i ∈ [N ] so that, by Eq. (50) and the lower bound in Eq. (43) we have which holds if and only if N = Ω(d/ε 2 ). For a system comprising k qubits, this is Ω(2 k /ε 2 ). This proves the bound in Theorem 2.3. It holds that p ≥ Ω( √ d).

D Proof of Theorem 3.1
In this appendix we offer proofs for Theorem 3.1 as well as some related results in the form of lemmas. We state a notational convenience at the outset: in order not to confuse undirected input graphs for the Max-Cut problem with the (multi-) graphs representing quantum circuits, we reserve the term fragments to refer to subgraphs of the latter, obtained by removing a subset of wires. Recall that we are interested in these fragments since applying the identity in Lemma 2.1 (or in the proof of Theorem 1 of Ref. [46]) to each of the removed wires allows us to simulate the circuit using smaller devices corresponding to the fragments. The QAOA ansatz with p layers is a circuit whose structure is repeated p times in succession. To aid our discussion on the relationship between problem instances and the complexity of circuit cutting, it will be helpful to reduce the scope of our analysis to the case of a single layer. The following observation, which is not specific to the QAOA ansatz, motivates this choice. For an n-qubit circuit, we define the support of a set of its gates A to be the subset of qubits upon which they act non-trivially, supp(A) ⊆ [n]. Suppose this circuit is separated into fragments F 1 , . . . , F r upon removing a subset of wires. Then we refer to the support of the subset of gates belonging to the fragment F j as the support of the fragment itself, denoted supp(F j ) ⊆ [n], for each j ∈ [r]. This quantity represents the size of each of the sub-circuits formed in the circuit cutting procedure.
Lemma D.1. If a circuit can be separated into r fragments F 1 , . . . , F r by removing sets of at most κ parallel wires (see Sec. 2), then a composition of p layers of that circuit can be separated into r fragments F 1 , . . . , F r by removing at most (2p − 1) sets of at most κ parallel wires. Furthermore, Proof. First consider the (intact) single-layer circuit, where each gate is labelled by the fragment to which it belongs when it is separated. That is, we assign a label j to each gate if it is in the fragment F j for some j ∈ [r]. We may construct a new directed multi-graph by contracting all the edges (wires) incident on gates with the same label. This is the communication graph in Ref.
[46], so we adopt this terminology here as well. The edges in the communication graph represent those edges which separate the original circuit into r fragments when removed. We then choose an identical partition of the gates in each of the p repeated layers, resulting in a sequence of p identical circuits whose gates are labelled using the labels 1, . . . , r. The partition of the gates in the p-layer circuit is then defined by these labels, i.e., the fragment F j contains all gates labelled by j, for each j ∈ [r]. It is clear that supp(F j ) = supp(F j ) for all j ∈ [r].
Let us now turn to the bound on the number and size of sets of parallel wires in the p-layer communication graph. There is a contribution of p sets of κ parallel wires in the communication graph since each layer adds sets of κ inter-fragment wires. It remains to bound the number of wires in the communication graph connecting gates in two different layers. Suppose the layers are labelled 1, . . . , p. For any i ∈ [p − 1], the wires corresponding to the a th qubit connect gates with different labels between layers i and i + 1 if only if the final gate acting upon the a th qubit in the first layer has a different label from the first gate acting upon the a th qubit. This happens only if there is a wire in the single-layer communication graph which corresponds to the a th qubit. Repeating this reasoning for each qubit leads to a contribution to the total number of multi-graph edges that is bounded from above by κ for each i ∈ [p − 1]. Furthermore, the wires connecting gates in layer i to layer i + 1 are parallel by definition, and so these can be arbitrarily partitioned into at most sets of at most κ wires for each i ∈ [p − 1]. In total, we have at most p + (p − 1) = 2p − 1 sets of at most κ parallel wires which, upon removal, yield the desired fragments F 1 , . . . , F r . Lemma D.2. Suppose there exists a partition of the edge set E(G) into r disjoint subsets E 1 , . . . , E r and let g 1 , . . . , g r denote the subgraphs of G corresponding to these subsets of edges, respectively. Also, let 1 ≤ γ ≤ r 2 denote the number of pairs of indices i, j ∈ [r], i < j such that V (g i ) ∩ V (g j ) = ∅. Then there exists a single-layer QAOA circuit for Max-Cut on G that can be separated into fragments F 1 , . . . , F r by removing γ sets of at most κ parallel wires, where κ := max i =j |V (g i ) ∩ V (g j )| and supp(F j ) = V (g j ) ∀j ∈ [r]. The graph on the left is the input to the Max-Cut problem, which has a balanced vertex separator given by the vertices in yellow. The two-qubit gates are applied in an order which respects the partition of the edges in the graph. We apply the resolution of the identity for circuit cutting on the two wires indicated by the dashed line.
By Lemma D.2 a single layer of the QAOA circuit ansatz can be separated into fragments F 1 , F 2 by removing a single set of at most κ parallel wires, and |supp(F j )| = |V (g j )| ≤ 5n/6 is a bound on the number of qubits in the support of fragment j. Also, by Lemma D.1, p layers of this circuit ansatz can be separated 1 into fragments F 1 , F 2 by removing at most 2p − 1 sets of κ parallel wires, with |supp(F j )| ≤ 5n/6 for j = 1, 2. Then, since the cost operator C for the Max-Cut problem is a diagonal observable the expected cost C can be written as by Eq. (5). Here, C z 1 ,...,z 2p−1 is the expected value of a suitably modified circuit in which the channel Ψ z i is applied to the i th group of parallel wires for each i ∈ [2p − 1], and z 1 , . . . , z 2p−1 are Bernoulli random variables as described in Lemma 2.1. The modified circuits can be executed on two quantum devices with at most max j |supp(F j )| ≤ 5n/6 qubits and classical communication.
Using the random Clifford strategy described in the proof of Theorem 2.2, each of these circuits has at most poly(pκ|E|) gates. Measuring in the computational basis allows one to construct an unbiased estimator of the expectation value C in a manner analogous to Line 7 in Algorithm 1. This estimator is bounded in magnitude by (2 κ+1 + 1) (2p−1) with certainty, so by Hoeffding's Inequality (2 κ+1 + 1) 2(2p−1) /ε 2 ≤ 4 (2p−1)(κ+2) /ε 2 repetitions of this procedure suffices to estimate C to within additive error ε, with high probability.

E Circuit cutting complexity for QAOA with clustered graphs
We consider the execution of a multi-layered circuit resulting from encoding a problem graph of the type introduced in Sec. 3.1 for Max-Cut using QAOA. There are two key components of the simulation to consider: the number of unique sub-circuits N required for circuit cutting, and the number of qubits in each fragment circuit m. For a circuit of p layers and a problem graph of r clusters, n nodes within each cluster, and k vertex separators, we find: N = N (p, r, k) = 3 pk 4 (p−1)k + (r − 2)12 (2p−1)k + 3 (p−1)k 4 pk , The number of qubits required to simulate any fragment is upper bounded by m = n + (3p − 1)k (including additional auxiliary qubits due to mid-circuit measurements). In Sec. 3.4, by treating a GPU as a simulator analog of a quantum hardware device, we aim for one GPU per fragment execution, i.e., m = n + (3p − 1)k ≤ 30. Relative speed-up Figure 9: Execution time and relative speed-up vs. the number of GPU nodes for an execution of a 79-qubit QAOA circuit. The input problem graph parameters are fixed at p = 1, r = 3, n = 25, and k = 2, while the number of GPU nodes used in the distributed execution of the fragment circuits is varied. The ideal linear scaling relationship is also provided with reference to the single-node performance, and illustrates the divergence from ideal strong-scaling behaviour at large node numbers.

F Analyzing the efficiency of simulated circuit cutting
This appendix considers how the simulation run time scales for performing circuit cutting on QAOA circuits, for the problem graphs introduced in Sec. 3.1, as the number of classical resources is varied.
Increasing the number of GPUs used for the parallelized execution of the N fragment sub-circuits demonstrates reasonable strong-scaling behaviour [54], as it decreases the overall runtime for a QAOA circuit execution. However, given that the number of circuit executions per GPU reduces, we reach approximately 15-times speed-up using 32 GPU-nodes as depicted in Fig. 9, yielding an efficiency of approximately 47% at our largest scale evaluation. These inefficiencies are to be expected since increasing the number of GPU nodes also increases the time needed for scheduling, transmission, device setup overheads, and serial execution components which all contribute to the overall runtime.