Well-conditioned multi-product formulas for hardware-friendly Hamiltonian simulation

Simulating the time-evolution of a Hamiltonian is one of the most promising applications of quantum computers. Multi-Product Formulas (MPFs) are well suited to replace standard product formulas since they scale better with respect to time and approximation errors. Hamiltonian simulation with MPFs was first proposed in a fully quantum setting using a linear combination of unitaries. Here, we analyze and demonstrate a hybrid quantum-classical approach to MPFs that classically combines expectation values evaluated with a quantum computer. This has the same approximation bounds as the fully quantum MPFs, but, in contrast, requires no additional qubits, no controlled operations, and is not probabilistic. We show how to design MPFs that do not amplify the hardware and sampling errors, and demonstrate their performance. In particular, we illustrate the potential of our work by theoretically analyzing the benefits when applied to a classically intractable spin-boson model, and by computing the dynamics of the transverse field Ising model using a classical simulator as well as quantum hardware. We observe an error reduction of up to an order of magnitude when compared to a product formula approach by suppressing hardware noise with Pauli Twirling, pulse efficient transpilation, and a novel zero-noise extrapolation based on scaled cross-resonance pulses. The MPF methodology reduces the circuit depth and may therefore represent an important step towards quantum advantage for Hamiltonian simulation on noisy hardware.


Introduction
A major goal of quantum mechanics is to accurately describe how a given quantum system evolves with time. This was the original incentive argued by Feynman in 1981 to develop quantum computers [37]. More specifically, Hamiltonian simulation seeks to implement the operator e −iHt as accurately as possible with an efficient quantum algorithm, which allows to study the time dynamics of a corresponding quantum system. Here, H is a Hermitian operator and efficient means that the running time should be polynomial in the size of the corresponding system, i.e., polylog(N ) for a N × N Hamiltonian. For example, in the context of the spin-boson model this would enable the study of decoherence models at scale [60,66].
Hamiltonian simulation can be done with system specific machines in which the hardware natively implements H, as exemplified by cold atoms [43], or with general purpose gate-based quantum computers. System specific machines allow large-scale simulation but only for the subset of Hamiltonians they natively support. Accurate Hamiltonian simulation on gate-based quantum computers is however challenging since e −iHt has to be implemented on hardware with a native gate set and limited connectivity [87].
Many quantum algorithms have been proposed to simulate Hamiltonians since Feynman's talk [66]. The choice of the algorithm depends on the structure of the Hamiltonian [19,20], e.g., whether it is sparse or dense, time-dependent [5,8] or time-independent, and whether the time evolution is real or imaginary [35,54].
Here, we focus on real-time simulations of a time-independent sparse Hamiltonian. Then, the two main approaches are variational quantum algorithms (VQA) [28,61,89,91] and product formulas (PF) [2,3,7,21,24,63]. VQAs do not implement the operator e −iHt but instead create a parameterized variational ansatz |ψ(θ)⟩ which approximates the state |Φ(t)⟩ governed by the Schrödinger's equation [61]. The quantum computer is then used to determine the coefficients in an equation of motion (EOM) that is classically integrated. The depth and structure of the circuit determine how well the variational approach can approximate the true state evolution. A priori, this is not clear, and we have to rely on error bounds that can be evaluated a posteriori to improve the circuit design [91]. Further, VQAs must compute the coefficients in the EOM to high accuracy. This requires a large number of circuit executions rendering the VQA approach in its current implementation unlikely to lead to a quantum advantage for timedependent problems [66]. First steps towards overcoming this computational cost could include approximating the Quantum Fisher Information matrix using simultaneous perturbation stochastic approximation techniques [40]. In addition, the cost of estimating complex observables can be reduced with positive operator valued measures [38,41] or classical shadows [44,51]. In contrast, PFs are analytical techniques which perform well in practice and implement e −iHt as an operator by breaking the Hamiltonian into sub-Hamiltonians, each simulated as a circuit for a small time-step. The circuit depth grows polynomially with the inverse approximation error and the simulation time. Interestingly, the existing error bounds for PFs can be several orders of magnitude larger than measured errors, even for small systems [25,26,46,59], which suggests that further investigation is needed.
A promising candidate to replace PFs are multi-product formulas (MPF) [10,23], which can be seen as a form of Richardson extrapolation [13,71,74]. Here, we show how to use MPFs to mitigate algorithmic errors in PFs by classically combining expectation values instead of implementing a quantum Linear Combination of Unitaries (LCU) [23], which was the original quantum algorithm proposed in 2012. We run a PF several times on a gate-based quantum computer with a different number of Trotter steps to estimate the same expectation value. Then, we linearly combine these estimates on a classical computer into a better estimate of the expectation value, which would otherwise only be obtainable with a large number of Trotter steps, i.e., significantly deeper circuits.
Our work makes both an algorithmic and a hardware contribution to Hamiltonian simulation with MPFs. We now describe how we build on existing references. The idea to classically combine expectation values to mitigate algorithmic errors was first introduced in 2018 in Ref. [34] applied to Hamiltonian simulation. Next, Ref. [83] showed that such a classical combination reduces the resources needed to implement the HHL algorithm [47]. In particular, Ref. [34] noticed that Richardson extrapolation can mitigate both physical [61,79] and algorithmic errors, and showed promise in numerical experiments performed with two and three extrapolation points. In 2019 Ref. [46] noticed that a non-careful choice of sequences of Trotter steps leads to exponentially small probabilities of success for MPFs implemented as an LCU. Our work generalizes the classical combination which Ref. [83] applied to tridiagonal Toeplitz symmetric matrices to arbitrary Hamiltonians. We further improve upon Ref. [83], which focused on the HHL algorithm and thus a Hamiltonian simulation embedded in the quantum phase estimation algorithm, by showing that the classical combination is applicable to observables arising in a general Hamiltonian simulation. We also solve the vanishing success probability issue raised by Ref. [46]. We also formalize the classical combination of expectation values for Hamiltonian simulation of Ref. [34], our contribution is an algorithm to create well-conditioned MPFs implemented as a classical combination of expectation values. Here, we explain how ill-conditioning, presented in Ref. [46], increases non-algorithmic errors when classically combining expectation values. The result of our work is a hardware-friendly MPF. Its circuit depth is quadratically lower than MPFs implemented as an LCU [46] since the individual product formulas in the MPF are independently run on the hardware. The ill-conditionning is avoided by a careful choice of Trotter exponents. We compare the different algorithms in Table 1.
Crucially, we implement for the first time Hamiltonian simulation by MPF on quantum hardware. We observe an algorithmic error reduction of up to an order of magnitude while suppressing hardware noise [57] with Pauli Twirling, pulse efficient transpilation [32], and a novel zero-noise extrapolation (ZNE) [61,79] based on scaled cross-resonance pulses [76]. This last method is similar to digital ZNE [31,42,49] but allows for a continuous range of stretch factors without the need for costly pulse-calibration. We illustrate our work by computing the dynamics in simulation and on quantum hardware of the transverse field Ising model; a model ubiquitous in condensed matter and statistical mechanics [72]. The Ising Hamiltonian is where J denotes the interaction strength between nearest neighboring spins ⟨i, j⟩, h the transverse magnetic field, and X and Z are Pauli operators. This model is often studied on both gatebased [17,57] and analogue quantum simulations [6,12,90]. Additionally, we extend the scaling predictions made by Ref. [66] of the resources needed to simulate a classically intractable spinboson model by a first-order PF and VQA. We show that in this setting our method requires exponentially less resources with respect to the system size than a first-order PF.
The main limitation of PF based approaches, including this work, is that the number of gates depends linearly on the number of terms L in the Hamiltonian, which can grow very large for chemistry applications [86]. Using randomization, the qDRIFT protocol [15] addresses this problem by trading accuracy for an L-independent algorithm. Hybrid approaches that interpolate between qDRIFT and PFs improve the asymptotic complexity with respect to both methods [45,68]. However, they do not improve the scaling with respect to the simulation time nor the precision. Randomizing the implementation of the LCU has also been explored. For example, Ref. [36] shows a drastic reduction in circuit depth by implementing the LCU on average and increasing the number of circuits. However, this method still requires two controlled PFs per circuit. Alternatively, Ref. [85] presents a randomized phase estimation algorithm based partially on the LCU approach and has an L-independent complexity. Although only one ancilla is required and most of the complexity is handled by the sampling, the circuits still require O(ϵ −2 ) controlled Pauli rotations, where ϵ is the desired precision. The MPF approach we present here is suitable for simulating general dynamics, in contrast to Ref. [85], and for applications that require higher precision, since we achieve more accurate results by increasing the number of shallow circuits we run. This paper is structured as follows. In Sec. 2 we review the theory of PFs and then show how to implement well-conditioned MPFs by classically combining the outcome of multiple PFs. Further, in Sec. 3 we update the resource estimate for spinboson models in Ref. [66] with our MPF approach to illustrate that we can achieve the same approximation accuracy as a first-order PF using shallower circuits. In Sec. 4, we discuss error mitigation strategies for quantum hardware and then illustrate our MPF method on fixed-frequency superconducing transmon qubits by simulating an Ising model. We conclude our work in Sec. 5 and provide suggestions for further research. Table 1: Second-order PF compared to different MPFs. The last row corresponds to the algorithm presented in this paper. l denotes the number of extrapolation points, k 1 < . . . < k l to the Trotter exponents, and ε ′ to the nonalgorithmic errors such as physical or sampling noise. We assume that the MPFs are built from S 2 , that l ≤ 15, and that the simulation time satisfies t ≤ 1. The harmonic Trotter sequence corresponds to [k 1 , k 1 + 1, . . . , k 1 + l − 1] and exemplifies a generic ill-conditioned sequence that may arise from a non-careful choice of Trotter exponents, such as proposed in Refs. [23,34], where no particular sequences were specified. The constant factor of the circuit depth scaling inside the O(·)-notation for LCU implementations can be very large once the circuit is transpiled to the hardware due to the limited connectivity and the many controlled gates.

Algo. Combination
Trotter seq. #circuits Max. circ. depth Success prob. Approx. error ∥⃗ a∥ 1 well-conditioned [46] O l 2 log(l) When the duration t of the time evolution is long we approximate e −iHt by splitting t into k equal Trotter steps of length t/k and apply a product formula S χ in each interval such that The last equality follows from the triangle inequality. As the order χ of the PF increases the number of Trotter steps k required to reach a target error decreases. However, the depth of the quantum circuit scales exponentially with χ [23,78]. Therefore, choosing the optimal formula for a given problem is a non-trivial task further complicated by the lack of tight analytical bounds [25]. Moreover, even if one could find the PF that minimizes the resources needed for a given Hamiltonian, it would still require deep circuits, rendering the PF approach very challenging to implement on noisy quantum hardware.
Multi-product formulas, which can be seen as a form of Richardson extrapolation, are a promising candidate to replace product formulas [10,23]. MPFs only require a sum of polynomially many unitary operations to achieve the same accuracy as PFs with exponentially many unitaries would. They achieve this by linearly combining PFs of the same order χ but varying Trotter exponents k to cancel lower order error terms. Definition 2.2 (Multi-product formula). For l ∈ N, we define the Trotter exponents as ⃗ k = (k 1 , . . . , k l ) where the k i 's are distinct natural numbers, and the extrapolation weights as ⃗ a = (a 1 , . . . , a l ) with a i ∈ R. Then, the multi-product formula M l,χ (t) is where χ, S χ are as in Def. 2.1. To cancel the first l − 1 error terms, the extrapolation weights must satisfy l j=1 a j = 1, and where η = χ+2n for symmetric product formulas and η = χ + n otherwise, for n = 0, ..., l − 2. For more details, see Appendix A.

Circuit implementation
Ref. [23] introduced MPFs as a quantum algorithm to approximate Hamiltonian simulation. The linear combination in Eq. (5) is translated to a quantum circuit by the Linear Combination of Unitaries (LCU) method, illustrated in Fig. 1.
Asymptotically, this method with k j = j, i.e. the minimal sequence with l different natural numbers, implements O l 2 Trotter steps and approximates the Hamiltonian better than PFs with as many Trotter steps. More precisely, MPFs require exponentially fewer Trotter steps [27] to achieve the same accuracy. However, the LCU implementation is probabilistic with a success probability that decreases exponentially as e −Ω(l) . Furthermore, even the shallowest circuits are too deep to implement on noisy hardware since the LCU method sums Trotter steps of varying order as controlled unitaries, requiring additional qubits and controlled operations. For example, simulating a five linearly connected spin Ising model with an MPF with ⃗ k = [k 1 , k 2 , k 3 ] implemented by an LCU requires 108(k 1 +k 2 )+40k 3 +4 CNOT gates, see Appendix B. While Eq. (5) is a linear combination of operators, in practice we measure operator expectation values after a quantum circuit. Crucially, as we show here and illustrate in Fig. 1, the linear combination in MPFs can thus be done classically. This allows to parallelize the circuits implementing the PF for different Trotter steps, resulting in shallow circuits with O(l) Trotter steps, which is better for noisy hardware. We therefore implement each S k j χ in Eq. (5) as an independent quantum circuit. As shown in Appendix C we can substitute the S χ 's in Eq. (5) by their expectation values. More precisely, computing an expectation value E after approximately simulating e −iHt with the PF S k j 2χ (t/k j ) yields an approximation E j of the expectation value E for each Trotter exponent k j with an approximation error related to χ and k j by Eq. (4). Finally, combining the E j 's as gives an estimation of E where the remaining asymptotic error depends on the base PF, see Appendix A.

Choosing Trotter exponents
MPFs can achieve exponentially lower algorithmic errors than PFs while requiring similar circuit depths. However, a poor choice of the Trotter exponents and extrapolation weights amplifies other error sources such as physical or sampling noise so that the error in the final quantity is larger than the largest algorithmic error in the MPF. This choice-of-sequences issue was raised in Ref. [46] where the authors noted that choosing the arithmetic progression for the Trotter exponents k j = j, which gives the shortest circuits, LCU: Our method: ... Figure 1: Implementation of MPFs. The LCU circuit probabilistically implements the linear combination j S kj 1 (t/k j ) and is successful if the measurement outcome of the top two qubits is 0. Our method implements each of the PFs as a separate quantum circuit and calculates the linear combination classically. Running the PFs separately instead of in sequence circumvents the need of auxiliary qubits and significantly reduces the circuit depth, especially as the number of terms in the linear combination increases.
leads to an ill-conditioned MPF, i.e. ∥⃗ a∥ 1 = e Ω(l) where ∥⃗ a∥ 1 := l j |a j |. When implemented as a quantum circuit as an LCU this ill-conditioning translates into an exponentially small success probability scaling as O ∥⃗ a∥ −2 1 . Since our approach is not probabilistic, i.e., we classically compute the linear combination in Eq. (7) rather than implementing M l,χ as an LCU, the condition number ∥⃗ a∥ 1 amplifies any error that is not the Trotter error by a factor of ∥⃗ a∥ 1 . More precisely, suppose that in Eq. (7) we calculated the expectation values E j with a quantum computer and we have several error sources. If ε(k j ) denotes the algorithmic error of the Trotter exponent k j and other sources arising from, e.g., sampling or physical noise are labelled ε ′ , then Now, the result of the MPF is Eq. (10) shows that if ∥⃗ a∥ 1 is large, then ε ′ is amplified and as a result, the sum in Eq. (9) can be a worse approximation to E than the E j 's. Appendix D shows a toy example illustrating the asymptotic behaviour, where ε ′ is the sampling noise of a Bernoulli variable. Here, the increase in non-algorithmic errors results in an increase in the sampling overhead.
Ref. [46] provides choices of ⃗ k for which ∥⃗ a∥ 1 remains "small" to avoid exponentially low success probabilities, i.e., ill-conditioned MPFs. The authors minimize either ∥⃗ a∥ 1 ∥ ⃗ k∥ 1 or ∥⃗ a∥ 1 and provide the corresponding values for χ = 2 and 4. Their method requires O l 2 log l Trotter steps to implement the LCU instead of O l 2 for k j = j. By contrast, our classical combination of expectation values needs only O(l log l) Trotter steps for the longest circuit. Importantly, the condition number decreases exponentially from ∥⃗ a∥ 1 = e Ω(l) to ∥⃗ a∥ 1 = O(log l). Furthermore, Tab. I in Ref. [46] shows that for an MPF based on S 2 and a practical value of l ≤ 15, the condition number ∥⃗ a∥ 1 < 1.7 and is therefore negligible. To avoid increasing non-algorithmic errors, see Eq. (10), we chose the Trotter exponents by minimizing ∥⃗ a∥ 1 under the constraints in Eq. (6) for a given possible range of the k j 's. The algorithm is described in Appendix E.
We illustrate how to choose the Trotter exponents with the Ising model for five linearly connected spins, see Eq.
Therefore, e −iH 1 t and e −iH 2 t are implemented with layers of two-qubit R ZZ and single-qubit R X rotations, respectively. As benchmark we compute the exact local magnetization ⟨Z 0 ⟩ * for h = 1, J = 0.5 and t = 0.5 by numerically exponentiating H 1 + H 2 . Next, we compute the local magnetization ⟨Z 0 (k j )⟩ with a first-order PF and k j Trotter steps. We calculate the relative error |⟨Z 0 (k j )⟩ − ⟨Z 0 ⟩ * |/|⟨Z 0 ⟩ * | which we fit to k −1 j , see the blue dots and dashed line in Fig. 2. The condition number of an MPF based on S 1 grows faster than for S 2 , see Appendix D. Therefore, we empirically set the threshold defining a well-conditioned MPF to ∥⃗ a∥ 1 ≤ 3 instead of 1.7. Thus, we compute well-and ill-conditioned MPFs, with ∥⃗ a∥ 1 ≤ 3 and ∥⃗ a∥ 1 > 3, respectively, with different Trotter exponents by classically combining the expectation values ⟨Z 0 (k j )⟩.
The values of ⃗ a are given in Appendix E. In the absence of non-algorithmic errors, i.e., ε ′ = 0, all MPFs reduce the error compared to the benchmark, see triangles in Fig. 2. Next, we artificially add a fixed perturbation ε ′ = 10 −3 fol-7URWWHUVWHSVk lowing ⟨Z 0 (k j )⟩ → ⟨Z 0 (k j )⟩ + sign(a j )ε ′ before performing the linear combination. The wellconditioned MPFs still reduce the error compared to ill-conditioned sequences which are worse than the benchmark, see stars in Fig. 2. For example, the noisy S 1 -based MPF [2, 4] whose deepest circuit requires k 2 = 4 Trotter steps has a lower error than the PF with k = 8 Totter steps. This is a factor of two reduction in circuit depth.

Spin-boson model
The spin-boson model [60] is a model of quantum dissipation. It gives insights into the transition between coherent and incoherent behaviour as well as a phase transition suppressing quantum tunneling. Its dynamics can be simulated with analogue simulators [33,52] or gate-based quantum computers [66]. In Ref. [66] the authors compare the performance of VQAs based on McLachlan's variational principle [65] against a first-order PF to simulate the dynamics of the spin-boson model. The authors made scaling predictions for a classically intractable system for both, VQA and first-order PFs. They conclude that VQA in its current form is unlikely to lead to a quantum advantage due to the large number of circuit executions, although it is assumed to have shallower circuits than first-order PFs. Further, they show that first-order PFs that achieve the desired accuracy lead to very deep circuits, which might render them infeasible as well. We now extend their results by adding second-order PFs and MPFs to the comparison, showing a significant circuit depth reduction. More precisely, the spin-boson Hamiltonian 11) describes a two-level system, such as an atom, coupled to a bath of M bosons [30,39]. Here, the bosonic operators a † k (a k ) create (annihilate) harmonic basis states with eigenfrequencies ω k . The Pauli matrices X, Z act on the state of the spin with eigenfrequency ω s and tunneling rate ∆. The coupling strength between the spin and the k th bosonic mode is g k . As in Ref. [66] we consider the resonant case ω k ≡ ω, g k ≡ g and ultrastrong coupling, i.e. g/ω = 0.5 with ω s = −1 and ∆ = 0. We consider different spin-boson models in which the M bosonic modes are truncated at different occupation numbers n max .
As in Ref. [66], we are interested in t = 10. Since we suspected that the algorithm would also work for sequences with t/k 1 > 1, we first computed several sequences of Trotter exponents with l = 2, 3 and 4. These sequences and their corresponding extrapolation weights satisfied ∥⃗ a∥ 1 ≤ 2 to ensure that errors arising from noise or sampling are amplified by at most a factor of 2. Next, we computed the approximation error ε for each MPF and selected the one which required less resources to achieve the target accuracy ε t . We observed that this method works well in practice even when t/k 1 > 1, confirming that the analytical error bounds on PFs are not tight. Furthermore, we found that the second-order PF always requires shallower circuits than the PFs S 4 and S 6 when used in an MPF. We also found that the PF S 2 alone always requires shallower circuits than S 1 . For small spin-boson systems we see a reduction of up to a factor of eight in circuit depth, see Appendix F.
We derive the scaling of the MPF approach taking into account the dependence of the er-ror on the number of qubits N q and the number of different Trotter exponents l. First, the circuit depth of the spin-boson model grows linearly with N q [66]. Next, as a worst case estimate we set α = t such that t/k j ≤ 1 for j = 1, · · · , l so that we can bound the error in the MPF by O t 2l+1 /(2l + 1)! [46,83]. A short calculation detailed in Appendix F shows that our hybrid scheme requires the deepest circuit to have Trotter steps instead of the O √ t 3 N 2 q /ε t required by S 2 . A more careful choice of Trotter exponents and α may allow better error bounds. Although the sampling overhead of ε −2 t , not shown in Eq. (12), cannot be avoided in either approach, the depth of the MPF circuits grows logarithmically instead of polynomially in ε −1 t . Our method is advantageous for large systems as it reduces the linear dependency of errors on N q to a logarithmic one. Ref. [66] concluded that first-order Trotter formulas are a better approach than VQA to simulate spin-boson models of intractable system sizes. However, S 1 exhibits a scaling linear in N q , making such a simulation challenging to implement on noisy hardware. MPFs implemented as a linear combination of expectation values are therefore promising to simulate classically intractable quantum systems.

Ising model computation on hardware
Quantum computing hardware has increased in scale and quality [53]. However, gate errors on cross-resonance based hardware are nonnegligible. For example, a typical CNOT gate on IBM Quantum hardware has an average error ranging from 0.5% to 1%. This makes it challenging to implement MPFs on noisy quantum hardware as hardware-related errors ε ′ must be kept comparable to or smaller than the total Trotter error which is close to 10%, see Eq. (10) and Fig. 2. We can however suppress hardware errors in the measured observables with noise mitigation methods which we describe in Sec. 4.1 -4.3 in the same order as we apply them as Qiskit transpiler passes [70]. Hardware results are in Sec. 4.4. They show a factor of six reduction in circuit depth of an MPF implementation over a PF.

Pauli Twirling
Pauli twirling (PT) converts an arbitrary noise channel into a stochastic Pauli error channel, suppressing off-diagonal coherent error contributions [57,84]. We first apply PT to the R ZZ gates, the dominant error source, to convert their noise channels into a stochastic Pauli error channel. PT is implemented by sandwiching each R ZZ gate between randomly sampled Pauli gates chosen so that the net operation of the circuit is unaffected in the absence of noise. The gate noise is reduced to a stochastic form by averaging the results of N PT ∈ N logically equivalent random circuits [14,17,48,57,84]. As in Ref. [57], we randomly sample twirling gates from the set . We apply the same Pauli gates before and after the R ZZ (θ) gate, see Fig. 3(a) and (b). Each quantum circuit is executed with N/N PT shots on that hardware where N is the total shot budget.

Pulse efficient transpilation and calibration
Second, we use a pulse-efficient circuit transpilation which replaces the R ZZ (θ) gates by R ZX gates with echos exposed in the circuit as presented in Ref. [32,76], see Fig. 3(c). This avoids the onerous double CNOT implementation of R ZZ (θ). The R ZX gates are implemented by scaling the hardware native cross-resonance (CR) pulses [73,77] found in the calibrated schedules of the CNOT gates. Since we run a Trotter simulation where the rotation angles θ are small the flat-top of the square pulses with Gaussian flanks vanish. We must therefore implement R ZX (θ) with Gaussian CR pulses of varying amplitude. To overcome non-linear effects we calibrate the amplitude of the CR pulses, see Appendix H.

Zero-noise extrapolation with scaled crossresonance gates
The noise in quantum computers can be mitigated with zero-noise extrapolation (ZNE) which runs logically equivalent circuits with different noise levels [61,79]. Ideally, such circuits are implemented by reducing the amplitude and increasing the duration of all the underlying pulses to increase the noise [56]. This requires extensive gate calibration, which is impractical and reduces the availability of a cloud-based quantum computing system. To overcome this limitation digital ZNE replaces each CNOT gate with 2n + 1 CNOT gates [16,31,75]. This, however, limits the possible stretch factors to c = 2n + 1. Similarly, the Mitiq package [58] inserts pairs of CNOT gates but allows for fractional stretch factors by repeating selected gate layers instead of repeating all CNOT gates.
Here, we implement ZNE by adding the gate sequence R ZX (−θ c )R ZX (θ c ) after the first layer of R ZZ (θ) gates, see Fig. 3(d). Crucially, the rotation angle θ c ̸ = θ which allows us to create a continuous range of stretch factors by linearly scaling the width w of the echoed CR pulses in the R ZX (±θ c ) gates as discussed in Ref. [32,76], see Fig. 3(e). Finally, we calculate the effective stretch factor c of each circuit as the fraction between the duration of the stretched circuit and the duration of the original circuit, both without measurements. The smallest possible stretch factor c min we can implement, indicated in Fig. 4(b), occurs when the flat-top w of the CR pulses vanishes, see Fig. 3(e). By scaling only the R ZX gates rather than each gate in the circuit, it is possible to use a wider range of stretch factors for the same T 1 times of the hardware. Each observable we measure is then extrapolated to the zero-noise limit with an exponential fit ae −bc + d where c is the stretch factor and a, b, and d are fit parameters, see Fig. 4(b). We compare our ZNE method to the Mitiq package in Appendix I.

Results
We compute the dynamics of the Ising model with five linearly connected spins and J = 0.5, h = 1, and an evolution time of t = 0.5. To select five good qubits on ibmq_montreal we perform quantum process tomography [9,67] of scaled-CR R ZX (θ c )R ZX (−θ c ) gates and retain the five linearly connected qubits 13,14,16,19, and 20 as their gate fidelity decay visually resembles their T 1 decay, see Appendix J. We calculate the global magnetization ⟨Z⟩ by measuring the local magnetization ⟨Z i ⟩ of each qubit and take the average ⟨Z⟩ = 1 Fig. 4(a). As benchmark we compute the first-order Trotter formula S 1 with exponents k j ∈ {1, ..., 10} on a noiseless state-vector simulator and compare it to the We also ran the corresponding circuits with and without dynamical decoupling but did not see a meaningful effect. We elaborate on this in Appendix G. For each Trotter step we create eight random PT circuits by dressing the R ZZ gates with gates uniformly drawn from G ZZ . Each of these eight circuits is executed 20 times with different stretch-factors θ c linearly spaced in the interval [θ min c , θ max c ]. Here, θ max c is chosen such that the pulse-schedule of ten Trotter steps does not exceed the shortest T 1 time. For every θ c each of the eight circuits is evaluated with 10 5 /8 shots to keep the physical and sampling noise below ε ′ = 10 −2 . This process is repeated ten times to gather statistics. Therefore, a zero-noise extrapolated global magnetization ⟨Z(k j )⟩, indicated as green stars in Fig. 4(b) and (c), is computed with a total of 10 × 20 × 10 5 = 2 · 10 7 shots. Before computing each ⟨Z(k j )⟩ we calibrate the R ZX gates using the procedure in Appendix H. All the measured expectation values are corrected for readout errors with a tensor product of individual qubit calibration matrices [11,57]. The noise mitigation significantly increases the accuracy in the measured ⟨Z(k j )⟩, see Fig. 4(b). For up to four Trotter steps the measured ⟨Z(k j )⟩ agree well with the noiseless benchmark. At Trotter steps k j > 4 the ⟨Z(k j )⟩ have a lower error than predicted by the noiseless benchmark which is a coincidence.
We build the MPF estimation of the zero-noise extrapolated values of ⟨Z⟩ using Eq. (7). We select the Trotter exponents k j to combine and the corresponding weights a j following the optimization algorithm described in Appendix E, which also contains the values of ⃗ k and ⃗ a, with the threshold ∥⃗ a∥ 1 ≤ 3. Crucially, we observe that the MPF significantly reduces the error. For example, the combinations [1,2], [1,3], and [2,4] all have the same accuracy as a first-order formula with 24 Trotter steps, see green crosses in Fig. 4(c). This is a reduction in circuit depth by a factor of twelve to six over the product formula.

Conclusion
We present a hybrid quantum-classical approach to MPFs that requires neither additional qubits nor controlled operations as do previous approaches based on an LCU [23,46]. Our method approximates a Hamiltonian evolution with the same accuracy as PFs or MPFs implemented with an LCU but with significantly shallower circuits by implementing each unitary in the LCU as a standalone circuit. Therefore, even though the total number of matrix exponentials in both approaches is the same, the reduction in gate count for the deepest circuit can be substantial as illustrated by the l = 3 example for which our approach has 22 times less two qubit gates than an LCU implementation when ⃗ k = [1,2,7]. Re- ducing the gate and qubit count makes it possible to execute MPFs on noisy hardware that achieve accurate results, such as those shown in Fig. 4, that would not be possible if implemented as an LCU.
The fast convergence of MPFs to the exact operator e −iHt also means that the approximation error soon becomes smaller than the noise levels on the hardware. Error mitigation is thus crucial to suppress hardware noise. As the hardware and error mitigation techniques improve, the advantages of our MPF method will become increasingly noticeable. To mitigate errors we used PT and pulse-efficient transpilation and introduced an implementation of ZNE that provides a continuous range of stretch factors while simultaneously avoiding onerous pulse calibration. Crucially, this method overcomes the c ∈ {1, 3, 5, ...} limitation of the digital ZNE which rapidly becomes infeasible for deep circuits. With these error mitigation methods we demonstrated on hardware that MPFs can achieve the same accuracy as a PF but with up to a factor of 12 reduction in circuit depth for a five spin Ising Hamiltonian.
Our approach has shallower circuits than MPFs based on an LCU for t ≤ 1. Future work may investigate whether there is a regime for large values of t in which the LCU circuits are shallower than the classical combination of expectation values we propose. For example, since an MPF implemented as an LCU is one quantum circuit, it may be possible to repeat t/∆t times the circuit for a time ∆t. This may use less Trotter steps than a classical combination of expectation values. In addition, tighter bounds could be derived, since as mentioned in Sec. 3, our experiments showed that the scheme still works even when t/k 1 > 1. Regarding ZNE as presented in Sec. 4.3: a deeper study of the optimal number of stretch factors is warranted. Furthermore, we empirically found that inserting the gate sequence R ZX (−θ c )R ZX (θ c ) after the gates in the first layer of R ZZ gates allowed us to mitigate errors. However, the choices of the sequence and where to insert them are not unique. While we tried several possibilities, a more rigorous study is left as further work. We obtained our ZNE expectation values with an exponential fit using the effective stretch factors as the independent variable. These choices could be either studied empirically in more detail or backed up with theory.
We conjecture that our scheme is optimal when used with S 2 , where optimal means requiring the shallowest circuits to achieve a given approximation error. In summary, the linear combinations of expectation values makes it possible to use the MPF approach on noisy hardware. This allows us to reduce circuit depth by orders of magnitude which we believe is crucial to reach a quantum advantage for Hamiltonian simulation on noisy quantum hardware.

Acknowledgements
Here, the A n are matrices that depend only on commutators of the H j . In particular, they are independent of the time t and the Trotter exponents k j . Therefore, the lower-order error terms can be cancelled out by a careful choice of extrapolation weights a j that are independent of the Hamiltonian. The A n are thus irrelevant in subsequent calculations. For the MPF M l,1 (t) to approximate e −iHt without bias, the extrapolation weights must satisfy l j=1 a j = 1. Furthermore, to cancel the first l − 1 error terms in S 1 we impose the constraint l−1 j=1 a j /k 1+n j = 0 for n = 0, ..., l − 2. Therefore, the resulting error is given by

B Comparison with LCU
Here, we compare the circuit depth of MPFs implemented by classically combining expectation values to MPFs implemented as LCUs with l = 3 extrapolation points as example. The LCU method implements an operation proportional to the linear combination κU + V for some unitary gates U, V and κ ≥ 0. A linear combination of three unitaries is implemented recursively as Similarly, we can implement an operation proportional to 3 j=1 a j S k j 1 (t/k j ) by choosing the appropriate coefficients α 1 , α 2 ≥ 0, shown in Fig. 5. We count the number of CNOT gates for the Ising model described in Sec. 4.4 required by each approach. The first-order Trotter formula implementing a five qubit Ising model requires four R ZZ gates and five R X gates. Thus, the deepest circuit of a classical combination will contain 4k 3 R ZZ gates. The number of different gates required by an LCU implementation is given in Table 2. The table also shows the number of CNOT gates required to implement each Table 2: Number of times each gate is required by an MPF with l = 3 implemented as an LCU or as a classical linear combination. The second column shows the number of CNOT gates required to implement each gate. Here we do not include pulse-efficient transpilation. Here, we do not explicitly include pulse-efficient transpilation but treat the cost of the R Z Z gate as a single CNOT gate.
S k3 1 t k3 Figure 5: Circuit implementing an MPF with l = 3 as an LCU. For a five spin Ising model the LCU method therefore requires seven qubits for l = 3. Here, each first-order PF S kj 1 requires 4k j R ZZ gates.

C Extrapolation properties
A crucial step in our scheme is to compute expectation values with the S k χ 's in Eq. (5) instead of M l,χ , since it allows us to directly compute the extrapolated observable from the measured expectation values without measuring cross-terms in the MPF. Here, cross-terms correspond to ⟨ψ| S k l χ OS k j χ |ψ⟩ for k j ̸ = k l when computing the expectation of an observable O with Eq. (5). The proof included below for completeness follows similar arguments as those from Refs. [34,83].
Let S k j χ = U + E j , where U = e −iHt is the matrix exponential we want to approximate via a PF and E j = i α η i t η i +1 k η i j is the error incurred in the approximation, where α k denotes matrix commutators. Let O denote the observable of interest and a 1 , ..., a l the extrapolation weights that satisfy Eq. (6). We will prove that if where η l−1 < n ∈ Z, then l j=1 Eq. (16) shows that by ignoring cross-terms we still obtain an approximation with an O(k −n 1 ) error.
With S Next, the first order in E j error terms is where the last equality comes from combining Eq. (6), i.e. the first l − 1 error terms cancel out, and Eq. (15), i.e. α n t n+1 k n j = O(k −n 1 ). A similar result can be shown for ⟨ψ| E † j OU |ψ⟩ and ⟨ψ| E † j OE j |ψ⟩. Thus, Eq. (16) holds and extrapolation can be applied directly to the measured observables.

D Sampling noise of a Bernoulli variable
We illustrate the asymptotic behaviour of the error term O (∥⃗ a∥ 1 ε ′ ) from Eq. (9) restated below l j=1 We consider as example the task of sampling a Bernoulli discrete random variable X and computing the probability p that X = 1 by taking M samples. We thus have and ε(k j ) = 0 for all j. The number of samples allows us to control the sampling noise as ε ′ (M ) = 1/ √ M on average. We use the extrapolation weights ⃗ a of an ill-conditioned MPF built from the PF S 2 and by taking k j = j as the sequence of Trotter exponents. Indeed, this setting uniquely defines the sequences ⃗ a by Eq. (6) whose 1-norm scales exponentially with the number of extrapolation points l, i.e., ∥⃗ a∥ 1 ∈ e Ω(l) [46], see Fig. 6(a). We expect the sampling error to behave as l j=1 i.e. ill-conditioned sequences have an error that grows exponentially with l. Adding extrapolation points therefore increases the error, see Fig. 6(b).
Here, each line corresponds to a different l and we estimate E j (M ) := E j + ε ′ (M ) for j = 1, · · · , l by sampling M times from X . Thus, each point in the plot corresponds to the value This example illustrates how a non-careful choice of Trotter exponents exponentially increases the sampling error of a quantum algorithm. Note that the example above can be seen as a weighted average of l times M samples with weights ⃗ a. A simple average is recovered with a j = l −1 .

E Optimal Trotter exponents
We obtain sequences of Trotter exponents by solving an optimization problem in Python that minimizes ∥⃗ a∥ 1 . We set Eq. (6) as the constraints for the optimization problem. Then, from all the possible sequences of Trotter exponents we accept those for which the condition number returned by the optimization algorithm is below a set threshold. All the results presented here satisfy ∥⃗ a∥ 1 ≤ 3. The Trotter exponents and extrapolation weights used in the Ising simulation experiments are given in Table 3. They were obtained with DOCPLEX [80] using the code shown below.

F Quantum circuit scaling of the spinboson model
Mapping the spin-boson Hamiltonian to a qubitlattice shows that the approximation error of PFs grows linearly with the number of qubits N q as O N q t 2 /k j [18,22,66]. Now, we calculate for the spin-boson model the scaling of the deepest circuit needed by an MPF that classically combines expectation values. We start from the Taylor expansion of a symmetric PF where the A n are matrices that depend on the commutators of the H j [27,46]. Using this notation we bound the MPF error by O ∥A 2l+1 ∥t 2l+1 (2l+1)! [46,83]. Ref. [22] shows that ∥A n ∥ ≤ N q by the triangle inequality and by grouping the Hamiltonian terms H j in an evenodd pattern. We therefore update the MPF error Table 3: Trotter exponents and extrapolation weights used to build the S 1 -based MPFs to approximate the Ising Hamiltonian. We define well-conditioned MPFs those that satisfy ∥⃗ a∥ 1 ≤ 3.
(1,1) 48 11 (2) Table 5: Total number of repetitions of the different circuits in the deepest Trotter circuit needed to simulate spinboson systems of different sizes (M, n max ) to achieve the target accuracy ε t ∈ {10 −2 , 10 −3 , 10 −4 }. The setting is the same as in Table 4 but here we compare only different PFs. the asymptotic scaling of l satisfies the inequality For simplicity, we write y := N q /ε t and x := (2l + 1) such that Eq. (22) reads x! > y. By Stirling's approximation, i.e. x! ∼ √ 2πx(x/e) x , the inequality x! > y holds if we set y = (x/e) x . Rearranging this equation yields which has the solution where W (·) denotes the Lambert function [29].
For z > e it holds that ln z − ln ln z < W (z) < ln z [50]. We can thus substitute W (e −1 ln y) by ln e −1 ln y − ln ln e −1 ln y = O(ln ln y) to find that the number of extrapolation points l scales as Equation (25) holds for t ≤ 1. For an arbitrary time, similarly to the main text, we rescale ⃗ k by α = t to ensure that t/k j ≤ 1. Furthermore, for t < 1 the number of Trotter steps of the deepest circuit k l is bounded by l 2 for the sequences in Ref. [46] which becomes k l ≤ αl 2 when we rescale ⃗ k to accommodate for t > 1. We thus conclude that k l scales as as presented in the main text. We further emphasize that Eq. (26) is a worst case upper bound which may yet be improved, for instance, by better choices of α.

G Dynamical Decoupling
Dynamical decoupling (DD) applies carefully chosen gate sequences on idle qubits to suppress, e.g., decoherence [69] and cross-talk [82]. These sequences are chosen to cancel systemenvironment interactions to a given order in timedependent perturbation theory [62,69]. Since, in practice, the duration of the two-qubit crossresonance gates on IBM hardware varies across qubit pairs and because the circuit structure has layers of one-and two-qubit gates there are often periods where some qubits idle and therefore suffer decoherence and energy relaxation. The simplest DD sequence replaces sufficiently long idle periods of length τ with the gate sequence τ ′ 4 -X -τ ′ 2 -X -τ ′ 4 , as done in Ref. [17,53]. Here, the idle time τ ′ is the idle time without DD less the duration of the X gate. We did not see a significant impact of DD on the hardware results and therefore we did not use it. We conjecture this is because single-qubit gates have a 4·10 −4 error and that the idle times are small. Indeed, the duration of the average delay which can accommodate the DD sequence in the circuit without the added ZNE gates is 252 ns. Under a T 1 time of 125 µs, the average T 1 across ibmq_montreal, this delay can be seen as an error of 1−e −0.292/125 ≈ 2·10 −3 , i.e. comparable to twice the single-qubit gate error. If the idle times in the circuit were longer we believe that it would have been advantageous to use DD.

H Calibration of R ZX pulses
Computing the dynamics of a quantum system with a Trotter expansion requires many high accuracy small rotations. To avoid decoherence the R ZX gates are built from scaled crossresonance gates following the linear methodology in Ref. [32,76]. Scaling calibrated CNOT gates to implement R ZX (θ) with small values of θ, i.e. up to ∼ 50 mrad, results in Gaussian cross-resonance pulses without a flat-top and a θ-dependent amplitude. Since the rotation angle θ implemented by the cross-resonance drive depends non-linearly on the pulse amplitude [4,32,64], we mitigate rotation errors with a fine amplitude calibration experiment [81]. Here, the gate sequence [R ZX (θ)] n |00⟩ is repeated for a variable n and a desired θ after which the target qubit is measured and both qubits are reset, see Fig. 7(a). The underlying pulse sequences, implemented by linearly scaling the cross-resonance amplitude, creates an effective rotation [R ZX (dθ + θ)] n in the measured population, see Fig. 7(b) which we fit to the function y(n) = A 2 e −n/τ cos (2πn[dθ + θ] + π) + B. (27) We chose the maximum n as 3/θ to have approximately three full oscillations. The small rotation angles we are dealing with thus require pulse sequences with a duration which can exceed the T 1 times of the qubits. To produce good fits, as measured by χ 2 , we add the damping factor e −n/τ in Eq. (27). We run the fine amplitude calibration for all qubit pairs and 20 values of θ linearly spaced between 0 and the largest θ for which the scaled CR gates have no flat-top. The measured dθ yields a correction factor θ/(dθ +θ) that we multiply with the amplitude of the scaled CR gates, see Fig. 8. We observe that this amplitude calibration improves the value of the measured expectation values, see Fig. 9.

I Comparison with Mitiq
Ref. [17] simulates an Ising model using a similar approach to error mitigation, that is, Pauli Twirling, a pulse efficient transpilation of the R ZZ gates and dynamical decoupling. By contrast, we do not apply dynamical decoupling and implement ZNE by linearly scaling crossresonance pulses. Ref. [17] uses the Mitiq package [58] to implement random gate folding, i.e. repeating random layers of R ZZ gates. Here, we use the example described in Sec. 4.4 as a benchmark to compare both approaches in Fig. 10. We observe that ZNE implemented by folding gates improves the data with respect to the unmitigated results in some cases. Instead, implementing ZNE by scaling CR pulses improves the results in almost all cases.    Figure 9: Comparison of the local ZNE magnetization ⟨Z i ⟩ using amplitude calibrated (blue markers) versus uncalibrated (red markers) R ZX gates. That data is for the five-spin Ising model described in Sec. 4.4 simulated on ibmq_montreal. Pannels (a) and (b) correspond to qubit 16 and 19, respectively. Each point is the average of ten different runs and the error bars show twice the standard deviation.

J Hardware properties
Error mitigation techniques can significantly decrease the physical noise in the measured observables. However, their efficacy is limited and error rates are mostly hardware related. In particular, choosing good qubits is a simple but worthwhile strategy towards achieving accurate results. Here good qubits refers to low error rates in the readout, single-and two-qubit gates.
Most of these values are provided by the IBM Quantum backends [1]. However, our circuits comprise mostly R ZX (θ) gates, which are the main source of noise. The fidelity of these gates is not provided by the backends as they are built from scaled-CR gates [32]. Therefore, we perform quantum process tomography of scaled-CR R ZX (θ)R ZX (−θ) gates using Qiskit Experiments [55] on ibmq_montreal, see Fig. 11. For each qubit pair in the device we calculate the fidelity of the gate sequence for 20 values of θ ∈ (0.5, 20), shown as dots in Fig. 12, and plot it together with the T 1 decay of the two qubits, represented as solid lines. We then inspect the plots and define as good qubit pairs those for which the gate fidelity decay resembles the T 1 decay. Finally, we select 4 good pairs that form five linearly connected qubits, depicted in Fig. 11. The properties of ibmq_montreal are shown in Table 6. qubits (q 1 , q 2 ) T 1 decay of q 1 T 1 decay of q 2 Figure 12: Fidelity of R ZX (θ)R ZX (−θ) on ibmq_montreal on May 29 th 2022 for different rotation angles θ ∈ (0.5, 21). Each panel corresponds to a different qubit pair (q 1 , q 2 ). The dots represent the fidelity of the gates, while the solid lines the T 1 decay of the qubits. Here we chose to work with qubits {13, 14, 16, 19, 20} and the corresponding panels are highlighted in blue.