Importance sampling for stochastic quantum simulations

Simulating many-body quantum systems is a promising task for quantum computers. However, the depth of most algorithms, such as product formulas, scales with the number of terms in the Hamiltonian, and can therefore be challenging to implement on near-term, as well as early fault-tolerant quantum devices. An efficient solution is given by the stochastic compilation protocol known as qDrift, which builds random product formulas by sampling from the Hamiltonian according to the coefficients. In this work, we unify the qDrift protocol with importance sampling, allowing us to sample from arbitrary probability distributions, while controlling both the bias, as well as the statistical fluctuations. We show that the simulation cost can be reduced while achieving the same accuracy, by considering the individual simulation cost during the sampling stage. Moreover, we incorporate recent work on composite channel and compute rigorous bounds on the bias and variance, showing how to choose the number of samples, experiments, and time steps for a given target accuracy. These results lead to a more efficient implementation of the qDrift protocol, both with and without the use of composite channels. Theoretical results are confirmed by numerical simulations performed on a lattice nuclear effective field theory.

Given a Hamiltonian H written as the sum of L multi-qubit operators, typically expressed as Pauli strings, the solution of the time-independent Schrödinger equation is obtained through the exponentiation of the Hamiltonian in question. One of the most popular and straightforward techniques to compute this matrix exponential is given by product formulas, such as Trotterization and higher order Trotter-Suzuki decomposition [30][31][32][33], due to their simplicity and high performance in practice, which is usually much better than the worst case analytical error bounds [4,34].
However, one main drawback of product formulas is their relative high cost to accuracy ratio. Hence, their gate count increases proportionally to the number of summands L in the Hamiltonian. Even if the asymptotic scaling is favorable, the pre-factor might be significant enough [27,35,36] to create problems in practice. Hence, deep circuits usually exceed the coherence time of NISQ devices, and would also be problematic for early fault-tolerant quantum hardware as well. Following [37] we define (N)ISQ devices as (noisy) intermediate scale quantum devices, composed of a few hundred qubits, and equipped with error correction (error mitigation) protocols that can correct up to a limited amount of error. Hence, both NISQ and ISQ devices are limited in size and depth, making it a priority to improve the complexity of current quantum algorithms. We refer to [38] and [39,40] for informative reviews, where the former is theoretical, and the latter focus on practical applications on NISQ devices.
Randomization has proven to be an important tool to improve the accuracy and efficiency of product formulas. Childs et al. [41] achieved better gate-complexity by randomizing the ordering of the terms in the Hamiltonian, Faehrmann et al. [42] increased the order of the product formula by averaging over different time slices, while Wan et al. [43] proposed a randomized quantum phase estimation procedure independent of L. More recently, Cho et al. [44] doubled the order of product formulas by introducing random corrections. Hence, stochasticity can turn coherent errors into incoherent ones, making the error scaling behave as a random walk [45][46][47][48]. Even if those improvements are considerable, none of these methods really address the scaling with L, which can be a large pre-factor for many relevant situations.
For this reason, Campbell [49] introduced qDrift, a protocol to build product formulas by sampling over the coefficients, whose length does not depend specifically on L, but on the square of the simulation time t 2 and on the spectral norm of the Hamiltonian λ. Chen et al. [50] improved the error bound on the bias and showed that one experiment of the qDrift protocol converges exponentially fast towards its expectation value, while Ouyang et al. [17] combined the advantages of qDrift and first-order Trotter formula to simulate the Hamiltonian through sparsification. Finally, Nakaji et al. [51] proposed a higher order qDrift protocol, known as qSwift, by adding correction terms to the standard qDrift approach.
The main contributions of this paper are the generalization of the qDrift protocol for arbitrary sampling distribution and the expansion of the results from Ref. [50] to multiple qDrift executions and composite channels with multiples Trotter steps. Sampling from arbitrary distributions has numerous benefits, such as allowing for a direct reduction of the actual simulation's cost in terms of native gates, or expanding qDrift to situations where it might be difficult to sample directly from the coefficients. For instance, we propose an alternative sampling distribution, which decreases the simulation cost, such as the total CNOT count. Moreover, this paper gives a rigorous understanding of the behavior of qDrift and composite channels with multiple experiments. We show that qDrift can be efficiently parallelized on multiples devices, and we give a rigorous formula for choosing the number of qDrift samples N , experiments M , and Trotter steps r, for a given accuracy and simulation time t. We hope this paper to be a starting point to build qDrift protocols tailored to certain applications and computing devices.
Even if we restrict ourselves to time-independent problems, we note that the present work can be directly applied to time-dependent situations through the continuous qDrift [52] protocol. We note that alternative and more refined techniques exist, using extra ancillary qubits and complex gadgets, such as qubitization/quantum signal processing [53,54], and linear combination of unitaries [55][56][57]. They usually offer better asymptotic scaling, but are more challenging to implement in practice and fall outside this paper's scope.
The relevant background and notations are covered in Section 2, with Section 2.2 recalling the qDrift protocol. We introduce importance sampling for stochastic quantum simulations in Section 3.1, while rigorous bounds on the bias, variance, and fluctuation are shown in Section 3.2. Section 3.3 unifies the importance sampled qDrift with composite channels, mixing Trotter and qDrift product formulas. Section 4 focuses on practical applications that benefit from this general framework, such as cost reduction and Hamiltonian partitioning. Finally, numerical simulations are performed in Section 5. Rigorous proofs of the stated theorems can be found in the appendices A and B for the results and applications sections, respectively.

Preliminaries
In this section, we introduce the background and notations adopted throughout this paper. We denote by · the spectral norm, · 1 the trace norm and by the diamond norm distance between two quantum channels U and E. We note that the maximum is taken over all (n + k)-qubit states, where n is the dimension of E and k ≥ 1. In the remaining of this paper, we consider a time-independent n-qubit Hamiltonian, in the form of a (2 n , 2 n ) Hermitian matrix with the following decomposition into L summands with H l = 1 and h l > 0. We denote λ = l h l the norm of the Hamiltonian and Λ := max l (h l ).

Deterministic Trotter product formulas
Given a Hamiltonian H, the first order Trotter product formula is built by exponentiating all the individual terms as The usual technique for long-time simulations is to split the time into r fragments and to repeatedly apply U (t/r), which are known as Trotter steps. Analytical work [41] shows that the Trotter error is upper bounded by Better error scaling can be achieved by considering higher order product formula, which typically requires a symmetric extension or randomization, leading to deeper circuits, while tighter error bounds with commutator scaling have been found by Childs et al. [58]. Despite theirs simplicity, Trotter formulas are performing surprisingly well and often much better than the predicted bounds, making them the default choice in many situations, including early fault tolerant quantum hardware. Since we will be dealing with quantum channels, it is useful to recall how time evolution is performed in the density matrix formalism. Given a unitary operator U (t) and a density matrix ρ, the time evolution is computed as where U(t) is the unitary channel of U (t).

The qDrift protocol
One of the main drawbacks of Trotter-Suzuki decompositions is their gate count. Hence, every term in the Hamiltonian, see Eq.
(2), must be simulated sequentially, leading to deep circuits in many relevant use cases. The resulting quantum circuits are therefore heavily affected by noise on NISQ devices and are time-consuming on fault-tolerant hardware, motivating the search for more efficient algorithms. Campbell [49] remarked that the gate count can be significantly reduced in some regimes by considering the relative importance of each term H j in the Hamiltonian, given by the corresponding coefficient h j . The qDrift protocol builds an approximate channel E(t; N, M ), which is randomly constructed by sampling terms from the Hamiltonian according to the magnitude of the coefficients. We call t the simulation time, N the number of qDrift samples, and M the number of qDrift experiments. In practice, a qDrift experiment is sampled from the product distribution p N (j) = λ −N N k=1 h j k , where each terms is sampled with probability p(j) = h j /λ, and j = (j 1 , j 2 , ..., j N ) a multi-index, leading to for appropriately chosen time steps τ j k . The qDrift channel is ultimately built as the arithmetic average of the M individual experiments as summarised in Algorithm 1, with the sampling distribution q(j) = p(j) := h j /λ. We note that the superscript m of the bold multi-index j m refers to the multi-index j of the m-th experiment, while j m k to its corresponding k-th component. We note that we use the notation q(j) and q j interchangeably throughout the paper.
In the asymptotic limit of infinite experiments, this will then converge to the following average qDrift where is the expectation value of an arbitrary function f (x) and sampling distribution p(x). The special case with N = 1 leads to which we will call the deterministic qDrift channel. The strength of each unitary is fixed to a constant τ j := τ = tλ/N , which is chosen such that the qDrift channel is equal to the first order Trotter product formula, up to first order in the Taylor expansion, with an upper bound on the diamond norm given by We note that this bound has been improved [50] to The deterministic channel can be used to parallelize a Trotter product formula for small time scales, where every H l is simulated simultaneously and independently on different quantum devices, or by using different qubits of the same device. This scheme essentially trades the circuit's depth with measurements.
Expectations values can then be obtained by post-processing. However, since the number of circuits grows exponentially with the number of slicing steps r, this application is impractical for long-time scales or large L, but could benefit NISQ devices which often fail in this regime due to the noise.

Importance sampling
Importance sampling is a useful technique to compute expectation values when the distribution p(x) is difficult to sample from, or as a way to reduce the variance. Frequently used in Monte Carlo integration, the trick is to sample from an alternative, in some cases considerably easier, distribution q(x) > 0 and re-weight accordingly (13) with ω(x) := p(x)/q(x) is the re-weighting factor. We can easily see that this gives us an unbiased estimator of the expectation value of interest, while the variance depends on the choice of q(x). With an adequate choice, the variance can be significantly reduced, leading to less expensive calculations. We guide the reader to [59] for an informative review on the topic.

Results
The first results presented in this section, are the application of importance sampling to the qDrift protocol and the computation of the corresponding bias, variance and fluctuation bounds. The second set of results unifies this framework with composite channels. We will begin by introducing the importance sampled qDrift.

Importance sampled qDrift
To better understand the paradigm shift, we adopt the Liouvillian representation of a unitary channel e −iHt with We first write the qDrift channel, sampled from an arbitrary distribution q(j) asĒ We then choose τ j so that we match the ideal channel to linear order, that is Since the bias at the second order in t/N cannot be matched with any choice of q(j), we focus on the part which is linear in time. For ease of notation, we incorporate the constant factor inside the generators as The expectation value of a qDrift sample is then in the right form and can be written as Therefore, the channel can be written as where we used the explicit expression for p j , and find We note that, contrary to the standard qDrift, the strength of each unitary τ j is now dependent on j.
The procedure is summarised in Algorithm 1, which is a generalisation of the regular qDrift protocol for arbitrary sampling distribution.

Bias, variance and fluctuation bounds
In this section, we compute the bias, concentration and fluctuation bounds for the importance sampled qDrift channel, as a function of the sampling distribution q(j), simulation time t, number of samples N and number of experiments M . We remark that for q(j) = p(j), we recover the usual bounds, meaning that our framework is a natural extension of the standard qDrift implementation. For the ease of notations, all minimum values of N , M and r, which are integer numbers per definition, are given in function of , t and λ, which may not lead to integer value numbers. Hence, they should be rounded up to the next integer in practical situations. Borrowing the proof strategy from [50, Proposition 3.2 ], we will now provide an upper bound on the error of the bias. We refer to Appendix A for the complete proofs of the presented results.
Theorem 1 (Bias error bound). Let U(t) be the unitary channel of a first-order Trotter product formula, E q (t; N ) an average qDrift channel with importance sampling and ω(j) = p(j)/q(j) the re-weighting factor. The diamond norm distance between these two channels for N = 1 is then upper bounded by leading to the following result for an arbitrary N .
We are now able to understand that importance sampling can lead to an increase in the number of qDrift samples N at fixed accuracy . In fact implying N q ≥ N p . Therefore, the standard qDrift channel will always requires a smaller number of samples, however, as we will argue in the next section, the total simulation cost can still be reduced by using importance sampling, without sacrificing accuracy, since we can favorise the sampling of cheaper circuits. Now that we have generalized the error bound on the bias, we need to understand how a finite importance sampled qDrift channel concentrates around its expectation value. Hence, this will provide an estimate of M and N , for a given accuracy and simulation time t.
Theorem 2 (Concentration bound). Let E q (t; N, M ) be a finite importance sampled qDrift channel on n qubits and V j instances of the N M unitaries that make up the channel. Their concentration around their expectation value can then be upper bounded ∀ ∈ [0, 4tλ] as follows

(25)
In order to guarantee an approximation error /2 with probability at least 1 − δ, it is then sufficient to take This theorem gives us two important pieces of information. First, we learn that we can distribute the resource budget across M and N , and that, for fixed accuracy , the channel converges exponentially fast towards the deterministic qDrift with their product. Moreover, the qDrift channel can be efficiently simulated in parallel, since it concentrates exponentially fast in M . This trade-off between circuit's depth for an increase in measurements has some advantages, such as reducing hardware errors on NISQ devices and shorter real-time simulation on fault-tolerant ones. Secondly, we are now aware that the qDrift results present in the literature also hold for any distribution q(j), as long as their ratio is similar, i.e., if ω(j) ≈ 1. This enables the design of alternative distributions, which for example, produce less expensive circuits or are easier to sample from. For instance, these results can be directly transported to the continuous qDrift protocol [52], where the continuous distribution is replaced with an easier one. In [52] the authors already proposed to use a more readily available distribution using norm upperbounds instead of norms. Our results here makes it easier to characterize both the bias and the variance change induced by such a choice. We will see in the next section how to choose q(j) to obtain a guaranteed reduction in the simulation cost.
Finally, we compute the bound of the expected fluctuations around the true evolution, following the strategy of [50, Proposition 3.4].

Corollary 1 (Fluctuation bound).
Let H be a nqubit Hamiltonian, q(j) an arbitrary distribution, t the simulation time, N a fixed number of qDrift samples, and M a fixed number of qDrift experiments.
with α being a numerical constant depending on H.
We now have a better understanding of how to choose N and M for a particular distribution q(j) and desired expected accuracy, in diamond norm, . In particular, if we choose N to control the bias to /κ, with κ > 1, as follows we then need to choose M so that (29) Using the choice for N above, we have resulting in the condition The parameter κ can be chosen to reduce N as much as possible while controlling that M does not diverge. In general, for a given κ and fixed choice for the distribution q(j), we have the following asymptotic scaling We then see that, thanks to the concentration bound in Theorem 2 depending on N M , the number of qDrift experiments scales better than O(1/ 2 ) that one would naively expect from shot noise.

Composite channels
We recall that the main goal of this paper is to reduce the actual implementation cost of random product formulas when running on quantum hardware. We remark that our framework can be naturally embedded in the context of composite channels [60][61][62], and use the deterministic nature of Trotter product formula to further reduce the cost. Following [60], we will first introduce composite channels, unify them with our importance sampling scheme, and compute the relevant bounds on the bias, variance and fluctuation. Given a partition of the Hamiltonian, a composite channel is a composition of channels, which are used to simulate the different terms in the partition. For simplicity, we will only consider the case where H is split into two parts with A i = B i = 1 and I A and I B two sets of indices, and use a deterministic formula for A and a stochastic one for B. To better understand this paradigm, we first perform an outer first order Trotter decomposition where E A,B is the error term. We then take U A (t) to be an approximation to the unitary channel U A (t)[ρ] = e itA ρe −itA performing the evolution under A (we consider here a first order product formula) and E B q (t; N, M ) the importance sampled qDrift channel for the B term. If we define the composite channel , and its corresponding average channel Λ q (t; N ), one can show that its diamond norm distance from the ideal channel U H can be bounded as follows (cf. [60]) If we split the total time into r segments and we use the union bound as usual, we find Apart from the use of a general importance sample qDrift, and the use of the improved bound Eq. (11) obtained in [50] for the error in the qDrift channel, this is the same result obtain already in [60]. Where we depart from their scheme is in the fact that we consider the possibility that evolution under different terms in the expansion could have a different gate cost. If we denote by C A j and C B j the cost of the term j in either A or B, we can bound the total cost of the composite channel as follows The average cost instead can be estimated as where we have denoted by C A tot = l∈A C A l the total cost for A and by Γ A,B comm the contribution containing the commutators among the A terms and between the A and B partitions.
As was pointed out already in [60], the number of qDrift samples N B per segment is now a completely free parameter. Following the same strategy adopted there, namely finding explicitly the minimum of the cost, one obtains an optimal number of samples as With this choice, the expected cost becomes Unfortunately, we cannot use directly the result for the concentration bound in Theorem 2 since r compositions of the channel Λ q (t; N, M ) will require M r experiments to be implemented. It is thus convenient to introduce another channel defined as In order to guarantee an approximation error /2 with probability at least 1 − δ, it is then sufficient to take Using the new version of the concentration bound Theorem 3, we can also provide an estimate for the expected error of the composite channel.
where the parameter contains the dependence on commutators.
If we now want to ensure the expected error to be less then we can take a number of steps given by for some κ > 1 together with N = N B from Eq. (40) and a number of experiments given by where the dependence on q(j) is absorbed in (49) A similar derivation to the one carried out here could be used for more general cases, discussed already in [60], when one employs a higher-order Trotter formula for the outer breakup in Eq. (35) or in the simulation of the evolution under A as well as protocols where the latter is implemented using qubitization. Further improvements obtained by using randomization [41,42,50] could also be accommodated.
Having unified the composite channel framework from [60] with the previous result about importance sampling described in this paper, we will now look into more concrete applications.

Reduction of the simulation cost
The main idea of this paper is to obtain a computational advantage by choosing the probability distribution q(j) in order to reduce the total simulation cost. For instance, we assign a weight C j > 0 to each term H j , representing the number of resources required for its simulation, and choose We will denote this sampling strategy with the subscript c, while remarking that the standard qDrift protocol is recovered when the cost is constant C j = C. This framework can be subsumed under the umbrella of importance sampling where the goal is a reduction in the integrated computational cost instead of the variance. The choice of the cost C j is then dictated by the restrictions from the hardware, and is particularly advantageous if the distribution of the cost has a large variance. We first show that the expected cost of one qDrift sample is always lower when sampling from q c (j) with respect to p(j), using Jensen's inequality. The complete derivation of the following results can be found in Appendix B.
Lemma 1 (Jensen's inequality [63]). Let X be an integrable random variable and ϕ(x) : R → R a convex function. We then have the following inequality: Corollary 3. The expected cost of an importance sampled qDrift channel with N = 1 samples and q(j) = q c (j) is always lower than for the standard qDrift protocol, i.e., we have Even if, on average, one qDrift sample is cheaper when sampling from this alternative distribution, this alone might not be enough to claim a reduction in the total simulation cost. This is because, due to Theorem 1, the standard qDrift channel needs less samples at fixed accuracy than an importance sampled one. However, we can show that the cost reduction holds in general, as formulated in the following theorem.
Theorem 4 (Cost reduction -pure qDrift). Let N p ( , t) and N qc ( , t) be the number of qDrift samples for the two distributions p(j) = h j /λ and q c (j) = h j /(λ c C j ) for a given target precision and propagation time t. The expected cost of the important sampled qDrift channel is then always smaller that the cost of the standard one The number of experiments is instead increased as and independent on the total evolution time t.
We therefore see that, for any simulation using pure qDrift, the importance sampling procedure described in this work guarantees a saving in the cost at the price of increasing the number of independent experiments. A similar saving can also be shown to hold when employing composite channels.

Theorem 5 (Cost reduction -composite channel).
Let C p ( , t) and C qc ( , t) be the expected cost to implement the composite channels Ω p (t; N, M, r) and Ω qc (t; N, M, r) using two distributions p(j) = h j /λ and q c (j) = h j /(λ c C j ) for a given target precision and propagation time t. Then the following holds The number of experiments is instead increased, M qc ( ) ≥ M p ( ), but retaining the same scaling with error and system size n and also independent on t.
We note that the specific definition of the cost is specific to each application and hardware. For example, on NISQ devices, the cost is dominated by entangling two-qubit gates, and we might then take their number to define the cost C. This choice automatically considers the structure of the device, such as its connectivity and the particular application. For example, when simulating physical systems in second quantization, the importance sampling scheme proposed here will give a lower probability to terms with large Jordan-Wigner strings [64]. On the other hand, for applications involving error-corrected devices, one would be tempted to choose the number of T gate instead to define the cost C, as they typically take time to be fabricated and are the main bottleneck in the fault-tolerant regime [65]. The situation is, however, less straightforward in this case since the decomposition in terms of T -gates can depend, in general, on the choice of sampling probability q(j). For our cost reduction scheme to work, we have made the common assumption that the time evolution under the individual terms H j can be fast-forwarded; a typical example is when H j are tensor products of Pauli operators. In these cases, the T cost is directly associated with the implementation of a single qubit z-rotation with an angle determined by the time step τ j and therefore on q(j) itself (cf. Eq. (21)), making it difficult to obtain a good candidate for the coefficients C j . One possibility would be to empirically optimize the sampling distribution q(j), for example, using a Monte Carlo approach or genetic optimization [66], in order to obtain as many time steps τ j as possible equal to integer multiples of π/8 while preventing the average weight E p [ω(j)] to grow too much. There is, however, a different setup where the importance sampling strategy from Eq. (50) could be employed directly to reduce the overall T count. One can take a decomposition where the terms H j forming the Hamiltonian are still fast]-forwardable but allow a decomposition in Clif-ford+T gates whose complexity does not depend on the time (see e.g. [67] for possible candidates). In this case, sampling from q c (j) will also guarantee a cost reduction.

Choice of the partitioning
Composite channels offer great versatility by combining deterministic and random product formulas, but their performance greatly depends on the adopted scheme to partition the Hamiltonian. By inspecting the optimized expression of the expected cost in Eq. (41), one can already see that a composite channel might be helpful in situations where λ B is very small (as already noticed in [60]). For instance, even in the particular case where the terms in B all commute with each other, if λ B is sufficiently small, the cost depends directly only on the cost of implementing the terms in A with the same pre-factor, we would have for a direct first-order Trotter simulation. Thanks to our ability to directly take into account the cost of individual terms, we can also see that another use case is whenever E p C B C B tot which is possible in situations where the cost of individual terms within B have a significant variation. In addition, for these situations, the distribution q c (j) from Eq. (50) guarantees a further reduction of the expected cost at the expense of an increase in the required number M of experiments that need to be carried out.
These properties frequently arise in simulations of Effective Field Theories, where higher-order corrections to the interaction are suppressed. In this context, a particularly convenient situation is when an accidental symmetry forces some low-energy constants to take unnaturally small values. In nuclear physics, for instance, typical examples of this phenomenon are the SU (4) Wigner symmetry in systems composed by neutrons and protons [68,69] and its generalization to SU (16) in the presence of hyperons [70]. For simulations of low energy reactions with nuclei, one could then consider the SU (4) symmetric potential in the deterministic part (as worked out e.g., in [27]) and add symmetry-breaking terms in a stochastic manner using a qDrift channel. Another example would be the inclusion of effective range effects, absent in purely contact interactions, to improve the accuracy in bulk neutron matter [71] or to provide the required stability to medium mass nuclei [72,73]. For situations where the physically relevant value of λ B is not sufficiently small to allow for considerable savings in cost, it could also be possible to rely on extrapolation techniques, e.g., eigenvector continuation [74], to study the system at reduced values of λ B followed by an extrapolation.
More generally, one can also employ more direct optimizations of the splitting by attempting to minimize the cost directly, possibly at the same time as the optimization of the weights in the importance sampling distribution to be used for the stochastic portion of the algorithm. To this end, schemes like the probabilistic partitioning scheme introduced in Ref. [60] could prove extremely valuable in enabling substantial savings for any given Hamiltonian one is interested in.

Numerical Simulation
This section presents an application of importance sampling and composite channels for the quantum simulation of a model inspired by a pionless lattice effective field theory [75], in particular, a simple toy model for a triton introduced in [27]. We take A = 2 dynamical nucleons together with a static one (infinite mass) fixed on the first site of a 2 × 2 lattice with periodic boundary conditions. We will consider the static nucleon to be the proton while the two dynamical ones will be neutrons in two different spin states (N f = 2). This model is simple enough to be easily simulated, yet contains much of the leading order contributions to the interaction and can thus provide valuable information about light nuclei and their response functions [76]. The model is equivalent to a two dimensional (d = 2) Hubbard model with a kinetic hopping term together with two and three body interactions with T the hopping coefficient, u the two-body interaction strength and v the three-body one. We recall that the fermionic operator c i,f destroys a particle of the species f on site i, c † i,f is the corresponding creation operator and n i,f = c † i,f c i,f the number operator. The Hamiltonian becomes particularly simple in first quantization because of the small size of the lattice. By using two qubits to encode the position of each nucleon using the following encoding strategy |1 ≡ |00 |2 ≡ |01 |3 ≡ |10 |4 ≡ |11 , (58) this model can be expressed in the Pauli basis as where X k , Y k , Z k are the corresponding Pauli matrices acting on qubit k. More details about the conversion can be found in the original work [27]. Instead of using realistic coefficients from experiments, we split the first quantized Hamiltonian into two parts A and B, and define our Hamiltonian as where the subscript j denotes different splitting strategies and I X a set of indices for X. We choose uniform coefficients inside each term of the partition in order to have a better control over the error bounds, as well as two different Hamiltonian models. The first is defined through the following separate contributions The second one is instead given as For simplicity, we will set a = 1 and express everything, i.e., the coefficient b and simulation time t in units of a. The first Hamiltonian follows the perturbation theoretical approach, where A is the simpler model H in the u = −4v configuration, where most of the coefficients are zero, and B describes a small perturbation making the system more realistic. A deterministic first order Trotter product formula is used to simulate the bulk of the system A, while a qDrift channel handles the contribution from the perturbation B.
The second Hamiltonian is chosen instead in order to minimize the expected cost of each circuit and exemplifies the role of importance sampling. The A part contains the terms that require small resources to be simulated on the quantum devices, i.e., which have a small cost C in terms of number of two-qubits CNOT gates, while B is composed of the most difficult terms, with some easy single-qubit rotations in order to diminish the expected cost.
We assume a constant cost of 0.1 for the one-qubit operations and of one for every CNOT gate appearing in the multi-qubit terms, after transpilation. We will assume a linear connectivity 1423 and neglect any generator cost generator cost  compilation optimization obtained through gate cancellation from neighboring operations for individual sampled circuits. The cost of each generator is displayed in Table 1, while the expected cost for the two different systems for Trotter, plain and importance sampled composite channels, can be seen in Table 2.
More precisely, we compute the deterministic cost, the expectation value of the cost per step, and the total cost at fixed accuracy and time t. We can notice that importance sampling is able to diminish the total simulation cost by a factor of two compared to the plain qDrift and that the use of qDrift channel allows a reduction of an order of magnitude in cost compared to Trotterization. We perform quantum simulation of both models j = 0 and j = 1 on a noiseless simulator for two different simulation times t = 0.05a and t = 0.1a and different values of the strength coefficient b ∈ [0.005, 0.5], using a composite channel Ω q (t; N = 1, M, r = 1). Part A is evolved using one step of the first order Trotter formula, while part B is evolved using a qDrift channel with N = 1 step. We consider two different sampling distributions: the standard one q = p and q = q c with the cost defined as above. The diamond norm, see Eq. (1), is displayed in Figure 1 against b/a, where the top line shows the computations for the model j = 0 and the bottom one for the model j = 1.
The diamond distance to the ideal simulation is displayed in dashed black for the full Trotterized channel, in red for the deterministic qDrift E(t; N = 1) channel, in blue (yellow) dashed line for the standard (q = p) composite channel with M = 1 (M = 10) and in violet (green) dots for the importance sampled (q = q c ) composite channel with M = 1 (M = 10). We perform statistics over R = 50 channels, but since the standard deviations are of order 10 −4 , they are not visible. We observe that the errors of the different composite channels are close to each others and are matching the full Trotter channel, except at b ≈ 0.05 where Trotterization is slightly better. This means that the different channels have approximately the same precision, and we can now look at cost needed to implement them, see panels (c) and (f) on the last column. We show the histograms of the cost over R = 50 qDrift channels from the sampled circuits obtained with a (importance sampled) composite qDrift channel with M = 1 or M = 10 experiments, while the error bars correspond to a 95% confidence interval obtained over 50 different cost histograms. For sake of readability, contiguos bars of different colors, which are displayed with a small shift over the cost axis, have the same cost written underneath. We observe additionally, that only a precise number of distinct bars are occupied, which correspond to each possible cost under the selected partition. Moreover, the dashed and dotted lines correspond to the expectation value of the standard and importance sampled qDrift channel, respectively, which is the same for M = 1 and M = 10. This emphasizes that the cost of the importance sampled channel is close to the expected cost, while the cost of the standard channel is more heterogeneous. However, the most important figure of merit is the distribution of the cost. Hence, the importance sampled qDrift is mainly composed of low-cost circuits, since the probability of sampling a cheap circuit is high, while the standard qDrift samples terms independently of the cost, leading to a more homogeneous cost distribution. Most importantly, we obtain a reduction in the required amount of resources of a factor 1.8 (2) for the model j = 0 with the (importance sampled) composite channel and 3.8 (5) for model j = 1, without sacrificing precision, compared to the full Trotterization. In fact, the cost for the simulation of the B part alone is reduced of one (two) order of magnitude with the pure (importance sampled) qDrift channel compared to direct first order Trotterization, see also Table 2.

Conclusions
This work generalizes past results on concentration bounds of random product formulae [50], allowing for the introduction of a generic importance sampling scheme. We provide a rigorous characterization of the protocol's bias and statistical fluctuations, extending previous results on qDrift to quantify the algorithm's sample complexity. In particular, we showed that a qDrift channel concentrates exponentially fast in N M around its expectation value, where N is the number of qDrift samples and M the number of experiments, thus allowing an efficient allocation between quantum resources (controlled by N ) and classical ones (controlled by M ). These results allow for parallel controllable simulations of a qDrift channel on multiple quantum devices while keeping each circuit shallow enough to mitigate the noise and run time in the NISQ and fault-tolerant era, respectively. Moreover, by incorporating the individual implementation cost for evolution under each of the Hamiltonian terms in a suitable sampling distribution q c (j), we show that importance sampling obtains a guaranteed total cost reduction, in terms of hardware native cost such as the number of CNOT gates, leading the way to a more straightforward implementation of qDrift in the near term. Under reasonable assumptions, similar cost savings can also be obtained when one wants to reduce the number of T gates, which is beneficial for errorcorrected devices.
In addition, we extend our result to consider composite channels [60], where the Hamiltonian is partitioned into A and B, which are simulated separately with a deterministic method (such as a Trotter-Suzuki product formula) and a qDrift channel respectively.
We show that the same importance sampling distribution q c (j) can also be employed in these cases to reduce the quantum resources required for the implementation. In the typical situation where evolution under different terms in the total Hamiltonian incurs different implementation costs, the explicit inclusion of this information in our construction opens the possibility to improve further the savings that can be achieved by using composite channels by optimizing the partitioning schemes [60]. In general, it could be profitable to handle Hamiltonian terms, which are expensive but have small norms in a stochastic way using qDrift. We propose different concrete applications within nuclear physics that may benefit from such an approach from an Effective Field Theory perspective.
Finally, the theoretical results are illustrated through numerical simulations of a simple model of a triton on a (2×2) lattice in first quantization. We find that a significant cost reduction (5×) can be obtained using composite channels and importance sampling without sacrificing accuracy. The approach is robust for different strength values between the two parts of the partition. Due to the quadratic scaling with simulation time of the qDrift part of the scheme, the protocol is particularly well suited for applications that require relatively short evolution times, e.g., protocols to measure observables by signal processing [77][78][79].
The example importance sampling strategy de-scribed in this work has the advantage of being simple and providing a guaranteed cost reduction, but it might not be optimal for some specific problems. We leave for future work exploration of more direct numerical optimizations of the sampling distribution and the partitioning scheme for constructing composite channels. In addition, other variance reduction techniques such as Particle-Filters/Sequential Monte Carlo (see e.g. [80,81]) or de-randomization (see e.g. [82]) could also possibly be used to improve the efficiency of stochastic quantum algorithms like the one described in this work.

A Proofs of the main results
In this section, we present rigorous proofs of the theorems stated in the main results Section 3.2 and Section 3.3. Before being able to provide them, we fist need to state two lemmas from Ref. [50].

Lemma 2.
Let U(ρ) = U ρU † and V(ρ) = V ρV † be unitary channels, we then have The results carries over to ensembles (p k , V k ) of unitary channels with weights p k ≥ 0 for which k p k = 1.
Lemma 3. Let X be hermitian. We then have the zero-th order bound exp{iX} − 1 ≤ X and the first- We are now in a position to prove Theorem 1, that we will recall for the ease of the reader.
Theorem 1 (Bias error bound). Let U(t) be a first-order Trotter product channel, E q (t; N ) an average qDrift channel with importance sampling and ω(j) = p(j)/q(j) the re-weighting factor. The diamond norm distance between these two channels for N = 1 is then upper bounded by leading to the following result Proof of Theorem 1. We first note that the Hamiltonian H can be written as the following expectation value with ω(j) = h j /(λq(j)) and therefore with X j (t) = thj q(j) H j . By noting that we obtain the following bound while If we denote V (t) = exp{−iX(t)} and the corresponding channel as V(t)[ρ] = V (t)ρV (t) † , we can express the deterministic qDrift channel with importance sampling as (cf. Eq. (9) in the main text) We now observe that which was the first result what we set out to show. The result for the general average channel with N > 1 follows by first considering the fact that we can obtain E q (t; 1) by N compositions Following [50] we then decompose the total evolution time t into N steps of duration t/N to find where we used Lemma 2 for the second line, the union bound for the third and Eq. (75) for the last step.
Now that we have generalized the bias error bound, we need to understand how a finite importance sampled qDrift channel concentrates around its expected value. This will provide us with an estimate of M and N for a given accuracy . We will use the martingale formalism and rely on Ref. [83] for a more in-depth consideration.

Definition 1 (martingale). Consider a filtration of the master sigma algebra
The intuition one may have is to think of k as a time index and F k to contain all events determined by the past up to time k. The causality requirement states that the present B k may only depend on the past (B k−1 , . . . , B 0 ), and the status quo conditions formulate that, on average, today is the same as yesterday.
Before proving the theorem, we need to state another result from Ref. [50,Corollary 3.4].
In order to show Theorem 2, we will generalize the construction from Ref. [50] of a suitable interpolating martingale. Let's start by introducing the unitaries V j = e −iτj Hj , for which E q [V j ] = E q [V ] independently on j, and consider the situation where we take a set of M separate samples of the N indices forming the product formula resulting in M distinct martingales of the form The generalized interpolating martingale needed for our construction can then be defined for j ∈ {0, 1, . . . , N M } as follows where we denote with a/b the entire division from a by b and a%b the rest, i.e., the integer value a modulo b. It is straightforward to see that the sum of two independent martingales is also a martingale, making D j a valid martingale. We have that the first element is given by while the last element, corresponding to j = N M , is given instead by For the special case M = 1 we recover the construction presented in Ref. [50]. The elements of the associated difference sequence are given, for j ∈ {1, 2, . . . , N M }, by where, for ease of notation, we have introduced a j = j/N + 1 and b j = j%N in the second line. Since both V k and E q [V ] are bounded ( V k ≤ 1 almost surely and E q [V ] ≤ 1) we can find the following bound where we used the triangle inequality in the second line and Jensen's inequality in the last. Furthermore, since almost surely, using Lemma 3, it also holds almost surely that Finally, in order to control the variance, we use We will now use this construction to show Theorem 2, which is reformulated below.
In order to guarantee an approximation error /2 with probability at least 1 − δ, it is then sufficient to take Proof of Theorem 2. Using the results in Eq. (87) and Eq. (88), we see that the parameters R and v from Corollary 4 can be chosen as Using Corollary 4, we can show that As in Ref. [50], for τ ≤ N R we consider the simpler bound A looser sufficient condition τ ≤ 2λt can be obtained by noticing that max k ω(k) ≥ 1, and the equality only holds when all the weights are the same. Substituting τ = /2, using 32/3 ≤ 11 and Lemma 2, we obtain the theorem statement.
Finally, these results can be used to compute the expected fluctuation bound, see Corollary 1.
Proof. The result can be shown by extending the martingales introduced for the proof of the concentration bound Theorem 2. We start by noticing that E q [W j ] = E q [W ] independent on j. For each of the M experiments we consider r sets of N indices for the unitaries forming the product and denote by W m sj the j-th unitary on the s-th block for the m-th experiment. We can then use them to generalize the martingale from Eq. (79) to for k = {0, 1, . . . , N r}. Here we have defined the indices as s k = (k − 1)/N + 1, j 0 = 0 while for k > 0 j k = (k − 1)%N + 1. The different definition of j 0 is required in order to accommodate the edge case k = 0. We also require that B 0 k = 0 as well as B m k = 0 for m > M as was done before for the B m k martingale. The causality condition in Definition 1 is automatically satisfied since ∀m B m k is completely determined by the random samples W m 11 , . . . , W m s k j k obtained up to the k-th step. The second condition can be checked explicitly, for (k)/N = (k − 1)/N , implying that s k+1 = s k and j k+1 = j k + 1, we have while, for k/N = (k − 1)/N + 1, i.e., k = nN for some integer n and therefore s k+1 = s k + 1 together with j k = N and j k+1 = 1, we have the following by noting that N − j k = 0. Finally, the new interpolating martingale for the different experiments takes the following form with the special case for j = 0 and j = N M r given, similarly to before, by Since between consecutive indices the martingale D j changes by only a single unitary, the difference sequence C j = D j − D j−1 have similar properties as C j in Eq. (84) above. In particular using the same strategy employed to arrive at Eq. (87). For the variance instead The result follows then by taking the parameters R and v from Corollary 4 as and using Corollary 4 as was done to show Theorem 2.
We are now in a position to show an upperbound for the expected error of a composite channel where the parameter contains the dependence on commutators.
Proof. We prove the corollary in the same way as we did Corollary 1 by relating the diamond norm distance to the operator norm for ensembles of unitary channels, see Lemma 2, and then using the triangle inequality The first term can be bounded by using Theorem 1 and the error bounds for composite channel from Eq. (37) to with R = tλ B N r (1 + max k ω(k)) from above. As in Ref. [50], the integral is evaluated by cutting it into two parts. The first, with a contribution of nearly one, when the denominator in the exponent is bigger than the numerator, i.e., τ ≤ max

B Proofs for the cost reduction
In this section, we will recall and show the results for the particular distribution q c (j) = h j /C j , where C j is the implementation cost of the corresponding term. hj Cj C j L j=1 hj Cj hj Cj where we used in the last step Jensen's inequality with ϕ(C) = 1/C, which is convex for real positive numbers C > 0.
This result shows that, on average, unitaries sampled from q(j) = q c (j) are cheaper to implement than from q(j) = p(j). It remains to be shown if the total implementation cost at fixed accuracy is also reduced with this choice of distribution.
Theorem 4 (Cost reduction -pure qDrift). Let N p and N qc be the number of qDrift samples for the two distributions p(j) = h j /λ and q c (j) = h j /(λ c C j ) for a given target precision . The expected cost of the importance sampled qDrift channel is then always smaller than the standard one The number of experiments is instead increased as and, more particularly, independent on the total evolution time t.
Proof of Theorem 4. The factor between N p and N qc , see Eq. (24), can be expanded as follows Using this result, together with Corollary 3 and Theorem 1, we find that a sufficient choice for N qc to guarantee error over a total time t is at least which translates into an average total cost of On the other hand, using regular qDrift we have In order to guarantee a cost reduction we then need or equivalently This is always satisfied, as one can easily show using Jensen's inequality. The result on the increase in the number of experiments follows instead directly from the definition of the distribution q c (j) and the sufficient condition Eq. (32) derived from Corollary 1.
We will finally show that the cost reduction is also retained when considering composite channels.
Theorem 5 (Cost reduction -composite channel). Let C p ( , t) and C qc ( , t) be the expected cost to implement the composite channels Ω p (t; N, M, r) and Ω qc (t; N, M, r) using two distributions p(j) = h j /λ and q c (j) = h j /(λ c C j ) for a given target precision and propagation time t. Then the following holds The number of experiments is instead increased, M qc ( ) ≥ M p ( ) but retaining the same scaling with error and system size n and also independent on the total evolution time t.
Proof. As from Eq. (41), we know that the cost of the composite channel can be written as Since the first term is independent from the choice of the sampling distribution, we only need to consider the second one which also appears in Theorem 4 and can therefore be bounded as For the number of experiments instead, using Eq. (48) and Eq. (49), for the p(j) distribution we have while for the importance sampled distribution q c (j) we find M qc ( ) = n 2α 2 κ (κ − 1) 2 where the second line is obtained by remarking that, due to Jensen's inequality, In order to get to the third line we used the inequality and the last inequality follows from 1 + E p [1/C B ] max j C B j ≥ 2.