Time-marching based quantum solvers for time-dependent linear differential equations

The time-marching strategy, which propagates the solution from one time step to the next, is a natural strategy for solving time-dependent differential equations on classical computers, as well as for solving the Hamiltonian simulation problem on quantum computers. For more general linear differential equations, a time-marching based quantum solver can suffer from exponentially vanishing success probability with respect to the number of time steps and is thus considered impractical. We solve this problem by repeatedly invoking a technique called the uniform singular value amplification, and the overall success probability can be lower bounded by a quantity that is independent of the number of time steps. The success probability can be further improved using a compression gadget lemma. This provides a path of designing quantum differential equation solvers that is alternative to those based on quantum linear systems algorithms (QLSA). We demonstrate the performance of the time-marching strategy with a high-order integrator based on the truncated Dyson series. The complexity of the algorithm depends linearly on the amplification ratio, which quantifies the deviation from a unitary dynamics. We prove that the linear dependence on the amplification ratio attains the query complexity lower bound and thus cannot be improved in the worst case. This algorithm also surpasses existing QLSA based solvers in three aspects: (1) the coefficient matrix $A(t)$ does not need to be diagonalizable. (2) $A(t)$ can be non-smooth, and is only of bounded variation. (3) It can use fewer queries to the initial state. Finally, we demonstrate the time-marching strategy with a first-order truncated Magnus series, while retaining the aforementioned benefits. Our analysis also raises some open questions concerning the differences between time-marching and QLSA based methods for solving differential equations.

The time-marching strategy, which propagates the solution from one time step to the next, is a natural strategy for solving time-dependent differential equations on classical computers, as well as for solving the Hamiltonian simulation problem on quantum computers. For more general homogeneous linear differential equations d dt |ψ(t) = A(t)|ψ(t) , |ψ(0) = |ψ 0 , a time-marching based quantum solver can suffer from exponentially vanishing success probability with respect to the number of time steps and is thus considered impractical. We solve this problem by repeatedly invoking a technique called the uniform singular value amplification, and the overall success probability can be lower bounded by a quantity that is independent of the number of time steps. The success probability can be further improved using a compression gadget lemma. This provides a path of designing quantum differential equation solvers that is alternative to those based on quantum linear systems algorithms (QLSA). We demonstrate the performance of the time-marching strategy with a high-order integrator based on the truncated Dyson series. The complexity of the algorithm depends linearly on the amplification ratio, which quantifies the deviation from a unitary dynamics. We prove that the linear dependence on the amplification ratio attains the query complexity lower bound and thus cannot be improved in the worst case. This algorithm also surpasses existing QLSA based solvers in three aspects: (1) A(t) does not need to be diagonalizable. (2) A(t) can be non-smooth, and is only of bounded variation. (3) It can use fewer queries to the initial state |ψ 0 . Finally, we demonstrate the timemarching strategy with a first-order truncated Magnus series, which simplifies the implementation compared to high-order truncated Dyson series approach, while retaining the aforementioned benefits. Our analysis also raises some open questions concerning the differences between time-marching and QLSA based methods for solving differential equations.

Introduction
In modern science and engineering, mathematical descriptions of "real-world" problems often lead to differential equations, which play a prominent role in many disciplines such as physics, chemistry, biology and economics. Efficient simulation of large scale differential equations has therefore served as one of the core tasks in many scientific applications. Recent advances of quantum algorithms indicate that quantum computers may significantly accelerate the simulation of such differential equations, particularly for those defined in high dimensional spaces. Let N be the number of degrees of freedom (e.g., the number of discretized points in a high dimensional space) which can be very large. For certain tasks, quantum computers can store and manipulate vectors of size N at a cost that scales only as polylog(N ), which leads to potentially significant advantages over classical computers.
In this paper, we focus on the initial value problem of the system of homogeneous linear ordinary differential equations (ODEs) where t ∈ [0, T ] ⊂ R + is the independent variable with T as the final time, the vector |ψ(t) ∈ C N is the dependent variable, the coefficient matrix A(t) ∈ C N ×N is a matrix-valued function in t, and |ψ 0 ∈ C N is the initial condition. We assume the differential equation has a well-posed solution, which can be implied by, e.g., A(t) being piecewise continuous. The coefficients can have jump discontinuity and are not required to be smooth in t. More precisely, we only need to assume that A(t) is of bounded variation, i.e., the total variation V T 0 (A) (see Eq. (10) for definition) is finite. The task for quantum differential equation solvers is to prepare a quantum state that is proportional to the final solution |ψ(T ) with certain precision.
One prominent example is the Hamiltonian simulation problem, which is a homogeneous linear differential equation governed by A(t) = −iH(t) and H(t) is a Hermitian matrix. Recent years have witnessed remarkable progresses on designing new algorithms as well as establishing improved theoretical complexity estimate of existing algorithms for both timeindependent Hamiltonian simulation [9-13, 15, 20, 23, 27-30, 53, 55, 57, 60, 69] and timedependent ones [3,4,7,24,47,57,66,67]. These quantum algorithms can be applied, when the underlying dynamics is unitary or can be converted into a unitary dynamics (see e.g., [33]). Compared to the Hamiltonian simulation problem whose dynamics is unitary, quantum algorithms for the general linear differential equations are considerably less explored. Such algorithms produce a quantum state representing the solution of the differential equation. This is different from having the solution stored in classical memory, but we can still extract information from the quantum state that may not be efficiently obtainable from classical computation (see e.g., [26]). Unless otherwise specified, we do not consider the special case when A(t) = A is time-independent, in which setting we may directly encode the propagator (i.e., the matrix function e AT ) using the quantum singular value transformation (QSVT) [38] when A is Hermitian or anti-Hermitian, or using a contour integral based strategy for a general A (see [59,61], as well as Section 5.3).
The time-marching strategy is a natural strategy solving time-dependent differential equations, and is adopted by nearly all classical differential equation solvers (see e.g., [42,43]). The idea is to divide the entire time interval [0, T ] into a number of short line segments separated by the temporal mesh points 0 = t 0 < t 1 < · · · < t L = T , and to calculate the solution information at the next time step using the solution information from previous k time steps. Methods with k = 1 are called one-step methods (e.g., Runge-Kutta methods). Methods with k > 1 are called multi-step methods (e.g., Adams methods). Multi-step methods require multiple copies of the solution from previous steps, and cannot be directly implemented on quantum computers due to the obstruction of the no-cloning theorem. For non-unitary dynamics, even one-step methods lead to severe challenges in the design of quantum algorithms, mainly due to potentially diminishing success probability. In Section 1.3, we use an illustrative example with the simplest one-step integrator (the forward Euler method) to demonstrate such challenges. As a result, the time-marching strategy has not been viewed as a practical route for designing quantum differential equation solvers beyond the Hamiltonian simulation problem.

Related works
The prevailing strategy for designing quantum differential equation solvers to date is to construct a large linear system recording states during the entire history of the evolution, and then to apply the quantum linear systems algorithms (QLSA) [1,5,25,32,44,49,58] to solve the resulting linear systems of equations. One may perform certain amplification procedures to boost the success probability of getting the final solution. This QLSA-based strategy was first proposed by Berry [8], which successfully avoids the pitfall in Section 1.3. It has been adopted by various quantum linear differential equation solvers [14,26,48,50], and has been applied to solve nonlinear differential equations using linearization techniques [2,34,36,45,46,51,52,64].
The work [8] uses multi-step integrators, and the analysis is applicable to time-independent A that is diagonalizable as A = V DV −1 with eigenvalues λ j = D jj satisfying |arg(−λ j )| ≤ θ 0 , 0 < θ 0 < π/2, ∀j. (2) The query complexity of the algorithms scales polynomially in T , the spectral norm A , the inverse precision −1 , and the condition number of the eigenvector matrix κ V := V V −1 .
For time-independent A, the work [14] combines a QLSA-based solver with the truncated Taylor series method, and the assumption in Eq. (2) was relaxed to A being dissipative (i.e., Re(λ j ) ≤ 0, ∀j). The complexity of this algorithm is also improved to be where d is the sparsity of the matrix A, κ V is the condition number of V , and q = sup t∈[0,T ] |ψ(t) / |ψ(T ) . For time-dependent linear differential equations, [26] combines a QLSA-based solver with a Chebyshev pseudospectral method. It assumes that A(t) is diagonalizable as A(t) = V (t)D(t)V (t) −1 and dissipative for all t ∈ [0, T ], and the underlying solution is sufficiently smooth in t. The complexity of the algorithm is . The smoothness of the solution implicitly requires that A(t) should be sufficiently smooth (e.g., analytic in t).
When the coefficient matrix A(t) is not sufficiently smooth, the polylogarithmic dependence on g no longer holds, and the complexity of the algorithm depends polynomially on −1 .
In all the analysis above, the complexity depends explicitly on the condition number of the eigenvector matrix κ V , which can be difficult to estimate and may lead to significant overestimation of the cost. The recent work by Krovi [48] replaces the dependence on κ V by the dependence on sup t∈[0,T ] e At . This is a significant improvement due to the inequality For instance, consider Then However, the technique in [48] is only applicable when A is time-independent.

Contribution
In this work, we propose that the time-marching method can become an efficient strategy for solving linear differential equations. Our main technical tool (Theorem 4) is a method to implement a sequence of non-unitary operations without incurring an exponential overhead. Theorem 4 has two main ingredients. The first is the uniform singular value amplification procedure, which was first developed in [54] and was refined in [38]. It allows us to amplify the success probability in each time step in a way that is oblivious to the quantum state. We find that in our context, the degree of the polynomial used by the uniform singular value amplification procedure in [38] exhibits a very large preconstant. We use a convex optimization based method to reduce the preconstant by orders of magnitude. Theorem 4 also proposes a useful tool called the compression gadget (which improves upon the result developed in [57]) to coherently combine short-time evolution operators into a long-time one without duplicating ancilla qubits. Combined with amplitude amplification [17], this leads to a further quadratic speedup in terms of the query complexity. Consider a temporal mesh 0 = t 0 < t 1 < · · · < t L = T . Using the time-marching strategy and the high-order truncated Dyson series algorithms [12,47,57] for short time evolution, we propose an algorithm to solve Eq. (1), which requires O(Q(αT ) 2 log( −1 )) queries to the coefficient matrix A(t), and O(Q) queries to the initial state |ψ 0 (Theorem 8). Here as a central quantity of this work, is the amplification ratio. It quantifies the non-unitarity of the dynamics (e.g., Q = 1 for unitary dynamics). Compared to the state-of-the-art results based on the QLSA in [26] and [48], our algorithm has several advantages, which we summarize in Table 1. First, our method does not require the diagonalizability of A(t), and the complexity is independent of the condition number κ V . This generalizes the result of [48] to equations with time-dependent matrix coefficients.
Second, our algorithm also has lower regularity requirement for the coefficient matrix A(t) and the solution |ψ(t) . In [26] the query complexity depends on the high-order derivatives of the solution (g = max t∈[0,T ] max n∈N |ψ (n+1) (t) in [26,Theorem 1]), and if the solution is not smooth, the spectral method loses the desirable exponential accuracy. Remarkably, in our method, exponential accuracy is achieved even when the coefficient matrix A(t) is not smooth, thus allowing |ψ(t) to be non-smooth. The results in both Theorem 8 and Theorem 14 are insensitive to the roughness in the coefficients A(t). Note that A(t) being of bounded variation is a very weak regularity condition, and in particular V T 0 (A) = T 0 A (t) dt when A is differentiable. This also allows for the existence of jump discontinuities in the coefficients.
Third, our algorithm may use fewer queries to the initial state, which is advantageous if the initial state preparation is an important factor. To simplify the discussion, let us focus on the dependence on T and assume all other quantities such as α, Q to be constants. In order to achieve the scaling in [26, Theorem 1] (for other QLSA based differential equation solvers discussed in Section 1.1, the situation is similar), the quantum solver must employ a quantum linear system solver with near optimal query complexities. In other words, the complexity of the linear system solver should scale as O(κpolylog( −1 )), and κ is the condition number of the linear system. Moreover, the near optimal quantum algorithms also need to query the initial state for O(κ) times. Such a dependence is most clearly seen from the perspective of adiabatic based near-optimal quantum linear system solvers [5,32,49]. This is because the construction of the adiabatic Hamiltonian corresponding to the linear system uses the initial state, and thus each query to the adiabatic Hamiltonian also queries the initial state. As a result, the quantum differential equation solvers in both [26,48] need to query the initial state for O(κ) = O(T ) times. The relation between κ and T is rooted in the no-fast-forwarding theorem and cannot be generally improved. Our time-marching based algorithm does not rely on such adiabatic constructions, and the query complexity to the initial states does not explicitly depend on T . In the case where the time evolution is almost unitary, i.e., Q = O(1), and where T is large, this feature can offer significant advantage.
Our method (Theorem 8) also has two drawbacks. The first is that the T dependence in the number of queries to A(t) is sub-optimal. The direct reason is that the uniform singular value amplification procedure becomes increasingly costly as T increases. The second is that the cost of our algorithm depends on Q defined in Eq. (4), while previous QLSA based algorithms depends on q = max t∈[0,T ] |ψ(t) |ψ(T ) , which satisfies q ≤ Q. We prove in Theorem 10 that the O(Q) dependence attains the query lower bound and cannot be improved in the worst case. There exists instances that Q can significantly (even exponentially) overestimate q, as will be discussed at the end of Section 4. However, qκ V and Q may not be directly related in general.
Finally, the implementation of the high-order truncated Dyson series algorithm requires complicated quantum control logic for handling time-ordering operators. To simplify the implementation, we combine the time-marching strategy with a first-order truncated Magnus series, whose implementation does not require the complex time-clocking quantum control logics. Though the cost of the resulting algorithm depends on higher powers of T and −1 comparing to the high-order integrators, it retains the aforementioned advantages compared to QLSA based solvers. Interestingly, this algorithm also exhibits a commutator scaling for differential equations in the high precision limit (see Theorem 14), which can be desirable when the norm of the commutator α comm = sup s,τ ∈[0,T ] [A(s), A(τ )] is much smaller than α.  Table 1: Comparison with state-of-the-art high-order algorithms. We compare the high-order algorithms in terms of whether they require smooth A(t), have κ V dependence, and the T scaling in the number of queries to A(t) and to the initial state in the case when A(t) is smooth.
where |ψ (n+1) (t) denotes the (n + 1)-th derivative of |ψ(t) and N is the set of natural numbers. The algorithm in [48, Theorem 10] on the last row is designed only for the time-independent case.

Challenges in designing time marching based quantum solvers
The most straightforward way of solving the ODE (1) is arguably the (forward) Euler method. For simplicity we consider the time-independent case, i.e., A(t) = A, and the time step sizes are chosen to be uniform: t l −t l−1 = T /L. We further assume A is a normal matrix and can be unitarily diagonalized. Starting from |ψ 0 , at each time step l, we go from |ψ l−1 ≈ |ψ(t l−1 ) to |ψ l ≈ |ψ(t l ) via |ψ l = (I + A(t l − t l−1 )) |ψ l−1 .
We will show that a direct implementation of this method on a quantum computer leads to severe challenges despite its simple appearance. Let us first look at how we should implement Ξ l = I + A(t l − t l−1 ) on a quantum computer. Generally we can assume that A is given through a block encoding (see Appendix B for a short introduction of block encoding and QSVT), with which we can construct a block encoding ofΞ l denoted by U l through a linear combination of unitaries. This construction, if performed directly using [38,Lemma 29], involves a subnormalization factor of 1 + A (t l − t l−1 ) = 1 + A T /L. Going from |ψ l−1 to |ψ l , we apply the block encoding U l , and measure the ancilla qubits. The success of the procedure depends on the measurement outcome, and the success probability is where the first factor comes from the subnormalization factor discussed above. Since the success at each time step is independent, the total success probability of implementing Euler's method for L steps is For this method to yield a meaningful result, we need |ψ L ≈ |ψ(T ) , and consequently |ψ L ≈ |ψ(T ) . The success probability is therefore approximately e −2 A T |ψ(T ) 2 |ψ(0) 2 , and it takes trials for the procedure to succeed with Ω(1) probability. To see why this is not a reasonable scaling, let us consider the case where A is anti-Hermitian, which yields the Schrödinger equation, and we have |ψ(T ) = |ψ(0) . The number of trials needed is therefore e 2 A T , despite the fact that the usual Hamiltonian simulation algorithms generally succeed in one run!
The problem becomes even worse if we want to implement the time step e AT /L with high accuracy, rather than approximating it with I + AT /L. For example, we may consider implementing e AT /L through QSVT, if A is either Hermitian or anti-Hermitian. But the subnormalization factor that comes from QSVT has an extra factor of 2, becoming 2 e AT /L , due to the summation of the even and odd parts [37,Theorem 56] or the real and imaginary parts [37,Theorem 58]. In the end the number of trials required becomes 4 L e AT 2 |ψ(0) 2 |ψ(T ) 2 , which increases exponentially with respect to the number of segments L even for a finite T . In the anti-Hermitian case, we can use the oblivious amplitude amplification (OAA) [38,Theorem 15] to solve this problem, or use the algorithm in [55] to avoid this problem entirely, but both methods work only because of the unitarity of the exact time evolution. For nonunitary dynamics, as OAA is not applicable, we need a different strategy to implement a time-marching based method.

Organization
The rest of the paper is organized as follows: in Section 2 we provide an overview of the method, presenting our method (Theorem 4) to link up short-time evolutions into a longtime evolution while keeping the success probability from decaying faster than necessary. In Section 3 we will discuss how to implement short-time evolution using the truncated Dyson series algorithm, leading to our first algorithm in Theorem 8. The amplification ratio Q dependence in this algorithm is shown to be optimal in Section 4. In Section 5 we demonstrate the performance of the time-marching based method with a first-order truncated Magnus series, which simplifies the implementation and also exhibits a commutator scaling in the high precision limit.

Overview of the method 2.1 Main idea
Let us first revisit the challenge of implementing the Euler method in Section 1.3. The reason that we end up with the exponential overhead e 2 A T is that each time step involves a subnormalization factor 1 + A T /L. Now let us consider, what if the subnormalization factor is I + AT /L rather than 1 + A T /L? The subnormalization factor cannot be better than this because we need to encode I + AT /L into a unitary matrix that has spectral norm 1. Again assume A is a normal matrix. If the subnormalization factor is indeed I + AT /L , then the e 2 A T overhead is replaced by This is a far more reasonable scaling. For instance, if A is anti-Hermitian, then e T A = 1, rather than exponentially growing with time.
From the above discussion, we can see that the seemingly subtle difference between the subnormalization factors 1 + A T /L and I + AT /L is in fact crucial. To implement (I + AT /L) |φ l−1 using linear combination of unitaries as discussed in Section 1.3 involves a success probability as described in (6), which can be much smaller than the intrinsic success probability: 1 The ratio between the intrinsic success probability and Eq. (6) is which comes from excessive subnormalization due to the construction of the block encoding. This excessive factor can be removed through a technique called the uniform singular value amplification [38,Theorem 17]. In a nutshell, the uniform singular value amplification uses an odd polynomial P (x) to approximate a linear function The norm constraint is a crucial requirement for applying QSVT with the polynomial P . The effect of the uniform singular value amplification is that it approximately multiplies a factor γ to the encoded operator, which nearly exactly cancels the excessive subnormalization factor. One important feature of the uniform singular value amplification is that it is oblivious to the quantum state we want to act on, i.e., |ψ l−1 . This means we do not need to repeatedly prepare quantum states from previous time steps in this amplification procedure, and this is the key to avoiding an exponential overhead. After repeated usage of the uniform singular value amplification, the subnormalization factor scales as e T A instead of e T A . The same technique also solves the problem with the subnormalization factor being much larger than 1 as discussed in Section 1.3. As discussed above, for the time-independent case, assuming |ψ(0) = 1, we need to run the algorithm for e AT 2 / |ψ(T ) 2 times to achieve Ω(1) overall success probability. In the time-dependent case, this factor takes a slightly more complicated form, where 0 = t 0 < t 1 < · · · < t L = T are determined by our choice of the temporal mesh. This gives the definition of Q in Eq. (4) Because this Q dependence comes from the number of trials needed in order to complete the whole procedure successfully with high probability, a natural question is whether the dependence can be improved from O(Q 2 ) to O(Q) using amplitude amplification [17]. However, a direct application of amplitude amplification comes at a price. As mentioned above, at each time step, we need to perform measurements on the ancilla qubits, and we need to ensure that the measurement results from all the time steps are correct. The direct application of amplitude amplification requires the usage of different ancilla qubits for each time step, and the number of ancilla qubits needed will scale linearly with respect to the number of time steps. To reduce the number of ancilla qubits, we employ a simplified version of the compression gadget introduced in [57] to coherently record the success or failure of each time step. This method ensures that the number of ancilla qubits needed to implement amplitude amplification scales only logarithmically in the number of time steps.

Input Model
To solve the ODE (1), we need to store the information of the coefficient matrix A(t) and the initial state |ψ(0) in the quantum computer. We assume that we have access to a unitary circuit U init to prepare the initial state, i.e., U init |0 n = |ψ(0) , where n is the number of qubits and 2 n = N . For A(t) this requires some explanation, in particular because we need not just a single matrix but a family of matrices on the time interval [0, T ].
A similar problem is encountered in the setting of time-dependent Hamiltonian simulation problem, where a time-dependent matrix encoding was proposed in [57] to encode the timedependent Hamiltonian. We adopt essentially the same idea, and extend it to the non-Hermitian case.
Definition 1 (Time-dependent matrix encoding). An (n q , m, a, b, α, )-MAT is a unitary that acts on three registers, each containing n q , m, and n qubits respectively. It satisfies where Here n is the number of qubits corresponding to the system (2 n = N ), m is the number of ancilla qubits for block encoding, and n q qubits are used to store the index of the quadrature points. There are 2 nq quadrature points in total.
The total variation of , is defined as follows: The set of all functions of bounded variation is denoted by Our algorithm only requires that the total variation We need to choose n q to achieve the required accuracy for performing numerical quadrature, and this will be discussed in detail in Section 3.1. We note that for the sparse matrix input model, which is used in other quantum differential equation solvers [8,14,26,48] and time-dependent Hamiltonian simulation algorithms [7], one can construct an efficient timedependent matrix encoding. Thus, our algorithms also apply to the case of sparse matrices, which will be discussed in more details in Section 3.4.

Uniform singular value amplification
For a given temporal mesh 0 = t 0 < t 1 < · · · < t L = T , the short time integrator at step l can be abstractly written as whereΞ l is an operator approximating the exact evolution operator Ξ l = T e t l t l−1 . We require thatΞ l is consistent with Ξ l to precision l Ξ l , i.e., We assume thatΞ l is implemented with its (α l , m, 0)-block encoding denoted by U l , i.e., Due to Eq. (13), U l can also be viewed as an (α l , m l , l Ξ l )-block encoding of Ξ l . In this section we discuss how to approximately implement a series of non-unitary operations Ξ 1 , Ξ 2 , . . . , Ξ L sequentially, and boost the success probability using uniform singular value amplification, which is developed in [38,Theorem 17]. The idea is to linearly amplify the singular values ofΞ l /α l by a factor that is approximately γ = α l / Ξ l . For simplicity, we first assumeΞ l = Ξ l (i.e., U l block encodes the exact short time integrator), and study how this technique can help us boost the success probability of a single operation.
Applying QSVT with an odd polynomial gives us the block encoding U of a new matrix, which up to rescaling is Ξ = W ΣV † , where Σ = diag( σ 1 , σ 2 , . . . , σ N ). We choose the odd polynomial in the same way as in [38,Theorem 17], and we choose γ = α(1−δ) Ξ . By these choices all singular values of Ξ are contained in the interval [0, If we choose the rescaling factor of Ξ so that then by [38,Theorem 17], ) applications of (controlled-) U and its inverse.
We can now use Lemma 2 to address the problem discussed in Section 1.3, and for now neglect all errors in the block encodings. The error analysis with the block encoding errors taken into account will be analyzed in Theorem 4. If we directly apply U l and post-select the measurement results, through the procedure described in Figure 1 without the uniform singular value amplification, then the success probability will be For each l, Ξ l ≤ α l , and the cumulative difference between α 1 α 2 · · · α L and Ξ L · · · Ξ 2 Ξ 1 can be quite significant, as discussed for the case of Euler's method in Section 1.3.
With the block encodings U l given by the uniform singular value amplification, we can implement Ξ L · · · Ξ 2 Ξ 1 as depicted in Figure 1. We apply each U l sequentially, measuring the ancilla qubits after each application, only proceeding when the measurement result is all 0, and otherwise aborting the procedure. The operator Ξ L · · · Ξ 2 Ξ 1 thus implemented will approximate the operator Ξ L · · · Ξ 2 Ξ 1 , which is our goal. With with the choice δ = 1/L, we have Therefore the success probability is, up to a constant factor, By turning the success probability from (15) to (16) using Lemma 2, we address the problem of the vanishing success probability discussed in Section 1.3. Figure 1: Implementing Ξ L · · · Ξ 2 Ξ 1 . After we apply each U l , we measure the ancilla qubits, and only proceed when the measurement result is all 0, and otherwise abort the procedure.
The uniform singular value amplification procedure was first proposed in [54,Theorem 5]. However, it only constructs an (2 Ξ , m + 1, Ξ )-block encoding of Ξ, and the extra factor 2 means that the overall success probability of the procedure in Fig. 1 will still decrease exponentially fast as O(4 −L ). This is not acceptable in the context of this paper. Therefore we adopt the version in [38,Theorem 17], which refines the analysis so that the subnormalization factor is Ξ /(1 − δ).
In the abstract form, one needs a polynomial that approximates γ x on the desired interval . Such a polynomial has been constructed by [38,Theorem 17], which we examine in more detail. It first constructs an even polynomial q(x) that approximates an even "rectangular function" defined as The desired odd polynomial is p(x) = γ xq(x), which approximates XRect(x) := γ x · Rect(x), and agrees with the linear function γ x on the interval [−γ −1 , γ −1 ]. The polynomial p(x) should also satisfy the norm constraint: |p(x)| ≤ 1 for all x ∈ [−1, 1] due to the requirement of QSVT (see [38,Corollary 5]). However, the function XRect(x) is discontinuous at x = γ −1 , and therefore a polynomial approximation to XRect(x) exhibits the Gibbs phenomenon, which states that p(x) always overshoots around x = γ −1 , even as the polynomial degree increases to infinity (see e.g., [39] and [62,Chapter 9]). An example of the Gibbs phenomenon is shown in the inset of Fig. 2 (a), where the polynomial approximation overshoots XRect(x). As a result, instead of approximating XRect(x), we can only find a polynomial that approximates the function (1 − δ)XRect(x) to satisfy the norm constraint. It is worth mentioning that although the construction above leads to the desired asymptotic scaling in Lemma 2, the preconstant can be quite large, which leads to high polynomial degrees even for moderate values of the parameters α, δ, . Fig. 2 (a) shows that for γ = 5, δ = 0.05, = 0.01, the required polynomial degree is already as large as 2001. Reducing to a smaller value δ = 0.01 would require a polynomial degree of around 10 4 , which may be too large to be practically useful.
To address this issue, we notice that the uniform singular value amplification only requires us to find a polynomial p(x) that approximates γ x on the desired interval Outside this interval I, the value of p(x) can be arbitrary, as long as the norm constraint is satisfied. In particular, p(x) does not need to approximate XRect(x), which vanishes outside I. This allows us to construct a convex optimization based procedure to numerically identify the near-optimal polynomial approximation for the uniform singular value amplification. The procedure is detailed in Appendix C. For the same parameter setting γ = 5, δ = 0.05, = 0.01, the polynomial approximation is given in Fig. 2 (b) and the polynomial degree is merely 21. Fig. 2 (e) further shows that as fixing γ , δ, both methods converge exponentially with respect to the increase of the polynomial degrees. However, the convergence rate of the convex optimization based method is significantly faster, which reduces the number of queries to U l by orders of magnitude.

Amplitude amplification using compression gadget
Since the main concern of running the algorithm in Fig. 1 is its success probability, it is natural to consider the usage of amplitude amplification [17] to reduce the number of repetitions needed to obtain a successful outcome. With the current procedure in Fig. 1, however, this results in a large space overhead. Directly applying U l (and hence Ξ l ) sequentially involves intermediate measurements to determine whether each Ξ l is applied successfully. We need to record the measurement outcome of each of the L steps, and this means we need to duplicate the ancilla register L times to implement amplitude amplification. To avoid this overhead, we need to replace the procedure with a fully coherent one, with measurement performed only at the end. This allows us to reduce the Q dependence from O(Q 2 ) to O(Q).
Let us first formulate the problem in a more abstract way. We have unitaries V 1 , V 2 , . . . , V L , each of which is a (α l , m l , 0)-block encoding of a potentially non-unitary operation Γ l . The goal is to implement Γ L · · · Γ 2 Γ 1 with amplitude amplification, and without duplicating the ancilla registers.
This goal can be achieved using the compression gadget in Fig. 3, following the idea in Ref. [57]. In fact we are using a simplified version of the compression gadget in Ref. [57], as the problem we are trying to solve is in some way easier. The main idea is to use a counter register to keep track of how many Γ l 's have been applied successfully in a coherent way. This allows us to post-select on the counter register to ensure that all Γ l 's have been applied successfully. This result is summarized in the following lemma. Its proof is given in Appendix D.
Lemma 3 (Compression gadget). Suppose we are given unitaries V 1 , V 2 , . . . , V L , each of which is a (α l , m l , 0)-block encoding of Γ l . Then we can construct a (α comp , m comp , 0)-block encoding of Γ L · · · Γ 2 Γ 1 , where Figure 3: The simplified compression gadget for coherently applying Γ L · · · Γ 2 Γ 1 . The counter register, containing log 2 (L) + 1 qubits, is used for keeping track of whether each Γ l has been applied successfully; the ancilla register, containing max l m l qubits, is for the ancilla qubits needed in V l 's; the state register stored the quantum state on which we want to apply Γ L · · · Γ 2 Γ 1 . ADD implements addition by 1 modulo the smallest power of 2 that is larger than or equal to 2L. Here each controlled ADD † is controlled on the state |0 mmax .
We can now use Lemma 3 to obtain the following result, taking into account the uniform singular value amplification procedure: Theorem 4 (Coherent implementation of long-time integrator). Suppose we are given Ξ l through its (α l , m l , l Ξ l )-block encoding U l , for l = 1, 2, . . . , L, and l l ≤ 1/2, then for any 0 < ≤ 1/(2L), we can construct an (α comp , m comp , comp )-block encoding of and ) applications of each (controlled-) U l and its inverse.
Proof. We denote byΞ l the matrix that is exactly encoded in each U l as in Eq. (14). Using Lemma 2, we first construct a ( )) times. Here we have used the fact that Ξ l = Θ( Ξ l ) because of Eq. (13). We denote by Ξ l the matrix that is exactly encoded in U l , i.e., Next, we use Lemma 3 to combine U l 's into a block encoding of Ξ L · · · Ξ 2 Ξ 1 . Directly applying Lemma 3 yields an (α comp , m comp , 0)-block encoding of Ξ L · · · Ξ 2 Ξ 1 , which we denote by U comp , where Noting the fact that To bound the error between Ξ L · · · Ξ 2 Ξ 1 and Ξ L · · · Ξ 2 Ξ 1 , we have where for the second inequality we have used Eqs. (13) and (19), and for the last inequality we have used L l=1 (1 + l ) ≤ e l l ≤ e 1/2 as well as (1 + ) L ≤ e L ≤ e 1/2 . Therefore we can choose comp as in the statement of the corollary.
With a coherent implementation of the long-time integrator given as a block encoding in Theorem 4, we can readily apply the standard amplitude amplification to boost the success probability and yield a quadratic speedup in terms of the query complexity.

High-order truncated Dyson series approach
In this section, we analyze the method described in Section 2, when the short time integrator is implemented using the high-order truncated Dyson series. In Section 3.1 we discuss how to implement the short time integrator developed in Ref. [57]. Then we use the tools developed in Sections 2.3 and 2.4 to link different segments of short time evolution into long time evolution in Section 3.2. We analyze the success probability in Section 3.3. Our time-marching based strategy can be combined with any input models such as the sparse matrix model [7,26,48,57] and the linear combination of unitaries (LCU) model [7,57]. We discuss in particular the implementation with a sparse matrix input model in Section 3.4.

Short time evolution
The truncated Dyson series method implements This infinite series can be truncated at order K, and the error is upper bounded by ( b a A(s) ds) K /K!. Therefore if we choose a and b so that b a A(s) ds = O(1), we can achieve high accuracy with only a few terms.
In order to be robust against error in the coefficient matrix, we need an additional step in the error analysis of the algorithm. If A(t) is accessed through an (n q , m, a, b, α, )-MAT as discussed in Section 2.2, then the above approach will yield a block encoding of a different time evolution operatorΞ ≈ T e b a

A(t)dt , where A(t) is the time-dependent matrix encoded in
MAT exactly, as defined in Eq. (9). By Lemma 16, and assume that a)).
With this additional step, using the same algorithm as in [57, Theorem 3], we can encode Ξ with the following costs: In [57,Theorem 3], n q is chosen to satisfy The discussion on the numerical quadrature error in Appendix F shows that we can further relax the regularity condition so that b a Ȧ (t) dt can be replaced by V b a (A) defined in Eq. (10).

Block encoding of the long time evolution operator
We choose t l 's so that t l − t l−1 ≤ (2α) −1 . Consequently the total number of segments are L = Θ(αT ). For each segment [t l−1 , t l ], we construct a time-dependent matrix encoding of A(t) using the sparse matrix oracles in (27), and then implement the short time evolution . By Lemma 5, this procedure yields an (α l , m l , l Ξ l )-block encoding of Ξ l , which we denote by U l . Since We need O m + max l log 1 register starts in state |0 m and will be returned to |0 m at each time step, and can therefore be reused. We introduce the shorthand notation In this block encoding we use each U l O 1 δ log 1 times. We now choose δ = 1/(2L), and l = O( comp /(LP )), = O( comp /(LP )). Also recall that L = O(αT ). With this choice of parameters we have and Each U l is used O αT log αT P comp times. Given the fact that each U l uses queries to MAT O log( −1 l ) log log( −1 l ) times, and that there are L = O(αT ) of them, the total number of queries to We summarize the result of the construction of the block encoding for the long-time evolution using truncated Dyson series as follows.
Then we can construct an (α comp , m comp , comp )-block encoding of T e T 0 A(t)dt , where α comp , m comp and comp are given in Eqs. (21) to (23), respectively. In this block encoding the number of times we need to use: • queries to MAT as given by Eq. (24); additional ancilla qubits that start in, and will be returned to |0 m ;

Success probability and main result for Dyson series approach
We can now apply the block encoding to an initial state |ψ(0) to get the final state |ψ(T ) = T e T 0 A(t)dt |ψ(0) . We assume that |ψ(0) = 1. Directly applying the block encoded time evolution operator will introduce an error, and we want to control the resulting error in the final normalized state. This can be done through the following lemma: Proof. Let |R = |ψ − |φ . Then We first use Theorem 6 to construct an (α comp , m comp , comp )-block encoding of Ξ = T e T 0 A(t)dt . We denote this block encoding by W . Let Ξ be the time evolution operator that is exactly encoded in W . Then if we successfully prepare the state | ψ(T ) = Ξ |ψ(0) by applying the block encoding and measuring the ancilla qubits, the error will be Therefore, by Lemma 7, in order to ensure that the error in the normalized state is upper bounded by , i.e., it suffices to choose comp = O( |ψ(T ) ). Upon measuring the ancilla qubits, if all measurement outcomes are 0, then we have successfully prepared the state | ψ(T ) that approximates the exact solution |ψ(T ) . This happens with probability that is at least | ψ(T ) 2 /α 2 comp . Using the scaling of α comp in Eq. (22), and the fact that | ψ(T ) = (1+O( )) |ψ(T ) , the success probability is Ω( is the same as that defined in Eq. (4). Suppose that the initial state |ψ(0) can be prepared using a unitary circuit U init , then we can boost the success probability to 2/3 with O(Q) rounds of amplitude amplification. With the above analysis, we can determine the cost of our algorithm, which we state in the following corollary.
queries to all MAT l , and O(Q) applications of (controlled-) U init and its inverse. Here Q is defined in Eq. (4). In total we use O(n + m + polylog(V T 0 (A)αT Q −1 )) qubits, and O(α 2 T 2 Q(m + polylog(V T 0 (A)αT Q −1 ))) additional elementary gates. Success is flagged by the measurement result of a qubit.

Application to sparse matrix input model
In this section, we discuss the complexity of our time-marching strategy using the sparse matrix input model. The same input model has also been used in other QLSA-based differential equation solvers [8,14,26,48] as well as time-dependent Hamiltonian simulation algorithms [7]. We assume that A(t) is a d-sparse matrix with A(t) ≤ 1, and that the locations of non-zero elements are time-independent. The information of A(t) is given through the following oracles: U row |j, s = |j, row(j, s) , U col |j, s = |j, col(j, s) , Here row(j, s) is the row index of the sth nonzero element in the jth column, col(j, s) is the column index of the sth nonzero element in the jth row. We can use [37,Lemma 48] to construct MAT in Definition 1 using the sparse matrix oracles in Eq. (27). To construct a (n q , m, a, b, α, )-MAT we need a single query to both U row and U col , and two queries to U val . The precision parameter can be made arbitrarily small, but to keep the error below we need O(n + log 5/2 (d −1 )) additional elementary gates and O(n b + log 5/2 (d −1 )) additional ancilla qubits are needed, where n b is the number of bits to encode the binary A jk (t). These additional ancilla qubits can be reused.

Corollary 9 (Time-marching based solver using truncated Dyson series with sparse matrix input model). Under the same assumptions in Theorem 8, together with the assumption that A(t) is a d-sparse coefficient matrix given by unitaries
log log(dT Q −1 ) applications of (controlled-) U row , U col , and U val and their inverses, and O(Q) applications of (controlled-) U init and its inverse. In total we use O(n+polylog(V T 0 (A)dT Q −1 )) qubits, and O(d 2 T 2 Q(n+polylog(V T 0 (A)dT Q −1 ))) additional elementary gates. Success is flagged by the measurement result of a qubit.

Optimality of the query complexity with respect to Q
In this section we show that the time-marching based solver using the high-order truncated Dyson series achieves the nearly optimal query complexity with respect to the amplification ratio Q. The optimality is guaranteed by the following lower bound result. Although we state this lower bound result in terms of a time-independent ODE, it automatically provides a lower bound for the harder problem of solving a time-dependent ODE in Eq. (1). The block encoding of the coefficient matrix A also automatically provides a time-dependent matrix encoding needed in Theorem 8.

Theorem 10.
For any given N , there exists a matrix A ∈ C N ×N that can be accessed through its (1, m, 0)-block encoding U A , a target time T = O(log(N )), and an initial state |ψ(0) ∈ C N ×N with |ψ(0) = 1 such that the following statement holds: Let |ψ(t) be the solution of the ODE d dt |ψ(t) = A |ψ(t) with 0 ≤ t ≤ T , i.e., |ψ(T ) = e AT |ψ(0) . Then for any θ > 0, there is no quantum algorithm that can prepare a quantum state ρ (allowed to be a mixed state) with D(|ψ(T ) , ρ) ≤ 1/2, which uses O(Q 1−θ poly(T )) queries to U A . Here D(·, ·) denotes the trace distance between two quantum states and Proof. We assume towards contradiction that such an algorithm exists. Now we apply this hypothetical algorithm to solve the unstructured search problem, in which we are asked to find a marked binary string targ of length n. We denote N = 2 n . The target targ is marked by the following oracle U targ : Here x is any binary string of length n that is different from targ. We let A = −U targ , then −U targ is an (1, 0, 0)-block encoding of A. Solving the ODE up to time T yields us a (unnormalized) state Note that in this quantum state, the amplitude corresponding to the marked element targ is amplified, while the amplitudes corresponding to all other elements are suppressed. Consequently, the probability getting targ when measuring the state in the computational basis increases with time. At time T , this probability is If we set T = 1 4 log(3(N − 1)), then the above probability will be 3/4. Suppose we can prepare a state ρ with D(ρ, |ψ(T ) ) ≤ 1/2, then Pr ρ (targ) ≥ Pr |ψ(T ) (targ) − D(ρ, |ψ(T ) ) ≥ 3/4 − 1/2 = 1/4. As a result, once we prepare a copy of ρ, we can measure in the computational basis, and with probability at least 1/4 we will get the marked element targ. We can use one application of U targ to check whether the measurement output is targ. Therefore in order to find targ with probability 2/3, we only need O(1) copies of ρ.
Let us then look at the cost of preparing ρ. With the hypothetical algorithm for solving the ODE, we can prepare ρ using O(Q 1−θ poly(T )) queries to U A . Here T = 1 4 log(3(N − 1)) as chosen above, and in this scenario Therefore, we need O(N It is worth noting that the lower bound result in Theorem 10 does not imply that the dependence on Q cannot be improved for specific instances of A. For example, consider the following 2 × 2 non-diagonalizable matrix Direct calculation shows that e AT , and therefore |ψ(T ) grows linearly in T . However, grows exponentially in T , and so is Q in Eq. (28). Applying the result of [48], we know that the cost of the optimal solver should grow as poly(T ) for any T . Therefore our time-marching based algorithm is suboptimal in this case. Note that if A is a normal matrix, the quantity in Eq. (30) is equal to e AT , and this issue does not arise.

Simplified implementation and first-order truncated Magnus series
The time-marching strategy can be paired with any reasonable short-time integrators. The implementation of the high-order truncated Dyson series algorithms requires complicated quantum control logic for handling time-ordering operators. In this section, we investigate the performance of time-marching based algorithms with low-order integrators that can be implemented without the explicit treatment of time-ordering operators. The simplest example is a first-order truncated Dyson series algorithm: We can approximate the integral using numerical quadrature as in the implementation of the high-order truncated Dyson series. A closely related first-order integrator takes the form i.e., we perform the matrix exponentiation of the time-independent matrix t j+1 t j A(s)ds directly without further Taylor expansion. This is the first-order truncated Magnus series for approximating the time-ordered integration. In the context of Hamiltonian simulation, this strategy gives rise to the quantum highly oscillatory protocol (qHOP) [4], which exhibits commutator scaling in high-precision limit, can be insensitive to the norm and the variation of A(t), and can lead to second-order convergence (i.e., superconvergence) for certain A(t). The rest of the section analyzes the scheme (32) for solving the ODE (1). We verify that this scheme also exhibits the desired commutator scaling. An extreme example where this is useful is when A(t) commutes with itself at any time, then (32) becomes exact, which leads to improved accuracy and poly-logarithmic dependence in the query complexity on the precision. For general non-commuting case, we can use higher order truncation of the Magnus series to improve the accuracy dependence while keeping the commutator scaling. However, the implementation of the quantum circuit can be more complicated.

Short-time evolution description
For each short time evolution, the time-ordered evolution operator is approximated by truncating its first-order Magnus expansion (32), which is equivalent to directly ignoring the time-ordering operator. The integral can be further approximated using standard Riemann sum with M quadrature points ( The short-time first-order Magnus evolution operator, denoted asΞ is thus given as The implementation of the short-time first-order Magnus evolution operator in two steps. The first step is to construct the block encoding of (33) via applying ⊗ q HAD on the n q qubits where HAD represents the single qubit Hadamard gate, applying MAT and then uncomputing, namely The quantum circuit of implementing Eq. (35) is described in Fig. 4. The second step is to implement the time-independent matrix exponential via a contour integral formulation combined with QSVT, as detailed in Appendix G. The observation is that for general A that are not normal, the singular value transformation no longer agrees with the eigenvalue transformation. Nevertheless, the matrix exponential can be applied by exploring the contour integral formulation proposed in [61]. The idea is to express e A as a contour integral: where Γ is a contour in the complex plane that contains all the eigenvalues of A in its interior. We then discretize the integral using numerical quadrature. This procedure turns the matrix exponential into a linear combination of (z j − A) −1 , which can be efficiently calculated as matrix inversion problems that does not require the matrix being normal. Here z j 's are some constants determined by the numerical quadrature, which is detailed in Section 5.3. We also remark that there are certain cases of A where e A admits a simpler implementation using QSVT via polynomial approximations of the exponential function [37,Corollary 64]. Ref. [59] also developed a method to apply a more general class of matrix functions using the contour integral, but they did not consider how to construct a block encoding, instead focusing on preparing the output state.

Time discretization error
We now analyze the time-discretization error for the short-time evolution.
Proof. The errors come from two parts: one from dropping the time-ordering operator, while the other from the numerical quadrature. To estimate the error, we introduce the notations of the exact evolution from time a to t as , a), and the matrix after dropping the time-ordering operator as We first study the approximation error between the two. For any t ∈ [a, b], by differentiating V (t, a) with respect to t, we have The difference between V and V satisfies the differential equation and by the variation of parameter formula [18], we have where g(τ ) := Applying the fundamental theorem of calculus in terms of β yields that Note that the exponential factor here is bounded quite loose, and in fact one can get a sharper bound by using the logarithmic norm. But here for our purposes, it is sufficient to estimate at the level of A(s) . Another useful fact is which can be shown by applying the Gronwall's inequality in terms of β 2 to We now estimate the error induced by the numerical quadrature using the total variation. Following the result in Appendix F, we have where we recall that V b a (A) is the total variation of A(t) on the interval [a, b]. Furthermore, the variation of parameter formula gives that for square matrices B and E, Therefore, Combining (43) and (46) yields the desired result.

Algorithm for implementing e A
In order to discuss the complexity of implementing the short-time first-order Magnus evolution operator, we first briefly discuss in this section how to implement a matrix exponential of a general time-independent matrix, namely to implement f (A) = e A , when A is given through its (α, m, 0)-block encoding. The details are laid out in Appendix G.
The main idea is to use the contour integral representation of a matrix function where Γ is a unit circle with radius β: Γ = {z = βe iθ : θ ∈ R}. This contour integral can be discretized as for z k = βe i2πk/K . The error that comes from this discretization can be analyzed using Lemma 18, which is modified from [63,Theorem 18.1] and [59,Proposition 5]. We choose β = 2α, R = 4α, and by Lemma 18 we have Our goal is to construct a block encoding of e A using (48). We will proceed as follows: we first construct a block encoding of where z k = βe i2πk/K . Inverting Ξ using QSVT gives us Now suppose we can prepare quantum states |COEF int , |COEF int that satisfy then we obtain the desired block encoding Here for a complex number z ∈ C, √ z can be any w ∈ C such that w 2 = z. To construct this block encoding we need additional ancilla qubits. The entire circuit acts on five different registers, as shown in Figure 6. The first register contains one qubit and is used for matrix inversion through QSVT. The second register contains log 2 (K) qubits to encode the k coefficients e z k z k in the amplitude. The third register contains one qubit and is used in the block encoding of each z k − A. The fourth register contains m qubits, and this is the ancilla register used in the block encoding U A . The fifth register contains n qubits and is the qubits that A acts on. We will use I r to denote the identity operator acting on r qubits.
The cost of this implementation can be summarized in the following lemma, whose proof is laid out in Appendix G.

Short-time complexity of the first-order integrator
The cost of constructing a block encoding of the short-time first-order Magnus evolution operator is given by the following lemma.  Proof. MAT is a time-dependent matrix encoding of A(t) with precision , and is hence a time-dependent matrix encoding ofÃ(t) with precision 0 by Definition 1. We first consider the cost of constructing the block encoding of Let n q = log 2 M . As depicted in (35), one can construct a (α(b − a), m + n q , 0) block encoding ofS, using 1 query to MAT and n q additional one-qubit gates.
We then use QSVT and contour integral formulation Lemma 12 to implement eS. Then we obtain the following − a)).
This uses O (α(b − a)) + log(1/ )) queries to the block encoding ofÃ and hence O (α(b − a)) + log(1/ )) queries to MAT. The additional elementary gates needed for this procedure is We now control the error between eS and Ξ = T e b a A(t)dt . Denote We first consider the error between eS and e S . Notice that Thanks to Lemma 11, the error .
By our choice of a and b we have that the right-hand-side is bounded by 4β 0 , as Therefore, we get a

Block encoding of the long-time evolution operator
For simplicity, we choose a uniform temporal mesh. Let t l = lT /L so that 0 = t 0 < t 1 < · · · < t l < · · · < t L = T where L is the number of time steps. We perform the short-time first-order Magnus evolution on each interval [t l , t l+1 ] and multiply all the evolution together. By Lemma 13, each short-time first-order Magnus evolution on [t l , t l+1 ] yields a (α l , m l , l Ξ l )- and β l is an upper bound of The construction uses O (α(t l+1 − t l ) + log(1/δ)) queries to MAT l and O(α l (t l+1 − t l )n q,l + n q,l log(1/δ)) additional elementary gates with n q,l = O log( Applying Theorem 4, we get a (α comp , m comp , comp )-block encoding of T e Here m comp = log 2 (L) + m + max l n q,l + max l O (log(α l (t l+1 − t l )) + log log(1/ l )) , and comp = e 1/2 L + Lδ + 4β comp P, (53) where β comp = l β l is an upper bound of In this block encoding we use O 1 δ log 1 applications of each U l .
Denote the upper bound of the time average L 1 -scaling T 0 A(s) ds/T as α. Thanks to our choice of t l 's, each β l can be upper bounded by With this choice of parameters we have and Each U l is used times. Given the fact that each U l uses queries to MAT l O (α(t l+1 − t l ) + log (max {α comm , α} T P/ comp )) times, and that there are L of them, the total number of queries to all MAT are Note that when α comm = 0, this first-order truncated series implementation in fact has the same complexity scaling as the high-order truncated Dyson series. We remark that it is also possible to get the L 1 scaling α = 1 T T 0 A(s) ds, by varying time step sizes in the propagation according to the average performance of the Hamiltonian as in [4,7]: Let 0 = t 0 < t 1 < · · · < t l < · · · < t L = T where L is the number of time steps and t 1 , · · · , t L−1 are chosen such that which we shall not discuss more details here.

Success probability and main result paired with first-order integrator
The success probability of the first-order Magnus method follows a similar argument as the Dyson series approach as detailed in Section 3.3. It suffices to choose comp = O( |ψ(T ) ) and the success probability is at least Ω(Q −2 ). Suppose that the initial state |ψ(0) can be prepared using a unitary circuit U init , then we can boost the success probability to 2/3 with O(Q) rounds of amplitude amplification. We can conclude the result when pairing the time-marching strategy with the first-order Magnus-type integrator for long time evolution as follows.  [0, T ] as defined in (10). We can prepare, with probability at least 2/3, a quantum state | ψ(T ) that satisfies total number of queries to all MAT l , and O(Q) applications of (controlled-) U init and its inverse. In particular, in the high-precision limit given a non-zero α comm , the total number of queries to all MAT l can be simplified as In total we use O(n + m + polylog(V T 0 (A)αT Q −1 )) qubits. Here Q is defined as (4), and α comm is defined in Eq. (54). Success is flagged by the measurement result of a qubit.

Discussion
We revisit the time-marching strategy for solving the ODE (1), and demonstrate that it can be a useful alternative to QLSA based quantum differential equation solvers. On a technical level, our time-marching based algorithm achieves the following tasks for the first time: (1) It has provable performance guarantees for non-diagonalizable, and time-dependent dynamics.
(2) It retains high-order accuracy for non-smooth A(t). (3) It can use few queries to the initial state. For inhomogeneous linear differential equations d dt |ψ(t) = A(t) |ψ(t) + |b(t) , we may use the variation of constants to express the solution as After discretization of the integration with respect to t , this is reduced to a series of homogeneous linear differential equations that can be implemented by the time-marching method. The analysis for the inhomogeneous case can thus be similar using the tools developed in this paper. At a high level, time-marching based methods and QLSA based methods simply amount to two different ways of rearranging the same identities of the form |ψ l =Ξ l |ψ l−1 , l = 1, . . . , L. However, there are many differences between these two classes of algorithms, and we do not know whether the gaps can be closed with further improvements, or are intrinsic to each type of the method. For instance, it is possible to refine the complexity analysis of QLSA based algorithms so that the cost does not directly depend on the condition number κ V , or to improve the algorithm so that it achieves high-order accuracy for non-smooth A(t) [16]. However, it is unclear whether the number of queries to the initial state of any QLSA based algorithm can be independent of T . For time-marching based algorithms, though the dependence on the precision is near-optimal and scales as polylog(1/ ), the dependence of the query complexity to the input matrix is O(T 2 ). We do not know whether the O(T 2 ) scaling in the query complexity to the matrix can be improved, or whether the dependence on Q can be weakened . It is also possible that another class of algorithms is needed to provide a unifying perspective to the questions above, and to achieve near-optimal complexity with respect to all parameters.
We use the following asymptotic notations besides the usual O notation: we write f = Ω(g) For two quantum states |x and |y , we sometimes write |x, y to denote |x |y . The notations regarding each short time evolution can be summarized as follows. Note that the same U l can be viewed as a block encoding of Ξ andΞ , and the difference is only in the error of the block encoding.

B Block encoding and quantum singular value transformation
We call a matrix A ∈ C 2 n ×2 n an n-qubit matrix or an n-qubit operator. In this work we extensively use the technique of block encoding [21,38,56], which is a way of embedding an arbitrary matrix as a sub-matrix of a larger unitary matrix. Using a unitary matrix U A to encode A as a sub-matrix means that there exist a subnormalization factor α > 0 such that where · denotes arbitrary matrix blocks of proper sizes. In general, the matrix that we block encode may only approximate, but is not exactly equal to, A/α. We use the following notation to describe such encodings.
Definition 15 (Block encoding, [38]). An (m + n)-qubit unitary operator U A is called an (α, m, )-block encoding of an n-qubit operator Here m is the number of ancilla qubits for block encoding, and α is called the subnormalization factor. With the (α, m, 0)-block encoding U A , we can perform a quantum singular value transformation (QSVT) of A on a quantum computer using the technique developed in [38]. More precisely, suppose A has a singular value decomposition A = U σV † , then for any odd polynomial f (x) with degree d such that |f (x)| ≤ 1 for x ∈ [−1, 1], we can construct a block encoding of U f (Σ/α)V † , and this block encoding uses U A d times. This is the main tool for implementing the uniform singular value amplification in Lemma 2. QSVT requires finding a sequence of phase factors corresponding to the polynomial we want to implement. There are classical algorithms capable of doing this [41] in polynomial time, and later works developed more efficient methods for high-degree polynomials with high precision requirements [22,35,65,68].

C Convex optimization based method for uniform singular value amplification
In order to approximate an odd target function using an odd polynomial of degree d, we can express the target polynomial as the linear combination of Chebyshev polynomials with some unknown coefficients {c k }: To formulate this as a discrete optimization problem, we first discretize [−1, 1] using M grid points (e.g., roots of Chebyshev polynomials {x j = − cos jπ M −1 } M −1 j=0 ). We define the coefficient matrix, A jk = T 2k+1 (x j ), k = 0, . . . , (d − 1)/2. Then the coefficients for approximating the polynomial used for uniform singular value amplification can be found by solving the following optimization problem This is a convex optimization problem and can be solved using software packages such as CVX [40]. The norm constraint |F (x)| ≤ 1 is relaxed to |F (x j )| ≤ c to take into account that the constraint can only be imposed on the sampled points, and the values of |F (x)| may slightly overshoot on [−1, 1]\{x j } M −1 j=0 . The effect of this relaxation can be negligible when we choose c to be sufficiently close to 1 (for instance, c can be max{0.9999, 1 − 0.1δ}). Since Eq. (63) approximately solves a min-max problem, it achieves the near-optimal solution (in the sense of the L ∞ norm) by definition both in the asymptotic and pre-asymptotic regimes. Once the polynomial F (x) is given, the Chebyshev coefficients can be used as the input to find the phase factors using QSPPACK 1 with an optimization based method [35].

D The compression gadget
In this section we discuss how to construct the compression gadget that is used in Section 2.4 and prove Lemma 3. We suppose we are given unitaries V 1 , V 2 , . . . , V L , each of which is a (α l , m l , 0)-block encoding of a potentially non-unitary operation Γ l . Our goal is to construct a block encoding of Γ L · · · Γ 2 Γ 1 without duplicating the ancilla registers in each V l .
To do this we introduce a counter register to count how many times Γ l is applied successfully. This counter register contains log 2 (L) + 1 qubits, and its state encodes a binary that is used to track successful applications of Γ l . We introduce a unitary operator ADD on this register defined through ADD |c = |c + 1 mod 2 log 2 (L) +1 .
This operator performs addition modulo 2 log 2 (L) +1 . Its inverse perform subtraction. We initialize the counter register in the state |L , and subtract from the number it encodes by applying ADD † each time Γ l is applied successfully. We keep track of success/failure by applying controlled ADD † so that the whole process is coherent. In the end if all steps are successful the counter register will be in the state |0 .
The circuit construction for the above coherent procedure is described in Figure 3. We denote m max = max l m l , and regard each V l as acting on two registers, one is the ancilla register containing m max qubits and the other the state register. If m l < m max , then we can simply let m max − m l qubits in the ancilla register remain idle. The circuit in Fig. 3 is in fact a block encoding of Γ L · · · Γ 2 Γ 1 . In this way we prove Lemma 3, which we restate here.

E Bounding the error due to the coefficient matrix
In this section, we show that small errors in the coefficient matrix can be controlled.

F Bounding numerical integration error using the total variation
In this section, we bound the numerical integration error. The standard numerical quadrature results typically bound the error by the derivative of the matrix Ȧ , and hence the matrix A needs to be differentiable (see, e.g., [57, Eq. (A11)] or the textbook [19]), Here we bound the numerical integration error by the total variation instead to account for non-differentiable cases.
For large interval [0, T ], we can partition the interval into segments 0 = x 0 < x 1 < · · · x K = T , x k − x k−1 = ∆, and the difference between the integral and the Riemann sum can be bounded through Now we obtain similar bounds using the total variation. First for short time where the second inequality is because where we have used the fact that

G Circuit construction of the contour integral formulation
In this section, we construct the circuit to implement e A using QSVT and the contour integral formulation. We devote Appendix G.1 and Appendix G.2 to the preliminary discussions of the block encoding of the inverse of a matrix and the quadrature discretization errors of the contour integral formulation, respectively. Appendix G.3-Appendix G.5 discuss the circuit construction in details and prove Lemma 12.