Parallel Quantum Algorithm for Hamiltonian Simulation

We study how parallelism can speed up quantum simulation. A parallel quantum algorithm is proposed for simulating the dynamics of a large class of Hamiltonians with good sparse structures, called uniform-structured Hamiltonians, including various Hamiltonians of practical interest like local Hamiltonians and Pauli sums. Given the oracle access to the target sparse Hamiltonian, in both query and gate complexity, the running time of our parallel quantum simulation algorithm measured by the quantum circuit depth has a doubly (poly-)logarithmic dependence $\operatorname{polylog}\log(1/\epsilon)$ on the simulation precision $\epsilon$. This presents an exponential improvement over the dependence $\operatorname{polylog}(1/\epsilon)$ of previous optimal sparse Hamiltonian simulation algorithm without parallelism. To obtain this result, we introduce a novel notion of parallel quantum walk, based on Childs' quantum walk. The target evolution unitary is approximated by a truncated Taylor series, which is obtained by combining these quantum walks in a parallel way. A lower bound $\Omega(\log \log (1/\epsilon))$ is established, showing that the $\epsilon$-dependence of the gate depth achieved in this work cannot be significantly improved. Our algorithm is applied to simulating three physical models: the Heisenberg model, the Sachdev-Ye-Kitaev model and a quantum chemistry model in second quantization. By explicitly calculating the gate complexity for implementing the oracles, we show that on all these models, the total gate depth of our algorithm has a $\operatorname{polylog}\log(1/\epsilon)$ dependence in the parallel setting.


Introduction
Simulating the quantum Hamiltonian dynamics is a fundamental problem in computational physics.Despite its ubiquity and importance, the problem is believed to be intractable for classical computers.Quantum computers were originally proposed to efficiently solve this problem [48].The first algorithm for solving this problem was given by Lloyd for local Hamiltonians [71], and has been followed by many remarkable results over the past twenty years.Moreover, these results have found diverse applications in other quantum algorithms (e.g., [37,6,46,32]) beyond quantum simulation.
While the state-of-the-art has achieved an optimal quantum algorithmic solution to simulating a large class of Hamiltonians [73], it remains open whether the quantum simulation algorithms can be parallelized in order to provide further speed-up.In this work, we identify a class of Hamiltonians that can be more efficiently simulated in parallel, called uniform-structured Hamiltonians.Then we introduce the notion of parallel quantum walk within Childs' framework [35,19,22,37].Based on it, we propose a parallel quantum simulation algorithm for uniform-structured Hamiltonians.
Hamiltonian simulation Simulating the time evolution of a quantum system governed by a time-independent Hamiltonian H for a time t is essentially approximating the unitary e −iHt to some precision ǫ, according to the Schrödinger's equation.In this paper, we focus on the digital quantum simulation (rather than the analog quantum simulation [69]), that is, simulating the Hamiltonian with a fault-tolerant universal quantum computer, given some oracle access to the Hamiltonian H.The performance of a quantum simulation algorithm depends on several factors: the simulation time t, the precision ǫ, the number of queries to oracles, and other parameters of the target Hamiltonian (e.g., size, matrix norm, and sparsity of H).
In the literature, there are basically three approaches to simulate a Hamiltonian: • The product formula approach is conceptually the simplest without introducing ancilla qubits.
Early works [71,3,18,97,36,41] on product-formula-based algorithms often had a poor complexity dependence on the precision ǫ, which was later improved [75] by techniques borrowed from other simulation approaches.This approach has regained attentions in recent years due to a number of new results [40,75,30,39,33,45] and its potential to be implemented in the near term.
• The quantum walk approach [35,19,68] spawned the groundbreaking work [20,21] that improves the complexity dependence on ǫ from poly(1/ǫ) to polylog(1/ǫ).This approach mainly applies to simulating a d-sparse Hamiltonian H, which is an Hermitian matrix with at most d nonzero entries in each row.The Hamitonian H is accessed by two oracles: an oracle O H giving the entry H jk according to the index pair (j, k), and an oracle O L giving the column index of the t th nonzero entry in row j according to t and j.By approximating e −iHt with a linear combination of Childs' quantum walk operators [22] (commonly known as the Linear-Combination-of-Unitaries, i.e., LCU algorithm [41]), the quantum walk approach achieves a nearly optimal [22] query complexity for simulating sparse Hamiltonians accessed by O H and O L .This technique has also been applied to solving the quantum linear systems problem [60], which was originally solved in [37] by phase estimation.
• The lower bound of query complexity for sparse Hamiltonian simulation was finally reached by the quantum signal processing approach [77,74,73], which provides a new way to transform the eigenvalue of a unitary by manipulating a single ancilla qubit without performing phase estimation.The input model was also generalized beyond sparse matrices by subsequent works on qubitization and block-encoding [73,32,50].
Later works [74,59,76,72] make further improvements on the complexity dependence on other parameters.We particularly note that all of the above approaches for Hamiltonian simulation are sequential.
Parallel quantum computation The aim of this paper is to study parallel quantum simulation of Hamiltonians.The computational model that we adopt is the quantum circuit model where the running time of a quantum algorithm is measured by the depth of its circuit implementation, with both gates and oracle queries being allowed to be performed in parallel.
The research on parallel quantum computation is not restricted to the circuit model.For example, in measurement-based quantum computing, it was observed that parallelism can provide more benefits than in the circuit model [64,27,28].Another parallel model closer to the current quantum hardware is distributed quantum computing, which can efficiently simulate the quantum circuit model with low depth overhead [15].Parallelism is also studied at more abstract levels like quantum programming [99,100].

Main Results
Our main result is a parallel quantum simulation algorithm for uniform-structured Hamiltonians, which will be formally defined in Sections 3.2 and 3.3.These Hamiltonians include local Hamiltonians, Pauli sums and other Hamiltonians of interest.Roughly speaking, a uniform-structured Hamiltonian H has the form H = w∈[m] H w 1 , where for each w, H w is a sparse Hamiltonian specified by a parameter s w .We adopt the sparse matrix input model for these Hamiltonians, that is, a target Hamiltonian H is accessed by two oracles: an oracle O H giving an entry H jk by the mapping |j, k, 0 → |j, k, H jk , and an oracle O P giving the parameter s w by the mapping |w, 0 → |w, s w .Here, O P might be different for various types of Hamiltonians.Formal details of the input model will be described in Section 2.
Throughout this paper we assume the target Hamiltonian H is normalized such that H max = 1, where H max := max jk |H jk |.Then our main result can be stated as the following: Theorem 1.1 (Informal version of Theorem 5.5).Any uniform-structured Hamiltonian H = w∈[m] H w acting on n qubits with each H w being d-sparse, can be simulated for time t to precision ǫ by a quantum circuit of depth poly(log log(1/ǫ), log n, m, d, t) and size poly(log(1/ǫ), n, m, d, t).
Here, the running time of our algorithm, i.e., the quantum circuit depth, has a doubly (poly-)logarithmic dependence on the precision ǫ.To the best of our knowledge, this is the first Hamiltonian simulation algorithm that achieves such dependence on ǫ.
Applying this theorem to simulating local Hamiltonians, we have: H w be an l-local Hamiltonian acting on n qubits, where each H w acts on a subsystem of l qubits whose positions are indicated by bit 1 in an n-bit string s w .Suppose the oracle O P has access to s w such that O P |w, 0 = |w, s w .Then H can be simulated for time t to precision ǫ by a quantum circuit with • O(τ log γ)-depth and O(τ γ)-size of queries to O H , • O(τ log γ)-depth and O(mτ γ)-size of queries to O P , and • O τ log 2 γ • log 2 n + log 3 γ -depth and O τ γ 2 • mn 4 + γ 3 -size of gates.where τ := m2 l • t, γ := log(τ /ǫ), and the gate refers to one-or two-qubit gate.
The best known algorithm for this task is by applying the optimal sparse Hamiltonian simulation of [74,73] (note that H in Corollary 1.2 is m2 l -sparse), which requires a query complexity O τ + γ log γ and a gate complexity O(n + γ polylog γ).By introducing parallelism, our algorithm exponentially improves the dependence on ǫ and n, in the depth of both queries and gates.
It is worth noting the difference between the oracle O P used in Corollary 1.2 for local Hamiltonians and the oracle O L in previous works [3,35,19,22,74] for generic sparse Hamiltonians.Oracle O L computes a function L(j, t) denoting the column index of the t th non-zero entry in row j of H.In practice, oracle O P is a more natural choice than O L , because if one wants to exploit the local structure of the Hamiltonian to be simulated, knowing the locality parameter s w given by O P is intuitively the minimal requirement.As an evidence, when we apply our algorithm to the Heisenberg model in Section 7.1, the gate complexity for implementing the oracle O P is efficient.Note that in the local Hamiltonian case, a query to the oracle O L can be achieved by at most m2 l queries to O P .
Lower bounds It was shown in [74,73] that sparse Hamiltonian simulation requires Ω log(1/ǫ) log log(1/ǫ)size of queries and gates.We are able to further prove a lower bound of the gate depth for the Hamiltonian simulation: Theorem 1.3 (ǫ-dependence depth lower bound for Hamiltonian simulation).Any quantum algorithm for sparse Hamiltonian simulation to precision ǫ requires Ω(log log(1/ǫ))-depth of gates.The same holds even for uniform-structured Hamiltonian simulation.This theorem implies that our parallel quantum algorithm given in Theorem 1.1 for simulating uniform-structured Hamiltonians cannot be significantly improved in the ǫ-dependence.
Applications Our algorithm is applied in Section 7 to simulating three quantum dynamical models in physics and chemistry of practical interest: • The Heisenberg model for studying the self-thermalization and many-body localization [82,78,38]; • The Sachdev-Ye-Kitaev (SYK) model for studying the simplest AdS/CFT duality [90,66,80,49]; and • A second-quantized molecular model for studying the electronic structure of a molecule [101,14,9].
We explicitly calculate the gate cost for implementing the oracles mentioned above and the total gate complexity for the simulation.Table 1 shows a comparison of our algorithm with previous best known algorithms on the same tasks.From it, one can see that by introducing parallelism, our algorithm achieves an exponential speed-up on the ǫ dependence for simulating all these models.Later improvements (e.g.[12,13,67,10,23]) Table 1: A comparison of our algorithm (Theorem 5.5) with previous best algorithms in simulating three physical models.Here, parameter n is the size of the system to be simulated2 , t is the simulation time, and ǫ is the precision of simulation.The complexity of an algorithm is measured by the depth of gates, where for readability the dependence on different parameters are split.The notation Õ(•) denotes an asymptotic upper bound suppressing poly-logarithmic factors.For Heisenberg model, we follow the convention of taking t = n.

High-level Overview of the Algorithm
Our algorithm is based on the quantum walk approach to Hamiltonian simulation [35,19,68,22].The basic idea of this approach is to approximate the target unitary e −iHt by expressing it as a Chebyshev series e −iHt ≈ r α r T r (H), where for each r, α r ∈ is some appropriate coefficient, and T r (x) is the degree-r Chebyshev polynomial.Each T r (H) can be obtained by r steps of Childs' quantum walk [35,19].Then a linear combination of these quantum walks is performed by the Linear-Combination-of-Unitaries (LCU) technique [41,68,22].Essentially, this approach is sequential due to the fact that r steps of quantum walk require r sequential queries, and to achieve a total precision ǫ of the simulation, r should be as large as Θ(log(1/ǫ)), inducing a logarithmic precision dependence.
In this work, we introduce a parallel quantum walk which is implementable with only a constant depth of parallel queries, for a large class of Hamiltonians pertaining good sparse structuresuniform-structured Hamiltonians.The parallel quantum walk is not a direct parallelization of Childs' quantum walk, instead it implements a monomial of H.We express the unitary e −iHt as a Taylor series e −iHt ≈ r β r H r (like in the previous work [20]), where each degree-r monomial H r can be obtained (with a proper scaling factor) by an r-parallel quantum walk.These parallel quantum walks are then linearly combined in parallel by a technique described in Section 4, which exploits parallelism in the LCU algorithm to combine R terms with polylog(R) depth.Since there are about O(log(1/ǫ)) terms in the LCU to achieve a total precision ǫ of the simulation, the query depth of our parallel algorithm is roughly polylog log(1/ǫ), achieving a doubly (poly-)logarithmic precision dependence.

Parallel quantum walk
The main ingredient in our parallel simulation algorithm is the parallel quantum walk.For an intuition of the algorithm, let us consider a very special case of uniformstructured Hamiltonians for example, a tensor product of Pauli matrices; that is, H = l∈[n] σ l for σ l ∈ {½, X, Y, Z} being Pauli matrices.Although H is a simple 1-sparse Hamiltonian, it suffices for an illustration of the main idea in our algorithm.To begin with, one can think of H as a weighted adjacency matrix of a 1-sparse graph H, then a step of Childs' quantum walk without parallelism is a "superposition version" of a classical random walk: it performs |j → H jk |k for all vertex j in H, where in this special case k is the unique neighbor of vertex j, and H jk ∈ {1, i, −1, i}.This step requires O(1) queries to determine j's neighbor and the entry value H jk .
The parallelism comes from observing the special structure of the graph H: it is a 1-sparse graph consisting of 2 n−1 pairs of vertices.Furthermore, for every vertex j, there is a "uniform" way to determine its pairing neighbor k.More precisely, consider an n-bit string s, which has its l th bit being 0 if σ l ∈ {½, Z} is a diagonal Pauli matrix, and being 1 if σ l ∈ {X, Y } is an off-diagonal Pauli matrix.Then it is easy to see that j, k are neighbors in H if and only if j ⊕ k = s with ⊕ the bit-wise XOR operator.Assume s is given by an oracle O P , a step of Childs' quantum walk |j → H jk |k can be performed by first querying O P to calculate k = j ⊕ s, and then querying O H to add the phase H jk .Now r steps of such a walk are essentially the mapping |j → (H jk ) r |j ⊕ s ⊕r , where (H jk ) r and s ⊕r can be computed in parallel by first querying r oracles O P (and O H ) simultaneously to compute r copies of s (and H jk ), and then classically computing s ⊕r (and (H jk ) r ) in a binary tree of depth O(log r), using the associativity of XOR (and multiplication).By a standard technique in reversible computation [16], the above parallel classical computation can be easily converted to parallel quantum computation, inducing the O(1)-depth of queries and O(log r)-depth of gates.
The above special example captures the idea of parallelism in our algorithm.More generally, denote H the Hamiltonian to be simulated and H its corresponding graph.Then an r-parallel quantum walk consist of two stages.The first stage is called pre-walk which, roughly speaking, prepares a superposition over all paths of length r generated from r steps of unweighted random walk on the graph H.This stage can be done efficiently in parallel, with only a constant depth of queries to O P , provided that the Hamiltonian H is uniform-structured.The second stage is called re-weight, which adjusts the weights (i.e., the quantum amplitudes) of the state prepared by pre-walk, according to the entry values H jk given by the oracle O H .This stage does not depend on the structure of H, and can be done efficiently in parallel with a constant depth of queries to O H . Thus combined with other techniques as in Childs' quantum walk, we can implement the monomial operator H r (with a proper scaling factor, see Section 3 for details).

Parallel LCU for Hamiltonian series
To approximate the evolution unitary e −iHt by a truncated Taylor series, the final step of our algorithm is to linearly combine the monomials H r obtained from the parallel quantum walks discussed above.The ordinary LCU algorithm [22,37] implementing a linear combination of R unitaries requires depth Θ(R).As pointed out in [37], if these unitaries are powers of a single unitary, then the LCU can be done in a parallel way analogous to the phase estimation [83] with depth O(log R).We slightly generalize this result to implementing a linear combination of block-encoded (see Definition 2.8) powers of a Hamiltonian (called a Hamiltonian power series) in parallel.Since the LCU requires a state corresponding to the coefficients in the linear combination, we also present a parallel quantum algorithm for this state-preparation procedure, based on standard results in quantum sampling [57].
To summarize, the whole algorithm is visualized in Figure 1 Target unitary e −iHt (with small t; Sec.

Related Works
Parallelism is employed in our Hamiltonian simulation algorithm to reduce the complexity dependence on the precision ǫ from polylog(1/ǫ) to polylog log(1/ǫ).Is it possible to reduce the dependence on other parameters like the simulation time t by parallelization?Atia and Aharonov [8] studied the fast-forwarding of Hamiltonians (which is further explored in a recent work [58])the ability to simulate a Hamiltonian by a quantum circuit with depth significantly less than the simulation time t (e.g.polylog(t)), which is essentially the possibility to reduce the complexity dependence on t in the parallel setting.They show that the fast-forwarding of generic Hamiltonians is impossible unless BQP = PSPACE Jeffery, Magniez and de Wolf [62] studied the parallel query complexity of element-distinctness and k-sum problems.The upper bounds in their results are obtained by what they called "parallelized quantum walk algorithms".But it should be noted that their algorithm is developed in the framework of finding marked elements via quantum walks [79], in particular, a modified quantum walk on multiple copies of Johnson graphs corresponding to a specific function (element-distinctness or k-sum), with parallel queries to that function.In contrast, our parallel quantum walk is defined in the framework of quantum walks for Hamiltonians [35,19,22,37], in particular, a quantum walk that implements a monomial H r (with a proper scaling factor) by r parallel queries to the oracles accessing a Hamiltonian H.

Structure of the Paper
For convenience of the reader, some preliminaries are presented in Section 2. In Section 3, we introduce a parallel quantum walk for Hamiltonians.More concretely, we revisit the framework of Childs' quantum walk in Section 3.1, give a parallelization of it and show how to implement this parallel quantum walk in Section 3.2, and analyze the complexity in Sections 3.2.1 and 3.2.2.Specifically in Section 3.2.1,we define uniform-structured Hamiltonians, for which the parallel quantum walk can be performed efficiently.In Section 3.3 we present an extension of the parallel quantum walk to the case of a sum of sparse Hamiltonians, where we also extend the class of uniform-structured Hamiltonians to include more Hamiltonians of interest.In Section 4, we show how to implement a parallel LCU for a Hamiltonian power series.Section 5 assembles the above results to simulate a Hamiltonian by combining parallel quantum walks.In Section 6, we prove an ǫ-dependence lower bound on the gate depth for Hamiltonian simulation.In Section 7, we apply our algorithm to simulate three concrete physical models.
where the complex number H jk is stored in a b-bit string that contains its real part and imaginary part (each with b/2 bits assuming b is even), and b will be determined by the precision ǫ of the algorithm.When we are just referring to the oracle we may omit the superscript b.
• A sparse structure oracle O P that gives parameters about the sparse structure of H such that O P |x, y = |x, P (x, y) for x ∈ X , y ∈ Y, where X , Y are sets of integers, and P : X × Y → Y is a function determined by the sparse structure of H such that for all x ∈ X , P (x, •) is a bijection for y ∈ Y.
In many previous works [3,35,19,22,37]  We note that O L can be expressed as a special case of O P by taking X = Y = [N ] and P = L.However, compared to these work, we adopt a more general oracle O P because Hamiltonians investigated in this paper pertain different sparse structures, which can be better exploited by different concrete forms of O P .Moreover, the implementations of these O P 's turn out to be very efficient, compared to much costlier implementations of the oracle O H , as shown in Section 7 when we calculate the total gate complexity of our algorithm in simulating practical physical models.
As in previous works, we allow using controlled versions of these oracles, and will not explicitly distinguish between the controlled and uncontrolled versions.

Complexity model
In this paper, we will consider both query complexity and gate complexity, where the gate refers to one-or two-qubit gate.For a parallel quantum algorithm represented by a quantum circuit, we will measure its cost by its depth and size.When considering the gate complexity, each oracle query is temporarily counted as one gate.The depth of gates in a quantum circuit is defined as the length of the longest path composed of gates and wires from the circuit inputs to outputs, where the length is the total number of gates on this path.The size of gates is defined as the total number of gates in the entire quantum circuit.
We adopt the definition of parallel query complexity in previous works (see for example [62]).The query complexity is calculated for each oracle separately.When considering query complexity of an oracle O, all gates and queries to other oracles are ignored.We allow queries to multiple copies of O in parallel -that is, we can perform the mapping in a single time-step for some r.From the viewpoint of a quantum circuit, these parallel queries to O act as "gates" in the same circuit layer.The depth of queries to O in a quantum circuit is defined as the length of the longest path composed of queries to O and wires from the circuit inputs to outputs, where the length is the total number of queries to O on this path.The size of queries to O is then defined as the number of total queries in the entire quantum circuit.
We often combine the depth and size complexity with the phrase "α-depth and β-size of gates (queries)" meaning that in the quantum circuit, the depth of gates (queries) is α and the size of gates (queries) is β.If a quantum circuit does not involve queries, then its complexity implicitly refer to the gate complexity; for example, "an α-depth and β-size quantum circuit" refers to a quantum circuit composed of α-depth and β-size of gates.
Error model A pure state |ψ is said to be approximated by | ψ to precision ǫ, if they are close in the l 2 -norm such that |ψ − | ψ ≤ ǫ.A unitary U is said to be implemented to precision ǫ, if a unitary Ũ is actually implemented such that U − Ũ ≤ ǫ, where the norm is the spectral norm.The terms "ǫ-approximation" and "ǫ-close" will also be used interchangeably.

Parallel Quantum Circuit
Now we review some basic techniques for constructing parallel quantum circuits.Although the results are known in the previous literature, we provide their proofs in order to illustrate basic ideas in designing parallel quantum circuits.Lemma 2.1 (Parallel copying [43,81]).Let COPY b be a unitary that creates b copies of a bit (including the original copy); that is, Proof.Suppose b is a power of two w.l.o.g.Then as shown in Figure 2, the gate COPY b can be inductively constructed from COPY b/2 , with COPY 1 being the identity operator.It is easy to check the depth and size of this quantum circuit by induction.Lemma 2.1 can be easily extended from copying a single bit to copying an m-bit string, with the circuit depth and size multiplied by m.We use COPY m b to denote such a circuit.For the case b = 2, we will omit the subscript and write COPY m .[43,81]).Let C b -R Z be a unitary that performs a phase shift on a single qubit controlled by b qubits; that is, The cases of X and Y -rotations can be easily proved by combining the above proof and the identities R , where Had stands for the Hadamard gate.
The following lemma accommodates a classical parallel computing technique into the quantum setting.
Lemma 2.3 (Sequence of associative operators).Let • be an associative operator.Given a unitary U • that performs |x, y, z → |x, y, z ⊕ (x • y) and its inverse U † • .Then for all m ∈ AE, the mapping can be implemented by a quantum circuit with O(log m)-depth and O(m)-size of U • and its inverse, where the additional gate complexity is often negligible compared to U • and thus omitted.
where p = (l + r − 1)/2.Assuming m is a power of two w.l.o.g., then the computation of S • (1, m) forms a tree with O(log m)-depth and O(m)-size of U • , where the root is S • (1, m) and the leaves are ) is computed into an ancilla space, apply COPY gate to copy the result into |z , then clean all the garbage partial sums by reverse computation with U † • .
Corollary 2.4 (Parallel addition of a sequence [43]).The addition of a sequence, i.e., the mapping Proof.By Lemma 2.3, combined with classical techniques of three-two adder, pairwise representation and carry-lookahead adder.A detailed proof can be found in [43].The following lemma translates the parallel classical results of computing elementary arithmetic functions [87] to the quantum case using the technique of reversible computing [16].
Lemma 2.6 (Parallel quantum circuit for elementary arithmetics [87]).Let f be one of the following elementary arithmetic functions: addition, subtraction, multiplication, division, cosine, sine, arctangent, exponentiation, logarithm, maximum, minimum, factorial 4 .Then the unitary that performs ] can be implemented by an O log 2 b -depth and O b 4 -size quantum circuit 5 , where x, ỹ are floating number representation of x, y on suitable intervals, and f (x, ỹ) is 2 −b -close to f (x, y).(For unary function the second operand y is omitted, e.g., f (x, y) = cos(x).)In particular, for f being addition, subtraction or multiplication, the depth can be O(log b).
In the remainder of this paper, the name "elementary arithmetic function" may also refer to a composition of a constant number of the arithmetic functions in Lemma 2.6.As elementary arithmetic operations are frequently used in this paper, in some cases we will measure the efficiency of computation with respect to these building blocks.More specifically we have the following definition for arithmetic-depth-efficient computation6 . 7can be implemented by an arithmetic-depth-efficient quantum circuit.

Block-encoding
Block-encoding is a recently introduced fundamental tool for arithmetic operations on matrices represented as a block of a unitary.It has been developed through a line of researches in quantum algorithms [60,22,37,73,6,32,50].
In this paper, we will slightly abuse the terminology in such a way that if the condition is then for simplicity U is also called an (α, a, ǫ)-block-encoding of A.

Parallel Quantum Walk
In this section, we define a parallel quantum walk within the framework of Childs' quantum walks [35,19,22,37].In Section 3.1, we revisit Childs' quantum walk.Then we propose a parallelization of it in Section 3.2 and show how to implement the parallel quantum walk in two stages: pre-walk and re-weight, whose complexities are analyzed in Section 3.2.1 and Section 3.2.2respectively.In particular, in Section 3.2.1 we define uniform-structured Hamiltonians for which the parallel quantum walk can be implemented efficiently.The Hamiltonians considered in Section 3.2 are d-sparse.In Section 3.3, we further consider a sum of d-sparse Hamiltonians and extend the parallel quantum walk and the class of uniform-structured Hamiltonians.

A Quantum Walk for Hamiltonians
Let H be a d-sparse N × N Hamiltonian acting on n qubits.By analogy with the classical Markov chain, the Hermitian H can be seen as a transition matrix with "complex probability" on a d-sparse undirected graph whose adjacency matrix is given by replacing each nonzero entry in H with 1.
Following [36], this graph is called the graph of the Hamiltonian, which we will often denote as H in the serif font throughout this paper.We write (j, k) ∈ H if an undirected edge (j, k) exists in the graph H.Following Childs' extension [35,19,68,22,37] of Szegedy's quantum walk [93], we define for all j ∈ [N ] the post-transition state of |j : as a generalization of the classical random walk, where the square root where |ψ j is defined in (2).Let T : H → H be any unitary such that The following lemma from [37] shows that we can implement a polynomial of H by multiple steps of quantum walk.More precisely, r iterations of Q block-encodes a degree-r Chebyshev polynomial of H. 4) is performed precisely, where T r (x) is the degree-r Chebyshev polynomial (of the first kind).
The proof of Lemma 3.2 involves some interesting techniques, which were later used in [73] to develop qubitization.Here we only give a proof for the special case when r = 1 (thus T † QT = T † ST ), which provides a basis for our generalization to parallel quantum walks in Section 3.2.
We first write the state |Ψ j in (3) into two parts |Ψ j = |Φ j + |Φ ⊥ j , where the subnormalized states Now we briefly illustrate how to implement one step of quantum walk by a quantum circuit.For the sake of simplicity we only consider query complexity, and assume each step is performed precisely.It suffices to show how to implement the unitary T (and thus 3. Query the oracle O H to compute H jk in an ancilla space, conditioned on which rotates the state in H B , then uncompute H jk by reverse computation.
That is,

Parallelization
The algorithm for the quantum walk described in Section 3.1 is highly sequential, because r steps of quantum walk need r iterations of Q, which in total requires Θ(r) sequential queries to O H and O L .Now we define a parallel quantum walk, which can be implemented with only a constant depth of parallel queries for uniform-structured Hamiltonians (to be defined in Section 3.2.1).Slightly abusing the notation, we denote j ∈ H r if a path j = (j 0 , . . ., j r )8 of length r + 1 exists in the graph H.
The parallel quantum walk defined above naturally generalizes the original quantum walk in Definition 3.1.The key idea is that we extend the state |Ψ j in (3) which is a superposition of one step of walk (j, k) ∈ H, to the state |Ψ (r) j 0 in (6) which is a superposition of r steps of walk (i.e., a path) j ∈ H r .(For technical reasons, the state space is enlarged to 2N ⊗2r+2 instead of 2N ⊗r+1 ; and the walk operator Q (r) is an extension of T † QT instead of Q.)As shown later in Lemma 3.8, the walk operator Q (r) becomes a block-encoding of the monomial (H/d) r , a generalization of the H/d obtained from the original quantum walk.It should be emphasized that an r-parallel quantum walk is not equivalent to r sequential steps of the original quantum walk, which instead block-encodes a Chebyshev polynomial T r (H/d).
Remark 3.6.The term of parallel quantum walk comes from the fact that, as proved later, the walk operator Q (r) can be performed by a parallel quantum circuit with a constant query depth if the Hamiltonian H is uniform-structured (see Definition 3.10 in Section 3.2.1).This result is non-trivial, because the state |Ψ (r) j 0 in Definition 3.5 contains a dependence chain (which has a sequential nature) induced by the path j ∈ H r , where j s+1 depends on j s for all l ∈ [r].This difficulty is resolved by observing that queries to oracle O H can be actually separated from the dependence chain (see Section 3.2.2),while queries to oracle O P can be parallelized if the graph H has a good structure (see Section 3.2.1).Now we will illustrate why Q (r) is a block-encoding of (H/d) r .Similar to the proof of Lemma 3.3, we can write |Ψ (r) , where the subnormalized state |Φ (r) represents the "good" part of |Ψ (r) j 0 .The following lemma shows some orthogonal relations between these subnormalized states in the context of S (r) .Lemma 3.7.For all j, k ∈ [N ], we have: Proof.We prove the two cases separately.
1. Let j 0 := j and k 0 := k.By straightforward calculation we have: Then using the self-adjointness of H we obtain:

Recall that our state space is H
Proof.We first show for precise T (r) .This part is similar to the proof of Lemma 3.3, and it suffices to show for all j, l ∈ [N ], where |Ψ j is defined as above.Equation ( 10) can be obtained by splitting and then applying Lemma 3.7.For approximated T (r) with precision ǫ/2, by linearity of error bound propagation, the LHS of ( 9) is approximated to precision ǫ.
In order to implement an r-parallel quantum walk Q (r) , we only need to focus on the T (r) part, since S (r) can be trivially implemented by a constant depth of SWAP gates.The outline of an implementation of T (r) is presented in Figure 4.Note that the implementation consists of two , where H A s = H B s = 2N for all s.Input Any state |j 0 ⊗ |0 ⊗ . . .⊗ |0 for j 0 ∈ [N ] (due to the definition of T (r) in ( 7)), with |j 0 , |0 ∈ 2N .Output The state |Ψ (r) j 0 defined in (6).T (r) can be implemented in the following ways.Pre-walk 1. Prepare in the subspace H A a pre-walk state |p (r) Re-weight Copy the computational basis states in H A to H B ; that is, apply COPY (r+1)

Pre-walk and Uniform-structured Hamiltonians
Now we give a detailed description of pre-walk.At the same time, we introduce a class of Hamiltonians -uniform-structured Hamiltonians, for which the pre-walk can be conducted in a parallel fashion.The state |p (r) (11) earns the name "pre-walk state" because it is a superposition of all paths generated by r steps of unweighted random walk on the graph H starting from the vertex j 0 .We call the process of generating |p (r) j 0 an r-pre-walk.For simplicity, we assume |p (r) , that is, |j s ∈ N for all |j s .For the pre-walk, we only need to focus on the graph H and the oracle O P that characterizes its sparse structure, as |p (r) j 0 does not involve any weight H jk .One remaining question is what kind of oracle O P to be used in our algorithm.Since the complexities that we consider are measured in terms of the query complexity of O P and gate complexity, for practical reasons, O P should be reasonably efficiently implementable.Conversely, if O P is powerful enough (thus hard to implement), then intuitively the pre-walk can be done with only a few queries to O P .For instance, given an oracle O P = O path that directly gives the path generated from walks according to a sequence of choices, as shown in the following lemma, then the query complexity is O(1).Recall that L(j, t) denotes the column index of the t th nonzero entry in row j of H, i.e., the t th neighbor of vertex j in the graph H. Lemma 3.9 (Pre-walk with a strong path oracle O path ).Let O P = O path give a path generated by r steps of walk starting from j 0 , according to the sequence of choices t ∈ [d] r ; that is, take • For all r ∈ AE, the corresponding L (r) can be expressed as where the function f, g and the operator • with input/output lengths O(n) satisfy that: f , • and the mapping |0 → • There exists an "inverse" function and L −1 is arithmetic-depth-efficiently computable with O(1) queries to O P .
Remark 3.11.One might notice that the expression g(t 0 ) • . . .• g(t r−1 ) is ready to be computed in parallel by Lemma 2.3.This point is in fact the key ingredient to implement a parallel prewalk.We also point out that if the function g and its inverse are both arithmetic-depth-efficiently computable, then the mapping |0 → Proof of Lemma 3.12.Note that for band Hamiltonians we do not need to query O P .The lemma is proved by verifying the conditions in Definition 3.10.Note that L(j, t) = j + t − (d − 1)/2, where the addition and subtraction are in N (although in this case the nonzero entries is not necessarily ordered, the correctness of the algorithm is unaffected).
• We have , and • to be addition in (12).
-By Lemma 2.6, these functions and operators are arithmetic-depth-efficiently computable.
As in Remark 3.11, the mapping |0 → 1 -The addition is obviously associative.
• Take the inverse function to be L −1 (j, k) = k − j, which is arithmetic-depth-efficiently computable by Lemma 2.6.
The second example is a tensor product of Pauli matrices.Recall that any Hamiltonian can be expressed as a sum of (scaled) tensor products of Pauli matrices, which form a basis for the Hermitian space.In Section 3.3, we will further show that this Pauli sum is also uniform-structured (according to the extended definition).Lemma 3.14 (Tensor product of Pauli matrices).Let H be a (scaled) tensor product of Pauli matrices, that is, H = α k∈[n] σ k with σ k ∈ {½, X, Y, Z} and α a constant.Let O P give an n-bit string s characterizing the Pauli string σ k k∈[n] , in particular, take X = [1], Y = [N ] and P (x, y) = y ⊕ s, where the k th bit of s is defined as Then H is uniform-structured.
Proof.Note that the sparsity d = 1 for H, because all Pauli matrices are 1-sparse.The lemma is proved by verifying the conditions in Definition 3.10.It holds that L(j, t) = j ⊕ s where ⊕ is the bit-wise XOR operator.
-These functions and operators are obviously arithmetic-depth-efficiently computable, while the mapping |0 → -The XOR ⊕ is associative.
• Take the inverse function to be L −1 (j, k) = s, which can be computed by a single query to O P .
The third example is a local Hamiltonian clause, which acts non-trivially on a subsystem of l qubits, whose positions are indicated by the l bits of 1 in an n-bit string s.The sum of many local Hamiltonian clauses is a local Hamiltonian, which will be investigated in Section 3.3 as a uniform-structured Hamiltonian (according to the extended definition).

Lemma 3.15 (Local Hamiltonian clause).
Let H be an l-local Hamiltonian clause; that is, H = H s ⊗½ s, where H s is a Hamiltonian acting on the subsystem of l qubits whose positions are indicated by bit 1 in the n-bit string s, and ½ s is the identity operator on the subsystem of the rest n − l qubits.
Let O P give the parameter s, in particular set X = [1], Y = [N ] and P (x, y) = y ⊕ s.Then H is uniform-structured.• We have that is, take f (j, k) = j ⊳ s k, g(t) = t↾ s , and • to be ⊳ s in (12).
-To compute f and •, it suffices to compute the operator ⊳ s , which can be computed by elementary gates according to (13)  -The associativity of ⊳ s is easy to verify.
• Take the inverse function to be L −1 (j, k) = k ∧ s, where ∧ is the bit-wise AND operator.This is arithmetic-depth-efficiently computable with O(1) queries to O P .Now we are ready to present the parallel pre-walk subroutine for uniform-structured Hamiltonians.As aforementioned, the parallelism relies on the structure of L (r) , which will be shown to be computable by a logarithmic depth quantum circuit with respect to r.The goal is achieved through several lemmas and corollaries.Proof.Since • is associative and arithmetic-depth-efficiently computable, by Lemma 2.3 we can first evaluate g(t 0 )•. ..•g(t r−1 ) by O log r • log 2 n -depth and O rn 4 -size of gates.Also f is arithmeticdepth-efficiently computable with O(1) queries to O P , thus L (r) (j, t) = f (j, g(t 0 ) • . . .• g(t r−1 )) can be computed by additional O log 2 n -depth and O n 4 -size of gates with O(1) queries.Finally apply COPY n to copy the result into |z , followed by garbage cleaning using reverse computation.The total complexity follows from summing these complexities up.
Note that for different s ∈ [r + 1], the function L (s) can be computed in parallel, then we obtain the following corollary.in Lemma 2.1, which has O(log r)-depth and O r 2 n -size.Then apply Lemma 3.17 to each of the r copies, for the s th copy taking the input (j, g(t 0 ), . . ., g(t s−1 )) to compute L (s) (j, t 0 , . . ., t s−1 ), which can be done in parallel for s = 1 to r. Finally apply COPY n in parallel to store the result, followed by garbage cleaning using reverse computation.The final complexity comes from summing up these complexities.
With the help of the inverse function L −1 , we can generate a path state |j from the state |j 0 , g(t) by erasing the redundant information |g(t) in ( 14), as shown in the following corollary.Proof.Since j s = L (s) (j 0 , t 0 , . . ., t s−1 ) for s = 1 to r, to perform the required mapping we only need to clean the |g(t) in ( 14) after applying Corollary 3.18 to |j 0 , g(t) |0 ⊗r .Recall that L −1 (j s , j s+1 ) = g(t s ) for s ∈ [r], so we can compute L −1 by taking inputs |j s , j s+1 to clean g(t s ) first on odd s then on even s, thereby for different s the computation can be done in parallel.Since L −1 is arithmetic-depth-efficiently computable with O(1) queries to O P , the cleaning process can be done by O log 2 n -depth and O rn 4 -size of gates with O(1) queries.The final complexity follows from summing these complexities up.
A combination of the above results gives a parallel pre-walk algorithm for uniform-structured Hamiltonians.
Lemma 3.20 (Pre-walk on uniform-structured Hamiltonians).Let H be a uniform-structured Hamiltonian, then an r-pre-walk on its graph H, i.e., preparing the state |p (r) j 0 , can be implemented by a quantum circuit with • O(1)-depth and O(r)-size of queries to O P , and

Re-weight
Intuitively, the re-weight procedure in the implementation of T (r) in Figure 4 adjusts the "weight" of each path |j in the pre-walk state |p (r) j 0 according to entries in H, given by the oracle O H .As we will see in the following, the re-weight analysis is in fact simpler than the pre-walk because there is no requirement on the sparse structure of H. H will be determined in the gate complexity analysis below.
• For gate complexity, in Step 2 the COPY (r+1)•(n+1) gate can be performed by O(1)-depth and O(rn)-size of gates.In Step 3, one needs to apply r controlled rotations conditioned on some H jk ; that is, perform the mapping To achieve a total precision ǫ of T (r) , each rotation in (15)  Following [19], to satisfy the condition H * jk H jk * = H * jk one should be careful in choosing the sign of H * jk for H jk < 0. This problem is addressed by adding an O 2 −b disturbance on the imaginary part of H jk to force it to be nonzero (thereby forcing H jk to be complex) for those H jk < 0, with the total precision unchanged up to a constant factor 10 .Now for , which is also arithmetic-depth-efficiently computable by Finally, combining the pre-walk complexity (Lemma 3.20 in Section 3.2.1), the re-weight complexity (Lemma 3.21 in the above), and the negligible complexity of implementing S (r) gives the total complexity of the parallel quantum walk for uniform-structured Hamiltonians.

Extension: A Parallel Quantum Walk for a Sum of Hamiltonians
As shown in the previous section, the parallel quantum walk can be efficiently implemented in parallel for the class of uniform-structured Hamiltonians, which are however somewhat restricted in applications.Now we extend the framework in Section 3.2 to a parallel quantum walk for a sum of Hamiltonians, that is, a Hamiltonian of the form H = w∈[m] H w , where H w are d-sparse Hamiltonians and m = poly(n).In this extended framework, we generalize the class of uniformstructured Hamiltonians to include more Hamiltonians of practical interest, like Pauli sums and local Hamiltonians.
Recall that in Section 3.2 the good sparse structure of a Hamiltonian H is a key to efficiently implement the parallel quantum walk.The intuition behind the extended framework in this section is then: if some Hamiltonians has the same type of good sparse structures, then a sum of them is still structured well enough for exploiting parallelism.The organization of this section is similar to Section 3.2.For readability, we only provide the essential definitions and lemmas here, and leave more details to Appendix A.
Let us first define the extended parallel quantum walk for a sum of Hamiltonians.Recall that the state |Ψ (r) j 0 in (6), which is a superposition of paths j = (j 0 , . . ., j r−1 ) in the graph H, is a key ingredient in the parallel quantum walk in the previous section.Now in the case of a sum Hamiltonian H = w H w , the graph H can be "decomposed" into a sum of subgraphs H w , so in a path j ∈ H each edge (j s , j s+1 ) belongs to at least one subgraph.Thus, we can define an extended state |Ψ (r,m) j 0 , which is still a superposition of paths j ∈ H, but each tensored with a corresponding string w, such that (j s , j s+1 ) ∈ H ws for all s ∈ [r].In this way, the extended parallel quantum walk can better exploit the sum structure of the Hamiltonian H = w∈[m] H w , as shown later.
Definition 3. 23 ((r, m)-parallel quantum walk).Given the Hamiltonian H as above.Let H = H W ⊗ H A ⊗ H B be the walk space, where H W = ( m ) ⊗r and H A = H B = 2N ⊗r+1 .For each where j ∈ H w denotes (j s , j s+1 ) ∈ H ws for all s ∈ [r], and Hjk := H jk /c(j, k) with c(j, k) := 11 the number of subgraphs containing the edge (j, k) ∈ H. Let • T (r,m) : H → H be any unitary operator such that • S (r,m) : H → H inverse the order in the subspace H A ⊗ H B ; that is, S (r,m) = ½ W ⊗ S (r) , where S (r) : H A ⊗ H B → H A ⊗ H B is the inverse order operator in Definition 3.5.
Then a step of (r, m)-parallel quantum walk for H is defined as Q (r,m) := T (r,m) † S (r,m) T (r,m) .Remark 3.24.Note that in (16), the amplitudes are determined by a new Hamiltonian H, which is entry-wise rescaled from H. The rescaling factor for each entry H jk is c(j, k), i.e., the number of overlapping subgraphs H w on the edge (j, k) ∈ H.For later implementation of the extended parallel quantum walk, we will consider a new oracle O H giving an entry of H such that O H |j, k, z = |j, k, z ⊕ Hjk .Note that if c(j, k) can be efficiently computed, O H can be easily constructed from O H , with the total precision scaled by at most m for the construction, as c(j, k) The overhead caused by this precision scaling will be shown to be negligible later.
The extended parallel quantum walk defined above also block-encodes a monomial of H, as shown in the following lemma.

Proof. Postponed to Appendix A.
As a straightforward generalization of the implementation of T (r) in Figure 4, an implementation of T (r,m) is shown in Figure 5.We call the procedure of preparing the (extended) pre-walk state |p (r,m) j 0 in (17) an (r, m)-pre-walk.Now we will introduce the notion of m-uniform-structured Hamiltonians as an extension of the uniform-structured Hamiltonians in Section 3.2.We first redefine the function L : • For all r ∈ AE, its corresponding L (r) can be expressed as where the function f, g and the operator • with input/output lengths O(n) satisfy that: f , • and the mapping |w |0 → |w  (16).T (r,m) can be implemented in the following ways.Pre-walk Re-weight Copy the computational basis states in H A to H B ; that is, apply COPY (r+1)   • There exists an inverse function L −1 such that L −1 (w, j, L(w, j, t)) = g(w, t) for all w ∈ [m], j ∈ [N ], t ∈ [d], and L −1 is arithmetic-depth-efficiently computable with O(1) queries to O P .
• The function [[(j, k) ∈ H w ]] can be arithmetic-depth-efficiently computed with O(1) queries to Note that the first two conditions in Definition 3.26 are naturally generalized from Definition 3.10, while the third condition is set to guarantee the efficiency of computing c(j, k) Proof.Postponed to Appendix A. Proof.Postponed to Appendix A.
Remark 3.29.Although any Hamiltonian can be represented as a Pauli sum due to the fact that tensor products of Pauli matrices form a basis for the Hermitian space, the number of summands m can be large.Since the complexity of the algorithm depends on m, only those Pauli sums with small m are of practical interest.The same difficulty exists when one tries to represent any Hamiltonian as a local Hamiltonian, because the parameter l can be large.
Following the same line of analysis as in Section 3.2, we have the following theorem that the (extended) parallel quantum walk for uniform-structured Hamiltonians in Definition 3.26 can be efficiently implemented in parallel.for r = polylog(1/ǫ).
Note that when m = 1, the above theorem reduces to Theorem 3.22.For readability, the proof of this theorem is postponed to Appendix A.

Parallel LCU for Hamiltonian Series
In this section, we show how to implement a linear combination of block-encoded powers of a Hamiltonian (i.e., a Hamiltonian power series) by a parallel quantum circuit.The method is based on the Linear-Combination-of-Unitaries (LCU) algorithm developed through [92,41,68,21,22,37,50].Here we adopt the block-encoding version of LCU in [50].The results in this section will be used to implement a linear combination of parallel quantum walks to approximate the evolution unitary e −iHt in Section 5.
We first recall two technical lemmas from [50].Although they apply to more general matrices, here for our purpose we restrict them to the Hamiltonians.Lemma 4.1 (Product of block-encoded Hamiltonian powers [50]).Let K be any Hamiltonian.
For n, m ∈ AE, if U is an (α, a, δ)-block-encoding of K n and V a (β, b, ǫ)-block-encoding of K m , then (½ a ⊗ V )(½ b ⊗ U ) is an (αβ, a + b, αǫ + βδ)-block-encoding of K n+m , where ½ s is the identity operator acting on the proper subsystem composed of the s qubits.Definition 4.2 (State preparation unitary).Let a ∈ R and α := a 1 , where √ a r |r with |0 ∈ R , where a r is the r th entry of a.
This definition is a special case of state preparation pair in [50], as shown in Appendix B.
Lemma 4.3 (LCU for Hamiltonian series [50]).Let a ∈ R be an R-dimensional vector, K be a Hamiltonian, and a r K r be a power series of K; • V be an (α, δ)-state-preparation-unitary of a; Combining Lemma 4.1 and Lemma 4.3, we can obtain the following: Corollary 4.4.Let a ∈ R be an R-dimensional vector, K be a Hamiltonian, s := ⌈log R⌉, and a r K r be a power series of K; • V be an (α, δ)-state-preparation-unitary of a; • W r be a (1, b r , ǫ)-block-encoding of K r , and where b := j∈[s] b 2 j and r j is the j th bit of r.
Then by Lemma 4.3 we reach the conclusion.Now we present two lemmas showing how to implement V and W in Corollary 4.4 by parallel quantum circuits.The idea of Lemma 4.5 is similar to a data structure for matrix sampling in [65], but here we consider complex amplitudes (rather than real amplitudes in [65]) and explicitly compute the gate complexity.This lemma is based on previous works on quantum sampling [57].Lemma 4.5 (Parallel state preparation).Let a ∈ R be an R-dimensional vector such that for each r ∈ [R], a r is arithmetic-depth-efficiently computable given r as input, and let α := a 1 .An (α, ǫ)-state-preparation-unitary V of a can be implemented by an O log 3 log(1/ǫ) -depth and O log 5 (1/ǫ) -size quantum circuit for R = O(log(1/ǫ)).
Proof.Assume R = 2 s w.l.o.g., because one can enlarge the dimension of a by appending enough 0 entries.Let a r = e iθr |a r | and partial sum S(j, l) := l−1 r=j |a r |.Recall that our goal is to perform the mapping √ a r |r .The state preparation V consists of s steps, and in the k th step we perform the mapping for all x ∈ [2 k−1 ] and k ∈ [s], where we define A proof of the correctness of this procedure can be found in [65] and [57], except that here we have an additional accumulative phase β x to handle the complex amplitudes.Let us compute the gate complexity.To achieve a total precision ǫ, each step of (20) needs to be (ǫ/s)-precise.Each step of (20) consists of two controlled rotations, C b -R Y and C b -R Z , which are conditioned on elementary arithmetic functions of γ x and β x respectively, where b = O(log(s/ǫ)) = O(log(1/ǫ)) is the number of bits for the required precision.While β x can be easily computed by an arithmetic-depth-efficient quantum circuit, as it is an elementary arithmetic function of a u and a v , which are arithmetic-depth-efficiently computable from x; γ x needs to be computed from the more complicated S(u, v).
Note that one can compute all S(j, l) required at once in an ancilla space.This is done by • first computing 2 −b /R -precise a r for all r by arithmetic-depth-efficient circuits on the input |0, 1, . . ., R − 1 , due to the assumption that a r is arithmetic-depth-efficiently computable; • then computing S(j, l) in an inductive way analogous to the proof of Lemma 2.3, except that one needs to create a constant number of copies of each S(j, l) by COPY b+s for future computation.The next lemma generalizes Lemma 8 in [37] to the block-encoding case.

Parallel Hamiltonian Simulation
Now we are ready to present a parallel quantum simulation algorithm for uniform-structured Hamiltonians by assembling the techniques developed in the previous sections.Following [22,21,37], we simulate the evolution unitary e −iHt by first splitting the time interval t into small segments each of length ∆t, then approximating the evolution within each segment by the truncated Taylor series In our setting, ∆t should be chosen such that the monomial (H∆t) r can be obtained from the parallel quantum walk in Section 3. Using the results of Section 4 to combine these walk operators, we get a block-encoding of e −iH∆t , then we apply these e −iH∆t sequentially on the initial state.
In this way we are able to simulate e −iHt .To guarantee a constant success amplitude after each application of e −iH∆t , a technique introduced in [20,21,22] called robust oblivious amplitude amplification will be used.
Proof.In Corollary 4.4, take R = ⌈log(1/ǫ)⌉, a r = (−i) r r! , K = H md and W r = Q (r,m) .Then α = r∈[R] |a r | < e is a constant, where each a r is arithmetic-depth-efficiently computable by Lemma 2.6.Assume R = 2 s w.l.o.g.To achieve a total ǫ-precision of U ∆t , in Corollary 4.4, take the precision of V to be ǫ/(2αR), and the precision of each W r (i.e., the precision of Q (r,m) in Theorem 3.30) to be ǫ/(2αs).By Lemma 4.5, the state preparation unitary V requires O log 3 log(1/ǫ)depth and O log 5  The final complexity follows from summing up these complexities and the assumption m = poly n.
Note that the number of ancilla qubits required for the block-encoding is not a tight upper bound.
To achieve a constant success amplitude after a sequence of e −iH∆t , we need to amplify the success amplitude to a constant after each application of e −iH∆t with robust oblivious amplitude amplification.The following lemma is a special case of Lemma 6 in [22].Proof.Let ∆t := 1/(md).First consider the case when τ = t/∆t is an integer.To achieve a total precision ǫ, applying Lemma 5.2 followed by Corollary 5.4 with precision ǫ/τ gives a (1, O(γ(n + log m)), O(ǫ/τ ))-block-encoding of e −iH∆t .Repeat the above procedure τ times; that is, using Lemma 4.1 to multiply these block-encoded e −iH∆t we obtain a (1, O(τ γ(n + log m)), O(ǫ))block-encoding of e −iHt .By properly scaling the precision ǫ we can remove the constant factor in O(ǫ) and implement e −iHt to precision ǫ with the same overhead up to a constant factor.The final complexity follows from summing up these complexities.
For the case when τ is not an integer, that is, t := t − [t/∆t] = 0, we can independently simulate the last segment for time t.This can be done through simulating H := H∆t/ t instead for time ∆t, where the oracle O H for H is easy to construct from O H , with at most O log 2 γdepth and O γ 4 -size of overhead for the required precision by Lemma 2.6.The final complexity is unchanged.

Lower Bounds
In this section, we prove Theorem 1.3 in Section 1.1, which gives a lower bound on the gate depth of simulating a uniform-structured Hamiltonian and implies that the polylog log(1/ǫ) factor in the gate depth in Theorem 5.5 cannot be significantly improved to o(log log(1/ǫ)).Our proof is based on the proof of Theorem 1.2 in [20], which gives a lower bound that simulating any sparse Hamiltonian to precision ǫ requires Ω log(1/ǫ) log log(1/ǫ) -size of queries, as an extension of the "no-fastforwarding theorem" in [18].Their proof basically reduces the problem of computing the parity of N bits (with unbounded error, i.e., with success probability strictly greater than 1/2), to simulating a 2-sparse 2N × 2N Hamiltonian (to a high precision).Our lower bound is achieved by two simple observations: the Hamiltonian used there is 6-band, which is actually uniform-structured as shown in Lemma 3.12; and computing the parity of N bits with unbounded error requires Ω(log N )-depth of gates.
Proof of Theorem 1.3.We will show that there exists a uniform-structured Hamiltonian H such that simulating H to precision ǫ requires Ω(log log(1/ǫ))-depth of gates.Following [20], consider a 2N × 2N Hamiltonian H determined by an N -bit string x = x 0 . . .
, where ⊕ stands for the XOR operator.Note that here H is 6-band, thus is uniform-structured by Lemma 3.12.Also we have H max ≤ 1, which can be normalized to 1 with a constant overhead.In [20] it is shown that: 2. if N = Θ log(1/ǫ) log log(1/ǫ) , then there is an unbounded-error algorithm to compute PARITY(x), by simulating H for a constant time t to precision ǫ on the input |0, 0 , followed by a computational basis measurement.
Take N = Θ log(1/ǫ) log log(1/ǫ) .To finish the proof, it suffices to show that computing PARITY(x) with unbounded error requires Ω(log N ) = Ω(log log(1/ǫ)) depth of gates.This is trivial because PARITY(x) depends on all x j for j ∈ [N ], while o(log N )-depth of gates only cover o(N ) input qubits.More precisely, for the sake of contradiction, suppose there exists an o(log N )-depth quantum circuit that takes an input |x and outputs PARITY(x) by a measurement on the first output qubit with success probability > 1/2, then there must be an input qubit holding |x k for some k ∈ [N ] that is not connected to the output qubit by a path of gates and wires.However, x k = 0 and x k = 1 yield different values of PARITY(x).This gives a contradiction.

Applications
It seems that the parallel quantum algorithm for Hamiltonian simulation developed above can be applied to a wide range of simulation tasks in physics and chemistry.As a concrete illustration, three examples of physical interest are presented in this section.For each example, we explicitly calculate the total gate complexity of the parallel quantum simulation algorithm for it and compare with the prior art.In particular, we calculate the gate cost for implementing the oracle O H and O P in the algorithm.Since our choices of O P turn out to be efficiently implementable by quantum gates in these examples, compared to the commonly chosen oracle O L , we believe that our definition of the oracle O P is more reasonable in these applications.The results in this section were already summarized in Subsection 1.1 as Table 1.

Simulation of the Heisenberg Model
Many body localization (MBL) is an intriguing phenomenon in the long-time behaviour of a closed quantum system with disorders and interactions [82,2,4].In contrary to the conventional assumption in quantum statistical mechanics that a system coupling to a bath (i.e., a large environment) after a long time will achieve a thermal equilibrium which erases the initial condition, the MBL system as an isolated many-body quantum system resists such thermalization -in a local subsystem the information of the initial state is remembered forever.While a theoretical understanding of MBL still remains challenging since its introduction by Anderson in 1958 [5], tremendous numerical works on various systems have been conducted through recent years (for example, see [82,4,1] for reviews).A typical example for numeric studies of MBL is one-dimensional Heisenberg model [82,78,38].Due to the difficulty of simulating many-body dynamics classically, quantum simulations can investigate properties of MBL in larger systems intractable for classical computers.
Following [38], we consider the problem of simulating the one-dimensional nearest-neighbor Heisenberg model with a random magnetic field in the z direction.More concretely, we will simulate an n-qubit Hamiltonian where h w ∈ [−1, 1] is chosen uniformly at random, the subscript w indicates the qubit w that the Pauli matrix acts on, and w = n is equivalent to w = 0 by assuming the periodic boundary conditions.First observe that H is a ( ]] is arithmetic-depth-efficiently computable for uniform-structured H, and (H w ) jk is easy to determine given a copy of h w , therefore the above procedure can be implemented by an O log 2 n + log b -depth and O n 5 + nb -size quantum circuit (in a way analogous to computing c(j, k) in the proof of Lemma A.3). Thus, we obtain the following: In [38] the performance of different quantum simulation algorithms on this task are compared, amongst which the asymptotically best one is based on quantum signal processing [74,73], which achieves a gate complexity O n 3 log n + log(1/ǫ) log log(1/ǫ) • n log n .Later a quantum algorithm to simulate lattice Hamiltonians [59] shows a better dependence on n in the gate complexity, consisting of O(n polylog(n/ǫ))-depth and O n 2 polylog(n/ǫ) -size of gates.We note that in the gate depth all these works have a polylog(1/ǫ) dependence, while Corollary 7.1 only contains a polylog log(1/ǫ) factor.

Simulation of the Sachdev-Ye-Kitaev Model
The Sachdev-Ye-Kitaev (SYK) model [90,66], a simple but important exactly solvable many-body system, has drawn an increasing interest in the condense matter physics and high energy physics communities due to its many striking properties [90,66,80,86,88] and its potential to have an interesting holographic dual [66,80].Like the MBL problem in Section 7.1, numeric studies of a larger SYK model enabled by quantum simulation could extend our understandings about its features and dual interpretation.
Following [49,11], we consider the problem of simulating the SYK model evolving under a Hamiltonian H = 1 4 • 4! 2n−1 p,q,r,s=0 J pqrs γ p γ q γ r γ s , where each J pqrs ∼ N 0, σ 2 is chosen randomly from a normal distribution with variance σ 2 = 3!J 2 (2n) 3 (J is assumed to be a constant), and γ p are Majorana fermion operators such that {γ p , γ q } = 2[[p = q]]½.The Majorana operator can be expressed as a tensor product of Pauli matrices by the Jordan-Wigner transformation γ p → Z 0 . . .Z p/2−1 X p/2 p is even Z 0 . . .Z (p−3)/2 Y (p−1)/2 p is odd (24) for p ∈ [2n] (unlike in [49], our index starts from 0), where as usual the subscript of a Pauli matrix indicates the qubit it acts on.Now H can be expressed as a Pauli sum on n qubits: where α is a constant, J w is chosen randomly from a normal distribution, and each H w is a tensor product of Pauli matrices.It can be seen by Lemma 3.27 that the SYK Hamiltonian in ( 25) is m-uniform-structured with m = (2n) 4 , thus we can apply Theorem 5.5 to simulate it.Note that where we use J w ∼ N 0, σ 2 with σ 2 = O 1/n 3 .Thus, to simulate H for time t, it is equivalent to simulate its normalized Hamiltonian with max norm ≤ 1 for a rescaled time t := H max t = O n 2.5 t .We can determine the parameters in Theorem 5.5: τ = O n 6.5 t , γ = O(log(nt/ǫ)) and b = O(log(nt/ǫ)).
For the total complexity of the algorithm, we starts from defining P : [2n] 4 → {1, i, −1, −i} × {½, X, Y, Z} n ≃ [4] n+1 to be a function that maps w ∈ [m] to the Pauli string (with a global phase) of H w .For example, if H w = −iX ⊗ Z ⊗ Z then P(w) = (−i, X, Z, Z).One can also write the Jordan-Wigner transformation in (24) as a function J : [2n] → {½, X, Y, Z} n ≃ [4] n , which can be computed by an O(log n)-depth and O(n log n)-size quantum circuit.To see this, that is, to perform the mapping |p |0 → |p |J (p) , we can first prepare the state |0, . . ., n − 1 in the second register, then make n copies of |p , and finally compute the q th bit of J (p): J (p) q := 3[[q < p/2]] + [[q = p/2]] p is even 3[[q < (p − 1)/2]] + 2[[q = (p − 1)/2]] p is odd ∈ [4] Following [9] we consider the problem of simulating a molecular electronic structure Hamiltonian in the second-quantized form.In this form, we will simulate a Hamiltonian H = p,q∈ [n] h pq a † p a q + 1 2 p,q,r,s∈ [n] h pqrs a † p a † q a r a s (26) where n represents the number of spin orbitals, h pq , h pqrs are one-electron and two-electron integrals, and a † p , a p are fermionic creation and annihilation operators satisfying the relations a † p , a q = [[p = q] ]½, a † p , a † q = {a p , a q } = 0.
As in the "database" algorithm [9], we assume that these b-bit precise h pq and h pqrs are precomputed and stored in a database; for example, an O (nb) 4 -size quantum-read/classical-write RAM (QCRAM), to which one quantum access can be performed by O(log(nb))-depth and O (nb) 4 -size of gates.One can apply Jordan-Wigner transformation to the creation and annihilation operators a † p , a p to obtain Hamiltonians acting on qubits: for p ∈ [n], where as usual the subscript of a Pauli matrix indicates the qubit it acts on.Note that each resulting qubit Hamiltonian in ( 27) and ( 28) is a sum of two tensor products of Pauli matrices, hence by splitting them and applying the transformation to (26) we obtain H as a Pauli sum on n qubits: where m = O n 4 , each h w is some h pq or h pqrs in (26) (up to a constant factor), and each H w is a tensor product of Pauli matrices.Here we omit the explicit mapping from the indices p, q, r, s in (26) to the index w in (29), but mention that the mapping can be efficiently performed by O(1)-size of gates.Similar to the SYK model, we see by Lemma • The construction of O P is similar to the one in Section 7.2.Here we will omit the details and claim that it can be implemented by an O(log n)-depth and O(n log n)-size quantum circuit.
• To implement O b H , that is, to perform the mapping |j, k, 0 → |j, k, H jk , we sum up those (H w ) jk with (j, k) ∈ H w to obtain H jk .Recall that [[(j, k) ∈ H w ]] is arithmetic-depth-efficiently computable for uniform-structured H. Also (H w ) jk is easy to compute given: a copy of h w which can be read out from the database by O(log(nb))-depth and O (nb) 4  Prior work [9] gives a quantum simulation algorithm for molecular Hamiltonians with gate complexity O n 8 t log(nt/ǫ) log log(nt/ǫ) .Later algorithmic improvements (e.g.[12,13,67,10,23]) focus on parameters other than the precision ǫ, and all these works have a poly-logarithmic dependence on ǫ as the Hamiltonian simulation subroutines used there have such dependence.By allowing parallelism, Corollary 7.3 exponentially improves the dependence to polylog log(1/ǫ) with respect to the depth.

Discussion
In this paper, we proposed a parallel quantum algorithms for Hamiltonian simulation that achieves a doubly (poly-)logarithmic precision dependence in the depth complexity.Thus, for the first time it is shown that parallelism can provide a significant speed-up in the Hamiltonian simulation problem.This result is achieved by introducing a novel notion of parallel quantum walk and identifying a class of Hamiltonians, called uniform-structured Hamiltonians, for which the simulation can be executed efficiently in parallel.We believe that the techniques developed in this paper can be applied to design other parallel quantum algorithms.
In our work, the dependence on the simulation precision ǫ is improved from polylog(1/ǫ) to polylog log(1/ǫ) by parallelization.One open question, as mentioned in Section 1.3, is whether parallelization can improve the dependence on the simulation time t; more precisely, what kind of Hamiltonians can be simulated for time t by a quantum circuit with depth polylog(t), by allowing parallel queries?
Intuitively, the power of parallelism is provided by the use of ancilla qubits.We have not explicitly counted the number of ancillae used in our parallel quantum algorithm, but it obviously depends on the precision ǫ.Therefore, another question for future research is how many ancillae are required for a significant parallel speed-up?More generally, the trade-off between the circuit depth and the number of ancillae will be be an interesting issue in the studies of more parallel quantum algorithms.

5 )Figure 1 :
Figure 1: An outline of the parallel quantum algorithm for Hamiltonian simulation.

Figure 2 :
Figure 2: Inductive construction of the gate COPY b any of the b entangled qubits will add to the relative phase, the whole state becomes |γ α |0 ⊗b + e i2πγ•2 −b β |1 ⊗b after applying C-R Z in parallel.The final state is obtained by reverse computation with COPY † b .It is easy to check the depth and size of the quantum circuit by Lemma 2.1.
can be implemented by an O(log m + log b)-depth and O(mb)-size quantum circuit, where the addition is in 2 b .

Corollary 2 . 5 (
Parallel controlled Z gate).Let C b -Z be a unitary that perform Z gate on a single qubit controlled by b qubits; that is, C b -Z |x |ψ = |x Z |ψ x = 2 b − 1 |x |ψ o.w. for all x ∈ [2 b ] and |ψ ∈ 2 , then C b -Z can be implemented by an O(log b)-depth and O(b)-size quantum circuit.Proof.Write x as a bit string x = x 0 . . .x b−1 , the C b -Z gate can be implemented in the following ways.First take • to be OR gate in Lemma 2.3 to compute x 0 ∨ . . .∨ x b−1 in the ancilla space by O(log b)-depth and O(b)-size of gates, then conditioned on which apply a C-Z gate to |ψ , finally clean the garbage by reverse computation.
and z = 0 any state o.w.(4) for all j, z ∈ [2N ].Let S : H → H be the SWAP operator such that S |a, b = |b, a for all a, b ∈ [2N ].Then a step of quantum walk for H is defined as Q only the operator T requires oracle queries.As the ordinary quantum walk does not assume any sparse structure of H, we take O P = O L here.Lemma 3.4.The unitary T can be implemented by a quantum circuit with O(1) queries to O H and O L .Proof.Let H A ⊗ H B be the state space with H A = H B = 2N .As in the definition (4) of T , we only consider its action on the initial state |j ⊗ |0 for j ∈ [N ], with |j , |0 ∈ 2N .Then T can be implemented in the following way: 1. Prepare a superposition over computational basis states of size d in H B . 2. Query the oracle O L to obtain in H B a superposition over nonzero entries in row j.
⊗r+1 .Let us focus on the space H A .Since S (r) is the reverse order operator, every computational basis component of the state S (r) |Φ (r)⊥ k has at least one subsystem s ∈ [r + 1] of the form |k s + N ∈ 2N , while every computational basis component of |Φ (r) j or |Φ (r)⊥ j has all subsystems of the form |j t for t ∈ [r + 1].The orthogonality statement immediately follows from j t |k s + N = 0 for any j t , k s ∈ [N ].Lemma 3.8.
r , and P (j 0 , t) = (j 1 , . . ., j r ) with j s+1 := L(j s , t s ) for s ∈ [r].Then the r-prewalk can be implemented by a quantum circuit with • O(1) queries to O P , and • O(1)-depth and O(r log d)-size of gates.Proof.Assume d is a power of two w.l.o.g.Starting from the initial state |j 0 |0 ⊗r , the pre-walk can be implemented in the following way.First prepare a superposition 1 √ d r t∈[d] r |t in the second register by O(1)-depth and O(r log d)-size of Hadamard gates, then query O P to compute P (j 0 , t) = (j 1 , . . ., j r ) and thus obtain the goal state |p (r)j 0 .In general, it might be expensive to implement O path .For example, given the oracle O L that computes the function L, the straightforward way to implement the strong oracle O path requires r sequential queries to O L .However, for a special class of Hamiltonians, generating a path according to a sequence of choices can be done more efficiently by exploiting parallelism in computing compositions of the function L. This forms the basic idea of uniform-structured Hamiltonians.Let functionL (r) : [N ] × [d] r → [N ]be inductively defined as L (r) (j, t 0 , . . ., t r−1 ) := L L (r−1) (j, t 0 , . . ., t r−2 ), t r−1 for j ∈ [N ], t ∈ [d] r , with L (1) := L. Note that L(r) gives the destination of r steps of walk according to a sequence of choices.Uniform-structured Hamiltonians are those Hamiltonians for which the function L (r) can be computed efficiently in parallel.Definition 3.10 (Uniform-structured Hamiltonian).A d-sparse Hamiltonian H with the associated oracle O P is uniform-structured if:

Example 3 . 16 .
Let H := A ⊗ ½ ⊗ B ⊗ ½ be a 16 × 16 Hamiltonian with A and B being 2 × 2 Hamiltonians, and ½ being 2 × 2 identity matrix.Then H is a 2-local Hamiltonian clause H s ⊗ ½ s, with H s = A ⊗ B and s = 1010.Proof of Lemma 3.15.The lemma is proved by verifying the conditions in Definition 3.10.We use the superscript i to denote the i th bit of a number.Note that the sparsity d = 2 l , L(j, t) = j ⊳ s (t↾ s ), where the operator ↾ s : [d] → [N ] lifts an l-bit string to an n-bit string according to s, defined as (b↾ s ) i := s i • b s 0 +...+s i , and the operator ⊳ s : [N ] × [N ] → [N ] overwrites an n-bit string by another according to s, defined as (a ⊳ s b) i := a i (1 − s i ) + b i s i (13) for all a, b ∈ [N ], i ∈ [n].For instance, 101↾ 01011 = 01001 and 10011 ⊳ 01011 01001 = 11001.

Corollary 3 . 18 .
For a uniform-structured Hamiltonian H, the mapping|j |g(t) |z → |j |g(t) r s=1 |z s−1 ⊕ L (s) (j, t 0 , . . ., t s−1 ) (14)for all j ∈ [N ], t ∈ [d] r , z ∈ [N ]r , can be implemented by a quantum circuit with • O(1)-depth and O(r)-size of queries to O P , and • O log r • log 2 n -depth and O r 2 n 4 -size of gates.Proof.First make r copies of |j |g(t) by applying parallel COPY O(rn) r

Corollary 3 . 19 .
For a uniform-structured Hamiltonian H, the mapping |j 0 , g(t) → |j for all j 0 ∈ [N ], t ∈ [d] r , where j ∈ H r has j s+1 := L(j s , t s ) for s ∈ [r], can be implemented by a quantum circuit with • O(1)-depth and O(r)-size of queries to O P , and • O log r • log 2 n -depth and O r 2 n 4 -size of gates.
j∈H r |j .Each mapping |0 → 1 √ d t∈[d] |g(t) in Step 1 is arithmetic-depth-efficient with O(1) queries to O P , due to the definition of uniform-structured Hamiltonians.Combined with Corollary 3.19 the final complexity is obtained.

Lemma 3 . 21 (
Re-weight).Re-weight of |p (r) j 0 , i.e., performing the mapping |p defined in(6), can be implemented to precision ǫ by a quantum circuit with • O(1)-depth and O(r)-size of queries to O b H with b = O(log(1/ǫ)), and • O log 2 log(1/ǫ) -depth and O r n + log 4 (1/ǫ) -size of gates.for r = polylog(1/ǫ).Proof.We analyze separately the query complexity and gate complexity of the re-weight stage, including Step 2 and Step 3 in Figure 4. • For query complexity, only Step 3 involves queries to O H .As H A s ⊗ H B s+1 are disjoint for s ∈ [r], these r queries are independent, and thus can be done in parallel with O(1)-depth and O(r)-size.The precision b of oracle O b needs to be O(ǫ/r)-precise, which requires H jk given by the oracle O b H to have b = O(log(r/ǫ)) = O(log(1/ǫ)) bits of precision.To perform the rotation, first compute (1 − |H jk |)/H * jk and its arctangent arithmetic-depthefficiently by Lemma 2.6, then apply C b -R Y in Corollary 2.2.

Lemma 2 . 6 .
From Corollary 2.2, a C b -R Y gate can be implemented by an O(log b)-depth and O(b)-size quantum circuit.The total complexity follows by summing up the complexities of r rotations.

Definition 3 .
26 (m-uniform-structured Hamiltonian).A sum Hamiltonian H = w∈[m] H w with the associated oracle O P is m-uniform-structured if:

Figure 5 :
Figure 5: Implementation of T (r,m) which is used in the construction of the oracle O H from O H , as mentioned in Remark 3.24.Now we present two important examples of m-uniform-structured Hamiltonians: Pauli sums and local Hamiltonians, which are generalizations of tensor products of Pauli matrices and local Hamiltonian clauses respectively.

Lemma 3 . 27 (
Pauli sum).Let H be a Pauli sum, that is, H = w∈[m] H w where H w are (scaled) tensor products of Pauli matrices defined in Lemma 3.14.Let O P give an n-bit string s(w) characterizing the Pauli string of H w for each w, in particular, take X = [m], Y = [N ] and P (w, y) = y ⊕ s(w), where the k th bit of s(w) is defined as

Lemma 3 . 28 (
Local Hamiltonian).Let H be an (l, m)-local Hamiltonian, that is, H = w∈[m] H w where H w are l-local Hamiltonian clauses defined in Lemma 3.28.Let O P give an n-bit string s(w) characterizing the locality of H w , i.e., the positions of the l qubits H w acts on, for each w.In particular, set X = [m], Y = [N ] and P (w, y) = y ⊕ s(w).Then H is m-uniform-structured.

Theorem 3 . 30 (
Parallel quantum walks for m-uniform-structured Hamiltonians).For an muniform-structured Hamiltonian, the (r, m)-parallel quantum walk Q (r,m) can be performed to precision ǫ by a quantum circuit with • O(1)-depth and O(r)-size of queries to O b H with b = O(log(m/ǫ)), • O(1)-depth and O(rm)-size of queries to O P , and • O log r • log 2 n + log 2 log(m/ǫ) -depth and O mr 2 n 4 + r log 4 (m/ǫ) -size of gates.
The input state |0, 1, . . ., R − 1 can be prepared by O(1)-depth and O(Rs)-size of gates, while the inductive summation procedure requires O s log 2 (b + s) -depth and O R(b + s) 4 -size of gates by Lemma 2.6, Lemma 2.1 and the proof of Lemma 2.3.Thus all S(j, l) required can be computed by an O log 3 log(1/ǫ) -depth and O log 5 (1/ǫ) -size quantum circuit.By Lemma 2.2 the controlled rotations in each step of (20) can be implemented by an O(log b)depth and O(b)-size quantum circuit, once the rotation angles, some elementary arithmetic functions of γ x , β x are computed by an O log 2 b -depth and O b 4 -size quantum circuit.Summing up the s steps gives the final complexity.

Lemma 4 . 6 (
Parallel implementation of W ). The unitary W in (19) can be implemented by a quantum circuit with j∈[s] Dep(W 2 j )-depth and j∈[s] Siz(W 2 j )-size, where Dep and Siz refer to the depth and size cost of a subroutine.Proof.A parallel quantum circuit implementation of W is shown in Figure6, where we omit the identity operator ½ b−b 2 j on the ancilla space for simplicity of notations.Since the controlled version of W 2 j has the same complexity as W 2 j up to a constant factor, the final complexity then follows easily.

Figure 6 :
Figure 6: A parallel quantum circuit for W .

Lemma 5 . 1 . 2 R
Assume R := ⌈log(1/ǫ)⌉ ≥ 4 w.l.o.g., we have e z − The RHS of the above equation can be bounded by the Taylor remainder e ξ |z| R R! ≤ 1 ≤ ǫ, where 0 ≤ ξ ≤ |z| ≤ 1/2.Let us analyze the complexity of implementing a block-encoding of e −iH∆t .Lemma 5.2.For any m-uniform-structured Hamiltonian H, there exists a unitary U ∆t that forms an (α, R(2⌈log m⌉ + 4n), ǫ)-block-encoding of e −iH∆t , and can be implemented by a quantum circuit with • O(log R)-depth and O(R)-size of queries to O b H with b = O(R + log m), • O(log R)-depth and O(mR)-size of queries to O P , and • O log 2 R • log 2 n + log 3 R -depth and O mR 2 n 4 + R(R + log m) 4 -size of gates.
(1/ǫ) -size of gates.Combining Lemma 4.6 and Theorem 3.30, the implementation of W requires • O(s)-depth and O(R)-size of queries to O b H with b = O(log(sm/ǫ)), • O(s)-depth and O(mR)-size of queries to O P , and • O s log R • log 2 n + log 2 log(sm/ǫ) -depth and O mR 2 n 4 + R log 4 (sm/ǫ) -size of gates.

Corollary 7 . 1 (
Parallel simulation of the Heisenberg model).The Heisenberg Hamiltonian defined in (22) can be simulated for time t = n to precision ǫ by O n 3 log 3 n • log 3 log(1/ǫ) -depth and O n 8 log 5 (n/ǫ) -size of quantum gates.
3.27 that the molecular Hamiltonian H in (29) is m-uniform-structured with m = O n 4 .According to [9], we have w∈[m] |h w | = O n 4 .Thus the max norm of H is bounded by H max ≤ w∈[m] h w H w max ≤ w∈[m] |h w | = O n 4 .To simulate H for time t it is equivalent to simulate its normalized Hamiltonian for a rescaled time t := H max t = O n 4 t .The parameters in Theorem 5.5 can be determined as follows: τ = O n 8 t , γ = O(log(nt/ǫ)) and b = O(log(nt/ǫ)) .Now let us compute the gate complexities for implementing the oracle O b H and O P : -size of gates; together with a string characterizing the Pauli string of H w (like the function P(w) for the SYK model) computable by O(log n)-depth and O(n log n)-size of gates.Therefore the above procedure can be implemented by an O log 2 n + log b -depth and O n 8 b 4 -size quantum circuit (in a way analogous to the proof of Lemma A.3).As an application of Theorem 5.5, we have the following: Corollary 7.3 (Parallel simulation of molecular Hamiltonians).The molecular Hamiltonian defined in (26) can be simulated for time t to precision ǫ by O n 8 log 3 n • t log 3 log(t/ǫ) -depth and O n 16 t log 5 (nt/ǫ) -size of gates. .
Sparse matrix input model Let H be a d-sparse N × N Hamiltonian acting on n qubits with H max = 1.H is accessed by: on sparse Hamiltonians, instead of O P , an oracle O L is given such that O L |j, t = |j, L(j, t) , where L(j, t) ∈ [N ] gives the column index of the t th nonzero entry in row j of H for t ∈ [d] and gives 0 for t / ∈ [d].
The factor 1/ √ d and the garbage states |k + N are introduced to keep |ψ j a normalized state.Now we are ready to define the quantum walk, a unitary acting on the extended space 2N ⊗ 2N for the N × N Hamiltonian H.
Definition 3.1 (Quantum walk for Hamiltonians).Given Hamiltonian H as above.Let H = 2N ⊗ 2N be the state space.For each j ∈ [N ], define Similar to the proof of Lemma 3.4, for each query we compute H jsj s+1 in a temporary ancilla space, conditioned on which rotates the state in H B s+1 , then uncompute H jsj s+1 .Finally we obtain the goal state |Ψ r |j |j ∈ H A ⊗ H B . 3. Query r copies of the oracle O H in parallel, each in the subspace H A s ⊗ H B s+1 for s ∈ [r].
For a better understanding of the quite involved Definition 3.10, we show three examples of uniform-structured Hamiltonians.The first example is a band Hamiltonian, which has its nonzero entries concentrated within a band around the diagonal.Lemma 3.12 (Band Hamiltonian).Assume d ∈ [N ] is odd.Let H be a d-band Hamiltonian, i.e., H jk = 0 if k / ∈ B d j , where B d j := {j + t − (d − 1)/2 : t ∈ [d]} with the addition and subtraction in N 9 .Let O P be an empty oracle, that is, take X = Y = ∅ and P to be undefined.Then H is uniform-structured.Example 3.13.The 4 × 4 Hamiltonian H with matrix form |g(t) can be performed by first querying O P to obtain |s in an ancilla space, then conditioned on it performing n controlled Hadamard gates in parallel followed by garbage cleaning, assuming d is a power of two w.l.o.g.This is arithmetic-depth-efficient with O(1) queries.
and O r 2 n 4 -size of gates.Proof.Let s∈[r+1] H s be the state space with H s = N .The process of preparing |p from the initial state |j 0 |0 ⊗r with |j 0 , |0 ∈ N .
2. Apply Corollary 3.19 we obtain the goal state |p Query r copies of the modified oracle O H in parallel, each in the subspace H A s ⊗H B s+1 for s ∈ [r].For each query we compute Hjsj s+1 in a temporary ancilla space, conditioned on which rotates the state in H B s+1 , then uncompute Hjsj s+1 with another query.Finally we obtain the goal state |Ψ [38])-local Hamiltonian with local clause H w := X w X w+1 + Y w Y w+1 + Z w Z w+1 + h w Z w for w ∈ [n].The locality of H w is indicated by an n-bit string s(w) with the u th bit defined as s(w) u := [[w = u ∨ w + 1 = u]]for all u ∈ [n], given by the oracle O P as in Lemma 3.28.Then by Lemma 3.28, H is m-uniformstructured with m = n, thus we can apply Theorem 5.5 to simulate H.For comparison, take the simulation time t = n as in[38].SinceH max ≤ w∈[n] H w max = O(n),to simulate H for time t, it is equivalent to simulate a normalized H with max norm ≤ 1 for a rescaled time t := H max t = O n 2 .Thus we can determine the parameters in Theorem 5.5: τ = O n 3 , γ = O(log(n/ǫ)) and b = O(log(n/ǫ)).For the total complexity of the algorithm, one should also compute the gate complexities for implementing the oracle O b H and O P .• To implement O P , that is, to perform the mapping |w |0 → |w |s(w) , one can first prepare the state |0, . . ., n − 1 in the second register, then make n copies of |w by applying COPY , and finally calculate each Boolean function s(w) u for u ∈ [n] in parallel followed by garbage cleaning.Together with Lemma 2.6, this can be done by O(log n)-depth and O(n log n)-size of gates.• To implement O b H , one needs to first generate n uniform random h w ∈ [−1, 1] to b-bit precision in the preprocessing.This can be done by O(1)-depth and O(nb)-size of Hadamard gates followed by single-qubit measurements.The preprocessing only needs to be done once.Later when a h w is required each time we can apply a COPY b gate to prepare a new copy of it.To perform the mapping |j, k, 0 → |j, k, H jk , one can calculate an entry H jk by summing up those (H w ) jk with (j, k) ∈ H w .Recall that [[(j, k) ∈ H w