On solving classes of positive-definite quantum linear systems with quadratically improved runtime in the condition number

Quantum algorithms for solving the Quantum Linear System (QLS) problem are among the most investigated quantum algorithms of recent times, with potential applications including the solution of computationally intractable differential equations and speed-ups in machine learning. A fundamental parameter governing the efficiency of QLS solvers is $\kappa$, the condition number of the coefficient matrix $A$, as it has been known since the inception of the QLS problem that for worst-case instances the runtime scales at least linearly in $\kappa$ [Harrow, Hassidim and Lloyd, PRL 103, 150502 (2009)]. However, for the case of positive-definite matrices classical algorithms can solve linear systems with a runtime scaling as $\sqrt{\kappa}$, a quadratic improvement compared to the the indefinite case. It is then natural to ask whether QLS solvers may hold an analogous improvement. In this work we answer the question in the negative, showing that solving a QLS entails a runtime linear in $\kappa$ also when $A$ is positive definite. We then identify broad classes of positive-definite QLS where this lower bound can be circumvented and present two new quantum algorithms featuring a quadratic speed-up in $\kappa$: the first is based on efficiently implementing a matrix-block-encoding of $A^{-1}$, the second constructs a decomposition of the form $A = L L^\dagger$ to precondition the system. These methods are widely applicable and both allow to efficiently solve BQP-complete problems.


Introduction
Quantum computation is described using the formalism of linear algebra, suggesting that quantum methods may be intrinsically well-suited to perform linear algebraic tasks. Algorithms solving linear systems of equations, in particular, are at the cornerstone of linear algebra [2, Chapter 2], having many direct applications and playing a pivotal role in several computational methods [3]. In the seminal work of Harrow, Hassadim, and Lloyd (HHL) the so-called Quantum Linear System (QLS) problem was introduced and a quantum algorithm was presented that allows solving the QLS exponentially faster than classical algorithms solving classical linear systems [1]. In subsequent works, several new algorithms have been put forward that solve QLS with further increased efficiency in comparison to the original HHL algorithm, improving the runtime dependence on the condition number [4], on the precision [5] and on the sparsity [6]. Recently, a new approach inspired by adiabatic quantum computation has introduced a significantly simpler quasi-optimal solving algorithm [7,8], significantly narrowing the gap with experimental implementations [9], making the algorithm compatible with Near-term Intermediate Scale Quantum (NISQ) devices [10,11] and leading to the development of the presently most efficient QLS solvers [12].
A key idea underpinning the possibility of achieving large quantum speed-ups in linear algebra tasks is the fact that an exponentially large complex vector can be compactly encoded in the amplitudes of a pure quantum state; e.g., a n-qubit state is described via 2 n amplitudes. This intuition is indeed correct for the QLS problem, which has been proven to be BQP-complete [1]: any quantum computation can be re-formulated as a QLS with only a polynomial overhead and therefore there exist families of QLS problems that afford super-polynomial speed-ups compared to classical solution methods (unless BPP = BQP 1 ). While this reduction shows that almost certainly there exist families of QLS problems that allow an exponential speed-up compared to all classical methods, the crucial question is whether there are natural problems that can be directly formulated and solved as QLS. The ubiquity of linear systems seems to suggests their quantum variant should be broadly applicable as well, but still it is not guaranteed, since in the QLS setting further significant constraints have to be met to obtain exponential speed-ups [13].
A prominent fundamental bottleneck of QLS solvers is that, to efficiently obtain the solution, it is not sufficient that the coefficient matrix A of the system Ax = b is invertible, but it also have to be robustly invertible, that is well-conditioned : it is required that the condition number of the matrix A, defined as the ratio between the largest and the smallest singular values, is small. In fact, solving a QLS necessarily entails a runtime scaling at least linearly in the condition number (unless BQP = PSPACE 2 ) as was already proven in Ref. [1]. Therefore, an exponential quantum speed-up for QLS solving is achievable only when the condition number scales polylogarithmically with the system size and, unfortunately, it seems rather difficult to find natural examples of matrix families that exhibit such mild growth of the condition number [14]. However, polynomial quantum speed-ups for linear system solving should be rather broadly achievable and could still provide a quantum advantage if the degree of the polynomial is large enough [15]. In this view, obtaining a further quadratic improvement in the dependence on the condition number for some restricted classes of matrices, that are however of wide practical interest, could be of the uttermost importance for obtaining a broader impact of QLS solving algorithms. A previous publication showed that in the context of quantum algorithms for solving certain Markov chain problems the use of specialised QLS solvers for positive-definite matrices provides an improvement in the condition number dependence [16]. Exploring the general positive-definite QLS problem is the main focus of this work.
In the rest of the Introduction we motivate why quadratic speed-ups in the condition number in positive-definite QLS may be expected (Section 1.1), review some related results present in the literature (Section 1.2) and then proceed to give high-level overviews of our two algorithms (Section 1.3 and Section 1.4). In Section 2 we fix the notation and give the main definitions. In Section 3 we prove that QLS solvers require, even when restricting to positive-definite matrices, a runtime scaling linearly in the condition number. We then move to the main results of this work, that is, achieving improved runtime scaling for solving certain classes of positive-definite QLS: in Section 4 we show a method based on an efficient implementation of a matrix-block-encoding of A −1 and in Section 5 a method based on decomposing the coefficient matrix as A = LL † to effectively precondition the system. Finally, an outlook of possible future research directions is given in Section 6.

Positive definite linear systems and quadratic speed-ups
In this work, we investigate the efficiency of QLS solving algorithms specialised to the case where the coefficient matrix A ∈ C N ×N is Hermitian and positive definite (PD). We will call this sub-class the positive-definite Quantum Linear System (PD-QLS) problem.
A first reason to focus specifically on the PD case is that in the classical setting several problems of practical relevance are formulated as linear systems involving PD coefficient matrices. A second important reason is that the few fully worked-out examples in the literature providing strong evidence of polynomial quantum speed-ups for "natural" QLS problems involve PD coefficient matrices [14,17]. In particular, the discretization of a partial differential equation (PDE) having a positive-definite kernel (such as, e.g., Poisson's equation) results in a large linear system where the coefficient matrix is positive definite. Montanaro and Pallister perform in Ref. [14] a detailed

Result Requirements
Reduction to majority problem Ω(κ) query complexity lower bound Proposition 6 Access to A via a matrix-block-encoding or via a sparse-oracle access A is the sum of PD local Hamiltonians, b is sparse and 1/γ in Eq. (91) is small Table 1: Summary of the main results of this work. The results provided in the second column of the table hold under the conditions specified in the third column. The big-Ω notation is used for runtime lower bounds (in query complexity), the big-O notation for runtime upper bounds (exhibiting an explicit solving algorithm).
analysis of how QLS may be employed to solve the finite-element formulation of a PDE and show that the linear dependence on the condition number is the main bottleneck to obtaining large quantum speed-ups. In fact, the discretization of PDEs for functions defined in R D typically results in positive-definite linear systems where the condition number scales as O N 2/D [14].
We highlight that one might have reasonably conjectured that it is possible to have a quadratically better scaling in κ, the condition number of A, in the PD case. First, note that the runtime lower bound of Ref. [1] is proven using a special family of matrices that are indefinite (neither positive nor negative definite) by construction, hence it is not directly applicable to the PD case. A standard method allows to transform an indefinite linear system into a PD one, but having a quadratically larger condition number 3 ; hence, the lower bound in Ref. [1] directly yields a √ κ runtime lower bound for PD-QLS solvers. Second, the conjugate gradient (CG) descent method is the most efficient classical algorithm for solving PD linear systems and requires only O √ κ log(1/ε) iterations to converge to the correct solution, up to ε approximation. Each iteration consists in the update of some vectors having N entries, where N is the dimension of the linear system, thus resulting in a total runtime in O N √ κ log(1/ε) [18]. Then, it might seem plausible that a quantization of the CG method could yield a quantum algorithm having runtime in O polylog(N ) √ κ log(1/ε) . This conjectured quadratic speed-up is however not always achievable since, as we prove in Section 3, PD-QLS solvers have runtime scaling linearly in κ in worst-case problem instances. But this no-go result can be used as guidance to understand what is preventing us from achieving a better runtime scaling and, conversely, what additional conditions have to be imposed in order to achieve a quadratic speed-up in κ for PD-QLS solvers.

Previous related results
In context of general QLS solvers, i.e. solvers applicable also to indefinite or non-Hermitian matrices, the best algorithms have a runtime scaling quasi-linearly in κ, i.e. scaling as O κ polylog(κ) , thus almost saturating the linear lower bound [1]. Note that the original HHL algorithm has a worse performance, with a runtime scaling as O(κ 2 /ε) where ε is the target precision. The first algorithm to achieve a quasi-linear scaling in κ was proposed by Ambainis in Ref. [4], which introduces a technique called Variable-Time Amplitude Amplification (VTAA) and employs it to optimize to the HHL algorithm. Subsequently, Childs, Kothari and Somma [5] introduced polynomial approximations of A −1 to exponentially improve the runtime dependence on the approximation error to O κ 2 polylog(κ/ε) ; they show, furthermore, that the VTAA-based optimization can be used also for this algorithm, thus yielding a O κ polylog(κ/ε) runtime. Later, Chakraborty et al. showed that also the pseudo-inversion problem, whereby the matrix A may be non-invertible and even nonsquare, can be solved with a runtime in O κ √ γ polylog(κ/ε) , where γ parametrises the overlap of b with the subspace where A is non-singular [19]. Finally, the current state-of-the-art methods for general QLS solving is given in Ref. [12], which does not rely on VTAA but instead is based on ideas stemming from adiabatic quantum computation [7], which result in conceptually simpler algorithms and in a significant improvement of the polylogarithmic factors. Furthermore, several specialized quantum algorithms have been introduced with the scope of more efficiently solving QLS for particular sub-classes of matrices. A few works, e.g. [17,20,21], have investigated the use of preconditioning to speed-up the solution. The main idea, which is well-known in classical linear system solving methods, is to look for an invertible matrix B, a so-called preconditioner, such that the matrix BA has a smaller condition number than A, and subsequently solve the equivalent linear system BA = B b. An algorithm based on a sparse preconditioning matrix was introduced in Ref. [17], but it has very little formal guarantees of performance improvement. Another method based on circulant preconditioners was presented in Ref. [20], for which it is clearer how to assess when a runtime improvement can be achieved. Runtime improvements have been obtained in Ref. [21] applying new preconditioning methods to Hamiltonians arising in many-body physics. An entirely different approach, based on hybrid classical-quantum algorithms, has been explored in Ref. [22], which yields runtime speed-ups for the case where the rows or columns of the coefficient matrix can be prepared with polylogarithmic-depth quantum circuits. We also mention the result of Ref. [23], showing that it is possible to solve QLS for the special class of tridiagonal Toeplitz matrix with a runtime that scales polylogarithmically in all parameters, condition number included; note, however, that matrices of this class can be fully specified with just two real parameters. Finally, the PD-QLS problem has been previously considered in Ref. [16], where it is suggested that positive-definite systems could be solved more efficiently than indefinite ones.

Overview of the method based on a matrix-block-encoding of A −1
Our first method for solving PD-QLS with improved runtime is based on implementing as a quantum circuit a unitary matrix U A −1 that encodes in a sub-block a matrix proportional to A −1 , i.e. a so-called matrix-block-encoding [24,25]. The solution state A −1 b can be subsequently obtained via matrix-vector multiplication, achieved by applying the circuit encoding A −1 and projecting onto the correct sub-block. This method is analogous to the one introduced by Childs, Kothari and Somma in Ref. [5], where it is shown that exponentially precise polynomial approximations of the inverse function can be constructed, which then allow to implement a matrix-block-encoding of A −1 up to exponentially small error and finally solve the QLS problem via matrix-vector multiplication. Here, we show that if A is a PD matrix, an equally good approximation of A −1 can be obtained with polynomials having a quadratically smaller degree, leading to the possibility of a quadratic speed-up in PD-QLS solving.
More in detail, Ref. [5] considers a Hermitian matrix A whose spectrum is by assumption contained in the domain D ∆ := [−1, −∆] ∪ [+∆, +1], with ∆ ≤ 1/κ. Then, families of real polynomials P (x) are constructed such that |P (x) − 1/x| ≤ ε on the domain D ∆ and have a degree ∈ O κ log(1/ε) , see panel (a) of Figure 1 for an illustrative example. Using either the Linear Combination of Unitaries (LCU) [26,27] or the Quantum Signal Processing (QSP) [28,24,25] method, it is possible to implement a matrix-block-encoding of P (A), up to some rescaling factor K > 0 such that P (A)/K can be encoded a sub-block of a unitary matrix, and then we have The query complexities of LCU and QSP scale at least linearly in the degree of the polynomial, hence the use of polynomials of low degree is crucial to construct efficient QLS solvers. Moreover, the normalisation factor K of the matrix-blockencoding of P (A) scales linearly in κ and enters multiplicatively in the cost of the matrix-vector multiplication step necessary to produce A −1 b .
In case A is a PD matrix, we can exploit the knowledge that its spectrum is contained in D + ∆ := [∆, 1] to perform the following trick: we assume to have access to a normalised matrixblock-encoding of B := I − η A, for some constant 4 η ∈ (0, 2], so that the spectrum of B is always contained in the interval D B = [−1, 1 − η∆]. We then construct a polynomial P (x) that approximates the function f (x) := 1/(1 − x) on the domain D B up to ε distance. Using the QSP method to implement a matrix-block-encoding of the matrix P (B) we then have which is the required matrix inverse, up to a 1/η rescaling factor. Importantly, our polynomial approximation of 1/(1−x) has a degree ∈ O κ/η log(κ/(ηε)) , a quadratically better dependence on κ with respect to the approximation of 1/x given in Ref. [5], see panel (b) of Figure 1. Moreover, the normalisation factor K scales again linearly in κ, which is the best dependence achievable. From a mathematical standpoint, the possibility of a quadratic reduction of the degree of the approximating polynomial can be interpreted as a consequence of Bernstein's inequality [29]. This inequality states that in the class of real polynomials P (x) of degree such that |P (x)| ≤ 1 for all x ∈ [−1, +1] the derivative P (x) satisfies |P (x)| ≤ (1 − x 2 ) −1/2 for all −1 < x < 1, while we have |P (x)| ≤ 2 for x = 1 and x = −1 (and these last bounds are saturated by Chebyschev polynomials). Note that polynomials approximations of 1/x and of 1/(x − 1) have high derivative close to the singularities, respectively in x = 0 and in x = 1, and because of Bernstein's inequality the latter case allows good polynomial approximations having a quadratically lower degree.
We need next to perform matrix-vector multiplication to obtain the state A −1 b and thus solve the QLS; this is obtained by applying the quantum circuit that encodes the matrix P (B) ∝ A −1 to a quantum state of the form |0 |b and then post-selecting the ancilla register to be in |0 or, more efficiently, using amplitude amplification [30]. The amplitude amplification step implies a multiplicative overhead of order 1/κ in the worst case, yielding a total runtime in O κ 3/2 log(κ/ε) for this PD-QLS solver. However, the efficiency the matrix-vector multiplication depends on the post-selection success probability and thus on the specific choice for the vector b. In a best-case scenario the post-selection success probability is constant and thus the overall runtime of the PD-QLS solver is O √ κ log(κ/ε) , while in the same high-success-probability scenario the QLS solver of Ref. [5] has a runtime in O κ log(κ/ε) , since this is the cost of implementing a polynomial approximation of A −1 for indefinite matrices.
We remark that there is a technical assumption that has to be satisfied to allow the realisation of our method, namely, that it is possible to implement a normalised matrix block encoding of B = I − η A. This is a non-trivial task: standard methods could allow us to prepare, for instance, a normalised matrix-block-encoding of B/d where d ≥ 2 is the sparsity of B, i.e. the maximum number of non-zero entries in each column of B [5]. Note, however, that we would then need to implement an approximation of the function g(x) := 1/(1 − xd) to obtain g(B/d) = 1 η A −1 ; the function g(x) has a singularity in x = 1/d, which is in the interior of the domain [−1, +1], and thus Bernstein's inequality precludes us from achieving good approximations with low-degree polynomials for this function.
We will show that it is possible to implement normalised matrix-block encodings of B for two special classes of QLS problems: the first is when A is a diagonally dominant matrix; the second, when A is given as the sum of local PD Hamiltonian terms, where by "local" we mean that it acts non-trivially only on a small number of qubits. In these two cases it is therefore possible to implement our improved PD-QLS solver. We leave the question whether it is possible in broader generality to prepare normalised block-encodings as an open research question.

Overview of the method based on the A = LL † decomposition
Our second method for solving PD-QLS with improved runtime is based on finding a decomposition A = LL † , akin to the classical Cholesky decomposition [31] which exists for all PD matrices, and then use L −1 as an efficient and effective preconditioner. Note, in fact, that L † x = b for b = L −1 b is a linear system equivalent to the original one, but the decomposition A = LL † immediately implies κ(L) = κ(L † ) = κ(A) and thus the new system provably has a quadratically smaller condition number. This decomposition is similar to a spectral gap amplification of A [32].
The method involves thus two main steps: in the first one we use classical computation to efficiently obtain a description of a matrix L such that LL † = A and such that it is possible to efficiently find, using only classical computation, a description of the vector b := L −1 b ; in the second one, we use an efficient quantum algorithm, having runtime quasi-linear in κ, to solve Note that the classical descriptions of L † and of b should also allow the efficient compilation the quantum algorithm used to solve the preconditioned QLS.
The picture is not yet complete, since we actually use a matrix L that is non-square and thus singular. As a result, the inversion operation has to be substituted by a pseudo-inversion and the condition number by the effective condition number, the ratio between the largest and the smallest non-zero singular value; when A is invertible the effective condition number of L is quadratically smaller than the condition number of A. We also use two different pseudo-inverses in the classical and in the quantum part of the computation: in the quantum step the Moore-Penrose pseudoinverse (L † ) + is employed and in the classical preconditioning a generalised pseudo-inverse L g chosen such that (L † ) + L g = A −1 ; thus, they yield the desired solution |x = A −1 b when applied in sequence. These extensions to non-square matrices and to different pseudo-inverses are made to give leeway in the design of the classical part of the algorithm, allowing us to meet the efficiency requirements mentioned above. Finally, we will employ the QLS solver of [19], which can tackle pseudo-inversion problems and has a runtime quasi-linear in the effective condition number.
We show that a fully suitable decomposition of the form A = LL † can be constructed for the Sum-QLS problem, i.e., when A is a sum of local PD Hamiltonian terms. In this case, the matrix L is formed by several blocks, each constructed from a single Hamiltonian term, while the pseudoinverse L g is obtained inverting the individual blocks, operations involving only small matrices and thus classically feasible. We also require that the vector b is sparse, containing only polynomially many non-zero entries, thus allowing to efficiently compute the description of b = |L g b .
As a final technical caveat, we note that because of the mismatch between the pseudo-inverse used in the classical and in the quantum part of the algorithm (L g and (L † ) + ), the vector b is not entirely contained in the support of (L † ) + and thus amplitude amplification of the component in the correct subspace is required. This incurs in a multiplicative overhead of a factor 1/ √ γ, where √ γ > 0 is a known bound for the amplitude of the "good" component of b . This method thus has a provable runtime improvement over competing QLS solvers whenever 1/ √ γ is sufficiently small.

Notation and definitions
We assume knowledge of the main quantum computation concepts, as given for instance in Ref. [33]. A quantum computation is described using a Hilbert space of dimension 2 n for some n, corresponds to a system of n qubits, having a distinguished computational basis.

Linear algebra and asymptotic notation
We consistently denote with N ∈ N the dimension of the QLS we aim to solve and we define n := log 2 N , so that a vector in C N (possibly padded with zeroes at the end if N is not a power of two) can be described as pure state of n qubits. For any complex matrix A ∈ C N ×M having N rows and M columns we write its Singular Value Decomposition (SVD) as A = V ΣW † where V and W are unitary matrices of size N × N and M × M respectively and Σ is a real non-negative matrix of size N × M which is uniquely determined up to reordering of the diagonal entries and contains the singular values of A on the main diagonal. An Hermitian matrix A is positive definite if v| A |v > 0 for all |v and is is positive semi-definite if v| A |v ≥ 0. The eigenvalues of A are real, positive (in the definite case) or non-negative (in the semi-definite case) and coincide with its singular values. For a general A, ς min , ς max and λ min , λ max denote the minimum and maximum singular values and eigenvalues, respectively. The Moore-Penrose pseudo-inverse of A is obtained by applying to the singular values ς i of A the function f : R → R defined as f (x) = 1/x for x = 0 and f (0) = 0. More precisely, for a diagonal matrix Σ we define Σ + = f (Σ), while for a general matrix A = V ΣW † the pseudo-inverse is given as In this work we employ the 2 -norm for vectors ||v|| 2 := . For a singular matrix A we define the effective condition number to be κ eff (A) := ||A|| ||A + ||, which is equal to the ratio between the largest and the smallest non-zero singular value of A. A Hermitian matrix A is diagonally dominant if j:j =i |A ij | ≤ |A ii | for all i and note that |A ii | = A ii > 0 when A is positive definite. We use the standard big-O and small-o notations for asymptotic scaling, together with the following definitions: f (x) ∈ Ω(g(x)) if and only if g(x) ∈ O(f (x)), which is used to give lower bounds, and Θ(g(x)) = O(g(x)) ∩ Ω(g(x)). We also use the soft-O and soft-Ω notations where , and similarly for Ω(g(x)), which are used to give more coarse-grained bounds.

Definition of the Quantum Linear System problem
In this section we introduce the main definitions that are relevant in the contest of the QLS problem, which is a quantum analogue of the classical linear algebra problem of solving the system of equations Ax = b, having solution x = A −1 b when A is invertible.
We use pure quantum states to encode N -dimensional complex vectors. A vector v enclosed in a bra or in a ket is always assumed to be normalised, |v = 1. In particular we have: We now give a general definition of the QLS problem. The formulation is similar to the one provided in Ref. [7] and is intentionally not specifying the access models employed for the coefficient matrix A and the known-term vector b, for sake of generality. Definition 1 (Quantum Linear System). Suppose we have access to a vector b ∈ C N \ {0} and to a non-singular matrix A ∈ C N ×N (access is given via quantum oracles, or some kind of implicit or explicit description). We are given two real positive parameters ς * and ς * such that ς * ≤ ς min (A) and ς max (A) ≤ ς * , i.e. the singular values of A are contained in the interval D A = ς * , ς * ; equivalently, we are given two parameters κ > 1 and α > 0 that provide the upper bounds κ(A) ≤ κ and ||A|| ≤ α. We are also given a target precision ε > 0.
The QLS problem then consists in preparing a density matrix ρ x which is ε-close in trace distance to the solution vector |x = A −1 b ; that is, we require that Figure 2: Different access models for QLS algorithms. Panel (a) illustrates the case where a classical description of A and b (and the target precision ε) is provided and then used to compile a quantum algorithm A, solving the QLS for the given A and b. The description of A and b does not need to be fully explicit: it is sufficient that it allows to efficiently compute the sequence of elementary quantum gates of A. Panel (b) illustrates the case of a relativising quantum algorithm; in this case, the sequence of elementary quantum gates only depends on a few parameters (e.g., α, κ, ε as in Definition 1) while two fixed sub-routines specify A and b and these sub-routines can be treated as black-boxes.
where the trace norm is defined as ||X|| Tr := 1 2 Tr

Definition 2 (Positive-definite Quantum Linear System). A PD-QLS problem is a QLS problem, as given in Definition 1, where the coefficient matrix A is Hermitian and positive definite.
We note that a more commonly employed definition requires that the QLS solver outputs a state | x such that | x − |x ≤ ε. We prefer to use the trace distance since it is operationally motivated, as it equals the probability of distinguishing two copies of two quantum states when optimizing over all possible measurements, see [33, Section 9.2.1]. The trace distance then also bounds the maximum relative error that can be introduced when estimating the expectation value Tr(|x x| M ) for a given observable M and computing expectation values was the end goal of Ref. [1]. Finally, for pure normalised states |ψ and |φ the trace distance simplifies as d Tr (ψ, φ) := |ψ ψ|−|φ φ| Tr = 1 − | ψ|φ | 2 and using | ψ|φ | ≥ 1− 1 2 |ψ −|φ 2 we obtain the inequality thus the trace distance between | x and |x is at least as small as their 2 distance. It is customary to assume that A has been rescaled by a factor α ≥ ||A||, so that we take, without loss of generality, ||A|| ≤ 1 and the only parameter that needs to be specified is an upper bound to the condition number. This will be also our convention, unless otherwise specified.

Oracles for quantum linear system solving
We now define and discuss a few different access models for A and b, since the results we present for the PD-QLS solvers crucially depend on which access model is assumed. A fundamental distinction is between oracular and non-oracular (a.k.a. relativising and non-relativising) algorithms, see Figure 2. For instance, in oracular settings it is often possible to establish unconditional query complexity lower bounds, while in non-oracular settings non-trivial runtime lower bounds typically can be proven only under some (reasonable) complexity theory assumptions, such as P = NP. Note that most of the literature on QLS assumes oracular access to A and b [1,4,5,6,7,12,19].
We start defining the access model for b that we assume throughout this paper, except where differently specified. Definition 3 (State preparation oracle, as in [1]). Given a vector b ∈ C N we say that we have quantum access to a state preparation oracle for b if there is a unitary operator U b such that U b |0 n = b , where b is obtained by padding b with zeroes until its size is a power of 2.
As already noted by HHL, a general setting where it is possible to efficiently implement U B is the one presented by Grover and Rudolph [34]; another possibility is that U b is directly encoded in a qRAM [35], as may be required in quantum machine learning contexts.
Next, we define two models for access to A, which we denote as P A and U A and which correspond to a sparse-matrix-access and matrix-block-encoding, respectively. Definition 4 (Sparse-matrix-access, as in [5]). Given a Hermitian matrix 5 A which is d-sparse (i.e., has at most d non-zero entries in each row and column) a quantum sparse-matrix-access P A is given by a pair of oracular functions where P pos A and P val A specify the positions of the (potentially) non-zero entries of A and the values of those entries, respectively, i.e.
where A i,j ⊕z ∈ {0, 1} * denotes a bit string of arbitrary length that encodes the value A i,j ∈ C.
In order to keep the presentation simple, we assume here and throughout the paper that numeric representations of complex numbers can be specified exactly or with a sufficiently high number of digits of precision.
Definition 5 (Matrix block encoding, as in [28,25]). A unitary operator U A acting on n + a qubits is called an (α, a, ε)-matrix-block-encoding of a n-qubit operator A if 6 which can be expressed also as: where the asterisks ( * ) denote arbitrary matrix blocks of appropriate dimensions. We call α the normalization factor of the matrix-block-encoding and we say in the special case where α = 1 that U A is a normalised matrix-block-encoding of A.
A technique introduced by Childs allows to implement a (d, 1, 0)-matrix-block-encoding of A, where the normalisation constant d is equal to the sparsity of A, using only a constant number of accesses to P A and O poly(n) extra elementary gates, see Ref. [5] and references therein. In short, we have the reduction: where the arrows means that having access to an oracle of first type allows to efficiently implement the oracle of the second type. We also note that other access models to A have been considered in the literature; for example in Ref. [36] it is assumed that is possible to efficiently prepare quantum states that are proportional to each one of the columns of A.

Quantum linear systems in non-oracular settings
We consider in Section 5 (and also briefly in Section 4.3) a case that we call the Sum-of-Hamiltonians QLS (Sum-QLS) problem, which is not formulated as an oracular algorithm but is based, instead, on a classical description of A and b. In order to obtain efficient Sum-QLS solving algorithms, it is thus necessary that the descriptions of A and b are compact, depending at most on O poly(n) real or complex parameters.
For the known term vector b, we will simply assume that it is a sparse vector in the computational basis, with at most O poly(n) non-zero entries, whose position is also known. Hence, a fully explicit classical description of b can be provided and this also enables efficiently implementing a state preparation circuit U b .
For the coefficient matrix A we give a more implicit description: the entries of A are not specified one-by-one (which would be inefficient, as the matrix size N ∈ Θ(2 n ) is assumed to be very large) but rather can be computed from only a relatively small set of parameters, scaling polynomially in the number of qubits. Specifically, we assume that A is given as the sum of polynomially many local Hamiltonian terms: where the number of terms is J ∈ O poly(n) and each Hamiltonian H (j) acts on a small number of qubits, namely, on at most O log(n) qubits. This case has been previously considered in Ref. [16].

Query complexity lower bound
In this section we prove a Ω(κ) lower bound on the runtime of QLS which is alternative to the ones given by HHL in Ref. [1]. The main innovation we introduce is that our lower bound applies also to the PD-QLS case, while the proofs given by HHL only yield a Ω( √ κ) lower bound when specialised to PD matrices. More precisely, we have the following result. The proof of these lower bounds is rather technical and can be found in Appendix A. We now present a weaker result that, however, can be easily proven as a consequence of the optimality of Grover search [37]; namely, we show that PD-QLS solving has a linear scaling in κ for all κ ≤ √ N .

Proposition 7. Consider a PD-QLS problem as presented in Definition 2 and suppose that access to A is given by a sparse-matrix oracle P A (Definition 4), with no assumption on the access model for b. Then, a quantum algorithm that solves the QLS up to any constant precision
Proof. Consider the search problem of finding an element j ∈ S, where S ⊆ {1, . . . , N } is a subset containing M elements. The membership of a element j in S is encoded as a quantum oracle P S which flips a ancilla qubit if j ∈ S and leaves the ancilla unchanged if j / ∈ S. Consider, next, a matrix A that is diagonal (and thus 1-sparse) having entries The sparse-matrix oracle P A = (P pos A , P val A ) for this diagonal matrix A can be implemented with exactly one access to the membership oracle P S . In fact, P pos A can be implemented without any access to P S , since it is known that the non-zero entries are on the diagonal, while P val A can be implemented with a single access to P S , assuming that M and N are known: by definition we have P S |j, 0 = |j, 0 if j / ∈ S and P S |j, 0 = |j, 1 if j ∈ S, thus it is sufficient to apply on the ancilla system the transformation |0 → |α and |1 → |β , where the quantum register contains a binary representation of the numbers α and β, to implement P val A . Now notice that A is a matrix having condition number Solving the QLS for the known-term vector |b where in the last line we have introduced the normalised vectors |j / Measuring |x in the computational basis therefore solves the search problem with probability 1/2.
Suppose now that exists a quantum algorithm A that solves the QLS problem exactly (ε = 0) for PD matrices and that queries o κ times the oracle P A . Applying A to the diagonal matrix A and |b = |1 N would then require only o( N/M ) calls to P A , and hence to P S , in order to produce |x . This means that using A we can solve an unstructured search problem using o( N/M ) queries to P S , violating the optimality of Grover search.
Next, suppose that the algorithm A is an approximate PD-QLS solver, i.e. that it produces a state ρ x such that ρ x −|x x | Tr ≤ ε for some constant ε < 1/2. Apply a projective measurement to ρ x that projects it on the space spanned by {|j } j∈S with probability p and projects it onto the orthogonal subspace with probability q = 1 − p; in the ideal case (ε = 0) we would have p = q = 1/2. By the operational definition of the trace distance, the probability distribution (p, q) must be at most ε-distinguishable from (1/2, 1/2), i.e. max{|p − 1/2|, |q − 1/2|} ≤ ε. Thus, the success probability is a constant p ≥ 1/2 − ε > 0.
Finally, this argument can be extended to any constant precision ε < 1. It is sufficient to define a new diagonal matrix A by changing the values α and β in Eq. (13), so that the probability p of finding an element j ∈ S when measuring | Notice that in the proof the vector |b = |1 N is fixed and easy to produce and hence the access model for |b plays no role in our reduction. We also remark that this proof can be straightforwardly modified to prove that the operation of quantum matrix-vector multiplication (i.e., obtaining a state proportional to A |b ) must also have a linear cost in κ. Moreover, since a sparse oracle access P A allows to efficiently implement also a matrix-block encoding U A [5], the same reduction immediately rules out oracular algorithms that use o(κ) accesses to U A (for κ ≤ √ N ). Finally, a simple argument shows that a Ω(κ) lower bound holds also for the U b -query complexity: an initial small difference between two input states b and b can be magnified κ-fold in the corresponding outputs A −1 b and A −1 b , which is impossible unless one uses at least κ accesses to U b [38].
block-encoding (Section 4.1) and then we provide the explicit definition of the approximating polynomials of the inverse function (Section 4.2). Next (Section 4.3) we discuss two cases where we can implement a matrix-block-encoding of B = I − η A, as required to achieve a quadratic reduction in the degree of the approximating polynomials. Finally (Section 4.4) we discuss the cost of matrix-vector multiplication and summarise the cost of PD-QLS solving via this approach.

The quantum signal processing method
We employ QSP as a tool to implement U A −1 , a matrix-block-encoding of A −1 , assuming that we have access to a normalised matrix-block-encoding U B of for some η > 0. We assume that the spectrum of B is contained in the interval where κ is an upper bound to the condition number of A. The QSP method can be stated, already specialised to our case of interest, as follows [25,Theorem 56].
Importantly, there are explicit classical algorithms that can efficiently compute a parametrization of U P (B/β) for any polynomial P and then compile an explicit quantum circuit that implements it, see Refs. [39,40] for the current state-of-the-art.
We now further motivate the need to choose β = 1, i.e., that the matrix-block-encoding of B is normalised; equivalently, the part proportional to the identity in the definition (18) must not be rescaled. We remind that, as presented in Section 1.3, our goal is to implement an polynomial P (B/β) approximating A −1 , that is Note, however, that in this expression actually we have P (x) ≈ 1 1−βx , a function that has a singularity in x = 1/β ≤ 1. As a consequence of Bernstein's inequality [29] this function may have polynomial approximations with quadratically better degree only when the singularity is in x = 1, i.e. when we have β = 1.

Polynomial approximation of 1/(1 − x)
In this section we show analytical polynomial approximations of the function f (x) = 1/(1 − x) so that we can use the QSP method to implement it for the matrix B = I − η A as in Eq. (18). To keep the notation simple we assume η = 1 and that the spectrum of A is contained in D A = 1 κ , 2 , while we can account for any value η < 1 simply by rescaling κ to κ/η. Consequently, it is only necessary for the polynomial P (x) to be a good approximation of our target function in the domain Our starting point will be the polynomialT ,κ (x), a shifted and rescaled version of T (x), the -degree Chebyshev polynomial of the first kind, which is the solution of the following minimax problem (see Ref. [3, Theorem 6.25]): Chebyschev polynomials satisfy the property that |T (x)| ≤ 1 for all ∈ N and all x ∈ [−1, +1], while T (1 + δ) ≥ 1 2 e √ δ for 0 ≤ δ ≤ 1/6, see e.g. [12,Lemma 13]. Using the changes of variables we can rewrite the definition in Eq. (20) aŝ Then, the numerator satisfies T y( We then use the following (2 − 1)-degree polynomial as our approximation of f (x) = 1 1−x : To see that ] has a twofold root in x = 1, thus is exactly divisible by 1 − x and moreover P 2 −1,κ (1) = 0. This last property is useful because it allows us to implement the pseudo-inverse A + for a singular matrix A; i.e., in the case in which A has some eigenvalues that are equal to 0 (equivalently, B = I − A has eigenvalues equal to 1) these will be mapped to 0 and if moreover all the non-zero eigenvalues are separated from zero by a gap 1/κ, the the polynomial in Eq. (24) is a close approximation of the mapping for all x ∈ D B and thus we obtain from Eq. (24) that is, we have an ε-close polynomial approximation on the interval Finally, the polynomial in Eq. (24) has to be normalised so that it becomes compatible with the QSP method. Therefore we definê With this definition we have for the proof of this bound. In conclusion, the QSP method allows us to implement a where b is the number of ancilla qubits required in the block-encoding of B.

Implementing normalised matrix-block-encodings of B = I -ηA
In this subsection we show two methods that, under different assumptions, allow us to efficiently implement a normalised matrix-block-encoding of B = I − η A. Preliminarily, we remark that this is a non-trivial task, as we argue with the following three considerations.
First, any Hermitian matrix M ∈ C N ×N satisfying ||M || ≤ 1 can be implemented as a normalised sub-block of a unitary, since an explicit construction is given by [28] Implementing the circuit corresponding to U M in general requires up to O(N 2 ) elementary quantum gates [42] and is thus inefficient; however, efficient constructions are possible in specialised cases.
Second, standard quantum methods allow us to efficiently implement sub-normalised matrixblock-encodings of B. As a first example, Childs' quantum walk operator uses O(1) accesses to a sparse-matrix oracle P B to implement U M for M = B/d, where d is the column-sparsity of B [5]. A second example is to assume that we have access to U A , a block-encoding of A with ||A|| ≤ 1, and then use the LCU lemma [26] to implement a linear combination of I and −U A , yielding a normalised block-encoding of p I − (1 − p)A ≡ p B, for some p ∈ (0, 1). However, any "black-box" method that aims at amplifying a block-encoding of B/β (with β > 1) to a normalised blockencoding of B is in general inefficient. This can be proven, for instance, applying the lower bound in Ref. [25,Theorem 73] to the function f (x) = β x.
Third, it is currently an open problem whether it is possible or not to implement a normalised matrix-block-encoding U M given access to a sparse-matrix oracle P M with ||M || ≤ 1. In absence of general results of this kind, we then turn to developing specialised methods to efficiently implementing a normalised block-encoding U B , with B = I − η A, in the cases where (i) A is diagonally dominant or (ii) A is the sum of positive semi-definite local Hamiltonians.

Diagonally-dominant coefficient matrix
In this section, we implement a normalised block-encoding of B = I − A employing the method described in Ref. [25,Lemma 47], which we report here in Lemma 9. Our construction requires the preparation of some families of states {|ψ i } i , {|φ j } j that are well-defined only when A is diagonally-dominant, while attempts at extending the method to B = I − η A for non-diagonallydominant PD matrices results in non-normalisable states for any η > 0. We also remark that the problem of solving linear systems involving diagonally dominant PD matrices (which includes the noteworthy class of Laplacian matrices [43]) is well studied in classical linear algebra: for these classes of matrices there are classical algorithms that can solve a linear system substantially faster than what is possible for more general matrices [44].
Lemma 9. Suppose that we have access to two "state preparation" unitaries U L and U R (left and right) acting on a + s qubits such that for any i, j ∈ {1, . . . , 2 s } and for some families of states where supp(A i ) are the position of the (at most) d non-zero entries of the column vector A i and the value 0 ≤ r i ≤ 1 can be computed so that the states are normalised, since we have where we have used A ii ≤ 1 and the diagonal dominance of A. We then define the following state-preparation unitaries: for certain numbers a and b of ancilla qubits initialised in |0 , and where |ψ * i is the complex conjugate of the state |ψ i w.r.t. the computational basis. Then, U † L U R is a normalised encoding of the matrix B = I − A, as one can verify using A * ji = A ij : The quantum circuit U L can be implemented efficiently (and similarly for U R ), as we now show. We use d calls to P pos A to recover the values j ν ≡ j(i, ν) for ν ∈ {1, . . . , d}, i.e., the positions of all the (potentially) non-zero entries of A i . This corresponds to implementing the following isometry (i.e., a unitary circuit plus the possibility of adding ancillas): Next, we use d calls to P val A to recover all the values A i,j , i.e.: We then use reversible (classical) computation to calculate the numerical values of all the ampli- Then, we use a general state preparation quantum circuit which, given a classical description of the amplitudes of a (d + 1)-dimensional quantum state, effectively prepares the corresponding state: Next, we use a single call to P pos A in quantum superposition to map |i, ν → |i, j(i, ν) = |i, j ν and thus we obtain and now note that, adopting the definition j(i, d + 1) := N + 1, on the right-hand side we have obtained We can now estimate the query and gate cost of implementing U L (and the cost of implementing U R is the same). Going through the derivation, we see that 4d + 1 oracle calls to P A = (P pos Regarding gate complexity, step (38) requires O d poly(p) Toffoli gates (which are universal for classical reversible computation) to perform the computation up to p digits of precision; moreover, step (39) can be performed using O(d) control-nots and single-qubit rotations employing the methods of Ref. [42]. In conclusion, treating the number of digits of precision as a constant and the single-qubit rotations as exact, both the query and the gate complexities are in O(d) and we thus obtain the following result. We finally remark that, in some cases, it might be possible to implement U L and U R with a query and gate complexity in O( √ d) instead of linearly in d. First, we may assume that we have directly access to an oracle P ψ that directly returns the amplitudes ψ instead of needing to compute these vales from the non-zero entries of A i . Second, one can use an algorithm that generalises Grover search to prepare the state |ψ * i using O( √ d) accesses to P ψ [45], and an even more efficient implementation can be realised using the method of Ref. [46], which avoids synthesising arithmetical operations and brings about an improvement of two orders of magnitude over prior works for realistic levels of precision.

Sum of positive semi-definite Hamiltonians
We now consider the case where A ∈ C N ×N is given by the sum of positive semi-definite Hamiltonian terms; i.e., we consider the case which is similar to the case presented in Section 5, but here we allow the Hamiltonian terms to have eigenvalue zero. We assume that the number J of Hamiltonian terms scales polynomially in n = log 2 N and that each Hamiltonian term H (j) is local, i.e., that it acts upon a small number of qubits; specifically, we require that the set S j ⊆ {1, . . . , n} of qubits upon which H j acts non-trivially satisfies |S j | ≤ s ∈ O(log n) for all j. Each H (j) can thus be expressed as 8 where h (j) is a positive-definite matrix of dimension 2 s × 2 s , which can be fully specified with and note that it is a small matrix, of at most O(poly n) size, with w (j) ≤ 1. Then, we can rapidly compute, with classical algorithms, a unitary extension u (j) for each w (j) , each requiring only one ancilla qubit. Specifically, we define and then we (implicitly) define W (j) ∈ C N ×N and the unitary U (j) ∈ C 2N ×2N where the interpretations are the same as in Eq. (43). Note that each U (j) can be efficiently compiled as a quantum circuit: it is sufficient to determine the gate decomposition of the (s + 1)qubit matrix u (j) and then embed the circuit in a (n + 1)-qubit quantum register. We assume that the ancilla qubit used for the extension given in Eq. (44) is one and shared across all circuits U (j) . We then employ the LCU lemma [26] as follows. Given access to the controlled version of each circuit U (j) , it is possible to efficiently implement the multi-controlled unitary 8 The correct expression for an operator H (j) acting on a set S j ⊆ {1, . . . , n} of s qubits is where the subscript c denotes the control register. Then, defining a unitary Had that acts as In conclusion, where the factor η = 1/J scales, by assumption, polynomially in n. The gate complexity of U B can be estimated as follows. Each u (j) requires O(2 2s ) elementary gates to be implemented [42] and thus the gate complexity of U (j) is also in O(2 2s ), assuming that two-qubit gates can be applied among arbitrary pairs of qubits. Then, U Select has gate complexity scaling as the sum of the complexities of the individual U (j) [27] and is thus in O(J2 2s ). The complexity of Had is O(log J), a sub-leading additive term that can be neglected in the asymptotic gate complexity of U B . Hence we have the following result.

From matrix inversion to solving the quantum linear system problem
Suppose now that we have a matrix-block-encoding of where K ∈ Θ(κ/η) is the normalization factor of the matrix-block-encoding (recall the rescaling of κ to κ/η), upper bounded by O(κ/η) as we show in Appendix B. This is equivalent to say that we have implemented the unitary where the left-upper block corresponds to having a ancilla qubits in |0 a . Then, one can directly solve a QLS by applying the unitary quantum circuit U A −1 to the vector |0 a |b and then post-select the outcome |0 a on the ancilla system; however, post-selection might introduce large overheads, since we have where Ψ ⊥ is a state perpendicular to all states of the form |0 a , ψ . Therefore, the probability of successfully obtaining the state A −1 b when post-selecting on the ancilla measurement is A PD-QLS solver that prepares a matrix-encoding of A −1 and obtains The query complexities can be quadratically improved to O(1/ √ p succ ) and O((2 − 1)/ √ p succ ), respectively, using amplitude amplification [30]. Having implemented an approximate matrix-encoding of A −1 that is ε-close to A −1 (using Definition 5), the output state The same inequality holds in trace distance because of (5) and it is true both when using postselection and when using amplitude amplification. In conclusion, we have the following results. First, using the QSP method of Theorem 8 and the polynomial approximation given in Eq. (24), it is possible to implement a (K, b + 2, ε)-matrix-block-encoding of A −1 , where K ∈ Θ(κ/η), and the method has a query complexity  We now proceed to a worst-case, average case, and best-case scenario analysis of a PD-QLS solver as given in the previous Proposition.

Worst-case scenario:
In the worst case we have A −1 |b ∈ O(1) and consequently the postselection success probability is p succ ∈ Ω(1/κ 2 ). One can in alternative use O(κ) rounds of amplitude amplification to reach a constant success probability. Using amplitude amplification, the overall query complexity is in O(κ 3/2 ), which is an improvement compared to the O(κ 2 ) runtime achieved by the HHL algorithm, but still falls shorts of the O(κ) runtime that can be achieved using more advanced methods such as Variable-Time Amplitude Amplification (VTAA) [4] or eigenpath transversal [7].

Average-case scenario:
We now look at the distribution of runtimes that arises when using a randomly chosen vector |b . As observed in the work by Subaşı   in the regime 1 κ N . This implies that a randomly sampled PD-QLS problem (according to the specified distribution) can be solved almost surely with a query complexity in O(κ), employing amplitude amplification in the post-selection step. This method then matches (actually, improves by a polylog(κ) factor) the asymptotic runtime of more sophisticated methods (such as those that employ VTAA or adiabatic evolution) when considering "typical" instances of PD-QLS.
Best-case scenario: The largest value that A −1 |b can reach is κ. In such case the postselection probability is constant and the overall query complexity is in O( √ κ), even without employing amplitude-amplification. This is a quadratic improvement for PD-QLS solving over competing methods working for indefinite QLS, since implementing a block-encoding of A −1 for indefinite matrices already requires O(κ) oracle calls to U A [5]. We note these best-case problems almost never occur under the probabilistic model described before, but real-world problems have intrinsic structure that could make them depart from the Porter-Thomas distribution and thus have A −1 |b √ κ: for instance, this is the case if |b has constant overlap with the eigenvector relative to the largest eigenvalue of A −1 . Note, finally, that it is not required that we know in advance how large the success probability is, since by definition |0 a heralds the success.

Optimization using Variable-Time Amplitude Amplification
In Appendix C we show how to use VTAA to obtain a PD-QLS solver having improved asymptotic query complexities. We proceed as in Ref. [5,Section 5] and Ref. [19,Section 3]: first we reformulate our algorithm as a variable-stopping-time quantum algorithm and afterwards we apply the VTAA optimisation to improve its runtime. There is, however, a technical hurdle to overcome: all previous VTAA-based QLS solvers use a phase estimation subroutine having a O(κ) runtime; its use would then preclude us from achieving a runtime sub-linear in κ. The main new idea we introduce is to replace phase estimation with efficient "windowing functions" whose implementation via QSP requires only O( √ κ) accesses to U B .
More precisely, in Ref. [5] the so-called Gapped Phase Estimation (GPE) method is introduced, with the purpose of reliably selecting eigenvalues of A that are larger than some value δ. For these eigenvalues it is possible to implement an approximate inverse at a reduced cost, scaling as O(1/δ) instead of O(κ). Then, a sequence of increasingly small values of δ is considered, until δ ≤ 1/κ, and VTAA is employed to enhance the success probability. Since GPE has a query complexity in Ω(1/δ) and the required precision is δ ≤ 1/κ, its complexity is in Ω(κ).
Instead, we introduce a "windowing function" W ,δ (λ) to select eigenvalues of B = I − ηA that satisfy λ ∈ [−1 + 2δ, 1 −2δ] and to reject eigenvalues λ ∈ [−1, −1 + δ] ∪[1 − δ, +1], except for a small error probability . Thus, W ,δ (x) is a polynomial -close to 1 in the center of the interval [−1, +1] and -close to 0 near the edges of the interval, with a steep fall around the points ±(1 − 1.5δ). The intervals where the function derivative is large (of order 1/δ) are very close to the extrema of the interval [−1, +1]. According to Bernstein's inequality [29] it is not prohibited that a windowing function could be implemented with a polynomial having a degree ∈ O(1/ √ δ), a quadratically smaller degree compared to case where the large derivative is near the center of the interval. In Appendix C we then show, with an explicit construction, that windowing polynomial with degree ∈ O(1/ √ δ) indeed can be implemented. We defer to the Appendix for further details. The end result is summarized in the following Proposition.
Moreover, the algorithm is gate efficient.
Now we discuss the runtime of this VTAA PD-QLS solver.
, thus the U B -complexity is the dominant factor.
• Compared to Proposition 12, the U b query complexity increases here only by an additive log(κ) factor, while the U B complexity typically (i.e., for almost all values of Γ A,b ) has a polynomial improvement. To prove it, note that A −1/2 |b 2 = b| A −1 |b ≤ κ and thus • Compared to the general VTAA-based QLS solver of Ref. [5], our PD-QLS solver has a polynomial speed-up for almost all values of Γ A,b , see the right plot in Figure 3. This is a consequence of Lemma 20, where we prove that

Method based on a quadratic reduction of the condition number via matrix decomposition
In this section we start giving some preliminary considerations on the approach that we are going to present (Section 5.1) and then give a formal statement of the Sum-QLS problem that we solve (Section 5.2). Next, we describe a classical pre-processing step that quadratically reduces the condition number (Section 5.3) and then present the quantum algorithm solving the pseudoinversion problem that originates from the preconditioning (Section 5.4). Finally, we estimate the gate complexity of the resulting Sum-QLS solver (Section 5.5).

General considerations
We present now the general features of any algorithm that solves PD-QLS exploiting a decomposition of the form A = LL † , as already summarised in Section 1.4. Note that such L exists for any PD matrix A and that it may not be unique, especially if we allow L to be non-square. The key property is that any L such that A = LL † satisfies κ eff (L) = κ(A), hence a system of the form L † x = b (for any b ) is quadratically better conditioned than the original system Ax = b. The method we introduce is based on finding matrices L ∈ C N ×M and L g ∈ C M ×N , with M > N , such that the decomposition A = LL † holds and L g satisfies LL g = I, i.e. it is a right pseudo-inverse; moreover, L g is rectangular with more rows than columns and thus L g L = I, i.e. L g cannot be a left pseudo-inverse. We make then the following observations. 1. L g LL † x = L g b is a linear system equivalent to the original one, having the vector x = A −1 b as the unique solution, but with no guarantee that the condition number of L g LL † is small. 2. L † x = L g b is a over-constrained linear system that may be inequivalent to the original one (since L g L = I) and typically has no proper solution x.

Finding argmin
The goal is thus to convert the linear system Ax = b into the linear regression problem argmin x L † x − L g b , having solution x = (L † ) + L g b. This is a non-trivial task, as we need, given access to A via sparse-matrix oracle or via some succinct description, to construct a suitable access to L † and, moreover, given access to b, to construct a suitable access to b := L g b. The latter requirement seems particularly worrisome, since it involves a pseudo-inversion of the exponentially large matrix L. We show, however, that for the Sum-QLS problem, whereby A is provided as a sum of local PD terms, one can find a suitable L g for which a compact classical description can be efficiently computed.
We also remark that a pseudo-inversion problem can be interpreted as a regular matrix inversion on the subspace where the matrix L † is full-rank; thus, solving pseudo-inversion entails a larger runtime compared to solving a standard QLS, since the appropriate subspace has to be selected via projection or via amplitude amplification. More formally, the operator (L † ) + : C M → C N (with M > N ) has rank equal to N and thus we have the orthogonal decomposition where the support is by definition the subspace orthogonal to the kernel. Then, calling Π and Π ⊥ = I − Π the orthogonal projectors on the support and on the kernel of (L † ) + , respectively 10 , we obtain the identity since by definition we have (L † ) + Π ⊥ = 0. It is then evident that only the component Π b , which is in general a sub-normalised quantum state, plays a role in the pseudo-inversion algorithm, while the orthogonal component Π ⊥ b can be arbitrary. Therefore, any quantum pseudo-inversion algorithms implicitly requires the amplification of the Π b component, which therefore entails a gate complexity in O(1/ √ γ), for some known lower bound √ γ ≤ Π b .

Problem statement
In the Sum-QLS we assume that the coefficient matrix A ∈ C N ×N is given by an explicit classical description, rather than via oracular access. This allows us to evade the lower bounds given in where each h (j) is an operator acting on a subset S j ⊆ {1, . . . , n} of at most s qubits, corresponding to a matrix of size at most 2 s × 2 s . We assume that J, the number of Hamiltonian terms, and s are "small", i.e. we take J ∈ O(poly n) and s ∈ O(log n), and that we have a complete classical description of each operator h (j) and of each subset S j . The matrix A is fully specified with at most J2 2s real parameters and with Jn boolean values (defining the sets S j ), i.e. the number of parameters is J2 2s + Jn ∈ O(poly n). We require moreover that the known-term vector b is sparse, containing at most d b non-zero entries in the computational basis, where d b also scales polynomially in n. This implies that a preparation circuit U b can be given as an explicit small quantum circuit. This leads to the following definition for the Sum-QLS problem. In Appendix D we show that the Sum-QLS problem is BQP-hard, by adapting a proof given in HHL [1]. That is, we show that any polynomial-time quantum computation (in the BQP class) can be re-formulated as a Sum-QLS problem for some artfully constructed coefficient matrix A and known-term vector b; therefore no polynomial-time classical probabilistic algorithm (in the BPP class) can solve the Sum-QLS problem 11 (unless BPP = BQP).

Classical pre-processing step
In this section we describe the classical pre-processing step, providing a quadratic improvement of the condition number. We decompose each matrix h (j) as which can be accomplished for example via the Cholesky decomposition [31], in which case l (j) is a lower triangular matrix. Notice that each matrix h (j) is a small matrix of size 2 s × 2 s ∈ O(poly n) and thus the Cholesky decomposition can be performed numerically on a classical computer using O(poly n) operations; specifically, Cholesky factorisation of a m × m matrix requires m 3 /3 arithmetic operations, m 3 /6 additions and m 3 /6 multiplications [31]. The total number of Hamiltonian terms is J ∈ O(poly n), implying that the total runtime for performing the Cholesky decomposition for all Hamiltonian terms is O(2 3s J), which also is polynomial in n under our assumptions. We save the Cholesky decompositions l (j) in a classical memory, storing O(J2 2 s) = O(poly n) complex values, for later use. These decompositions implicitly define the operators where each L (j) is of size 2 n × 2 n and where the interpretation of this equation is the same as in Eq. (66). We now introduce the rectangular matrix L ∈ C N ×JN , with N = 2 n , given by and thus we have the required decomposition We can efficiently implement quantum circuits P L (P L † ) that provide sparse-matrix access to L (L † ). To this end, it is sufficient to convert the classical random access memory that stores the positions and values of the entries of L (L † ) into a qRAM [35]. The scheme presented in Ref. [ Since h (j) is by assumption non-singular, each l (j) is a non-singular lower-triangular matrix and its inverse l −1 (j) is an upper-triangular matrix which we can efficiently compute and store in a classical memory using a polynomial amount of space. These matrices implicitly define operators L −1 (j) = l −1 (j) ⊗ I ¬Sj such that L (j) L −1 (j) = I. We can then define the matrix . . .
which is a generalised right pseudo-inverse of L, i.e. L g satisfies the equation and thus using (L † ) + = A −1 L we get the required relation (L † ) + L g = A −1 L L g = A −1 . Using a qRAM the gate complexity of P L g is equal to that of P L and is thus in O(nJ2 2s ). We finally introduce the quantum state By assumption, b is sparse and has d b ∈ O(poly n) non-zero entries, hence the vector b has sparsity Employing the classical pre-processing described up to now, we can efficiently implement quantum circuits P L † and U b that act as a sparse access to L † and state preparation circuit for b , respectively; importantly, the gate complexities of these unitaries are independent from κ. We thus have at hand the necessary tools to implement the quantum pseudo-inversion algorithm that we present in the upcoming Section 5.4.

Efficient pseudo-inversion quantum algorithm
In this section we look into a quantum algorithm for the linear regression problem having solution |x = (L † ) + b = A −1 b . We then employ the quantum pseudo-inversion algorithm of [19, Corollary 31] which we report here for completeness.

Proposition 15 (Complexity of pseudo-inverse state preparation). Suppose κ ≥ 2, L ∈ C N ×N is a Hermitian matrix whose non-zero eigenvalues are contained in the domain [−1, −1/ κ] ∪ [1/ κ, 1], and ε is the target precision. Assume that we have access to U L , an (α, a, δ)-matrix-block-encoding
of L with δ ∈ o ε/( κ 2 log 3 κ ε ) and a ∈ Ω(log N ), and to a state preparation oracle U v for a vector v such that ||Π L |v || ≥ √ γ, where Π L is the orthogonal projector onto the support of L + and γ is a known positive parameter. Then, there is a (VTAA-based) quantum algorithm that produces a state ε-close to |L + v and has: We will apply Proposition 15 using as coefficient matrix L the Hermitian extension of L † and κ ≡ √ κ. That is, we consider the Hermitian matrix L ∈ C (J+1)N ×(J+1)N and vector v ∈ C (J+1)N which, after pseudo-inversion, results in the vector which encodes the quantum state |J ⊗ (L † ) + b and the solution is obtained discarding the (J + 1)-level ancilla system in the state |J .

Runtime estimation
In this section we look into methods for estimating of the parameters α, κ and γ (i.e., the normalisation of the block-encoding of L † , condition number, and overlap with the support space) that determine the runtime of the pseudo-inversion algorithm of Proposition 15, and thus determine the overall complexity of the Sum-QLS solver. First, we can implement sparse-matrix-accesses P L and P L † using the information stored in a qRAM, as previously explained. Childs' walk operator [5] then allows to realise a block-encoding U L of L using O(1) accesses to P L and P L † . Note that U L is also a block-encoding of L † (after a swap of the position of the block). The normalisation factor of this block-encoding is α = J2 s ∈ O(poly n), equal to the sparsity of L.
Second, we can explicitly bound the condition number of A as follows. Positive-definiteness of the Hamiltonian terms H (j) implies positive-definiteness of A and, moreover, the smallest and largest eigenvalues of A satisfy λ min (A) ≥ J j=1 λ min (h (j) ) and λ max (A) ≤ J j=1 λ max (h (j) ). Since each h (j) can be efficiently diagonalised, this means that it is possible to classically compute these values, which then yield the explicit upper bound Tighter bounds to κ(A) could also be obtained via more computationally intensive numerical methods, e.g. by first summing together groups of Hamiltonian terms and then diagonalizing each sum of Hamiltonians. We now move on to lower-bounding the value of the overlap parameter where Π L are the projectors on the supports of L + and of (L † ) + , respectively, and note that the supports of L and (L † ) + are equal. Using the identity Π L = L + L we thus obtain Moreover, we have: where the normalisation factor N is given by Then we can compute and finally we obtain We now suppose that a value γ > 0 such that ||Π L |v || ≥ √ γ is known and we remind that the quantum psuedo-inversion algorithm has a runtime quasi-linear in 1/ √ γ. We extensively comment on the values that γ can take in order to understand in which cases the Sum-QLS solver yields an advantage over competing methods.
1. We have Π L b ≤ 1, and the inequality is saturated when H (j) = A/J for all j.

The bound
holds, where λ * := min j λ min (H (j) ). Assuming that λ * ∈ Ω λ min (A)/J , i.e. there is no Hamiltonian term having a minimum eigenvalue significantly smaller than the average minimum eigenvalue, we obtain: 3. The numerator in Eq. (87) can be explicitly calculated, while the denominator is in general difficult to compute 12 . However, assuming ||A|| ≤ 1, we have the bounds 4. Importantly, the expression b| A −1 |b appears at the denominator, so that a more "illconditioned" vector b results in a larger overlap and thus in a faster Sum-QLS solver.

As in the analysis of Section 4.4 we can study the runtime in an average-case scenario.
For randomly chosen A and b (sampled according to suitable probability distributions) we have A −1 |b ∈ Θ( κ(A)) almost surely; under the same assumptions, we also have A −1/2 |b ∈ Θ(κ(A) 1/4 ) almost surely. Inserting this estimation in Eq. (88) we have that holds almost surely. We conclude that for an average-case Sum-QLS problem the runtime is Summarising, we have the following result.
The second part of A consists of using the quantum pseudo-inversion algorithm given in Proposition 15, using the unitaries U L † and U b as sub-routines, in order to produce a ε-close approximation of the ideal output |x = A −1 b . This algorithm requires the knowledge of a value √ γ that where Π L is the projector onto the support of L, equivalently:

Inserting the previously given expressions for the relevant parameters in Eqs. (75-76) results in
assuming that J, 2 s and d b have polynomial dependence on n = log 2 N . A quadratic speed-up in κ (up to polylogarithmic factors) is achieved over general QLS solvers when γ ∈ Ω(1).
We remark that the family of Sum-QLS instances where all the parameters in Proposition 16 scale polynomially (with the promise, in particular, that Eq. (91) holds for some γ ∈ O(poly n)), can be solved in polynomial time on a quantum computer, as was already shown in Ref. [16]. This means that the subset of problems having a polynomial scaling of the parameters, which we denote Sum-QLS poly , is contained in BQP. Moreover, the reduction in Appendix D can map polynomialsized quantum circuits onto an instance of Sum-QLS poly , thus showing that Sum-QLS poly is also BQP-hard. These two inclusion then show that Sum-QLS poly is BQP-complete 13 .

Discussion and outlook
In this work we have presented two algorithms aiming at solving QLS problems in the case where the coefficient matrix is positive definite and having (for certain problem instances) a runtime in O( √ κ), a quadratic improvement compared to what can be obtained using general QLS solvers. This improvement has the potential of greatly expanding the classes of problems where quantum computation can provide a quantum speed-up. For instance, the discretization of partial differential equations in D dimensions results in PD linear system with κ ∈ O(N 2/D ) [14] and thus having a runtime improvement from O(κ) to O( √ κ) is crucial to yield a quantum speed-up in the physically relevant cases D = 2 and D = 3. As a second example, it is possible estimate the hitting time of a Markov chain solving a QLS for the matrix A = I − S, where S is related to the discriminant matrix of the Markov chain [16]; since A is positive definite and decomposable as a sum of PD local Hamiltonian terms [16, Appendix A] our second algorithm could be applicable to this problem.
In the spirit of finding in the near future real-world applications of quantum algorithms, we note that there is considerable interest in the possibility of realising QLS solvers in Noisy Intermediatescale Quantum (NISQ) devices [10,11] and we argue that some of our results might be implementable in NISQ devices too. In particular, the crux of our first algorithm is to find a good polynomial approximation for A −1 with a degree in O( √ κ) and then implement it with the quantum signal processing method [28,25]. The quadratic reduction in the degree of the polynomials renders their realisation more easily compatible with the next generation of quantum processors.
We note that many further improvements and extensions to our algorithms may be possible. Regarding the first algorithm (Section 4), it would be important to extend the classes of matrices for which a normalised matrix-block-encoding of B = I − η A can be efficiently implemented to make the method more generally applicable. Regarding the second algorithm (Section 5) we note that the specific choice of the generalised pseudo-inverse L g in the classical step results in a O(1/ √ γ) multiplicative overhead in the runtime, where γ is given in Eq. (91). An open question is whether a different choice of the pseudo-inverse could improve, or eliminate altogether, this overhead. We also mention the possibility that the decomposition of A as a sum of local PD Hamiltonians could be computed on-the-fly by the solver, instead of being given as an external input. We note that if the sparsity pattern satisfies certain conditions (it is a chordal graph) a decomposition A = LL † that does not increase the sparsity of A exists, and the characterisation given in Ref. [50,Theorem 2.6] could be employed to compute it. We finally mention an open research idea that may be worth investigating. The eigenpath transversal method has been used in some new algorithms to solve the QLS problem with time complexity in O(κ) [7,8,10,11,12]; this method is simpler than the Variable-Time Amplitude Amplification (VTAA) method and also results in (marginally) improved runtimes, however, it is not directly applicable to solve a pseudo-inversion problem. It would be interesting to find a way to adapt the eigenpath transversal method to make it work also in the case where the coefficient matrix is singular. As a by-product, it could replace the algorithm given in Ref. [19] as the subroutine used in our second algorithm to solve the pseudo-inversion problem, therefore making it more practical.

Acknowledgments
This work was supported by the Dutch Research Council (NWO/OCW), as part of the Quantum Software Consortium programme (project number 024.003.037). Some ideas present in the paper originated from discussions with Anirban Chowdhury while DO was visiting the Center for Quantum Information and Control (CQuIC). The authors acknowledge discussion with Markus Mieth, Anderas Spörl, and thank András Gilyén for reading the manuscript and for giving us several insightful comments and suggestions.

Appendices A Proof of the query complexity lower bound
In this Appendix we give the proof of the query complexity lower bound presented in Section 3, which we present here in a more extended form. Ω min(κ, N ) . More precisely, we have:  (κ, N ) , independently from the access model for b, when A is a matrix with constant sparsity.

Proposition 17 (Query complexity lower bound). Consider oracular quantum algorithms that solve the PD-QLS problem as presented in Definition 2 for different access models to A and b. Namely, access to b is given via a state preparation oracle U b (Definition 3), while access to A is given either via a sparse-matrix oracle P A (Definition 4) or via a matrix-block-encoding U A (Definition 5). Then, PD-QLS solving algorithms reaching a constant precision ε ∈ O(1) have query complexities Q[U b ], Q[U A ], Q[P A ] all in
In the main text we prove a weaker result using a reduction to the quantum search problem, which has a query complexity in Ω ( N/M ) We assume that we have access to y via a quantum oracle P y that acts as P y |i, z = |i, z ⊕ y i for all i ∈ [N ] and for z ∈ {0, 1}. We also remind that the two-sided bounded-error quantum query complexity Q 2 of a boolean function is defined as the minimum number of accesses to the input of the function (i.e., to P y ) that are necessary to correctly output the value of the function with probability at least 2/3, both for the positive and for the negative instances. Then we have the following Lemma: We now show that (relativising) PD-QLS solving algorithms can be used to compute Promise-Majority M ; the lower bound on the query complexity of PromiseMajority M directly translates into a lower bound on the query complexity of the PD-QLS solvers. We will prove separately the three cases of Proposition 17, with each proof building upon the previous ones.

Proof. Case 1.
We assume that y ∈ {0, 1} N is in the domain of PromiseMajority M , i.e. y either contains exactly N/2 + M/2 zeros or N/2 + M/2 ones, and we define b ∈ C N +1 : where the value b N +1 is fixed to provide a "phase reference" and avoid ambiguity on the global sign. We have |b = b/ √ 2N + M and |b can be implemented by first preparing a state proportional to (1, . . . , 1, √ N + M ) and then applying the correct phases; this can be done with one oracle call to each of P y and P † y , via the transformations and extended by linearity to superpositions. Next, we introduce the vector 1 N := (1, . . . , 1) T containing N ones and then define as a matrix of size (N + 1) × (N + 1), so that K 2 = K , and finally where is a small parameter that we will define shortly. The matrix A can be used as a coefficient matrix for a PD-QLS solver since A is positive definite and ||A|| = 1. Moreover, the condition number of A is exactly κ(A) = 1/ . Next we have: where the summation converges since ||(1 − )K|| < 1. Let's apply A −1 to b: Then we have so that we get We then perform a projective measurement where one of the possible measurement outcomes is The cases f = 0 and f = 1 in Eq. (A.11) can be distinguished with constant advantage, since: Note that the two cases can still be distinguished with constant probability if we replace |x = A −1 b with any approximation ρ x which is sufficiently close to it. To summarise, suppose we have to solve a PromiseMajority M problem and that we can exploit as a subroutine an oracular quantum algorithm A that, given access to U b , prepares the state A −1 b with sufficiently high precision; suppose moreover that A has a query complexity Q[U b ] = g(κ), for some function g : R + → N. Then, A can be used to prepare the state in Eq. (A.11) and solve PromiseMajority M with constant distinguishing advantage. The P y -query complexity of . The lower bound Q 2 (PromiseMajority M ) ∈ Ω(N/M ) then directly implies g(κ) ∈ Ω min(κ, N ) .

Proof. Case 2.
We modify the construction given in the previous proof and encode the input y in the entries of the coefficient matrix, with the goal of showing that Q[U A ] ∈ Ω min(κ, N ) . To this end, we define the vector u ∈ R N +1 and a diagonal matrix D ∈ R (N +1)×(N +1) It is possible to implement exactly (i.e., ideally with zero error) a normalised matrix block of A using at most 4 calls to P y . First, we consider the unitary Had that prepares the state |1 = (1, 1, . . . , 1, 0) T , that is Had |0 = |1 . Then, the matrix is a normalised matrix-block-encoding of A = I − (1 − )K . Note that the matrix in the centre can be interpreted as the |0 0|-controlled version of the Pauli-X rotation e iθX (with cos θ = −(1 − )) and is thus efficiently implementable. The operations in Eq. (A.2) correspond to a unitary quantum circuit that can be written as D ⊕ U, for some unitary U, and finally we obtain that U A := (D ⊕ U) U A (D ⊕ U † ) is a matrix-block-encoding of A . We now consider the linear system A x = u and a quantum algorithm that prepares the corresponding solution state The state D A −1 b can be transformed into A −1 b using the steps given in Eq. (A.2), which only requires two extra accesses to P y . This state allows to solve PromiseMajority M with constant probability and thus the same considerations made in the preceding proof yield the result Q[U A ] ∈ Ω min(κ, N ) .

Proof. Case 3.
We start proving again a lower bound on the query complexity Q[U b ], as in Case 1., but for a PD-QLS where the coefficient matrix A is sparse; then, we use the method used in the proof of Case 2. to convert it into a lower bound on Q[P A ].
Given y ∈ {0, 1} N satisfying the PromiseMajority M condition, we introduce the known term where c 0 is a positive constant that we will fix later, and we have |b = b N (1 + c 2 0 ). Next, we define B ∈ R (N +1)×(N +1) as where B ∈ R N ×N is a symmetric sparse matrix, having d non-zero entries in each row and column which are all equal to 1 d , for some constant d. Since each row and column of B sums to one, B can be interpreted as the adjacency matrix of a Markov chain on a d-sparse graph. We require that the graph corresponding to B is not bipartite and that the spectral gap of B is large (i.e., the Markov chain is ergodic and rapidly mixing). These properties guarantee that B t quickly converges to K = 1 N 1 N 1 T N for t → ∞. Since B is symmetric, its spectrum is real and, because of the Perron-Froebenius theorem [52], the spectrum is contained in the interval [−1, +1] and includes an eigenvalue λ = 1 with multiplicity one; the fact that B t converges to 1 N 1 N 1 T N implies that −1 cannot be an eigenvalue of B. We then define the spectral gap δ(B) as the positive parameter δ(B) := min λ =1 {1 − |λ|}, where the minimum is taken over the eigenvalues of B.
There are families of so-called expander graphs such that both the sparsity and the spectral gap are constant, see [53,Chapter 21]. We assume that B belongs to one of these expander families and thus, in particular, there is a (known) positive constant c 1 such that for all sizes N ∈ N. Moreover, 1 N is the unique +1 eigenvector and hence, from the spectral decomposition of B, we can write for some > 0 that we will define shortly. The matrix A can be used as a coefficient matrix in a PD-QLS solver since it is positive definite and with norm one. Moreover, A is (d + 1)-sparse (where d is the sparsity of B, which is constant) and the condition number of A is exactly κ(A) = 1/ . Next we have and then defining K : where we have used ||R|| = 1 − δ(B), K 2 = K and KR = RK = 0. Applying A −1 = I + B to b we thus obtain:  Eq. (A.19). Then, fixing the constant c 0 as c 0 = 100c 1 and using the triangle inequality, we obtain the following upper bound where we have assumed c 1 ≥ 1. We also have the lower bound Then we have, using f = PromiseMajority M (y), Next, we proceed as in the proof of Case 2. and define an equivalent PD-QLS where the vector y is encoded in the entries of the coefficient matrix. We thus define the vector u ∈ R N +1 and the diagonal matrix D ∈ R (N +1)×(N +1) and thus the identity b = Du holds. We then introduce A , given by where A is as in Eq. (A.23). The matrix D is unitary and self-inverse, hence κ(A ) = κ(A), and Notice that the position of the non-zero entries of A are the same as in A and thus independent from y, while the sign of a non-zero entry A i,j = (−1) yi+yj can be obtained querying P y once with input |i and once with input |j . These queries can be performed in quantum superposition and thus two accesses to P y are sufficient to implement a quantum sparse-matrix-access P A . Therefore it is possible to prepare, with the same P y -complexity as discussed previously, the state Finally, we can obtain A −1 b using two extra accesses to P y by applying the transformations given in Eq. (A.2). This proves that a quantum algorithm that solves the PD-QLS having access to P A necessarily has a query complexity Q[P A ] ∈ Ω min(κ, N ) .

B Scaling of the normalisation factor of the matrix-block-encoding
In this Appendix, we consider the polynomial as was defined in Eq. (26) and where we havê We prove that the normalisation factor K := 2 max x∈[−1,+1] P 2 −1,κ (x) satisfies K ∈ Θ(κ), provided that ≥ c √ κ for some constant c that we will determine later. Notice that by construction P 2 −1,κ (1) = 0 and the polynomial is positive for x ∈ [−1, +1).
To study the properties of the local maxima of P 2 −1,κ (x) in the interval [−1, +1] we compute the derivative of P (x) using the property ∂ ∂x is a Chebyshev polynomial of the second kind. We have: We set the derivative equal to 0 and simplify the expression assuming 1−x = 0 and 1−T ,κ (x) = 0: Then we use the change of variables y(x) and δ(κ) given in Eqs. (22) and their inverses κ(δ) = 1 δ + 1 2 , x(y) = y−δ/2 1+δ/2 to rewrite the previous equation as which is equivalent to: we obtain from (B.6) the inequality 1 2 e √ δ − 1 ! > 2 2 δ, which is satisfied for ≥ 4.36/ √ δ. Since we have, moreover, P 2 −1,κ (1) = 0, the function P 2 −1,κ (x) does not have any local maximum in −1, 1 − 1 κ , and must have one or more local maxima x * ∈ 1− 1 κ , +1 ; equivalently, Eq. (B.6) must have at least one solution y * ∈ (1, 1 + δ]. Any local maximum y * = y(x * ) satisfies: and substituting T (y * ) in the definition (B.1) gives and inserting this inequality in Eq. (B.9) we have that any local maximum x * satisfies In summary, it will be sufficient to show 1 + δ − y * ∈ Ω(δ) to prove that the normalisation constant satisfies K = 2 max {x * } |P 2 −1,κ (x * )| ∈ O(κ), where the maximisation is over the set of (potentially multiple) local maxima x * ∈ 1− 1 κ , +1 . To this end, we rewrite Eq. (B.6) as: Since both the first and the second derivative of T (y) are positive for y ≥ 1 (i.e., the derivative of T (y) is monotonically increasing), this equation can be satisfied only if 14 or equivalently, defining 1 + δ * := y * A Chebyshev polynomial of the second kind can be written as hence we have 2δ * (B.18) 14 In this appendix we employ the notation ! ≤ to mark inequalities that we still need to prove and with ≤ an inequality that has already been proven.

C VTAA optimization improving the runtime values of the first algorithm
In this Appendix we use VTAA to speed-up the asymptotic runtime of the algorithm based on polynomial approximations of 1/ (1 − x). Preliminarily, we introduce a Lemma showing that this VTAA algorithm, as summarised in Proposition 13, does provide an asymptotic query complexity improvement (compared to general QLS solvers) for most values of Γ A,b , see the right plot in Figure 3.
√ κ ], under the usual assumption that the spectrum of A is contained in [1/κ, 1].

C.1 Variable-time amplitude amplification review
In this Section we review the general VTAA method, mainly following Ref. [19,Section 3].
By construction, a variable-stopping-time quantum algorithm produces a quantum state of the form A |0 all = m j=0 α j |1 j C |Ψ j F,S , where is a state having the qubit C j in |1 and the other clock qubits in |0 , while |Ψ j F,S are normalised quantum states in H F ⊗ H S with m j=0 |α j | 2 = 1. Definition 22 (Stopping times). We introduce a sequence of stopping times t min ≡ t 0 < t 1 < t 2 < . . . < t m ≡ t max , where each t j ∈ N is the complexity of the sub-algorithm A ≤j := A j · . . . · A 0 . The probability p j of stopping at time t j (i.e., after the execution of algorithm A ≤j ) and the 2 -average stopping time are defined as which project onto "bad" and "maybe good" subspaces after the application of A ≤j . The "maybe good" subspace at step j+1 is contained in the "maybe good" subspace at step j and, by construction, In the definition above the t j can represent any complexity measure (e.g., query complexity, gate complexity, circuit depth). In this Appendix we only compute the query complexity Q of the sub-algorithm A ≤j , but we remark that all the algorithms we consider here are gate-efficient, i.e., the gate complexity is in O Q poly(log Q, log N ) .

Definition 23 (Variable-Time Amplitude Amplification). Given a variable-stopping-time algorithm
. . · A 0 as in Definition 21 and given a sequence of m + 1 non-negative integers (k 0 , k 1 , . . . , k m ), we recursively define a variable-time amplification A = A m · . . . · A 0 as follows. Setting A −1 := I, each A j implements a standard k j -step amplitude amplification that uses A j A j−1 and its inverse a total of 2k j + 1 times, where the input state is |ψ j in = A j A j−1 |0 all and the target state is Π j mg |ψ j in . That is, A j begins preparing the input state and then uses k j accesses to the reflections R j out := I − 2 Π j mg and R j in : Note that if we set k 0 = . . . = k m = 0 we have A ≡ A. Also, by the recursive structure of VTAA the first sub-algorithm A 1 is used a total of m j=0 (2k j + 1) times in A , which grows exponentially in m if k j ≥ 1. Nonetheless, VTAA can provide a speed-up when the amplification parameters (k 0 , k 1 , . . . , k m ) are chosen appropriately. More precisely, the following result can be derived from Ref. [19,Lemma 22].
Proposition 24 (Result of VTAA). Using the notation of the previous definitions, let A be a variable-time amplification such that each A j uses k j steps of amplitude amplification, where where success is heralded by |1 F , the success probability satisfies p succ ∈ Θ(1), and the global query complexity is We can compare Eq. (C.12) with standard amplitude amplification, which has a query complexity Q ∈ O(t max / √ p succ ). In our algorithm, m will scale logarithmically in t max , so the total runtime is O(t max + t avg / √ p succ ). Hence, VTAA can provide a speed-up when the average runtime is much shorter than the maximum runtime. We remark that a sequence of good values (k 0 , . . . , k m ) such that conditions in Eq. (C.10) are satisfied can be obtained efficiently by means of an iterative classical-quantum pre-processing algorithm. Specifically, suppose that some values (k 0 , . . . , k j ) satisfying (C.10) have been already found; then, one can compile the corresponding VTAA algorithm A j and use phase estimation on the state |ψ j+1 in = A j+1 A j |0 all to obtain an estimate of sin(θ j+1 ) up to constant multiplicative precision; this allows to find a value k j+1 which satisfies (C.10) with probability 1−p fail with a cost O 1 θj+1 log(1/p fail ) [30], which is asymptotically equal to the query complexity of A j+1 , apart from a multiplicative log(1/p fail ) overhead. Once k j+1 has been precomputed one can directly compile A j+1 , without the need of performing phase estimation "online". The total failure probability is p tot fail ≤ m p fail and the query cost of the hybrid classical-quantum pre-processing algorithm has only a O log(1/p fail ) multiplicative overhead compared to the expression in Eq. (C.12).

C.2 Efficient implementation of windowing polynomials
In this Section we introduce the "windowing functions" that we use to replace the Gapped Phase Estimation (GPE) subroutine, introduced in Ref. [5]. Using GPE gives an additive O(κ) runtime overhead, which is too costly for our purposes.
A family of polynomials satisfying the inequalities in (C.13) is already given in Ref. [25,Lemma 29], but they have a degree ∈ O(δ −1 ). We seek to achieve the same result with a quadratically smaller degree, ∈ O(δ −1/2 ). Also, it is crucial that the polynomial approximation has even parity: using QSP one can implement matrix polynomials when the polynomial P satisfies |P (x)| ≤ 1 for |x| ≤ 1 and P has definite parity, while for a polynomial P without definite parity the more restrictive constraint |P (x)| ≤ 1/2 has to be satisfied (see Theorem 8).
Proof. The proof proceeds in two steps: first, we find an analytic function f (x) that satisfies the inequalities in (C.13) within error /2, and then show that the Chebyschev expansion converges very quickly to it, so that choosing a degree ∈ O δ −1/2 is sufficient to be within /2-distance from f (x) for all x.
Second part: We consider the Chebyschev series associated to W σ,δ (x). Note that the function W σ,δ (x) has even parity, hence also the associated Chebyschev series has even parity [54]. Truncating the Chebyschev series at degree , we get a sequence of polynomials [S W σ,δ ](x) converging uniformly to W σ,δ (x) for → ∞. More precisely, we have the following result [54,Theorem 5.16].

Lemma 26. Suppose f (x) can be extended to an analytic function on the ellipse
We apply this theorem to f (x) = W σ,δ (x), which is analytic in the whole complex plane. The main technical hurdle is to upper bound M . We choose r = 1+ where a and b are the semi-axis of the ellipse E r along the real and imaginary axis, respectively. We have, for x, y ∈ R: 2 dτ (C.20) where the last inequality holds for |y| ≤ 1. We now upper bound so that we will have M = sup Er |W σ,δ (z)| ≤ M 2 half . From (C.22) we only need to maximize the Equivalently, we can maximize −x 2 + y 2 . Substituting cos θ = u and sin 2 θ = 1 − u 2 we get where we have used δ ∈ Θ σ log −1 . We therefore have M half ∈ e O( √ log −1 ) ⊂ O −1 and thus also M ≤ M 2 half ∈ O( −2 ). We thus set ∈ Θ 1 √ σ log( −3 σ −1/2 ) and compute where C and D are positive constants. We recall that σ ∈ Θ δ/ log( −1 ) and thus

C.3 Polynomial approximations on increasingly larger domains
Now we explain how the windowing functions can be used to select domain where a polynomial of low degree can be a good approximation of the function 1/ (1 − x), if it is applied to eigenvalues contained in such domain. As usual, the spectrum of A is contained in D A = 1 κ , 1 and, correspondingly, the spectrum of We set m := log 2 κ + 1, δ j := η 2 −j for j = {1, . . . , m} and fix˜ > 0, a parameter related to the target precision ε. According to Eq. (25) and Eq. (C.13) we can find polynomials P˜ ,δj (x) and W˜ ,δj (x) such that , respectively. We also define a normalisation factor which can be set to coincide with the factor K defined in Eq. (26) and introduce the shorthands The windowing function W j (x) is used to select an interval [−1 + 2δ j , 1 − 2δ j ] where P j (x) is a good approximation of the inverse. For any eigenvalue λ of B we have P j (λ) ≈ 1 K 1 1−λ when λ ≤ 1 − δ j and W j (λ) ≈ 0 when λ ≥ 1 − δ j . More precisely, we have where ∆˜ j are arbitrary matrices having operator norm smaller or equal to˜ . Note that W j (B) and P j (B) commute, since they are both polynomial functions of B. Moreover, we can set without loss of generality W m (B) ≡ I, since we already have P m (B) ≈ A −1 /(η K) on the entire domain of A. Finally, we have K ∈ Θ(κ/η), as derived in the main text.

C.4 Variable-stopping-time PD-QLS solver: definition
We now proceed to reformulate our PD-QLS solver as a variable-stopping-time algorithm. For notation convenience, from now on we often drop the dependency of a polynomial from its variable and simply write P instead of P (x).
with the polynomial approximations P j (B) given in Eq. (C.38).
The algorithm B = B m · . . . · B 1 · B 0 acts on the same Hilbert space as A. The initialisation step is skipped, i.e., we set B 0 = I, and ∀j ≥ 1 we define B j as the application of the unitary U j given in Eq. (C.41) controlled on C 1 , . . . , C j−1 being in |0 ⊗j−1 . The unitaries U j are not applied.
The complete PD-QLS solving algorithm is then defined as follows.
1. In a initial pre-processing step (that needs to be run only once), a sequence of integers (k 0 , k 1 , . . . , k m ) that satisfy Eq. (C.10) is determined using a phase estimation algorithm.

The main part of the algorithm consists in running A , which is a VTAA version of
3. The final post-processing consists in applying B † to the output of A in order to un-compute the clock register. We then measure the flag qubit and when the result is |1 F (which happens with constant probability) we output the S register (which contains a state close to A −1 b ).

C.5 Variable-stopping-time PD-QLS solver: correctness
We will now analyse the correctness of Algorithm 27 and we begin introducing a Lemma.

Lemma 28. The state after the application of A k defined in Algorithm 27 is given by
for any input state |φ S and f ∈ {0, 1}.
Using that P j , M j , W j are functions of B and thus commute, we have at step k + 1 Equation (C.45) can be verified similarly.
Next, we consider the output state of A , the VTAA version of A, which features an amplification of the amplitude of the |1 F component, i.e., it is a state of of the form where the success probability is constant, p succ ∈ Θ(1), and where we have We now claim that, for an appropriate choice of the algorithm parameters, the error in |ψ succ C,S is upper bounded by O(ε), and we will prove this claim in the next section. Thus we have: where O(x) here denotes an arbitrary vector with 2-norm bounded by O(x). Note that Eq. (C.45) The final step of the PD-QLS algorithm consists in applying B † to the state in Eq. (C.51). Using again Eq. (C.45) we obtain: Measuring the flag in |1 F then results with constant success probability in a vector that has O(˜ ) distance from the ideal output A −1 b .

C.6 Variable-stopping-time PD-QLS solver: error bound
In order to derive Eq. (C.54) we use Eq. (C.40) and write and N is defined below. To prove the last step, recall that j |1 j C M j−1 W j A −1 b S is a nor- As proven in the main text, K ∈ O(κ/η). Thus we have N ∈ Ω(1/κ 2 ) and choosing we also have N ∈ Ω(1/κ 2 ) and˜ m/N ∈ O(ε). This condition is sufficient to guarantee that the output state |ψ succ is within ε-distance from the ideal output.

D Proof that the Sum-QLS problem is BQP-hard
In this Appendix we prove that the Sum-QLS problem is BQP-hard, therefore no efficient classical algorithm can solve the problem (unless BPP = BQP). The initial part of the proof is equal to the reduction presented by HHL: it is possible to construct a QLS problem M x = e 1 (where e 1 is a canonical vector with a one in the first position) which is BQP-hard for a class of sparse indefinite matrices M that are easily constructible. This can be converted to the equivalent QLS problem M x = M † e 1 , where the matrix A = M † M is by construction PD and M † e 1 is easy to prepare. What remains to be proven is that A admits an explicit Sum-QLS structure, i.e. that one can construct a decomposition as a sum of PD local Hamiltonian terms and that the overlap with the support (bounded by the parameter γ) scales polynomially. Preliminarily, we introduce a couple of useful definitions: the specific quantum circuit model (universal for quantum computation) that we aim to "simulate" as a Sum-QLS, and the Feynman-Kitaev clock construction.

Quantum circuit model:
We describe an arbitrary quantum computation C on n qubits as the application of T elementary gates {U 0 , U 1 . . . , U T −1 } to an initial state |0 ⊗n and the output of the computation is the quantum state U T −1 · · · U 1 U 0 |0 ⊗n . We assume that the elementary gate set consists of gates acting either on one or two qubits, e.g. arbitrary single-qubit rotations and control-nots. Hence, the t-th elementary gate is described by a unitary U t ∈ C N ×N with N = 2 n which can be written as U t = u t ⊗ I ¬St , where u t acts on a subset S t of the qubits and is either a 2 × 2 (|S t | = 1) or a 4 × 4 (|S t | = 2) unitary matrix.

Feynman-Kitaev clock:
We introduce the following unitary matrix acting on C 3T ⊗ C n , which is based on the Feynman-Kitaev clock construction (see e.g. [55] and references therein): where we have defined U t = I for T ≤ t ≤ 2T − 1 and U t = U † 3T −t−1 for 2T ≤ t ≤ 3T − 1 and the sums in the clock register (denoted by the subscript c) are taken modulo 3T .
Proposition 29 (Sum-QLS is BQP-hard). Let C be a given n-qubit T -gate quantum circuit. Then, there is a Sum-QLS problem (as in Definition 14) with that is equivalent to C, i.e., solving this Sum-QLS problem (up to a small constant error ε) allows to obtain the output state of C with constant probability and constant precision. According to Eq. where e 1 ∈ C 3T 2 n is the unit vector with a one in the first position (i.e., it is the vector representation of the state |0 c |0 ⊗n ). Using the state |x = M −1 e 1 the output state of the quantum computation is obtained whenever upon measurement of the clock register a time t ∈ [T : 2T − 1] is returned, which happens with probability e −2 /(1 + e −2 + e −4 ) ≥ 0.11. Therefore, any quantum circuit C having T gates can be restated as a QLS and solved (using e.g. the HHL algorithm) with a gate complexity scaling as poly(n, T ). If the circuit C has T ∈ O(poly n), then also the QLS solver has gate complexity in O(poly n), showing that the QLS problem is BQP-hard.
We now consider the PD linear system which is equivalent to the system in Eq. (D.9) and can be cast as a Sum-QLS problem. Notice, first, that b is a sparse vector, with d b ≤ 3: if the elementary gate set consists of control-nots and single-qubit rotations, then the first row of M has at most 3 non-zero entries.
The matrix A has size 3T 2 n × 3T 2 n and can be written as and thus the decomposition in Eq. (D.12) holds. Note that the matrix I −Ut −U † t I has eigenvalues λ = 0 and λ = 2, hence is positive semi-definite, while each H (t) has the smallest eigenvalue equal to δ. Moreover, each H (t) is a local Hamiltonian. Specifically, we have H (t) = h (t) ⊗ I ¬St , where h (t) acts on a set S t consisting of s = log 2 3T + q qubits, where q = 1 if U t corresponds to a single qubit gate and q = 2 if it corresponds to a two-qubit gate. Explicitly where each u t is a matrix specifying the t-th elementary quantum gate, with U t = u t ⊗ I ¬St .
We now need to estimate a lower bound γ for the overlap parameter as in Eq. (91). We have Using J = 3T we finally get (D.20) In conclusion, given the sequence of gates u t and the set of qubits S t to which they are applied in the quantum circuit C, we can efficiently compute the values and the positions of the non-zero entries of H (t) , as given in Eq. (D.13). This then explicitly describes A as a sum of PD local Hamiltonian terms, as required by the definition of the Sum-QLS problem. Going through the derivation, we see that the relevant parameters scale as stated in points (1) to (6) in the statement of the Proposition. Using Eq. (92), it follows that the algorithm presented in Section 5 solves this problem with a gate complexity in O poly(n, T ) = O(poly n), in the case where the original circuit is itself a polynomial time quantum circuit (i.e. T ∈ O(poly n)). Finally, note that the εerror in precision of Sum-QLS solver is amplified at most by a constant factor when post-selecting the clock register to show a time t ∈ [T : 2T − 1]. Thus, it is possible to obtain the output of C with an error that is bounded by a constant, as required per the definition of the BQP class. This finally proves that the Sum-QLS problem is BQP-hard; at the same time, it proves that Sum-QLS poly , the sub-class of problems having polynomially scaling parameters, is solvable in quantum polynomial time and is thus BQP-complete.