Faster quantum mixing for slowly evolving sequences of Markov chains

in physics, of reduces to the mixing time, i.e. , the time required to reach the steady state of the Markov chain, which scales as δ − 1 , the inverse of the spectral gap.


Introduction
Markov chains (MCs) are central in computational approaches to physics [1], in computer science [2], and machine learning [3], and they form the crux of the ubiquitous Markov Chain Monte Carlo meth-Davide Orsucci: davide.orsucci@uibk.ac.at Hans J. Briegel: hans.briegel@uibk.ac.at Vedran Dunjko: v.dunjko@liacs.leidenuniv.nl ods [4]. In MC-based approaches the underlying objective is to produce samples from the steady state, i.e., the stationary distribution of a given MC. The MC is constructed so that this distribution encodes the solution of the problem at hand. The solution can then be reached by "mixing", i.e., by applying the MC transitions many times. For some problems, mixing processes constitute the fastest known classical solving algorithms, and play a vital role, e.g., in the Metropolis-Hastings methods [5], periodic Gibbs sampling [6], and Glauber dynamics [7].
The fundamental parameter governing the time complexity of MC-based algorithms is thus the mixing time, that is, the number of steps required to attain stationarity. In most applications the MC is ergodic, i.e., has a unique stationary distribution, and timereversible, i.e., satisfies detailed balance [8,9]. The mixing time is tightly related to the spectral gap δ of the MC 1 and is bounded by Ω(δ −1 ) [10].
Oftentimes direct mixing can be computationally prohibitive and thus heuristic methods, such as simulated annealing [11,12], are employed. Here one constructs a sequence of Markov chains which, for instance, encode the Gibbs (thermal) distributions at gradually decreasing values of the temperature, where the target distribution is specified by the final MC, i.e., the final temperature. Intuitively, this process increases efficiency by avoiding local minima, although the performance is typically not guaranteed. In simulated annealing, the neighbouring chains in the sequence are similar, in other words, the sequence is slowly evolving.
The emergence of quantum computation offers a new possibility to utilize quantum effects to achieve guaranteed mixing more rapidly. In particular, it has been conjectured that run-times of O √ δ −1 2 Figure 1: Schematic depiction of the scenarios of sequences of slowly evolving MCs. Panels (a,b) and (c,d) depict classical and quantum sampling tasks, respectively. Panels (a,c) and (b,d) respectively delineate finite and continuing sequences, the latter having step-wise outputs. This work is predominantly concerned with (b,d). Although panel (d) allows quantum states to be carried from one time-step to another, our algorithm actually works by forwarding just classical information, without sacrificing efficiency. That is, no quantum memory from one time-step to the next is required.
should be possible [13] for the mixing problem. Such quadratic speed-ups have been demonstrated for various special cases of MCs [13][14][15][16][17][18], mostly relying on quantum walk [19,20] approaches. Quantum walks have also been utilized to speed-up simulated annealing [21][22][23], which often leads to the best runtimes in practice. However, considering provable results for guaranteed mixing of general Markov chains, the best quantum algorithms achieve O √ δ −1 √ N , which falls short of the conjectured quadratic speedup, as it introduces the dependence on the system size N . Avoiding the O √ N dependence seems to be challenging, which further motivates investigating the settings with relaxed constraints, e.g., by restricting the MC family [13,18,24].
In this work, we obtain improved O √ δ −1 4 √ N run-times for mixing problems not by restricting the Markov chain family, but rather by relying on additional context. In particular, we consider the settings where we are tasked to sequentially produce independent samples from a sequence of slowly evolving Markov chains. This setting is natural in statistical and quantum physics, e.g., when studying phase boundaries, which requires many independent samples from near-by points in the parameter space [25]. Another motivation for studying this setting is in the context of machine learning (ML), appearing both in reinforcement learning [26] and in the training of generative models [27], as we discuss later in the paper.
Our setting is similar to simulated annealing in that it considers a sequence of pair-wise similar Markovchains, and indeed our methods are similar to those in [21,22]. However our setting brings about a key disan extension of the O notation where polylogarithmic multiplicative factors are neglected. tinction: in annealing the goal is to produce a sample from final MC, and the intermediary chains have only an auxiliary role; in our case the goal is to produce independent samples from each MC in the sequence. Further, in principle the sequence can be exponentially large, or having a length which is not a priori specified. This is schematically illustrated in Fig. 1.

Problem and methods
We now specify the setting more precisely and introduce the required notation. We consider finite-space MCs, of size N , where a distribution over the space is specified by a vector π := (π(1), π(2), . . . , π(N )) T of non-negative entries summing to one. The problem of sampling from this distribution corresponds to producing a single element x ∈ {1, 2, . . . , N } according to π. Due to the methods used, our algorithms will actually produce a quantum (or coherent) sample, i.e., the quantum state | π := x π(x) | x , which is called the coherent encoding of π. As we elaborate later, coherent encodings have substantial advantages over classical samples, although they are in general computationally more difficult to prepare.
In abstract terms our sampling problem can be formulated as follows. We consider an infinite sequence of ergodic time-reversible Markov chains {MC t } ∞ t=1 . In our approach, we will at each time-step t generate a coherent sample, that is the state | π t , corresponding to the stationary distribution π t of MC t . The measurement of | π t in the computational basis yields a classical sample from π t , thus preparation of | π t allows for both classical and quantum sampling. We say that the sequence of MCs is slowly evolving if the stationary distributions of consecutive MCs in the sequence are sufficiently close, specifically, if at every time step t we have | π t+1 | π t | 2 ≥ η for some constant 0 < η < 1.
Our techniques rely on Szegedy-type quantum walks [28] to perform the above sequential sampling task with a O( √ δ −1 4 √ N ) time complexity in the context of slowly evolving MCs. We thus briefly introduce the properties of the Szegedy constructions for the convenience of the reader and provide in App. A more background on MC theory.

Szegedy quantum walk
Each MC is specified by a stochastic matrix P which specifies the transition probabilities in a single step of the chain. For a given transition matrix P of a ergodic time-reversible MC one can construct the corresponding Szegedy quantum walk operator W (P ); this is a unitary operator having the crucial property that | π is the unique +1-eigenstate of W (P ), with all other eigenstates of W (P ) having an eigenphase which is at least quadratically larger than the spectral gap δ of P . In other words, if | θ is such that W (P )| θ = e iθ | θ , then |θ| ∈ O( √ δ), see App. B for details and the construction.
These properties allow us to realize useful quantum subroutines with a run-time which is quadratically smaller than the classical mixing time. Namely, the Szegedy walk operator can be used in conjunction with the phase detection algorithm [29], a simple variant of phase estimation [30], to (approximately) distinguish | π from all other eigenstates of W (P ). The run-time is in O √ δ −1 and has only logarithmic dependence on the approximation error 3 [29][30][31][32]. In turn, the capacity to identify | π can be leveraged to implement an approximate projective measurement onto | π π |, by measuring whether the quantum register containing the phase estimate is zero. Similarly, applying a Pauli-Z rotation onto the qubit that specifies whether the phase value is zero we obtain an approximation of the reflection operator R(π) := I − 2 | π π |.

Amplitude amplification
This brings us to our key subroutine. Using the reflection R(π) we use amplitude amplification [33,34] to rotate an initial state | ψ in to an approximation of Equivalently, we can also use the fixed-point ampli- 3 All algorithms we consider are approximate, but the dependence on the target error ε is at most log 2 (ε −1 ) and thus always ignored in the O notation. tude amplification algorithm 4 of Yoder et al. [36], an algorithm that has the same quadratic speed-up as standard amplitude amplification. We refer the reader to App. C for further details on fixed-point amplitude amplification and to App. D for an analysis of how the runtime depends on the target precision.
With these key subroutines defined we can explain a straightforward algorithm that allows to prepare coherent encodings of | π with O √ δ −1 √ N run-time. One simply utilizes amplitude amplification to rotate the uniform superposition | u := 1 √ N x | x to the target state | π . Since the target states are encodings of probability distributions, all amplitudes are real and non-negative and, thus, the fidelity always satisfies γ > 1/N . This bound is attained by distributions approaching a Kronecker delta.
We remark that amplitude amplification requires the ability to reflect both around the source state | u and the target state, i.e., to implement both R(π) and R(u) := I − 2 | u u |. In this work we restrict our attention to cases in which the preparation of | u is efficient and, therefore, also R(u) can be easily implemented. Notice that preparing the uniform distribution is not always simple [17], e.g., this happens when it is computationally difficult to decide if an element x ∈ N belongs to the space of the MC 5 . However, assuming that the space of the MC is {1, . . . , N } the uniform distribution can be prepared with quantum circuits of depth O(log(N )) [37]; this is a case that finds application to quantum machine learning problems [38]. Even more simply, when N = 2 n the uniform distribution is obtained from | 0 via the Hadamard transform, that is, | u = H ⊗n | 0 . Consequently, our methods can be applied to spin 1/2 systems, where the preparation of the uniform superposition of all the configuration of n spins is trivial, yet producing Gibbs distributions at low temperature for certain classical Hamiltonians is NP-hard [39].

Preparation from uniform distribution and from samples
To speed-up the basic algorithm described in the previous section, the idea is to eliminate the worst-case 4 The fixed-point property means that the output state converges to the ideal target state for increasing run-times [35]. 5 As a concrete example, consider the following MC inspired by the Graph Isomorphism problem: the space of the MC consists of all graphs that can be obtained from an initial graph via permutation of the vertices of a given initial graph; and a transition in the MC is obtained by randomly selecting two vertices of a graph and swapping them. Then, deciding if a graph belongs to the space is equivalent to solving the Graph Isomorphism problem. preparation scenario. Specifically, in the case when the distribution is highly clumped, one should attempt the preparation from an element having high probability in the target distribution π. However, we still have to choose the candidate element to start from, which alone would lead to a Ω √ N run-time (by the optimality of Grover's search [40]). We will first show that this issue can be resolved when one has access beforehand to a few classical samples from the target distribution. This seems to require that the solution we are looking for are already provided as input. But we will utilize the slowly evolving context to ensure such samples are available, and therefore samples for the subsequent step can be prepared without the necessity of back-tracking in the sequence.
To utilize these ideas we first show how to prepare the coherent encoding | π by choosing a suitable initial state | ψ in and then amplitude amplify | ψ in to obtain | π . Specifically, the initial state is either the uniform distribution, | ψ in ≡ | u , or a classical sample x j which is taken from a small set of classical samples x = {x 1 , . . . , x c } that are available beforehand, | ψ in ≡ | x j . We will call these subroutines PrepareFromUniform and PrepareFromSamples, respectively, and simply Prepare whenever the distinction is not relevant.
As mentioned, PrepareFromUniform is efficient in the extreme case where π is very close to being uniform, while the procedure can require up to O( √ N ) operations in the opposite extreme case where π has support over only one element x; this last case corresponds, in fact, to a standard Grover search for the element x. However, when most of the "weight" (probability) of π is concentrated on a few elements (which need not be nearby) these must have a large overlap with | π , which is a sufficient condition to efficiently perform amplitude amplification; that is, running a search algorithm in reverse (un-searching) from one of these elements allows for a fast re-preparation of | π [24]. Then, a classical sample drawn from π will probably come from elements having large "weight" and thus the PrepareFromSamples subroutine will be efficient. The main idea of our algorithms is to discover which of the two Prepare algorithms is the most efficient and then use it for state preparation. The worst regime is for distributions which are neither too uniform nor too clumped, where both algorithms have a O 4 √ N time complexity. Before continuing with the complete description of the Prepare subroutines, we remark that we use amplitude amplification starting from | ψ in to produce | π and thus we require the ability to perform reflections both around | ψ in and around | π . Both choices for the initial state can be prepared efficiently: | x j is simply a classical state, while | u can be pre-pared easily when the MC space is explicitly known. Thus, also reflectors around them can be efficiently implemented. A reflection around | π is instead approximated with O( √ δ −1 ) operations using Szegedy operator. Therefore, the total gate cost of Prepare is We are then left with the task of estimating this lower bound γ.

Preparing from uniform
This subroutine is the straightforward algorithm we mentioned earlier. Suppose for the moment that the value of | u | π | is known. Then we can amplitude amplify | u to | π , operation having a gate cost which is proportional to in which we have introduced the notation: By norm inequalities we get 1 ≤ f (π) ≤ √ N , where the lower and upper bounds are saturated by a Kronecker delta and the uniform distribution, respectively.
If the value of | u | π | is not known we proceed as follows. We arbitrarily choose a value χ ≡ √ N /χ as a tentative estimate of √ N /f (π) so that amplitude amplification produces an approximation of | π when | u | π | ≥ χ −1 holds 6 . However, we do not know if the initial overlap is large enough and thus preparation of | π is not guaranteed. To amend this, we subsequently apply to the output of amplitude amplification a projective measurement onto | π π | (or onto the orthogonal complement) which, if successful, heralds the correct preparation of | π . As we mentioned, this projective measurement can be implemented with Szegedy quantum walks with run-time O √ δ −1 and therefore its run-time is independent from the initial overlap | u | π |. This constitutes a heralded preparation of | π from | u .
PrepareFromUniform is summarized in Alg. 1, and in App. E further details and error analysis are provided. The run-time for preparing c copies is in O c χ √ δ −1 with an exponentially decaying failure probability when χ ≥ √ N /f (π).

Algorithm 1 PrepareFromUniform
Output: a bit signalling success; in case of success, c quantum samples (i.e., c copies of the state | π ).
Input: quantum access to the transition matrix P ; c, the number of copies to be produced;
2. If at least c successful preparations have been heralded in step 1., output a bit signalling success, together with the quantum states obtained in c successful runs of heralded preparation of | π . Else, return a bit signalling failure.

Preparing from samples
The second subroutine we will utilize, named PrepareFromSamples, requires extra inputs, namely a set of c samples x = {x 1 , . . . , x c } from the desired target distribution π, and is based on amplitude amplification of | x j to | π . Later we will show how these samples can be efficiently obtained in a slowly evolving sequence. The run-time of amplitude amplification scales as | x j | π | −1 = 1/ π(x j ) assuming, for the moment being, that the value of | x j | π | is known. We thus introduce a random variable X distributed according to π, i.e., X takes a value x with probability π(x). The run-time of amplitude amplification is also a random variable, proportional to | X | π | −1 = π −1/2 (X). The average run-time scales as Note that we have bounded only the expected runtime of our algorithm and not of a specific instance of the algorithm, i.e., for a particular choice of x j . The sampling procedure could return in fact a sample x j for which the run-time factor [π −1/2 (x j )] is much larger than its average value.
However, this can be prevented by using a few samples, since the run-time when starting from a randomly sampled element x j is, with constant probability, close to the average run-time. To formalize this we use Markov's inequality: We then consider the case a = 2 and proceed similarly as we did for PrepareFromUniform. Namely, we guess an estimate χ for f (π) and amplitude amplify | x j to | π assuming that | x j | π | ≥ 1/(2χ) holds (that is, we use O(2χ) reflections) and then repeat for all samples in x. Subsequently we apply a projective measurement onto | π π | to herald the successful preparation of | π .
Suppose that χ ≥ f (π). Since the c samples are independent, the probability that PrepareFromSamples fails for all the samples in x is then exponentially small in c (for instance, PrepareFromSamples can fail if | x j | π | −1 > 2f (π) holds for all x j ). A formal specification of PrepareFromSamples for preparing c new samples is given in Alg. 2 and has run-time The failure probability again goes down exponentially if χ ≥ f (π). See App. E for details and error analysis.

Algorithm 2 PrepareFromSamples
Output: a bit signalling success; in case of success, c quantum samples (i.e., c copies of the state | π ).
Input: quantum access to the transition matrix P ;
2. If at least c successful preparations have been heralded in step 1., output a bit signalling success, together with the quantum states coming from c successful runs of heralded preparation of | π . Else, return a bit signalling failure.

Combined algorithm
Now we will put together the two Prepare subroutines in a single combined algorithm. In the case in which the value f (π) is known one simply runs whichever of the two Prepare algorithms is faster.
To deal with the situation when f (π) is not known, we modify the algorithm in a manner similar to how Grover's search is adapted to work without an estimate on the number of marked numbers [40]. Essentially, one runs both preparation algorithms one after another, starting from χ = 1 and χ = 1 for PrepareFromSamples and PrepareFromUniform, respectively; then, in each iteration the values of χ and χ are set to twice larger values, terminating either when c copies of the coherent encoding are produced or when both χ and χ exceed 2 4 √ N . Then the total number of reflections required scales as C(π), which is 4 √ N in the worst case, and the global failure probability is, again, exponentially small.

Application in the context of slowly evolving sequences
The algorithm given in the preceding paragraphs can be implemented also for a stand-alone MC, i.e., for a chain not coming from a slowly evolving sequence. However, it comes with the unrealistic requirement that c samples drawn from the stationary distribution of π are available beforehand: it seems that, paradoxically, the output of the algorithm is also required as input. Nonetheless, the result is non-trivial even for stand-alone MCs because of the following two observations. First, the initial classical samples can be re-used to prepare multiple coherent copies and this, in turn, allows us to prepare an arbitrary number of fresh independent samples, given only a small number (c) of seed examples. Second, our algorithm outputs a coherent encoding of the stationary distribution, allowing quantum information post-processing to be applied.
Going back to our primary objective, we now show how these initial samples can be made available in the context of slowly evolving sequences.
We proceed inductively. We suppose that at time step t we have at hand c samples from π t . This allows us to produce c copies of Next, using a Szegedy quantum walk we can implement reflections both around | π t and around | π t+1 . This allows us to use amplitude amplification (or its fixed-point variant) to approximately map each of the c copies of | π t to a copy of | π t+1 . In turn, these copies of | π t+1 can be measured in the computational basis to obtain c samples from π t+1 , allowing to proceed iteratively in the state preparation in the sequence.
By the slowly evolving assumption, the overlap between | π t and | π t+1 is constant and amplitude amplification to | π t+1 for c copies in parallel has gate complexity in O c where δ t := min{δ t , δ t+1 }. To simplify the statement of the result we can as-sume that δ t and δ t+1 are multiplicatively close, that is, 1 κ δ t ≤ δ t+1 ≤ κ δ t for some constant κ > 1.

Hence the time complexity of this final amplitude amplification is in
, which is dominated by the run-time necessary to initially prepare | π t ⊗c . Consequently, the overall run-time of this process is We highlight that the entire procedure only requires classical memory between consecutive time steps in the sequence, in the form of c classical samples stored in memory 7 . This is without loss of generality since, as we show later on, Ω C(π) reflections are needed even if one allows for quantum memory. Moreover, assuming that the classical memory is devoid of errors, this observation that no quantum memory is required shows that approximation errors do not accumulate in the slowly evolving sequence. In fact, the quantum algorithm performed at step t+1 does not receive any quantum state as input from step t, but only classical information. Of course, each step t still entails a finite failure probability, albeit exponentially small in c.
If quantum memory is available, the algorithm can be made slightly more efficient, in terms of how many samples (classical or quantum) are required. Instead of c classical samples, one need store only one coherent sample. The basis of this is a simple neardeterministic cloning algorithm producing two copies of | π from one, which may be of independent interest. Details of this quantum memory algorithm are provided in App. F.

Application in quantum machine learning
We now consider a modification of the Prepare algorithm, which finds application, e.g., in quantization of the reflective Projective Simulation (rPS) model. For the reader interested in quantum ML, details about the rPS can be found in [38]. Here it is sufficient to point out that the outputs in the rPS model are not samples from {π t } t but come from restricting to a subset of "marked elements" M ⊆ {1, 2, . . . , N } and, typically, the number of marked elements M = |M| is much smaller than N . That is, we want to sample from π M , the (normalized) probability distribution obtained by restricting π to M: where µ := x∈M π(x) and therefore π M = 1 √ µ x∈M π(x) | x . The set of marked elements is specified by two black-box unitary maps, a membership oracle P M and by a sparse oracle Q M . The former, given an element x, specifies whether x ∈ M or not; the latter is a quantum accessible memory that, upon input of a number ν ∈ {1, . . . , M }, outputs the ν-th element of M (i.e. x ν ∈ M).
Accessing the oracle P M twice, one can implement a reflection over the subspace of marked elements. This allows to run amplitude amplification, as done, e.g., in the context of rPS [38], to rotate | π to π M . This operation has a run-time of O √ δ −1 µ −1 , yielding a quadratic improvement in both mixing time and hitting time with respect to classical methods. This can be done provided that an initial copy of | π is available. We now explain how, in this context, the state | π can sometimes be made available more cheaply.
We consider two modified algorithms for preparation of | π : these are equal to the algorithms specified before in Alg. 1 and Alg. 2, except for the choice of the initial states, which now are chosen to have support on the marked elements. Specifically, the initial state is either where now x is a set of c samples drawn previously from π M . These modified state preparation algorithms require that the new input states | ψ in can be efficiently produced. This is obviously true for classical samples, while u M can be prepared with one access to the Q M oracle, since u M = Q M

M ν=1
1 √ M | ν . This is then sufficient to perform amplitude amplification of | ψ in to | π , for both choices of | ψ in . Using similar reasoning as done previously, we see that the amplitude amplification has run-time scaling as when trying to prepare π M from u M ; and when performing amplitude amplification starting from the available samples x the expected run-time is Again, combining these two state preparation algorithms (which start from initial states having support on the marked subspace M) into a single procedure we obtain a run-time in O c 2 C(π M ) µ −1 √ δ −1 for preparing c copies of | π . The preparation from M is then more efficient whenever C(π M ) µ −1 < C(π); notice that C(π M ) ≤ 4 √ M , since π M has support on just M elements.
For the problem of sampling from marked elements, being in a slowly evolving sequence of MCs allows to make the initial set of c samples available. In this case, we again use the state preparation for MC t to prepare | π t ⊗c and then map them to | π t+1 ⊗c . The final step consists in running the algorithm of [38] (namely, an amplitude amplification of the marked subspace) in order to obtain c samples from π M t+1 . This allows then to proceed inductively with sampling in the sequence. The final projection has O c µ −1 √ δ −1 gate complexity and thus is dominated by the cost of preparing π M t+1 ⊗c .
We notice, finally, that the method just presented can be directly used to produce new copies of π M , thus directly solving the problem considered in [38]. It can be straightforwardly realized by running any of the algorithms for preparing | π followed by amplitude amplification of the subspace of marked elements.

Optimality analysis
To begin with, notice that preparing coherent encodings | π is in general a difficult task, even when sampling from π can be done efficiently. Consider a randomized algorithm that produces a outcome x with probability π(x) which makes a number of binary random choices selecting a computational branch b 1 or b 2 with probabilities p or 1 − p. One can "purify" this algorithm to a unitary quantum circuit by substituting every random choice with a controlled dependence on a pure qubit prepared in the state The resulting output state then has the form | π = x π(x) | x | φ(x) where | φ(x) contains residual information of all the choices. Starting from | π one cannot directly obtain | π since there is, in general, no efficient deterministic method that allows one to erase the information contained in the second register. In fact, the possibility to efficiently produce a coherent encoding | π for all probability distributions which can be efficiently sampled would imply SZK ⊆ BQP [41], that is, Statistical Zero Knowledge problems (including, e.g., Graph Isomorphism) could be solved in quantum polynomial time. While the inclusion SZK ⊆ BQP is not impossible, it is expected that specific structures of the problems have to be exploited (e.g., graph-theoretic properties in Graph Isomorphism), while the methods based on MC mixing are oblivious to such problem structures. Consequently, it is highly unlikely that any quantum algorithm can prepare | π encoding stationary distributions of time-reversible MCs in polylog(N ) time, not even when the classical mixing process is fast (i.e., when the MC mixes in polylog(N ) time).
We prove in the App. G that our algorithm is strictly optimal in the class of sampling algorithms which utilize oracle access to reflections about | π (and do not use other properties of the transition matrix P of the MC) as is the case of many algorithms based on Szegedy quantum walk [42][43][44][45][46]. Specifically, we show that if we start from c copies of | π and the goal is to obtain c + 1 classical samples from π (for some constant c) then Ω( 4 √ N ) accesses to the reflection oracle are required (more tightly, we can prove a Ω( C(π) ) lower bound). Our proof relies on the "inner-product adversary" method developed in the context of quantum money [47], a so-called computational no-cloning theorem.
This Ω( 4 √ N ) lower bound actually applies to any MC, also outside of the context of slowly evolving sequences of MCs. Any algorithm that uses quantum walks just to realize the reflection around | π , and then subsequently uses such reflections in a blackbox fashion, cannot avoid an O( 4 √ N ) dependence in its run-time. In particular, algorithms of this type cannot generically achieve the conjectured quadratic speed-up for sampling from stationary distributions of time-reversible MCs [13]. Hence, other techniques are needed.
We finally point out that, however, in our algorithms we have full access to the transition matrix P and, thus, we are not restricted to using reflections around | π . In particular, we can implement a classical random walk as well. If the MC is rapidly mixing then, by definition, the random walk allows to efficiently sample from π, while achieving the same goal having access only to reflections around | π and some initial copies of | π could take an exponentially longer time.

Discussion
We have presented quantum algorithms for generating samples from stationary distributions of a sequence of Markov chains which achieve a quadratic improvement over previous approaches that can guarantee the generation of the correct output, and work for all time-reversible chains. To achieve this improvement we do not assume special properties of the chain (except detailed balance) but rather we have considered settings where the chains come in a context, namely in a slowly evolving sequence. This result thus has application to all MCs where this framework is natural.
An important domain of application includes statistical physics and material science, where the slowly evolving context, and the need for independent samples, arise when studying phase transitions [25].
A second important family of applications occurs in machine learning (ML), both in the reinforcement learning case [26] and in the context of generative models [27]. To briefly comment on this domain, as mentioned earlier in reinforcement learning settings [26] where the learner's distribution over actions is specified by MCs, the MCs are sequentially updated as the system learns [38,[48][49][50]. The other facet involves the training of certain generative models (used, e.g., for unsupervised learning), such as Boltzmann machines [51]. Here one encounters the need for producing samples from stationary distributions (e.g., Gibbs states) which are themselves slowly modified as the model is updated [52,53]. We remark that, in ML, the subsequent Markov chains in the sequence are generated according to a training algorithm which depends on the external outputs of previous Markov chains. Whenever this is the case, the methods developed for quantum-enhanced annealing methods become unsuitable, as they need to keep coherence through the protocol steps [22,23].
We conclude observing that, as a feature of our protocol, at each time step we do not output just a classical sample from the target stationary distribution, but a coherent encoding of this distribution. This is not a guaranteed characteristic of quantum mixing protocols [13] and makes our approach suitable for combining with other quantum protocols which start from such a coherent encoding [29,38,44,54].

A Markov chain notions
Here we review the fundamental notions of Markov chain theory and refer to [8,9] for further details.

Transition matrices and probability distributions:
We deal with discrete-time Markov chains having a finite number N of states. Therefore, to a MC is associated a left-stochastic matrix P (a matrix with non-negative entries which add up to one in every column) of size N × N , and each entry P x,y specifies the transition probability from the state x to state y. Correspondingly, the non-negative (column) vector π denotes a probability distribution over the state space as A MC is then specified by a the transition matrix P and an initial probability distribution π in . We stick to the convention of left-stochastic matrices which act from the left on column vectors π representing probability distributions, that is π = P π. This convention is not customary in the MC literature (where the usage of right-stochastic matrices prevails), but it matches the one adopted in the quantum information community. In particular, P y,x denotes the transition probability from the element x to the element y.

Ergodic MCs:
A N -state MC is irreducible if it is possible from each state x to reach any other state y in a finite number of steps and with non-zero probability. The period of a state x is the largest positive integer such that any return to x can occur only at multiples of that integer. If the period of all states is 1, the MC is said to be aperiodic. If P is irreducible and aperiodic, then there exists a unique stationary distribution π, such that: and, moreover, π has support over all the elements of the MC. This also implies that, under application of a sufficiently large number of steps any initial probability distribution π will converge to the unique stationary distribution, lim k→∞ P k π = π. This convergence process is called mixing, and since MCs mix if and only if they are irreducible and aperiodic, these are called ergodic Markov chains.

Time reversal:
The time reversal P of a Markov chain P having stationary distribution π is defined as: and a MC is said to be time-reversible if it is equal to its time-reversed version, P = P . Equivalently, a time-reversible MC is one that satisfies the detailed balance equation: We can also write the time reversed MC in matrix form as P = D(π)P T D(π) −1 , where D(π) is the diagonal matrix D(π) := diag ( π(1), . . . , π(N ) ). This implies that if P is time reversible, then its spectrum is real. In the following, we will always consider ergodic and time-reversible MCs.
Mixing times: Obviously, not all mixing process of ergodic MCs are equally fast. We use the total variation distance, defined as d(π, π ) := 1 2 x |π(x)−π (x)| to assess the speed of mixing (the total variation distance exactly matches the trace distance in the quantum information context). We then define d(k) := max σ d(P k σ, π) as the distance in distributions between a sample drawn after k walk steps starting from any distribution σ and stationary distribution π of P . The mixing time t mix ( ) then is defined as the smallest time necessary to bring any initial distribution within distance from the stationary distribution, d(t mix ( )) ≤ . We then set t mix := t mix (1/4). It can be shown then that the convergence of an ergodic MC is exponentially fast in terms of the mixing time, that is: The mixing times often play the critical role in the computational complexity of MC-based algorithms. There are many techniques that can be employed for upper and lower bounding the mixing time, but one of the most useful characterizations is the following. Because of Perron-Frobenius theorem all eigenvalues of a left-stochastic matrix P are smaller or equal to 1 in modulus. If P is ergodic then its stationary distribution | π is the only eigenvector of P having eigenvalue equal to +1. That is, all other eigenvectors have eigenvalues λ with |λ| < 1.
Let σ(P ) be the spectrum of a time-reversible Markov chain P ; we define the spectral gap δ of P as: i.e. the minimum of 1 − |λ| over the eigenvalues of P which differ from one. The spectral gap is a rather tight estimate for the inverse of the mixing time, since holds for all time-reversible MCs (where π min is the smallest probability in π). In short we have t mix ∈ O(1/δ) and 1/δ ∈ O(t mix ), giving asymptotic upper and lower bounds to the mixing time.

B Szegedy quantum walk
Here we review the basics of Szegedy quantum walks [28]. For further details see [29,44] and references therein.

Szegedy walk operator:
The Szegedy walk operator W (P ) can be implemented for any transition matrix P , and not only for those associated to ergodic and time-reversible MCs (but W (P ) has nice spectral properties only if P is ergodic and time reversible). The basic building block to define W (P ) is the diffusion operator U P which acts on two quantum registers of N states and is (partially) defined as follows: By measuring the second register in the computational basis a step of the classical random walk is obtained, hence U P is a natural way of defining a quantum extension of the classical MC. When we say that we have quantum access to P , we specifically mean that we have access to a diffusion operator of the form (16). The diffusion operator U P can be efficiently realized, for instance, when P is a sparse transition matrix. Then, we can define the Szegedy walk operator as the unitary where Z 2 := 2 | 0 0 | 2 − I and Swap interchanges the first and second register. W (P ) acts non-trivially on the Spectral properties of W (P ): When P is ergodic and time-reversible the space A + B has dimension 2N − 1 and the intersection A ∩ B contains only the state U P | π, 0 = Swap U P | π, 0 , as one can verify using the detailed balance equation for P . The state U P | π, 0 is the only +1-eigenstate of W (P )Π A+B , where Π A+B is a projector on the invariant subspace A + B.  are the eigenvalues of P that are different from one (remember, the spectrum of P is real for time-reversible MCs). With this notation the phase gap (∆) and spectral gap (δ) are given by and therefore the phase gap is quadratically larger than the spectral gap, ∆ ≥ √ 2δ.

C Subroutines based on quantum walks
Here we show how to use Szegedy walk operator within the phase detection algorithm to implement projective measurements onto | π and partial reflections around | π , as originally done in [29]; we also show how to use fixed-point amplitude amplification to deterministically map a given input state to | π .

Phase estimation and phase detection:
The phase detection algorithm applied to the Szegedy walk operator W (P ) is illustrated in Fig. 2 and its action is as follows. Call {| θ } the eigenvectors of W (P ) having eigenvalue e iθ , W (P )| θ = e iθ | θ . In particular | θ = 0 ≡ | π . Then, given an input state of the form | ψ in = ψ θ | θ the phase detection algorithm outputs an approximation of the state That is, the second register contains a bit signalling whether the | θ is the +1-eigenvector or not. The phase detection algorithm produces a state within trace distance ε from the state in Eq.
In particular, the cost has only a logarithmic dependence on the error, see e.g., [29,31,32].
Projective measurement onto | π : The phase detection algorithm allows to directly implement the projective measurement given by the projectors | π π | , I − | π π | , applied to any arbitrary input state | ψ in : it is realized by measuring the second register of the state in Eq. (19) in the computational basis. The gate and oracle complexity of this projective measurement is thus the same of the phase detection algorithm, The success probability of the measurement, applied on an input pure state | ψ in , is approximately | ψ in | π | 2 . Moreover, the classical outcome of this projective measurement is a bit that signals whether the projection onto | π π | was successful or not. This allows, e.g., to redo the preparation and measurement process until the algorithm succeeds in obtaining the target state | π .
Partial reflections around | π : Next, the phase detection algorithm can be used to approximately implement the partial reflection where φ is a tunable parameter. Notice that for φ = 180 • the partial reflection becomes a standard reflection around | π , R(π) = I − 2 | π π |. A partial reflection can be implemented using once the circuit in Fig. 2 and once its inverse: with the first call a input state | ψ in is mapped to a state as in Eq. (19); then, a phase e iφ is applied selectively on the ancilla qubit being in the | 1 state; finally, the phase detection algorithm is run in reverse to uncompute the bit contained in the second register. In summary: Notice that in the operation given above we apply in sequence a phase estimation and its inverse, and these two operations cancel out. In conclusion, the oracle and gate cost of approximating a partial reflection is also Fixed-point amplitude amplification: Partial reflections are fundamental for implementing fixed-point amplitude amplification (FPAA). This is a variant of amplitude amplification whereby the output state can get arbitrarily close to the ideal target state. In standard amplitude amplification usually one has the "soufflé problem" [35]: if the rotation in the amplitude amplification process is not stopped at the right moment, the fidelity with the target state starts decreasing again; moreover, even using the optimal number of reflections, only a constant fidelity between the outputs state and the ideal target is reached. In contrast, in FPAA one gets exponentially close to the target state | ψ out , allowing an almost exact preparation of this state, with only a logarithmic dependence of run-time on the approximation error. Details follow.
A FPAA algorithm takes a single copy of a input state | ψ in and maps it to a state ε-close to | ψ out ≡ Π out | ψ in / Π out | ψ in , where Π out is a projector over a target subspace (after the process the input state | ψ in is no longer available). More precisely, in order to implement FPAA three ingredients are required: 1. a single copy of a input quantum state | ψ in ; 2. the ability of implementing partial reflections around the input state; 3. the ability of implementing partial reflections around the target subspace.
The FPAA algorithm of Yoder et. al. [36] can be implemented, provided that the conditions 1.-3. hold, and has both the quadratic speedup of Grover search and the fixed-point property. This algorithm is parametric, depending on two input parameters, ε ∈ [0, 1] and γ ∈ (0, 1): if | ψ out | ψ in | ≥ √ γ holds, then the output of the FPAA algorithm is a state with ε distance in trace norm from the ideal | ψ out . The number of calls to R φ (ψ) and R φ (Π out ) is in O( γ −1 log ε −1 ).

Heralded state preparation:
We finally show how to use FPAA followed by a projective measurement to implement (in an efficient way) a heralded preparation of | π . If we start from an initial state | ψ in and then apply the projective measurement | π π | , I − | π π | the success probability is | π | ψ in | 2 ; but the success probability of the measurement process can be increased by preceding the measurement by a round of amplitude amplification. On average, the procedure using amplitude amplification has a quadratically smaller run-time in producing a copy of | π . More precisely, the heralded state preparation works as follows. We first run the optimal FPAA algorithm [36] using reflections around a initial state | ψ in and around the target state | π ; the algorithm is run setting ε as the target approximation error and setting some value γ > 0 as overlap parameter. This means that, if the inequality | π | ψ in | 2 ≥ γ holds, then FPAA guarantees to output a state within ε distance from | π ; however, if instead | π | ψ in | 2 < γ holds, the output state can be arbitrarily far from | π . To ameliorate this issue, after the FPAA we apply a projective measurement onto | π (or onto the orthogonal subspace). When the measurements succeeds, the preparation of (a approximation of) | π is guaranteed, independently from the initial overlap | π | ψ in |. We also remind that this final measurement succeeds almost deterministically (with probability 1 − ε) when | π | ψ in | 2 ≥ γ holds. The number of reflections needed in the heralded preparation of | π is then O( γ −1 log(ε −1 )) and the total run-time is in O( γ −1 log 2 (ε −1 )), as we will prove in the next Appendix.

D Analysis of imperfect reflection operators
Here we consider the propagation of errors when the partial reflection used within FPAA are approximate and how the run-time is affected. For this section only, we assume that the relevant parameters in the soft-O notation are δ, γ and log ε −1 ; namely, we keep log ε −1 terms and discard log log ε −1 dependencies.
We suppose that √ γ is (a lower bound to) the overlap between | ψ in and | π , and the final targeted error is ε. FPAA entails the use of O γ −1 log ε −1 perfect reflections in order to achieve the desired accuracy goal. However, the same can be achieved with imperfect reflections, provided that each reflection has an error smaller of ε/(number of steps): by the triangle inequality the total accumulated error will be upper bounded by ε. That is, we need to implement a partial reflection with an accuracy Hence the total gate cost of the heralded state preparation procedure (nesting approximate reflections within FPAA) is given by: Thus, heralded preparation of | π within ε final precision has a overall gate complexity scaling as log 2 (ε −1 ). We also remark that errors do not propagate from one time step to the next in the slowly evolving sequence, since at each time step we freshly prepare new copies of | π . This is possible since we have access to projectors onto the required states, which allow to decrease approximation errors.

E Failure probabilities of preparation from uniform and from samples
We here show that the Prepare subroutines are not overly sensitive to small imperfections and that failure probabilities decrease exponentially with c, the number of classical samples carried over in each step of the slowly evolving sequence.
Preparation from uniform distribution: For PrepareFromUniform, as given in Alg. 1, the analysis is simple, since the input state | u has no error. The algorithm succeeds when at least c out of 2c heralded preparations of | π starting from | u are successful. In the case in which χ ≥ √ N /f (π) the global probability of failure is 2 −O(c) . In fact the 2c runs have independent outcomes; then, we can apply the Chernoff bound: where ε is an upper bound to the failure probability in the preparation of | π and δ > 0 is a free parameter. Choosing 1 + δ = 1/(2ε) we get: where the second inequality holds for sufficiently small ε.

Preparation from samples:
A similar analysis holds for PrepareFromSamples, as given in Alg. 2. Notice that the input samples x = {x 1 , . . . , x c } are not (exactly) distributed with π, but with a distributionπ which is ε-close to π, say, in total variation distance. Considering a random variableX distributed asπ we have, for any v > 0: In the first inequality we have applied the definition of total variation distance and in the second Markov's inequality. Thus, we have Thus, with high probability at least one sample in x * ∈ x satisfies π −1/2 (x * ) < 2 E π −1/2 (X) = 2f (π); namely, this happens with probability at least: Then, we consider a heralded state preparation of | π starting from | x * for 2c times. If χ ≥ f (π) the analysis proceeds exactly as the one performed for PrepareFromUniform and, hence, with probability 1 − 2 −O(c) at least c of these 2c runs will be successful in producing approximations of | π . The global failure probability of PrepareFromSamples is thus in 2 −O(c) .

Combined algorithm:
The algorithms PrepareFromUniform and PrepareFromSamples are used as subroutines of the combined state preparation algorithm, as specified in the main text. In this algorithm the values of χ and χ = √ N /χ are doubled until they exceed 2 4 √ N , in which case either χ ≥ f (π) or χ ≥ 4 √ N /f (π) is satisfied: then, the algorithm has to succeed, except with probability 2 −O(c) .

G Lower bound on the oracle cost of sampling
Here we prove a lower bound on the number of reflections around | π that are needed to produce two classical samples drawn from π, starting from a single copy of | π . The lower bound also applies when c + 1 classical samples have to be produced from c copies of | π , thus we directly prove this more general case.
The state preparation algorithms presented in this work allow to prepare | π using O C(π) ≤ O( 4 √ N ) reflections around | π , using at most logarithmically many copies of | π . A result of Aaronson and Christiano [47] is that there exists a class of states with all positive real amplitudes (effectively, coherent encodings of probability distributions) which require, on average, Ω( 4 √ N ) accesses to the reflection oracle in order to be duplicated. This already shows that our algorithms have essentially optimal worst-case performance.
We strengthen the result in two ways. First, we prove a Ω C(π) lower bound in the number of oracle accesses needed, thus matching the O C(π) oracle complexity attained by our algorithms, for all values of C(π). Secondly, we show that the same lower bound holds also for classical sampling problems. Namely, suppose that we have c initial copies of | π and the ability to implement controlled reflections around | π , while the goal is to obtain c + 1 classical samples distributed according to π. We show that in order to accomplish this task Ω C(π)/c 2 controlled-reflection around | π are required. This means that our algorithm is asymptotically optimal (up to polylogarithmic factors) in the number of queries to a reflection oracle.
The proof of this lower bound hinges upon the "inner-product adversary" method [47]. We can condense the results of Section 4.2 and Appendix B of [47] into the following theorem. Theorem 1. Suppose that we have access to reflection oracles U ψ (and to its controlled version c-U φ ) so that: The states | ψ come from a subset Z of the entire Hilbert space H. Moreover we require that on these states there is a symmetric binary relation R ⊆ Z × Z such that ∀| ψ ∈ Z : (ψ, ψ) / ∈ R ∀| ψ ∈ Z, ∃| φ ∈ Z : (ψ, φ) ∈ R .
Suppose, next, that for all | ψ ∈ Z and for all | η ∈ H that are orthogonal to | ψ the following inequality holds E φ∈Z: for some γ ∈ R + . Then, consider a quantum circuit Q ψ consisting of a fixed set of unitary operations Q * that make oracle calls to c-U ψ (and similarly, Q φ is obtained when Q * calls c-U φ ). Suppose that for all (ψ, φ) ∈ R the quantum states Ψ ψ in and Ψ φ in are two input states such that Ψ φ in Ψ ψ in ≥ α, while the output states Ψ ψ out = Q ψ Ψ ψ in and Ψ φ out = Q φ Ψ φ in have to satisfy Ψ φ out Ψ ψ out ≤ β. Then Q * must make accesses to a c-U ψ or c-U φ to obtain these output states.
This theorem can be applied to our case as follows. The input state consists of c copies of | π , while the output state consists of c + 1 classical samples from π. Namely, we are considering a quantum circuit Q π which aims at producing these c + 1 classical samples. Q π consists of a sequence of CPTP maps making oracle calls to c-U π , controlled reflections around | π ; then, purifying the maps of Q * to unitary operations we obtain a circuit Q * which employs the same number of oracle calls to c-U π . The output of Q π then has the form: Q π | π ⊗c | 0 = x1,...,xc+1 π(x 1 ) · · · π(x c+1 ) | x 1 , . . . , x c | φ(x 1 , . . . , x c+1 ) , where | φ(x 1 , . . . , x c+1 ) is a state containing all the residual information. The output state in Eq. (38) also generalizes other tasks, e.g., choosing | φ(x 1 , . . . , x c+1 ) = | 0 corresponds to preparing c + 1 copies of | π . Next, we consider a set Z of quantum states which are coherent encodings of specific probability distributions; on these states we impose a relation R which is suitable for computing a bound as in Eq. (36). In turn, using Eq. (37), this will provide a lower bound to the number of reflectors required by the quantum circuit Q * or, equivalently, by its purification Q * .