Efficient Bayesian phase estimation using mixed priors

We describe an efficient implementation of Bayesian quantum phase estimation in the presence of noise and multiple eigenstates. The main contribution of this work is the dynamic switching between different representations of the phase distributions, namely truncated Fourier series and normal distributions. The Fourier-series representation has the advantage of being exact in many cases, but suffers from increasing complexity with each update of the prior. This necessitates truncation of the series, which eventually causes the distribution to become unstable. We derive bounds on the error in representing normal distributions with a truncated Fourier series, and use these to decide when to switch to the normal-distribution representation. This representation is much simpler, and was proposed in conjunction with rejection filtering for approximate Bayesian updates. We show that, in many cases, the update can be done exactly using analytic expressions, thereby greatly reducing the time complexity of the updates. Finally, when dealing with a superposition of several eigenstates, we need to estimate the relative weights. This can be formulated as a convex optimization problem, which we solve using a gradient-projection algorithm. By updating the weights at exponentially scaled iterations we greatly reduce the computational complexity without affecting the overall accuracy.


Introduction
Phase estimation is an important building block in quantum computing, with applications ranging from ground-state determination in quantum chemistry, to prime factorization and quantum sampling [2,15,18]. In the ideal setting we assume that the quantum system can be initialized to an eigenstate |φ of a known unitary U , such that U |φ = e iφ |φ .
The goal of quantum phase estimation (QPE) is then to estimate the phase φ. Some of the existing approaches include quantum Fourier based phase estimation [1,6], iterative phase estimation [11,17], and other methods including robust-phase estimation, time-series analysis, and integral kernels [10,14,16]. In practice, there are several factors that complicate the problem. First, it may not be possible to initialize the state exactly to a single eigenstate. This could be because the state is perturbed by noise, or simply because the eigenstate is unknown. The latter case arises, for instance, in the ground state determination of molecules in quantum chemistry where the desired eigenstate can only be approximated. Regardless of the cause, phase estimation may need to deal with an initial state that is a superposition of eigenstates: Second, practical phase-estimation algorithms may also need to deal different sources of noise present in current and near-term quantum devices. Bayesian phase estimation [17] has been shown to be particularly well suited for dealing with noise [19] and the presence of multiple eigenstates [13]. Figure 1: The multi-round quantum circuit used in the experiments.
In this paper we describe an efficient implementation of Bayesian phase estimation. In Section 2 we describe the Bayesian approach to quantum phase estimation along with an explanation of techniques used to implement it. In Section 3 we review some of the difficulties encountered in existing techniques and provide a detailed description of the proposed algorithm. This is followed by numerical experiments in Section 4, and conclusions in Section 5.

Bayesian phase estimation
We consider Bayesian phase estimation with measurements obtained using the quantum circuit depicted in Figure 1, as proposed in [13]. The circuit consists of a series of rounds, each parameterized by an integer exponent k r and phase shift β r . The different rounds in the circuit make up a single experiment and result in a binary measurement vector m. Suppose we are given a quantum circuit of R rounds parameterized by vectors β ∈ [0, 2π) R and k ∈ N R , then we can denote by P k,β (m | φ, w) the probability of observing measurements m for given phases φ and weight vector w with entries |α j | 2 . We can define a prior distribution P (φ, w) over the phases φ ∈ Φ R = [0, 2π) R , and weights w ∈ ∆, where ∆ denotes the unit simplex given by the set {w ∈ R R | w ≥ 0, w 1 = 1}. The posterior distribution then follows from Bayes' theorem, and is given by where P k,β (m) = Φ R ∆ P k,β (m | φ, w) · P (φ, w) dw dφ. (2) Note that throughout the text, the probability density functions are distinguished by parameter names to keep the notation uncluttered. By updating the prior after each set of measurements, we can express the posterior distribution resulting from the -th experiment, with parameters k = k ( ) and β = β ( ) , and measurements m = m ( ) , as which implicitly depends on the parameters and measurements from previous experiments. In the special case where |Ψ = |ϕ is a single eigenphase, the probability of observing a certain measurement vector m is given by In the general case, where |Ψ = j α j |φ j is a superposition of eigenstates, this changes to It is reasonable to assume, as done in [13], that the joint prior P (φ, w) can be written in the form Updates to the prior for each of the phases φ j can then be obtained from (3) by integrating out the weights and all phases except φ j (denoted by φ \j ), which yields Similarly, integrating over all phases φ gives the updated weight distribution Finally, it can be verified that the marginal probability (2) can be expressed in terms of scalars C j and W j from equation 6 as In order to implement Bayesian phase estimation we need a suitable representation for the probability density functions P ( ) (φ j ) for the phases. We look at these next and discuss the treatment of weights in more detail in Section 2.3.

Normal distribution
One of the simplest ways to represent the prior P (φ) is based on the probability-density function of the univariate normal distribution N (µ, σ 2 ): As phases that differ by integer multiples of 2π are equivalent, we can obtain the desired representation by wrapping the normal distribution around the interval [0, 2π) to get After acquiring the measurements from an experiment, we would like to update the prior distributions to P ( +1) j (φ j ), as given in (5). However, the resulting candidate distribution will not be in the form of a wrapped normal distribution, and we must therefore find the best fit. This amounts to finding the mean µ (n+1) and standard deviation σ (n+1) of the candidate distribution and using these to define the updated normal prior. In order to find these parameters, we have to evaluate the expected value of e iφ over the distribution Q j = P ( +1) j , namely Given this expectation we can obtain the mean and Holevo variance using (see for example [9]): These values are then used to define the updated prior P ( +1) j := f • µj ,σj . Priors based on the normal distribution were used in the context of (noisy) single-round experiments with a single eigenstate by [19]. In that work, the integral appearing in (10) is evaluated approximately using rejection sampling, which requires a potentially large number of samples to evaluate accurately. In Section 3.3 we show how this expectation can be computed much more efficiently.

Fourier representation
A second way to represent distributions is based on the Fourier series where c k and s k are scalar coefficients. The mean and variance of the resulting distribution over [0, 2π), are conveniently expressed in terms of the coefficients as arg(c 1 + is 1 ) and (π 2 (c 2 1 + s 2 1 )) −1 − 1, respectively [13]. It is well known that products of sine and cosine terms can be rewritten as sums of individual sine and cosine terms by, possibly repeated, application of the product-sum formulas Together with the identities cos(−φ) = cos(φ) and sin(−φ) = − sin(φ) it follows that sums and products of Fourier series can also be written as Fourier series. By appropriately adding or subtracting the product-sum formulas in (12) it follows that cos(θ + ϕ) = cos(θ) cos(ϕ) − sin(θ) sin(ϕ) sin(θ + ϕ) = sin(θ) cos(ϕ) + cos(θ) sin(ϕ).
The uniform prior P (φ) = 1/2π can be written as a Fourier-series with c 0 = 1/2π and all other coefficients zero. Since the update rule in (5) only multiplies and adds distributions, the resulting phase distributions P j can be written exactly as a Fourier series [3,9,13]. More generally, it follows that whenever the initial prior can be expressed as a Fourier series, all subsequent phase distributions can be expressed as Fourier series. In order to evaluate the products of distributions in (5) we can use the following convenient update rules: Given two Fourier series with a finite number of terms n 1 and n 2 . Then the sum of the two series will have no more than max{n 1 , n 2 } terms, whereas the product will have a non-zero term at the maximum index n 1 +n 2 . When updating the distribution, this means that each round in the experiment increases the maximum index of the Fourier-series representation by k r . The number of coefficients that need to be stored and manipulated therefore keeps growing with each update, leading to a computational complexity that grows at least quadratically in the number of rounds. In order to keep the Bayesian approach practical, we therefore need to limit the number of terms in the representation [13]. However, as the algorithm progresses, the probability distributions tend to become increasingly localized and ringing effects, such as those shown in Figure 2, start to appear as a result of the truncation. This eventually leads to instability in the distribution, causing the algorithm to fail. Increasing the number of terms in the representation can delay but not eliminate this effect. Moreover, from a computational point of view, we of course would like to keep the number of terms in the representation as small as possible.

Weight updates
Given an initial uniform prior for the weights, we see that equation (7) evolves to a weighted sum of Dirichlet distributions, which are of the form: The coefficients α for that distribution are initially zero, but change after each update such that each Dirichlet distribution has integer parameters α i ≥ 0 with α 1 = n, where n is the update iteration. Assuming there are k eigenvalues, this gives a total of n+k−1 k−1 = O(n k ) terms to represent the distribution exactly, which becomes impractical for even modest values of n and k. An alternative approach, taken in [13], is to choose the weights w such that they maximize the log-likelihood of the measurements. Assuming a uniform prior, this leads to a convex optimization problem: where C ( ) denotes a vector of the coefficients C ( ) j from (6) at iteration . The normalization by 1/n is included to keep the problem scale independent of the number of iterations, but otherwise does not affect the solution. In [13], this subproblem is solved using sequential least-squares approach for the first thousand iterations, after which a local optimization method is applied.

Proposed approach
In the previous section we have seen that, given initial uniform priors, the phase distributions have exact Fourier representations up to the point where the number of coefficients exceeds a chosen maximum. Truncation of the series eventually leads to instability as the phase distributions become increasingly localized. In order to prevent the algorithm from breaking down we propose to monitor the standard deviation of each phase distribution. Once the standard deviation falls below a certain threshold we convert the Fourier-series representation of the distribution to a normal-based representation. From then on we only update the parameters of the normal distribution.
At the beginning of the algorithm we need to choose the priors and initialize the weights. Starting with identical priors and weights for each distribution leads to a symmetry in which the updates to the distributions are identical, which means that we effectively only use a single distribution. To break this symmetry we can initialize the priors such that they are all distinct. In Section 3.1 we show how a wraparound normal distribution can be approximated by a finite Fourier series. This allows us to initialize the phase distributions with relatively flat normal-like distributions with different mean values. In Section 3.2 we bound the pointwise error in representing a normal distribution by a truncated Fourier series. For a given maximum number of coefficients in the Fourier representation we can then find the standard deviation for which the error bound exceeds a given maximum. This standard deviation is then used as the threshold that triggers conversion from the Fourier to the normal representation of the phase distribution. In Section 3.3 we show that the parameter updates for the normal distribution can be computed analytically. This means that we no longer need the rejection sampling procedure used in [19] to approximate the parameters. Finally, in Section 3.4 we propose an efficient approach for solving the weight optimization problem (16).

Fourier representation of the normal distribution
In order to represent a wrapped normal distribution f • µ,σ in Fourier form we need to determine the sine and cosine coefficients. For a term such as sin(kφ), this can be done by evaluating the inner product: where the first equality follows from the definition of f • µ,σ in (9) and periodicity of the sine function. A similar expression holds for the inner product with cosine terms cos(kφ). We can evaluate these expressions using the Fourier transform of the normal distribution [5,8], which is given bŷ Choosing t = k and taking the real and imaginary parts respectively we obtain: For the final coefficients we need to make sure that the basis functions are orthonormal, which is equivalent to scaling the coefficients by 1/π for k = 0 and by 1/2π for k = 0, since

Conversion to normal distribution form
After each experiment we update the Fourier representation of each of the eigenphases using the rules in (15). Truncation of the coefficients during the update may introduce intermediate errors, and is therefore best done after the update by discarding the excessive coefficients. Over the course of successive updates, the distributions tend to become increasingly peaked. As a results of the limited number of coefficients, this causes the distributions to exhibit the ringing effects seen in Figure 2, which get more and more severe until the distributions blow up and become meaningless. We can avoid this by monitoring the standard deviation for each of the distributions and convert a distribution from Fourier representation to normal form once the standard distribution becomes too small. We determine the critical standard deviation by considering the maximum pointwise error of approximating a normal distribution with a truncated Fourier series. After normalizing the coefficients found in (18) the pointwise error at any point φ for n max ≥ 0 is given by summation over the discarded terms: The maximum absolute error occurs at We illustration the error bound in Figure 2. Given a fixed limit n max on the number of coefficients and a maximum permitted error ε, we can define the critical value σ (n max ) as the minimum σ that satisfies A good approximation can be determined efficiently using bisection search. Once the standard deviation of the Fourier representation falls below this critical value, we can no longer guarantee that the representation is accurate, and therefore switch to a normal representation of the distribution based on the values in (11). Note that the conversion is done independently for the different distributions.

Updating the normal distribution
In [19], single-round updates to the normal distribution are done using rejection sampling. It turns out that the integrals appearing in the update have closed-form solutions that easily evaluated. We first consider the computation of the C ( ) j coefficients appearing in (6): It follows from (4) that we can write P k,β (m | φ j ) as a weighted sum of sine and cosine terms with certain coefficients c k and s k . Given a wrap-around normal prior P ( ) j = f µ,σ , it then follows that Using the integrals in (18), this simplifies to values have been computed, we can evaluate the probability P ( ) k,β (m) using (8). The posterior distribution used to define the updated prior is then given by equation (5): where the tilde indicates that this is an intermediate distribution. The expression within brackets, along with the normalization factor 1/P ( ) k,β (m) can be expressed in Fourier representation with certain coefficients c k and s k . Together with the normal prior distribution, we thus havẽ To obtain an updated normal distribution we use (11), which requires the evaluation of e iφ Q with Q =P Writing e iφ = cos(φ) + i sin(φ) and defining coefficients c we have This expression is easily evaluated using (18).

Updating the weights
Updates of the weights are done by minimizing the negative log-likelihood function (16): A single evaluation of the objective function and gradient at iteration n has complexity O(mn), where m is the total number of eigenphases considered. If we were to solve this problem for weight updates in every iteration, we would obtain a complexity that is quadratic in the number of iterations n. Although polynomial, this nevertheless imposes a practical limit on the number of iterations the algorithm can take. Adding a single vector C ( ) to the summation after a large number of iterations does little to change the objective function or gradient. It therefore makes sense to solve the weights only when a fixed percentage of new vectors has been added since the previous solve. In other words, we can use an exponentially spaced grid of iterations at which to compute the weights. Assuming a fixed maximum number of iterations to solve each problem, we then obtain a complexity of O(mn) for all weight updates combined, which is linear in the total number of iterations. An alternative approach, taken in [13] is to solve (20) using sequential least-squares programming for the first hundred experiments, and switch to optimization of a quadratic model of the objective function after that. Given the constant size of the model, this means that the time complexity also remains linear in n.
For the minimization of (20) we use a gradient projection algorithm [4]. This iterative method requires an operator for Euclidean projection onto the unit simplex: Application of the projection operator can be done using a simple O(m log m) algorithm, which works well for small m. More advanced O(m) algorithms exist [7], but may have a large constant factor. We implemented the gradient projection algorithm with curvilinear line search, in which updates are of the form The line-search parameter α is chosen to satisfy the Armijo condition [12] f n ( for some small value of γ, for instance γ = 0.001. We use a backtracking line-search procedure in which the line-search parameter is initialized to α k = 1, and halved as often as needed to satisfy (21).
The main stopping criterion is that d (i) (1) 2 is sufficiently small, as this indicates that either the norm of the gradient is small, or the negative gradient lies close to the normal cone of the simplicial constraint. In addition we impose a maximum number of iterations, as well as a maximum on the number of halving steps in the line-search procedure.

Noise modeling
One of the advantages of Bayesian phase estimation is that common types of noise can easily be included in the model. One such type of noise is depolarizing noise, which results in an ideal noise free measurement with probability p, and a uniformly random measurement with probability 1 − p. For a single-round experiment, the probability p can be written as e −k1/kerr , where k err is a system-dependent parameter [13]. This can be incorporated in the conditional measurement distribution by defininḡ and computing C j in (6) using the updated measurement probability. For experiments with R rounds such that each round uses a different auxiliary qubit and all measurements are taken at the very end, we can generalize this by replacing k 1 by the sum of all k i values in p, and replacing the last term Another example of noise that can be included in the probability model is measurement noise. Here, each of the measurement outcomes m i is flipped with some fixed probability p that can be estimated experimentally. For multi-round experiments this means that the current state differs from the one expected based on the measurements so far. When the number of rounds is limited we can evaluate the measurement probability as where m represents the measurements if there were no noise, and P (m is the probability of observing the noisy measurement m with Hamming distance d(m, m ) to m . Both (22) and (23) can be written as finite Fourier series and therefore easily fit into the proposed framework.

Numerical experiments
In this section we compare the performance of the three different approaches: the normal approach in which all priors take the form of a normal distribution; the Fourier approach in which the priors are represented as a truncated Fourier series; and the proposed mixed approach, in which each of the priors is initially represented as a truncated Fourier series and converted to normal distribution form once the standard deviation falls below a given threshold. For all experiments we choose phase shift values β uniformly at random on the interval [0, 2π), and initialize the prior distributions as equally spaced normal distributions with a standard deviation of 3, either directly, or approximately using the techniques described in Section 3.1. By choosing different mean values we break the model symmetry, which would otherwise lead to identical updates to all distributions. All experiments were run on a 2.6 GHz Intel Core i7 processor with 16 GB of memory. The results in this section can be reproduced using the code available online at: https://github.com/ewoutvandenberg/Quantum-phase-estimation.

Fourier representation
As a first experiment we consider the application of the Fourier approach on a noiseless test problem in which two eigenstates with eigenphases φ 1 = 2 and φ 2 = 4 are superimposed with weights w 1 = 0.7 and w 2 = 0.3. We model the phase distributions using a truncated Fourier series and consider truncation after 200, 1000, and 5000 terms respectively. For the experiments we use a fixed three-round setup with k = [1,2,5], and run 20 independent trials of 10 6 iterations each. As the distributions evolve they may converge to eigenphases corresponding to any of the actual eigenstates. In order to evaluate the accuracy of the results, we therefore need to associate each distribution with one of the know eigenphases. There are several ways in which this matching can be done. The simple approach we found to work very well, and which we therefore use throughout, consists of matching the distribution to the eigenphase that is closest to its mean at a given reference iteration. We typically perform the matching based on the distribution at the final iteration, but intermediate distributions are sometimes used to better illustrate the performance. Applying this procedure to our current setup with matching done at iteration 100 gives median phase errors as shown in Figure 3(a). For each of the three truncation limits we see that the median phase error initially converges steadily before suddenly diverging rapidly. For the Fourier representation with 200 terms this divergence happens after some 2,500 iterations. Using representations with a larger number of terms helps postpone the point at which divergence sets in, but does not prevent it. Matching at different iterations beyond 1,000 confirms that the divergence is inherent in the representation and not caused by the distributions suddenly converging to different eigenphases. In Figure 3(b) we plot the median standard deviation as a function of iteration along with the critical σ values for = 10 −4 , as calculated using the procedure described in Section 3.2. For plotting purposes we set the standard deviation to 20 when the weight of the distribution is zero. For the case where the maximum number of coefficients is 5,000, we see that the break down occurs very quickly after the critical standard deviation is reached. To get a better understanding of the sensitivity of this critical value with respect to , we plot the Fourier truncation error (19b) as a function of σ in Figure 3 the number of terms will become computationally expensive, and we therefore conclude that a pure Fourier-based implementation is not practical in general.

Normal and mixed representations
We now move on to another experiment in which we compare the normal and mixed approach. For the mixed approach we use a Fourier representation with 200 coefficients and a critical standard deviation corresponding to = 10 −4 . We maintain the non-adaptive measurement scheme described above, but change to a new problem with three phases φ = [2,4,5] and weight values w = [0.5, 0.3, 0.2]. Given that the number of eigenstates in the initial state |Ψ is generally not known, we maintain five distributions for the phases. For the performance evaluation we match each distribution to one of the three groundtruth values. The weight values contributing to each distribution are added up, and the phases are averaged using the relative weights 1 . The resulting median phase errors for the collated distributions are plotted in Figure 3(d). The normal-based approach has a slower start than the mixed approach, but eventually catches up after around 10 5 iterations. The standard deviation of the distributions and the convergence of the weights in terms of the median absolute distance are plotted in Figures 3(e-f).
A closer look at the data indicates that the mixed approach successfully converged to the correct weights in all twenty trials. The normal-based approach does so in the majority of trials, but has some notable deviations, which lead to a deteriorated mean absolute error. The average runtime of the mixed approach was 70.43 seconds, which is larger than the average of 15.65 seconds for the normal-based approach. The difference in runtime is due to the updates of the Fourier representation in the initial iterations, which are more expensive. In most of the trials of the mixed approach we found exactly three distributions with non-negligible weights. For these distribution, the transition from Fourier representation to normal distribution occurred on average at iterations 282, 533, and 1111, in order of decreasing weight of the distribution.

Weight updates
In Section 3.4 we proposed a way of reducing the number of weight updates to ensure that the computational complexity scales linearly instead of quadratically in the number of iterations. To validate the approach we evaluate the performance of the mixed approach with five different weight-update settings. For each of these we update the weights at every iteration for the first T iterations. After that, the weights are updated only at iterations that are power-of-two multiples of T , namely, 2T , 4T , 8T , and so on. In the first setting we update the weight at every iteration for the first T = 10 5 iterations. This setting therefore closely resembles the naive approach with a complexity that scales quadratically in the number of iterations. In the next three settings we respectively choose T = 2, T = 64, and T = 512. In the fifth setting we again choose T = 512, but update the line-search as described in Section 3.4. Rather than projecting x + αd for each line-search parameter α, we project once with α = 1, and perform backtracking line search using only that point. We evaluate these settings on the problem described in Section 4.2 and plot the resulting mean absolute weight errors in Figure 3(f). The different shades of the colors used for the normal and mixed approaches correspond to the different update schemes. The only setting that failed to converge was the normal approach combined with update parameter T = 2. Aside from this all update schemes performed comparably for each of the approaches. The only real difference, not visible in the plot, lies in the runtime: in the mixed approach, the first setting, in which updates the weights at every iteration for the first 10 5 iterations, takes 1040 seconds to complete. Settings two through four take 472, 587, and 610 seconds, and the final setting takes 604 seconds. For the normal approach these runtimes are 366, 117, 122, 121, and 126, respectively. The experiments in Section 4.2 were done using T = 512, and we will continue to use this value for the remainder of the paper.

Design of single-round experiments
As an alternative to the three-round setup with fixed exponents k = [1, 2, 5] used so far, we now consider the effect of using single-round experiments in which the single k value is either chosen according to some fixed scheme or chosen adaptively based on the estimated phase distributions. Using a fixed k = 1 was found not to work, and we therefore follow [13], and consider choosing k cyclically over the values {1, 2, . . . , c max } for different values of c max . The resulting median phase error over twenty problem instances with a single eigenphase is shown in Figure 4(a). The performance of the normal and mixed approaches are quite similar for c max = 1, which coincides with keeping k = 1 fixed. For c max values equal to 2, 3, or 5, the normal-based approach consistently fails to converge to the desired phase. The mixed approach, by contrast, shows rapid convergence during the initial iterations and steady convergence for subsequent iterations, even after switching to the normal distribution. The fact that the performance remains good must be due to the initialization of the normal distribution with the partially converged Fourier representation. To verify this, we study the convergence of the normal approach when initialized with different normal distributions. The standard deviation of all distributions is set to the critical value σ (200) ≈ 0.023, and the mean is set to the correct phase plus or minus a bias term, for a range of different bias values. We run 1,000 trials of the algorithm for 100,000 iterations, using different cyclic schemes. The percentage of trials that fail to converge to within 0.1 of the desired phase is plotted in Figure 4(b) for each setting. Note that those trials that do succeed generally continue to converge to the desired phase as the number of iterations increases. For each of the curves there is a clear transition between bias values for which the method converges,  and those for which the algorithm fails to converge. As c max increases, or the critical sigma values decreases, this transition becomes sharper and starts at smaller bias values. For the transition from a Fourier representation to a normal distribution to succeed it is therefore important that sufficient convergence has been achieved at the time of the switch. If needed, this can be achieved by adjusting the scheme or increasing the maximum number of Fourier coefficients. Next, we apply the same schemes to the three-eigenstates problem described in Section 4.2. When combined with the normal-based approach, none of these schemes showed successful convergence to the desired phases. Moreover, for c max = 1 both the normal and mixed approaches failed to converge. For other small values up to c max = 5, as seen in Figure 4(c), the mixed approach initially converged to the desired phases, but quickly diverged after switching to the normal distribution. The plot also illustrates that we can improve convergence by increasing c max or by increasing the number of terms used in the Fourier representation. Results obtained when using five distributions to approximate the desired three are similar to those shown and therefore omitted from the plot.
It is also possible to use an adaptive scheme in which the value of k 1 is determined based on the current standard deviation. The choice of k 1 = 1.25/σ was proposed in [19] for single eigenstate experiments, and combined with cyclic schemes for multiple eigenstate experiments in [13]. To evaluate the use of adaptive schemes in both the normal-based and mixed approaches, we consider the quantitȳ Based on this quantity we derive two schemes: a purely adaptive scheme in which we simply set k 1 =k, and an adaptive cyclic scheme in which we choose k cyclically with an adaptive maximum  value c max =k. In Figure 4(d) we show the performance of these schemes, along with fixed (k = 1) and purely cyclic schemes, on a single eigenstate problem. The adaptive and cyclic schemes with c max = 50 have similar convergence and all outperform the scheme with fixed k 1 . Choosing c max = 4, 096 gives even faster convergence, especially so for the purely adaptive scheme. The median phase error of the adaptive and purely cyclic schemes steadily decreases until it comes to halt around iteration 7, 000.
Applying the same schemes to a problem with three eigenphases, we find that none of the normalbased setups works. In addition, the fixed and purely adaptive schemes, which previously excelled, also failed to converge to the three desired values. Both the purely cyclic and adaptive cyclic approaches converged to the desired values, as shown in Figure 4(e). The lighter-shade curves in the figure show that these methods continue to work when using five rather than three distributions. The parameters obtained using the adaptive cyclic approach are plotted in Figure 4(f).

Decoherence and read-out errors
In Section 3.5 we showed how different types of noise can be included into the probability models.
Here we study the performance Bayesian phase estimation under decoherence and read-out errors. Decoherence in single-round experiments has the effect of obtaining a completely random measurement with probability 1 − e −k1/kerr , where k 1 is the number of controlled application of unitary U , and k err is a system dependent parameter, which we set to 100 in our experiments. It can be seen that the larger the value of k 1 the larger the probability of obtaining a completely random measurement. Indeed, even for k 1 = 70 this already occurs with a probability slightly exceeding 0.5, which clearly shows that we can no longer choose arbitrarily large values for k 1 . We therefore run adaptive and cyclic experiments on a single eigenstate with c max and k max conservatively bounded by 50. The resulting phase estimates along with those obtained for a fixed k 1 = 1 are plotted in Figure 5(a). The cyclic and adaptive schemes have faster initial convergence than the fixed scheme, and likewise for the mixed versus the normal approach. The only configuration that failed was the normal approach combined with a purely cyclic scheme. When applied to the problem with three eigenphases, we found that the purely adaptive approach fails, even with k max reduced to 10 or 20. The results obtained using the mixed approach and the purely cyclic scheme, are shown in Figure 5(b). The results for the adaptive cyclic scheme are similar aside from slightly faster initial convergence and a somewhat slower convergence later. For the cyclic scheme we observe that the convergence of the different phases begins at later iterations when c max is larger. In the case of c max = 100 we see an abrupt degradation of the performance around iteration 10 5 . As discussed in Section 4.1, this happens when the deviation to the correct phase is still too large at the time of switching to the normal distribution. In this case we can avoid the degradation by increasing the number of terms n max in the Fourier representation to 1000, as illustrated in Figure 5(b).
For our experiments with read-out errors, we apply Bayesian phase estimation in combination with the quantum Fourier transform (QFT). The QFT circuit, illustrated in Figure 5(c) for three measurement qubits, has long been used for non-iterative estimation of a single eigenphase [1,6]. Here we show that the same circuit can be used to recover multiple eigenphases even when the measurements are noisy. The circuit in Figure 5(c) is similar to the one given in Figure 1, although there are some important differences. The first difference is that the exponents k are fixed to successive powers of 2, starting at 1. The second difference is that the phase-shift values β are no longer predetermined but instead appear as conditional rotations. In our example we use quantum-controlled gates, which means that we do not know the effective value of the rotation until after measuring the qubit, provided there are no read-out errors. An alternative implementation where the rotations are classically controlled after taking the measurements is also possible, but we do not consider it here. Unlike our earlier experiment setup, we are now in a situation where the effective β values depend on the quantum state of the system. In order to deal with read-out errors we must therefore distinguish between the actual state of the system, and the observed measurements. As described in Section 3.5, we determine the measurement probability by first computing the probability of ending up in each of the 2 n possible states, and multiplying this by the probability of obtaining the observed measurement from the given states. Although the computational complexity of this approach grows exponential in the number qubits, it remains tractable when the number of qubits is small. This is certainly the case for our experiments, where we use a QFT circuit with measurements on five qubits (note that the illustration in Figure 5(c) uses only three). The results obtained in this way on a single-eigenphase problem with varying levels of read-out error are shown in Figure 5(d). The performance of the normal approach gradually deteriorates as the noise level is increased, and the method only just manages to converge when the read-out error reaches 26%. Beyond that point the method does not converge to the desired value within the given number of iterations. The mixed approach also converges slower with increasing read-out errors, but performs well up to a 48% error rate. As seen from Figure 5(e), the normal and mixed approach both converge for the problem with three eigenphases with read-out error rates up to 8%. When the noise reaches 10% or above, the normal-based approach fails to find the desired phases. For the mixed approach this happens at noise levels of around 30%.

Additional phases
In practical settings, it can be expected that the initial state is a superposition of not only the eigenstates of interest but also one or more spurious eigenstates of small weight. To get a better understanding of the effect these eigenstates have on phase estimation, we revisit our earlier experiments consisting of three main phases, but this time include additional eigenstates. In a series of problem instances, we vary the number of additional phases while keeping their total weight to 0.1, and scale down the weights of the original three phases by a factor of 0.9, yielding weighs [0.45, 0.27, 0.18]. For phase estimation, we use the mixed approach with depths k chosen according to the cyclic scheme with c max = 20. We provide the algorithm with extra phase parameters, which are needed absorb the spurious phases and also reflect the fact that the actual number of eigenstates in the initial phase is typically unknown in practice. In Figure 6(a) we show the trajectory of eight phase estimates in the case of two spurious phases. From the given trajectories we would like to extract the three dominant phases, indicated by the horizontal dashed lines. There are three simple techniques we use to do so. First, we note that two of the estimated phases in this example have zero weight, and these are therefore easily filtered out. More generally, we can filter out phases with weights below some selected threshold. Second, we see that several of the phases exhibit strong oscillations. Clearly, these phases cannot provide an accurate estimate, and we therefore filter them out based on the total variation of the estimated angle, that is, the sum of absolute differences between successive estimates, over the last, say 50, recorded values. Third, we see that multiple estimated phases can converge towards the same phase; φ 1 = 2 in this case.
We found that such estimates tend to repel each other, leading to the situation where none of them is accurate. To correct this, we detect bundles of similar estimates (phases deviating at most τ degrees from the bundle center) and replace them by their weighted average. Applying these three correction steps for the current example using τ = 5 we obtain the phase estimates shown in Figure 6(b). Next, we generate 100 problem instances each for 2, 5, 10, and 20 spurious phases with weights and phases sampled uniformly at random. We normalize the total weight of the spurious phases such that they add up to 0.1 and apply Bayesian phase estimation followed by the filtering process described above.
For each problem instance we record the weight and phase differences whenever exactly three phases are recovered, otherwise we mark the problem instance as failed. We run the algorithm with three plus a given number of additional phase parameters for threshold values τ ∈ {1, 3, 5}, and summarize the results in Table 1. For the successful cases with τ = 3 we see that the phase error is small in all but one setting, indicating that whenever the algorithm converged, it did so to the correct phases. The fact that the maximum phase error decreases as the number of spurious phases increases is likely due to the decrease of the individual weights. The worst-case estimation error in the weights is worse than that of the phases, but overall the median error is very small. The number of failed cases decreases significantly with larger values of τ , which is possible here because the actual phases are well spread out. Finally, we note that the method seems fairly insensitive to the selected number of additional phase parameters.

Scalability
We now move beyond the three phases considered so far and look at the performance of Bayesian phase estimation on problems with up to twelve phases. For this we consider a somewhat idealized setting where the unknown phases are initialized at a grid of angles with π/6 increments starting at π/12. We randomly perturb these values by adding angles, in radians, sampled uniformly at random from the interval [−0.05, 0.05]. The weights are sampled uniformly at random from [ 1 2 , 1] to ensure all weights are similar in magnitude, and are then normalized to sum up to one. We generate 100 instances each for problems with the number of phases ranging from one to twelve and apply the filtering procedure described in the previous section to obtain the Table 1: Performance of the Bayesian phase estimation algorithm with filtering of the results for 100 problem instances with three base phases and varying numbers of spurious phases. The algorithm uses three plus the given number of additional phase parameters. Clustering of phase estimates is done with threshold values τ equal to 1, 3, and 5 respectively, and the corresponding number of problem instances that failed to return three phases are listed, from left to right, in the failed instances column. For threshold parameter τ = 3 we show the maximum and median phase and weight errors over all successful problem instances.
include any spurious phases. We focus on recovery of the phases and declare a problem instance to be successful if it recovers all ground-truth phases with a deviation of at most 0.005 radians, and it does not include any other phases. Table 2 summarizes the success counts for filtering with threshold values τ ∈ {0.01, 1, 3, 5} and a varying number of additional phase parameters. Increasing τ results in higher success rates, but as mentioned above, this comes at the cost of resolution and can be done only when the threshold is less than half the minimum distance between the phases. For those problem instances that failed to recover all phases, we found that the estimated values were often accurate, but that some of the phases were not included, as illustrated in Figure 6(c) for a problem instance with nine phases. Further simulations would be needed to see how the algorithm deals with phases that are clustered together more closely or exhibit a larger range of weights.

Conclusions
In this work we have analyzed the performance of Bayesian phase estimation for different representations of the prior distributions. As a first representation we use a normal distribution, which was used earlier in [19]. We show that updates to the distributions can be evaluated analytically in noiseless settings, as well as in settings with several types of commonly encountered noise. The normal-based approach is fast, but performs poorly when multiple eigenphases are present, certainly in experiments with only a single measurement round. In these cases the distributions tend to collapse into a single distribution, which is then impossible to untangle since the updates will essentially be identical. As a second representation, we consider the truncated Fourier series representation used in [13]. For noise-  less problems, as well as for problems with many different types of noise, the Fourier series is ideal in that it captures exactly the desired distributions. However, the number of coefficients required to maintain exact representations keeps growing with each additional round of measurement. In order for the method to remain computationally attractive, it is therefore necessary to truncate the Fourier series. We show that this truncation eventually causes the distribution to become unstable, thereby limiting the accuracy that can be achieved using a fixed number of terms. To combine the advantages of both approaches, we propose a mixed approach in which the distributions are initially represented as truncated Fourier series. When successive updates cause the standard deviation of a distribution to fall below a suitably chosen threshold, we change the representation to a normal distribution. We show that the proposed mixed approach performs well with both static and adaptively chosen values of k, and that the performance remains stable when decoherence or read-out errors are present. Finally, we show how the Bayesian approach can be combined with the quantum Fourier transformation, which is traditionally used for phase estimation. Measurement errors can be dealt with successfully in this setting but the current method requires the evaluation of the probability distributions for all possible states. Reduction of this complexity is an interesting topic for future work. In terms of scalability of the method we conclude that Bayesian phase estimation is especially well suited for problems where the state is a superposition of a small number of dominant eigenstates and possibly many spurious eigenstates with a much smaller weight.