Robust and Efficient Hamiltonian Learning

With the fast development of quantum technology, the sizes of both digital and analog quantum systems increase drastically. In order to have better control and understanding of the quantum hardware, an important task is to characterize the interaction, i.e., to learn the Hamiltonian, which determines both static and dynamic properties of the system. Conventional Hamiltonian learning methods either require costly process tomography or adopt impractical assumptions, such as prior information on the Hamiltonian structure and the ground or thermal states of the system. In this work, we present a robust and efficient Hamiltonian learning method that circumvents these limitations based only on mild assumptions. The proposed method can efficiently learn any Hamiltonian that is sparse on the Pauli basis using only short-time dynamics and local operations without any information on the Hamiltonian or preparing any eigenstates or thermal states. The method has a scalable complexity and a vanishing failure probability regarding the qubit number. Meanwhile, it performs robustly given the presence of state preparation and measurement errors and resiliently against a certain amount of circuit and shot noise. We numerically test the scaling and the estimation accuracy of the method for transverse field Ising Hamiltonian with random interaction strengths and molecular Hamiltonians, both with varying sizes and manually added noise. All these results verify the robustness and efficacy of the method, paving the way for a systematic understanding of the dynamics of large quantum systems.


Introduction
Recently, we have witnessed a rapid development of quantum technologies with drastically increasing sizes of fully or partially controllable quantum systems [1-Wenjun Yu: wenjunyus@gmail.com Xiao Yuan: xiaoyuan@pku.edu.cn 10]. Following the law of quantum mechanics, the quantum devices allow for controllable quantum operations and offer opportunities to naturally probe quantum many-body behaviors in a programmable or analog way. To further upgrade the quantum technology or even to realize a universal quantum computer, an essential requirement is to better understand the underlying quantum mechanisms of the systems [11][12][13]. According to the Schrödinger equation, the static and dynamical behaviors of quantum systems are described by their Hamiltonian. Successful detection of the Hamiltonian, i.e., the so-called Hamiltonian learning [14-20] task, could guide us to have a better understanding and hence control of analog quantum simulators [21-24, 11, 13], digital quantum processors [25-29, 9, 30], quantum sensors [31][32][33], etc.
A straightforward choice of learning Hamiltonian is via quantum process tomography [34][35][36][37][38]. However, since the dimension of a many-body Hamiltonian grows exponentially with the system size, complete tomography of the Hamiltonian is costly and formidable in real experiments [39,40]. Here we briefly review some other efficient learning methods and describe their limitations. Wiebe et al. [15,16] firstly use a trustworthy quantum simulator to inverse the unknown evolution and iteratively approximate the exact inverse. However, since the scaling of complexity is case-specific, the order of magnitude thereof could be prohibitively large for a certain Hamiltonian family. Besides, the accurate simulation of some arbitrary inverse Hamiltonian evolution remains hard for experimental realizations in NISQ era [41]. To avoid the usage of expensive and inaccessible quantum resources, some protocols focus on using states that contain information on the Hamiltonia, including steady states and thermal states [42][43][44][45]. These methods are scalable with the system size by measuring states with desired properties to infer the Hamiltonian. However, they rely on specific states containing the information of the target Hamiltonian, which are generally hard to prepare and are vulnerable to measurement errors.
There are also learning proposals that get rid of the requirements of preparing some special initial states. For example, Li et al. [46] construct equations based on the energy conservation of arbitrary states dur-ing evolution. Given the structure of the Hamiltonian, this work considers the effects of measurement errors and gives a bound on the measurement errors based on the gap between the smallest two singular values of the equation matrix. Notably, this protocol is steady under both state preparation and measurement (SPAM) errors. This steadiness, nevertheless, relies heavily on the model assumptions of the measurement errors and Hamiltonian structure (ansatz) errors as well as the eigenstate thermalization hypothesis, and it is hard to give a firm claim for the real-world cases based on these constraints. Zubida et al. [47] estimate the Hamiltonian parameters by constructing equations from Ehrenfest theorem about the derivatives of observables involved in the structure of the Hamiltonian. The number of measurements scales with precision quartically Ω(1/ϵ 4 ) due to the trade-off between statistical and systematic errors, but its accuracy would be affected by state preparation and measurement errors. Hangleiter et al. [48] measure state properties by evolving it to identify the nearest-neighbor coupled Hamiltonian in superconducting systems. Since the structure and system nature are known, they show that measurements contain the eigenvalues of the Hamiltonian. By fixing a time evolution as a base, this method can be robust to state preparation errors. Nevertheless, all of the above three methods need a precise prior about the structure of the Hamiltonian, rendering them unfavorable for tasks of learning an unknown system from scratch or reliably verifying Hamiltonian simulation with extra noise terms.
In this work, we report an efficient Hamiltonian learning algorithm that circumvents the above requirements and assumptions. Our method only utilizes short-time Hamiltonian evolution -getting rid of the requirement of preparing ground or thermal states, assumes sparse Hamiltonians -avoiding the second assumption without specifying the Hamiltonian structure 1 , and exploits ideas from randomized benchmarking (RB) [49] -being robust against effects from SPAM errors. The method requires only local input states, local Pauli measurements, and local Pauli operations inserted in the Hamiltonian dynamics, which is accessible for various experimental platforms, such as superconducting qubits [2,9,29,[50][51][52][53], trapped ions [10, 54, 55], Rydberg atoms [56, 3], nuclear magnetic resonance [57,19], etc.
After our paper was posted to the arXiv, there come out various interesting proposals toward efficiently learning unknown Hamiltonians [58][59][60][61]. Wilde et al. [60] recruit the first-order Trotter to simulate Hamiltonian with prior parameters and use the effective fidelity between the simulated and evolved states 1 The sparsity is defined with respect to the decomposition of the Hamiltonian in the Pauli matrix basis. Note that realistic Hamiltonian, including molecules, lattice models, quantum field theories, etc., all satisfy the sparsity requirement.
to generate the loss function. By imposing constraints on the loss function, they also show the accuracy guarantee of the optimization solution. Huang et al. [61] randomly implement Pauli gates to project the target Hamiltonian onto some stabilizer spaces in the Pauli basis, so the eigenspace of the new Hamiltonian is easily determined. They achieve the Heisenberglimited scaling complexity by introducing the phase estimation method as a subroutine and assuming the bounded-degree structure. We also expect to find more and more interesting proposals for learning an unknown Hamiltonian with fewer assumptions about the Hamiltonian itself.
Contributions. For the first time, we introduce a provably efficient and robust method to detect an arbitrary Hamiltonian operator containing sparse decomposition parameters over Pauli basis. In order to be SPAM error robust, our protocol employs the exponential fitting idea from randomized benchmarking to estimate the absolute values of parameters and uses simplified process tomography equations to fulfill the need of sign estimation. The sign estimation remains robust as it only returns discrete values. Based on this protocol, we give a provable guarantee in Section 4 to the estimation results sketched from theoretical bounds of all subroutines. Given the sparsity s and qubit number n, we achieve a measurement complexity as O sn/ϵ 4 where ϵ depicts the desired precision. We simulate the protocol on multiple systems with various Hamiltonian structures in Section 5 to verify and monitor the scaling properties, including an 8-qubit molecular Hamiltonian with over 100 nonzero Pauli terms. All these results conclude that our Hamiltonian learning protocol on n-qubit systems is scalable and robust. It is also comparably practical since it does not require the correct Hamiltonian structure or the preparation of eigenstates or thermal states and only needs single-qubit operations.
During the estimation of the absolute values (which we denote by stage 1), the protocol estimates the Pauli fidelity of the Hamiltonian evolution channel and transforms these values to be Pauli error rates, which are equal to squares of the decomposition parameters. We construct the fidelity estimator by implementing the random circuits [49] in an interleaved manner to estimate the extraneous Hamiltonian channel. We need to assume that the circuit noise performs benignly as in Assumption 2 and 3, and the exponential decay and fittings achieve SPAM-robust estimation. Suppose Hamiltonian operators to be sparse as in A1 and A2, we adopt the sparse transforming idea [62] to reduce the exponential calculation to an efficient sampling. We also relieve the original assumption about the unbiased Gaussian noise to biased noise. In the second stage, we introduce an optimization problem inspired by the process tomography protocol [63] to estimate the sign information. With the assumption about the SPAM errors in Assumption 1, the sign estimator can be SPAM robust as it tolerates small disturbances.
Organization. The remainder of this paper is organized as follows. We provide the basic problem setting and intuition in Section 2. We present an overview and the main ideas of our protocol in Section 3 with a rigorous analysis of the validity of the protocol in Section 4. The numerical simulations for the Hamiltonian learning protocol are exhibited in Section 5. We conclude the results in Section 6.

Background
In this section, we introduce the main setting of the Hamiltonian learning problem and some basics thereof. We then summarize the assumptions about the Hamiltonian in our protocol. We also introduce the relationship between the time evolution channel and the Hamiltonian operator as a tool to extract the estimation. In order to better control the SPAM errors and circuit noise, we propose some restrictions to keep them performing benignly. The assumptions and constraints help to learn the information of the Hamiltonian operator robustly and efficiently.

Problem Setting
The Hamiltonian learning problem aims to estimate the unknown underlying Hamiltonian operator of a quantum system. Consider an n-qubit Hermitian Hamiltonian H with dimension d × d and d = 2 n , which can be decomposed on the Pauli operator basis as with real coefficients s α .
Here the P n = {I, X, Y, Z} ⊗n is the refined Pauli group consisting of 4 n Pauli matrices. We label each element therein by a 2n-bit string as α for P α , and Appendix A.1 includes the detailed definitions of this labeling and other Pauli-related concepts. We denote the vector s ∈ R d 2 by the decomposition parameter, and each element can be obtained by s α = 1 d Tr(HP α ) since P n forms a basis under the Hilbert Schmidt norm. Consequently, the Hamiltonian learning problem is treated as the reconstruction of the decomposition parameters s.
In the literature, several Hamiltonian learning methods have been proposed to efficiently learn the decomposition parameters s [42, 64,65,44,45,48] by imposing (1) a certain Hamiltonian structure, such as nearestneighbor or local interactions (2) the ability to prepare an eigenstate or thermal state of the Hamiltonian.
Yet, these constraints may fail in practical scenarios. For the first requirement, a realistic system may have different types of interactions, and it would be impractical to assume a specific Hamiltonian structure. For example, we may not have full prior information on the Hamiltonian structure for some partially controllable systems [12]. Meanwhile, even if we know the Hamiltonian of an ideal quantum system, the imperfection of the system may lead to unknown and complex interactions [13,12,10], such as the multiqubit crosstalk of superconducting qubits [29,9,[66][67][68]. Indeed, the dimension of the vector s increases exponentially with the number of qubits, and a specific structure of the Hamiltonian will restrict the problem size to an exponentially smaller subspace. In this work, we release the assumption to avoid prior information on the Hamiltonian. We notice that while the Hamiltonian structure may be unknown, most real-world physical systems, such as superconducting qubits, cold atoms, etc., generally have sparse decomposition parameters s. For convenience, we denote the set of nonzero parameters by s ⋆ . In this work, our protocol and analysis mainly focus on the Hamiltonian with sparse nonzero parameters, and we make the following assumptions to determine the decomposition parameters s of the target Hamiltonian.

Assumption. Given Hamiltonian H, we have
A1 (Random sparse support) The support of the nonzero decomposition parameters s ⋆ is chosen uniformly at random from all subsets with size s of the group P n . Here s is polynomial with the qubit number, n.
A2 (Spectral gap) The smallest nonzero parameter of s ⋆ is larger than a constant √ ϵ 0 .
A1 ensures the sparsity of the Pauli decomposition of the target Hamiltonian, which is much more general compared to the structure assumptions, i.e., indicating which Pauli term has a nonzero coefficient. Although this assumption supposes uniformly random support, this is due to the concern of theoretical analysis and proofs. In the numerical results, we will show the feasibility of our protocol when learning sparse Hamiltonians with different structures, including a comparably complicated molecular Hamiltonian. The assumed lower bound of the spectral gap in A2 helps to distinguish the nonzero parameters from overwhelming noise effects. These two assumptions help to narrow the estimation to the sparse parameters s ⋆ , which makes the execution efficient regarding the number of qubits. For the second requirement, preparing the eigenstate or thermal state of an arbitrary Hamiltonian is proven to be QMA-hard [69], and this eigen-or thermal state preparation also seems impractical. Instead of solving those computationally hard problems, a more natural way is to consider the time evolution of the Hamiltonian, which is accessible for both analog and digital quantum computers [11,30,[70][71][72]. Using the Schrödinger equation, the time-evolved state for an initial state ρ is represented as a unitary where U H (t) = e −iHt corresponds to Hamiltonian H and time t. Here and henceforth, we use H t to denote the Hamiltonian evolution channel with time t. For a short enough time, we can approximate the Hamiltonian evolution channel by expanding and truncating the infinite series as (3) For simplicity, here and after in this paper we will use the index α ∈ P n to indicate the corresponding operator P α ∈ P n when it would not incur ambiguity. We note that the first-and second-order terms of the evolution channel faithfully store the decomposition parameters of the Hamiltonian operator, which offers the possibility for extracting the Hamiltonian information from the dynamics. Together, we will utilize the short-time evolution under the target unknown Hamiltonian and present a Hamiltonian learning protocol for any unknown Hamiltonian with sparse decomposition parameters.

Noise Models
In way to construct a robust Hamiltonian learning protocol, a crucial obstacle remaining is the unknown SPAM errors and circuit noise. Due to the limitation of the quality and controlling quantum systems, these errors and noise are ubiquitous in NISQ devices. The imperfect state preparation and readout noise will undermine the availability of measurement outcomes, which renders the estimation untrustworthy. To make the protocol SPAM robust, we first need to restrict the strength of these errors; otherwise, the unbounded errors will always be able to destroy all the effective information from measurements. For theoretical convenience, we will hold the following assumption to quantify the device's SPAM errors.
Assumption 1. We denote the noisy state byρ and the noisy measurement operator byM . The noisy device satisfies the following constraints for every pair of positive Pauli eigenstates and Pauli measurements, where H here represents the Hamiltonian evolution with all possible short evolving t used in the implementation. The positive bound τ must be in the region of 0, O t 1 ϵ0 s where t 1 is a typical unit time used in the regression during the sign estimation.
In order to eliminate these depicted SPAM effects, we will show shortly that the protocol would choose special circuit components, especially random Pauli gates, and implement them in a interleaved way to build a SPAM-robust estimation. In this sense, we need to clarify the form and amplitude of the introduced gate noise in a rigorous way. Following the idea and convention in randomized benchmarking protocols, [49,73,74] we impose time-independent and Markovian quantum noise. Furthermore, for every noisy implementation of the Pauli gate, we assume that there exists some certain noise channel Λ such thatP where P is the implementation of the unitary P . We denote the noise effect satisfies Eq. (5) by gateindependent noise. Therefore, we have the following model for circuit noise.

Assumption 2. The noise from the implementation of quantum circuits is time-independent, Markovian, and gate-independent.
Even though only gate-independent noise allows the RB circuit to construct the standard twirling structure, Wallman [75] proved that a slight gatedependent disturbance of the gate-independent noise still keeps the twirling results and causes limited harm to the final analysis of RB. This might give us a clue about the possibility to relax this gate-independent requirement. An additional assumption is ad-hoc for the execution of the interleaved protocol. Since the implementation of random Pauli gates carries some noise inevitably, our protocol needs to calibrate and compensate for these effects to extract the net information of the Hamiltonian. However, in order to eliminate them, the noise is supposed to be diagonal in the superoperator representation, i.e., a stochastic channel.
Assumption 3. The noise Λ of the circuit implementation for Pauli gates is a stochastic channel (Pauli channel), and it is close to the identity according to the operator norm as Note that Pauli gates are actually implemented from single-qubit pulses in realization. From some current experimental verification and detection results [76][77][78], the implementation of local operations can be close to their ideal version up to high precision. Therefore, Assumption 3 is reasonable. In fact, the constant 1 3 can be further enlarged to an arbitrary constant less than 1, which still makes sure that all the Pauli fidelity terms of Λ are larger than 0.

Protocol Overview
In this section, we illustrate the main ingredients of our protocol to estimate the unknown Hamiltonian operator. The basic idea is to extract the information of the sparse parameter vector s ⋆ sequentially from the second-and first-order terms of the series expansion of Hamiltonian evolution in Eq. (3). In this sense, our protocol is naturally analog to a quantum channel detection scheme.
We briefly summarize our protocol as follows. We also refer to Figure 1 for a schematic summary.
• At the first stage, the procedure first extracts the Pauli fidelity with Pauli inputs and measurements in a similar vein to RB as in Figure 1 (a), which is robust against SPAM errors. By linear regression with t 2 , the results correspond to the second-order terms in the evolution channel. Then we show a procedure to efficiently transform the extracted fidelity to the squares of s ⋆ , and an example is illustrated in Figure 1 (b).
• At the second stage, we construct and solve the process equations to estimate the remaining sign information of s ⋆ from the first-order terms as in Figure 1 (c). With knowledge of the support position about s ⋆ from the first stage and the nature that small noise on results will not cause any sign flips, the sign estimator can be executed robustly and efficiently for both quantum and classical complexity.
One may argue that a straightforward strategy is to directly consider the first-order expansion of Eq.
(3) and extract s ⋆ with certain input states and measurements. However, the first-order term corresponds to coherent (off-diagonal) information of the channel, which is generally sensitive to SPAM errors. Indeed, most quantum channel detection schemes that are robust to SPAM errors, such as randomized benchmarking [79][80][81][82], can only extract incoherent (diagonal) information of channels. Moreover, as in Eq. (18), the overhead for classical computing to construct the process equations from state preparations and measurements would be exponentially large without the information of the sparse support of the Hamiltonian effect since there are exponential items on the Pauli basis. Therefore, we choose this composite algorithm to circumvent these obstacles. We discuss in detail the two stages in the remainder of this section and refer to the next section and appendix for an informal summary and detailed proofs of the protocol, respectively.

Stage 1. Absolute value estimation
The first stage is to estimate the squares (absolute values) of the decomposition parameters, which come from the second-order effects of the Hamiltonian evolution.
Considering the χ-representation using the Pauli basis, the Hamiltonian evolution channel can be decomposed as where p α := χ αα is denoted by the Pauli error rate. Comparing Eq. (7) with Eq.
(3), the relationship between the parameters in s and Pauli error rates Since the square of decomposition parameter s 2 α is proportional to the Pauli error rate p α with a small remainder from small t, we can approximately obtain every |s α | given the corresponding Pauli error rate p α . Note that the sparsity of |s α | also implies the sparsity of p α . Therefore, the basic idea is to exploit the recently proposed Pauli estimation protocol to learn all the sparse Pauli error rates [62] from Pauli fidelity (which is defined later). Nevertheless, there are two challenges if we directly adopt existing Pauli estimation protocols.
• First, the Hamiltonian channel also contains coherent parts, while we only need the incoherent information here.
• Second, Pauli error rates of a short-time Hamiltonian channel are only approximately sparse to the second-order expansion. It is actually dense if we consider infinite expansions.
Fortunately, we can circumvent these in our estimation. For the first one, our circuit would apply the Pauli twirling to the Hamiltonian evolution channel to convert it into a Pauli (diagonal) channel. In brief, Pauli twirling annihilates the off-diagonal elements of a channel and keeps all the Pauli (diagonal) information the same. Denote a Pauli gate by P α , and the evolution channel H t after Pauli twirling must be For the second one, we apply regression to the results with different small t to suppress the effects induced by higher-order terms in the evolution. We further prove that the protocol is noise-resilient and efficient for large-scale Hamiltonian evolution channels. The first stage can be viewed as a Pauli error rate estimator along with an independent Pauli fidelity estimator. In circuit (a), we introduce the kernel of the random circuits to get the desired fidelity terms, whereP denotes the noisy Pauli gate. By choosing the random matrix M , the index ℓ, and the offset d, the protocol accordingly queries the fidelity terms f M ′ ℓ+d , where M ′ is the shuffled M as discussed in Section 3.1.2. According to the query, the procedure then prepares input states followed by the cascading structures including the noisy Pauli gates and Hamiltonian evolution. The sub-subscripts represent the corresponding single-qubit parts of Pauli gates. By measuring circuits with different depths, it acquires the fidelity terms with bounded noise and combines them to be the bin sets as in Eq. (15).
The formal illustration of the circuits can be viewed in Section 3.1.1. In subfigure (b), the procedure uses multiple subsampling groups and bin sets aligned therein to detect the type of each bin set and peel to generate bins with single nonzero Pauli error rates. For example, in this two-qubit case, the procedure finds the first bin set in the first group to be single-ton and extracts the nonzero term and index, pXX , with the aid of different signs. For the second bin set, the procedure gets stuck due to the summation of multiple terms. It then switches to the second group to detect the first bin set with result pIZ . By learning pIZ is nonzero, the procedure uses it to peel that stuck case and gets pY X . Then the procedure uses estimated error rates to peel the second bin set of group two and certifies it as a zero-ton bin. This peeling can be viewed more detailedly in Section 3.1.2. The second stage aims to complete the Hamiltonian learning by estimating signs of decomposition parameters as illustrated in Section 3.2. In circuit (c), the procedure inputs random local Pauli eigenstates and measures them with Pauli operators after evolving under Hamiltonian, whereC andC ′ denote the noisy local Clifford gates for state preparation and measurements which are not conjugate. By constructing the relationship between the measurement outcomes and the unknown parameters, we can solve them and recover the sign information.

Pauli fidelity estimation
In this subsection, we focus on getting data from measuring quantum circuits consisting of the Hamiltonian evolution. Particularly, the protocol extracts Pauli fidelity as a type of Pauli information for further analysis and extraction of the target parameters. Inputting and measuring the same operator P x , the outcome corresponds to a Pauli fidelity of an arbitrary channel E as defined in the following where ⟨x, α⟩ p ∈ {0, 1} is a useful operation named Pauli inner product. Depending on whether P α and P x commute or not, ⟨x, α⟩ p returns 0 or 1. We give the mathematical definition of this inner product using bit strings of indices in Appendix A.1. Eq. (9) and Eq. (10) depict the major tool and the main target of the estimation in this step, respectively.
The fidelity estimator contains two circuits to extract the Pauli fidelity of the target Hamiltonian chan-nel, as shown in Figure 2. The idea of the circuit (a), as well as the Pauli estimation, is proposed in [49]. We introduce an additional circuit (b) to rule out the noise effects. Therefore, each round of execution consists of implementing these two circuits.
As shown in circuit (a) of Figure 2, we prepare a stabilizer state using a local Clifford gate C and |0⟩ ⊗n , where the state is the simultaneous positive eigenstate of all Pauli operators in a stabilizer group G that includes P x given the target fidelity is f x . Here the group can be chosen arbitrarily given it contains P x . The concepts related to the stabilizer group in the Pauli group are reviewed in Appendix A.2. We then implement random Pauli gates, and these gates twirl the noise channel Λ into a Pauli channel Λ P n which is a diagonal matrix in the superoperator representation with elements to be all Pauli fidelity. In this case, the circuit is equal to the combination of diagonal channel from the twirled noise channel. We finally measure the syndrome β ∈ A G = P n ⧸ G of the state, i.e., do C † to the state and measure it on Z ⊗n basis. A brief introduction of the syndrome measurement is In order to rule out the effects of the implementation noise Λ, we recruit the circuit in [49] as a reference circuit. The procedure inputs the stabilizer states of stabilizer group G in the Pauli group and measures the Pauli syndromes using local Clifford gates C and C † . Random Pauli gates P twirl noise channels into the diagonal version with only fidelity left. The sub-subscripts of the locally tensored gates represent the corresponding single-qubit parts. Therefore, the procedure varies the gate length and gets a decaying factor by tracking the overall syndromes from both Pauli gates and the syndrome measurements, which helps to estimate the fidelity information of Λ. (b) Since we have access to the shorttime evolution of the target Hamiltonian, we implement it in the estimation circuit. Similarly to the previous circuit, we estimate the fidelity of the composite channel Ht • Λ. Under Assumption 3, a ratio estimator contributes to the extraction of the net fidelity information of the Hamiltonian channel from the composite results from both circuits. also in Appendix A.2.
The estimator sums all the indices of the sampled Pauli gates as a and decides the corresponding syndrome or the decomposition of a in the quotient group: α ∈ A G . It is found that the distribution probability of the overall error syndrome α+β is the linear combination of exponentially decaying fidelity terms with signs depending on the inner products of fidelity indices and the overall syndrome. We can further isolate every single decaying by adding an additional sign for each overall syndrome and taking an expectation over these syndromes as shown in Lemma 6 of [49]. By varying m, we can extract the desired fidelity term f x of Λ robustly as SPAM errors do not affect the decay rates. An advantage of considering the stabilizer formalism is that we can simultaneously estimate all the fidelity in one stabilizer group within one round of experiments by different post-processing. While it would be hidden in the asymptotic complexity, this advantage can expedite real-world processing considerably.
In circuit (b) of Figure 2, we interleave Pauli gates and the additional time evolution and keep the state preparation and measurements the same. The target channel now becomes H t •Λ with the same analysis as the previous circuit. We can, therefore, get the Pauli fidelity f Ht•Λ . In the remainder of this paper, we will omit the subscript when it is clear of which channel the fidelity is.
With the executions of both circuits, we can eliminate the Pauli gate noise Λ and extract the net information of the Pauli fidelity of H t solely. As in assumption 3, we assume that the implementation noise Λ is itself a Pauli channel, and a simple ratio estimator helps to detect the corresponding Pauli fidelity of the Hamiltonian evolution H t . Therefore, we get a noise-robust estimation of the Pauli fidelities of the Hamiltonian evolution channel H t .
Observing Eq. (8) and Eq. (10), we realize that fidelity terms of the evolution channel also contain higher-order effects of the evolving time. In order to extract the net second-order coefficients that directly contain the information of decomposition parameters, our procedure applies an ordinary linear regression for multiple pairs of Pauli fidelity of evolution channels and squares of evolving times. This helps to estimate the second-order Pauli fidelity, {f (2) x }, of the Hamiltonian evolution channel, which is the second-order coefficient of the ordinary fidelity of the evolution, This fidelity denotes the second-order coefficient of the ordinary Pauli fidelity, and we have the following definition from Eq. (3) and Eq. (11) which is time-independent. We summarize the whole fidelity estimator in Algorithm 1. The subroutines, FEstimator and HFEstimator, denote the implementation of circuit (a) and (b) in Figure 2 with post-processing, respectively. The algorithm uses the former estimation as a reference and implements the latter for multiple evolving times to both eliminate the Pauli noise channel and extract the second-order terms. For instance, here, we choose five evenly-increasing evolving times for Hamiltonian evolutions. Dividing the composite fidelity by the reference fidelity of noise channel Λ, we can estimate the fidelity of the Hamiltonian channel with the corresponding evolving time, as shown in Lemma 1. We denote this as the ratio estimator. We will elaborate on these subroutines in detail in Appendix B.1.
Algorithm 1 Fidelity Estimator: FE(G, X, l, t 0 ) 1: Input: a target set X belonging to a stabilizer group G, a positive integer l, and the unit time length t 0 2: Initialize the identity array I with size |X| to be all 1 3:f Λ ← I − FEstimator(G, X, l) 4: for i = 1, · · · , 5 do 5: t ← i · t 0 6:f Ht•Λ,i ← I − HFEstimator(G, X, l, t) 7:f Ht,i ←f Ht•Λ,i /f Λ ▷ ratio estimator 8: end for 9: Do ordinary linear regression for every fidelity in X from (f Hit 0 , (i · t 0 ) 2 ) pairs and getf (2) 10: return the second-order fidelity arrayf (2) For this second-order fidelity estimation, we hold a guarantee about the accuracy and complexity of the results. In this theoretical statement, we assign the ordinary least square as our regression method for convenience.
Proposition 1. Let G be a stabilizer group in P n and X ⊂ G, and suppose Assumption 2 and 3 hold. For any small ϵ, δ > 0, the following happens with probability 1 − δ: Run Algorithm 1 FE(X, G, l, t 0 ) with t 0 satisfies ∥H 5t0 − I∥ < 1 4 and l = 2 ϵ 2 log 2|κ| δ where κ is the set of variant sequence lengths. regression, and the resulting regression arrayf (2) satisfies where the upper bound term adopts the largest bound term referred from Lemma 1 among multiple evolution with different t, and σ is a constant regrading the regression method.
Proof. The formal proof can be viewed in Appendix B.1. The main idea is to combine the accuracy bound from the subroutines of this algorithm. By tracking the processes of the ratio estimator and the regression, the accuracy of the second-order fidelity can be determined from those inherited bounds.
Remark 1. Note that in this analysis, we choose the ordinary least square regression with an instance of five evenly increasing evolving times, and the parameter to represent the noise propagation during fitting is denoted by σ. In order to minimize the overall noise in this phase, we can adjust the unit time t 0 to reconcile these two noise sources. Specifically, under this fitting model, we need to choose t 0 ∼ 4 O(σϵr Λ ).
Algorithm 1 serves as a supplier of second-order Pauli fidelity for the later bin construction as stated in Algorithm 7, which is a key structure in the transform from fidelity to error rates in Sec. 3.1.2. The bin construction specifies the fidelity terms needed with indices included in the subset X ⊂ G since we do not need the whole estimation as explained in the next subsection. Algorithm 1 will return the set of desired second-order fidelity X in stabilizer group G. Even though this estimator can estimate all the fidelity in a stabilizer group using only one round of measurements, we still count each query separately regarding the worst-case complexity.

From Pauli fidelity to Pauli error rates
In the last subsection, we have constructed a secondorder fidelity estimator. According to the previous discussion, the rest of this stage is to use this subroutine to get the desired values, namely, to transform the second-order fidelity to squares of decomposition parameters. We first define the second-order Pauli error rate p (2) α which stands for the second-order coefficient of p α . According to Eq. (8), we know the specific representation of p (2) : Therefore, the target is reduced to implement the transform from second-order fidelity to second-order Pauli error rates. We know from the definition that the following Walsh-Hadamard transform with Pauli inner product depicts the relation between Pauli fidelity and Pauli error rates. In the remainder of this paper, we use Walsh-Hadamard transform to denote the Pauli variant when it causes no ambiguity.
Since second-order fidelity and Pauli error rates are linear components of the ordinary fidelity and Pauli error rates, they can also be linked by the Walsh-Hadamard transform. Even though the second-order error rates are sparse according to Assumption A1, the second-order fidelity is generally dense. A faithful implementation of Walsh-Hadamard transform thus requires the summation of an exponential number of terms. Instead of directly following Eq. (14) with an exponential number of summations, our protocol recruits the idea for Pauli properties estimation [62] to efficiently obtain the sparse Pauli error information. Along with the idea firstly introduced in Ref. [62], our protocol utilizes a special structure named bin. A bin U is a linear combination of a subset of noisy second-order fidelity terms, as defined in Eq. (15), which also indicates the value of a set of second-order Pauli error rates. The procedure first determines a size index b < 2n. Then each bin is constructed from 2 b randomly sampled second-order fidelity terms. In order to implement the random sampling, the procedure employs several 2n × b random binary matrices M and transforms them to corresponding M ′ , where M ′ := J n M and J n := I n ⊗ X. By enumerating all bbit string ℓ, the random matrices M ′ help to disperse them to 2n-bit which are valid Pauli indices. This conversion from M to M ′ aims to keep the dispersion in the form of Eq. (15). Furthermore, the procedure recruits additional 2n-bit vectors d as offsets. The detailed construction can be viewed in Algorithm 7. After this construction, we observe that each bin consists of 2 2n−b second-order Pauli error rates and that different bins have no overlap for a given M (or M ′ ) as follows, This equation can be checked in Lemma 2. Herẽ f (2) M ′ ℓ+d denotes the noisy Pauli fidelity after the linear regression, andp α is a noisy second-order Pauli error rate with noise comes from the noisy fidelity. Note that each bin U [j] selects all the error rates of which indices satisfy M T α = j. The bin construction actually disperses the whole set of error rates into bins with different indices. This is similar to a hash function that uses α to uniquely decide the bin index j that containsp (2) α . Therefore, Given a matrix M (or M ′ ), all bins form a partition over the second-order Pauli error rates. We denote every sampling matrix M (or M ′ ) a subsampling group as indices it disperses form a subgroup in F 2n 2 . Based on the sparse support assumption and a large enough subsampling size b ∼ O(log s), the procedure constructs bins that are likely to contain only one or few nonzero Pauli error rates. Therefore, we only need polynomial number of fidelity terms. This detection of the number of nonzero terms in every bin can be achieved with the aid of offset strings d. Intuitively, we construct a bin U [j] with multiple offsets d, and we denote these bins with the same bin index j and different offsets to belong to a bin set. Notably, all bins in a single bin set contain the same set of second-order error rates with different signs according to Eq. (15). Suppose a bin set only contains one nonzero term p (2) α and some small noise, we would find absolute values of bins in this bin set remain close. Similarly, absolute values that are all close to zero indicate that bins contain all zero Pauli error rates, while significant but diverse absolute values imply that there are multiple nonzero rates in a bin set. Therefore, we can classify bin sets by zero-ton, single-ton, and multi-ton bins. This bin detector is formalized in Algorithm 8 of Appendix B.2.
When we decide the bin is a single-ton, the averaged absolute value would be regarded as the estimation of that second-order Pauli error rate. Furthermore, the Pauli index can be learned from signs of bins with different offsets. According to Eq. (15), when the noise is not large enough to flip the sign, there comes an equation to describe the linkage between a bin's sign and the Pauli index of the nonzero term, where the sign function is defined as follows, By choosing multiple offsets d, the procedure can solve the index α. However, it is hard to deal with a multi-ton bin since it is a summation of multiple error rates, and we can hardly extract the information of any single rates directly. In order to avoid this stuck case, the procedure uses multiple sampling matrices M and constructs different groups of bins. In each group, the partition of Pauli error rates varies from others. In this case, if the procedure finds a multi-ton bin, it is supposed to find some single-ton bins in another group with overlapped Pauli error rates.
By peeling out the overlapping error rates detected from single-ton bins, the multi-ton bin contains fewer nonzero error rates and is degraded to single-ton eventually. We denote this by the peeling process, and we show an example of the peeling part of a 2-qubit system in Figure 1. Here in Algorithm 2, we give a concise overview of how the peeling process works as mentioned above in this subsection and temporarily remove all the technical ingredients. Shortly, this algorithm invokes Algorithm 7 to initialize several groups of random bins as above-explained. With these bins, the peeling process enumerates all of them and uses Algorithm 8 to detect their nature, including their types, and possible values and indices. We can then record the desired second-order Pauli rates when finding single-ton bins. We will further give an exhaustive illustration of this peeling process in Algorithm 9 in Appendix B.2.
It is worth noting that since the procedure needs to peel out nonzero rates and generate new single-ton bins, it triggers noise propagation among bins during peeling. In Appendix B.2, we analyze the behavior of the whole transformation with noisy fidelity terms gained from Algorithm 1. We show that an index array T indicates the size of variances for noise in each bin. Moreover, since the noise of fidelity terms from Algorithm 1 is not necessarily unbiased, we also consider the bias effects in the noise propagation during the peeling process, which is more complicated than the ideal version of transform introduced in [62]. Combining the noise analysis and the mechanisms of if the bin is single-ton then 8: for all other groups c ′ ̸ = c do 10: The estimation works The proof relies on several lemmas stated in Appendix B.2. The proof starts by fetching the error bounds from Proposition 1 and analyzing the noise in every bin. By proving Lemma 2 and Lemma 3, it shows that sampling bins serve as hash functions and that the bin detector works accurately with a given level of noise. Based on these, the index T records the noise size and helps the procedure to adjust the noise expectation. Therefore, we can prove the success of this peeling process. The formal proof can be viewed in Appendix B.3 In summary, the first stage can detect the second-order Pauli error rates of the Hamiltonian evolution channel or, equivalently, the squared decomposition parameters of the Hamiltonian. The protocol first uses random gate twirling to detect the Pauli fidelity of the Hamiltonian channel, followed by regression to eliminate evolving times. It then constructs bins from several sets of second-order fidelity as the estimator of all nonzero second-orderPauli error rates. The detailed protocol with formalized subroutines is illustrated in Appendix B. The validity and efficiency of estimation for the whole set of squared decomposition parameters s ⋆ is proven rigorously in Appendix B.3. We also discuss the possible benefits the estimation gets if some prior knowledge about the underlying structure of the Hamiltonian is available. The knowledge might be collected from the understanding of the experimental devices or the properties of the tasks running on the devices. We can expect a more accurate and reliable result with this kind of aid, which is illustrated in Appendix B.4.

Stage 2. Sign Estimation
After the first stage, we can detect all nonzero secondorder Pauli error rates of the Hamiltonian channel. Nevertheless, these terms only carry the square values of parameters as stated in Eq. (13). As a complement, the protocol for this stage is to estimate all the sign information. The idea of our sign estimator is based on constructing equations as in quantum process tomography [35,83,84] and utilizing the information gained from the first stage.
Suppose the procedure chooses m states and measurement settings {(ρ k , M k )} m k=1 . As shown in circuit (c) of Figure 1, the procedure measures these states on the Pauli operators after evolving under the target Hamiltonian. For each pair, the procedure collects multiple measurement results of the evolved state with different evolving times {t}. By the regression over {t}, the procedure focuses on the first-order coefficients of the measurements. According to the expansion in Eq. (3), the first-order measurement for a pair of SPAM settings is depicted by the process equation as where H (1) t is the first-order expansion of the Hamiltonian evolution channel, and [P α , M ] is the commutator. After the procedure computes all these first-order coefficients from measurements, we collect a series of observations for different SPAM settings where each corresponds to a linear combination of trace results for commutators as in Eq. (18). This reduces the problem of estimating Hamiltonian parameters s to solving the following process equations, Generally, the size of unknown parameters s scales exponentially with n, rendering the solution intractable. In our execution, the sparsity assumption guarantees that there are only sparse nonzero variables to be solved. Besides, from the first stage, our procedure has extracted the square values of decomposition parameters, which also includes the precise sparse support information. The size of linear equations is reduced to s by shrinking the target parameter from s to s ⋆ and reducing the coefficient matrix. Therefore, we have the following equations to illustrate the polynomial-size problem, The problem will be more knotty when SPAM errors are taken into account. Suppose measurements of the evolved states for the chosen observable are corrupted by additive noise. The noise will be transmitted in the linear regression stage, and the effects of noise will be enlarged due to the short evolving time t. In this case, the first-order part of measurements M will also be corrupted by the noise asM = M+ω. To suppress these effects, by choosing the coefficient matrix Φ with random local Pauli eigenstates ρ and Pauli measurements M , we can construct a coefficient matrix which limits the noise propagation during solving the unknown parameters. Regarding the noise and biased observation vectorM, the original equations will be derived to a new optimization problem, where ϵ is set to avoid the stuck case when the equation has no solution, and x is an unknown vector with dimension s. It is easy to see that when ϵ ≥ ∥ω∥ 2 , the true parameter s ⋆ lies in the feasible set. The algorithm for sign estimation is shown in the following with a proper choice of ϵ ≥ ∥ω∥ 2 according to Lemma 6.
By sampling the input states and measurement operators, the Φ matrix possesses approximate restricted isometry property (RIP) as defined in Definition 3. This property, by definition, makes sure that the norm of an arbitrary vector after the matrix product is close to the norm of the original vector. We prove this RIP of the randomly sampled process matrix for all possible decomposition parameters with a high probability according to concentration inequalities in Lemma 6, which we refer to as the approximate RIP. Based on the definition of RIP, the noise of resulting nonzero Algorithm 3 Sign Estimation: SE({|p α |, α}, m, t 1 ) 1: Input: The nonzero error rates support of P from Algorithm 9, a positive integer m, and the unit time length t 1 2: Prepare m random local Pauli eigenstates and m random Pauli operators for measurements (ρ γi , P βi ) m i=1 3: Choose evolving times {i · t 1 } which are multiples of t 1 and construct the Hamiltonian evolution channel {H t } for evolving time 4: Measure {Tr(H t (ρ γ )P β )} and calculate the coefficient matrix Φ for Pauli indices belong to P 5: Process ordinary linear regression on the measurement outcomes along the time {i · t 1 } and get ob-servationsM ) for the optimization problem in Eq. (21) 8: Solve the organized optimization problem with solution x ⋆ 9: return solution x ⋆ parameters can be bounded by the chosen relaxation ϵ. Since the sign estimator aims at extracting the discrete ±1 information, it tolerates a reasonable amount of noise that would not cause a sign flip in every solved parameter. Therefore, the noise is not destructive for the estimator, given that we use multiple rounds of equations to minimize the statistical effects. Proof. The full proof is shown in Appendix C.

Remark 2.
As there exists an upper bound of the m due to the accumulation of the SPAM errors, we cannot directly use larger m to significantly suppress the sign errors. On the other hand, we can choose multiple blocks of equations to estimate signs for multiple times. Based on this, we can further reduce the failure probability by an additional majority vote mechanism when we hold multiple sets of signs.
In this stage, we propose a protocol to extract the sign information of the Hamiltonian channel by constructing equations from random state preparation and measurements. With the aid of absolute values from the former stage, the size of this problem is reduced to linear with the sparsity. Based on the randomness, the equations will be robust against the measurement noise. Therefore, this efficient and robust estimation fulfills the need for sign estimation.

Main Theorem
Executing the two-stage protocol illustrated above, we can estimate the whole information of the nonzero decomposition parameters s ⋆ , namely, we can recover the Hamiltonian operator faithfully from this estimation as in Algorithm 4.

Algorithm 4 Hamiltonian Estimator
1: Input: An oracle to control the unknown Hamiltonian H, positive integer l, b, C, P and m, and the unit time lengths t 0 and t 1 2: Input: Offsets D for subsampling 3: Call Algorithm 9 with l, b, C, P , t 0 and offsets D to construct bin sets with random subsampling groups and estimate all the second-order error rates We combine the guarantee of our protocol from the guarantees of the two stages since our main protocol is a composite algorithm. Our guarantee, therefore, does rely on some mild assumptions, which are either inherited from foregoing subroutines or from requirements to bond these stages in a self-consistent way.
We summarize the results in Theorem 1, which states that our algorithm can estimate the decomposition parameters of an unknown sparse Hamiltonian with high accuracy and vanishing failure probability. It shows that our protocol is provably robust against circuit noise and SPAM errors under mild assumptions, which can be viewed from the guarantees of the fidelity estimator and the sign estimator separately. This algorithm is also provably scalable for both quantum measurement complexity and classical computational complexity. Moreover, the quantum complexity claimed above is for the worst case, and this complexity can be further reduced by merging all queries of qubitwise commutative Pauli fidelity, which relies on the choices of stabilizer groups elaborated in Appendix B.1. This merging can usually reduce the needed rounds of fidelity detectors by one or two orders of magnitude. Therefore, our protocol provides an implementable method for Hamiltonian learning.
s . The circuit measurement complexity of this execution isÕ sn ϵ 4 wherẽ O notation ignores the logarithmic terms. The postprocessing complexity is poly(n, s), where poly denotes that the scaling is polynomial with the elements.
Proof. This statement can be regarded as a combination of Proposition 1, 2, and 3. During the combination, we need first to consider the overall precision resulting from subsequently invoking subroutines. According to the target overall precision, we shall calculate the precision for each subroutine and then count and sum up the required complexities for all these subroutines. The rigorous counting and derivation are given in Appendix D.
Remark 3. It is true that we can choose other fitting models and strategies for extracting the corresponding information in both stages, and we leave a detailed study of the strategy for future studies. In the execution of the algorithm, we need two inputs, t 0 and t 1 , to serve as the fitting times of stages 1 and 2, respectively. Nevertheless, there is a natural trade-off toward the size of those evolving times t 0 and t 1 . If the evolving times are too short, this could suppress systematic errors from high-order terms in the expansion, while it would enlarge the errors from imperfect observations and measurements by fitting. If the times are too long, this will lead to large systematic errors. Therefore, the balanced choice of t 0 and t 1 can be viewed in Remark 1 and 5, respectively. To further improve the performance, a possible modification is to consider those higher-order terms in the fitting model to mitigate the systematic errors. In appendix E.2, we exhibit different trials of executions by fitting higher-order terms and taking the desired coefficients. This can vastly reduce the amplitude of the systematic errors and allow us to implement a longer time evolution. Moreover, the idea from [85] can help to construct an exponential suppression of the highorder terms with the number of data points. In general, we expect the choices of evolving times to satisfy that the effects of fluctuation errors from observations dominate.

Numerical Results
In this section, we exhibit several numerical results to show the validity and accuracy of the protocol for learning an unknown Hamiltonian with sparse decomposition parameters on the Pauli basis. These numerical results work as strong evidence that verifies the foregoing intuitively illustrated protocol. They also reveal some critical properties of the Hamiltonian learning method in the realistic implementation as we discuss shortly.
In order to simplify the numerical simulation without modifying the core components of the protocol, we make some mild assumptions in the simulation, which would not weaken the results arguably. Among all of these simulations, instead of running the cascading circuits to extract the fidelity information as in Section 3.1, we assume there exists an oracle to return measurement outcomes of input states under the evolution with the input time of the unknown Hamiltonian, which helps to directly get the Pauli fidelity terms of evolution. In this sense, we count each desired fidelity term as one query, and the number of calls for the oracle represents the total measurements. This is based on the understanding that the complexity of the first stage is always dominant in the total number of circuit measurements, which makes the tracking of fidelity query a faithful indicator of the total quantum complexity.
As for the simulation of noise effects, we add zero-mean Gaussian noise N (0, 10 −3 ) to the queried Hamiltonian fidelity terms to imitate the fluctuation noise from the fidelity estimator. Along the fitting process and the sparse transformation, the noise effects would be exhibited in the absolute value estimation. Moreover, we add the Gaussian noise N (0, 10 −3 ) to the measurement outcomes in the sign estimation as a representation of the SPAM effects when constructing the process equations. These errors will be propagated by the fitting and the optimization stated in stage 2, and the resulting signs reflect overall noise effects. This makes it possible to observe the noise-resilience of our learning protocol from the accuracy of each part in the simulation. First, we study how different choices of the number of bins b affect the efficacy of the protocol. We consider the 6-qubit transverse field Ising model (TFIM) Hamiltonian with random interaction parameters, For different choices of b, we run the protocol for 50 randomly chosen Hamiltonian operators, where each of them is estimated for 10 independent rounds. As defined in Section 3.1, the parameter b reflects the number of bin sets in every subsampling group. With a larger b, the procedure hashes Pauli error rates more In the box plot, we exhibit the statistical properties of errors from the outcomes using different b. In each case, the data is gained from the estimation of 50 random TFIM Hamiltonians with 10 rounds each. For every box, the orange line denotes the median of the error distribution, while the upper and lower bounds represent the 75% and 25%, respectively. By increasing b, the procedure estimates all the decomposition parameters with smaller errors. A sharp vanishing over the distribution of the estimation error is witnessed when b grows up to 4. In the inset, we use the violin plot to show a complete view of the distribution of the query numbers and recovery errors. Every violin along its axis represents a distribution of the corresponding value, and the widths everywhere illustrate the probability density. There is also a sharp shrink of the error bar while the querying number is growing along with the increasing b.
dispersively, meaning that every bin contains fewer underlying Pauli error rates as well as nonzero error rates. Since the peeling process relies on the sparsity of nonzero terms in every bin, procedures with larger b would generate more accurate estimations, where the reconstruction error is depicted by the relative 1-norm distance, On the other hand, a larger b is equivalent to querying more fidelity terms in the subsampling step, which leads to a trade-off of choosing a proper b. In Figure 3, we can view in the box-plot that the recovery errors decay vastly when the procedure increases b from 3 to 4. The error is even smaller when it continues to increase b. However, as shown in the inset of Figure 3, the execution overhead also increases with a b. Therefore, a wise choice of b can be explored as the threshold under which the error is significantly large, where we choose it as b = 5 in our following simulation. This remains consistent with the previous analysis about the peeling step, where the procedure expects the bin number to be larger than the sparsity. However, we may not have access to the error in practice for an unknown quantum system. The problem then rises as to how to decide a proper b in practice. That relies on adaptively increasing b from b 0 . In (a), we show the distribution of the results for random TFIM Hamiltonian with different system sizes varying from 1 to 7. For each n, the procedure runs for 50 random TFIM Hamiltonian operators with 10 rounds for each to detect the average reconstruction results. The parameter b is chosen by witnessing the threshold behaviors by varying b as a trial process. For example, we choose b = 5 for 6-qubit random TFIM Hamiltonian learning, and all others can be viewed in the appendix. In (b), we show the ability to recover different molecular Hamiltonians, H2 (4 qubits), H3 (6 qubits), and LiH (6 qubits). For each molecular system, we execute our method for 500 independent rounds to gather these statistical properties. The bin parameters are also chosen according to the threshold behaviors. In (c), we show an average reconstruction among 20 rounds of the 7-qubit TFIM Hamiltonian, and we use the error bars to display the standard deviations of parameters. The terms are sorted by the absolute values of decomposition parameters. The labels of the x-axis indicate both the Pauli indices as well as the signs of the estimation (the signs are all corrected estimated, as one can see from (a)). In (d), the procedure estimates the Hamiltonian for the H4 (8 qubits) molecule, and we exhibit the 50 dominant terms of both ideal and reconstructed parameters among 184 nonzero terms. This result is an average of 8 rounds, and the error bars are constructed according to standard deviations. In each round, the simulation uses 5 blocks to vote for the final estimation of signs to suppress the failure probability of sign flips. Similar to (c), this plot also uses x-axis coordinates to denote the Pauli indices and signs, and the estimation shows great agreement with the ideal parameters.
to choose the proper bin size b. For every b, we run the protocol multiple times to first observe the peeling process to find out whether some multi-ton bins get stuck after the peeling process. If this happens with significant frequencies, it indicates that the current b is not enough. We can further observe the convergence of results from different rounds. As shown in Figures 3 and 5, the variance between different rounds of estimation will decrease sharply when b is large enough i.e., b ≥ ⌈log s⌉, and we denote the first b to observe this decaying as the threshold. Since the execution complexity scales as O(n2 b /ϵ 4 ) when we choose b arbitrarily, the overall adaptive trial of b to b = ⌈log s⌉ only increases the complexity O(ns/ϵ 4 ) by a constant factor.
Next, we study the scaling of the protocol for different Hamiltonians with varying system sizes. We first consider the TFIM with different numbers of qubits. Figure 4(a), we show a complete view of the distribution of outcomes in different n. We choose b for each n according to the abovementioned threshold idea. The errors are also measured by the relative 1-norm distance, and the major parts concentrate on small errors in each case. While an interesting observation is that the lower bound of these errors grows with the system size, this is mainly due to the regression error as we execute the ordinary least-square linear regression in this simulation. Even though we only consider the second-order effects here, We further analyze the effects of different fitting settings in the following and in Appendix E.2, which suggests that considering the higher even-order terms in the fitting can significantly improve the estimation.

As in
The TFIM model still has a simple structure with nearest-neighbour 1D interactions. We then consider more complicated examples for benchmark, i.e., molecular Hamiltonians, which have been demonstrated to be potential candidates for quantum computing and quantum simulation [86,87]. We encode the fermionic Hamiltonian with qubits, which generally has much more complicated Pauli decomposition parameters that are not local but have sophisticated multiqubit interaction. We note that though the nonzero terms are sparse compared to the exponential size of Pauli decomposition, the nonzero support is still quite large, which grows quartically with the system size. If we could measure each Pauli decomposition parameter to accuracy ε, the reconstruction error defined in Eq. (23) also grows quartically. In order to show a system-size independent error rate, we recruit the average 1-norm distance, to depict the reconstruct errors in the molecular cases.
In Figure 4(d), we display the distribution of the estimation of several molecular Hamiltonians only considering the zero-and second-order terms during fitting. Generally, both the quantum complexity increase when system sizes grow, while the average reconstruction errors remain comparably constant.
In the remaining two subplots, we use the bar plot to exhibit the estimation of every parameter. In Figure 4(c), we exhibit the average estimation of 50 independent rounds of executions for a 7-qubit TFIM Hamiltonian with random parameters. Furthermore, we use the higher-order fitting considering the fourthorder terms of the Hamiltonian expansion to extract a much more precise estimation. The higher-order fitting will require an additional one or two times query complexity, while it brings a greatly precise estimation. Therefore, as can be viewed in (c), the distances between pairs of original values and reconstructed values are generally close to zero. In Figure 4(d), we estimate the Hamiltonian of the hydrogen chain H 4 . We show the 50 dominant terms with both the ideal and reconstructed decomposition parameters 2 . Since the number of nonzero terms is great and they vary from large terms to very small terms, we also consider the fourth-order terms' effects during fitting to improve the estimation of absolute values. In order to provide the perfect sign estimation, we add a further majority voting from 5 blocks of resulting signs when deciding the final signs. In this average estimation of 8 independent rounds, random noise causes little harm to the reconstruction, and our procedure returns precise estimations of the most significant terms among the nonzero parameters. On the other hand, the error may grow when the underlying value turns small since the procedure cannot divide the noise effects from the small parameters accurately. This result sheds light on how the algorithm would execute on Hamiltonian with much more nonzero decomposition parameters.
We refer to Appendix E for more discussions on the implementation and numerical results. The codes to generate these simulations are shared in [88].

Conclusion
In this work, we have proposed a universal, robust, and efficient Hamiltonian learning protocol. The protocol is universal in the sense that it estimates any nqubit Hamiltonians with sparse Pauli decomposition parameters without prior knowledge of the Hamiltonian structure. For example, the protocol can estimate Hamiltonians consisting of global interactions, or namely those high-weight Pauli terms. Next, our protocol is robust since it estimates the Hamiltonian noise-resiliently against SPAM errors and certain amounts of the circuit and shot noise. The protocol is also efficient with polynomially scaling classical and quantum complexities. Besides, the protocol directly exploits the time evolution of the Hamiltonian instead of requiring eigenstates or thermal states of the system, which make it reliable in the real-world implementation. All these results have been rigorously proven in our theoretical analysis in Appendix as well as numerically verified for different sizes of transverse field Ising models and different electronic Hamiltonians in the main text.
Our work provides a scalable and more practicallyminded Hamiltonian learning method compared to the previous protocols based either on explicit structures or informative states. This protocol could be applied for calibrating quantum computing hardware, studying the interaction structures of intricate manybody quantum systems, detecting interesting manybody physics phenomena, etc. As our method exhibits a direct linkage between the fidelity estimation and Hamiltonian learning, we can expect that development in the fidelity estimation also improve the learning of Hamiltonian. For example, the proposal in [89] implies that entanglement states can help to reduce the overheads for fidelity estimation, which can also be employed in the Hamiltonian learning. Our method only exploits the incoherent part of the Hamiltonian time evolution process to estimate the absolute value of the decomposition parameters, and one may expect to make use of the coherent part (the first-order expansion) to more efficiently extract the Hamiltonian information. Nevertheless, as we have discussed previously, this would require a new SPAMrobust process detection mechanism, which also deserves its own investigation. Another point full of practical interest is to combine the fitting method that holds a clear trade-off between the higher-order effects and the random noise with our protocol. As inspired by Figure 4 and 6, a wise choice of fitting method can significantly improve the estimation, while we still need to balance the effects of the overall errors. Moreover, a rigorous analysis of more general nonsparse Hamiltonian learning remains open. we are interested in the case when the parameters of the unknown Hamiltonian distribute unevenly rather than sparsely. As a rough idea, compressed sensing might help in this circumstance.

A.1 Pauli Concept
In this section, we will introduce the commonly-used Pauli group and the label to represent every element. We will start from the single-qubit case. For the single-qubit Pauli group P, it is generated by three Pauli operators, The multiplication among these generators will cause some extra phase terms for some operators, so this group contains several elements that are not Hermitian. On the other hand, all the Pauli elements serve either as observables or gates. Therefore, we only take the refined Pauli group P with all Hermitian elements into account. In group P, we treat elements with only differences on phase the same, which is equivalent to making a quotient over the phase element Moreover, for every integer n, the refined n-qubit Pauli group P n is constructed from tensor products of P, and we have In order to specify every element among 4 n in the refined n-qubit Pauli group, a canonical way will be using a 2n-bit string to label every Pauli operator in the refined group. We can start to illustrate this convention from the case of the single-qubit Pauli group P. This refined group contains four elements which are the three Pauli matrices and an identity operator. As for the label, we refer to a bit string α with a length of two, I X Y Z α 00 01 10 11 .
Therefore, if we want to label an n-qubit Pauli element, we recruit 2n-bit strings to record the Pauli element in each qubit. For a 2n-bit label α, the 2i − 1 th and 2i th bits, α 2i−1 α 2i , characterize the corresponding single Pauli operator of P α on the ith qubit. For example, if we want to represent the operator XZIX, then its label must be α = 01110001.
Since there exists an isomorphism between the 2n-bit labels and the Pauli elements, we need to define the corresponding operations that use the label to express some properties of Pauli operators. The basic operation is the addition. We define the addition of Pauli labels following the multiplication of Pauli operators. Regardless of phases, the summation is equal to the multiplication of every pair of Pauli operators so that P α · P β = P α+β , ∀ P α , P β ∈ P n . (28) Note that we will use α ∈ P n for short of denoting P α ∈ P n when it would not incur any confusion. We next introduce a new inner product between labels of two Pauli operators. For two n-qubit Pauli labels α and β, we define the Pauli inner product as follows where we use the subscript to distinguish the Pauli inner product from the ordinary inner product. This operation actually describes the commutation of every pair of Pauli operators. By calculating the qubitwise commutation and accumulating them, the resulting Boolean value indicates the commutation property as

A.2 Pauli Stabilizer Group
There exists a kind of special subgroup in the refined Pauli group, namely the stabilizer group.

Definition 1.
A stabilizer group S is defined to be a subgroup of the refined Pauli group P n such that ⟨a, b⟩ p = 0 for all P a , P b ∈ S.
It can be directly inferred from the definition that each stabilizer group is a normal subgroup of the refined Pauli group. Therefore, we can find further the quotient group A S := P n ⧸ S . By group theory, we know that for an arbitrary P c ∈ P n , there exists a unique pair of e c ∈ A S and s c ∈ S so that P c = e c · s c . Since only e c contributes to the syndrome measurement after implementing P c to a stabilizer state, we denote e c by the error syndrome of P c .
It is easy to prove that the maximum size of a stabilizer group is 2 n for P n . Therefore, the size of the quotient group of a maximum stabilizer group is 2 n . Consider a maximum stabilizer group generated by n generator G := ⟨g 1 , · · · , g n ⟩, and for any Pauli operator P c = e c · s c , we can get the full commuting relation between P c and G by the commuting relation between P c and set {g 1 , · · · , g n }, which a n-bit string can represent. Therefore, each element in A S maps onto a certain n-bit string, and this serves the basics of the stabilizer algebra. Consequently, if we implement a simultaneous measurement of all the generators referred to as the syndrome measurement, the results turn out to be the bit string representation of the error syndrome of the equivalent Pauli gate to the whole circuit.
During the analysis of the fidelity estimator in Sec. 3.1.1, the protocol tracks and adds the syndromes from sampled Pauli gates and the final measurement. The summation is processed by the corresponding bit strings by binary summation. And the result is the overall syndrome of the circuit.

A.3 Pauli Channel Characterizations
The commonly used Pauli channel can be regarded as a stochastic quantum channel. The is due to the definition of Pauli channels, where coefficients {p α } are Pauli error rates. By the CPTP conditions for every physical channel, we have the following constraints over these Pauli error rates, Therefore, the set {p α } constitutes a probability distribution over 4 n possible Pauli operations. Consequently, a Pauli channel with the form as Eq. (31) can be interpreted as applying a Pauli gate P α (·)P α with probability p α . Moreover, there is a second set of 4 n parameters that can fully characterize a Pauli channel E P , which is defined in Section 3.1 as Pauli fidelity, Note we can also deduce this value from the definition of Pauli channels. From Eq. (31), every Pauli operator is an eigenoperator of an arbitrary Pauli channel, while the eigenvalue of P α is just f α . Thereby, some also refer values defined in Eq. (10) as Pauli eigenvalues.

B Pauli Properties Estimation
In this work, we estimate an arbitrary Hamiltonian operator with a sparse decomposition vector efficiently. As stated in Section 2, the decomposition vector s can faithfully describe a Hamiltonian operator. Therefore, the problem of Hamiltonian estimation is, in principle, a scalable task. According to Eq. (3), the Hamiltonian channel contains the full information of the vector. We choose the Hamiltonian evolution as the target, so the question is deduced to estimate a Hamiltonian channel. In this section, we introduce the fidelity estimator in Appendix B.1. The estimator applies to the reference circuits and the circuit with the Hamiltonian evolution sequentially to extract the net Pauli fidelity of the Hamiltonian channel. Moreover, we recruit the sparse Pauli error rates estimator in Appendix B.2 to detect the error rates of the Hamiltonian channel, which are directly linked with the decomposition parameters.

B.1 Pauli Fidelity Estimation
Here we first introduce the procedure and theoretical guarantees presented in [49], and it will be recruited as a subroutine for the estimation of the Hamiltonian channel's Pauli fidelity. This algorithm was first proposed to estimate the implementation quality of all Pauli gates. To quantify this quality, we regard the practical implementation of a Pauli gate P asP = Λ • P . Therefore, Algorithm 5 estimates the Pauli fidelity of this noise channel Λ over a set of Pauli indices, X, which belongs to a stabilizer subgroup G in the whole Pauli group P n . For the implementation, we choose the input as a stabilizer state of stabilizer subgroup G, and the final measurement will be a syndrome measurement of the group G. Note both the input states and the measurements contain some noise. Hence the protocol must be noise-resilient of SPAM errors. The execution starts with a series of random Pauli gates. Intuitively, it first sets the length index m as 0 as a reference, and m increases exponentially from 1. We denote this sequence of m by κ. The procedure then applies m + 1 layers of uniformly random Pauli gates to the input state for each m ∈ κ. Since the randomly twirled channel will be a diagonal channel in the superoperator representation, the expected channel of this circuit equals the composition of m identically diagonal channels with some SPAM error effects, where the diagonal channel is equal to the diagonal part of the noise channel, Λ, namely, the Pauli fidelity terms of Λ. By employing the stabilizer states and syndrome measurements, the results bring some diagonal information about the noise channel tarnished by SPAM errors. More specifically, the results will be an exponential decay with the length index m with factors to be the diagonal terms. When a certain measurement result decays down to one-third of the reference, the algorithm then collects all results along m ∈ κ to fit and get the diagonal information. The Algorithm 5 is shown in the following, where l denotes the number of random sequences for every length needed to construct the random estimator.
Algorithm 5 FEstimator(G, X, l) 1: Input: a target set X belonging to a stabilizer group G, and a positive integer l 2: Initialize an arrayr Λ with size |X| to be all -1; Initialize an array V with size |X| to be all 0 3: Input states will be positive stabilizer statesρ G ; The measurement will be syndrome measurement M G of G 4: for all k ∈ [l] do

5:
Apply a random Pauli gate P α ∈ P n to stateρ G , and perform measurement M G with result b 6: for all x ∈ X do  m ← 2m 25: end while 26: return the arrayr Λ Even though the above algorithm works for a subset of a stabilizer group, we can simply generalize this method to detect a general subset X in the Pauli group. To achieve that, we employ the concept of stabilizer covering. For a set X of Pauli operators, its stabilizer covering is a collection of stabilizer groups, of which the union set covers X. Therefore, we can divide the target set exclusively into multiple stabilizer groups and execute the algorithm for each piece to estimate the whole set. 8 in [49]). Let G be a stabilizer group in P n and X ⊂ G. Suppose the noise Λ satisfies the assumption 2 and ∥Λ−I∥ < 1 2 , and the following holds with probability 1−δ by choosing small ϵ, δ > 0: Running FEstimator(G, X, l) with l = 2 ϵ 2 log 2|X||κ| δ where κ is the set of exponentially increasing sequence lengths. The procedure uses at most O log 1 ∆ X sequence lengths with largest length m to be O 1 ∆ X , where ∆ X denotes the smallest residual r Λ := 1 − f Λ among the set X regarding Pauli fidelity f Λ of noise channel Λ. Moreover, the outputr Λ satisfies

Proposition 4 (Rephrasing Proposition
Beyond this standard Pauli fidelity estimation for the implementation noise of Pauli gates, we want further to detect the fidelity information of the Hamiltonian evolution. The idea of this detection is similar to that of the standard procedure for the noise channel. We will use random Pauli gates to twirl the desired Hamiltonian channel. For convenience, we denote the extended algorithm by HFEstimator(G, X, l, t) where the additional parameter t represents the evolving time of the Hamiltonian channel. The major difference between this protocol and the original one is that the new procedure will estimate the composite channel of the implementation noise and the Hamiltonian evolution, which is H t • Λ. After implementing a random Pauli gate, the new procedure appends the Hamiltonian channel H t to the circuit, as shown in Algorithm 6. In summary, the modified procedure will implement m+1 random gates along with the channel H t , alternatively. The circuit is illustrated in Figure 2(b).
Algorithm 6 HFEstimator(G, X, l, t) 1: Input: a target set X belonging to a stabilizer group G, a positive integer l, and the evolving time t 2: Initialize an arrayr H•Λ with size |X| to be all -1; Initialize an array V with size |X| to be all 0 3: Input states will be positive stabilizer statesρ G ; The measurement will be syndrome measurement M G of G 4: for all k ∈ [l] do

5:
Apply a random Pauli gate P α ∈ P n and Hamiltonian evolution H t to stateρ G ; Perform measurement m ← 2m 25: end while 26: return the arrayr H•Λ Similar to Algorithm 5, this algorithm actually replaces the noise channel Λ by the composite channel H t • Λ. Therefore, the resulting estimation from Algorithm 6 keeps very similar guarantees to Proposition 4. Corollary 1. Let X ⊆ P n and G be a stabilizer group containing X. Suppose the noise Λ satisfies assumption 2 and assumption 3, and the following holds with probability 1 − δ by choosing small ϵ, δ > 0: Running HFEstimator(G, X, l, t) with t satisfies ∥H t − I∥ < 1 4 and l = 2 ϵ 2 log 2|X||κ| δ where κ is the set of exponentially Proof. Since the Hamiltonian channel is defined to be H t (ρ) = e −iHt ρe iHt , the channel is close to the identity channel when the evolving time t is small. With the assumption stated above, the composite channel H t • Λ is close to the identity under the Pauli twirling. Specifically, we have where the first inequality comes from the fact that all Pauli fidelity is upper bounded by 1, and the last inequality comes from the requirement of t and the assumption 3. Therefore, according to the Proposition 4, HFEstimator can estimate the composite channel precisely as stated.
Thus we can calculate the estimation by the ratio estimator. The upper and lower bounds of this estimation are therefore calculated as follows, For the lower bound, we have a similar way of quantifying that Remark 4. In the statement, we claim the errors of estimations will be bounded by a multiplicative accuracy.
Although there are different components involved in the relative part, this is still precise since the residual is small and all the fidelity terms are less than 1. As we have introduced in the main text, the local gate noise is usually very close to the identity channel [76][77][78], which means f Λ is close to 1, and r Λ is negligible. Likewise, the Hamiltonian channels are assigned with short evolving times in our learning method. Thus, these channels are also well-behaved with large f Ht and vanishing r Ht•Λ . Therefore, our estimation will be of a multiplicative precision, and this error is much less than the predetermined ϵ.
In this analysis, we do not rule out the effects of evolving times, so this bound of estimation may vary from different Hamiltonian and evolving times. To resolve this dependence, the procedure FE increases the evolving times evenly among multiple calls of HFEstimator. With a regression over these multiple t, the procedure can estimate the second-order fidelity f (2) as stated in Eq. (8). With the bounded noise proved in Lemma 1, the regression parameter can be estimated with high accuracy. Proposition 1. Let X ⊆ P n and G be a stabilizer group containing X, and suppose Assumption 2 and 3 hold. For any small ϵ, δ > 0, the following happens with probability 1 − δ: Run Algorithm 1 FE(X, G, l, t 0 ) with t 0 satisfies ∥H 5t0 − I∥ < 1 4 and l = 2 ϵ 2 log 2|κ| δ where κ is the set of variant sequence lengths. regression, and the resulting regression arrayf (2) satisfies where the upper bound term adopts the largest bound term referred from Lemma 1 among multiple evolution with different t, and σ is a constant regrading the regression method.
Proof. According to the ordinary least squares, the parametersf (2) can be calculated from the noisy observations and ideal predictors,f (2) where k denotes the number of observations in the regression. According to Lemma 1, the errors of estimation of f Ht are also bounded by the predetermined precision ϵ multiplying the size of residual. In Eq. (38), the estimator is a linear summation. Thus we choose the largest bias among these fidelity terms with different evolving times t and use it to replace all other bounds, Note that the higher-order errors come from the systematic errors in the estimation of f (2) as we only consider the second-order terms of Pauli fidelity. Furthermore, o(t 2 ) is due to the observation that only even number order terms contribute to the fidelity. In the execution of FE, we implement multiple evolutions with t varies from t 0 up to 5t 0 evenly. Therefore, we have the following constant (40) Combining the above two equations, we can get the claimed bound.

B.2 Sparse Pauli Error Rates
In this section, we focus on the post-processing of the fidelity information detected in the last part. Even though the whole set of fidelity can fully characterize an unknown Pauli channel, it costs an exponential-size calculation to transform the fidelity to Pauli error rates, as stated in Eq. (14). These error rates imply the absolute values of our desired parameters according to Eq. (13). Fortunately, the proposal in [62] offers an efficient method to complete this transformation given the target channel has sparse nonzero Pauli error rates. We will illustrate this method with some reorganized algorithms and guarantees. The protocol first constructs bins to collect the linear combinations of certain fidelity terms. As shown in Return U c;t [j] 16: end for Algorithm 7, the procedure chooses fidelity according to the random indices and queries Algorithm 1 for the fidelity value. According to the Pauli index defined in Appendix A.1, the stabilizer group G k is generated by  ⟨P ϕ(k1k2,1) , P ϕ(k1k2,1) , · · · , P ϕ(k1k2,n) ⟩, where result of ϕ : F 2 2 × {1, · · · , n} → F 2n 2 is 0-extended to 2n-bit string: Intuitively, that means we extend the original Pauli to 1-weight Pauli on every qubit as a generator and use X when encountering I on some qubits. M here is some random Boolean matrix, and the procedure reshuffles the fidelity by this matrix. In order to make bins represented in the same form of a hash function, the procedure further introduces the additional J n := I n ⊗ X to modify random matrices. Vector d works as offsets to create variant phases. Therefore, every bin is labeled by the matrix index c, the offset index t, and the intrinsic bit string j with length b. We denote the bins with the same j and c to be in a single bin set. Therefore, there are B = 2 b bin sets regarding a random matrix. With the summation and Walsh-Hadamard Transform, the bin set {U c;t [j]} t actually constructs a partition over the complete Pauli error rates as stated in Lemma 2.
Lemma 2 (Rephrased from Lemma 1 in [62]). The B-point WHT subsampled bin coefficients with index j ∈ F b 2 can be written as: Moreover, the sampling error is as follows where W α is the noise of the Pauli error rate p α .
From Lemma 2, matrix M determines the hash function to partition the whole set of Pauli error rates into these bin sets. We thus denote these matrices by subsampling matrices. Since each bin set is a partitioned piece and there are only sparse nonzero error rates, it will be likely that every bin set contains only a single nonzero error rate, which allows us to learn the value directly. To achieve this, the procedure must recruit bin sets number B = O(s).
The procedure also constructs methods to determine the precise status of every bin set to verify we can acquire the desired error rate directly. The following Algorithm 8 gives a detailed illustration of the construction of this detector. We can shed the most light on the case of noiseless processing. According to Lemma 2, an offset vector decides the sign for every term in a bin. Because every bin set employs P different offsets, if all the terms are zero, the absolute values of these bins would be all zero. In a similar manner, when there exists one nonzero term, all the absolute values would be the same. In contrast, if these absolute values vary from different bins, the procedure claims there are several nonzero terms in a bin set. For convenience, we denote the above three types of bins by zero-ton, single-ton, and multi-ton bins, respectively. Obviously, bins in a bin set remain in the same type, so we may also use these types to label a whole bin set without leading to ambiguity.
Besides, this detector aims to extract both the value and the index of the nonzero error rate when it is the only nonzero term in a bin set. The idea of index estimation relies on the choice of offsets d. Ideally, if there is one nonzero term without noise, the procedure would choose the set of offsets {d c,1 , · · · d c,P } to contain all the linear independent unit vectors in the binary space of length 2n. Therefore, the sign of each of the bins suggests the binary value of α on the corresponding bit according to Lemma 2. As for the value of the corresponding error rate, the procedure arbitrarily fetches a bin in the bin set and eliminates the effect of the random sign to get the rate since we have known the index of this rate.
Suppose there exists some noise during the construction of bins. The idea of distinguishing a bin set among the aforementioned three types keeps the same. The procedure checks bins with the first P 1 random offsets and determines the type of this bin set. Given a single-ton bin set with the error rate p α , we have the following where D c is offset matrix from P = P 1 + P 2 ds, and W represents the effects from noise. In order to protect the information about the index α of the only nonzero term against noise, the procedure employs some linear error-correcting codes and implements each offset d among the last P 2 as a column of the code generating matrix. Therefore, the signs of the last P 2 bins serve as an element of codewords that remains resilient against the noise from corrupting the signs. In the detection, the procedure thus decodes signs back to index α if the code distance is large enough. Given the index of the nonzero term, the procedure will eliminate signs in the first P 1 bins and average over these bins. This helps to suppress the statistical fluctuations in these bins and get a much more precise estimation of the nonzero error rate. However, the procedure cannot extract all the nonzero error rates in this way since it is highly possible that there are some bins including several nonzero terms. The protocol implements the peeling among different bin sets using the single-ton bin set with the support of the bin detector. As shown in Algorithm 2, the procedure will fetch the single-ton bin set to peel out the error rate of all other bin sets in different sampling groups which contain the same error rate. By constantly peeling, the procedure generates more artificial single-ton bins, and the peeling will keep until there are no remaining nonzero error rates. In the following, we give the full version of the peeling process with detailed parameters and calculations. The procedure also maintains an array T to keep track of noise propagated among bins, which helps the bin detector return a reliable detection even with the occurrence of noise.
Even though the algorithms are rephrased from [62], the model of noise from desired fidelity terms varies in our implementation of this Pauli error rate estimator. In this work, this estimator plays as a subroutine utilized to process the estimated fidelity from the fidelity estimation. Therefore, we cannot impose a similar assumption Return ( B,nil, nil) 15: end if as stated in [62] on the estimated fidelity. On the other hand, the theorem and corresponding proofs can be modified in our implementation to coordinate with the fidelity estimator's form.
According to the fidelity estimator, there exist some instructions, such as the ratio estimation, that are inherently biased. Generally, we denote the noise of the estimated fidelity f (2) by w = ∆ + ω, where ∆ = E(w) is a constant that represents the bias of the estimation. In this sense, the noise ω is random with zero mean, and we will further assume that ω is zero-mean Gaussian noise. This is due to the central limit theorem as there include many samples of fidelity estimation in the first step. Even though we have little knowledge about this noise, the whole noise w is bounded according to Lemma 1. For convenience, we denote the bound by B > 0. Thus we model this noise as a constant belonging to an interval ∆ ∈ [−B, B] along with zero-mean Gaussian noise ω ∼ N (0, B 2 ).
We must consider even further how this form of noise behaves in every bin. A bin is constructed from up to B fidelity terms, and all these fidelity terms contain independent random noise. Thus the Gaussian part of bins' noise that comes from the linear combination of different ω is W ∼ N (0, B 2 B ). As for the constant parts, we can treat them as biases of the underlying error rates. From the Walsh-Hadamard transform, the size of bias ∆ (p) α for an error rate p α is no larger than BB 4 n . In this case, we divide the general noise from the noisy estimation of Pauli fidelity into the two parts mentioned above.
In our execution of this error rate estimator, the main challenge comes from the noisy inputs. The major part of dealing with the ubiquitous noise is the robust bin detector. Based on this, the analysis of the failure or exception for running this estimator will focus on the resilience of the bin detector. Primarily, we rephrase the tail bounds in [90] since we will employ this bound heavily in our proof.
Lemma 3 (Tail bound [90,Lemma 11]). Given g, k ∈ R N where k is an isotropic Gaussian random variable k ∼ N (0, ν 2 1 N ), then the following tail bound holds: for τ 1 , τ 2 , and θ 0 satisfying As stated in Algorithm 8, the procedure aims to distinguish every input bin set as zero-ton, single-ton, or multi-ton. Thus the number of type errors is six by enumerating all pairs between the preceding three. To be more clear, we follow the notation in [62], and let B be the real type andB to be the detected type. Besides that, we also need to consider the case in which a bin is detected to be single-ton correctly while the detector  if hassingle = f alse then We then show a comprehensive lemma to illustrate the robustness in most executions of this detector. The lemma states an exponentially vanishing failure probability with regard to some mild conditions that we expect to rule out later by the total probability formula. In the proof, we will separately analyze the probability of all these eight types of failures. . With the fidelity extracted from Algorithm 1 and bins constructed from Algorithm 7, the detector will deal with noisy bins. Let E denote the event that an arbitrary bin detection with inputs as those in Algorithm 9 get failed as defined in Definition 2. Let D be the event that every bin contains at most P 1 nonzero terms Let V be the event that all the prior bin detection executes successfully. Let H be the event that the peeling graph is cycle-free. Let K denote that all the peelings do not rule out the randomness and that the bias parts remain limited dependence. Then Proof. During the proof we denote B √ B by ν. The original noise in each bin can be divided as zero-mean Gaussian noise N (0, ν 2 ) and a bias with variance less than B 2 ν 2 N given the event K as illustrated in Lemma 5.
From Assumption A1 & A2 and the constraint that In the remainder of this proof, we will keep this constraint on ϵ 0 .
We will follow the Definition 2 to check all the possibility of variants of detection failures. For the first one, we consider the case that a bin detection identify a single-ton bin as a zero-ton bin. The failure rate is defined as follows, According to the lemma 7 in [62], the maintained array T keeps track of the variance information of Gaussian noise in each bin. Thus, the above failure rate can be bounded by Lemma 3, Lemma 5, and Markov inequality.
Since the number N = 4 n , B is a polynomial parameter and P 1 = O(n), the failure rate is exponentially low. Then, we consider the case that we recognize a zero-ton bin as a single-ton bin.
Similarly, we can use the formula of the total probability to bound this failure rate.
As for the failure rate f 3 , it measures the probability of identifying a single-ton bin as a multi-ton bin. For clearness, we denote the bin contains error rate p m actually, and the estimated rate ispα. And the sign vector added to peel the error rate is represented by s α Since here involves multiple error rates, we only consider the case that the estimated error rate satisfies that α = α and p α −p α ≤ ν. By Lemma 3, we have As for the next failure, we consider the case that the bin contains several nonzero terms while the detector identifies it as a single-ton bin.
We then decompose the bin vector by the Gaussian noise part W, the bias part ∆W, the error rates β, and the sign matrix s. The failure rate can be calculated as follows, The first term in the right-hand side can be easily bounded by Lemma 3 since T tracks the variance size of every element in W, As for the second term, we also need to decompose the remaining two terms, Note that ∆W is a random variable with the mean to be zero from the summation of multiple Bernoulli variables. Thus it is an approximate Gaussian variable by the central limit theorem. The variance falls into the interval B N B 2 , 2B N B 2 . By Lemma 3, the first term can be bounded by The second term can follow the analysis in Appendix E of [90], and we get Even N = 4 n , we can choose a proper P 1 and guarantee that this is still an exponential vanishing failure rate. Therefore, we have The following two failure types are easy to handle as they can be reduced to the previous rates. For the failure that the procedure identifies a zero-ton bin as a multi-ton bin, this bin must pass the zero-ton verification first. Thus, we have If the detector recognizes a multi-ton as a zero-ton bin, the probability of this failure is very similar to f 4 , and we can claim that Then we need to consider the remaining index and value errors. As the sign for every single-ton bin is generated from the inner product of offsets and the index as follows, sgn [U c;t [j]] = sgn (−1) ⟨dc;t,α⟩p p α + ∆W + W = ⟨d c;t , α⟩ p ⊕ Z t , ∀ t ∈ {P 1 + 1, · · · , P 1 + P 2 }, where sgn is a function that returns the sign of input and the Bernoulli variable Z denotes whether the bias and noise flips the sign of this bin. By deliberately choosing these P 2 offsets, we can use these bit strings to construct a generating matrix G of some error correcting code, From the error-correcting codes' perspective, there are 2n logical bits, and the code words have length P 2 . Thus, the code rate is R = 2n P2 ≤ 1. Besides, we denote the code protects error with Hamming weight up to βP 2 with a positive parameter β ∈ [0, 1]. To decide whether a code with a given distance can protect the index information, we need to figure out the property of the Bernoulli variables Z. By definition, the variable Z = 1 if and only if |∆W + W | ≥ p α and they contain opposite signs. Therefore, we have Due to the fact that W is Gaussian noise with zero mean and variance T ν 2 , there exists a tail bound for Gaussian random variables.
The last inequality comes from the assumption about the smallest Pauli error rate. Combining this tail bound and the bound of the bias, the success probability of the Bernoulli variable is We denote this success rate by P, and the index failure rate is therefore bounded by Hoeffding's inequality, The value estimation comes from the single-ton search step in Algorithm 8.
Till now, we have completed the analysis of all types of failures that would occur in the detection of bins. In the implementation of our algorithm, we choose P 1 P 2 ∼ O(n) and N = 4 n is an exponentially increasing parameter. Therefore, by the union bound, we have that Lemma 5. Suppose Assumptions A1 & A2 are satisfied. Suppose that every bin contains at most n 2 nonzero error rates and that every previous bin detection ran successfully before. In an arbitrary peeled bin set U c0 [j 0 ], the bias in every bin keeps bounded dependence, and the variance after the peeling will be bounded as follows, This above-mentioned observation succeeds with probability at least 1 − e −O(n) .
Proof. We prove this lemma inductively. In the beginning, we first focus on the bias from an arbitrary bin before peeling. According to Lemma 1, the Pauli fidelity terms are carrying bounded noise w ∈ [−B, B], and we can regard the bias of fidelity as still falling into this bound naturally. The procedure constructs every bin by a specific linear combination of multiple fidelity terms, and Lemma 2 tells the form of the resulting bins. Let us consider the bias ∆W in a bin U c;t [j], which is in the form of the summation of constants with random signs, where these ∆ (p) are the effective biases of every error rate by Walsh-Hadamard tranform. Since all the bias terms are pair-wise independent, we can calculate the variance, Then we need to analyze the effect of peeling processing on the biases of bins. The peeling decoder algorithm points out that the single-ton bin will be averaged over random offsets to peel the corresponding error rate term in other bins. The bias in a single-ton bin is propagated to the target bin during every peeling.
For brevity, we denote the l single-ton bins and the target bin by {U c1 [j 1 ], · · · , U c l [j l ]} and U c0 [j 0 ], respectively. Averaging over those single-ton bins with P 1 randomly chosen offsets, and the target bin after the peeling is When the previous detection works perfectly, the right-hand side of Eq. (72) becomes a single-ton bin, and all the other terms are either Gaussian noise or biases. By the induction assumption, we can find all the involving bins have biases with variance less than 4B N B. Note that the bias terms span all possible Pauli indices in a bin, and the average over different random signs is not completely independent. The bias linked to the nonzero error rate will be propagated faithfully. Thus we have where 2l ≤ n ≤ P 1 ∼ O(n). The second term comes from the covariance of the desired nonzero terms' biases, and the third term is the independent part of summation.
In the above analysis, we have presumed two things. For the first, we assume there is only one bias term from the peeling that performs dependently with the original bias terms. This actually contains two aspects that both the peeling bin and the peeled bin have untouched biases with the corresponding Pauli index. These rely on the constraint that the accumulation of all previous signs that come from peelings would not cause any collapses to the signs of nonzero terms in a bin. Then we need to bound the probability that the accumulation of random signs in multiple peelings rules out the randomness. This is possible because the propagating terms carry one more random sign during every peeling process. Random signs must come from the inner product of the corresponding offset and a nonzero bit string in the kernel space of the subsampling matrix. To check whether these random signs will annihilate themselves, we can construct a directed group with nodes representing bins and edges representing peeling processes. Therefore, only the peeling that lays upstream in the corresponding path will affect the target bin's bias part. Note there are BC bin sets, and each bin set will be affected by at most BC bin sets. By the fact that the sum of all previous bitstrings must be zero to eliminate the random sign or equal to the original nonzero terms' indices, we have the failure rate Pr (K c ) for the above variance bound to be Since C is a constant and N is exponentially large, the failure rate is exponentially vanishing.

B.3 Theoretic bounds
The abovementioned lemmas help to remove obstacles over the engagement of the fidelity estimator and the error rate transformer. Consequently, we can develop a comprehensive proposition to demonstrate the validity of combining these two subroutines. This is based on the following assumptions.  , where σ is a constant related to the choice of fitting times. Hence, we reduce the proposition to a claim that with bounded-noise fidelity, the transformer can estimate all the s error rates correctly with their support indices. This claim can be verified by a series of lemmas.
We first consider the failure probability and error bounds of the transformer. Since the peeling decoder is the main routine for this transformer, the failure probability can be tracked as follows, P F ≤ Pr Peeling decoder fails E c bin + Pr (E bin |D, H) + Pr (D c ) + Pr (H c ) .
The event E bin denotes that there exist some failing bin detections. On the contrary, E c bin denotes that no bin detection error occurred in the entire execution of Algorithm 9. D and H denote that the maximum number of nonzero terms is at most n 2 and that the peeling routines are all cycle-free, respectively. According to the Proposition 4 in [90], we know the first term is vanishing with s as O 1 s . The third term is exponentially vanishing according to Lemma 8 in [62], and the fourth term is at most O log log log s s Thus, we get the stated failure probability.
We then consider the query number of the fidelity estimator, which represents the number of experiments needed to estimate the Pauli information of a Hamiltonian. As illustrated by Proposition 4 in [90], the peeling decoder succeeds with the probability of at least 1−O 1 s adopting C = O(1) subsampling groups and B = Θ(s) bin sets in every group. Therefore, to prepare these bins, the number of queried eigenvalues is BP C = O(P s). Note in every bin set, so we choose P = P 1 + P 2 = O(n). Therefore, we get the illustrated query number.
As for the error bound, since the event that the procedure estimates the error rates with noise larger than O ϵ 2 t 2 0 √ s is classified as value error, the above analysis of the failure probability demonstrates that all error rates fall into the stated interval in successful execution. Therefore, the absolute values of decomposition parameters can be estimated by the square root of these Pauli error rates, and we get the stated precision.

B.4 With Prior Knowledge
From an experimental view of system's Hamiltonian learning, we will always hold some prior knowledge about the system's underlying dynamics. For example, we can find out some significant interactions of the system that are gained from the experimental details or the designed properties of the device. Formally speaking, given the prior knowledge, we can expect that the unknown Hamiltonian contains the corresponding terms on the Pauli basis with unknown parameters, which we refer to as the structure information. Indeed, it is not always true that this prior information covers all the possible structure, so we cannot fully rely on the prior structure to estimate the Hamiltonian. In this subsection, we would like to discuss how the partial prior knowledge about the structure would boost this Hamiltonian learning method, especially in the Pauli error rate estimation.
During the bin detection, the structure information can serve as a lower bound for the nonzero-term detection. For example, given that there exists one known term in the detected bin U , we shall accept the detection only if it is found to be a single-ton or multi-ton bin. Otherwise, the algorithm can output an "exception", and we shall adaptively adjust the execution parameter like γ 1 in Algorithm 8. And the same happens when we have two or more terms in one bin. As for the peeling step, since this procedure would invoke the bin detector as a subroutine, the prior information would not directly improve this step.
As discussed above, the prior information about the possible structure can help to improve the accuracy of the overall estimation. However, since our method is designed without prior information, the benefits we get from the structure knowledge remains in a heuristic way and can be mainly observed in numerical results.

C Sign Estimation
We have briefly summarized the sign estimation procedure in Sec 3.2, including the motivation and basic ideas. In this section, we will illustrate the protocol to estimate signs of all decomposed parameters. As this sign estimation plays the third step in the whole Hamiltonian estimation procedure, this estimation employs some information gained from previous Pauli estimation.
In the above section, we regard the Hamiltonian channel as a processing matrix form in Eq. (7), or, more specifically, as a Pauli channel. That is due to the fact that our target information is faithfully stored in diagonal terms. On the contrary, we estimate the sign information mainly from off-diagonal terms in this section, so we need to care for the whole channel. With the expansion in Eq.
(3), we consider the SPAM as a local Pauli eigenstate ρ γ and a Pauli operator P β with the measurement outcome as follows, where we want to solve all decomposition parameters s.
In order to solve all the parameters to get the sign information, we construct a family of linear equations to extract all the interesting parameters. Firstly, we want to rule out the effect of the constant term in Eq (80). We use variant t and fit the measurement results on these t. Therefore, we get the first-order version of the Hamiltonian evolution.
where [P α , P β ] is the commutator notation. Moreover, since we know the support of nonzero s ⋆ exactly from the last section, it helps to reduce the number of unknown variables and equally the number of equations needed. Sequentially, we choose multiple state and measurement settings to construct the necessary equations. The collection of SPAM settings is denoted by {(ρ 1 , M 1 ), · · · , (ρ m , M m )}. Collect these first-order measurement results, and this vector is denoted by M. For the coefficient matrix Φ, we define Φ j,α := i Tr ρ γj [P α , M βj ] . As we pick m pairs states and measurements, and Φ only covers s = |s ⋆ | nonzero parameters, the matrix is in the shape of m × s. Therefore, we construct the following equations based on Eq. (81) The challenge arises when the observations M are noisy. This is possible as we have some noisy SPAMs, and the linear regression brings systematic errors due to the higher-order terms. Consider the case that the protocol can only estimate M with the additive noise ω, so the equation becomeŝ Indeed solving the equation directly is not applicable due to the unknown noise. Instead, we adopt the method from compressed sensing [91][92][93] and consider the problem of finding a feasible vector x satisfying the following constraints This problem is also illustrated in the main text as Eq. (21). Notably, this problem has a nonempty feasible set for every ϵ ≥ ∥ω∥ 2 since the ideal value of s ⋆ is a feasible solution. We always choose ϵ = √ m στ t1 + o(t 1 ) , which is the upper bound of ∥ω∥ 2 . We denote this problem as sign optimization.
We will show that the solution x is a well-qualified estimation of the ideal parameters given the measurements and states are randomly chosen from the Pauli group and local eigenstates thereof. In the following, we first give a definition of the restricted isometry property introduced by [91,93]. Then we show how to prove this property for the randomly sampled process matrices in our specific case and how to bound the noise effects.
where C is a constant related to the choice of the sampling set of SPAM.
Proof. We first figure out the size of noise ω that comes from SPAM errors and linear regressions in (83). According to Assumption 1, every measurement carries noise at most τ . Therefore, we have Tr H t (ρ γ )P β = Tr(ρ γ P β ) + it α∈P n s α Tr(ρ γ P α P β − P α ρ γ P β ) + o(t 2 ) + τ, which satisfies for all t. Considering the ordinary least square, we have Tr H (1) where K denotes the number of evolving times used to extract the first-order term andt is the averaged time length. As we choose the t from t 1 up to Kt 1 arithmetically, the foregoing bound is derived to στ t1 + o(t 1 ) for some regression constant σ, which is very similar to Eq. (40). Therefore, the norm of the total noise term ω can be bounded by O 1 ∆ where ∆ refers to the smallest fidelity residual of the detected channel. In the second stage, the protocol needs to prepare m equations, which are equal to m measurements. Therefore, the complexity M Q satisfies The classical complexity is also combined from the foregoing two stages. In the first stage, the post-processing is mainly contributed by the peeling algorithm, of which the complexity is stated as O(sn 2 ) in Theorem 1 of [62]. Also, to extract the desired fidelity terms, the procedure needs O(sn) fittings. In the second stage, the optimization problem illustrated in Eq. (21) is of size m × s. Since we have confirmed the existence of the solution, the complexity is poly(s). Therefore, we have M C = O(sn 2 ) + O(sn) + poly(s) = poly(n, s). (95)

E Implementation and numerical results
This section serves as the further analysis and numerical detection of some concerns that are raised in the main text.

E.1 Threshold Behaviors
In this section, we perform the exhaustive numerical simulation to show how we choose the bin parameter b in each case of the simulations introduced in Section 5. Even though the procedure runs with an inputted parameter, b, this parameter is supposed to rely on the structure of the target Hamiltonian to guarantee the successful execution. Hence, we have to employ a trial process to determine the proper b for each Hamiltonian. More clearly, we increase the parameter b and run the procedure on the target Hamiltonian to find the proper choice of b 0 such that the estimation with b ≥ b 0 is steadily close to the ideal values. In Figure 5, we exhibit the simulation results of running the procedure exhaustively to detect the behaviors of different b on different systems.
In Figure 5, we display the trial results for the random TFIM Hamiltonian with different system sizes. In the case of n = 1, all these b perform well due to the fact that there is only one nonzero parameter in the Hamiltonian. In the case of the n = 2 system, these results are more typical, which is very similar to Figure 3. According to the box plot, the procedure performs much better when b ≥ 2, and the performance keeps steady after b grows up to 4. Therefore, the procedure will choose b = 4 as a proper parameter to execute on the 2-qubit system. In all other cases, this process works very similarly. Based on the understanding of the requirement of the parameter b and the observations on Figure 3 and Figure 5, parameters for random TFIM Hamiltonian learning are just the same as those used in Section 5.
In addition to the random TFIM systems, we also check the threshold behaviors for molecular systems. The numerical results of the trial processes are shown in Figure 5(b). A key property of molecular Hamiltonian is that they contain much more nonzero terms than the Ising models. Hence, the threshold behaviors are generally asking for larger b. In the first plot, the reconstruction errors of H 2 molecules decay rapidly from b = 2 to b = 5. After the procedure increases b up to 6, the reconstruction becomes steadily close to the ideal decomposition parameters. Based on these results, the procedure regards b = 6 as a proper choice of the bin parameter, and we exhibit the corresponding distribution of b = 6 estimations in Figure 4. Similarly, we run the trial simulation for H 3 and LiH molecules, respectively. By witnessing the reconstruction with steady errors, the procedure can then determine the proper parameter in each case.

E.2 Higher-Order Fitting
During the simulation for the numerical results, we found there exist some lower bounds of the errors after we detect the distribution of our reconstructions for systems with comparable large sizes. This means there are some systematic errors in some processes of the procedure. In this section, we will show these errors are mainly contributed by the higher-order terms during the fitting of the second-order fidelity.
Firstly, we need to consider the definite expression of the fidelity terms of a Hamiltonian channel. According For m = 2, it is the ordinary fitting used for the vast estimation exhibited in the main text. For m = 4, we consider the effects of fourth-order terms besides the second-order terms in the fitting process. This modification asks for a little more time steps for regression while bringing much better estimation. As for m = 6, we consider the further sixth-order terms, and this modification requires even more time steps to implement six-order regression, which increases the numbers of oracle used.
the highest terms that we consider in the fitting. As can be observed, the estimation is better when considering the fourth-order terms than the ordinary fitting with a mild sacrifice of the measurement complexity due to more time step needed for regression. This proves the previous idea about systematic errors. On the other hand, when the procedure keeps fetching higher-order terms, both the errors and complexity grow due to the effects of the circuit noise. Even though the systematic errors are suppressed by considering higher-order terms, the circuit noise would be enlarged by the higher-order regression. Going from m = 4 to m = 6, the circuit noise is now the dominant contributor of the overall errors. Therefore, we propose an effective solution to significantly alleviate the systematic errors in the fitting process, while we still need to suppress the circuit noise by further increase the shot numbers in order to balance the systematic errors and circuit noise of fidelity.