Learning ground states of gapped quantum Hamiltonians with Kernel Methods

Neural network approaches to approximate the ground state of quantum hamiltonians require the numerical solution of a highly nonlinear optimization problem. We introduce a statistical learning approach that makes the optimization trivial by using kernel methods. Our scheme is an approximate realization of the power method, where supervised learning is used to learn the next step of the power iteration. We show that the ground state properties of arbitrary gapped quantum hamiltonians can be reached with polynomial resources under the assumption that the supervised learning is efficient. Using kernel ridge regression, we provide numerical evidence that the learning assumption is verified by applying our scheme to find the ground states of several prototypical interacting many-body quantum systems, both in one and two dimensions, showing the flexibility of our approach.


Introduction
The exact simulation of quantum many-body systems on a classical computer requires computational resources that grow exponentially with the number of degrees of freedom.However, to address scientifically relevant problems such as strongly-correlated materials [1,2] or quantum chemistry [3,4], it is necessary to study large systems.Over the years, a variety of numerical methods have been proposed that exploit the specific structure of the system at hand and resort to approximation schemes to lower the computational cost.For example, tensor networks [5] can efficiently encode one-dimensional systems, but they face challenges in higher dimensions [6].Quantum Monte-Carlo methods [7,8] can give accurate results for stoquastic Hamiltonians [9], but is in general plagued by the so-called sign-problem [10,11].Finally, traditional variational methods [12,13,14,8] require that the ground-or time-evolving state be well approximated by a physically-inspired parameterized function [15,16,17].
Recently, data-driven approaches for compressing the wave function based on neural networks [18,19] and kernel methods [20,21,22] have been proposed.However, data-driven approaches require knowledge of the exact wave function, either from experiments or from numerically-exact calculations.In contrast, the variational principle for ground-state calculations provides a principle-driven approach to wave function optimization problem, and it has lead to the proposal of Neural(-network) Quantum States (NQS) [23] and Gaussian process states [24,25].The NQS approach produced state-of-the-art groundstate results on a variety of systems such as the J 1 -J 2 spin model [26,27,28], atomic nuclei [29], and molecules [30].While neural-networks are universal function approximators and can, in principle, represent any wave-function, in practice the variational energy optimization of NQS is a non-trivial task [31,32].Ref. [33,34,35] proposed schemes which solve a series of simpler supervised-learning tasks instead.When used to solve the ground-state problem with a firstorder approximation of imaginary time evolution with a large time step, equivalent to the power method, we refer to this approach as the Self-Learning Power Method (SLPM).This supervised approach does not immediately solve the optimization hardness of NQS, as there is still a non-trivial optimization problem to solve at every step of the procedure.
Kernel methods are a popular class of machine learning methods for supervised learning tasks [36,37].They map the input data to a high-dimensional space by a non-linear transformation, with the goal of making the input data correlations approximately linear.The similarity between the input data is encoded by the kernel, whose choice is problem-dependent.When compared to neural-network approaches, kernel methods have the crucial advantage that the solution of certain optimization problems can be obtained by solving a linear system of equations.
In this article we combine the SLPM with kernel methods, rendering the optimization problem at each step of the power method straightforward.We prove the convergence of SLPM methods to the ground state of gapped quantum Hamiltonians under a learningefficiency assumption by generalizing previous results of Ref. [38].Considering the SLPM with kernel ridge regression, we numerically verify the learningefficiency assumption for small quantum systems.For larger systems, we estimate the ground-state energy directly and find a favorable system-size scaling.
The article is organized as follows: in Section 2, we recall the power method and the basics of supervised learning, setting the notation we use throughout the text.In Section 3, we introduce the SLPM, with Section 3.2 containing an in-depth theoretical analysis of its convergence properties, which is the first major result of our work.Then, after briefly recalling Kernel Ridge Regression in Section 4.1, we discuss our particular choice of kernel and numerical implementation of the SLPM in Sections 4.2 and 4.3.Finally, in Section 5, we provide comprehensive numerical results obtained on the transverse-field Ising (TFI) and antiferromagnetic Heisenberg (AFH) models in one and two dimensions, concluding with a discussion in Section 6.

Preliminaries
First, we briefly introduce our notation and recap some well-known concepts regarding the Power Method (PM), in Section 2.1.We then briefly overview supervised learning in Section 2.2.
Let Ĥ be a hamiltonian of a quantum system.We denote its normalized eigenstates by |Υ k ⟩, and we order them with respect to their corresponding eigenvalues E k , such that E 0 ≤ E 1 ≤ • • • ≤ E max .We wish to determine the ground state |Υ 0 ⟩ and its energy E 0 .The gap of Ĥ is defined as δ = E 1 − E 0 , and we say that the Hamiltonian is gapped if δ > 0. We remark that the method remains efficient for Hamiltonians that are gapless in the thermodynamic limit, as long as the gap closes polynomially with the inverse system size.

Power Method
The PM is a procedure to find the dominant eigenvector1 of a matrix.Following the notation of Ref. [8], we consider a gapped hamiltonian Ĥ and a constant Λ ∈ R. The PM relies on the repeated application of the shifted Hamiltonian Λ − Ĥ to a trial state |Φ (0) ⟩.The state obtained at the (n+1)−th step is, therefore, To make the ground state the dominant eigenvector, one must take Λ > E0+Emax

2
. Starting from a |Φ (0) ⟩ with non-zero overlap with the true ground state |Υ 0 ⟩, we have that lim n→∞ |Φ (n) ⟩ ∝ |Υ 0 ⟩, as the infidelity with the ground state decreases exponentially with ( Λ−E1 Λ−E0 ) n as long as Λ ≥ E1+Emax 2 and with ( Emax−Λ Λ−E0 ) n otherwise.We note that, in the case of a degenerate ground state, the PM converges to a linear combination of the degenerate eigenstates, dependent on their overlap with the initial state |Υ 0 ⟩.
The PM is widely adopted in exact diagonalization studies, where one works with vectors storing the wave-function amplitude ⟨x|Φ (n) ⟩ for all the basis states x.This approach requires exponential resources to store the vector encoding the wave function amplitudes in a chosen basis.

Supervised Learning
Suppose we are given a set of observations D = {(x i , y i )} Ns i=1 of an unknown function f 0 : X → R where X ⊆ R n is the input space, x i ∈ X are samples and y i = f 0 (x i ) ∈ R are the corresponding function values, also known as labels.The task of supervised learning is to find the optimal function f ⋆ in some suitable space of Ansatz functions H which best describes the observations.This is done by minimizing a so-called loss function L, which quantifies the distance between the predictions of each Ansatz function and the observations.The corresponding optimization problem is given by Notably, supervised learning can be done with artificial neural networks, a specific instance of highly expressive parameterized maps that can typically approximate complex unknown functions with high accuracy.After fixing the architecture, the optimization problem Eq. ( 2) is solved by finding the optimal parameters, for example, by using gradient-based optimization methods.For a more complete overview of supervised learning, we refer the reader to one of the standard textbooks in the literature, such as Ref. [39].

Self-Learning Power Method
In this section, we introduce an approximate version of the PM that has polynomial complexity (Section 3.1) and provide a quantitative theoretical discussion of its convergence properties (Section 3.2).We call this approach the Self-Learning Power Method (SLPM), which is sketched in Fig. 1.The SLPM encodes the wave function with an approximate representation Ψ (n) ≈ Φ (n) , taken from a space H of functions with a polynomial memory and query complexity in the computational basis2 to bypass the exponential computational cost of the exact PM, as discussed in the previous section.In the following, we show that the state at step n + 1 can be computed by solving an optimization problem given the state at step n. to form a data set, which is then learned with supervised learning, obtaining a new state Ψ (n+1) , approximating Ψ, which can again be propagated and sampled from.The procedure is repeated until convergence to a state close to the true ground state.

< l a t e x i t s h a 1 _ b a s e 6 4 = "
In this paper the learning is done with kernel ridge regression (see Section 4).

Algorithm
Given Ψ (n) the state Ψ (n+1) ∈ H is the solution of the optimization problem for any similarity metric L. In this article, we treat this optimization problem in the framework of supervised learning, which we have introduced in Section 2.2, replacing the "target" state (Λ− Ĥ)Ψ (n) with a data-set D (n+1) = {(x i , y i )} where Here ∼ indicates that x i are sampled from the target distribution Π, which we do with Markov-chain Monte Carlo methods (see Appendix B for a discussion).We remark that most physical Hamiltonians are sparse and therefore the elements ⟨x|Λ − Ĥ|Ψ (n) ⟩ can be queried efficiently if Ψ (n) can be queried efficiently in the computational basis 3 .In practice, when considering spin systems on a lattice with N sites, the wave-function ψ(x) takes as inputs the bit-strings x ∈ {−1, 1} N encoding basis states |x⟩.The data set contains a polynomially-large set of bit-strings x, sampled from the Born-probability distribution |ψ(x)| 2 , and associated with their corresponding amplitude ψ(x).
We remark that if H spans the whole Hilbert space and if the data set contains all (exponentially-many) bit-strings, the solution to the optimization problem given by Eq. (3) would match the PM exactly.By truncating the data-set size and considering only a subset of all possible wave functions, the solution is only approximate and therefore there is a finite difference between an exact step of the PM and the approximate procedure.We quantify this difference with the step infidelity, which we define as 3 More in detail, one has to calculate Physical Hamiltonians, in general, have a polynomial number of nonzero terms ⟨x| Ĥ|x ′ ⟩ ̸ = 0, therefore the sum over x ′ runs over a polynomial number of elements and this query is efficient.

Discussion of convergence properties
The SLPM is approximating the propagation of the state which is instead exact when using the standard PM.It is well known (see Section 2.1) that the PM converges exponentially fast to the dominant eigenstate.In this subsection, we discuss how the noise introduced by a non-zero step-infidelity affects the convergence.To derive quantitative bounds, we prove that if this noise is small enough, it does not significantly hinder the convergence to the ground state as the relative error of the energy is bounded (as we will see in Eq. ( 13)).
The discussion is based on Ref. [38], but has been adapted to the language of computational physics.

Power method with noise
We consider the general case where, at every step of the PM, a small noise term |∆⟩ is added to the state.
In the setting we are interested in, this noise arises from the self-learning procedure, but we wish to keep the theoretical treatment general and accordingly we make no assumptions on the origin of the noise.Formally, we define the power method with noise as: Definition 2 (Noisy Power Method).Take an initial state |Ψ (0) ⟩.
Step (n + 1) of the noisy power method is defined recursively as where ∆ (n) is a additive noise term, which, without loss of generality 4 , is taken such that ⟨∆ (n) |Λ − Ĥ|Ψ (n) ⟩ = 0.The factor γ (n) ∈ C captures both a potential drift in the global phase as well as the normalization.
If I (n) = 0, the noise must be zero as well, while in the general case the step-infidelity bounds the amplitude of the noise term according to Theorem 1 (Convergence of the noisy power method).Let |Υ 0 ⟩ represent the ground state of the Hamiltonian Ĥ.Take Λ ≥ E1+Emax 2 and assume that the initial state |Ψ (0) ⟩ and noise |∆ (n) ⟩ respect the conditions at every step n of the noisy power method for some ε < 1 2 .Then there exists a minimum number of steps [38] is given in Appendix A. Eq. ( 9) requires that, if the initial state Ψ (0) has an exponentially small overlap with the ground state (as in random initialization [38]), the noise parallel to ground state wave-function must also be exponentially small.Instead, the assumption of Eq. (10) requires that the noise amplitude be smaller than ε.As the final infidelity is bounded by ε 2 we want to choose the smallest ε possible.For a given step infidelity I (n) the smallest ε we can guarantee using Eq. ( 8) is given by Requiring ε ⋆ < 1 2 , it is possible to show that the stepinfidelity must be sufficiently small and satisfy When those requirements are satisfied, Theorem 1 states that in a number of steps M , logarithmic in both the initial overlap ⟨Υ 0 |Ψ (0) ⟩ and in the final infidelity ε 2 , we reach a state with at most we can replace it with This state has an accuracy on the ground-state energy given by the relative error, For simplicity, the theorem assumes that the noise bounds are constant throughout the run-time of the noisy power method.While this might not be the case in practice, it is easy to generalize the result to a varying ε.Doing so, one finds that the first assumption (Eq.( 9)) is necessary to start the method while asymptotically, the bound is given only by the latter assumption (this is discussed more in detail in Appendix C).

Self-Learning Power Method
While the discussion of the noisy power method convergence so far is general, we now contextualize it to the case of the SLPM.To do so we assume that the learning is efficient, precisely defined as follows.
Definition 3 (Efficient supervised learning).We say that the supervised learning is efficient if its stepinfidelity is of the order of 1/N S α for some α > 0, where N S is the size of the data-set.
As a consequence of Theorem 1, summing up the discussion of the convergence properties, we present the following corollary for the convergence of the SLPM: Corollary 1 (Convergence of the self-learning power method).Let Ĥ be a gapped Hamiltonian, take Λ ≥ E1+Emax 2 , and assume that • The supervised learning is efficient, meaning that Then the final infidelity I is bounded by and the error on the ground-state energy of Ĥ is of the order of Therefore, assuming the supervised learning is efficient, it is possible to consider a polynomially large data set to compute the ground-state energy of a gapped Hamiltonian with a polynomial cost.The same holds for gapless Hamiltonians, as long as the gap closes polynomially with the inverse system size, as we show in Appendix A.

The Self-learning Power Method with Kernel Ridge Regression
There are several practical ways to implement the SLPM defined in Section 3, by solving the supervised learning problem of learning the next state (Eq.( 3)) with a suitable approach.In this section, we specialize our discussion on realizing the SLPM with a kernel method called Kernel Ridge Regression.

Kernel Ridge Regression
Given that we are discussing a kernel method we start with a brief, formal definition of the kernel.A positive definite kernel k (named Mercer Kernel after the author of Ref. [40]), is a function k : X × X → R, where X ⊆ R n , with the following properties: It can be shown that every kernel uniquely defines a function space [41], the so-called Reproducing Kernel Hilbert Space (RKHS) The RKHS can be used as space of Ansatz functions for the supervised learning problem Eq. (2).When using the regularized least squares loss (ridge loss) this approach is called Kernel Ridge Regression (see e.g.Ref. [36] for more details).Here It can be shown that, in this setting, the supervised learning problem has an analytical solution of the form [42,43] where the sum only goes over the finitely many training samples and the weights w i are uniquely determined by solving the linear system of equations In the case where k(x i , x j ) is singular, there is an infinite number of w that satisfy Eq. ( 19); The presence of the infinitesimal regularization term λ in this equation ensures that we choose the solution with the minimal norm ∥f ∥ H k .

Implementation
The most straightforward approach would be to learn the amplitudes Ψ (n+1) with functions from the reproducing kernel Hilbert space H k (Eq.( 16)) using kernel ridge regression.However, as is commonly done with NQS, and in a previous work using a different kernel method (Ref.[20]), we use the method to learn the log-amplitudes log Ψ n+1 (x) instead.The loss function remains the same, but the data-set D (n+1) = {(x i , y i )} is changed to include log-amplitudes as labels, and we have to take the exponential to make predictions where the weights w i are found through Eq. ( 19) for the modified data-set of Eq. ( 20).We remark that this prediction is different from what one would obtain with relevance-vector regression [44,45] used in Refs.[20,21].
While both methods aim to minimize the meansquared error on the training dataset (first term in Eq. ( 17)), they favour different solutions.The KRR finds predictions with small RKHS norm ∥f ∥ H k due to the regularization term (favouring smooth log Ψ).The relevance-vector regression instead favours solutions which have small weights w, due to a zero-mean gaussian prior on them 5 .
We make one final approximation for the simulations in this article to reduce the computational cost.We assume that the distribution of the previous state Ψ (n) is sufficiently close to the distribution of the propagated state (Λ − Ĥ)Ψ (n) , and sample from , reducing the number of evaluations of Ψ (n) required.

Kernel choice and symmetries
The properties of the kernel k(•, •) are fundamental, as they are reflected on the encoded wave-function.
For example, discrete symmetries can be explicitly TFI chain, 20 spins  12)), and ϵ rel : relative error of the predicted energy of the final state (defined in Eq. ( 13)) after convergence of the self-learning power method, as a function of the number of samples in the data-set NS.Statistical error bars are smaller than the markers and have been omitted from the plot.
enforced by constructing a kernel that averages the output over all possible input permutations.
In this article, we consider a symmetric kernel of the form where x, y are vectors which encode the basis states, σ(x) = x arcsin(γx) is our choice of non-linear function.We remark that by taking γ ≈ 0.5808, this kernel corresponds to a symmetrized Restricted Boltzmann Machine (RBM) in the infinite hidden-neuron density limit through the neural tangent kernel theory [47].Details of this connection are explained in Appendix D but are not needed for the discussion.
To contain the computational cost, when simulating lattice systems, we consider the group of all possible translations rather than taking the full space-group of the lattice.Additionally, spin-inversion (Z 2 ) symmetry can be enforced by choosing an even non-linear function σ.

Numerical experiments
To numerically investigate the viability of the SLPM we benchmark it on the transverse-field Ising model (TFI) with periodic boundary conditions in one and two dimensions, and on the antiferromagnetic Heisenberg model (AFH) on the square lattice.

TFI model in one dimension at fixed system size
The Hamiltonian of the TFI model is where σx,y,z i are Pauli matrices on site i, ⟨i, j⟩ iterates over all nearest-neighbor pairs, and h is the strength of the external field in the transverse direction.
We start by considering a 1-dimensional chain of 20 spins with transverse field h ∈ {0.5, 1, 2}.The initial state is always taken to be Ψ (0) ∝ x |x⟩, the uniform superposition of all computational basis states, and we fix Λ = 1.
In the left panel of Fig. 2, we plot the relative error of the energy with respect to the ground-state value as a function of the number of iterations for h = 1.We compare the SLPM for several data set sizes N s against the exact version, observing a crossover from an initial regime where the effect of the noise is negligible.The SLPM closely matches the exact one to a regime where the noise dominates and the bound given by Theorem 1 prevents further improvements and a steady-state is reached.As expected, the number of steps at which we observe the crossover depends on the number of samples.

Numerical verification of the efficientlearning assumption
In the right panel of Fig. 2, we numerically prove the assumption of efficient learning by showing that the  [48,49]) and with Quantum Monte Carlo for larger systems (loop algorithm from the ALPS library [50,51]).They are provided in Appendix C.
step-infidelity at n = 300 is compatible with a power law I n ∝ N −α S , where the exponent α depends on the parameters of the Hamiltonian.In the same figure, we also report that the best relative error follows a similar power law with the same exponent.Interestingly, we see that the scaling exponent α of the step-infidelity and relative error are degraded for values of h below the critical point (h = 1 in this case).This shows that Corollary 1 is valid and therefore gives further grounding to the theoretical analysis we carried out in Section 3.2.In Fig. 7 of the appendix we show the same for the TFI in two dimensions and for the AFH in one and two dimensions.

TFI model in one and two dimensions and scaling as function of system size
Continuing, we investigate the scaling of the accuracy of the SLPM at increasing system sizes.In Fig. 3 we plot the relative error of the ground-state energy for 1D (left panel) and 2D (right panel) periodic lattices.The data-set size is kept fixed at N S = 4096 for all system sizes.In both cases, for values of the transverse field h above the critical point 6 we observe a behavior consistent with a power-law dependency of the relative error with the system size.As the Hilbert-space size is increasing exponentially, this means that a tiny fraction of the Hilbert space is sufficient to compute the ground-state energy accurately in a few hundred steps.Evidently, at the critical point, the gap of the 6 the critical point of the TFI Hamiltonian in the thermodynamic limit is h = 1 in 1-D chains and h ≈ 3.044 for 2-D square lattices [52,53,54,55,56].
Hamiltonian becomes smaller and therefore we need to perform more iterations (M ≈ 2000) to converge.
As in the previous simulations, the scaling with the system size degrades for values of the transverse field below the critical point.This is linked to a less efficient supervised learning of the state (in terms of the number of samples) at every step of the SLPM, and is probably related to poor generalization properties of the kernel in this regime.In principle, we expect that by choosing a different kernel function, it should be possible to improve the learning efficiency and, therefore, the algorithm's overall performance.

AFH model in one and two dimensions
In addition to the TFI, we also benchmark the SLPM against the antiferromagnetic Heisenberg model, whose Hamiltonian is given by where we assume periodic boundary conditions.The AFH hamiltonian is gapless in the thermodynamic limit.However, the gap is nonvanishing on finite lattices, and the SLPM can be applied.The ground state has a well-known sign structure, which can be accounted for by rotating the Hamiltonian according to the Marshall sign rule [57].The SLPM is then used to learn the amplitudes.To simplify the problem, the ground-state search is constrained to the symmetry sector with zero magnetization by introducing a proper constraint in the sampling step used to generate the data set.For the simulations of the AFH we fix Λ = 0.In Fig. 4, we show the dependence of the final relative error of the energy as a function of the number of samples in the data set, in the left panel for a one-dimensional chain and in the right panel for two-dimensional square lattices.Both result in a power-law-like scaling, with an exponent lower than that of the TFI, meaning that supervised learning is less efficient and requires us to use more samples to get a comparable accuracy.

Discussion
In this article, we have presented a kernel-method realization of the SLPM that can be used to find the ground state of gapped quantum hamiltonians by solving a series of quadratic optimization problems.
We have shown that if the supervised learning requires a polynomial number of samples at each step of the power method, a logarithmic number of steps in both the initial overlap and final infidelity of the SLPM is sufficient to reach a ground state infidelity that scales polynomially with the number of samples.In our numerical experiments, we have considered a relatively simple kernel that is reasonably cheap to evaluate and enforces physical symmetries on the ground state wave-function.For the TFI and AFH models in one and two dimensions we have numerically verified that the efficient-learning assumption is valid for kernel ridge regression using this kernel, at least for the small system sizes for which the exact infidelity computation is tractable.For larger system sizes, a direct computation shows a favorable scaling of the energy relative error in terms of number of samples and as a function of the system size.Our kernel ridge regression approach is ultimately limited by the number of samples as the computational resources needed to compute and store the ker-nel matrix scale quadratically with the data-set size, while the solution of the linear system of equations scales cubically.Therefore, in practice the number of samples is at most of the order of 10 5 due to hardware limitations.Algorithmic improvements such as re-using information from the matrix decomposition of the previous step when most of the samples remain the same or using iterative solvers enabling parallelization could ease some of these limitations.Nevertheless, we believe that the primary focus for improvements has to be laid on increasing the efficiency of the supervised learning.This can be done either by developing kernels with superior generalization properties for the problems at hand, with kernel methods which do not require to compute the full kernel matrix, like those used in Refs.[20,21], or by using methods based on neural networks.
Possible extensions of this work include the application of the SLPM to non-stoquastic Hamiltonians.This can be achieved by learning the sign structure or the phase of the wave-function in addition to the absolute value of the amplitude.It might be worth exploring a more general Ansatz using pseudokernels [58], where the absolute value and phase of the wave-function amplitude are learned simultaneously.
The code used to run the simulations in this article can be found in Ref. [59].

Appendix
We provide a proof of Theorem 1 and touch on the scaling in the case of gapless Hamiltonians in Appendix A, discuss the details of the numerical implementation of the sampling and kernel ridge regression in Appendix B and give additional numerical results in Appendix C. Finally in Appendix D we highlight the connection between the kernel used in this article and symmetrized Restricted Boltzmann Machine (RBM) in the infinite hidden neuron density limit.

A Proof of Theorem 1 (Convergence of the noisy power method)
We provide a proof of Theorem 1 presented in the main text, conceptually following the steps of the proof in the supplementary material of Ref. [38].
The main idea of the proof is to show that, if the noise is small enough, at every step the distance between the current state and the ground state decreases with respect to the previous step, where the Fubini-Study metric, given by is used as as the measure of distance.We remark that the the Fubini-Study metric is a monotonic function of the infidelity.To simplify notation we define the following short-hand notation for the overlap of the state and noise at step n with the eigenstates of the Hamiltoinian We assume that Λ > E1+Emax

2
, and therefore the two largest (also in absolute terms) eigenvalues of Λ− Ĥ are given by σ 0 = Λ − E 0 and σ Assuming normalized eigenstates ∥Υ k ∥ = 1 we have that the cosine of the Fubini-Study distance is given by Then, using basic trigonometric identities we find where we used k Ψ We can show that the distance decreases at every step of the noisy power method under the following assumptions on the noise at each step, in terms of the current state Ψ (n) : ⟩ the tangent of the distance of the propagated state with the ground state, which is a monotonic function of it, can be bounded as where in the first step we applied the triangle inequality and bounded with the largest eigenvalue in the nominator and applied the reverse triangle inequality in the denominator, assuming that σ 0 Ψ Under the assumptions of Eq. ( 30) and bounding ) where we defined ξ := δ 4 and split 7 This is equivalent to bounding 1 we used that the weighed mean of two terms is less than the maximum 8 .Splitting again 1 σ1+2ξ = (1 − ξ σ1+2ξ ) 1 σ1+ξ we have , and find where ω := max ( Λ−E1 Λ−E0 ) 1 4 , ε .Therefore under the assumptions of Eq. ( 30), the noisy power method decreases tan θ(Ψ (n) , Υ 0 ) at every step until at some step M it reaches tan θ(Ψ (M ) , Υ 0 ) = ε and then it stays at that distance.
Finally we show that the following strengthened assumptions, given in terms of the initial distance, imply the assumptions in Eq. ( 30), and thus the convergence of the method.

Number of steps
We derive the bound on the number of steps M needed to reach a state with tan θ ≤ ε.As ω in Eq. ( 34) is a multiplicative factor it is clear that a logarithmic number of steps M suffices to reach ε.In order to bound the number of steps it is convenient to first bound ln 1 ω .Using the definition of ω we have that Assuming ε < 1 2 , we bound ln 1 ε ≥ ln (2).For the other term we use ln 4 .
(37) Then, recursively applying Eq. ( 34) until step M where we assume to reach ε we set Taking the log on both sides, solving for M and using Eq. ( 37) we find

Scaling for gapless hamiltonians
We briefly analyze the scaling of the SLPM in the case of a gapless Hamiltonian with a gap which closes polynomially with the inverse system size.
In Theorem 1 it is easy to see that the number of steps needed is inversely proportional to the gap If we assume for a system of size L that, (I) the gap closes as δ ∝ L −β , (II) we can choose a constant |Λ| ≪ E 0 and (III) the ground state energy scales as and thus a number of steps proportional to L β+1 are needed.We note that the same applies to the power method in absence of noise.
In the case of noise however, we also need a step infidelity which is small enough so that our method can resolve the gap (see Corollary 1).Under the assumptions above this would result in a required step infidelity ∝ L −2(β+1) , meaning that we would need in the order of L 2(β+1) α samples.Therefore, as long as the gap closes polynomially with the inverse system size, the method is efficient.

Details of the Monte-Carlo sampling procedure
We use Markov-Chain Monte Carlo sampling with the Metropolis-Hastings algorithm (see e.g.Ref. [8] Sec.3.9 and references therein) to generate samples x ∼ |ϕ| 2 , where ϕ is given in terms of log-amplitudes by the kernel ridge regression predictor For the TFI model we perform single-spin flip updates, and for the AFH model we propose to exchange two spins at each step, initializing the markov chains with states which have total spin 0.

Multiple occurences of the same sample
Monte-Carlo sampling can produce the same sample multiple times.Formally, including multiple occurrences of the same samples in the data-set reduces the rank of the kernel matrix, which can lead to numerical instabilities when decomposing the matrix, and needs to be accounted for with the regularization.We note that when using a kernel which is symmetric with respect to a symmetry group G, then two samples x, x ′ belonging to the same orbit, meaning that gx = x ′ for some g ∈ G, have the same effect, and we consider them to be the "same" as well.
By introducing a weight factor c i into the loss, we can account for the number of occurences in the loss, and keep only a unique set of samples.This reduces the cost, as the kernel matrix is smaller since there are fewer samples, while formally keeping the loss invariant, and results in a de-facto sample-dependent regularization as we see in the following.The modified loss is given by and the optimal weights are given by the solution of This is very similar to the original system of equations (Eq.( 19) in the main text), except that the regularization is now sample-dependent, except in the limit λ → 0, where removing repeated samples from the data-set has no effect.We found that in practice, when the regularization λ is already very small, it is sufficient to remove repetitions, while not using the sample-dependent regularization, and therefore adopt this procedure for the simulations in this paper.

Numerical Details of the supervised learning procedure
The self-learning power method is initialized with a data-set for the uniform superposition state, taking uniform samples from the whole Hilbert space (taking only states with magnetization 0 for the AFH) and setting all log-amplitudes for the labels to y = 0, resulting in a state Ψ 0 (x) = 1.For the kernel Eq. ( 23) we use the non-linearity σ(x) = x arcsin(γx) fixing γ = 0.5808.Throughout our experiments we keep the regularization fixed at λ = 10 −8 , except in rare cases where we get nan's and have to increase it to 10 −7 .
We solve the linear system of equations for the weights (Eq.( 19)) using the Cholesky decomposition of the regularized kernel matrix.In general we work with un-normalized states.As the operator λ − Ĥ is not unitary, to avoid underflow/overflow and ensure numerical stability, at every step we subtract the largest log-amplitude present in the data-set from all the labels, effusively normalizing the state so that max xi Ψ (n) (x i ) = 1 for all samples x i in the data-set.We wrote the code for the numerical simulations with the jax library [60], using the sampler from netket [61,62] and we optionally parallelized it using mpi4jax [63].All of the simulations in this article were run serially on a NVIDIA V100 gpu.[50,51].For 2D we also report SSE results from Ref. [64] for comparison.

C Additional Numerical Experiments
For completeness, in this appendix we provide the reference energies used for benchmarking purposes and and present several additional numerical experiments that we performed to further support the results in the main text.
In order to benchmark our method we computed reference energies with exact diagonalization (ED) using using the SpinED package [48,49], and did Quantum Monte Carlo (QMC) simultations using the loop algorithm from the Alps package [50,51].For the QMC simulations we fixed the inverse temperature at β = 1000 and ran 10 5 thermalization steps followed by 10 6 sweeps.In Table 1 we provide energies for the TFI model in two dimensions and in Table 2 for the AFH model in one and two dimensions.

TFI model in one dimension at fixed system size
We start with a few additional results for the 20-spin Ising chain.In Fig. 5 we provide a plot for the relative error as a function of the number of iterations, for h = 0.5 and h = 2 in addition to h = 1 as already plotted in the left panel of Fig. 2 in the main text.We observe comparable behaviour in all three regimes, except for h = 0.5 when the number of samples is low and fluctuations occur.This can be explained by the large noise introduced in this case as can be seen by our study of the step infidelity in the following plot.In Fig. 6 we plot step infidelity I (n) as a function of the step n.We observe that it is not constant for all the steps of the self-learning power method, but varies before finally leveling off when a steady state is reached, as can be seen by comparing to Fig. 5.In the case of h = 0.5 and a number of samples N S ≤ 512 the step infidelity becomes higher over time, causing the error to increase.We would like to point out that in this case the condition on the step infidelity mentioned in the discussion of Eq. (11) (ε < 1/2) is not satisfied, and therefore Theorem 1 cannot be applied.Nevertheless, by increasing the number of samples we can alleviate these fluctuations.
As pointed out in the main text, it is possible to generalize the theoretical bounds to varying noise (quantified by ε in Theorem 1).One straightforward way to do so is to apply the theorem several times in a row.Starting from an initial state with possibly exponentially small overlap with the ground state, we fix ε = 1 2 and run enough steps until a infidelity of ε 2 = 1/4 is reached.In this regime the first assumption of Theorem 1 (Eq.( 9)) dominates, as the initial fidelity is much smaller than ε.Now we are in a state with a finite fidelity of 3/4, which we use as initial state to apply Theorem 1 again, choosing a smaller value of ε, e.g. as a function of the step infidelity according to Eq. (11).Now the dominant assumption of Theorem 1 is the second one (Eq.( 10)) as the initial fidelity is much larger than ε.We note that this argument also justifies starting the SLPM with a lower number of samples until a steady state is reached, and increasing afterwards, resulting in a lower compuational cost.12)), and ϵ rel : relative error of the predicted energy of the final state (defined in Eq. ( 13)) after convergence of the self-learning power method, as a function of the number of samples in the data-set NS, taking averages over 100 runs.

Numerical verification of the efficient-learning assumption
In the main text we numerically investigated the efficient-learning assumption for the TFI model on a 1D chain with 20 spins, showing that the step infidelity after a fixed number of steps is compatible with a power-law I (n) ∝ N −α S .In Fig. 7 we provide additional evidence that this is also verified for the TFI model in two dimensions and for the AFH model, plotting I (n) as a function of the number of samples after n = 200 steps of the SLPM.In the left panel we consider a 4 × 4 square lattice of the TFI model for different values of h, and in the right panel a 20 spin chain for the AFH in 1D, and a 4 × 4 square lattice in 2D.Furthermore we observe that the final Infidelity I and relative energy error ϵ rel follow similar power laws with the same exponent as I (n) confirming that Corollary 1 is valid for the these systems.

TFI and AFH model in one and two dimensions
In the main text we studied the scaling of the SLPM ground-state energy for the TFI model with the system size for a fixed number of samples, and the scaling with the number of samples for the AFH model.In Figs. 8 and 9 we provide the respective other plot for the two models.In Fig. 8 we study the TFI for 64 spins, in a 1-dimensional chain in the left panel and on a 8 × 8 square lattice on the right panel.In both cases we find a power law-like scaling of the error with the number of samples, further corroborating our results.In Fig. 9 we investigate the scaling of the SLPM error for the AFH in the system size.For the one-dimensional systems in the left panel we find scaling compatible with a power law, similar to what we found for the TFI in the main text.For the two-dimensional systems we observe that for the largest system and higher number if samples levels off.It might be possible to attribute this to finite-size effects on the smallest system, given that the number of samples becomes of the order of the effective Hilbert space size.

D The Kernel of a symmetrized Restricted Boltzmann machine
Neural networks, like simple restricted Boltzmann machines (RBM) have been shown to be able to learn the ground states of the systems studied in this article, with an accuracy which increases with the width of the hidden layer, given by the hidden-layer density α [23].Recently it has been discovered that the training of neural networks is governed by a kernel, the neural tangent kernel (NTK) [47].It has been shown that in the infinite hidden-layer width limit, on average the prediction of randomly initialized neural networks, which are fully trained with gradiend descent using the least-squares loss, is equal to the prediction of kernel ridge regression using the NTK.Therefore, this theory formally allows us to study RBM's in the α → ∞ limit using kernel methods.
In particular we are interested in the NTK of a symmetrized RBM, and the effect the symmetrization of the neural network has on the symmetry properties of the kernel.Let G be a permutation group.A symmetrized RBM consists of the following layers which produce an invariant model.We can write it as ) where σ(•) = log(cosh(•)), w ∈ R n1×n0 , w i,j ∼ N (0, 1) is the matrix containing the filters, x ∈ X = {−1, 1} n0 is the input of size n 0 and n 1 is the number of features in the dense symmetric layer of which we take the limit n 1 → ∞.L g denotes a concrete instance of an operator performing the symmetry operation g on an element of X , and we are not adding any hidden or visible bias.
The NTK of a neural network f θ (x) is defined as where θ p are the parameters of the neural network, in our case given by w and P → ∞ when n 1 → ∞.
It can be shown that the NTK of the symmetrized RBM defined above in Eq. ( 43) is given by Θ(x, y) where σ(x) := x arcsin(γx) for γ ≈ 0.5808 which comes from approximating the nonlinearity.We note that in the main text, when presenting this kernel, under a slight abuse of notation we wrote g instead of L g to simplify the expression.
In the remainder we provide a sketch of the derivation needed to arrive at this simplified expression for the kernel.We note that the NTK can also be computed with a library such as Ref. [65], using circular convolution to get translation symmetry (the full space group is not possible), and with increased computational cost with respect to the kernel presented above, as convolution in twice the number of dimensions as the input needs to be performed.

Derivation of the neural tangent kernel of a symmetrized restricted Boltzmann machine with ±1 input values
The common way to determine the NTK Θ of a neural network is to use a recursive procedure computing it layer by layer.To compute Θ one also needs to compute the covariance Σ of the centered multivariate normal distribution describing the output of the neural network over random initialization of the weights, before training.For a simple feed-forward neural network it has been shown that both Θ and Σ are represented by a scalar kernels times an implicit identity ⊗I n l , as the features of the output of each layer are independent.For the symmetrized layer of the RBM however, the kernels are no longer scalar, as the same features for different elements of G are correlated.Therefore, we have to work with kernels Σ g,g ′ , Θg, g ′ which are of size |G| ×|G| with g, g ′ ∈ G.The final scalar NTK is obtained only after the last pooling layer.We note that it is possible to show, for both the covariance and the NTK for a permutation group G, that Σ g,g ′ = Σ e,g ′ g −1 and Θ g,g ′ = Θ e,g ′ g −1 holds 9 , where e is the identity element of G. Therefore instead of computing the whole kernel matrix it suffices that we work with a single row.
We apply the recursive procedure to compute the NTK of the symmetrized RBM.
9 To give a concrete example, in the case of circular translation with stride 1 this means that the kernel matrices are circulant.
We apply the pooling (d) to get the scalar NTK for the output of the symmetrized RBM, pooling over the whole group G: where we chose g, g ′ ∈ G arbitrarly, as ∀g ∈ G : gG = G.
We combine the individual components to find the equation for the NTK of the symmetrized RBM, and simplify to obtain the NTK of the output of the final layer: ) What is left is to find an expression for Σ1 (x, y) = E f ∼N (0,Σ 1 ) [ φ(f (x)) φ(f (y)), which requires evaluating a two-dimensional gaussian integral.We are not aware of any analytical solutions for our choice of nonlinearity, and opt to approximate the non-linearity with one for which analytical solutions are known.We can approximate d dx log cosh(x) = tanh(x) ≈ erf(αx) (51) where α ≈ 0.8324 (which can be found numerically by minimizing the L2 error in a suitable interval centered around 0). Then noticing that for all inputs x ∈ {−1, 1} n0 it holds that Σ 1 g,g (x, x) = ∥x∥ 2 n0 ≡ 1, and using Eqs. 10 and 11 of Ref. [66] Finally we obtain the expression for the kernel presented in the main text (Section 4.

Figure 3 :
Figure3: Scaling of the SLPM ground-state energy relative error as a function of the system size for 1D (left panel) and 2D (right panel) periodic lattices of the TFI Hamiltonian with varying values of the transverse field h.The number of samples is fixed in all simulations at NS = 4096 and the energies are estimated by taking 2 20 samples from the final state.The horizontal axes use a logarithmic scale of the total number of spins.The reference energies for the relative error are computed analytically for 1-D systems, with exact diagonalization for 2-D up to 40 sites (using the code from Ref.[48,49]) and with Quantum Monte Carlo for larger systems (loop algorithm from the ALPS library[50,51]).They are provided in Appendix C.

2 −15 2 4 Figure 4 :
Figure 4: Scaling of the SLPM ground-state energy relative error as a function of the number of samples in the data-set size for 1D (left panel) and 2D (right panel) periodic lattices of the AFH Hamiltonian for different systems sizes.Estimates and reference values computed as in Fig. 3.

Figure 6 :
Figure 6: Step infidelity of learning the states visited along the SLPM for the TFI model on a one-dimensional chain of N = 20 spins.(left panel): h = 0.5, (central panel): h = 1, (right panel): h = 2. Shown are averages over 100 runs.

Figure 7 :
Figure 7: Final state convergence for the TFI model in 2D (left panel) and for the AFH model in 1D and 2D (right panel), in analogy to the right panel of Fig. 2 in the main text (TFI in 1D).Plotted are I (n) : step infidelity of learning the final state (see Definition 1), I: infidelity of the final state with the true ground state (defined in Eq. (12)), and ϵ rel : relative error of the predicted energy of the final state (defined in Eq. (13)) after convergence of the self-learning power method, as a function of the number of samples in the data-set NS, taking averages over 100 runs.

2 −15 2 7 1 7 1Figure 8 :
Figure 8: Scaling of the SLPM ground-state energy relative error as a function of the number of samples in the data-set size for a 1D chain of 64 spins panel) and 2D 8 × 8 (right panel) periodic lattice of the TFI Hamiltonian.Estimates and reference values computed as in Fig. 3.

Figure 9 :
Figure 9: Scaling of the SLPM ground-state energy relative error as function of the system size for 1D (left panel) and 2D (right panel) periodic lattices of the AFH Hamiltonian for different data-set sizes NS.Estimates and reference values computed as in Fig. 3.
(a) An initial dense symmetric layer projecting onto G (b) element-wise log cosh Nonlinearity (c) Sum over the features of (a) (d) Global AvgPool over the elements of G
Figure 2: Convergence of the Self-learning power method for the TFI model on a one-dimensional chain of N = 20 spins.(leftpanel): Relative error of the predicted energy with the true ground state energy as a function of the number of iterations n, compared to the power method for h = 1.Starting from an initial uniform superposition state, after a certain number of iterations, a steady state is reached, with an energy that becomes more accurate with increasing data-set size NS, taking the average over 100 runs.
(right panel): Final state convergence.Plotted are I (n) : step infidelity of learning the final state (see Definition 1), I: infidelity of the final state with the true ground state (defined in Eq. (

Table 1 :
[50,51]ce ground state energies of the the transverse-field Ising model in two dimensions with periodic boundary conditions for several values of h.ED computed with SpinED[48,49], and QMC with Alps[50,51].

Table 2 :
[48,49]ce ground state energies of the the antiferromagnetic Heisenberg model on one and two-dimensional periodic lattices.ED computed with SpinED[48,49], and QMC with Alps