Deep Learning of Quantum Many-Body Dynamics via Random Driving

Neural networks have emerged as a powerful way to approach many practical problems in quantum physics. In this work, we illustrate the power of deep learning to predict the dynamics of a quantum many-body system, where the training is \textit{based purely on monitoring expectation values of observables under random driving}. The trained recurrent network is able to produce accurate predictions for driving trajectories entirely different than those observed during training. As a proof of principle, here we train the network on numerical data generated from spin models, showing that it can learn the dynamics of observables of interest without needing information about the full quantum state. This allows our approach to be applied eventually to actual experimental data generated from a quantum many-body system that might be open, noisy, or disordered, without any need for a detailed understanding of the system. This scheme provides considerable speedup for rapid explorations and pulse optimization. Remarkably, we show the network is able to extrapolate the dynamics to times longer than those it has been trained on, as well as to the infinite-system-size limit.


Introduction
Machine learning based on neural networks (NN) [17,24] has revealed a remarkable potential in solving a wide variety of complex problems. More recently, a number of first applications to quantum physics have emerged. These include the identification of quantum phases of matter and learning phase transitions [2,5,48,50,51], quantum state tomography [16,47], quantum error correction and decoding [14,46], improving quantum Monte Carlo methods [52], solving optimization problems encoded in the ground state of a quantum many-body Hamiltonian [33], and tackling quantum many-body dynamics [4,15,34,41]. Our work provides a step forward in the latter direction.  This includes local observables but may also comprise, e.g., spatial two-point correlators.

l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . D u i s a u t e m v e l e u m i r i u r e d o l o r i n h e n d r e r i t i n v u l p u t a t e v e l i t e s s e m o l e s t i e c o n s e q u a t , v e l i l l u m d o l o r e e u f e u g i a t n u l l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . D u i s a u t e m v e l e u m i r i u r e d o l o r i n h e n d r e r i t i n v u l p u t a t e v e l i t e s s e m o l e s t i e c o n s e q u a t , v e l i l l u m d o l o r e e u f e u g i a t n u l l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . D u i s a u t e m v e l e u m i r i u r e d o l o r i n h e n d r e r i t i n v u l p u t a t e v e l i t e s s e m o l e s t i e c o n s e q u a t , v e l i l l u m d o l o r e e u f e u g i a t n u l l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . D u i s a u t e m v e l e u m i r i u r e d o l o r i n h e n d r e r i t i n v u l p u t a t e v e l i t e s s e m o l e s t i e c o n s e q u a t , v e l i l l u m d o l o r e e u f e u g i a t n u l l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . D u i s a u t e m v e l e u m i r i u r e d o l o r i n h e n d r e r i t i n v u l p u t a t e v e l i t e s s e m o l e s t i e c o n s e q u a t , v e l i l l u m d o l o r e e u f e u g i a t n u l l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . D u i s a u t e m v e l e u m i r i u r e d o l o r i n h e n d r e r i t i n v u l p u t a t e v e l i t e s s e m o l e s t i e c o n s e q u a t , v e l i l l u m d o l o r e e u f e u g i a t n u l l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . D u i s a u t e m v e l e u m i r i u r e d o l o r i n h e n d r e r i t i n v u l p u t a t e v e l i t e s s e m o l e s t i e c o n s e q u a t , v e l i l l u m d o l o r e e u f e u g i a t n u l l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n u l l a m c o r p e r s u s c i p i t l o b o r t i s n i s l u t a l i q u i p e x e a c o m m o d o c o n s e q u a t . D u i s a u t e m v e l e u m i r i u r e d o l o r i n h e n d r e r i t i n v u l p u t a t e v e l i t e s s e m o l e s t i e c o n s e q u a t , v e l i l l u m d o l o r e e u f e u g i a t n u l l a f a c i l i s i s a t v e r o e r o s e t a c c u m s a n e t i u s t o o d i o d i g n i s s i m q u i b l a n d i t p r a e s e n t l u p t a t u m z z r i l d e l e n i t a u g u e d u i s d o l o r e t e f e u g a i t n u l l a f a c i l i s i . L o r e m i p s u m d o l o r s i t a m e t , c o n s e c t e t u e r a d i p i s c i n g e l i t , s e d d i a m n o n u m m y n i b h e u i s m o d t i n c i d u n t u t l a o r e e t d o l o r e m a g n a a l i q u a m e r a t v o l u t p a t . U t w i s i e n i m a d m i n i m v e n i a m , q u i s n o s t r u d e x e r c i t a t i o n
In recent years, there has been tremendous activity in the field of non-equilibrium engineered quantum many-body systems, particularly motivated by the significant progress achieved in experiments [7,19,23,28,31,40,40]. However, simulating the dynamics of quantum many-body systems computationally is extremely difficult, even in 1D for long times. The reason is that, besides the exponential growth of the Hilbert space dimension with the system size, quan-tum correlations also grow significantly with time [3]. Recently, it has been shown that machine learning based on NNs can assist in overcoming such difficulties to a great extent [4,26,41]. For example, for systems that can be described by local Hamiltonians, dynamics is confined in a smaller sector of the Hilbert space, so that a majority of quantum states cannot be reached during the time interval of interest [39]. In such cases, it might be enough to find an appropriate variational ansatz for the state of system, containing a number of time-dependent parameters that may scale polynomially with the system size. NNs have shown great potential as such variational ansatzes for the quantum state [4,15,41]. Nevertheless, challenges remain with this approach, e.g. in efficiently evaluating expectation values.
In this work, we introduce a different approach to exploiting NNs for describing driven quantum manybody dynamics. Our aim is to teach a NN to predict the time-evolution of observables in non-equilibrium quantum many-body systems subject to arbitrary driving. Such driven many-body models play a key role in many areas, such as cold atom experiments [11,43], NMR [13,49], and quantum information processing [1].
In our approach, the learning is based purely on observations of expectation values during randomly driven dynamics (Fig. 1). If successful, this means such NNs can eventually be trained on actual experimental systems, without any prior knowledge of the underlying Hamiltonian or of any additional dissipative couplings and other aspects of the real physical system. It also means that they can be trained on partial observations, i.e., measurements of only a selected subset of degrees of freedom.
It is obvious that this goal represents a considerable challenge: Even if the Hamiltonian were known, predicting the full time evolution of a quantum manybody system in principle requires knowledge of the exponentially large many-body wave function. Alternatively, one has to come up with an efficient compressed representation, with matrix product states [37] as one prominent typical example. It is, however, hard to find such compressed representations in general, and their applicability can depend very much on circumstances such as the space dimension. In addition, time-evolving the representation and calculating expectation values can be computationally demanding.
Viewed more physically, the full many-body quantum state contains information about all possible correlators of observables (including multi-point correlators). This is important because even the attempt to predict only the evolution of single-particle observable expectation values in an interacting system will invariably require knowledge of higher-order correlators. This can be seen from the hierarchy of equations of motions that may be constructed for the set of all correlators and sometimes serves as a starting point for truncation schemes. This is the challenge faced by a NN that has to learn to predict many-body dynamics based purely on observations. It has to construct a compressed representation that implicitly must involve information about the correlations that are building up inside the system due to interactions. Knowing only the single-particle observables and the current driving field would certainly not be enough to predict the subsequent evolution. We may, therefore, think of such a NN as discovering on its own, from scratch (without human guidance), a representation of a correlated many-body state that is hopefully more predictive than simple low-order truncation schemes. At the same time it should be also more general (e.g. in terms of dimensionality) than approaches based on ansatz wave functions. Moreover, this representation will be optimized for the set of observables that was selected for teaching the dynamics.
In this context, we mention another challenge. At first glance, it might seem obvious that once a NN has been taught to solve some representation of the effective equations of motion of the system, it can use this knowledge to predict the dynamics up to arbitrarily large times, exceeding the time spans seen during training. However, as we will explain below, we typically consider dynamics starting out from some simple initial quantum state. As time progresses, entanglement grows. This implies that the dynamics at later times, when longer-range correlations have built up, can be qualitatively different from that at early times, which makes extrapolation of the dynamics towards larger times a non-trivial task.
As we will see, two main ingredients have helped us to achieve our goal. First, as already alluded to, we teach the NN by having it observe the dynamics for time-dependent driving by many different realizations of a random process. This approach is simple but powerful, because it can immediately be applied to experimental setups. Second, in a systematic study of different NN architectures, we found that recurrent neural networks (RNNs), i.e., NNs with internal memory, excel at this challenge. These NNs are typically used for sequence processing and prediction, such as of audio signals [18] or in language translation [45]. In the present context, they seem to become very efficient in time-evolving the autonomously discovered compressed representation of a quantum state.
2 General remarks on the approach As indicated above, the main idea of our approach is to observe the system's dynamics under arbitrary driving, represented via time-dependent parameters D(t) in the Hamiltonian. We train the NN such that it maps these driving trajectories (such as magnetic or electric fields evolving in time) to the observed evolution of the expectation values of a set of relevant observables. Specific classes of important driving scenarios include periodic driving or quenches, but the goal is to train a deep NN that is eventually able to provide accurate predictions for any sort of driving (such as what is encountered during quantum control protocols). As we will explain below, we achieve this key goal by training on driving trajectories that are samples of a stochastic process.
Furthermore, at this point, we want to stress that we train our NN to predict partial observations, i.e., expectation values of only a selected subset of degrees of freedom, instead of a complete set of observables (i.e. the full quantum state). Because of this more modest, but practically very relevant goal, our approach is well-suited to be applied to experimental data, where a full quantum-state tomography during training would be infeasible. Moreover, the approach is compatible with a scenario where the internal dynamics of the system might even be unknown in the experiment, i.e. the Hamiltonian is not provided to us. Our NN is able to learn the dynamics while needing no prior information about the underlying dynamical equations. In contrast, standard methods based on, for example, exact Schrödinger evolution, timedependent density matrix renormalization group [42], or more recent methods based on NN representations of quantum states [4,15,41], rely on precise knowledge of the general form of the underlying Hamiltonian ruling the dynamics of the device, whose parameters would need to be fitted to the experimental data. We note that there have been a few very promising recent first steps in the direction of training NNs on actual experimental measurement data, where the goal was to have the NN deduce the quantum state of a single qubit [12,44]. To apply our approach in an experiment, the set of system observables would be measured at a given time and then averaged over many runs for the same driving trajectory in order to obtain an estimate of the expectation values. This procedure would be repeated for different time-points to obtain the full response for that driving trajectory.
Learning the dynamics from observations also immediately implies that the approach is not restricted to Hamiltonian dynamics, but rather allows the NN to correctly predict open and noisy quantum dynamics, i.e., dissipative dynamics that would have to be described either via Lindblad master equations or even via non-Markovian equations. Moreover, while we will illustrate our approach further below on spatially homogeneous systems, inhomogeneous systems like impurity models or disordered systems fall within the purview of the method as well. By selecting suitable system observables, one can choose to focus on interesting spatial neighborhoods, e.g., the immediate vicinity of a quantum impurity.
As we shall illustrate for a specific example, our scheme provides substantial speedup as compared with direct numerical simulations of the Schrödinger equation. Exact Schrödinger evolution of a quantum many-body system involves an effort (memory and runtime) growing exponentially with system size. In contrast, we will show that our approach provides a more favourable scaling in terms of system size (at most polynomial) for certain tasks that the full quantum state is not required. This is important for many applications, such as optimization (pulse engineering) and rapid exploration of the behaviour under different driving protocols [8,11,30,49]. Eventually, our approach could support efforts in autonomous design and calibration of complex quantum experiments [32,35].
We briefly highlight another feature of our approach. As we will explain below, we rely on RNNs to predict the mapping from driving to observations. This has an important consequence: the NN can be used to produce predictions on time windows that are longer than what it has been trained on, and we will show that these are fairly accurate. This can find immediate applications for the extrapolation of experimental data in platforms with time constraints (e.g., existing quantum computers and quantum simulators with finite decoherence times). We will see that under appropriate conditions, our NN can even be used to extrapolate the dynamics to longer times for the infinite-system-size limit while it is trained on finitesize systems.

Technical considerations
We employ supervised learning to train a NN on predicting the expectation values of the observables of interest, denoted by S(t), given the time-dependent driving trajectory D(t). In this section, we comment on our specific choices for the key ingredients of the approach, that we have, after exhaustive explorations, found best suited to achieve high-quality predictions. These ingredients include the generation of training trajectories for D(t) based on a random process, and the most promising combination of NN architecture and training strategy.
Sampling Gaussian processes. We train the NN on arbitrary time-dependent trajectories D(t) sampled from a sufficiently large class of functions, so that the NN can afterwards predict the resulting dynamics of expectation values S(t) for different arbitrary time-dependent driving trajectories that it may have never seen during training. Investigating a few different classes of functions we noted that sampling these trajectories from a random Gaussian process works very well for our purpose. Our choice is motivated by well established properties of these distributions such as flexible shape in data fitting by controlling the mean and variance which made them the most fundamental distributions in statistics. In more detail, we use a mixture of Gaussian processes by randomly varying also their respective correlation times over a wide range. It turns out that in this way, the NN can learn all the essential features to finally predict the dynamics for different time-dependent functions. Sampling from a random Gaussian process for spatially random static potentials was demonstrated to be efficient in another context, for predicting band structures in topological materials [38]. There are various methods to generate Gaussian random functions [25]. In Appendix B. 1, we explain in detail the method that we use here (based on eigendecomposition of the correlation matrix). We also comment on the range of parameters (correlation time and amplitude of our Gaussian random processes) that we selected in our illustrative examples.
Initial states. It is important to realize that the fully trained NN will only be able to provide correct predictions for non-equilibrium quantum dynamics starting out in the class of initial states that was considered during training. We restrict ourselves to start with product states which are a common choice for most of real experimental implementations. However, our experiments confirm that our networks can also learn the dynamics for other classes of initial states given having as their input an appropriate description of the initial state. One can train the network on samples which are initially prepared in an identical entangled state or on samples for which the initial states are chosen from a specific class of entangled states. For example, we observed that our network succeeds also in learning the dynamics where the samples are prepared initially in the ground state of the initial Hamiltonian. Note that the state preparation stage can also be included as part of the driving protocol during the prediction phase. This observation is inspired by the typical way that, e.g., an experimental quantum simulator would be run: At first, the system is prepared in a very simple state (e.g. a product state), and then an external pulse sequence or an adiabatic ramp of parameters is used to generate a more complex many-body state, which forms the starting point for all subsequent dynamics (e.g. a quench). We can follow the same idea here, albeit at the expense of using up part of the time interval during which the NN predictions are reasonably accurate.
General training strategy. During training (Fig. 2  (a)), we feed as input into the NN all the parameters that fully identify the Hamiltonian of the model, the entire driving trajectory D(t), as well as a description of the initial state. Since we pick initial states ρ(0) to be a product state, it is sufficient to provide a compressed unique description f (ρ(0)) of that state, which will not involve exponentially many parameters. In our illustrative examples, where we choose product states for the initial states, it is sufficient to provide the expectation values of a set of local observables (hence f (ρ(0)) = S(0) in that case). If one considers the case of an identical initial entangled state for all samples then it is not even required to feed to the network any information about the initial state. The output of the NN is the full evolution of the expectation values of a subset of observables of interest (S(t)). During training, in our case we provide the results of an exact simulation, but in future applications this data could come from an experiment.
The neural network architecture. As for the architecture of the NN, we apply the most well-known type of RNN, "long short-term memory" (LSTM [21]; Fig. 2 (b)). As with any RNN, LSTM-NNs have a temporal structure and respect the fundamental principle of causality, which makes them well-suited to represent differential equations (equations of motion). Moreover, LSTMs specifically are able to capture both long-term and short-term dependencies. This characteristic is extremely useful as it gives the LSTM-NN the power to handle complex non-Markovian dynamics. Therefore, all of these together suggest this architecture to be very promising for our many-body dynamics prediction task (see Appendix A for a brief recapitulation of the LSTM architecture).
Let us also comment here briefly on the size of our LSTM-NN in terms of input and output size as well as number of hidden layers. The input size (at each time step) is given by the number of driving parameters and the number of parameters that identify our initial product state. The output size is set by the number of observables to be predicted. These observables may contain two-point or higher-order correlators. However, if we deal with a translationally invariant system and impose a cutoff for the range of all the correlators, the number of neurons in the output layer does not scale with system size. Even in the most general case (inhomogeneous system, arbitrary range of some finite-order correlators), the scaling is only polynomial.
The size of the hidden layers determines the number of trainable parameters in the NN. We will briefly comment in a later section on our empirical observations regarding the required size for the illustrative examples considered in this work (we find no need to scale up with system size).

Physical Model
To evaluate the performance of our scheme in predicting the dynamics of a driven quantum many-body system, we concentrate on two prototypical 1D spin models (with periodic boundary conditions in each case): i) The transverse field Ising (TFI) model, where the spin ring is driven out of equilibrium with a spatially homogeneous time-dependent transverse field ( Fig. 1(b)), and ii) the Heisenberg model, where we introduce a time-dependent coupling between neighbouring spins (Fig. 1(c)). Apart from their fundamental interest, these two models are of practical relevance in the context of non-equilibrium quantum many-body dynamics [10,20]. We discuss the Heisenberg model in Appendix I and mostly focus on the TFI model here in the main text.
The TFI Hamiltonian for a system of size M reads where σ α i with α = x, y, z are Pauli operators acting on the site j. The spins form a one-dimensional ring, so that σ α M +1 ≡ σ α 1 . B(t) denotes the transverse magnetic field. The TFI model is a widely studied paradigmatic model with many experimental implementations. It displays non-trivial physics, with a quantum phase transition at J = B, and a rich phenomenology when driven out of equilibrium, in particular in the context of quenches [6]. In the following, we set J = 1 without loss of generality, since this only fixes the energy scale.
As the TFI model is a quantum-integrable model, one may wonder whether the success of the NN is tied to integrability. To clarify this we eventually also investigate non-integrable models, predicting the dynamics in the presence of an extra longitudinal field. In this case, the model cannot any more be mapped to free fermions and becomes non-integrable.
Training. We train the NN on Gaussian random fields B(t) over a broad range of field amplitudes. Specifically, for the numerical experiments analyzed here, we generated 50,000 random realizations of a Gaussian random process for B(t). Later we analyze how the performance of the NN depends on the size of the training data set. We solve the Schrödinger equation for the TFI ring driven out of equilibrium by these random transverse fields, for the particular system size M of interest. We calculate the evolution of a selected subset of observables. Note that in this work, for all the spin models that we investigate, we always choose as observables all the local spin opera-tors σ α j , as well as correlators of the type σ α j σ α j+ , with α, α = x, y, z. This choice is for illustration only and does not imply one needs to train the NN over all these observables to get it to learn the dynamics properly. We verified that one can successfully train the NN on fewer observables of interest as well, see Appendix E for a detailed discussion on this.
We solve the Schrödinger equation for the case in which the spins are initially prepared in an arbitrary translationally-invariant uncorrelated state with p chosen at random from the interval [0, 1]. Here, |0 and |1 denote, respectively, the ±1 eigenstates of σ z . As we indicated above, starting from a product state does not restrict the applicability of our scheme, as one could start from more complicated states by including the state preparation as part of the driving protocol.
Evaluation. In order to evaluate the performance of the fully trained NN and to show that it correctly predicts the dynamics for any type of time-dependent field, we test the NN on a new set of Gaussian random fields (with instances never seen during the training), as well as random periodic driving fields and different sorts of quenches. As for periodic drivings, we generate a set of functions of the form B(t) = A sin(ωt) with random values for A and ω. As for quenches, we generate a set of step functions where the heights of steps and the time that the quench occurs is chosen randomly. See Appendix B. 2 for more details related to the range of parameters that we choose for generating our periodic and quench trajectories.

Results
In order to discuss the power of our approach in predicting the dynamics of a quantum many-body system, we inspect the prediction accuracy as a function of evolution time. We will also highlight two additional capabilities of our approach: predicting the dynamics on a time window longer than those observed during training for (i) the system size that the NN was trained on, and for (ii) the infinite-system-size limit, based only on training data for finite-size systems.

Prediction accuracy
For the sake of illustration, we consider a spin ring of size M = 7 and train the NN on random Gaussian driving fields up to time Jt = 7, where finite-size effects are already apparent. We then evaluate the performance of the NN for three classes of driving fields up to time Jt = 20. Specifically, in the first column of Fig. 3, we evaluate the NN on a new set sampled from the the class of driving fields that the NN has been trained on, namely Gaussian random fields. In the next two columns (highlighted in pink), we consider some classes of driving fields that the NN has never seen explicitly, namely periodic driving fields and different sorts of quenches. In panels (a-c), we compare the predicted evolution for σ x j (t) and twopoint correlators σ z j (t)σ x j+ (t) against their true evolution, for the driving fields shown at the top of each panel. The spins are initially prepared in random uncorrelated translationally-invariant states for the case of Gaussian driving fields, and we chose to initialize them along the z direction for the other two types of driving fields. As can be appreciated, the LSTM-NN is able to predict the dynamics very well for the time interval that it has been trained on, and -remarkably -also to extrapolate to a longer time window (indicated by the light blue areas). We attribute both to the fact that the LSTM-NN has memory, so that it is able to build up by itself an implicit representation of the higher-order correlators using a non-Markovian strategy. We discuss this in more detail in Sec. 5.3. The results also confirm that training on Gaussian random driving fields is sufficient for the NN to learn all the essential features required for predicting the dynamics for arbitrary driving fields.
In order to show that the above-mentioned results hold beyond the particular driving trajectories that we displayed in Fig. 3(a-c), we present in Fig. 3(d-f) the time evolution of the mean square error MSE. Throughout the paper, MSE is defined as the quadratic deviation between the true dynamics and the predictions obtained from the NN, averaged over all samples and all the selected observables: where O j represents the expectation value of one of the observables of interest. In this case, we averaged over 1000 realizations of: Gaussian fields (d), periodic driving fields (e), and different sorts of quenches (f). In all cases, the spins are initialized in a randomly chosen, uncorrelated, translationally invariant state.
The light blue areas in Fig. 3 (and all figures of the paper) denote the time interval that the NN has not been trained on. As one can see, also on this averaged level the predictions of the NN are valid even beyond the time window it has been trained on (remarkably, even for quenches that occur after the training time  interval).
In Fig. 4, we analyse how the prediction accuracy depends on the size of the training dataset. For this purpose, we plot for training sets of varying size the error √ MSE against the total number of training instances seen by the NN. As usual, training proceeds via epochs, each of which cycles once through the entire training data set. This plot verifies that 50,000 samples are already sufficient to achieve high accuracy, but smaller training set sizes, down to 10,000 samples can already achieve reasonable learning success. This may be important for experimental implementation, where generating training data can be time-consuming, depending on the runtime for an individual shot of the experiment. We refer the reader to Appendix D for a more detailed discussion of the resources required to train the NN in terms of data set size and NN size.
Let us to point out that, in general, for stronger transverse magnetic fields, it becomes more challenging for the NN to approximate the dynamics over a longer time window. We attribute this to the fact that for very large transverse fields, the expectation values of spin operators oscillate quite rapidly. Still, as can be seen from the plots above, the prediction of the NN for the range of driving amplitudes that we have considered here (see Appendix B, for more details) most of the time remains valid up to Jt = 20, substantially longer than the training time interval.

Extrapolating the dynamics to the infinitesystem-size limit
We now turn to the question whether one could train a NN on a finite-size system and yet have it extrapolate the dynamics to the infinite-system-size limit. We have found that this goal can be achieved by training the NN for a sufficiently large system size up to a point in time where the finite-size effects have not yet appeared. As we will show in this section, once the NN (trained in this manner) is asked to extrapolate the dynamics to longer times, the predictions fit the infinite-system-size-limit dynamics.
In our example, we train the NN on system size M = 15, which is the largest system size for which we are still able to generate training samples in a reasonable time by solving the many-body Schrödinger equation. For a given system size, the time window for which the observed dynamics is effectively indistinguishable from the infinite-system-size limit depends on the chosen driving field and also on which observable we are monitoring. Moreover, we have discovered an even stronger dependence on the initial state of the spins. For example, for system size M = 15, given initial spin states closely aligned with the magnetic field (paramagnetic initial condition), finite-size effects appear quite early, say for Jt ∈ [3,15]. However, if we initially prepare the spins in other directions, finitesize effects appear for Jt ∈ [7,25] for most observables. Therefore, in the following, we present results for initial paramagnetic states, with the NN trained up to Jt = 3, while for any other type of initial states training occurs up to Jt = 7.
In Fig. 5, we illustrate the power of our LSTM-NN in extrapolating the dynamics to the infinite-systemsize limit, under different sorts of quenches. To verify the power of our NN in extrapolating to this limit, we compared the NN prediction against the largest system size (M = 21) that we have still been able to solve using the exact Schrödinger evolution (for the particular TFI model, we could also have resorted to the known exact solution, but in general, for abritrary models, that would not be possible). We investigated two scenarios: (i) Quench from a "ferromagnetic" phase to the "critical point": the spins are initially prepared in a ferromagnetic state i | ↑ , and we quench to the critical point (B is suddenly changed from 0 to B = J). In Fig 5(a), we show the power of the NN in predicting and extrapolating σ z i under this quench. As in the previous figure, the region highlighted in light blue corresponds to the times beyond those where the NN has been trained on. We show the true dynamics for different system sizes in order to appreciate how finite-size effects (here in the form of revivals) appear later for larger sizes. Since we train up to a time where finite-size effects have not yet appeared, the NN is able to extrapolate the dynamics to the infinite-system-size limit, as is apparent from the plot (it shows no revivals).
(ii) Quench from a "paramagnetic" phase to a "ferromagnetic" phase: the spins are initially prepared in the paramagnetic state i | → , where | → is the +1 eigenstate of σ x , and we quench into a ferromagnetic phase (B is suddenly changed from 2.8J to −0.1J). We present the performance of the NN for the cases where it was trained up to Jt = 7 and Jt = 3, for which the finite-size effects are already present or not, respectively. As can be seen, when training is performed on the longer time interval, the NN predic- tions match the evolution for the system size that it has been trained on (M = 15 in this case), including the corresponding finite-size effects. Remarkably, for a shorter training time interval where finite-size effects are not yet visible, the NN predictions are closer to the larger system size M = 23, approaching the infinite-size limit. This behaviour can also be observed when averaging over many different trajectories. In general, we observe that when we initially prepare spins in the paramagnetic initial state (i.e. aligned along the x direction), it is more difficult for the NN to extrapolate the dynamics to the infinite-system-size limit. More specifically, in that case the extrapolation to the infinite-system-size limit is valid for a shorter time window. We attribute this to the fact that finite-size effects appear quite early in this case. We checked out different sorts of quenches with the spins in a (i) ferromagnetic (aligned along z) and in a (ii) paramagnetic initial state. For each class, we evaluated the NN on 30 realizations. We observed that the NN is able to extrapolate for a longer interval and with a higher precision for the ferromagnetic initial state. In Fig. 5(c), we show ( √ MSE) for the (more challenging) paramagnetic initial state. The green(pink) curve shows the quadratic deviation of the NN predic-tion from the exact evolution for M = 15 (M = 21). As one can see, the NN predictions are closer to the larger system size M = 21 than the system size that it has been trained on (M = 15) verifying the power of the NN in extrapolating to infinite-system-size limit. After a while, both curves approach each other, which is due to the fact that finite-size effects for system size M = 21 start to appear as well.

Implicit construction of higher-order correlators by the neural network
Even though we are only interested in predicting the dynamics of single-spin observables and two-point correlators, it is a well-known fact that the dynamics of those depends in turn on higher-order correlators. For example, in the particular model (TFI spin ring) discussed above, these equations of motion are: where n, m, k ∈ x, y, z and ijk denotes the Levi-Civita symbol. The fact that the NN is able to approximate the dynamics with high accuracy for long times even though it has not been trained on predicting higher-order correlators implies that it is able to build up some internal representation of the latter by itself. In order to verify this, we compare the NN predictions against the well-known 'Gaussian' approximation that is obtained by truncating the equations of motion at the level of two-point correlators. We apply the Gaussian moment theorem [27] (for arbitrary Gaussian operators A, B, and C) to the third-order moments as follows and obtain a closed system of equations for the first and second-order moments. In Fig. 6, we compare the evolution of σ x i and σ x i σ y i+2 obtained from these truncated equations, from the NN, and from the exact dynamics. As can be seen, the dynamics extracted from the Gaussian approximation is valid only during a short time, while the prediction of the NN holds during much longer times. This verifies that the NN is successful in finding an implicit representation for higher-order correlators using a non-Markovian strategy (memory). To illustrate that this is generally true, we provide a much more detailed comparison in Appendix F).

Speedup
We recall that, as mentioned before, our approach offers advantages that go much beyond any possible speedups vs. direct numerical simulations, such as, in particular, the ability to learn the dynamics of an unknown experimental system. However, let us still comment on the speedup that can be achieved.
The speedup achieved by our trained NN depends on the algorithm which it replaces, as well as the particular implementation. As we do not put any constraints on the dimensionality of the physical system, or any other aspects, the natural point of comparison is the direct simulation of the Schrödinger equation (rather than powerful approximate methods like matrix-product states that are more restricted in their use). Of course in this work we just explored models in one dimension. However, when moving to higher space dimensions typically systems thermalize more easily via interactions, which means the evolution of observables can become even simpler (despite the large growth of entanglement, presenting a problem for wave-function-based methods). When that is the case, the NN should be able to learn even more easily.
In setting up the comparison between the NN and a direct numerical integration of the Schrödinger equation, we have to be careful. It would be unfair to impose the default (very stringent) numerical accuracy requirements on the numerical integrator, since the NN only gives approximate predictions. For that reason, we relaxed the accuracy setting until the numerical integrator had errors on the order of the typical deviations observed in the NN. In addition, we allowed adaptive stepsize for increased efficiency (See Appendix G for more details). Besides, we selected the sampling step, for which the integrator returns the expectation values, as large as possible. The NN is asked to predict the expectation values on a grid that is twice as fine.
Any direct Schrödinger evolution of a quantum many-body system involves an effort (memory and time) growing exponentially with system size, which is avoided by the NN approach as one can see in Fig.  7. In this figure, we compare how the runtime scales versus system size M , comparing a single evaluation of the NN with the direct Schrödinger evolution up to time Jt = 20. Both algorithms were run on the same CPU, and the numbers given are averaged over 1000 instances. As can be seen, the runtime for the NN scales linearly versus system size (see inset). Note that here we trained the NN on predicting all spatial two-point correlators of the spin operators, meaning that the size of the output layer of the NN increases linearly with system size. We do not see any reason that prevents such a favourable scaling for larger system sizes. Such polynomial scaling should even hold for inhomogeneous models.
Despite our efforts to account for aspects of accuracy, the speedup comparisons made above may still seem unfair at first glance. First, we have not included the effort for training here. Second, the NN is only able to predict the dynamics for the subset of observables it has been trained on; in contrast, the Schrödinger evolution provides access to the full quantum state. Concerning the first point, we argue that there are many practically relevant applications which demand exploration of a much larger number of samples than those required to train the NN, and in this case the speedup by the NN eventually outweights the training effort. Examples include rapid exploration of many different parameter values or driving protocols, and pulse engineering, where the goal is to find the driving trajectories that lead to a desired behaviour of the system. Regarding the second point, we emphasize that there is an important class of applications where we are interested in the dynamics of a subset of observables and not in the full quantum state. In addition, in any typical experiment, only such a subset of observables is accessible in the first place. All of these aspects taken together make the comparison introduced here relevant for practical applications.
To provide some illustrative numbers (with all the caveats regarding dependence on computer hardware and algorithm), for a single instance at system size M = 15, the LSTM-NN is able to predict the dynamics up to time Jt = 20 in 15 ms, while solving the Schrödinger equation directly for the same instance takes about 16s. Thereby, not taking into account the training effort, this particular example shows a remarkable speedup of a factor 1000.
Although we have not performed a systematic study of how the hidden layers should scale with system size to reach a certain accuracy, we can comment on what we infer from our observations on the models that we explored. For the translationally invariant models (with a cutoff for the range of correlators), the number of hidden layers does not scale up with system size. A more complicated scenario, that we do not explore in this work, is for spatially inhomogeneous models where the number of output neurons can naturally scale linearly with system size. For such cases, one needs to apply a convolutional neural network (CNN) to deal with the spatial dimension, combined with the LSTM-NN that tracks the time evolution. Then even though the number of output neurons scales (linearly) with system size, the nature of a CNN implies that the number of trainable parameters (weights) is still independent of system size. The only exception would be at or near a critical point, where very long-range correlations develop, which may make it necessary to grow the CNN weight kernels (or number of layers) with the system size, but not stronger than linear.
We also point out that NN training can exploit parallelism with respect to static parameters in the Hamiltonian. This means if a NN has been trained on a full range of these values (while supplying the parameter values as additional input), the result will be a NN that makes better predictions than what would have been obtained when using the same overall amount of training samples to train several separate NNs, each for a specific parameter value (or a specific small parameter interval). The reason is that the "meta-network" (that has been trained on all different parameter values simultaneously) effectively reuses what it has learned for one parameter value to improve its predictions at other values.

Predicting the dynamics of quantum nonintegrable models
The TFI model with an arbitrary time-dependent transverse driving field is a quantum-integrable model at every instant of time. This means that at each time, it can be readily diagonalized by a linear transformation of the fermionic operators that result from applying the Jordan-Wigner transformation [29] to map this spin model to a quadratic fermionic model. Thus, one might ask whether the success of the NN predictions is due to the integrability. In order to rule out this potential explanation, we discuss the power of our NN for a more general, non-integrable Hamiltonian as well. We add a longitudinal field to the TFI Hamiltonian, which is known to break integrability. We are interested in investigating how the power of the NN in predicting the dynamics is related to the strength of the integrability-breaking parameter g. Thus, we train the NN on the above Hamiltonian with Gaussian random transverse driving fields for a fixed value of g and evaluate its performance for the same g on a different class of driving fields. We eventually repeat this comparison for a set of different g.
In Fig. 8 (a), we compare the predicted and true evolution of σ x i under specific shown quench trajectories, for different values of g. Apparently, the NN is able to predict the dynamics successfully both for the time-window (white regions) that it has been trained on and beyond that (light blue regions).
For a more quantitative analysis, we show in Fig. 8 (b) the prediction error versus time, for different sorts of quenches. For each coupling g, we average over 1000 realizations and over all our observables. We note that, in general, it is not fair to compare the error for models with different values of g, as they have completely different dynamics. But at least Fig. 8 (b) demonstrates that the NN also succeeds in predicting the dynamics of the investigated non-integrable models, with various different integrability breaking parameters g. Nevertheless, as can be seen in this figure, there is a tendency for the NN predictions to be less accurate for the non-integrable models. Evaluating the NN on more classes of driving fields, we observed that less accurate predictions are more evident for smaller system sizes at long times. In such cases the NN has a tendency to predict trajectories close to the infinite-system-size limit, rather than the given system size that it has been trained on. At this moment, it remains unclear whether this is a consequence of the model being non-integrable. It is also unclear whether this would hold across a wider range of physical models, as we observed the same behaviour for the Heisenberg model with time-dependent exchange couplings, which is known to be quantum integrable at each instant. In Appendix H, we present some additional analysis, for Gaussian random driving fields.
Moreover, in Fig. 8 (c), we show how entanglement grows in terms of time for the model (5). Specifically, we calculate the von Neumann entropy of half the spin system. As it is evident, the entanglement already grows significantly in the time window for which the NN successfully predicts the dynamics (both within the training interval and beyond that). This again verifies the power of the NN in constructing an implicit representation of complicated entangled quantum many-body states, in terms of higher-order correlators, as already discussed in Sec. 5.3.

Other training strategy and other neural network architectures
All our analysis so far has relied on the LSMT-NN that has to predict the observables when receiving the driving trajectory as input. To put these results into perspective, we have analyzed a variety of other approaches, including other NN architectures and another training strategy, which we will now discuss.
As an alternative to the training strategy introduced in Sec. 3, which we will further refer to as "full-evolution training", we now discuss an alternative training strategy that we can apply for our manybody dynamics prediction task. We call this strategy "step-wise training". In this strategy, we ask the NN to learn a set of Markovian evolution equations (Fig. 9). We start by feeding into the NN the initial value of the observables of interest, together with the initial value of time and the value of the driving field. We train the NN to predict the value of the observables for one time step ahead. Moving on to the next time step, we feed into the NN the previously predicted observables, together with the current value of time and the driv- ing field. The NN then predicts the observables for the next time step ahead, and we keep going until we reach the last time step. We stress that this strategy explicitly prevents the NN from directly building up any internal representation of higher-order correlators as long as it is not recurrent, so it represents a useful benchmark. Of course, the NN could still try to approximately infer some higher-order correlators from the observables which it is supplied with as input at any given time, provided that the dynamics does not explore completely arbitrary states.
In addition to applying both of these training strategies, we also explore other types of architectures for the NN. For example, inspired by the timetranslational symmetry of the dynamics together with the fact that CNNs can be made to respect causality and can also effectively represent memory effects, we discuss the power of a temporal 1D-CNN in predicting the dynamics. Moreover, we explore the power of a fully-connected neural network (FCNN) on this task.
Next, we compare the performance of all of these training strategies and architectures by concentrating on a TFI model with system size M = 15. We train the NN up to Jt = 7 on Gaussian random driving fields, and evaluate all architectures on the same time interval on a new set of Gaussian random driving fields (Fig. 10 (a)) as well as different sorts of quenches ( Fig. 10 (b)). For each class, we averaged over 1000 realizations, and over all the selected observables. For all samples, the spins are initially prepared in a random translationally-invariant product state.
Among different approaches, it is clear that the previously analyzed combination of the "full-evolution training strategy" and the LSTM-NN is superior. Notably, this scheme also outperforms the 1D-CNN, that can in principle apply a non-Markovian strategy to capture the dependencies on the past at each time. We attribute this to the fact that the LSTM-NN is able to decide in each step in a systematic way which long-term memories to keep. In more detail, it is able to capture both short-and long-term dependencies, while this is not the case for the CNN, where the choice of temporal kernel size also restricts the memory time span (see Appendix C. 2 for details related to the layout of 1D-CNN architecture, the optimal kernel size, and also error versus time for different kernel sizes).
The green and dark blue curves in Fig. 10 compare the power of a FCNN applying both training strategies. For the step-wise training, the error grows faster at the beginning. This may happen for two reasons. First, in each step the NN has access to the information from just one step ago. Therefore, it does not have a chance for building an implicit representation for the higher-order correlators that develop with time. Second, in each step we feed to the NN predictions from one step ago, which leads the NN to accumulate error at each step. For all of these reasons, the "full-evolution training" strategy looks more promising and indeed performs better.
It is impressive that an FCNN applying the fullevolution training strategy still has a reasonable performance up to some point, even though it has no built-in notion of causality and the architecture is not chosen to efficiently encapsulate memory. We attribute this reasonable performance to the all-to-all communication between neurons and the fact that the FCNN does have access to the full driving trajectory. Essentially, it needs to learn causality on its own (but is of course less efficient than a dedicated architecture like LSTM, e.g. in terms of the total number of weights being much larger for the FCNN). See Appendix C. 1 for details related to the layout of the FCNN.

Application in experiments
We briefly indicate how our method would be applied to learn and predict quantum many-body dynamics in actual experiments.
Obtaining the time evolution of any expectation value in an experiment involves several steps. For each time point t, the experiment has to be run n times under a fixed driving trajectory until that point, when the measurement is performed. This will yield the expectation value up to an error ∼ 1/ √ n set by the projection noise. Afterwards, data has to be obtained for other time points and other driving trajectories. Overall, this requires nN t N runs, where N is the number of different driving trajectories (training samples) and N t the number of time points. The number of runs is substantial, e.g. on the order of 10 10 if we assume N ∼ 10 4 , N t ∼ 10 2 and n ∼ 10 4 (for a 1 per cent projection noise error). Still, in platforms where a single run takes on the order of only a few microseconds, like superconducting qubits, this becomes feasible. The response of correlated quantum materials under fast high-field light pulses may be an even more favourable area of application. This is due to the shorter time scales involved and due to the parallelism stemming from the simultaneous observation of many copies of the system (many unit cells of the crystal providing a contribution to the overall signal which is directly a measure of the expectation value rather than a single projected measurement value). In general, multiple commuting observables can be measured in parallel, only non-commuting observables require separate runs. In the future, it will be an interesting challenge to explore how much a NN can deal with more noisy measurement data, reducing the need to accumulate statistics. Some efficiency improvements are possible, e.g. one might allow for a larger statistical error (smaller n) if the time points are closely spaced, because the NN will then effectively try to interpolate smoothly through the noisy observations. One might also use tricks from the machine learning toolbox, like pre-training on simulated data (with a model that approximates the experiment at least roughly), which will condition the NN for more efficient learning on the actual experimental results.

Conclusion
In this work, we have introduced an approach that exploits the power of deep learning for predicting the dy-namics of non-equilibrium quantum many-body systems. The training is based on monitoring the expectation values of a subset of observables of interest under random driving trajectories. We demonstrated that the trained RNN is able to produce accurate predictions for driving trajectories entirely different than those observed during training. Moreover, it is able to extrapolate to time spans and system sizes larger than those encountered during training.
Our scheme is model-free, i.e. learning requires no knowledge of the underlying quantum state or the correct evolution equations. In the future, this feature will allow for applications on actual experimental data, generated from a system that might even be open, noisy, inhomogeneous, or non-Markovian. This is in contrast to approaches which rely on precise knowledge of the underlying Hamiltonian, like exact Schrödinger evolution, tensor network wave functions, or recent methods based on NN representations of quantum states. Another independent benefit is the significant speedup that could be used, e.g., for pulse engineering via gradient-descent techniques or the exploration of dynamical phase diagrams. More advanced, feedback-based control strategies for quantum many-body systems might eventually be discovered from scratch in an efficient manner by combining the method introduced here with deep reinforcement learning.

Data Availability
All data supporting the findings of this study are available within the article and supplementary information or from the first author upon request.

Code Availability
The code used to generate the results of this study are available from the first author on request.  Figure 11: An LSTM-NN structure made of a chain of repeating modules, where each module includes three gates.

Acknowledgments
perceived information from the past, called the hidden input (h t−1 ) [36]. Such a NN is able to record the history for -in principle -arbitrary long times, since weights are not time-dependent and therefore the number of trainable parameters does not grow with the time interval. For training such a NN, the gradient of the cost function needs to be backpropagated from the output towards the input layer, as in feedforward NNs, and also along backward along the time axis.
However, RNNs are prone to run into a fundamental problem, the "vanishing/exploding gradient problem", i. e., that the size of the gradient decreases (or sometimes increases) exponentially during backpropagation. In principle, this problem can also occur in traditional feedforward NNs, especially if they are deep. However, this effect is typically much stronger for RNNs since the time series can get very long. This seriously limited the trainability of early RNN architectures, which were not capable of capturing longterm dependencies. This problem led to the development of RNNs with cleverly designed gated units (controlling memory access) to avoid the exponential growth or vanishing of the gradient, and therefore permitting to train RNNs that capture both longterm dependencies. The first such architecture is called long short-term memory (LSTM), developed by Hochreiter and Schmidhuber in the late 90s [21]. As an RNN architecture, standard LSTM is also built of a chain of repeating modules, as is shown in Fig.  11, where the repeating modules have a more complicated structure than in a simple recurrent network.
Each module includes three gates, where each gate is composed out of a sigmoid NN layer, together with the point-wise multiplication on top of it. Next, we explain step by step how these three gates together control how the memory needs to be accessed. We label weights w and biases b by subscripts according to the name of the corresponding layer.
• Forget gate layer: this gate uses the hidden state h t−1 from the previous time step and the external input x t at a particular time step t (with the bias b f and the weight w f ) to decide whether to keep the memory, or to discard the information that is of less importance, applying a sigmoid activation.
σ denotes sigmoid function and the dot stands for matrix multiplication. Eventually, the output of the forget gate is multiplied with the module state (C t ).
• Input gate layer: the operation of this gate is a three-step process, first a sigmoid layer decides which data should be stored (very similar to the forget gate) hidden state and current input also will be passed into the tanh function to push values between -1 and 1 to regulate the NN and stored inC t .
the outcome of the two previous steps will be combined via multiplication operation and then this information is added to the module state (f t * C t−1 ).
Here the * denotes element-wise multiplication.
• Output gate layer: the operation of this gate which decides the value of the next hidden input can be decomposed into two steps, run a sigmoid layer on the previous hidden state and the current input, which decides what parts of the module state are going to be carried passing the module state through tanh to squish the values to be between -1 and 1, and finally multiply it by the output of the sigmoid gate so that we only pass to the next module some selected parts B Generating random driving trajectories In this section we explain how we generate our random samples for training and also for evaluating the NN.
B.1 Sampling driving trajectories from a Gaussian random process for training the neural network There are different methods to generate Gaussian random functions [25]. We will explain in detail the one we use. Discretizing the time interval of interest according to some step ∆t, we define a vector d = (D(0), D(∆t), D(2∆t), ...) T that contains the trajectory of the time-dependent parameter. We build the correlation matrix C with elements where in this case we assumed a Gaussian correlation function with a correlation time σ (though other functional forms could be used). Being real and symmetric, this can be diagonalized as C = QΛQ T , where Λ is a diagonal matrix containing the eigenvalues and Q is an orthogonal matrix. Hence, we can generate the random parameter trajectory as d = Q √ Λx, where the components of x are independent random variables drawn from the unit-width normal distribution ( x n = 0 and x n x m = δ nm ), which can be easily generated. Note that we could sample using Fourier series with random coefficients, but this would automatically generate periodic functions, and thus introduce spurious correlations between the values of the drive at early and late times. That is why we avoided that method.
In order to make sure that our generated random functions are characteristic for a wide range of arbitrary time-dependent functions, we also choose for any training trajectory the correlation amplitude c 0 and time σ randomly. The former is chosen uniformly from c 0 ∈ [0, 4], corresponding to a maximum value of the magnetic field |B| around 5, and the latter uniformly from the wide interval σ ∈ [1,9]. Technically, this means we are sampling from a mixture of Gaussian processes.

B.2 Generating random time-dependent functions for evaluating the network
As we explained in the main text, in order to evaluate the performance of the NN, and to illustrate that it succeeds in predicting the dynamics for different functional forms of the driving trajectories, we test the NN on random periodic driving fields and different sorts of quenches. As for the periodic drivings, we generate 1000 functions of the form B(t) = Asin(ωt) with random amplitudes A and frequencies ω chosen uniformly from intervals [−3, 3] and [0.1, 4], respectively. As for quenches, we generate 1000 step functions where the heights of the steps are chosen randomly from the interval [−3, 3] with the quench also happening at random times.

C Neural network layout
In this section, we present the details regarding the layout of all the architectures that we have investigated. We have specified and trained all these different architectures with Keras [9], a deep-learning framework written for Python.

C.1 Fully connected and LSTM neural network
In Table. 1, we present the layout of our deep fullyconnected neural network (FCNN) and the LSTM-NN. For the FCNN, the activation function for all layers except the last layer is the rectified linear activation function ("ReLU"). For the last layer the activation function is "linear". As optimizer we always use "adam".
NN architecture L N Input size Output size FCNN 6 800 n+3 9(l-1)+3 LSTM 4 500 1+1+3 9(l-1)+3 Table 1: The layout of the deep fully-connected neural network (FCNN) and of the LSTM-NN. The L, N , and n stands for number of hidden layers, neurons per layer, and time steps. The l denotes the largest distance between spins on a ring. The expression 9(l − 1) + 3 counts the total number of all the observables that we choose for training throughout the examples discussed in the main text, containing all first and second-order moments (correlators) of spin operators. The number 3 is related to the number of parameters that identify the initial state. Our initial product state is identified by the 3 expectation values of first-order moments of the spin operators.
The most important difference between these two architectures is that the LSTM receives as input only the information for a single time step and outputs the predicted observables for that time step (as can be seen in the number of input and output neurons), eventually working its way sequentially through the whole time interval. Note that because the LSTM cells provide memory, all the information from previous time steps is still available. In contrast, the FCNN receives as input the full driving trajectory at once and also outputs the full time-dependent evolution of the observables throughout the whole time interval at once. For the LSTM, besides the driving field, we also feed in the current value of the time variable, which seems to help to enhance the accuracy of predictions (this is reasonable, since time-translation invariance is broken by the preparation of the initial state at time 0). In addition, both the FCNN and the LSTM receive as input information about the initial √ MSE Kernel size Jt Figure 12: The performance of a temporal 1D-CNN for different kernel sizes on the TFI model driven with Gaussian driving random fields, (M = 15). The √ MSE versus time averaged over all the selected observables and over 1000 realizations of Gaussian random driving fields. quantum state, in the form of the expectation values of the spin components (which is sufficient as we consider only product states as initial states).
We set the padding parameter as "padding=causal", which means the output at time t only depends on the previous time steps; therefore, the NN has a notion of causality. In Fig. 12, we also compare the performance of this architecture for different kernel sizes. As it is evident, a larger kernel size leads to a better performance at longer times. However, above kernel size=10, the performance almost saturates for the 56 time steps that we considered here.

D Computational resources
In this section, we provide details related to the computational resources required for training. In Fig.  13(a), we show how the precision of the LSTM-NN in predicting the dynamics of the TFI model with M = 7 is related to the size of the NN.
We also discuss how for the same model the performance of the NN is related to the training set size. Fig. 13(b) shows how error scales versus time for different training set size when the NN is trained on a fixed number of instances. We define number of instances as training set size multiplied by the number of epochs. The number of epochs denotes the number of times that the NN goes through the entire training data set. As can be appreciated, 50,000 samples already provides a favourable performance. We observed more or less the same scaling behaviour (in terms of the size of the NN and the training set size) for the rest of the models and different system sizes that we checked out.
We generated the training samples on CPU nodes, and trained the NNs on GPU nodes of a highperformance computing cluster. In Table 2, we provide details related to the required resources for different models and system sizes that we have checked out, given 50,000 samples are provided for training.

E Performance of the LSTM network trained on first-order moments of spin operators
In this work we always trained the NN on all first and second-order moments of spin operators. However, this choice was for illustration purposes only. We demonstrate here that one does not need to train for all these observables to get the NN to learn the dynamics properly. In Fig. 14  versus time, comparing the two cases where we either trained the NN on just first-order moments of spin operators or over all first and second-order moments of spin operators. These results are for a TFI model with M = 11. This figure verifies that one can successfully train the NN on fewer observables.

F Power of the LSTM network in implicit construction of higher-order correlators
In the main text, to illustrate the power of the LSTM-NN in building an implicit construction for higherorder correlators, we compared the performance of the LSTM-NN against truncated equations of motion within the Gaussian approximation for a typical driving field. To demonstrate this is generally true even for other architectures, here we compare them for different architectures on 1000 realizations of different sorts of quenches. However, we emphasize that our goal in presenting this plot is not at all to compare NN with a very naive approximation. Instead, we aim to illustrate that our NNs are able to implicitly construct higher-order correlators. As visible from Fig. 15, all architectures outperform the Gaussian approximation. It is surprising that above some time, an FCNN that is trained in a step-wise manner and has no chance for constructing an implicit representation of higher-order correlators still outperforms the Gaussian approximation. We imagine that this is due to the fact that the dynamics does not explore completely arbitrary states in this case and therefore still the NN can approximately infer some higher-order correlators from the observables which it is supplied with as input at any given time. G Accuracy in solving the Schrödinger equation As we indicated in the main text, for solving the Schrödinger equation we used Qutip [22], which uses scipy and "zvode" method for the integration. To compare the runtime of the integrator with the NN, we relaxed the accuracy of the integrator by increasing the relative and absolute tolerance denoted by "rtol" and "atol", respectively. This results in a larger integration step. In Fig. 16, we compare the error in predicting the dynamics of the observables by the NN vs. the error in numerical integration with different relative and absolute tolerances. Note that we trained the NN on the dynamics achieved by our integrator with a high accuracy namely small stepsize. We define the MSE as the quadratic deviation of the prediction gained by the NN and the dynamics achieved by our integrator with a relaxed accuracy with the dynamics achieved by our integrator with high accuracy. As can be seen, setting the "atol=10 −5 " and the "rtol=10 −3 ", the error in numerical integration is comparable with the error observed by the NN. Therefore, as runtime for the integrator, we count the runtime correspond to this accuracy.
H Power of the LSTM network in predicting the dynamics of a non-integrable model In Fig. 17, we show the performance of the LSTM-NN in predicting the dynamics of the model (5) presented in Sec. V-E in the main text, subject to Gaussian random transverse fields (as opposed to quenches). We show the performance for a single instance ( Fig. 17(a)) and over 1000 random realizations for different values of g ( Fig. 17 (b)). It seems that the NN is less accurate in predicting the dynamics when g = 0 in comparison with g = 0. This is more evident for M = 7. As we indicated in the main text, we observed for the small system sizes that the NN has less power for g = 0 to predict the finite size effects. In such cases the NN has a tendency to predict trajectories close to the infinite-system-size limit.
For M = 13, the errors for non-integrable models become smaller for longer times. However, this does not imply that the NN has a better performance. As it is visible in panel (b), the reason is that at longer times, the dynamics for non-integrable cases relaxes to an almost fixed value.

I Heisenberg model
In this section, we illustrate the power of the LSTM-NN in predicting the dynamics of the 1D-Heisenberg model. We take advantage of this example to illustrate the power of our NN for the models where we drive the coupling constants (instead of the fields). We train the NN on the dynamics of following Hamiltonian where J x is a homogeneous Gaussian random function and we set J z = 1 and J y = 1.
We train the NN on system size M = 9 up to Jt = 7 on Gaussian random exchange couplings. Eventually, we evaluate the NN in predicting the dynamics for the same system size up to Jt = 15 for a new set of random exchange couplings as well as quenches on couplings. In Fig. 18 (a), we show the performance of our LSTM-NN in predicting the dynamics under the shown Gaussian random exchange coupling in direction of x at the top of the panel. We compare the true and predicted evolution of σ z i and σ x i σ y i+ for different values of . As can be seen, the NN is able to predict the dynamics very well for the time window that it has been trained on as well as extrapolating the dynamics to a longer time window which is highlighted in light blue. In Fig. 18 (b), we show the performance of the NN on the class of exchange couplings that it has never seen, namely quench in x as shown at the top of the panel. As can be appreciated, the NN is able to predict the shown first and secondorder moments for the time-window that it has been trained on, and for longer times.
To illustrate that comparable results can be achieved for a wide variety of exchange couplings, we show in Fig. 18 (c) and (d) the √ MSE -the square root of quadratic deviation between the true dynamics and the predictions obtained from NN -on predicting the subset of observable of interest. For each class of exchange couplings, we averaged on 1000 realizations. These plots demonstrate the power of our LSTM-NN in predicting the dynamics of this model for the time interval that it has been trained on and beyond that for a big variety of samples. For both classes of exchange couplings, the spins are initially prepared in a randomly chosen translationally invariant product state.
Checking out the performance of our NN for a few more system sizes, we observed sometimes the NN is less accurate in predicting finite-size effects at longer times. For these cases the NN has a tendency to predict trajectories close to the infinite-system-size limit, rather than the given system size that it has been trained on. The reason for this remains open and needs to be explored more as part of future work.