Quantum Deep Hedging

Quantum machine learning has the potential for a transformative impact across industry sectors and in particular in finance. In our work we look at the problem of hedging where deep reinforcement learning offers a powerful framework for real markets. We develop quantum reinforcement learning methods based on policy-search and distributional actor-critic algorithms that use quantum neural network architectures with orthogonal and compound layers for the policy and value functions. We prove that the quantum neural networks we use are trainable, and we perform extensive simulations that show that quantum models can reduce the number of trainable parameters while achieving comparable performance and that the distributional approach obtains better performance than other standard approaches, both classical and quantum. We successfully implement the proposed models on a trapped-ion quantum processor, utilizing circuits with up to $16$ qubits, and observe performance that agrees well with noiseless simulation. Our quantum techniques are general and can be applied to other reinforcement learning problems beyond hedging.


Introduction
In financial markets, hedging is the important activity of trading with the aim of reducing risk.For example, buyers and sellers of derivative contracts will often trade the asset underlying the derivative in order to mitigate the risk of adverse price movements.Classical financial mathematics provides optimal hedging strategies for derivatives in idealized friction-less markets, but for real markets these strategies must be adapted to take into account transaction costs, market impact, limited liquidity, and other constraints.
Finding optimal hedging strategies in the presence of these important real-world effects is highly challenging.Deep Hedging [1,2] is a framework for the application of modern reinforcement learning techniques to solve this problem.One starts by defining a reinforcement learning environment for the hedging problem and a trading goal of maximizing a risk-adjusted measure of cumulative future returns.Then, one can apply standard deep reinforcement learning algorithms, such as policy-search or actor-critic approaches, by designing neural network architectures to model the trading strategy and by defining a training loss function to find the optimal parameters that maximize the trading goal.
Beyond Deep Hedging, the applicability of machine learning to finance has grown significantly in recent years as highly efficient machine learning algorithms have evolved over time to support different data types and scale to larger data sets.For instance, supervised learning can be used for asset pricing or portfolio optimization [3,4], unsupervised learning for portfolio risk analysis and stock selection [5,6], and reinforcement learning for algorithmic trading [7,8].At the same time, machine learning has been identified as one of the most important domains of applicability of quantum computing given the potential ability of quantum computers to solve classically-intractable computational problems [9], perform linear algebraic operations efficiently [10], compute gradients [11] and provide variational-type approaches [12].Such techniques have already been considered for financial use cases [13][14][15][16][17][18], and in fact finance is estimated to be one of the first industry sectors to benefit from quantum computing [19,20].
In this work, we develop quantum deep learning methods and show how they can be a powerful tool for Deep Hedging.Quantum deep learning methods, and, in particular, quantum neural networks based on parametrized quantum circuits, have been proposed as a way to enhance the power of classical deep learning [12].Such quantum neural networks may in general be difficult to train, often encountering problems of barren plateaus or vanishing gradients [21].Here, we design quantum neural layers based on Hamming-weight preserving unitaries constructed out of 2-dimensional rotation gates (called RBS gates), and prove that they can be efficiently trainable, in the sense that the variance of the gradients decays only polynomially with the number of qubits.These quantum layers, first defined in [22], naturally provide models that have orthogonal features, improving interpretability [23], allowing for deeper architectures, and providing theoretical and practical benefits in generalization [24].Depending on the input data encoding, one can control the size of the Hilbert space these neural networks are exploring while training.We call these two types of layers orthogonal layers and compound layers respectively.
First, using our orthogonal layers within classical neural network architectures, we design novel quantum neural networks for time-series.To evaluate the behavior of our quantum neural networks, we use the same example as in [1], where the market was simulated using Geometric Brownian Motion (GBM) with a single hedging instrument (equity).We then benchmark four different neural network architectures (Feed-forward, Recurrent, LSTM, Transformer), with both classical layers and our quantum layers, using the policysearch Deep Hedging algorithm [1,2].Our quantum neural networks achieve comparable scores as their classical counterparts while obtaining qualitatively different solutions, providing models with orthogonal features and considerably fewer parameters.It is conceivable that similar parameter reduction may be obtained by purely classical techniques such as pruning.
Second, we design a novel quantum-native reinforcement learning method for Deep Hedging.We start by formulating a quantum encoding of the environment and trading goals for Deep Hedging.We then introduce a distributional actor-critic reinforcement learning algorithm in combination with quantum neural network architectures based on compound layers for the policy and value function.Our approach is inspired by classical distributional reinforcement learning wherein the critic does not only learn the expectation of cumulative returns, but also approximates their distribution.Recent studies, such as AlphaTensor [25], have demonstrated that distributional reinforcement learning can lead to better models compared to standard reinforcement learning methods, despite the increased difficulty in training [26].A distributional actor-critic method for Deep Hedging had not been studied in the classical case before.We note that our quantum distributional reinforcement learning algorithms can be used more generally for reinforcement learning problems and is not limited to Deep Hedging.
Quantum computers are naturally suited to distributional reinforcement learning.Each quantum circuit explicitly encodes a mapping between exponential size distributions, and the measurement of a quantum circuit results in a sample from such a distribution.These samples can be used to simply learn the expectation of the entire distribution or to flexibly obtain extra information about the distribution, such as the expectations restricted to relevant subsets of the total range.In the case of Deep Hedging, we parametrize the value function using quantum neural networks with compound layers, which preserve the Hamming-weight subspaces of their input domain.This choice is particularly suited to the Deep Hedging setting, where when we encode a stochastic path as a binary string of up or down jumps, then it is intuitively natural that the number of net jumps, namely the Hamming weight of the encoding, is a major component that determines its behavior.The restriction to compound layers further makes the neural network architecture shallower and trainable.
Confirming this intuition, the quantum policies trained using our distributional actorcritic algorithm outperform those trained with policy-search based or standard actor-critic models where only the expectation of the value function is learned.These results are achieved again for a variant of the example from [1], where we used a discretized Geometric Brownian Motion as a market model both without and with transaction costs.
Last, we evaluated our framework on the Quantinuum H1-1 and H1-2 trapped-ion quantum processors [27].In particular, we performed inference on the quantum hardware using two sets of Quantum Deep Hedging models which were classically pre-trained.First, we used the policy-search based algorithm with the LSTM and Transformer architectures instantiated with 16-qubit orthogonal layers.Second, we used the novel distributional actor-critic algorithm instantiated with compound neural networks using up to 12 qubits.We observed close alignment between noiseless simulation and hardware experiments, with our distributional actor-critic models again providing best performance.We note that some of the circuits used to instantiate our framework may be classically simulatable with only a polynomial overhead (including the setup used in our numerical experiments) [28,29].Nevertheless, this does not hold true for the Quantum Deep Hedging framework, which is more general and can be applied with any quantum layers.For example, it can be shown that with suitable input states which are still very easy to create, circuits used in our work become classically hard to simulate [30].
The rest of the paper is organized as follows.Section 2 introduces preliminaries for quantum computing and reinforcement learning.In Section 3, the problem of Deep Hedging is formulated and policy-search and actor-critic algorithms are presented.Section 4 presents our orthogonal and compound neural networks and proves their trainability.Section 5 introduces a novel quantum framework for reinforcement learning and applies it to Deep Hedging.Section 6 reports our simulation and hardware implementation results.Finally, Section 7 concludes with remarks and open questions.

Quantum Computing
Quantum computing [31] is a new paradigm for computing that uses the postulates of quantum mechanics to perform computation.The basic unit of information in quantum computing is a qubit.The state of a qubit can be written as: where α 0 , α 1 ∈ C and |α 0 | 2 +|α 1 | 2 = 1, and corresponds to a unit vector in the Hilbert space H = span C {|0⟩, |1⟩}.A qubit can be generalized to n-qubit states, which are represented by unit vectors in H ⊗n ≃ C Quantum states can be measured, and the measurement process reveals information about the state of the system.The probability of a quantum state |x⟩ yielding outcome b from a measurement in the computational basis is given by |α b | 2 .More generally, the measurement process is described as an observable, which is a Hermitian operator that acts on the quantum state.The observable is given by O = m o m P m , where o m are real numbers that specify the measurement outcomes and P m are projection operators onto the subspaces that correspond to each outcome.This can be calculated as the inner product between the state and the corresponding projection operator, or p m (x) = ⟨x|P m |x⟩.The expectation of measuring the observable O in the state |x⟩ is defined as the sum of the measurement outcomes weighted by their corresponding probabilities, or m o m p m (x).This expectation can also be written as the trace of the observable O and the density matrix ρ(x) = |x⟩⟨x|, i.e., ⟨x|O|x⟩ = Tr [Oρ(x)], where Tr is the trace operator.The trace operation returns the sum of the diagonal elements of the matrix, which corresponds to the expected value of the measurement outcome in the state |x⟩.

Reinforcement Learning
The aim of reinforcement learning [32] is to train an agent to discover the policy that maximizes the agent's performance in terms of cumulative future reward.And while interacting with the environment, the agent only receives a reward signal.The agent can take an action from a set of possible actions based on a policy that maps each state to an action.
Environments in reinforcement learning are modeled as decision making problems defined by specifying the state set, the action set, the underlying model describing the dynamics of the environment, and the reward mechanism.The usual framework used to describe the environment's elements in reinforcement learning are Markov Decision Processes (MDPs).In this paper, we will consider finite-horizon MDPs that can be defined as follows: Definition 1 (Finite-horizon MDP).A finite-horizon MDP M is defined by a tuple (S, A, p, r, T ), where S is is the state space, A is the action space, p : S × A − → ∆(S) is the transition function with ∆(S) the set of distributions over S, r : S × A − → R is the reward function and T ∈ N * is the time horizon.
Starting from a state s t ∈ S, a single interaction with the environment can be represented by a sequence of actions {a π t ′ } T t ′ =t selected based on a deterministic policy π : S → A, and a sequence of random states {s t ′ } T t ′ =t that follow the MDP transitions p.The cumulative return R π t is the sum of rewards from time-step t to T and is given by In reinforcement learning, the objective is to find the policy π * that maximizes the expected return for all states s t .The expected value of the return, taking into consideration all possible future states {s t ′ } T t ′ =t resulting from the environment transitions described by p, is referred to as the value function v π .For any time-step t, and denoting by s t ′ ∈ ∆(S) the random variable that takes values in S according to the environment dynamics for t ′ > t and knowing s t = s t , the cumulative return is defined as follows: Typically, the goal of reinforcement learning is to find a policy that maximizes the expected return, and different algorithms have been developed to achieve such objectives [33].The value function is used to evaluate policies in order to find the one that maximizes the expected return.

Deep Hedging Environment
Deep Hedging is a classical algorithm that treats hedging of a set of derivatives as a reinforcement learning problem.This algorithm was first introduced in [1,2] and has been further developed in subsequent works such as [34][35][36][37].In the original approach, the authors use a reinforcement learning environment associated with Deep Hedging that employs finite-horizon MDPs where there is a different state and action space per time-step.The time horizon T represents the maximum maturity of all instruments, and S t and A t are the sets of observed market states and available actions at time-step t, respectively.
During the interaction with the environment, the agent observes a market state s t ∈ S t that contains all current and past market information (prices, cost estimates, news, internal state of neural networks, . . .), and takes an action a π t = π t (s t ) ∈ A t , potentially subject to constraints (liquidity limits, risk limits, . . .), according to a deterministic policy π := {π t } T t=0 .Then, the environment transitions to the next state s t+1 , according to p t : S t − → ∆(S t+1 ), that is assumed not to depend on the action a π t since actions have no market impact in the Deep Hedging model [37]: Subsequently, the agent receives a total reward of where r + t is the source of positive rewards such as the generated cashflows and r − t represents the source of negative rewards and corresponds to the transaction costs.The interaction ends after a terminal state s T ∈ S T is reached.The cumulative sum of the rewards perceived during this interaction can be rewritten as R π t (s T ) ≡ R π t (s t , s t+1 , . . ., s T ), since, by definition, s T contains all previous states s t ′ for all t ≤ t ′ < T .

Trading Goals in Deep Hedging
The standard objective in reinforcement learning problems is to find the optimal strategy π * that maximizes the value function v π over all policies π.As discussed in Section 2.2, the value function is usually defined as the expected cumulative return, which, in this context, would be E[R π t (s T )|s t ].However, in order to take into account the inherent risk in trading strategies, the goal of Deep Hedging is to find a deterministic optimal policy π * that maximizes the value function defined, for some policy π, as where E is the expected utility defined by a risk-averse (i.e., concave) utility function.
Various forms of utility functions have been proposed that satisfy the concave requirement.For a more detailed discussion on the desired properties of the utility function and examples of commonly used forms, see [1,2].In this paper, we use the exponential utility function E λ , which is an example of a monetary utility function that is increasing, concave, and cash-invariant [37].Specifically, it is defined for some risk aversion level λ > 0 as The parameter λ can be used to reflect the investor's risk tolerance, with larger values indicating more risk aversion.With this exponential utility function, the value function v * associated with the optimal policy π * is given by Since the Deep Hedging objective is formulated in terms of risk-adjusted measures, the value function is no longer a solution to the standard Bellman equation.However, conventional reinforcement learning algorithms can be adapted to find policies that maximize the utility.Two algorithms have been developed to solve the Deep Hedging problem.The first approach, called policy-search Deep Hedging [1,2], uses a neural network to model the policy and updates the set of parameters using gradient descent to minimize the policy loss function.The second approach, actor-critic Deep Hedging [37], represents both the policy and the value function with neural networks and computes the utility using the policy function to update the value network, which is then used to update the policy network.

Quantum Neural Networks with Orthogonal and Compound Layers
A Quantum Neural Network (QNN) consists of a composition of parametrized unitary operations, whose parameters can be trained to provide machine learning models for classification or regression tasks.While current quantum hardware is still far from being powerful enough to compete with classical machine learning algorithms, many interesting quantum machine learning algorithms have started to appear, such as for regression [38,39], classification [40][41][42], generative modeling [43,44], and reinforcement learning [45].
In general, a QNN consists of data-loading layers and trainable layers, which are both parametrized unitary operations.In some architectures, the data-loading is an explicitencoding scheme that is used to directly embed the classical input data into the amplitudes of a quantum state.While in others, these parts only implicitly encode the data and prepare a state whose amplitudes are some complex non-linear function of the input data.The latter is known in literature as a quantum feature map [46].After the unitary operators are applied, the resulting quantum state is probed to produce classical data that can be used for inference or training.
One popular approach, based on variational quantum circuits [12], is to apply an alternating sequence of quantum feature maps and trainable parts and output the expectation of the resulting state with respect to some observable [47,48].A second class of architectures encodes in the amplitudes the input data and performs trainable unitary operations that reproduce the linear layers of certain classical neural networks but with reduced computational complexity [22,49].The output is obtained through quantum-state tomography and non-linear operations are then applied classically.After applying the non-linearity, the data is reloaded onto a quantum state, and the process is repeated to compose layers.
Such quantum circuits can be trained using classical gradient descent methods until convergence.For variational quantum circuits, where the output is an expectation value, the gradient can be computed using the parameter-shift rule [38,50].One needs to be very careful in designing such quantum neural networks, since they may in general be difficult to train, often encountering problems, such as, barren plateaus or vanishing gradients [21].
In Sections 4.1 and 4.2 below we review two different types of quantum layers built from Hamming-weight preserving unitaries and that can be used to provide a natural quantization of classical neural network architectures.In Section 4.3 we discuss neural networks architectures that make use of these layers.While similar techniques have appeared previously in [42,49], our discussion is more systematic and extends the techniques to a larger set of architectures.Finally, in Section 4.4 we discuss the properties of quantum neural networks with compound layers and prove their trainability.

Quantum Orthogonal Layers
The quantum orthogonal layer was proposed by Kerenidis et al. [22] to simulate traditional orthogonal layers with reduced complexity at inference time.Specifically, a quantum orthogonal layer on n-qubits acts as an element of SO(n) when restricted to the span of the computational basis states with Hamming-weight one, i.e. the unary basis.This is achieved by composing two-qubit Reconfigurable Beamsplitter (RBS) gates.An RBS gate acting on the i-th and j-th qubits implements a Givens rotation: .
If the goal is to apply an orthogonal matrix to classical data in a vector x ∈ R n , then x can be efficiently amplitude encoded in a quantum state in the unary basis with a log-depth circuit [51,52].The unary data loader (depicted in Figure 1) uses n qubits and maps the all-zeros basis state |0⟩ ⊗n to the state |x⟩ as follows: where ∥ • ∥ represents the ℓ 2 norm and |e i ⟩ is the i th unary basis quantum state represented by |0⟩ ⊗(i−1) |1⟩|0⟩ ⊗(n−i) .
Let G(i, j, θ) denote the Givens rotation applied to the i-th and j-th unary basis vector, i.e. e i and e j , θ a vector of angles, and T is a list of triplets (i, j, m).The orthogonal layer is defined by   It acts as U (θ)|x⟩ = W |x⟩, where W = (i,j,m)∈T G(i, j, θ m ).Since the dimension of the Hamming-weight one subspace is n for n qubits, there exist efficient quantum-state tomography procedures for reading out the resulting quantum state encoding the matrixvector product W |x⟩ [22].The fact that circuits of RBS can only span elements of SO(n) avoids the computational overhead that is associated with the need to re-orthogonalize the weight matrix in the classical case.This can be implemented with a linear-depth quantum circuit.Note that the application of each such layer can also be performed classically in time O(n 2 ).Furthermore, orthogonal layers are efficiently trainable, as the dimension of the space they explore is linear in the number of qubits used.Specifically, these layers are trained by classically simulating the circuit and using quantum for inference.
There are different linear-depth circuits for U (θ), highlighted in Figure 2, each with its own unique properties.The Pyramid architecture, as described in [22], consists of n(n−1)/2 RBS gates arranged in a pyramid-like structure and has a linear depth.This architecture allows for the representation of all possible orthogonal matrices of size n × n.The Brick architecture is a variation of the Pyramid architecture and also consists of n(n − 1)/2 RBS gates.However, it has a more compact layout of gates, while still exhibiting similar properties.Both the Pyramid and Brick architectures can be implemented in hardware with nearest-neighbour connectivity between qubits.On the other hand, the Butterfly architecture, which was proposed in [49], uses logarithmic depth circuits with a linear number of gates to implement a quantum orthogonal layer.This architecture requires all-to-all connectivity in the hardware layout.To summarize, an orthogonal layer with input size n uses a parametrized quantum circuit with n qubits and a number of parameters equal to n 2 for a Pyramid or Brick circuit and can be implemented on hardware with nearest-neighbour connectivity.For a Butterfly circuit, the number of parameters is n 2 log(n) and requires all-to-all connectivity in the hardware.Because classical data can be efficiently loaded onto a quantum state and retrieved from a quantum orthogonal layer, it is possible to compose quantum orthogonal layers with nonlinear activation functions.More specifically, after applying the sequence of RBS gates, the matrix-vector product W |x⟩ is readout and an activation function is applied classically.The resulting vector is then loaded onto a quantum state using the unary data loader for the next layer.In Section 4.3, we will make use of this scheme to construct various quantizations of classical neural architectures for time series.

Quantum Compound Layers
The quantum compound layer is a natural and powerful generalization of the orthogonal layer [42] and a version of it has been previously used [49] to implement quantum analogues of vision transformers.The prefix "compound" refers to the fact that the quantum circuits implement linear operators on the exterior power of a vector space.
For an n-dimensional vector space V with orthonormal basis {e i } n i=1 , the k-th exterior power k V is the n k -dimensional vector space spanned by the k-fold alternating products of vectors in V : The alternating property implies that for any permutation σ of the indices: The direct sum n k=0 k V equipped with the alternating product forms the exterior algebra of V , denoted V .
For any linear operator A on V , there exists an extension to a linear operator A (k) on k V , which acts as on k-vectors.The matrix for the extended operator, called the k-th (multiplicative) compound matrix, has as entries A (k) IJ = det(A IJ ), where I and J are k-sized subsets of the rows and columns of A, respectively.Furthermore, there exists a unique linear operator A := n k=0 A (k) over V such that the restriction to the k-th exterior power is A (k) .In the quantum setting, the k-vectors e i 1 ∧ • • • ∧ e i k are mapped to computational basis states |S⟩, where S ∈ {0, 1} n , |S| = k, and ∀t ∈ [k], S it = 1.Thus n-qubits can be used to encode a projectivization of the exterior algebra V .To apply compound matrices to k-vectors in the qubit encoding, we utilize Fermionic Beam Splitter (FBS) gates, whose action on qubits i and j depends on the parity of the qubits between i and j.On a computational basis state |S⟩, the FBS gate acts on qubits i and j as the unitary matrix below: , where θ ∈ [0, 2π) and f (i, j, S) = i<k<j s k , and as identity on all other qubits.An FBS gate can be implemented as the composition of controlled-X gates, controlled-Z gates, and RBS gates.Like in the previous subsection, let G(i, j, θ) denote the Givens rotation applied to the i-th and j-th basis vector, i.e. e i and e j , θ a vector of angles, and T is a list of triplets (i, j, m).The quantum compound layer is defined by It can be shown [53] that this layer acts as U (θ)|S⟩ = A (k) |S⟩ for |S| = k, where A (k) is the k-th multiplicative compound of A = (i,j,m)∈T G(i, j, θ m ).Thus the operation U acts as A over the quantum state space, in other words it is a block diagonal unitary that acts separately on each fixed Hamming-weight subspace (see Figure 3).
The compound layer is similar to the circuits we described in the orthogonal layer case, where there, given we only consider the unary basis, the FBS gates can be replaced by the simpler RBS gates.Note as well, that RBS and FBS gates acting on nearest-neighbor qubits, as in the Pyramid and Brick circuits, are also equivalent.The main difference in the compound layer comes from the fact the data loading part is not restricted to the unary basis, and thus one needs to consider the entire exponential size block-diagonal unitary, and not only its linear size restriction to the unary basis.
Thus, one can see that by controlling the Hamming-weight of the basis states used in the data loading part one can smoothly control the size of the explored space, from linear size, when using a unary basis for data loading, to an exponential size, when we use all possible Hamming-weights, as is the case for example when we load each coordinate of a classical data point by performing a one-qubit rotation with an appropriate data-dependent angle.
Lastly, since any element of SO(n) can be expressed as a product of O(n 2 ) Givens rotations, the direct sum of compound matrices can be implemented efficiently as a composition of FBS gates.Thus quantum computation can be used to efficiently parametrize and apply compound matrices over the exterior algebra.

Quantum Neural Network Architectures with Orthogonal Layers
Our aim is to develop quantum neural networks capable of processing sequential data.
To achieve this, we will utilize classical neural network architectures that have proven to be effective in dealing with time-series data.However, we will extend these classical architectures by replacing the linear layers with our orthogonal layers.This approach offers an alternative to the use of quantum variational circuits, which are commonly used in quantum machine learning.As discussed in Section 4.1 orthogonal layers use only the unary basis whose size is equal to the number of qubits, and hence one can easily perform tomography to obtain a classical description of the output and apply a nonlinearity.By combining the strengths of classical neural networks with the unique properties of quantum orthogonal layers, we hope to achieve improved results in processing sequential data.
We designed quantum neural networks to process input time-series data (x 0 , x 1 , . . ., x T ) and produce the final output sequence (y 0 , y 1 , . . ., y T ).We split these architectures into two categories: feed-forward and recurrent architectures.
For the purpose of details, we assume that the input and output have the same dimension n and that this dimension is maintained across layers.Additionally, these architectures are made up of blocks that can be repeated to create deeper architectures.Here, we will assume that the number of blocks is one.
• Feed-forward Architectures: A classical Feed-forward neural network consists of multiple layers of transformations, where information flows from input to output without looping back.In each layer, a linear transformation, bias shift, and a non-linear function are applied.The output is calculated as where f is the activation function, and W and β are the weights and biases.The number of parameters in each layer of a classical network is O(n 2 ).Our proposed quantum equivalent (Figure 4a) calculates each transformation as where • represents the element-wise product, U (θ)|x t ⟩ is the output of a quantum orthogonal layer retrieved using tomography, γ is a scaling factor used to rescale each feature, β is a shift that acts as the bias, and f is a non-linear function such as sigmoid, tanh, or ReLU.All parameters θ, γ, and β are trainable and shared across the networks used at different time steps.The total number of trainable parameters is O(n 2 ) when using the Brick and Pyramid architectures, and O(n log(n)) for the Butterfly architecture.Additionally, the parameters of the quantum orthogonal layers can either be shared across layers or not, depending on the requirements of the time-series model being used.
• Recurrent Architectures: Recurrent neural networks are designed to handle sequential data.One example of a recurrent architecture is the standard Recurrent Neural Network (RNN).In this example, we will show how to provide a quantum version of RNNs (Figure 4b), but the same approach can be applied to other recurrent models such as the LSTM.RNNs consist of a repeating module with a hidden state, h t , that allows information to be passed from one step of the sequence to the next.At each time-step, the network takes as input x t and the hidden state from the previous time-step, h t−1 .The hidden state is updated as where f is an activation function, W x and W h are weight matrices, and β is a bias vector.The quantum analogue of this update is represented as where we now have two orthogonal layers with parameters θ x and θ h , one for the input state x t and another for the previous hidden state h t−1 .We also have two scaling factors, γ x and γ h .The hidden state h t is then used to generate the output at each time-step, y t , through another layer, which can be implemented using another orthogonal layer to map it to the output.Moreover, the parameters are shared across layers and we can show that the number of trainable parameters per layer in the quantum Recurrent Neural Networks grows similarly to that in the quantum Feed-forward Neural Networks, with the total number of parameters being O(n 2 ) for the Brick and Pyramid architectures and O(n log(n)) for the Butterfly architecture.
We also define an attention mechanism that can be applied to the output sequence (y 0 , y 1 , . . ., y T ) to create a transformer architecture.
• Attention Mechanism: The attention mechanism, as used in the Transformer architecture [54], can be applied to both Feed-forward and Recurrent Neural Networks.
Here, we describe a basic quantum attention mechanism (Figure 4c), but it can be generalized.Given an output sequence (y 1 , y 2 , . . ., y T ), the goal of the attention mechanism is to compute the output (ỹ 1 , ỹ2 , . . ., ỹT ) as ỹt = t ′ ≤t w t,t ′ y t ′ where the weights w t,t ′ are computed by considering all previous time steps: where W y is a trainable attention matrix that combines the query and key matrices into one matrix, as described in [49], and τ is a temperature parameter.In the quantum case, we use where γ y is a scaling factor used to rescale each feature, U (θ y ) are the parameters of the quantum orthogonal layer that computes the attention weights w.The dot product between |y t ′ ⟩ and U (θ y )|y t ⟩ can be computed quantumly using an additional data-loader to unload y t ′ after applying U (θ y ) to y t .This procedure is similar to the one described in [52].

Properties of Quantum Compound Neural Networks
We define a quantum compound neural network to be the standard variational QNN, i.e.
where O is an observable that preserves Hamming-weight, e.g.diagonal in the computational basis), U is a quantum compound layer, and is a quantum feature map (Figure 5).As mentioned in Section 4.2, it is sufficient to use only RBS gates in the Brick architecture to implement a quantum compound layer.Thus, without loss of generality, we define U (θ) to consist only of RBS gates in the Brick architecture.Under these assumptions, the output of the QNN decomposes as where P k is the projector onto the Hamming-weight-k subspace, θ is the k-th compound matrix associated with the Givens circuit U (θ) in the manner discussed earlier.One potential application of such a subspace-preserving QNN could be the following.Suppose there is some canonical grouping of the input data x ∈ X according to a partitioning function f : Then one could potentially construct a quantum-feature map or state preparation procedure such that P f (x) ρ(x)P f (x) = ρ(x), i.e. the quantum states encoding inputs lying in different groups are embedded into different Hamming-weight subspaces.Then it follows that and the quantum compound neural network can potentially learn different functions over the different groups.This form of learning is a special case of group-invariant machine learning, which has also recently been explored in the quantum case [55].Note that the parameters θ are shared across the different functions which is beneficial for training.Furthermore, we show below that under Gaussian initialization, the variance of the gradient on each subspace does not vanish exponentially with the number of qubits.Thus quantum compound neural networks can be trained efficiently.
Classical neural networks over exterior algebras have been applied to manifold learning tasks, specifically for data that lies on Grassmannians [56].The (n, k)-Grassmannian of V is a manifold containing all k-dimensional subspaces of V and can be embedded in the space k V , where k-wedge products of orthogonal vectors define subspaces.When considering A ∈ SO(n), the operator A (k) maps between elements of the Grassmannian.While n k can be large, the application of A (k) to a Grassmannian can done by multiplying n × k and n × n matrices.However, optimizing the linear layers of the neural network while ensuring the data remains an orthogonal matrix can be computationally challenging.
Since the embedding of the (n, k)-Grassmannian into k V is not surjective, and in the quantum case we can apply compound matrices to the larger space k V , the technique used classically for Grassmannian neural networks to reduce the dimension of the matrixvector products cannot be applied to simulate the quantum case.In other words, such compound layers are inherently quantum and perform an operation that in the general case seems to take exponential time to perform classically.Nevertheless, we will show below that such compound layers remain trainable in certain settings.
In Section 5, we present a specific example of the Deep Hedging problem where grouping the inputs by Hamming-weight is natural in the quantum setting and improves the accuracy of the model.
Highly expressive QNNs are known to suffer from barren plateaus in their training landscape at initialization.This occurs when the variance of the gradient decays exponentially with the number of qubits, which makes sampling to estimate the gradient asymptotically intractable.Specifically, consider a typical QNN of the form where O is an observable and U is a 2-design for the Haar measure µ on SU(2 n ).Using known formulas for integration over the Haar measure on compact groups, it was shown [21] that Var More recent lines of work have shown that the symmetries of the parameterized quantum circuit also need to be considered and play an important role in understanding, for example, convergence [57,58] and trainability [59] of QNNs.For trainability, it was shown that if the input state ρ lies in the invariant subspace H k , then the variance of the gradient is in the worst case O(1/d 2 k ), where d k = dim H k .Note there will also be a dependence on the initial state used.While this result was shown for SU, the asymptotics of Haar moments are similar for other classical compact Lie groups [60], such as SO.In the case of the compound layer, the subspaces H k are spanned by computational basis states with Hamming-weight k.
, then the variance of the gradient does not exponentially decay with growing system size for all initial states.This is at least the case when k is independent of n.
It was further conjectured in [59] 1 that the variance actually scales with the dimension of the (dynamical) Lie algebra restricted to the invariant subspace, which can be polynomial in n even when dim H k is exponential.In the case of the compound layer, the dimension of the Lie algebra of the compound matrix group for SO(n) is actually equal to the dimension of so(n).Thus even though there are invariant subspaces whose dimension is exponential in n, the dimension of the Lie algebra grows at most quadratically in n.If proven true, this conjecture would imply that the variance does not decay exponentially on any subspace, e.g. for k = n/2.
It is possible to go beyond unproven conjectures by making some well justified assumptions about the initialization and measurement phases of the quantum compound neural network, namely that (1) The parameters are randomly initialized from centered Gaussian distributions with variance inversely proportional to the number of gates in the circuit (2) The final measurement is made in the computational basis.Specifically, we make a vector valued measurement by measuring Z i = I ⊗i−1 ⊗ Z ⊗ I ⊗n−i for each qubit i ∈ [1, n].Since the operators Z i all commute, the order of measurement is arbitrary and the measurement is well-defined.Such a measurement allows for loss functions that are arbitrary functions of the measured bit-string.It suffices therefore to demonstrate that the gradient of the output after measuring each Z i does not decay exponentially with the number of qubits.The overall gradient may still decay depending on the loss function used, but this would be a property of the loss itself and not of the quantum neural network.(3) The initial state is chosen to be either the uniform superposition, or the uniform superposition over all computational basis states of Hamming-weight k, where 1 < k < n.
We now give a rigorous proof that the overall gradient does not vanish in this setting.We note that this is the setting in our numerical experiments (Section 6).We use the following theorem from [64], which we paraphrase below in the necessary form.

Theorem 1 (Paraphrased from [64, Theorem 4.2]). Consider any n-qubit variational form with output function given by
where O is some n-qubit observable, and each U j is expressible as the product of a constant number of parametrized 2-qubit gates (potentially with shared parameters).Then for any parameter θ j it holds that , when each θ j is initialized from a normal distribution N (0, γ 2 ) with γ 2 = O(1/L).
From Theorem 1, we have the following theorem.
Theorem 2. Consider a variational quantum algorithm using a quantum compound layer and a final observable Z m (where X m , Y m , Z m correspond to the application of the corresponding Pauli gate on qubit m), and let the corresponding output (as a function of the parameters θ) be C(θ).We have when the parameters are initialized from a normal distribution N (0, γ 2 ) with and the input state is chosen to be ρ 0 = |ψ⟩⟨ψ| where |ψ⟩ is an n-qubit state representing either the uniform superposition over computational basis states, or the equal superposition of computational basis states with Hamming-weight k, for any 1 ≤ k < n.
Proof.We observe that the number of parameterized gates in a quantum compound neural network is O(n 2 ).Using Theorem 1, it suffices for our result to show, for some parameter θ l , that 1 2 = Ω 1 poly(n) .
All the parameterized gates in the quantum compound neural network are RBS gates.Let the gate corresponding to θ l act on the qubits i, j.The corresponding unitary is . Finally define U − (θ 1 : l−1 ), U + (θ l+1 : L ) denote the sections of the parameterized circuit before and after the l th parameterized gate.By explicit differentiation, we have We therefore consider the case when m ∈ {i, j}.In the rest of the proof, we assume without loss of generality that i = 0, j = 1, m = 0. We have, Consider any two computational basis states |a⟩, |b⟩.
or vice-versa for some n − 2 bit string x, and 0 otherwise.We now determine 1  2 (∂C/∂θ l ) 2 | θ=0 for different possible initial states |ψ 0 ⟩.Suppose the initial state is the uniform superposition over all computational basis states |ψ 0 ⟩ = 1 2 n/2 b∈{0,1} n |b⟩.In this case, Now let the initial state be the ψ k which is the uniform superposition over all strings of Hamming-weight k.For 2 ≤ k ≤ n, we have By an analogous argument for the Hamming-weight 1 subspace we find that ).As we have shown before, the gradients do vanish for the Hamming-weight 0 subspace.

Quantum Deep Hedging
In this section, we present a quantum framework for Deep Hedging, referred to as Quantum Deep Hedging, where we aim to leverage the power of quantum computing to enhance the deep reinforcement learning methods introduced in [1] for solving the hedging problem.We will incorporate the use of quantum neural networks, as defined in the previous sections, and provide quantum reinforcement learning solutions to this problem.Quantum reinforcement learning involves utilizing quantum computing to improve reinforcement learning algorithms.A comprehensive survey by Meyer et al. [45] outlines various approaches for incorporating quantum subroutines in these algorithms: • In classical environments: A common approach involves using quantum neural networks, such as variational quantum circuits, to represent the value function [65][66][67][68] or the policy [69,70] in classical environments.These quantum neural networks can replace their classical counterparts in various reinforcement learning training methods, including value-based, policy-based, and actor-critic.Experiments have shown that they sometimes produce better policies or value estimates when applied to small environments, but can face trainability problems with larger environments due to the barren plateaus that occur in these cicuits (as discussed in Section 4).
• In quantum environments: Another approach considers the case of quantum access to the environment and aims to use this access to achieve a significant speed-up by developing a full quantum approach [71,72].This access can be achieved by oracularizing the environment's components, such as the transition probabilities and reward function.Other methods, based on the gradient estimation algorithm from [11] and developed in [73,74], use quantum environments to directly compute the policy gradient as an output of a quantum procedure.
When trying to apply the standard quantum reinforcement learning techniques described above to Deep Hedging, one faces some problems that need to be resolved.One issue is that most quantum neural network models for policies have only been applied to discrete action spaces, while Deep Hedging has a continuous action space with constraints.Additionally, current algorithms for training quantum policies or value functions rely on solving the discounted Bellman equation, which is not suitable for hedging as the goal is defined by a utility function and the value function no longer follows the Bellman equation.Furthermore, building a quantum-accessible environment and approximating the policy gradient with quantum methods also poses a challenge, as these methods require finite action spaces and use amplitude estimation to approximate the value function and its gradient with respect to the policy.
Therefore, we aim to design a quantum reinforcement learning framework for Quantum Deep Hedging by addressing the challenges of standard quantum reinforcement learning techniques.In the subsequent subsections, we will outline the two methods we developed to overcome these challenges: • Using orthogonal layers: We explore the application of quantum reinforcement learning methods to classical Deep Hedging environments using quantum orthogonal neural network architectures.This is in contrast to prior work that used variational circuits to represent parametrized quantum policies and value functions.To compare these quantum neural network architectures, we implemented policy-search Deep Hedging to train quantum policies and solve the classical environment.By using orthogonal layers, we aim to design a straightforward and effective method for enhancing Deep Hedging with quantum computing.
• Using compound layers: We propose a quantum native approach to the Deep Hedging problem, where we formulate Quantum Deep Hedging as a fully quantum reinforcement learning problem and solve it using actor-critic methods.Following the steps in [1,2], we construct a quantum environment for Deep Hedging by providing quantum representations of the environment quantities and the trading goal.We then design specific quantum neural networks using compound layers and quantum reinforcement learning algorithms to solve the problem.Our approach is inspired by distributional reinforcement learning and leverages the properties of the quantum environment to provide a model-based quantum-enhanced solution to Deep Hedging.
end while output Policy parameters ϕ.

Classical Environment for Deep Hedging
We base our classical environment on the work in [1,2].To model the market state s t , we assume that it can be represented by a sequence of market observations {M t } T t=0 and provide a formal definition of the MDP for the environment described in Section 3.1.At each time-step t, the available market information M t is described by n numerical quantities represented by a vector: This vector includes market information such as stock prices and other relevant financial data.In this setting, the market state s t can be identified with the sequence of past and actual market observations {M t ′ } t t ′ =0 up to time-step t: Furthermore, there are m available hedging instruments, such as stocks, options, or futures, that can be traded with high liquidity in the market.The classical Deep Hedging environment can be formally defined as a finite-horizon MDP as follows: This formulation allows us to model the problem of Deep Hedging formally, where the objective is to choose a sequence of trading actions {a π t } T t=0 that optimizes the risk-adjusted expected returns over the given time horizon T , given a sequence of market observations {M t } T t=0 .At every time-step, the policy π maps the current market state s t to an action a π t , performing a sequence-to-sequence mapping from {M t } T t=0 to {a π t } T t=0 .Building upon this classical MDP framework for Deep Hedging, we now introduce quantum reinforcement learning methods specifically designed for classical environments, leveraging orthogonal layers to enhance their performance and efficiency.

Quantum Reinforcement Learning methods for Classical Environments
Our first approach to Quantum Deep Hedging utilizes quantum orthogonal neural network architectures to parametrize the policy π.While in this part of our work we focus on parametrizing the policy, this approach could be extended to the value function as well.
The policy QNN π(.; ϕ) is a sequence-to-sequence model, which can be parametrized with ϕ := {ϕ t } T t=0 , one per time-step, that can be shared or not depending on the setting, such that π ϕ t t represents the neural network used at time-step t with parameters ϕ t .The input time-series data (M 0 , M 1 , . . ., M T ) is preprocessed classically and transformed into a sequence of embeddings (x 0 , x 1 , . . ., x T ) in a high-dimensional feature space of dimension d.We use the quantum neural networks described in Section 4.3, which extract features from each x t ∈ R d and may pass on information across time to produce the final output sequence (y 0 , y 1 , . . ., y T ).The number of hidden layers used in each neural network architecture is a hyper-parameter governed by factors like the complexity of the learning problem and the availability of resources.The output y t ∈ R d can be further processed classically to obtain the desired result, which is the action π We can train these parametrized quantum policies using the policy-search Deep Hedging algorithm, introduced in [1, 2], by updating the set of parameters ϕ using gradient descent to minimize the policy loss function L(ϕ) defined as The training procedure, outlined in Algorithm 1, proceeds as follows.At every iteration, it generates N trajectories {s i t } T t=0 before using the policy QNN to compute the sequence of actions.Using this sequence of actions, we can compute the cumulative return for each episode and then estimate the utility over these episodes to provide an estimate L(ϕ) of the policy loss function defined earlier, and then update the parameters ϕ.
The same principles of using quantum neural networks that use orthogonal layers within classical architectures can be extended to other deep reinforcement learning algorithms for Deep Hedging, such as actor-critic and value-based methods, as well as future approaches in Deep Hedging.

Quantum Deep Hedging in Quantum Environments
Here, we extend the classical Deep Hedging problem into the quantum case, starting by defining the quantum environment and the trading goal for Quantum Deep Hedging.Then, in order to develop quantum algorithms to solve this newly defined trading goal in the quantum environment, we connect to distributional reinforcement learning by showing that the value function in our environment can be expressed using a categorical distribution.
We will propose a model-based distributional approach to approximate this value function by constructing appropriate quantum unitaries and observables that approximate the value function and its distribution.Furthermore, we introduce two quantum algorithms: Quantum Deep Hedging with expected actor-critic and Quantum Deep Hedging with distributional actor-critic, which are summarized in Algorithms 2 and 3, respectively.These algorithms use quantum neural network architectures with compound layers to approximate the policy and value functions.

Quantum Environment for Deep Hedging
We present here a method for converting the classical Deep Hedging problem into a quantum-native setup, which we refer to as the quantum environment for Deep Hedging.In order for this approach to be efficient, we make the state space finite, allowing the market states to be encoded into quantum states of the form |s t ⟩.Moreover, at each time-step, the environment utilizes an oracle U p t to map the transition probabilities p t to the amplitudes of a quantum state, which is a superposition of next states |s t+1 ⟩.
Note that the quantum environment introduced is still solving the classical Deep Hedging task and differs from previous work in quantum reinforcement learning.Previous work, such as [71][72][73][74][75], has employed quantum environments with finite action spaces to create model-based approaches.A key distinction in our approach is that our transition oracles do not necessitate encoding of the action a t , as the Deep Hedging model assumes that actions have no impact on the transition probabilities, i.e., trading actions do not affect the market [37].This is a valuable feature of our approach, as it removes the need for a quantum encoding of actions.Additionally, our approach does not require quantum access to the reward function through oracles.While previous work encodes the reward function into parts of a quantum state, for example in the amplitudes [71], in quantum registers [72,73], or in the phases of the quantum states [74], it is unclear how to achieve this efficiently for the reward function associated with Deep Hedging and which is defined on a continuous action space.Consequently, we present an alternative formulation of quantum environments that incorporates the unique structure of the Deep Hedging MDP, as detailed in Definition 2.
To formally specify our quantum environment, we assume that the market information M t at each time-step t can be represented as a n-bit binary string, which we encode in an n-qubit computational basis state The quantum encoding |s t ⟩ of the market state s t can then be expressed as follows: With the formalism described above, the oracle U p t encoding transitions probabilities as described by a classical transition function p t can be written as follows: We will now redefine the trading goal in the context of the quantum environment for Deep Hedging.While our focus is on the exponential utility E λ as defined in Section 3.2, our approach can be applied to any risk measure that can be expressed as the expectation of a deterministic function over future returns.To evaluate the value function for a given policy π, we need to compute the expectation of the exponentiated rewards over future returns.Specifically, for a state s t , the random variable exp(−λR Using this observable, we can redefine the trading goal in the quantum environment for Deep Hedging as finding an optimal policy π * such that Our next goal is to develop quantum algorithms to solve this newly defined trading goal in the quantum environment.In other words, our objective is to find the optimal policy π * that minimizes the logarithm of the expectation of the quantum observable O π t , which represents the exponentiated rewards over future returns.To design our quantum algorithm, we now connect to distributional reinforcement learning by showing that the value function in our environment can be expressed using a categorical distribution.

Connection with Distributional Reinforcement Learning
The connection between our proposed Quantum Deep Hedging approach and distributional reinforcement learning lies in the definition of value functions.In distributional reinforcement learning, the focus is on learning the probability distribution of the returns, as opposed to just the expected return value.This is done using neural networks [76,77] that approximate the return distribution using categorical distributions.Similarly, in our Quantum Deep Hedging approach, the quantum observable O π t can be interpreted as a categorical distribution over all possible future returns, and our algorithms are designed to approximate this distribution and find the optimal policy π * that minimizes the logarithm of the expectation of this distribution.Therefore, distributional reinforcement learning provides a useful framework for our approach.
In distributional reinforcement learning, value functions are defined using distributions, and categorical distributions are a common choice in many approximation schemes.A categorical distribution P z , with a finite support z = {z 1 , z 2 , . . ., z K }, is defined as a mixture of Dirac measures on each element of z and has the form P z := i p i δ z i , where p i ≥ 0, i p i = 1, and δ z i is the Dirac measure on z i [26].In other words, a categorical distribution is a probability distribution over a finite set of discrete outcomes.To formalize the notion of approximating distributions, we will use the Cramér distance (or ℓ 2 metric) defined as follows: Definition 3 (Cramér distance).Given two distributions P, Q over subsets of R, with cumulative distribution functions (over R) given by F P , F Q respectively, the Cramér distance between the two distributions is defined as  The figures depict data generated from a trimodal distribution.In (6a), a single distribution is learned to match the expectation over the entire set of states.In (6b), the set is divided into three distinct subsets, with each peak representing the weighted and learned distribution within its corresponding subset.This improved approach provides a closer fit to the original data and effectively incorporates information about tails, offering a more accurate representation of the underlying distribution.
An important result from the distributional reinforcement learning literature [26, Proposition 1] shows the following properties of the Cramér distance: A categorical distribution P z over some support z can be projected onto a categorical distribution over a different support z ′ by a mapping Π C (called the Cramér projection) that preserves the expectation (E[P z ] = E[Π C (P z )]) while minimizing the Cramér distance between P z and Π C (P z ), as long as the new support z ′ has a larger range, i.e [min z, max z] ⊂ [min z ′ , max z ′ ].
To establish a connection between our framework and distributional reinforcement learning, we can utilize the fact that sampling from a categorical distribution can be achieved by measuring a quantum observable.Specifically, the observable O z , defined as Here, the number of qubits m required to index all the outcomes is such that K ≤ 2 m .Thus, measuring O z in |z⟩ serves as a quantum representation of the distribution P z over the support z.
In the Quantum Deep Hedging framework, the categorical distribution that models the returns distribution of a policy π at a time-step t can be represented by measuring O π t , the quantum observable defined in Section 5.2.1, in the quantum state |(s T |s t )⟩ that encodes the probabilities of the future trajectories given the current state s t .
Constructing the quantum state |(s T |s t )⟩ requires quantum access to the environment, and it can be generated using the transition oracles {U p t ′ } T t ′ =t and measured with the observable O π t .However, the observable O π t is not efficient to implement as its natural description requires classical computation of its eigenvalues and is dependent on the policy π, which changes during the training procedure.To address this, we propose a construction to approximately represent the distribution using a much simpler observable that is diagonal in the computational basis and has a spectrum independent of the specific reward structure.This construction is detailed in the following proposition that demonstrates, using the Cramér projection, how an observable with a fixed support can approximate the value distribution while preserving its expectation.Proposition 1.Consider a support z of size 2 m such that, for any policy π, the following holds: Proof.Given a policy π and a state s t ∈ S t , the distribution of the exponentiated returns is a categorical distribution, denoted as P π t , from which we can obtain a categorical distribution over a support z using the Cramér projection.Specifically, we can project P π t onto the support z to obtain P z t := Π C (P π t ), which returns a z b ∈ z with probability p(z b |s t ).We can then define a unitary that maps |s t ⟩|0⟩ ⊗m to |s t ⟩|(z|s t )⟩, where |(z|s t )⟩ := b p(z b |s t )|b⟩ is a quantum state encoding the probabilities of P z t .The claim for a particular state s t is satisfied when we measure this unitary with the observable where I is the identity operator acting on one qubit.Since the different quantum encodings |s t ⟩ are orthogonal, a unitary U π t that performs the Cramér projection for all s t ∈ S t can be constructed.By known properties of the Cramér projection, the first requirement of matching expectations is satisfied.
We now consider the Cramér distance between the true and projected distributions.As the Cramér projection minimizes Cramér distance we obtain an upper bound by analyzing the distance from any projection onto the same support.Consider the projection assigns to each z b the weight of the true distribution between z b and z b−1 (where the subtraction is performed by viewing b as the binary representation of an integer).Let the true and projected cdfs be F P π t and F P z t respectively.The square of the Cramér distance between these distributions is the sum of 2 m integrals of the form: Accumulating the 2 m terms and taking the square root, we have the necessary bound on the Cramér distance.
Proposition 1 illustrates the use of quantum circuits to approximate the distributional value function by fixing a support z, and measuring the observable O z t in the quantum states |(z|s t )⟩ produced by the unitary U π t .We can hope to learn U π t by using existing classical distributional reinforcement learning algorithms for our quantum setting.However, this approach may be challenging for two reasons.First, most of the existing approaches in distributional reinforcement learning assume that the value function conforms to the discounted Bellman equation, which is not our case since we need to take into account the risk-adjusted measure.Second, the size of the quantum support increases exponentially with the number of qubits, making training impractical.
To overcome these challenges, we propose a new approach that utilizes the environment's model and exploits the structure and properties of our Hamming-weight preserving unitaries.This approach allows us to learn polynomial-sized distributions rather than exponential ones.

Distributional Value Function Approximation
We propose a model-based approach to learn distributional value functions in quantum environments.We introduce a new distributional reinforcement learning algorithm that differs from existing methods, in particular from [76] that fixes the support z and learns the probabilities for each element in the support and from [77] that fixes the probabilities (or quantiles) and learns the support.Our approach in fact splits the set of future trajectories into subsets and learns the expectation of the distribution within each subset.Because our approach is model-based, we can use the model to compute the probabilities of these subsets and calculate the overall expectation as well.
This approach can use any type of unitaries that preserve subspaces of some nature, while here we give a specific example using the Hamming-weight preserving quantum compound neural networks developed in Section 4.4.As we have seen, these compound neural networks preserve the Hamming-weight subspaces and thus allow us to split the set of future trajectories according to the Hamming-weight of their quantum representation.
In more detail, we partition the set of all complete trajectories S T into n(T − t) + 1 disjoint subsets S t,k T , where k = 0, . . ., n(T − t), such that each subset contains terminal states with a Hamming-weight of k in their trajectories from t + 1 to T .This allows us to decompose the superposition |(s T |s t )⟩ of all trajectories by grouping the future trajectories by Hamming-weight.Specifically, we express |(s T |s t )⟩ as a sum of terms corresponding to each subset, with each term given by |(s T |s t , k)⟩, where k is the Hamming-weight of the trajectories in the subset.We learn one expectation per subset, and by computing the probability of each subset, we can recover the overall expectation over all subsets.The probability that the future trajectory will have Hamming-weight k is denoted by and |(s T |s t , k)⟩ is the superposition of such trajectories defined as: In what follows, we will show that there exist Hamming-weight preserving unitaries that can approximate the expected value for each subset of future trajectories grouped by Hamming-weight, extending the result of Proposition 1.
Proposition 2. Consider a support z of size 2 n(T −t)+2 such that, for any policy π and for any Hamming-weight k = 0, . . ., n(T − t), the following holds: Then, there exists an observable O z t with eigenvalues in z that operates on n × (T + 1) + 2 qubits and such that, for any deterministic policy π, there is a Hamming-weight preserving unitary U π t satisfying: Proof.Given a policy π and a state s t ∈ S t , the distribution of the exponentiated returns restricted to future paths with Hamming-weight k is also a categorical distribution, denoted as (P π t ) (k) , from which we can obtain a categorical distribution (P z t ) (k) over a support z (k) := {z b | |b| = k + 1} using the Cramér projection.We can construct a Hamming-weight preserving unitary that maps |(s T |s t , k)⟩⊗|01⟩ to |s t ⟩⊗|(z (k) |s t )⟩, where: |(z (k) |s t )⟩ encodes the probabilities of (P z t ) (k) .Measuring this state with the observable O z t as defined in the proof of Proposition 1 satisfies the claim for a specific state s t and Hamming-weight k.Since the quantum states |(z (k) |s t )⟩ have different Hamming-weights, it is possible to construct a unitary operation that performs this mapping for all Hamming-weights.Similarly, as in Proosition 1, we can construct a unitary that performs this mapping for all states s t ∈ S t .
Regarding the above proof, there are two points to note.First, in order to calculate expectations on different subsets of future trajectories grouped by Hamming-weight, we added two ancilla qubits to satisfy the requirement of the Cramér projection for a support of size at least 2. This requirement is not satisfied for trajectories of Hamming-weight 0 or n(T − t) without the addition of these ancilla qubits.As a result, the Hamming-weight of the measured quantum states has been shifted by +1.Second, if the subsets of trajectories with Hamming-weights 0 and n(T − t) are empty, then only n(T − t) qubits are needed instead of n(T − t) + 2.
If we have Hamming-weight preserving unitaries, as described in Proposition 2, that produce the correct expectation on every subset, then we can obtain the overall expectation of the distributional value function.By loading the superposition over all future states |(s T |s t )⟩ using the transition oracles {U p t ′ } T −1 t ′ =t and applying the unitary U π t from Proposition 2, we can calculate directly the overall expectation without having to reconstruct the expectations per subspace and compute the overall expectation classically.Hammingweight preserving unitaries act only inside the subspace to map each |(s T |s t , k)⟩ ⊗ |01⟩ to |(z|s t , k + 1)⟩.Therefore, the overall expectation matches the expectation of the value function when we apply U π t to |(s T |s t )⟩ ⊗ |01⟩, resulting in the state |(z|s t )⟩ with density ρ π t (z|s t )).In other words, we have We have shown that for any policy π, a Hamming-weight preserving unitary exists at each time-step t, which can predict the expectation of the exponentiated returns on every subset of future paths accurately, as well as the overall expectation over all future paths by projecting the output states onto an observable O z t independent of π.We will now use quantum compound neural networks to parametrize these unitaries and provide reinforcement learning algorithms to learn the expectation on every subset before using the overall expectation to improve the policy π.

Compound Neural Networks for Deep Hedging
In this section, we will discuss a general approach to constructing quantum neural networks using Hamming-weight preserving unitaries that can be used in Quantum Deep Hedging for quantum environments.We will explain how these networks can be used for both the policy and value function before providing algorithms for training them in Section 5.2.5.At each time-step, we design a quantum neural network that acts on n(T + 1) + 2 qubits.It takes the quantum encoding |s t ⟩ as input on the first n(t + 1) qubits, applies the transition oracles on n(T +1) qubits, and uses a compound neural network with two additional ancilla qubits (initialized to |01⟩) to predict a value within some range (o min , o max ).First, we will design the observable and then present the architecture of the quantum circuit for the value function, which can also be applied to the policy by adjusting the bounds accordingly.
To construct the observable, we need to define a support that covers the valid range of values (o min , o max ) for every subset of Hamming-weights of size at least 2. For the value function, this range corresponds to the possible values of the exponentiated returns, which is crucial for the validity of the Cramér projection.For the policy, the support should only cover the range of valid actions.To design the support, we draw inspiration from [76] and define the support for each subset of Hamming-weight k + 1 as having size N k = n(T −t)+2 k+1 with uniformly spaced atoms.We can express this support as where i b ranges from 0 to N k − 1 and indexes all the quantum states with Hamming-weight k + 1.
To construct the unitary, we begin with |s t ⟩ ⊗ |0⟩ ⊗n(T −t) ⊗ |01⟩ and apply the model to create a superposition over all future trajectories |(s T |s t )⟩ ⊗ |01⟩.We can rewrite this superposition as the tensor decomposition |s t ⟩ ⊗ |M t+1 , . . ., M T ⟩ ⊗ |01⟩, since non-zero probability paths necesarily start from s t .Next, we conditionally apply a compound layer to the last n(T − t) + 2 qubits, controlled by the first n(t + 1) qubits.Instead of learning |S t | = O(2 nt ) parameters for each s t , we control each of the first n(t+1) qubits individually and learn a maximum number of 2n(t + 1) parameters, two sets per control qubit.
Since the control qubits are always in a computational basis state, they can be removed and the appropriate unitary can be chosen classically for each past state.Thus, our compound neural network U t (ω t ) with parameters ω t is written as the product of n(t + 1) compound layers, each with parameters ω i,0 t or ω i,1 t depending on whether the i-th qubit is in state |0⟩ or |1⟩.
After applying the unitary U t (ω t ) as described earlier, we obtain the quantum state with density t in this state results in a value within the range (o min , o max ).Furthermore, when using this circuit for the value function, we can apply U t (ω t ) to a specific Hamming-weight in order to predict a value only for that subset of trajectories.
In summary, we have described how to construct a quantum neural network for a given time-step t using the environment's model.These quantum neural networks will be different for different environments and we provide a concrete example in Section 6.2 (see Figure 7).In the next section, we will use these networks to model the value function, adjust the same compound neural network such that it can be used for the policy and train the parameters of both networks using an actor-critic algorithm.

Quantum Reinforcement Learning Methods for Quantum Environments
We present an actor-critic approach to Quantum Deep Hedging, which is specifically designed for quantum environments.To represent the policy and the value function, we use compound neural networks, which were introduced in the previous section.For each time-step t, we use a compound neural network with n(t + 1) layers, where each layer is controlled by one of the first n(t + 1) qubits and acts on the remaining n(T − t) + 2 qubits.We use the Brick architecture with logarithmic depth for each layer, resulting in an overall depth of O(nt log n(T − t)).
The value QNN v is parameterized by ω := {ω t } T t=0 , where ω t contains all the parameters used at time-step t.Each layer's parameters, ω i t , can be further split depending on the possible values of the i-th control qubit.We construct the observable O z t as defined in Section 5.2.4 on a support z that covers the bounds on the exponentiated return function.We assume knowledge of z, which can be computed classically.In this case, the value QNN maps the state s t to Here, ρ ωt t (z|s t ) := |(z|s t )⟩⟨(z|s t )|, and is the output of the value QNN when applied to s t .Similarly, we define the policy network π with parameters ϕ := {ϕ t } T t=0 .For each time-step, we define an observable O a t on a different support a that covers the range of valid actions (typically [0, 1]) using our approach described in Section 5.2. 4 is the output of the policy QNN when applied to s t .If there is more than one hedging instrument, we can construct one policy QNN per instrument.However, this is not necessary for the value function since it evaluates the overall policy for all instruments.
To train the value network, we use two optimization objectives: the distributional and the expected losses.The distributional loss L D takes the Hamming-weight of the current state into account when evaluating the expected reward, while the expected loss L E only considers the expected reward at time t.Specifically, L D (ω) is defined as and L E (ω) is defined as After updating the value parameters ω, we use them to build estimates of the value function and then update the policy parameters ϕ.Using the value estimates, we update the policy to minimize the loss, adapted from [37], defined as Update policy parameters ϕ with gradient descent to minimize: end while output Policy parameters ϕ.
The training procedure for our approach is outlined in Algorithms 2 and 3.At each iteration, we generate N trajectories {s i t } T t=0 and use the policy QNN to compute the corresponding sequence of actions.Using this sequence of actions, we can compute the cumulative return for each episode and for each time-step and we use them to update the value network.When using the expected loss, we update the value parameters ω such that we predict this cumulative return in expectation.In the distributional case, we need to compute the Hamming-weight of the future trajectory and update the parameters such that we predict the expectation for only that subspace.Once the value estimates are updated, we can use them to update the policy parameters ϕ.

Properties of Quantum Deep Hedging
Predicting the performance of any particular deep learning algorithm is difficult due to the complexity of the models used, and the often unpredictable behavior of the non-convex optimization that must be performed for training.It is however possible to examine some desirable global properties of the system that indicate (but do not guarantee) good performance.The structure of the presented Quantum Deep Hedging framework leads to some of these global properties, when specifically instantiated with Hamming-weight preserving unitaries (as in Section 5.2.3).Some of these properties have been hinted at throughout the text, and we summarize them here: Update policy parameters ϕ with gradient descent to minimize end while output Policy parameters ϕ.
• Expressivity: The central question for any learning algorithm is whether the parameterized models used are expressive enough to capture target models of interest.The "universality" of models such as deep neural networks has been a driving force in their adoption and utility.In our algorithms we do not use models that are universal in that they can express any quantum operation, however we show that they are expressive enough to capture the quantities of interest, which in our case is the true distribution of the value function.The primary challenge is that the distribution of the value function is on an unknown and potentially changing support.We show in Proposition 1 that our model that uses a fixed support and general parameterized unitaries on m-qubits can approximate the true distribution with error decaying exponentially in m.In Proposition 2, we specialize this result to the case where the value function distribution is constant on the Hamming-weight subspaces and we correspondingly use a fixed support with Hamming-weight preserving parameterized unitaries.
• Generalization: The number of possible futures in a deep-hedging environment grows exponentially with the time horizon T .In practice, our learning algorithm can only use a limited number of episodes (polynomial in T ).We must therefore investigate the out-of-sample performance or generalization of our algorithm in this setting.This can however be guaranteed in our setting where we use the Quantum Compound Neural Networks.Our parameterized models consist of O(T ) such networks, each on O(T ) qubits.From the definition in Section 4.4, each of the neural networks has O(T 2 ) parameters.As a consequence of results due to Caro and Datta [78], the pseudo-dimension of our parameterized model is polynomial in T .Therefore poly(T ) episodes suffice to ensure that empirical risk minimization over our sample converges with high probability to the true optimal expressible model over the whole distribution of futures.
• Trainability: Finally we consider whether the task of optimizing the parameters for our model can be performed efficiently.The associated optimization problem is nonconvex and thus training convergence cannot be guaranteed.We can show however that in our setting, the well-known "barren plateau" problem [21] does not arise.Each Quantum Compound Neural Network that we use is on O(T ) qubits and has O(T ) depth.Furthermore the loss function we measure can be constructed as a function of measurements in the computational basis (corresponding to a measuring the vector of observables Z on each qubit i).We initialize the parameters of the model as normal random variables with variance O(1/T ).Theorem 2 ensures that the gradients decay only polynomially with the time horizon T .

Results
In the previous sections we introduced quantum methods for Deep Hedging, which use quantum orthogonal and compound neural networks within policy-search and actor-critic based reinforcement learning algorithms.In this section we present results of hardware experiments evaluating our methods on classical and quantum-accesible market environments.We benchmarked our models for both classical and quantum environments using three different methods: simulating our quantum models on classical hardware assuming perfect quantum operations, simulating them on classical emulators that model the noise for quantum hardware, and applying our quantum models directly on the 20 qubit trapped-ion quantum processors Quantinuum H1-1, H1-2 [27].Note that because orthogonal layers are efficiently simulatable classically, we can perform simulations for up to 64 qubits, while for the compound architectures that use the entire exponential space , we only simulated layers with up to 12 qubits.
In all experiments, the parameters of all the quantum compound neural networks were initialized using Gaussian initialization, and the training for all quantum neural networks was performed in exact classical simulation.In the following subsections, we give the details of the results.

Classical Market Environment
In the first part of our experiments, we consider Quantum Deep Hedging as described in Section 5.1 in classical market environments.We considered the environment from [1,2] where the authors used Black-Scholes model to simulate the market state and evaluate hedging strategies.In this setup, the underlying asset is modeled using Geometric Brownian Motion (GBM), which is commonly used in finance to model stock prices.
A For simulations, we assume one calendar year to be one unit of time increment and therefore set dt = 1/252 assuming 252 trading days in a calendar year.The market state s t is represented by the sequence of past and actual market observations {M t ′ } t t ′ =0 , where M t ′ is the stock price at time-step t ′ .We used a European short call option with a strike price of K = S 0 as the instrument to be hedged.The time horizon was set to 30 trading days with daily rebalancing, and the percentage drift (µ) for the GBM was set to 0 and the percentage volatility (σ) was set to 0.2.Proportional transaction costs were utilized with a proportionality constant of 0.01.The training dataset comprised of 9.6 × 10 4 samples, whereas the testing dataset consisted of 2.4 × 10 4 samples.We compared Feed-forward, Recurrent, LSTM, and Transformer models, constructed using the framework described in Section 5.1.Here, the input sequence (M 0 , M 1 , . . ., M T ) ∈ R (t+1) corresponds to the mid-market price of the underlying equity.The outputs a π t ∈ [0, 1] correspond to the model's delta for that time-step.The Feed-forward model is constructed using the Feed Forward architecture.Both Recurrent and LSTM models are built using the Recurrent architectures, where in the Recurrent model the hidden state passed onto the subsequent time-step is fixed to be the output of the previous time-step, i.e. the model's position on the hedging instrument at the previous time-step.The Transformer model we used in this work is constructed by adding the attention mechanism on top of the Feed Forward architecture.
Exact Simulations.To evaluate the behavior of quantum orthogonal neural networks, all four architectures (Feed-forward, Recurrent, LSTM, Transformer) were compared, both with classical linear and quantum orthogonal layers (with Pyramid and Butterfly circuits).A feature size of 16 was used for the linear layers in classical architectures, and 16 qubits were used for the orthogonal layers in quantum architectures.For Feed-forward and Recurrent models each hidden layer was repeated three times within the network.The LSTM model had one hidden cell constructed using four classical linear/quantum orthogonal layers.The Transformer model had three hidden layers followed by two classical lin- ear/quantum orthogonal layers for the attention mechanism.Parameters for all models were shared across time-steps.Noiseless classical simulations were performed for training and inference, with a batch of 256 paths.The results are presented in Table 1.We compared the achieved utilities with and without transaction costs and the number of training parameters.We observe that quantum orthogonal neural networks (Pyramid and Butterfly) achieve performance competitive with classical neural networks while using fewer trainable parameters and this holds for environments both with and without transaction costs.For the comparison, we have used the same classical and quantum architectures with the same layer sizes and we trained with identical hyperparameters.The quantum networks showed competitive performance and used fewer parameters due to the fact that every linear layer had been replaced with an orthogonal one.Let us also note that it might be possible to achieve parameter reduction with other classical methods (e.g.pruning).
The Transformer and LSTM architectures demonstrated the highest model utilities among the studied architectures, while the quantum orthogonal Butterfly layers used fewest training parameters.
Hardware Emulations.To investigate the behavior of our quantum neural networks on current hardware, we employed Quantinuum H1-1 emulator [27] to perform inference on our models.We kept the same environment configuration and used a batch of 32 paths to perform inference.However, we downsized the network to a single layer as this allowed fewer circuit executions without significantly hampering model utilities.The Feed-forward and Recurrent architectures use one circuit evaluation per time-step, while the LSTM architectures use 4 circuit evaluations per time-step, and the Transformer architectures use 3 circuit evaluations per time-step.As the hardware architecture allowed for all-to-all connectivity, we used quantum orthogonal layers with a Butterfly circuit which enables logdepth circuits with linear number of two-qubit gates and thus are ideal for computations on near-term hardware.We used 1000 measurement shots per circuit evaluation to perform tomography over the unary basis and construct the output of each layer.The results are summarized in Table 2.The utility of the models is presented for two cases: when evaluated on a classical exact simulator and when evaluated on Quantinuum's hardware emulator.The table also summarizes the number of circuit evaluations needed to hedge 32 paths over 30 days with each model architecture.The results show that the LSTM architecture with Butterfly layers were most robust to noise as the model utilities on the simulator and emulator are relatively close.We also observed that the LSTM model achieved the highest utility on both cases.Hardware Experiments.For our hardware experiments, we used the LSTM and Transformer models with 16-qubit Butterfly quantum circuits to perform inference on the Quantinuum H1-1 trapped-ion quantum processor [27].We reduced the time horizon of the GBM to 5 days and considered models with transaction costs for a batch of 4 randomly chosen paths.We used the same model size as the ones used on hardware emulators which resulted in 80 circuit executions for the LSTM model and 60 circuit executions for the Transformer model.We present inference results for a model with a classical linear layer, and a quantum model with a butterfly quantum orthogonal layer simulated and executed on the quantum hardware.The results are presented in Table 3.In addition to model utilities, we also list the terminal Profit and Loss (PnL) for each path for a more finegrained comparison.The results reveal that the LSTM architecture exhibits robustness to noise, consistent with the results obtained from the hardware emulator, as evidenced by the terminal PnL values of each path closely aligning with those of the simulations run on the hardware.Conversely, the Transformer's hardware execution demonstrates poorer performance compared to the simulation results.

Quantum Market Environment
In the second part of our experiments, we utilize quantum compound neural networks in various reinforcement learning algorithms in a quantum environment.Specifically, we implement the expected and distributional actor-critic algorithms, as described in Section 5.2, and compare them to the policy-search algorithm adapted for compound neural networks.To accomplish this, we first describe how to construct a quantum environment for Quantum Deep Hedging by adapting the classical environment used in Section 6.1.
More specifically, we aim to build a quantum environment that mimics market dynamics following the Black-Scholes model, as described by a GBM.
To encode the dynamics of the Brownian motion in a quantum environment, we can use the fact that Brownian motions can be seen as the limit of a discrete random walk.Specifically, we can use a sequence of nT independent and identically distributed Bernoulli random variables b 1 , b 2 , . . ., b nT with mean 1/2 to approximate W T , where T is the maturity and n is a hyperparameter that determines the precision of the approximation.Using this property, we can provide a discrete quantum environment for the Black-Scholes model, and approximate the price B t at some time-step t by b 1 , b 2 , . . ., b nt as follows: For loading the data (in yellow), first t qubits are used to encode past jumps of the market state.The next O(T − t) qubits encode the transition oracles which in the case of Black-Scholes corresponds to an equal superposition over all possible future jumps.The last two qubits are ancilla and are used to encode the state |01⟩.Next for the unitaries (in blue), we used a architecture controlled on past market state.Based on the direction of jump, a different compound layer constructed using Brick architecture (Figure 2) is applied on T − t + 2 qubits.Finally, each of the last T − t + 2 qubits is measured (in green) independently.For t = 0, the input state is an equal superposition of all possible future jumps followed by one unitary over T + 2 qubits without control.As described in Section 5.2.4 the control qubits are always in a computational basis state.Thus, they can be removed for efficient hardware implementation, and the appropriate parameters for unitary acting on T − t + 2 qubits can be chosen classically for each past state.
To obtain a sample B t of B t , we sample nt Bernoulli variables b 1 , . . ., b nt , i.e., n Bernoulli variables per day.Thus, we define the encoding of the market observation M t for a time-step t > 0 as |M t ⟩ := |b n(t−1)+1 . . .b nt ⟩, which contains all the jumps between t and t + 1 and that can be encoded using n qubits.We define the encoding of the market state s t at time-step t as the history of all previous jumps, i.e., |s t ⟩ := |b 1 b 2 . . .b nt ⟩, from which we retrieve the price.Loading this quantum state can be done using a 1-depth circuit made with at most nt Pauli-X gates acting on nt qubits.Note that the number of qubits required to encode s t here is nt and not n(t + 1) as in the general case, since the price at time-step 0 is fixed.For every time-step t, the transition model p t can be oracularized by applying n Hadamard gates on an additional n qubits.The different transition oracles can be applied in parallel to build a superposition of all future trajectories and obtain |(s T |s t )⟩ = |s t ⟩ ⊗ b∈{0,1} n(T −t) 1 2 n(T −t)/2 |b⟩.
In our experimental setup, we chose n = 1 as the number of Bernoulli variables per day to approximate Brownian motion.We retained the same GBM parameters as in the classical environment (µ = 0 and σ = 0.2), but set the instrument maturity to 10 days.Due to the limitations of simulating circuits with up to 12 qubits for compound architectures, we adjusted the time step increment to 30 trading days, instead of the usual 252.This means each time-step t should be changed to t/30 for the approximate GBM simulations, allowing us to capture short-term stock price fluctuations while maintaining the overall GBM behavior.The final payoff is a European short call option with a strike price of k = 1.We also investigated cases with and without a transaction cost proportional to ϵ = 0.002.We utilized the compound neural networks from Section 5.2.4 to represent the policy in the policy-search algorithm and both the policy and value in the actor-critic algorithms.We employed the Huber loss and scaled the value function to prevent exploding gradients.The algorithms were trained using classical simulations of the quantum circuits for 2000 steps with Adam optimizers, employing 3 random seeds and a batch of 16 generated episodes per training step.We selected the best parameters from these runs for a random selection of 16 paths and reported the inference results.

Exact Simulations.
Here, the performance of the algorithms was evaluated through exact simulation on classical hardware using the Brick architecture with logarithmic depth per block for training the compound neural networks.The results, presented in Table 4, showed that policies trained using a distributional actor-critic algorithm yielded better utilities for this particular example.These results align with the findings by Lyle et al. [26], where minimizing a distributional loss led to better policies.
Hardware Emulations.We investigated the performance of the algorithms in presence of hardware noise by running inference on the Quantinuum H1-1 emulator [27] over 16 randomly chosen paths.To accommodate today's hardware limitations, the depth of circuits with large depth was reduced by using a fixed depth per block instead of logarithmic depth per block.We compared the inference results of the hardware emulator with exact simulation.Results presented in Table 5 show that quantum compound neural networks are noise-resilient, with similar utilities demonstrated between classical simulation and hardware emulation.Furthermore, our results show that the distributional policies outperformed the expected policies, and the expected policies outperformed the policies trained using the policy-search Deep Hedging algorithm.
Hardware Experiments.In the third part of our experiments, we performed inference on Quantinuum's H1-1 and H1-2 trapped-ion quantum processors [27] using policies trained  via distributional and expected algorithms.We used a set of 8 randomly chosen paths and compared the terminal PnLs and utility in presence of transaction costs obtained via quantum hardware with exact classical simulations.The results are presented in Table 6.
We also present the results from Black-Scholes delta hedge model for the setting .We note that the utility obtained from the hardware closely aligns with the emulation results, and the PnL values for the selected paths are also similar.Our study reveals that both the distributional and expected policies significantly outperformed the Black-Scholes delta hedge, with the distributional policy exhibiting the best overall performance.

Discussion
In this work we developed quantum reinforcement learning methods for Deep Hedging.These methods are based on novel quantum neural network architectures that utilize orthogonal and compound layers, and on a novel distributional actor-critic algorithm that takes advantage of the fact that quantum states and operations naturally deal with large distributions.
There are many potential advantages to using quantum methods to enhance the capabilities of Deep Hedging algorithms.First, for neural networks with deep architectures, as is the case for time-series data, feature orthogonality can improve interpretability, help to avoid vanishing gradients and result in faster and better training.Second, quantum compound neural networks can explore a larger dimensional optimization landscape and thus might train to more accurate models, once we ensure that barren plateau phenomena do not occur.Third, quantum neural networks are natively appropriate to be used in distributional reinforcement learning algorithms, which can lead to considerably better models, as we show for the toy example developed in this work.Finally, the quantum circuits that we need to implement in order to train competitive quantum models for Deep Hedging are rather small, since the number of qubits and depth of the quantum circuit is basically equal to the maturity time.
Note that our hardware experiments were done with a maturity of 10 days, due to the fact that we had to simulate the training of the quantum models on a classical computer which very soon becomes infeasible.In principle, one can train directly on the quantum computer, using for example a parameter shift rule to compute the gradients [50], in which case even with the current state of the quantum hardware (or with the hardware that will arrive in the next years) one can indeed train a Quantum Deep Hedging model for a maturity time of a month or more.In this case, the quantum model can no longer be simulated classically.
Moreover, we believe our quantum reinforcement learning methods have applications beyond Deep Hedging, for example for algorithmic trading or option pricing, and it would be interesting to develop specific quantum methods for such problems.Note that in these use cases the training data can be produced efficiently, removing the bottleneck of loading large amounts of data onto the quantum computer.The open questions regarding our work in quantum reinforcement learning are centered around three aspects.First, there is a need to expand the results regarding the trainability of the quantum neural networks proposed in this work to other settings.Second, the question of how to extend the quantum environment built for the GBM to other environments such as the Heston model studied in [37] arises.Finally, there is a need to design new distributional losses that make use of temporal difference methods to learn the value functions in the Deep Hedging context.One approach to this is using the theoretical framework that allows for such design as developed in [79], while another approach is using the moment matching approach as described in [80].Currently, the work focuses on the expectation of the value function, but it is important to consider other moments that can be matched for both the overall expectation and the expectation per subspace.

Figure 1 :
Figure 1: A quantum circuit with logarithmic depth for data loading.Vertical lines represent RBS gates with parameters that are dependent on the input x.The unitary represented by this data loader is denoted as U L (x).

Figure 2 :
Figure 2: Various Hamming-weight preserving circuits used in quantum orthogonal layers.These circuits are parameterized by a set of parameters θ, with each parameter representing the angle of a specific RBS gate.The parameterized unitary represented by this layer is expressed as U (θ).

Figure 3 :
Figure 3: A quantum compound layer U (θ) acts as a block diagonal unitary on each fixed Hammingweight subspace.

Figure 4 :
Figure 4: Diverse quantum neural network architectures for time-series data, featuring orthogonal layers in each block as outlined in Section 4.3.Here, x t and y t denote the time-series input and output, respectively, while ỹt represents the output after being adjusted by the attention mechanism.

Figure 5 :
Figure 5: A quantum compound neural network.U L (x) refers to a general data loader unitary.U (θ) denotes a Hamming-weight preserving unitary as for example the ones shown in Figure 2.
Learning over ST .(b) Learning over multiple subsets of ST .

Figure 6 :
Figure 6: An example illustrating the process of learning expectations for each subset.The figures depict data generated from a trimodal distribution.In (6a), a single distribution is learned to match the expectation over the entire set of states.In (6b), the set is divided into three distinct subsets, with each peak representing the weighted and learned distribution within its corresponding subset.This improved approach provides a closer fit to the original data and effectively incorporates information about tails, offering a more accurate representation of the underlying distribution.
1} m z b |b⟩⟨b|, can be applied on the quantum state |z⟩ := b √ p b |b⟩ to sample from the categorical distribution P z with support z.

Figure 7 :
Figure7: The quantum compound neural network using O(T ) + 2 qubits for the Black-Scholes model.For loading the data (in yellow), first t qubits are used to encode past jumps of the market state.The next O(T − t) qubits encode the transition oracles which in the case of Black-Scholes corresponds to an equal superposition over all possible future jumps.The last two qubits are ancilla and are used to encode the state |01⟩.Next for the unitaries (in blue), we used a architecture controlled on past market state.Based on the direction of jump, a different compound layer constructed using Brick architecture (Figure2) is applied on T − t + 2 qubits.Finally, each of the last T − t + 2 qubits is measured (in green) independently.For t = 0, the input state is an equal superposition of all possible future jumps followed by one unitary over T + 2 qubits without control.As described in Section 5.2.4 the control qubits are always in a computational basis state.Thus, they can be removed for efficient hardware implementation, and the appropriate parameters for unitary acting on T − t + 2 qubits can be chosen classically for each past state.

Algorithm 1
Policy-Search Deep Hedging with Orthogonal Neural Networks input Policy QNN π. hyperparameters Number of episodes per training step N .Initialize policy QNN with parameters ϕ.
Definition 2 (Classical MDP for Deep Hedging).The classical MDP for Deep Hedging is a finite-horizon Markov Decision Process, defined by a tuple (S, A, p, r, T ).Here, S is the market state space, which can be decomposed into subsets S t ⊂ R n×(t+1) at each time-step t, A is the trading action space, which can also be decomposed into subsets A t ⊂ [0, 1] m , p is the transition model that can be represented by p t : S t → ∆(S t+1 ), r is the reward function, which can be represented by r t : S t × A t → R, and T ∈ N * is the time maturity of all hedging instruments.
This quantum state encodes the probabilities p t (s T |s t ) := P[s T |s t ] in a superposition of all possible trajectories (s t+1 , . . ., s T ) of length T − t and can be prepared by sequentially applying oracles U p t , U p t+1 , . . ., U p T −1 to |s t ⟩ and n × (T − t) ancilla qubits.Denoting ρ t (s T |s t ) := |(s T |s t )⟩⟨(s T |s t )|, we have T ) | s T ∈ S T } and probabilities p t (s T |s t ).We can express its expectation as the measurement of a quantum observable O π t in the quantum state |(s T |s t )⟩, where O π t := s T exp(−λR π t (s T ))|s T ⟩⟨s T |, and |(s T |s t )⟩ is defined similarly to |(s t+1 |s t )⟩ as |(s T |s t )⟩ := s T p t (s T |s t )|s T ⟩ ∈ H ⊗n×(T +1) .T )|s t ].
1} m z b , where m ≥ 1.Then, there exists an observable O z t with eigenvalues in z that operates on n × (t + 1) + m qubits and such that, for any deterministic policy π, there is a unitary U π t satisfying ∀s t ∈ S t , Tr[O z t ρ t (z|s t )] = Tr[O π t ρ t (s T |s t )], where |(z|s t )⟩ := U π t (|s t ⟩ ⊗ |0⟩ ⊗m ) and ρ t (z|s t ) := |(z|s t )⟩⟨(z|s t )|.Additionally, let P π t and P z t denote the distributions of the value function (outcome of measuring O π t in |(s T |s t )⟩) and the corresponding categorical projection onto the fixed support z (outcome of measuring O z t in |(z|s t )⟩), respectively.If the cumulative distribution function of P π t is L-Lipschitz, then the Cramér distance between P π t and P z t is such that: ∀k, ∀s t ∈ S t , Tr[O z t ρ t (z|s t , k + 1)] = Tr[O π t ρ t (s T |s t , k)] T |s t , k)⟩⊗|01⟩) and ρ t (z|s t , k+1) := |(z|s t , k+1)⟩⟨(z|s t , k+1)|.

Algorithm 3
Distributional Actor-Critic Deep Hedging with Compound Neural Networks input Policy QNN π, Value QNN v. hyperparameters Number of episodes per training step N .Initialize policy and value QNNs with parameters {ϕ t } T t=0 , {ω t } T t=0 .

Table 1 :
GBM is a continuous-time stochastic process B t described by a stochastic differential equationdB t = µB t dt + σB t dW t ,where µ ∈ R is the percentage drift and σ ∈ R + is the percentage volatility.W t corresponds to Brownian motion and thus dW t ∼ N (0, dt).Comparison of expected utilities without and with transaction costs for models with classical and orthogonal layers using exact simulation over 256 paths and 30 trading days, including the number of trainable parameters.

Table 2 :
Comparison of exact simulation and Quantinuum H1-1 emulator results for orthogonal layer models, evaluating expected utilities with transaction costs over 32 paths and 30 trading days, and showing the number of circuits emulated.

Table 3 :
Comparison of exact simulation and Quantinuum H1-1 hardware results for orthogonal layer models, evaluating expected utilities and terminal PnLs with transaction costs over 4 paths and 5 trading days, evaluating the performance under hardware conditions.

Table 4 :
Comparison of compound neural networks trained using different algorithms with exact simulation, evaluating expected utilities over 16 paths and 10 trading days, both without and with transaction costs.

Table 5 :
Comparison of compound neural networks trained using different algorithms with simulation and Quantinuum H1-1 emulator results, evaluating expected utilities with transaction costs over 16 paths and 10 trading days.

Table 6 :
Comparison of compound neural networks trained using different algorithms with exact simulation and Quantinuum H1-1, H1-2 hardware results, evaluating expected utilities and terminal PnLs with transaction costs over 8 paths and 10 trading days, benchmarked against the standard Black-Scholes delta-hedging model.