Graph neural network initialisation of quantum approximate optimisation

Approximate combinatorial optimisation has emerged as one of the most promising application areas for quantum computers, particularly those in the near term. In this work, we focus on the quantum approximate optimisation algorithm (QAOA) for solving the MaxCut problem. Specifically, we address two problems in the QAOA, how to initialise the algorithm, and how to subsequently train the parameters to find an optimal solution. For the former, we propose graph neural networks (GNNs) as a warm-starting technique for QAOA. We demonstrate that merging GNNs with QAOA can outperform both approaches individually. Furthermore, we demonstrate how graph neural networks enables warm-start generalisation across not only graph instances, but also to increasing graph sizes, a feature not straightforwardly available to other warm-starting methods. For training the QAOA, we test several optimisers for the MaxCut problem up to 16 qubits and benchmark against vanilla gradient descent. These include quantum aware/agnostic and machine learning based/neural optimisers. Examples of the latter include reinforcement and meta-learning. With the incorporation of these initialisation and optimisation toolkits, we demonstrate how the optimisation problems can be solved using QAOA in an end-to-end differentiable pipeline.


Introduction
Among the forerunners for use cases of near-term quantum computers, dubbed noisy intermediatescale quantum (NISQ) [1] are the variational quantum algorithms (VQAs). The most well known of these is the variational quantum eigensolver [2] and the quantum approximate optimisation algorithm (QAOA) [3]. In their wake, many new algorithms have been proposed in this variational framework tackling problems in a variety of areas [4][5][6]. The primary workhorse in such algorithms is typically the parameterised quantum circuit (PQC), and due to the heuristic and trainable nature of VQAs they have also become synonymous with 'modern' quantum machine learning [7]. This is particularly evident with the adoption of PQCs as the quantum version of neural networks [8,9].
In this work, we focus on one particular VQA -the QAOA -primarily used for approximate discrete combinatorial optimisation. The canonical example of such a problem is finding the 'maximum cut' (Max-Cut) of a graph, where (for an unweighted graph) one aims to partition the graph nodes into two sets such that the sets have as many edges connecting them as possible. Discrete optimisation problems such as Max-Cut are extremely challenging to solve (specifically NP-Hard) and accurate solutions to such problems take exponential time in general. Aside from its theoretical relevance, Max-Cut finds applications across various fields such as study of the spin glass model, network design, VLSI and other circuit layout designs [10], and across data clustering [11]. While it is not believed quantum computers can solve NP-Hard problems efficiently [12], it is hoped that quantum algorithms such as QAOA may be able to outperform classical algorithms by some benchmark. Given the ubiquity of combinatorial optimisation problems in the real world, even incremental improvements may have large financial and quality impacts.
However, due to the limitations in running real experiments on small and unreliable NISQ devices, which currently are typically only accessible via expensive cloud computing platforms [41], it is important to limit the quantum resource (i.e. the overall number of runs, or the time for a single run on quantum hardware) required to solve a problem to the bare minimum. Therefore, effective initialisation and optimisation strategies for VQAs can dramatically accelerate the search for optimal problem solutions. The former ensures the algorithm begins 'close' to a solution in the parameter space (near a local or global optimum), while the latter enables smooth and efficient traversal of the landscape. This is especially relevant given the existence of difficult optimisation landscapes in VQAs plagued by barren plateaus [42][43][44][45], local minima [46,47] and narrow gorges [48]. To avoid these, and to enable efficient optimisation, several initialisation technique have been proposed for VQAs, including using tensor networks [49], meta-learning [50][51][52] and algorithm-specific techniques [29,30].
Returning to the specifics of combinatorial optimisation, the use of machine and deep learning has been shown to be effective means of solving this family of problems, see for example Refs. [53][54][55][56]. Of primary interest for our purposes are the works of [53,56]. The former [53] trains a graph neural network (GNN) to solve Max-Cut, while the latter [56] extends this to more general optimisation problems, and demonstrates scaling up to millions of variables. Based on these insights and the recent trend in the quantum domain of incorporating VQAs with neural networks (with software libraries developed for this purpose [57,58]) indicates that using both classical and quantum learning architectures synergistically has much promise. We extend this hybridisation in this work. This paper is divided into two parts. In the first part (Sections 2-4), we discuss the previous works in QAOA initialisation and give our first contribution: an initialisation strategy using graph neural networks. Specifically, we merge GNN solvers with the warm-starting technique for QAOA of [29], and demonstrate the effectiveness of this via numerical results in Section 4. By then examining the conclusion of [56], we can see how our GNN approach would allow QAOA initialisation to scale far beyond the capabilities of current generation near-term quantum devices. In the second part of the paper (Section 5), we then complement this by evaluating several methods of optimisation techniques for the QAOA proposed in the literature, including quantum-aware, quantum-agnostic and neural network based optimisation approaches.

QAOA for solving Max-Cut
For concreteness in this work, we focus on the discrete optimisation problem known as Max-Cut. It involves finding a division (a cut) of the vertices of a (weighted) graph into two sets, which maximises the sum of the weights over all the edges across the vertex subsets. For unweighted graphs, this cut will maximise simply the number of edges across the two subsets.
The problem can be recast to minimising the weighted sum of operators acting on the vertices of a given graph. Mathematically, this can be stated as follows. Given a graph G := (V, E) with vertices V and edges E = {(i, j)|i, j ∈ V and i = j}, the Max-Cut can be found by minimising the following cost function, where z := z 1 z 2 . . . z n are variables for each vertex, i, such that z i ∈ {+1, −1} and w ij is the corresponding weight of the edge between vertices i and j. In this case, the value (sign) of z i determines on which side of the cut the node resides. The Max-Cut problem is a canonical NP-complete problem [60], meaning there is no known efficient polynomial time algorithm in general. Further, Max-Cut is also known to be APX-hard [61], meaning there is also no known polynomial time approximate algorithm. The current best known polynomial time approximate classical approach is the Goemans-Williamson (GW) algorithm which is able to  [29] or graph neural networks (GNNs)) and Trotterised quantum annealing [30] (TQA). TQA produces initial angles, {β, γ}, whereas warm-starting techniques initialise the QAOA state. 'Init' here refers to any parameter initialisation scheme. In this work, we choose a specific initialisation technique called the Xavier initialisation [59] for warm-starting techniques. (b) GNNs take an embedding of the initial graph and applies updates to the embeddings based on the neighbours of each node, using an parameterised function, f θ . In this case, the GNN outputs a probability for each node being on either side of the cut. (c) A unified p-layer QAOA circuit for all initialisation schemes. In TQA, fixed choices for angles θ, φ initialise vanilla QAOA with the standard mixer, whereas warm-starting produces an initial state and mixer encoding a probabilistic solution given by the SDP or GNN. Here, x * implicitly encodes the regularisation parameter discussed in [29]. Note that both the GNN and QAOA parameters can be trained in an end-to-end differentiable manner, in contrast to other schemes.
achieve an approximation ratio, r ≈ 0.878, where: Assuming the unique games conjecture (UGC) [62,63] classically, then the GW algorithm achieves the best approximation ratio for Max-Cut. Without this conjecture, it has been proven that it is NP-hard to approximate the Max-Cut value with an approximation ratio better than 16 17 ≈ 0.941. To address Max-Cut quantum mechanically, one can quantise the cost function Eq. (1) by replacing the variables with operators, z i → Z i , where Z is the Pauli-Z matrix. The cost function can now be described with a Hamiltonian: where the actual cost -corresponding to the cut size -is extracted as the expectation value of this Hamiltonian with respect to a quantum state, The goal of a quantum algorithm is then to find the ground state, |ψ G := argmin |ψ ψ| H C |ψ , i.e. the state which minimises the energy of the Hamiltonian, H C . Constructing the Hamiltonian as in Eq. (4), ensures that the minimum energy state is exactly the state encoding the Max-Cut of the problem graph: However, since the Max-Cut problem is NP-Hard, we expect that finding this ground state, |ψ Max-Cut , will also be hard in general. The QAOA algorithm attempts to solve this by initialising with respect to an easy Hamiltonian (also called a 'mixer' Hamiltonian): which has as an eigenstate the simple product state, |ψ init = |+ ⊗n = H |0 ⊗n where X and H are the Pauli-X and Hadamard operators respectively. This can be viewed as an initialisation which is a superposition of all possible candidate solutions. The QAOA then attempts to simulate adiabatic evolution from |ψ init to the target state |ψ Max-Cut by an alternating bangbang application of two unitaries derived from the Hamiltonians, Eq. (4), Eq. (7), which are respectively: In the QAOA, the parameters, γ, β, are trainable, and govern the length of time each operator is applied for. These two unitaries are alternated in p 'layers' acting on the initial state, so the final state is prepared using 2p parameters, {β, γ} := {β 1 , β 2 , . . . , β p , γ 1 , γ 2 , . . . , γ p }: Optimising the parameters, {β, γ} serves as a proxy for finding the ground state, and so we aim that after a finite depth, p, we achieve a state |ψ β,γ which is close to the target state |ψ Max-Cut .

Initialising the QAOA
Since searching over the non-convex parameter landscape for an optimal setting of the {γ, β} parameters directly on quantum hardware may be expensive and/or challenging, any attempts to initialise the QAOA parameters near a candidate solution are extremely valuable as the algorithm would then begin its search from an already good approximate solution. Such approaches are dubbed as 'warm-starts ' [29], in contrast to 'coldstarts'. One could consider a cold-start to be a random initialisation of {γ, β}, or by using an initial state which encodes no problem information, e.g. |+ ⊗n , as in vanilla QAOA. In this work, we refer to cold-start as the latter, and 'random initialisation' to mean a random setting of the parameters, {γ, β}. We first revisit and summarise two previous approaches [29, 30], before presenting our approach to QAOA initialisation. We illustrate these previous initialisation approaches in Fig. 1, which we review briefly in Section 2.1 and Section 2.2, and also our approach based on graph neural networks, which we introduce in Section 3. For simplicity, we focus on the simplest version of QAOA, but the methods could be extended to other variants, for example recursive QAOA (RQAOA) [29, 64, 65].

Continuous relaxations
The first approach of [29] proposed a warmstart for QAOA method which can be applied to Max-Cut as a special case of a quadratic unconstrained binary optimisation (QUBO) problem. In this work, two sub-approaches were discussed. The first converts the QUBO into its continuous quadratic relaxation form which is efficiently solvable and directly uses the output of this relaxed problem to initialise the QAOA circuit. The second approach applies the randomhyperplane rounding method of the GW algorithm to generate a candidate solution for the QUBO. For Max-Cut, this QUBO can be written in terms of the graph Laplacian, L G = D − A (where D is the diagonal degree matrix, and A is the adjacency matrix of G) as follows (we utilise this form later in this work): However, by removing the requirement that each z i is binary, one can obtain an efficiently solvable continuous relaxation that can serve as warmstart for solving Eq. (10). Since L G is a positive semidefinite (PSD) matrix, the relaxed form can be trivially written as, and then allowing x i to be continuous in this interval. If the matrix in the QUBO is not PSD however, then one can obtain another continuous relaxation, as a semidefinite programme (SDP) [29,66]. The output of this optimisation is a real vector x * which, when a rounding procedure is performed (i.e. the GW algorithm), is a candidate solution for the original Max-Cut. In order to use this relaxed solution to initialise QAOA, Ref.
[29] also demonstrated that the initial state from Eq. (9) and the mixer Hamiltonian Eq. (7) must also be altered as: where θ i = 2 sin −1 ( x * i ). One can immediately see that |ψ WS init is the ground state of H M with eigenvalue −n. One possible issue that may arise with this warm-start is if the relaxed solution x * i is either 0 or 1. When this happens, the qubit i would be initialised to state |0 or |1 , respectively. This means the qubit would be unaffected by the problem Hamiltonian Eq. (4) which only contains Pauli Z terms. To account for this possibility, [29] modifies θ i in Eq. (12) with a regularisation parameter, ∈ [0, 0.5] if the candidate solution, x * i is too close to 0 or 1. Examining Fig. 1, this initialisation scheme is achieved by setting the angles in the initial state, The 'Init' in this figure implies that one is free to choose any QAOA parameter initialisation method as this warm-start approach only modifies the input state and the mixer Hamiltonian.

Trotterised quantum annealing
A second proposed method [30] to initialise QAOA uses concepts from quantum annealing [67,68], which is a popular method of solving QUBO problems of the form Eq. (10). QAOA was proposed as a discrete gate-based method to emulate quantum adiabatic evolution, or quantum annealing. Therefore, one may hope that insights from quantum annealing may be useful in setting the initial angles for the QAOA circuit parameters. In a method proposed by [30] (dubbed Trotterised quantum annealing (TQA)), one fixes the QAOA circuit depth, p, and sets the parameters as: where k = 1, · · · , p and δt is a time interval which is a-priori unknown and given as a fraction of the (unknown) total optimal anneal time, T * , resulting in δt = T * /p. The authors of [30] observed an optimal time step value δt to be ≈ 0.75 for 3-regular graphs and O(1) for other graph ensembles. In [30], the QAOA was initialised with δt = 0.75 and observed to help avoid local minima and find near-optimum minima close to the global minimum. We also choose this value generically in our numerical results later in the text, although one should ideally pre-optimise the value for each graph instance. Note that in contrast to the warm-starting method from the previous section, the TQA approach initialises the parameters, {β, γ} rather than the initial QAOA state (and mixer Hamiltonian) which is set as in vanilla QAOA as |+ .
Again, revisiting Fig. 1, this initial state can be achieved by choosing ϕ j = π, θ j = π /2, ∀j. This is due to the fact that Similarly, for the mixer Hamiltonian, we have , which is the single qubit mixer unitary from vanilla QAOA, up to a redefinition of β k .

Graph neural network warm-starting of QAOA
Now that we have introduced the QAOA, and alternative methods for warm-starting its initial state and/or initial algorithm parameters, let us turn now to our proposed method; the use of graph neural networks. This approach is closest to the relaxation method of Section 2.1 in that the GNN provides an alternative initial state to vanilla QAOA, and so it is to this method which we primarily compare. One of the main drawbacks of using SDP relaxations and the GW algorithm, is that every graph for which the Max-Cut must be found generates a new problem instance to be initialised. However, on the other hand, the GW algorithm comes equipped with performance guarantees (generating an approximate solution within 88% of the optimal answer).
As we shall see, using graph neural networks as an initialiser allows a generalisation across many graph instances at once. Importantly, even increasing the number of qubits will not significantly affect the time complexity of such approaches as it can be interpreted as a learned prior over graphs. We also demonstrate how the model can be trained on a small number of qubits, and still perform well on larger problem instances (size generalisation), a feature not present in any of the previous initialisation methods for QAOA. Furthermore, the incorporation of a differentiable initialisation structure allows the entire pipeline of QAOA to become end-toend differentiable, which is particularly advantageous since it makes the problem ameanable to the automatic differentiation functionality of many deep learning libraries [69,70]. First, let us begin by introducing graph neural networks.

Graph neural networks
Graph neural networks (GNNs) [71] are a specific neural network model designed to operate on graph-structured data, and typically function via some message passing process across the graph. They are example models in geometric deep learning [72], where one incorporates problem symmetries and invariants into the learning protocol. Examples have also been proposed in the field of quantum machine learning [50,[73][74][75]. Specifically, graph neural networks operate by taking an encoding of the input graph describing a problem of interest and outputting a transformed encoding. Each graph node is initially encoded as a vector, which are then updated by the GNN to incorporate information about the relative features of the graph in the node. This is done by taking into account the connections and directions of the edges within the graph, using quantities such as the node degree, and the adjacency matrix. This transformed graph information is then used to solve the problem of interest. There are many possible architectures for how this graph information is utilised in the GNN, including attention based mechanisms [76] or graph convolutions [77] for example, see e.g. Ref. [78] for a review.
In order to transform the feature embeddings, the GNN is trained for a certain number of iterations (a hyperparameter). For a given graph node, n ν , we associate a vector, h nν t , where t is the current iteration. In the next iteration (t+1), to update the vector for node n ν , we first compute some function of the vector embeddings of the nodes in a neighbourhood of n ν , denoted as N (n ν ) = {n j } j:n j ∼nν . A-priori, there is no limitation on how large this neighbourhood can be, making it larger will increase training time and difficulty, but increase representational power. These function values are then aggregated (for example by taking an average) and combined with the node vector (with perhaps a non-linear activation function) at the previous iteration to generate h nν t+1 . Each nodes update increases the information contained relative to a larger subset of nodes in the graph. The collective action of these operations can be described by a parame- Fig. 1), whose parameters, θ, are suitably trained to minimise a cost function. In all cases here, we initialise all the elements of the feature vectors, h nν 0 to be the degree of the node, n ν . For the specific GNN architecture we use in the majority of this work, we choose the line graph neural network (LGNN) [79], shown to be competitive on combinatorial optimisation problems [53]. However we also incorporate the graph convolutional network (GCN) proposed for combinatorial optimisation by [56] in some numerical results in Section 4. We give further details about these two architectures in Appendix A.
Once we have a trained GNN for a certain number of iterations, T , we can use the information encoded in {h n j T } j for the problem at hand. A simple example would be to attach a multi-layer perception and perform classification on each node, where {h n j T } behaves as feature vectors encoding the graph structure. For our purposes, we use these vectors to generate probabilities on the nodes. These are the probabilities that the node is in a given side of the Max-Cut, which are then taken as the values x * in warm-started QAOA circuit and its mixer Hamiltonian Eq. (12).

Graph neural networks for Max-Cut
To attach this probability, there are at least two possible methods one could apply. Firstly, one could consider using reinforcement learning or long short term memories (LSTMs) [80,81]. These methods generate probabilities in a stepwise fashion by employing a sequential dependency. To train these, one may use a policy gradient method [82] (dubbed as an 'autoregressive decoding' approach [83]).
The second (simpler) method is to treat edge independently, and generate the probability of each edge being present in the Max-Cut or not (a 'non-autoregressive decoding'). This can be formulated as a vector, p, where each element corresponds to a node, n ν , generated by applying a softmax to the final output feature vectors of the GNN, {h n j T } j [53,56]: In [53], for each node, n ν , the final output is the two dimensional vector [h nν ,0 T , h nν ,1 T ], constructed as the output of a final two-output linear layer. The probability for each node is then taken as one of these outputs (say j = 0) via the softmax in Eq. (14) and then used in the cost function described in the next section.

Unsupervised training
Now that we have defined the structure and output of the GNN, it must be suitably trained. One approach is to use supervised training, however this may require a large number of example graphs to serve as the ground truth. Instead, following [53,56], we opt for an unsupervised approach, bypassing the need for labels. To do so, we choose the cost function as [53], which is given by the Max-Cut QUBO itself in terms of the graph Laplacian (Eq. (10)): where p ∈ [0, 1] n is the probability vector from the GNN Eq. (14). In Eq. (15), we define the cost function as an average over a training set of T graphs, {G t } T t=1 . Note, that the graphs in the training set do not have to be the same size as the graphs of interest; the GNN can be trained on an ensemble of graphs of different sizes, or graphs which are strictly smaller (or larger) than the test graph. We utilise this feature to improve GNN training in Section 4.1 and to demonstrate the generalisation capabilities of the GNN in Section 4.2. See Ref. [56] for how the cost function and GNN structure could be adapted to alternate QUBO type problems.

Initialisation numerical results
Let us first study the impact of the initialisation schemes discussed above on the QAOA numerically. In all of the below, the approximation ratio, r, will be the figure of merit. We also use Xavier initialisation [59] for the QAOA parameters in all cases except for the TQA initialisation method. This initialises each parameter from a uniform distribution over an interval depending on the number of parameters in the QAOA circuit.

% of graphs
LGNN QAOA Cold-start QAOA Figure 2: Success probability of the graph neural network on 3-regular graphs for Max-Cut. Histogram shows the number of graphs on which GNN QAOA versus cold-start QAOA can achieve a certain ratio of the optimal cut. Here, we set p = 5 and n = 12 qubits and generate the percentages over 50 random graphs. The GNN initialised version is able to generate larger cut values than the vanilla version of QAOA.
We begin by benchmarking the graph neural network QAOA against a random cold start initialisation for the two architectures discussed above, the line graph neural network (LGNN) and the graph convolutional network (GCN). To demonstrate feasibility, Fig. 2 shows the success probability of the vanilla (cold-started) QAOA against the (LGNN-)QAOA initialisation over 50 random 3-regular graphs with 12 qubits and p = 5 QAOA depth. We observe the GNN-QAOA is capable of generating larger cuts on average than the vanilla version.
Next, Fig. 3 compares the approximation ratio directly output by the GNN, against the GNN initialised solution which is then further optimised by QAOA. To generate discrete Max-Cut solutions from the GNN probability vectors, we choose the same simple rounding scheme as [56], assigning x * i = int(p i ) ∈ {0, 1} for each node, i. We leave the incorporation and comparison of more complex rounding techniques to future work. We first plot as function of qubit number (graph size), for QAOA depths scaling proportionally with the number of qubits (p = 3n/4 in (a) and p = n/2 in (b)). We see that the GNN QAOA outperforms both the GNN and vanilla QAOA individually. This advantage is more pronounced at lower relative depth, but is diminished at higher depth. This is confirmed by Fig. 3(c) where we fix the qubit number at 16 and plot the results as a function of QAOA depth. However, since the depth of quantum circuits is limited by decoherence in NISQ devices, the advantage of the GNN in the low-depth regime is promising. Finally, since we observe in these results that the LGNN appears to outperform the GCN architecture (due to the more complex message passing functionality), we opt to use the former primarily in the remainder of this work. However, for larger problem sizes, the LGNN may be less scalable [56].

Graph neural network versus SDP relaxations
Next, we benchmark the GNN against the SDP relaxation approach directly in Fig. 4. We begin in Fig. 4a by comparing the quality of the Max-Cut solution produced by the GW algorithm (r GW ) against the solution from the (line-)GNN (r GNN ). We observe comparable quality between the two methods (as remarked previously [56]) with the GNN generating solutions which are between 85-90% the quality of the GW algorithm. However, we note that for these small graph instances, the training set becomes saturated as all possible 3-regular graphs eventually appear. Naively increasing the size (say from N train = 1000 to N train = 5000) of training set with graphs of the same size as the test case does not improve performance of the GNN dramatically. Therefore, in order to non-trivially increase the amount of training data, we include graphs of different sizes (taking N train = 5000 graphs which are 1×-5× the size of the test graph) in the training set. Doing so increases the performance ratio of GNN over GW to ∼ 95%, at the expense of a greater training time. However, if we then examine Fig. 4b and Fig. 4c, then the tradeoff we are making becomes appar-ent. With a small sacrifice in solution quality, the GNN (once trained) is able to generate candidate solutions significantly faster than SDP relaxations and the GW algorithm.
Firstly, in Fig. 4b and show that the inference time (time to produce a warm-started solution) is significantly higher with the SDP relaxation method than using a trained GNN. For this plot we focus graphs relevant to the QAOA problem sizes we study in this paper. Next, we show in Fig. 4c how the GNNs speed advantage enables the scalable production of warm-started solutions for QAOA into millions of variables, far beyond the capability of near term quantum devices. Specifically, we reproduce the results of [56] comparing the full time taken by the SDP relaxation and the GW algorithm against the GCN architecture (even including training time) to solve Max-Cut (see [56] for details). The runtime of the GW algorithm is limited by the interior-point method used to solve SDPs which scales as O(n 3.5 ) [56]. In Section 4.2, we take this even further and demonstrate how the GNN approach is able to generalise not only across test graph instances of the same size as in the training set, but also to larger test graph instances, which is a feature clearly not possible via the relaxation method.
As a final comparison, we compare the GNN initialisation technique against all other techniques in Fig. 5. We compare against the warm-starting technique using relaxations of [29] ('Warm-start'), and the Trotterised quantum annealing ('TQA') based approach of [30], as a function of depth and training epochs.

Generalisation capabilities of GNNs
A key feature of using neural networks for certain tasks is capability to generalise. This generalisation ability is one of the driving motivations behind machine learning, and tests the ability of an algorithm to actually learn (rather than just memorise a solution). Similarly, we can test the generalisation ability of GNNs in warm-starting the QAOA. To do so, we train instantiations of GNNs on small graph instances, and then directly apply them to larger test instances. We test this for examples of between 6 to 14 nodes in Table 1. Here, we see this generalisation feature directly, as the GNN is capable of performing well on graphs larger than those in the train-  Figure 3: Performance of the graph neural network on 3-regular graphs for Max-Cut. We compare the Max-Cut approximation ratios achieved with LGNN/GCN initialisation versus cold-start (vanilla) QAOA. We also plot the raw values outputted by the LGNN with a simple rounding scheme. In (a), the QAOA depth is set to be p = 3n/4, while in (b), p = n/2. Each datapoint is generated via 1000 runs of the LGNN/GCN on random instances of 3-regular graphs, of the appropriate size to the number of qubits. Finally, in (c), we fix the number of qubits to be 16 and plot each method as a function of the QAOA depth, demonstrating monotonic improvement as a function of p. Each datapoint is generated via 1000 runs of the LGNN/GCN on random instances of 3-regular graphs, of the appropriate size to the number of qubits.  Figure 4: Time versus quality tradeoff between GNN versus relaxation initialisation methods. (a) Max-Cut approximation ratios generated by GNN and the continuous relaxation (without QAOA), as a function of graph size. r j is the approximation ratio generated by method j ∈ {GW, GNN}. We use a simple rounding technique to generate the discrete values from the soft outcomes from the (line-)GNN (see main text) and plot results for 1000 and 5000 training graphs. (b) Comparison of time to produce a warm-start taken by relaxation initialisation versus (line-)GNN initialisation as a function of qubit number, averaged over 10 random graph instances. The GNN enables much faster inference for Max-Cut. This does not include pre-training time for the GNN, but as an example, training on 1000 graphs for 18 qubits takes only 6 minutes for the LGNN. Finally, (c) reproduces results taken from [56] demonstrating the scalability of the GNN initialisation (for the GCN) over solving the SDP relaxation with the GW algorithm. The GNN can provide warm-starts for QAOA on graphs with millions of nodes. Note, these plots do not include QAOA runtime after the warm-start.
ing set. Note a related generalisation behaviour was demonstrated via meta-learning [50,51] for the parameters of a variational circuit. The work of [50] utilises recurrent neural networks (we revisit this strategy in Section 5.3.2) for training the QAOA, but generalisation in this case was possible due to the structure of the algorithm and parameter concentration [18]. The FLexible Initializer for arbitrarily-sized Parametrized quantum circuits (FLIP) [51], also has related parameters generalisation capabilities, which would be interesting to compare and incorporate with warm-starting initialisers in future work.
The rows in Table 1 correspond to the graph size on which the model was trained, and the columns correspond to the graph size on which the GNN was then tested. For example, both training and testing on graphs with 10 nodes gives an approximation ratio of ≈ 0.93, whereas if we train the GNN using only graphs of 6 nodes, there is no drop in performance. In contrast, if we reduce from 14 to 6 training nodes, we only see a drop in the approximation ratio of 7%. Again, we mention the limitation to small problem sizes due to the QAOA circuit simulation overhead. As with the discussions above, it has been observed that GNNs trained on 30 node graphs have the ability to generalise to 300 nodes and larger [53,56]. Note that this generalisation is the reverse situation to that in Fig. 4a, where we include larger graphs in the training set to improve solution quality in smaller graphs, rather than including larger graphs in the test set, as we do here.

Optimisation of the QAOA
Now, we move to the second focus of this work, which is a comparison between a wide range of different optimisers which can be used to train the QAOA when solving the Max-Cut problem. A large variety of optimisers for VQAs have been proposed in the literature, and each has their own respective advantages and disadvan- tages when solving a particular problem. Due to the hybrid quantum-classical nature of these algorithms, many of optimisation techniques have been taken directly from classical literature in, for example, the training of classical neural networks. However, due to the need to improve the performance of near term quantum algorithms, there has also been much effort put into the discovery of quantum-aware optimisers, which may include many non-classical hyperparameters including, for example, the number of measurement shots to be taken to compute quantities of interest from quantum states [84,85]. In the following sections, we implement and compare a number of these optimisers. We begin by evaluating gradient-based and gradient-free optimisers in Section 5. We then compare quantum and classical methods for simultaneous perturbation stochastic optimisation in Section 5.2, which typ-ically have lower overheads than the gradient-free or gradient-based optimisers since all parameters are updated in a single optimisation step, as opposed to parameter-wise updates. Finally, we then implement some neural network based optimisers in Section 5.3 which operate via reinforcement learning and the meta-learning optimiser mentioned above. In all of the above cases, we use vanilla stochastic gradient descent optimisation as a benchmark.

Gradient-based versus gradient-free optimisation
For a given parameterised quantum circuit instance, equipped with parameters at iteration (epoch), t, θ t , the parameters at iteration t + 1 are given by an update rule: ∆C(θ t ) is the update rule, which contains information about the cost function to be optimised at epoch t, C(θ t ). In gradient-based optimisers, the update contains information about the gradients of C(θ t ) with respect to the parameters, θ t . In contrast, gradient-free (or zeroth order) optimisation methods use only information about C(θ t ) itself. One may also incorporate secondorder derivative information also, which tend to outperform the previous methods, but are typically more expensive as a result. In the following, we test techniques which fall into all of these categories, for a range of qubit numbers and QAOA depths.
A general form of this update rule when incorporating gradient information can be written as: where η(t, θ) is a learning rate, which determines the speed of convergence, and may depend on the previous parameters, θ and t. The quantity g(θ) ∈ R d×d is a metric tensor, which incorporates information about the parameter landscape. This tensor can be the classical, or quantum Fisher information (QFI) for example. In the case of the latter, the elements of g(θ) when dealing with a parameterised state, |ψ θ are given by: In this form, the gradient update Eq. (18) updates the parameters according to the quantum natural gradient (QNG) [86].
If we further simplify by taking g = 1 to be the identity and choosing different functions for η(θ, t), we recover many popular optimisation routines such as Adam [87] or Adadelta [88], which incorporates notions such as momentum into the update rule, and makes the learning rate time-dependent. Such behaviour is desired to, for example, allow the parameters to make large steps at the beginning of optimisation (when far from the target), and take smaller steps towards the latter stage when one is close to the optimal solution. The simplest form of gradient descent is vanilla, which takes η(θ, t) := η to be a constant. The 'stochastic' versions of gradient descent use an approximation of the cost gradient computed with only a few training examples. In Fig. 6, we begin by comparing some examples of the above gradient-based optimisation rule (specifically using QNG, Adam and RMSProp) to a gradient-free method (COBYLA) [89]. We also add a method known as model gradient descent (MGD) which is a gradient-based method introduced by [37] that involves quadratic model fitting as a gradient estimation (see Appendix A of [90] for pseudocode). The results are shown in Fig. 6 for Max-Cut on 3 regular graphs for up to 14 qubits. We observe that optimisation using the QNG outperforms other methods, however it does so with a large resource requirement, which is needed to compute the quantum fisher information (QFI) using quantum circuit evaluations.
We next examine the convergence speed of the 'quantum-aware' QNG optimiser, versus the standard Adam and RMSProp in Fig. 7. Again, QNG significantly reduces convergence time, but again at the expense of being a more computationally taxing optimisation method [86].

Simultaneous perturbation stochastic approximation optimisation
From the above, using the QNG as an optimisation routine is very effective, but it has a large computational burden due to the evaluation of the quantum Fisher information. A strategy to bypass this inefficiency was proposed by [91], who suggested combining the QNG gradient with the simultaneous perturbation stochastic approximation (SPSA) algorithm. This algorithm is an efficient method to bypass the linear scaling in the number of parameters using the standard parameter shift rule [7,92], to compute quantum gradients. For example, in the expression Eq. (18) when restricted to vanilla gradient descent, one gradient term must be computed for each of the d (d = 2p in the case of the QAOA) parameters. In contrast, SPSA approximates the entire gradient vector by choosing a random direction in parameter space and estimating the gradient in this direction using, for example, a finite difference method. This requires a constant amount of computation relative to the number of parameters. To incorporate the quantum Fisher information, [91] to SPSA actually uplifts a second order version of SPSA (called 2-SPSA), which exploits the Hessian of the cost function to be optimised. The update rules for 1-, 2-SPSA and QN-SPSA are given by: where stochastic approximation to the Hessian 1 , H(θ), and the quantum Fisher information are given by: respectively. Here, F (θ, θ + α) = | ψ θ |ψ θ+α | 2 is the fidelity between the parameterised state prepared with angles, θ, and a 'shifted' version with angles θ+α. The quantities ∆ 1 , ∆ 2 are uniformly random vectors sampled over {−1, 1} d . Also, is a small constant arising from finite differencing approximations of the gradient, for example, we approximate the true gradient, ∇C(θ) by ∇C(θ), given by: We compare all these three perturbation methods in Fig. 8 against SGD once again, with a fixed learning rate of η = 0.01. Notice that SGD performs comparably to 1-SPSA, but with the expense of more cost function evaluations.  (1-, 2-and QN-SPSA) for Max-Cut with QAOA. Plots show approximation ratio as a function of (a) depth (qubit number fixed at 10), (b) number of qubits (QAOA depth is fixed to p = 5) and (c) training iteration (qubit number and QAOA depth fixed to 10 and 7 respectively). In all cases, the average is taken over 10 independent optimisation runs.

Neural optimisation
In this section, we move to a different methodology to find optimal QAOA parameters than those presented in the previous section. Specifically, as with the incorporation of graph neural networks in the initialisation of the algorithms, we can test neural network based methods for the optimisation itself. Specifically, we test two proposals given in this literature to optimise parameterised quantum circuits. The first, based on a reinforcement learning approach, uses the method of [33]. The second is derived from using meta learning to optimise quantum circuits, proposed by [50]. Both of these approaches involve neural networks outputting the optimised parameters by either predicting the update rule or directly predicting the QAOA parameters.

Reinforcement learning optimisation
The work of [33] frames the QAOA optimisation as a reinforcement learning problem, adapting [93] to the problem specific nature of QAOA. The primary idea is to construct and learn a policy, π(a, s), via which an reinforcement learning agent associates a state, s t ∈ S to an action, a t ∈ A. In [33], an action is the update applied to the parameters (similarly to Eq. (17)), ∆γ, ∆β. A state, s t = S, consists of the finite differences of the QAOA cost, ∆C(θ tl ) and the parameters, ∆γ tl , ∆β tl . Here, l ∈ {t − 1, . . . , t − L} ranges over the previous L history iterations to the current iteration, t. The possible corresponding actions, a t ∈ A, are the set of parameter differ-ences, {∆γ tl , ∆β tl } l=t−1 . The goal of the reinforcement learning agent is to maximise the reward, R(s t , a t , s t+1 ) which in this case is the change of C between two consecutive iterations, t and t + 1. The agent will aim to maximise a discounted version of the total reward over iterations.
The specific approach used to search for a policy proposed by [33] is an actor-critic network in the proximal policy optimisation (PPO) algorithm [94], and a fully connected two hidden layer perceptron with 64 neurons for both actor and critic. The authors observed an eightfold improvement in the approximation ratio compared to the gradient-free Nelder-Mead optimiser 2 . Furthermore, the ability of this method to generalise across different graph sizes is reminiscent of our above QAOA initialisation approach using GNNs.

Meta-learning optimisation
A second method to incorporate neural networks is via meta-learning. In the classical realm, this is commonly used in the form of one neural network predicting parameters for another. The method we adopt here is one proposed by [50,95] which uses a neural optimiser that, when given information about the current state of the optimisation routine, proposes a new set of parameters for the quantum algorithm. Specifically, [50,95] adopts a long short term memory (LSTM) as a neural optimiser (with trainable parameters, ϕ), an example of a recurrent neural network (RNN). Using this architecture, the parameters at iteration, t + 1, are output as: Here, s t is the hidden state of the LSTM at iteration t, and the next state is also suggested by the neural optimiser with the QAOA parameters. C t is used as a training input to the neural optimiser, which in the case of a VQA is an approximation to the expectation value of the problem Hamiltonian, i.e. Eq. (4). The cost function for the RNN chosen by [50] incorporates the averaged history of the cost at previous iterations as well as a term that encourages exploration of the parameter landscape. We compare this approach against SGD (with a fixed learning rate of η = 0.01) in Fig. 9 and the previous RL based approach. Approximation ratio, r SGD Meta-learning + SGD RL + SGD Figure 9: Comparison of neural optimisers. We compare the LSTM based meta-learning against a reinforcement learning optimiser, with vanilla stochastic gradient descent (SGD) used as a benchmark. Once each neural optimiser has converged, we continue the optimisation with SGD. Here we use 10 qubits for Max-Cut and QAOA depth p = 6. The results are also averaged over independent 10 runs.

Conclusion and Outlook
The work presented in this paper builds new techniques into analysing the QAOA algorithm -a quantum algorithm for constrained optimisation problems. Here, we build an efficient and differentiable process using the powerful machinery of graph neural networks. GNNs have been extensively studied in the classical domain for a variety of graph problems and we adopt them as an initialisation technique for the QAOA, a necessary step to ensure the QAOA is capable of finding solutions efficiently. Good initialisation techniques are especially crucial for variational algorithms to achieve good performance when implemented on depth-limited near term quantum hardware. Contrary to the previous works on QAOA initialisation, our GNN approach does not require separate instances each time one encounters a new problem graph and therefore can speed up inference time across graphs. We demonstrated this in the case of the QAOA by showing good generalisation capabilities on new (even larger) test graphs than the family of graphs they have been trained on. To complement the initialisation of the algorithm, we investigate the search for optimal QAOA parameters, or optimisation, with a variety of methods. In particular, we incorporate gradientbased/gradient-free classical and quantum-aware optimisers along with more sophisticated optimisation methods incorporating meta-and reinforcement learning.
There is a large scope for future work, particularly in the further incorporation and investigation of classical machine learning techniques and models to improve near term quantum algorithms. One could consider alternative GNN structures for initialisation, other neural optimisers or utilising transfer learning techniques. For example, one could study the combination of warm-starting abilities of GNNs with generic quantum circuit initialisers such as FLIP [51], both of which exhibit generalisation capabilities to larger problem sizes.
A second extension of our proposal could be the extension to other graph problems besides simply Max-Cut, and at much larger scales following [56]. Finally, one could consider the use of truly 'quantum' machine learning models such as quantum graph neural networks [73] or others [9,96].