Quantum Optimal Control via Semi-Automatic Differentiation

We develop a framework of"semi-automatic differentiation"that combines existing gradient-based methods of quantum optimal control with automatic differentiation. The approach allows to optimize practically any computable functional and is implemented in two open source Julia packages, GRAPE.jl and Krotov.jl, part of the QuantumControl.jl framework. Our method is based on formally rewriting the optimization functional in terms of propagated states, overlaps with target states, or quantum gates. An analytical application of the chain rule then allows to separate the time propagation and the evaluation of the functional when calculating the gradient. The former can be evaluated with great efficiency via a modified GRAPE scheme. The latter is evaluated with automatic differentiation, but with a profoundly reduced complexity compared to the time propagation. Thus, our approach eliminates the prohibitive memory and runtime overhead normally associated with automatic differentiation and facilitates further advancement in quantum control by enabling the direct optimization of non-analytic functionals for quantum information and quantum metrology, especially in open quantum systems. We illustrate and benchmark the use of semi-automatic differentiation for the optimization of perfectly entangling quantum gates on superconducting qubits coupled via a shared transmission line. This includes the first direct optimization of the non-analytic gate concurrence.


Introduction
Optimal control is a cornerstone in the development of quantum technologies [1][2][3][4][5][6][7]. Quantum information processing [8], quantum simulation [9], and quantum sensing [10] all rely on the ability to manipulate matter at the fundamental quantum level. In simple cases, analytical solutions to the control problem may be possible [11]. In more realistic settings, in particular, when taking into account classical or quantum noise, numerical optimization methods must be used. The most general methods for open-loop pulse-level control, gradient-ascent-pulse-engineering (grape) [12,13] and Krotov's method [14][15][16][17][18][19][20] can both harness the full range of control possible via arbitrary waveform generators [21] or optical pulse shapers [22]. These methods iteratively update the controls via gradient information, that is, the derivative of the optimization functional with respect to the control parameters. Most generally, the control parameters are the amplitudes of time-dependent control fields, e.g., laser pulses in the control of trapped atoms or ions, and microwave pulses for the control of superconducting circuits.
Traditionally, the required gradients have to be derived analytically. In particular for grape, the numerical scheme [12] is formulated specifically for an overlap with a target state. This limits existing grape implementations [23][24][25][26] to a small class of optimization functionals that include stateto-state transfers and quantum gates. The implementation of Krotov's method [27] provides slightly more flexibility, but still requires that a function to evaluate the gradient is passed along with the optimization functional.
The limitation to standard functionals [19] has restrained the full potential of quantum control. In many applications, both in quantum information and in quantum sensing, there are figures of merit that capture the true objective of the optimization but do not fit into the simple mold of reaching a specific target state. For example, the true intent of a control scheme is often to generate entanglement. Similarly, in metrology, the quantum Fisher information [28] directly measures the metrological gain [29,30].
In the context of universal quantum computing, in combination with single-qubit gates, any perfectly entangling two-qubit gate is sufficient to implement a quantum circuit [8]. For a complex system, which specific gate can be implemented with minimal resources or with maximum noise robustness is not predictable, but can be identified by directly maximizing the entanglement power, as defined by the gate concurrence [31]. This is complicated by the fact that the gate concurrence is not an analytic quantity and thus there is no closed-form expression for the gradient. To address this, based on the geometric theory of two-qubit gates in the Weyl chamber [32,33], an alternative functional that has an analytic, albeit complicated, gradient was formulated and demonstrated in Refs. [34,35].
A breakthrough in the flexibility of quantum optimal control was made by adopting automatic differentiation (AD) in Refs. [36][37][38][39][40], with proof-of-concept implementations in Refs. [41,42]. Automatic differentiation [43,44] considers the evaluation of the optimization functional as a computational graph of elementary operations, and applies the chain rule to evaluate its derivative. This relies on the realization that at a sufficiently low level, any function numerically evaluated by a computer is analytic.
In addition to providing the flexibility to include arbitrary final-time functional and running costs, AD has enabled the optimization of open quantum systems through quantum trajectories [38,45,46], which are otherwise difficult to incorporate analytically [47]. The approach of gradient-based optimization through AD has also been taken up in some more comprehensive quantum control software packages [48][49][50].
Although first developed in the 1960s [51], the widespread use of AD is associated with the rise of machine learning. It is at the core of the backpropagation method [52] for training neural networks. Hence, it is most commonly embedded in machine learning frameworks such as Tensorflow [53], Py-Torch [54,55], jax [56,57], or Flux/Zygote [58][59][60][61]. These frameworks have traditionally focused only on the numerical operations required for neural networks, specifically dense real-valued matrix-vector multiplications. The in-place sparse complex-valued linear algebra operations required for the efficient numerical methods of quantum dynamics [62][63][64][65][66] have only recently started to be addressed and thus have required workarounds that have hindered performance.
More fundamentally, AD requires the storage of gradient-information for every node in the computational graph. In a "full-AD" mode as in Refs. [36][37][38][39][40], where the full time propagation as well as the evaluation of the optimization functional are performed within the AD framework, this generally implies the storage of a gradient matrix or vector for every linear algebra operation. For large Hilbert space dimensions, open quantum systems, or a large number of control parameters (time steps), this numerical overhead quickly becomes prohibitive.
In this paper, we address both of these issues by introducing the approach of semi-automatic differentiation. The approach exploits the fact that all pulse-level quantum control problems are based on the time evolution of quantum states. We show how formally rewriting the optimization functional as a function of the propagated states, of the overlaps with target states, or of a quantum gate allows to split the evaluation of the gradient into two parts. The main part can be evaluated analytically and leads to an efficient modified grape scheme. The remaining part has a profoundly reduced computational complexity and can be evaluated using automatic differentiation with negligible numerical overhead.
The modified grape scheme is described in detail and compared with Krotov's method. Thus, the paper gives a blueprint for the efficient implementation of gradient-based optimal control for arbitrary functionals. In addition, we have also implemented the method in the Julia programming language [67,68] in two packages GRAPE.jl [69] and Krotov.jl [70], both of which are part of a more comprehensive QuantumControl.jl framework [71].
To demonstrate the numerical efficacy of the semi-automatic differentiation approach, we benchmark the optimization of entangling quantum gates for two superconducting transmon qubits [72] with a shared transmission line [73] for a varying Hilbert space dimension and a varying number of time steps. In addition to replicating the results of Refs. [34,35] with an automatic gradient, we also demonstrate the first direct optimization of a non-analytic entanglement measure [31]. We show that the numerical cost of the optimization in terms of both memory and runtime scales identically to the direct optimization of a specific quantum gate with an analytic gradient. This is in stark contrast to the use of full automatic differentiation along the lines of Refs. [36][37][38][39][40], which we show to quickly become infeasible in terms of memory usage and/or runtime.
The paper is structured as follows. In Section 2, we review the efficient implementation of grape and Krotov's method. This includes the calculation of gradients to machine precision. The numerical scheme described here also applies to the analytic component of the semi-AD approach, with only minor changes. In Section 3, we briefly review the concepts of automatic differentiation to develop an understanding of the potential implications of a full-AD approach on numerical costs. Section 4 then develops the theory of semi-automatic differentiation and contains the main new results of this paper. Section 5 defines the optimization problem for perfectly entangling quantum gates on superconducting transmon qubits and shows the benchmarks of the semi-AD and full-AD approaches, as well as the direct optimization of a quantum gate. Section 6 concludes.

Gradient-based Optimal Control
In this section, we review standard methods of gradient-based numerical control theory. As we will show in section 4, the method of semi-automatic differentiation that we introduce in this paper builds on these existing control methods, with minimal additional numerical overhead. That is, the numerical cost of optimizing arbitrary functionals with semi-automatic differentiation is virtually the same as the cost of optimizing standard functionals with traditional gradient-based control methods. Thus, we review the state-of-the-art for implementing these methods as efficiently as possible.

Optimization Functionals
Mathematically, quantum control problems are solved by iteratively minimizing an optimization functional of the general form The terms that constitute the total functional are the final time functional J T and the running costs g a,b (t) that may encode penalties on the pulse amplitudes or on the propagated states. J T depends explicitly on the states {|Ψ k (T ) }. These are the result of a forward propagation of some set of states {|φ k } at t = 0 under the control fields l (t). The index l numbers independent control fields. In the example in Section 5, this would be the real and imaginary part of a complex-valued microwave field in a rotating frame, or equivalently the amplitude and phase of the microwave field in the non-rotating frame. The index k numbers different "objectives" that must be achieved simultaneously. For the example of a two-qubit gateÔ, k numbers the four logical basis states |00 , |01 , |10 , and |11 . Then, a typical functional is with N = 4, where |φ tgt k ≡Ô |φ k is a target state at t = T andÛ is the time evolution operator, so that |Ψ k (T ) ≡Û |φ k .
On the left-hand side the functional J explicitly depends on a set of control parameters { nl }. We will primarily focus here on piecewise constant control fields. In this case, nl is the value of the l'th control field on the n'th interval of the time grid. With N T time intervals, the time evolution operator is thenÛ = n=1 n=N T e −iĤndtn whereĤ n is the Hamiltonian for the n'th interval on the time grid, dt n = t n − t n−1 , typicallyĤ n =Ĥ (0) + l nlĤ (l) with the drift HamiltonianĤ (0) and the control Hamiltonians {Ĥ (l) }; although a nonlinear dependency on the control fields is also possible. While we have written the functional and the time evolution in terms of Hilbert space states and Hamiltonians, all of the formalism applies equally to open quantum systems. In this case, any state |Ψ is replaced by a density matrixρ and any HamiltonianĤ is replaced by a Liouvillian super-operator L. Any inner product Ψ|Φ is defined for (density) matrices as tr[Â †B ]. In any case, from a numerical perspective, a state |Ψ or a (vectorized) density matrixρ is a complex vector, and a HamiltonianĤ or a Liouvillian L is a (sparse) matrix acting on that vector. The main distinction is thatĤ for a normal Schrödinger equation is Hermitian, while L is not. For open quantum systems described by a master equation in Lindblad form, the explicit L can be constructed as a sparse matrix from the Lindblad operators; see, e.g., Appendix B.2 in Ref. [74].

GRAPE
The most direct gradient-based method for minimizing a functional as in Eq. (1) is Gradient Ascent Pulse Engineering (grape) [12]. In its original form, the optimization is performed by calculating the gradient ∇J for the piecewise-constant pulse values as the control parameters, i.e., the vector of values ∂J/∂ (i) nl for all values n, l of the control field in iteration (i), and then update the control field in the direction of the gradient as − α(∇J) nl with some fixed step width (or "learning rate") α.
In practice, once the gradient ∇J has been calculated, it can be fed into a black-box gradient-based optimization package, e.g. the Matlab or SciPy optimization toolboxes [75][76][77] or standalone packages such as Optim.jl [78]. For one, these packages will perform a linesearch to determine a suitable step width α in each iteration. Even more importantly, they can employ quasi-Newton methods for the optimization that use a Hessian (the matrix of second order derivatives) estimated only from the gradient information of previous iterations. Using second-order information in this way dramatically speeds up convergence [13] and is strongly recommended.
We find that l-bfgs-b [79], written in Fortran [80] with wrappers for Python [77] and Julia [81] is a particularly robust quasi-Newton optimizer. It also includes the possibility to apply bounds to the control field (hence the suffix -B). Its particular line search method requires recalculating ∇J for each step, so it is numerically more expensive than some other linesearch methods that use only evaluations of J to determine α. Other implementations of lbfgs [78] allow for a variety of linesearch algorithms [82] with more customization. However, we have not found any of these to reliably yield better convergence than the method built in to l-bfgs-b, and indeed have found the lack of hyperparameters in l-bfgs-b to be a virtue.

Efficient Evaluation of Final-Time Gradients
Typically, implementations of grape only consider specific final-time functionals, J ≡ J T in Eq. (1), e.g., J T = J T,sm given by Eq. (2). We will discuss the efficient evaluation of the gradient for this special case here, and the more general case with non-zero running costs in Section 4.4. To evaluate . .Û 1 |φ k for each of the k objectives. The time evolution operatorsÛ n for the time intervals n = 1 . . . N T are piecewise constant. We first consider with |χ k (t n ) = U † n+1 . . . U † N T |φ tgt k , i.e., a backward-propagation of the target state with the adjoint Hamiltonian or Liouvillian and |Ψ k (t n−1 ) =Û n−1 . . .Û 1 |φ k , i.e., a forward-propagation of the initial state.
The derivative of the time evolution operatorÛ n of a particular time step acting on an arbitrary state can be evaluated efficiently and to machine precision by defining a "gradient generator" for the n'th time step as a block matrix n ≡ ∂Ĥn ∂ nl for the different controls numbered 1 through L, andĤ n is the full Hamiltonian or Liouvillian for that time step. It can be shown that [83,84] That is, by propagating an extended vector |Ψ = [|ψ 1 , . . . , |ψ L , |Ψ ] T under G n , where the components |ψ l for the L different controls are initialized to zero, we obtain both the forward propagation of an arbitrary state vector |Ψ and the gradient of the time evolution operator for that step with respect to every control field. Note that the block matrix defined in Eq. (4) does not need to be instantiated; it is sufficient to define it as an abstract operator such that where |ψ l is the l'th block of the extended |Ψ and |Ψ is the state vector on which |Ψ is based. Thus, the data structure encoding G n can be essentially the same as the data structure encodingĤ n . The propagation defined in Eq. (5) can be performed using any propagation method used to evaluate the "normal" propagationÛ n |Ψ ; that is, a generic ode solver, or preferably a more efficient polynomial expansion of the time evolution operatorÛ n = e −iĤdt for a uniform time step dt. For a Hermitian H, Chebychev polynomials are the fastest converging polynomial expansion [85]. For the standard Schrödinger equation, propagation with Chebychev polynomials P m (x) is very straightforward and efficient to implement. The time evolution of a state |Ψ by a single time step dt for the n'th interval of the time grid can be written aŝ whereĤ n,norm is the HamiltonianĤ n normalized to a spectral range within [−1, 1], and using the recursive definition of the Chebychev polynomials, ○ forward-prop with updated control 1 ○ backward-prop and storage with guess . . .   The expansion coefficients a m for a given spectral range and a uniform time step dt can be calculated analytically and are proportional to Bessel functions [62]. For a non-HermitianĤ (e.g., a Liouvillian), an expansion into Newton polynomials is suitable [64,66,86]. Both Chebychev and Newton propagation have been implemented in Julia [87].
In this context, it is worth noting that the eigenvalues of G n in Eq. (4) are the same as the eigenvalues of the underlyingĤ n , with an additional (L + 1) degeneracy. In particular, for a Hermitian H n the eigenvalues of G n are real, despite G n as a whole not being Hermitian. Thus, if e −iĤndt can be evaluated via a Chebychev expansion, so can e −iGndt , with the same expansion coefficients.
The complete numerical scheme for evaluating ∇τ (k) nl is shown in Fig. 1 (a). It starts at the bottom left with the initial state |φ k . This state is forward-propagated using the values (i−1) nl for the l'th control and the n'th time step, where the superscript (i − 1) indicates the guess for the current iteration (i). The propagation continues to the final state |Ψ k (T ) . All of these propagated states must be stored in memory. After the forward-propagation ends, we initialize an extended state |χ k (t = t N T = T ) = [0, . . . 0, |φ tgt k ] T , consisting of a zero block for each of the L controls l (t) and the target state for the objective (k) in the bottom block. This extended state is backward-propagated as |χ k (t n−1 ) = e −iG * n dtn |χ k (t n ) with a negative time step dt n . In the full block-form of the extended state and generator, each propagation step is defined as . . .
which is the backward version of the forward step defined in Eqs. (4)(5)(6). After each step in the backward propagation, we calculate the nl'th component of the gradient of τ k with respect to the control values, where |χ kl (t n−1 ) is the state from the l'th block of |χ k (t n−1 ) and |Ψ k (t n−1 ) is read from the stored forward-propagated states. After calculating this overlap, the first L blocks of |χ k (t n−1 must be zeroed out before performing the next step in the backward propagation, so as to have the correct state on the right-hand side of Eq. (9). The scheme depicted in Fig. 1 (a) is inherently parallel in the different objectives (index k). This is in fact one reason why we have framed the optimization of a quantum gate in terms of multiple objectives, one for each of the logical basis states. The approach is in contrast to the approach taken by most existing grape implementations [23,24,37] that treat the quantum gateÛ as the dynamic object and may even use explicit matrix-exponentiation to calculateÛ = e −iĤdt , limiting the implementation to small Hilbert space dimensions. In addition to numerical parallelizability and efficiency, formulating the control problem in terms of multiple simultaneous objectives also allows the method to extend to gate optimization in open quantum systems [88] or to ensemble optimization for robust quantum gates [89].
In principle, we could reverse the order of the backward and forward propagation: Instead of the procedure shown in Fig. 1 (a), we could first backward-propagate |χ k (T ) = |φ tgt k and store the resulting |χ k (t) , and then forward-propagate an extended state |Ψ k (t) to evaluate ∇τ (k) nl , using Eqs. (4)(5)(6). This corresponds to ∂Ûn ∂ nl in Eq. (3) acting to the right instead of to the left. However, as we show in Section 4.1, doing the forward-propagation first is necessary when combining grape with automatic differentiation.
Having evaluated ∇τ For other functionals [19], we would similarly have to compute the chain rule to complete the grape scheme.

Krotov's method
As an alternative to grape, Krotov's method [14][15][16] takes a constructive approach based on timecontinuous control fields, J({ l (t)}) on the left-hand side of Eq. (1). For a specifically chosen running cost, g a ( l (t)) = λa with an arbitrary "update shape" S(t) and scaling factor λ a , and a reference field ref l (t) that is typically chosen in each iteration (i + 1) as the guess field (i) l (t) for that iteration, the derivation of Krotov's method [17][18][19][20] considers the necessary and sufficient conditions for the functional derivative ∂J ∂ l (t) to ensure monotonic convergence, , and finds a first-order [20] update equation for ∆ where |Ψ k (t) is the initial state |φ k forward-propagated with the updated pulse for the current iteration, and |χ with the boundary condition The discretization to a time grid is done only after formulating the time-continuous update equation and results in the scheme shown in Fig. 1 (b). For each objective indexed by k, we start in the top right with |χ k (T ) defined by Eq. (14). This state is backward-propagated as (dt n < 0) if there are no state-dependent running costs (g b ≡ 0), or using an inhomogeneous propagator otherwise [90]. All backward-propagated states must be stored in memory. For each control field indexed by l, the control update for the first time step is calculated according to Eq. (12) for t = 0, i.e., using the result of the backward propagation and the initial state |φ k for each objective. The updated controls for the first time step then allow to obtain |Ψ k (t 1 ) , which together with |χ k (t 1 ) from the stored backward-propagated states allows to calculate the update for the second time step. In this sense, the scheme is sequential : the update in every time step depends directly on the state forward-propagated under the updated pulse from the previous time step. In contrast, the update in nl . Like the grape scheme in panel (a), Krotov's method is inherently parallel in k, with the caveat that the parallelization must be synchronized after each time step in the forward propagation, so as to evaluate the sum over k in Eq. (12).

Automatic Differentiation
Existing implementations of grape [23][24][25][26]48] typically hard-code the optimization functional to something equivalent to Eq. (2). In order to extend to more functionals, at best a user has to manually supply a routine that calculates the gradient, for example, with a chain rule such as Eq. (11). In the worst case, if the gradient is not easily expressed as overlaps τ k , the numerical scheme in Fig. 1 (a) would have to be adapted to that particular functional.
Even more problematic are figures of merit that are highly relevant to quantum engineering but do not have an analytic gradient. Typical examples are entanglement measures [31] or the quantum Fisher information [28] which is directly connected to metrological gain in quantum sensing applications [29,30]. Evaluating these measures, in general, involves eigenvalue decompositions, which does not lend itself to an analytical calculation of the gradient. Even if equivalent analytic expressions, e.g., for the gate concurrence, can be found [34,35], their analytic gradients are extremely tedious to calculate and implement [91].
In situations where the analytic calculation of a gradient is impossible or impractical, the numerical evaluation of gradients can be an alternative, in particular through automatic differentiation (AD) [36-40, 43, 44]. The core of AD is the realization that any numerical computation, even one that is seemingly non-analytical like an eigen-decomposition, can ultimately be expressed in elemental numerical steps -sums and products of floating point numbers, if taken to the extreme of machine instructions. These elemental steps have a known derivative, and thus the gradient of any function can be evaluated by applying the chain rule ad nauseam. Doing this requires that the computer keep track of all intermediate values in the computation. In our case, the function of interest is the optimization functional J, with input values { n } (we temporarily drop the index l numbering different controls here for simplicity). Consider as a simple example, borrowed from Ref. [37]. We introduce intermediary values , and thus finally J = v 6 . The full chain rule for the gradient of J is There are two ways to write the chain rule recursively, allowing to consistently evaluate it numerically for arbitrarily complicated functionals. These correspond to the forward and reverse modes of automatic differentiation. For the forward mode, we define the "tangent" of an intermediary value asv j ≡ ∂v j /∂ n where we have picked a particular n for which we want to evaluate the derivative. We then finḋ where the sum is over all v i on which v j depends explicitly. We can evaluate these tangents along with the evaluation of the intermediary values themselves. For the derivative with respect to 1 , in our example, we would havev To calculate the full gradient, the entire calculation must be evaluated once for each n . Thus, calculating the gradient in forward-mode is efficient only if the number of dependent variables is small. On the other hand, for a single dependent variable, the runtime and memory requirements of evaluating the derivative are proportional to those of evaluating J itself. In particular, a tangentv j does not have to be kept in memory longer than the value v j itself. This makes forward mode automatic differentiation easy to implement, e.g., by using operator overloading and dual numbers [43].
For the reverse mode, we define the "adjoint" of an intermediary value asv j ≡ ∂J/∂v j . The name "adjoint" in this context is unrelated to the Hermitian conjugate in quantum mechanics, denoted by a dagger. The definition of the AD adjoint results in a recursive relationship where the sum is over all v i which depend on v j . Note that this is the reverse of Eq. (18). Consequently, the adjoints are evaluated backward, starting fromv 6 = ∂J/∂v 6 = 1. We then further findv 5 The final adjoints now contain the derivatives for all of the input parameters. This is the primary benefit of reverse-mode AD: the full gradient can be evaluated at once, making it the preferred method for calculating the gradient of an optimization functional R N T L → R. However, we can only start the calculation of the adjoints once J itself has been evaluated. All intermediary values v j must be stored together with information on which elemental function was used to obtain the value, as well as all adjointsv j .
In practice, this can be implemented in several ways. Early versions of Tensorflow [53] require that the entire calculation is set up as a computational graph, see Fig. 1 in Ref. [37]. A forward-pass through the graph calculates the functional and stores intermediary values, while a backward-pass distributes adjoint information to the parents of each node in the graph. Alternatively, a tabular representation of the graph may be constructed during the forward pass, in what is called a Wengert tape. The backward pass then adds adjoint information to the tape. Lastly, it is possible to transform the source code of a function that evaluates the optimization functional into a new function that first performs the original evaluation, and then inverts the computational steps, splicing in code to calculate the adjoints. These implementation details can have a significant impact on the performance and flexibility; see [92] for a discussion of the particular tradeoffs. However, they do not change the fundamental memory requirements associated with reverse-mode AD.
There is considerable flexibility in what is considered an "elemental" function for which the adjoint can be defined. Most importantly, linear algebra operations do not have to be handled at the level of scalar operations, alleviating to some extent the overhead associated with AD. Using the rules of matrix calculus [93,94], AD adjoints for many operations can be defined [95]. This even includes eigen-or svd decompositions [96].
The more high-level the elemental functions are, the less the numerical overhead of AD. However, this comes at the cost of having to define more and more analytical adjoints. This is why many AD frameworks have been slow to adopt operations that are outside of the narrow scope of machine learning, which only requires real-valued dense matrix-vector operations. In contrast, quantum dynamics is inherently described with complex-valued state vectors, and operators are usually sparse. Defining AD adjoints for complex linear algebra operations is possible [97], but has only recently seen adoption. Another practical issue is that naively, the intermediary values (or vectors) v i are immutable. Thus, most AD frameworks (including Zygote [60,61], which we have used here) do not support in-place linear algebra (blas [98]) operations, which can greatly speed up the simulation of quantum dynamics.

Semi-Automatic Differentiation
We now develop a method to eliminate the two shortcomings of reverse-mode automatic differentiation: the excessive memory overhead associated with having to store the full computational graph, respectively, a Wengert tape, and the limited support in AD frameworks for the linear algebra operations relevant to simulating the dynamics of a quantum system. We do this by applying an analytic chain rule to the calculation of the gradient. To this end, we introduce intermediary variables z j and rewrite the functional in terms of these intermediaries, . The values z j may be complex, which requires some care when writing out the chain rule.
In principle, one must separate the z j into real and imaginary part as independent variables, An elegant alternative is to introduce Wirtinger derivatives, which instead treats z j and the conjugate value z * j as independent variables, so that The derivative of the complex value z j with respect to the real value nl is defined straightforwardly as Our goal is to choose parameters z j so that ∂J/∂z j can be calculated with automatic differentiation with minimal numerical effort. That is, we would like the computational graph for J({z j }) to be as small as possible. Additionally, if the number of parameters {z j } can be kept small, forward-mode differentiation or even the use of finite difference may become feasible. For the second part of the chain rule, we require that ∂z j /∂ nl can be calculated analytically (without the use of AD).
Software frameworks for automatic differentiation such as Zygote [61] and Tensorflow [53] may define a (mathematically questionable [94]) "gradient" of a real-valued function J with respect to a complex vector with elements z j as This differs from the Wirtinger derivatives by a factor of two. Thus, in Eq. (23) when using, e.g., Zygote's gradient function for ∇ z J.

State functionals
As a starting point, we can take seriously the explicit dependency of J T on {|Ψ k (T ) } in Eq. (1) and write the gradient of J T using the chain rule in the states. To do this, we must combine the Wirtinger derivative with the rules of matrix calculus [93,94], and write With the definition in Eq. (21), this corresponds directly to the scalar where Ψ km = m | Ψ k (T ) for any orthonormal basis {|m } corresponds to the z j in Eq. (20).
In Eq. (27), we may now recognize that the derivative of the scalar J T with respect to a column vector |Ψ k (T ) results, according to the rules of matrix calculus, in a row vector that we may associate with a co-state χ k |. Specifically, we may define with a minus sign that will be motivated in Section 4.5. Since |χ k (T ) does not depend on nl , we may push it into the derivative and obtain We can now see that the term under the sum has the exact same form as Eq. (3), that is, the derivative of a complex overlap of two states, τ k ≡ χ k (T ) | Ψ k (T ) , which we know how to evaluate numerically via Eq. (10), respectively the scheme in Fig. 1 (a). The only difference is that in the original grape, we initialize the extended state |χ k (T ) for the backward propagation from the target state, whereas now we initialize it with Eq. (29). This also explains why we have chosen to perform the forwardpropagation first in Fig. 1 (a): while in the original grape, backward and forward propagation are interchangeable, now we need the result |Ψ k (T ) of the forward propagation in order to initialize the backward propagation. We can use automatic differentiation to evaluate Eq. (29) for arbitrary functionals. For example, with Zygote's gradient function to evaluate ∇ Ψ k J T analogously to Eq. (25), we have where the factor 1 2 accounts for the difference between the complex gradient and the correct Wirtinger derivative, cf. Eq. (26).

Overlap functionals
When formulating the gradient for the square-modulus functional in Eq. (11), we already used the overlap of the forward-propagated state for the k'th objective with the target state for that objective, τ k ≡ φ tgt k Ψ k (T ) . Many of the most common functionals in quantum control are expressed in terms of overlaps of propagated states with target states [19]. We can exploit this to further analytically simplify the calculation of |χ k (T ) in Eq. (29) via automatic differentiation. We find where we have used that only the complex conjugate τ * k = Ψ k (T ) φ tgt k of the overlap depends explicitly on the co-state Ψ k (T )|. The gradient ∇ τ k J T defined as in Eq. (26) can be obtained with automatic differentiation.
We note that for functionals that explicitly depend on overlaps [19], it is generally not difficult to evaluate the chain rule analytically. Thus, the use of automatic differentiation here is less a matter of necessity than of convenience. It allows us to implement a grape optimization package where the user can pass an arbitrary functional J T ({τ k }) without having to explicitly specify a gradient.

Gate functionals
For the optimization of quantum gates, such as the examples we will explore in Section 5, it is common to have a logical subspace embedded in a larger physical subspace. The functional J T in this case can often be written as a function of the achieved gateÛ L in the logical subspace.
In this context,Û L is the projection of the full time evolution operatorÛ to the logical subspace. Specifically, the entries of the matrixÛ L are where {|φ i } are the basis states that span the logical subspace (assumed to be the initial states for the optimization objectives), and each |Ψ j (T ) is the result of forward-propagating |φ j . We may then calculate the gradient of J T as in Eqs. (29,30), with a further analytic chain rule, just as in Section 4.2 but with the elements (U L ) ij instead of the complex overlaps τ k : again using the notation of the Wirtinger derivative. We have used that only (U L ) * ij depends explicitly on the co-states { Ψ k (T )|}. Furthermore, according to the definitions in Eqs. (25,22), and with the Kronecker delta δ jk . Thus, Eq. (34) simplifies to where ∇ UL J T is evaluated via automatic differentiation, e.g. with Zygote's gradient function. While the simplified Eqs. (32,37) are not fundamentally different from constructing the boundary states {|χ k } directly with automatic differentiation, they can help to circumventing numerical instabilities that an AD framework may have for complicated functionals. Also, for very large Hilbert space dimensions, it may eliminate some of the numerical overhead associated with automatic differentiation. While the direct construction of |χ k requires differentiation in a number of variables proportional to the full Hilbert space dimension, using Eq. (32) reduces this to the dimension of the logical subspace,

Running costs
So far, we have only discussed the evaluation of gradients for final time functionals J T . We now extend the discussion of semi-automatic differentiation to the running costs g a,b in Eq. (1). Since we are considering piecewise constant pulses, the integral over the running cost turns into a sum over the time steps. That is, we rewrite Eq. (1) as with As in Fig. 1, we define |Ψ k (t n ) =Û n . . .Û 1 |φ k , t 0 = 0, t N T = T , andÛ n = exp[−iĤ n dt n ] as the time evolution operator for the n'th time interval, dt n = t n − t n−1 . Similarly, ∆t n is the time step around the time grid point t n , e.g. ∆t 0 = dt 1 , ∆t n = 1 2 (t n+1 − t n−1 ) for 1 ≤ n < N T , and ∆t N T = dt N T . For uniform time grids, dt n ≡ ∆t n ≡ dt.
Typically, running costs on the control fields are direct analytic expressions, e.g., g a ({ nl }) = λ a 2 nl to penalize large amplitudes, with a weight λ a . Thus, they are easily included in the gradient, e.g., (∇g a ) nl = 2λ a nl . For convenience, this can also be done with automatic differentiation. This even extends to penalties on the first and second derivatives of the controls [37][38][39].
More interesting is the case of state-dependent constraints. Typical examples [99] include trajectory optimizations, where the time evolution of each state |Ψ k (t n ) should be close to some target evolution |Ψ tgt k (t n ) with a weight λ b , or observable optimizations where the expectation value of some observableD(t) is to be minimized. A special case of this is the minimization of the population in some forbidden subspace [100], whereD(t n ) ≡D is a projector into that subspace.
To obtain the full gradient of a functional with a state-dependent running cost, we apply the same procedure as in Section 4.1 and find cf. Eqs. (14,29). In the sum over n in Eq. (42b), we have used that |Ψ k (t n ) depends on nl only for n ≥ n. This implies that for the final time interval, n = N T , there is only a single term, with |χ k (T ) ≡ |χ Evaluating the gradient progressively backward in time for n = (N T − 1) . . . 1, we then find a recursive relationship with Thus, there are no fundamental changes to the scheme in Fig. 1 (a) in the presence of statedependent running costs. The states {|φ k } must be forward-propagated and stored, and then the extended states |χ k (t n ) are propagated backward to produce the gradient. The only difference is that the boundary state |χ k (T ) is now constructed based on Eq. (45) instead of Eq. (29) (or the target states, in the original grape). Furthermore, the backward-propagation uses the discrete inhomogeneous Eq. (47). The inhomogeneity is calculated using the forward-propagated states stored previously, with the derivative of g b performed analytically or by automatic differentiation.

Semi-AD for Krotov's method and time-continuous schemes
We have included a minus sign in the definition of |χ k (T ) , Eq. (29), respectively |χ (T ) k in Eq. (43), since that results in the exact same definition as the |χ k (T ) that is the boundary condition for the backward propagation in Krotov's method, Eq. (14). This allows us to make a strong connection between the most general case of semi-automatic differentiation for grape and for Krotov's method.
Evaluating Eq. (31) with a framework for automatic differentiation like Zygote or Tensorflow is all that is required to bring the concept of semi-automatic differentiation to Krotov's method. In this way, it becomes possible to use Krotov's method to optimize towards any computable functional. Conversely, for functionals where the derivative with respect to the states is known analytically, e.g., because they have already been explored using Krotov's method, that existing code can be shared between an implementation of grape and Krotov's method.
We may also observe that the generalization of grape and Krotov's method are even more closely related in the time-continuous limit. For dt → 0, we may use the first-order Taylor expansion of e −iĤndt to find with the boundary condition of Eq. (29) for |χ k (T ) . With ∆ (t) ∝ −∇J T , this matches Krotov's update equation (12), up to the concurrent |Ψ Similarly, if there are state-dependent constraints g b ≡ 0 in Eq. (1), the time-continuous limit of Eq. (46) for the backward propagation in the semi-AD grape method corresponds directly to the inhomogeneous Eq. (13) for the backward-propagation in Krotov's method. However, the former does not require a genuine inhomogeneous propagator [90]: the operatorÛ † n+1 in Eq. (46) is a normal time evolution operator, with the inhomogeneity added afterward. Equation (48) may provide an alternative implementation of a generalized grape scheme in the limit of very small dt. The scheme would be nearly identical to Fig. 1 (b); instead of calculating the pulse update directly, it would calculate the gradient ∇J T and feed it into the l-bfgs-b optimizer. Considering the time evolution operator only to first order in dt avoids having to construct and propagate the extended state discussed in Section 2.3 and may thus be slightly faster. Finally, Eq. (48) may also provide a motivation for the exploration of optimization schemes that mix and match sequential and concurrent updates [24,99].

Asymptotic memory usage and checkpointing
We can now analyze the asymptotic memory usage of the semi-automatic differentiation procedure in relation to the "full" use of automatic differentiation as in Refs. [36][37][38][39][40] in general terms. As discussed in Section 3, automatic differentiation in general requires the storage of any intermediate value in the evaluation of the functional, that is, the time propagation of the quantum states. Thus, the AD memory overhead depends very much on how exactly this propagation is implemented.
In many of the existing uses of AD for quantum control, the time propagation is achieved by exponentiating the Hamiltonian. The required storage if this exponentiation is performed in a single computational step has been analyzed in Ref. [101]. The intermediary values in this case are the time evolution operators, i.e., the exponentiated matrices, and the states resulting from the application of those time evolution operators, for every time step. For a Hilbert space dimension of N H , the size of the matrices is N 2 H . The required storage for these matrices is thus proportional to N T N 2 H where N T is the number of time steps. Asymptotically, this dominates over the storage required for the states, which is proportional to N T N H .
While matrix exponentiation in a single step minimizes the number of intermediary values in the computation graph and thus the AD memory overhead, the computational complexity for algorithms for matrix exponentiation scales polynomially in matrix-matrix multiplications, which themselves scale quite unfavorably as N 3 H . In contrast, expanding the time evolution operator in a polynomial series and applying it directly to the states, e.g., using the Chebychev propagation method outlined in Section 2.3 reduces the computational complexity to something polynomial in matrix-vector products, which scale as N 2 H . However, in the context of AD, the polynomial expansion also increases the number of intermediary states. Specifically, for a polynomial of order M , there are M intermediary matrices and M intermediary states, and thus the total asymptotic memory overhead increases by a factor of M . Typically, the M required for convergence of the polynomial is between 10 and 100, depending on the spectral radius of the Hamiltonian and the size of the time step.
A well-established technique to reduce the excessive memory overhead of automatic differentiation by trading it for an increase in runtime is the use of checkpointing [44]. The central idea of checkpointing is to store only periodic snapshots of the intermediate variables in the computation graph, and then recalculate the forward pass from the last available checkpoint when calculating the gradient via automatic differentiation. This idea has been applied to the use of automatic differentiation in GRAPE [101]. A snapshot is taken every C = √ N T time steps in the propagation. The asymptotic memory usage in this case reduces from N T N 2 is the number of checkpoints. That is, checkpointing achieves a quadratic reduction with respect to the number of time steps. With a polynomial propagator, we again have an increase by a factor of M within each checkpointed segment of size C, thus resulting in a memory scaling of , again a quadratic reduction. The fundamental idea of checkpointing can also be applied in the semi-AD scheme in Fig. 1 (a). By default, we store all the forward propagated states marked in red. Instead, we may store only every C states. During the backward propagation of |χ k (T ) , we then repeat the forward propagation from the last available checkpoint to recover that states |Ψ k (t n ) required for the overlaps that determine ∇τ (k) ln . The required memory for storage is thus reduced from N T to N C + C = 2 √ N T . However, we need an additional N T − N C = N T − √ N T propagation steps. Lastly, we can consider the special case of unitary dynamics. No storage at all is required in the semi-AD scheme: after the initial forward propagation in Fig. 1 (a), the resulting |Ψ k (T ) can simply be backward-propagated in parallel with |χ k (T ) . Thus, we trade the need for storage with a full additional time propagation. The method is not applicable in open quantum systems where the dynamics are not unitary, and thus a backward propagation of |Ψ k (T ) does not recover the states |Ψ k (t n ) from the forward propagation. The unitary dynamics can be exploited in the same way to reduce the storage in a full-AD implementation of GRAPE [101].
The asymptotic memory requirement for both full-AD and semi-AD for different propagators and with and without checkpointing is summarized in Table 1 Table 1: Asymptotic memory usage for full automatic differentiation (Full-AD) and semi-automatic differentiation (Semi-AD). The table shows how the number of matrices and vectors that must be stored in memory scales with the size of the Hilbert space NH and the number of time steps NT . For full-AD, we compare propagation via matrix exponentiation ("exp.") and propagation via a polynomial method ("polyn."), such as the Chebychev polynomials detailed in section 2.3. In this case M is the order of the polynomial required for convergence, equal to the number of matrix-vector products in a propagation step. Also included is the number of matrices/states that need to be stored when checkpointing is used. C denotes the time steps between checkpoints, and NC = NT /C is the number of checkpoints. The bottom row shows the total asymptotic scaling of the required memory.
AD, the memory requirement is determined only from the storage of states, and is thus linear both in the Hilbert space dimension and in the number of time steps. The remaining use of automatic differentiation to evaluate it is a constant number of states. At best, for ∇ τ k , it is a small number of scalars. In practice, the memory and runtime characteristics of full-AD may be less predictable. Many AD frameworks have large constant overhead and large prefactors for the asymptotic scaling in Table 1. On the other hand, the size of the computational graph and thus the amount of memory required for AD depend heavily on the exact implementation details of the time propagation. Therefore, we will explore the scaling of memory and runtime empirically for a specific implementation and a realistic example in Section 5.
As discussed in Section 3, there is a certain freedom to define what operations constitute "elementary operation", with a known pre-defined adjoint. For example, at least in principle it would be possible to define an entire propagation step via Chebychev expansion as a single node in the computational graph. This would eliminate the scaling with the number of coefficients M , but would require to implement a custom adjoint for that propagation step, which is not trivial. Custom adjoints would have to be implemented by hand for every different propagation method.
In general, any use of automatic differentiation involves a trade-off between memory usage, runtime, and code complexity (with custom adjoints). We believe that semi-automatic differentiation is a particularly attractive balance of these three goals: It is strictly linear in the number of time steps and the dimension of the Hilbert space, its runtime and structure match that of a traditional GRAPE implementation without any AD capabilities, and it requires no implementation of any custom adjoints, which may otherwise achieve similar memory scaling. Specifically, it works with any propagation method without modification; the runtime of the optimization is directly proportional to the runtime of the time propagation.
oscillator that couples to the transmission line. In the dispersive limit, where the qubit-cavity detuning dominates the coupling strength, the cavity can be eliminated. This results in an effective two-transmon Hamiltonian with a static qubit-qubit coupling. Furthermore, a microwave control field with a frequency ω d near the qubit frequencies ω 1 , ω 2 drives transitions on the transmons. In the rotating-wave approximation, the effective Hamiltonian reads [102] H whereb † q andb q are the creation and annihilation operators for the transmon excitations. Transmon qubits can be engineered to a wide range of frequencies, anharmonicities, and coupling strengths [103]. Here, we use the parameters listed in Table 2, cf. Ref. [35], as a typical example.
The control field in general is complex-valued, corresponding to variations in both the amplitude and phase relative to the rotating frame; it is easiest to split it into real and imaginary parts and treat them as independent controls Ω re (t) and Ω im (t), as we have done above.
The Hamiltonian in Eq. (49) is well-suited as a system on which to benchmark optimal control methods for quantum gates. First, the system is fully controllable, allowing to implement any twoqubit gate if no further restrictions are placed on the control field [34,104]. Second, entangling quantum gates have been demonstrated with gate durations anywhere between 20 ns and 5 µs [35,103,105,106]. Thus, we can reasonably explore the numerical scaling of the optimization procedure with the number of time steps. Lastly, the number of levels in the anharmonic oscillator that reasonably contribute to the gate dynamics varies significantly depending on the gate mechanism and the amplitudes and frequencies of the control field [106]. This allows us to benchmark the optimization for varying Hilbert space sizes. Especially for non-analytic controls obtained with optimal control, leakage from the logical subspace can be a significant problem, and we include as many as N q = 15 levels in the numerical simulation to account for this.

Optimization Functionals
A primary benefit of the semi-automatic differentiation is that it allows optimizing arbitrary figures of merit, including ones for which it is difficult or impossible to derive analytical gradients. This freedom can significantly enhance the effectiveness of optimal control. For example, in the context of entangling quantum gates, it was demonstrated that optimizing for an arbitrary perfectly entangling quantum gate is easier than optimizing for a specific entangling gate such as CNOT [34,35]. This is because for a given Hamiltonian it is not known a priori which perfect entangler will be easiest to implement with given resources such as a maximum power of the control pulse. At the same time, for the realization of a universal quantum computer, one perfect entangler together with arbitrary single-qubit operations is sufficient.
The entangling power of a quantum gate is measured by the gate concurrence, defined as the maximum concurrence that can be obtained by applying the quantum gate to a separable input state [31]. It can be evaluated by writing a two-qubit gateÛ L (a 4 × 4 matrix in the logical subspace) in a Cartan decomposition [32],Û whereσ x,y,z are the Pauli matrices,k 1,2 are single-qubit operations, and c 1,2,3 are real-valued coefficients that characterize the two-qubit aspect of the quantum gate. When eliminating symmetries in Eq. (50), the coefficients can be understood as coordinates in a geometric representation of twoqubit gates called the Weyl chamber. A gate is a perfect entangler with a gate concurrence C = 1 if c 1 + c 2 ≥ π 2 , c 1 − c 2 ≤ π 2 , and c 2 + c 3 ≤ π 2 , which is a polyhedron within the Weyl chamber. Otherwise, cf. Ref. [31]. The calculation of the Weyl chamber coordinates themselves is described in Ref. [107] and implemented in Refs. [108,109] and involves obtaining the eigenvalues ofÛ L , as well as a branch selection of a complex logarithm. Thus, the evaluation of Eq. (51) is inherently non-analytical, preventing the analytic construction of a gradient ∇C with respect to the control values. For this reason, a less direct measure for the entangling power of the quantum gate was developed in Refs. [34,35]. Intuitively, it minimizes the geometric distance from the polyhedron of perfect entanglers in the Weyl chamber. Mathematically, that distance is formulated not in terms of the Weyl chamber coordinates, but in terms of the Makhlin local invariants g 1,2,3 [110]. Similarly to the Weyl chamber coordinates, these characterize two-qubit gates up to single-qubit operations. It can be shown that the geometric distance to the polyhedron of perfect entanglers is Unlike the Weyl chamber coordinates, the local invariants can be calculated analytically from the two-qubit gateÛ L as Re tr 2 (m) , g 2 = 1 16 Im tr 2 (m) , withm =Û T BÛ B , whereÛ B is the representation ofÛ L in the Bell basis. Applying matrix calculus, it is possible -although both tedious and lengthy -to calculate an analytic gradient of Eq. (52) [91]. For the derivative with respect to a state, cf. Eqs. (14,29), a Python implementation is available in Ref. [108].
Both C(Û L ) and D PE (Û L ) are only well-defined ifÛ L is unitary. To ensure this in the optimization, we may add a term that calculates the loss of population from the logical subspace, Altogether, we use the following three optimization functionals: 1. Square-modulus gate optimization (SM) cf. Eq. (2), where |φ tgt k is the result of applying the target gatê to the initial states |φ 1 = |00 , |φ 2 = |01 , |φ 3 = |10 , |φ 4 = |11 spanning the logical subspace. The choice of √ iswap as the target gate is somewhat arbitrary, although it has been demonstrated that this is an easily reachable gate for the Hamiltonian in Eq. (49) [35].

Perfect entangler optimization (PE)
where D PE and p loss are defined in Eq. (52) and Eq. (54), respectively.

Concurrence optimization (C)
where C and p loss are defined in Eq. (51) and Eq. (54), respectively. This is an example of a non-analytic functional whose gradient can only be evaluated via automatic differentiation.

Benchmarks
The result of benchmarking the functionals in Section 5.2 with semi-automatic and full automatic differentiation is shown in Fig. 2. The "Semi-AD" optimization uses the approach described in Section 4.2 for the gate optimization with the square-modulus functional (SM), and the approach described in Section 4.3 for the perfect-entanglers (PE) and concurrence (C) optimizations, as indicated by the argument on the left-hand-sides of Eqs. (55)(56)(57)(58). The resulting optimized control fields are similar to those obtained in Ref. [35], and are shown in the "PE" examples in the online documentation of the GRAPE.jl and Krotov.jl packages [69,70], which implement both the direct and semi-AD optimization methods. All of the optimizations use an expansion in Chebychev polynomials for the time propagation, Eq. (7). We compare this with an optimization using full automatic differentiation ("Full-AD"). This means that we simply propagate the set of initial states with an AD-aware ode solver, the Runge-Kutta algorithm dp5 [111] in DifferentialEquations.jl [112] and then evaluate the functional within the Zygote AD framework [61]. As an alternative to the general ode solver, we also benchmark a full-AD variant of the Chebychev propagator. In this implementation, in-place linear algebra operations must be avoided, adding runtime overhead due to the allocation and deallocation of memory, cf. panels (e, f). It is possible to do this within Zygote without too much numerical overhead due to the simplicity of Eqs. (7,8). Of course, the propagator is also limited to standard Hermitian dynamics. Lastly, as a baseline, we benchmark the direct optimization of the square-modulus functional for a √ iswap gate with the analytic gradient in Eq. (11), that is, without any use of automatic differentiation ("Direct (Cheby) (SM)").
In the left column, panels (a, c, e), we vary the number of levels at which we cut off the transmon qubit between N q = 3 and N q = 15 levels, which we show in terms of the Hilbert space size, N H = N 2 q for two transmons. The gate duration is constant at 100 ns. In the right column, panels (b, d, f), we vary the gate duration between 20 ns and 800 ns while keeping the number of transmon levels constant at 5 (N H = 25). The step size of the time grid is 0.1 ns, so that the number of time steps in the scheme of Fig. 1 (a) is directly proportional to the gate duration and varies between 200 and 8000 steps. shown. The dynamics are simulated by evaluating the time evolution operator in Chebychev polynomials ("Cheby"), or by using a generic ODE solver. The gradient is evaluated either via semi-automatic differentiation ("Semi-AD") for the PE and C optimizations or via a set over overlaps {τ k } for the SM optimization, via full automatic differentiation ("Full-AD"), or -for the "Direct" optimization -using an analytic gradient (no automatic differentiation).
For each optimization, we run the optimization for 10 iterations. At a minimum, each iteration includes one evaluation of the gradient as depicted in Fig. 1 (a), that is, a forward propagation of the four logical basis states and a backward propagation of four extended gradient states. However, the linesearch may require additional evaluations of the gradient in order to determine the step width. The number of linesearch steps varies for different optimizations and between iterations. Thus, the benchmark of the runtime in panels (a,b) shows the seconds required per gradient evaluation. The optimizations were performed on an Intel Xeon Gold 6226R CPU workstation with a nominal clock speed of 2.9 GHz, without parallelization. The runtime was determined using BenchmarkTools.jl [113] as the minimum of 20 optimization runs, with a maximum total benchmark time of 24 hours. It excludes compilation overhead.
Fundamentally, the runtime is super-linear with the size of the Hilbert space. This is due to two factors: First, an increase in the spectral range, which for the Chebychev propagation implies an increase in the number of coefficients required for Eq. (7) to converge to machine precision. This increase is roughly linear with the number of levels in the transmon, i.e., the square root of the Hilbert space dimension. Second, the numerical scaling of the matrix-vector multiplications in Eq. (8) (and likewise in the implementation of a Runge-Kutta ode solver). In principle, matrix-vector multiplication scales as N 2 H , although this is mitigated by the fact that we use sparse matrices to store the Hamiltonian. For the gate duration, the runtime is directly proportional to the number of time steps.
We find that the runtime of the Semi-AD optimization is virtually indistinguishable from that of the direct optimization. This is in stark contrast to the full-AD optimization with a general ode solver. For a Hilbert space dimension greater than 64, the runtime of this optimization becomes prohibitively expensive (greater than 10 minutes per iteration, or step in the line search). The situation improves dramatically when using a modified Chebychev propagator for the full-AD optimization. Within the shown parameter region, these optimizations are well within the order of one minute per iteration. However, the insets in panels (a, b) show a significant difference in scaling between the semi-AD and full-AD Chebychev optimizations. For the linear scaling in panel (b), we find a factor of 10 between the different slopes (3 ms per time step versus 30 ms per time step). This order of magnitude also appears realistic for the scaling with respect to the Hilbert space dimension, and may become a problem for the runtime of much larger systems. Certainly, it will not be possible to perform a full-AD optimization of anything but the most trivial open quantum systems, with a Liouville space that is quadratically larger than the underlying Hilbert space. This is especially true because the Chebychev propagator is not applicable to non-unitary (open system) dynamics. The use of Newton polynomials discussed in Section 2.3 is numerically more demanding than the use of Chebychev polynomials, and it is unclear whether it would be feasible to implement a variant of the method that avoids in-place operations and would be compatible with a framework for automatic differentiation such as Zygote. Thus, a full-AD optimization of an open quantum system would likely have to rely on an ode solver, with the prohibitive runtime shown in panel (a). The relative scaling of a direct optimization compared to a full-AD optimization matches recent observations in Ref. [114].
There is no substantial difference in runtime between the different optimization functionals. This illustrates that the numerical effort of the optimization is entirely dominated by the time propagation, and further explains why the performance of the Semi-AD optimization is indistinguishable from a direct optimization with an analytic gradient.
The peak ram usage shown in panels (c, d) was measured by monitoring the Julia process running the optimization with psutil [115], and subtracting a baseline from a "Hello World" program to account for the footprint of the Julia runtime, which can vary depending on the exact version of Julia and the installed packages. We show the median ram usage, as well as the range of values in the inset for the Semi-AD (Cheby) optimization. There is significant fluctuation in the peak ram usage, as it depends on Julia's garbage collector, which has some element of randomness. However, the data shown in the insets indicates that the peak ram usage for semi-automatic differentiation is essentially constant around 100 MB, and equal to the memory used for the direct optimization with an analytic gradient. In principle, as analyzed in Section 4.6, we would expect a linear scaling for large Hilbert space dimensions,. This is determined by the storage of the forward-propagated states marked in red in Fig. 1 (a). For N = 4 targets, a Hilbert space dimension of N H , and N T time steps, each stored state is a vector of complex numbers of length N H , and each complex number requires 64 bits for both the real and imaginary part, i.e., 16 bytes in total. Thus, the total memory required for storage is For N H = 225, N T = 1000, the rightmost point in the inset of panel (a), this comes out to 14 MB, and 49 MB for N H = 100, N T = 8000, the rightmost point in the inset of panel (b), and we can conclude that for the data shown in Fig. 2, the propagation overhead still dominates over the storage of propagated states. The memory usage a full-AD optimization with an ode solver appears constant with respect to the size of the Hilbert space, albeit with a significant overhead, averaging around 700 MB. This is likely due to the solver being optimized for automatic differentiation, that is, providing handwritten adjoints for the Runge-Kutta step and thus eliminating the overhead of the computational graph for a single propagation step. Memory usage still scales linearly with the number of time steps, reaching 4 GB for 8000 time steps. The full-AD optimization using a Chebychev expansion does not benefit from the AD-aware propagation in DifferentialEquations.jl, and thus the memory usage scales with the number of coefficients in Eq. (7), resulting in an excessive ram usage of 8 GB for a Hilbert space dimension of 225. Thus, the use of the Chebychev propagator for significantly larger Hilbert spaces would be prohibitive, and a possible implementation of a Newton propagator would likely perform even worse, making the use of full-AD for open quantum systems impractical.
Lastly, in panels (e, f) we show the accumulated memory allocated on the heap, as measured by BenchmarkTools.jl. Note the scale of the y-axis, which reaches 4 × 10 5 MB, i.e., 400 GB. This is normalized by the number of gradient evaluations. The allocations differ from the peak ram usage due to Julia's garbage collector, but correlate strongly with the runtime shown in panels (a, b). This illustrates the importance of good memory management, which is easy for an in-place Chebychev propagator (the negligible allocations of < 60 MB shown in the inset), but impossible for a propagator running in an automatic differentiation framework that does not allow for in-place operations.

Conclusion and Outlook
We have developed a theory of semi-automatic differentiation that allows to optimize arbitrary functionals in an efficient and flexible manner. Separating time propagation and the evaluation of the functional eliminates the excessive computational overhead and limited scope traditionally associated with the use of automatic differentiation (AD). With semi-AD, the time propagation and the associated gradient can be evaluated outside of the AD framework in a modified grape scheme, which we have described in detail. This scheme can be implemented in the most efficient manner possible, using sparse linear algebra, complex matrices, and in-place operations. The remaining part of the gradient is of minimal computational complexity and can thus be efficiently evaluated within an AD framework.
For the example of entangling gates on superconducting transmon qubits, we have verified that we can optimize for an arbitrary perfectly entangling quantum gate, either via a functional exploiting the geometric structure of the Weyl chamber, or by directly by maximizing the gate concurrence. This is the first demonstration of a gradient-based optimization of the concurrence, or any non-analytic functional. Fundamentally, such functionals require the use of automatic differentiation and have been held back thus far by the associated numerical overhead.
The runtime and memory usage of the semi-AD optimizations shown here scales identically to a direct optimization of a quantum gate with a fully analytic gradient. That is, we completely eliminate the exorbitant numerical overhead traditionally associated with automatic differentiation. A "full-AD" optimization that uses a generic ode solver within the AD framework becomes unfeasible in terms of runtime for Hilbert space dimensions greater than ≈ 100. The runtime can be improved significantly by adapting a propagation via expansion into Chebychev polynomials to the requirements of the AD framework (no in-place linear algebra operations). However, this results in excessive memory usage and would be difficult to extend to the significantly more complicated propagators required for open quantum systems.
As we have demonstrated, the potential improvements of semi-automatic differentiation scale superlinearly both for runtime and memory usage, compared to a full-AD approach. Even for moderate Hilbert space sizes in a closed quantum system, we observe improvements of up to two orders of magnitude. For control problems of larger dimension (either the dimension of the Hilbert space or the number of control parameters), and especially in open quantum systems, that effect will be further magnified. In such a setting, and for functionals where an analytic gradient is not feasible, semiautomatic differentiation will be the only viable option. Fundamentally, an optimization will be possible as long as simulating the time dynamics of the system is computationally feasible.
We have implemented the semi-AD approach in two ready-to-use Julia packages, GRAPE.jl [69] and Krotov.jl [70], as part of the more general QuantumControl.jl [71]. As an AD framework, we have used Zygote [60,61]. However, the method described here is applicable to any language or AD framework. In fact, the low complexity of the evaluation of the optimization functional relative to the full time propagation allows for additional avenues in environments where AD is underdeveloped. For example, we have tested the use of finite differences [116] and found it to be adequate for the examples in Section 5. Similarly, we would expect the easier-to-implement forward-mode differentiation to be a practical alternative.
We have developed the method of semi-automatic differentiation for the standard grape model of piecewise-constant control fields, albeit extending it to arbitrary functionals. However, the idea applies also to extensions of grape for a reduced number of control parameters, as used in the goat [117], group [118], and grafs [119] methods. Parametrization of the control field may be taken into account by a further chain rule in Eq. (3). Furthermore, as pointed out in Ref. [117], Eq. (5) can be used not just to evaluate the gradient of a piecewise constant time step, but also the gradient a full time propagation with respect to a single control parameter. In all of these cases, the method can be augmented with semi-automatic differentiation to extend it to arbitrary functionals. We will explore this in future work as part of the QuantumControl.jl framework [71].
In addition to enabling the use of AD, a second motivation for using a framework like Tensorflow to model the entire optimization problem was to enable the use of gpu computing, enabling considerable speedups [37,49]. This possibility remains with semi-automatic differentiation, but the use of AD and gpu computing are now entirely independent: The time propagation can be implemented in whatever way is most efficient, including on the gpu.
Lastly, in Ref. [120], it was observed that the scheme in Fig. 1 (a) can be extended to compute the full Hessian of an overlap of two states. With next-generation AD systems that allow for the calculation of higher-order derivatives [121], this opens up the possibility of a full Hessian semi-automatic differentiation approach. The resulting Newton optimization may provide better convergence than the pseudo-Newton achievable via lbfgs that we have used here.
Application of the semi-automatic differentiation framework developed in this paper will open new pathways to solve long-standing problems in quantum control. Apart from the optimization of entanglement measures in quantum information, which we have demonstrated here, and which could be extended to open quantum systems [122], the method would be applicable to the creation of multipartite entanglement in many-body systems. To date, this has been addressed with optimal control, either indirectly [123] or with gradient-free methods [124], usually via the crab method [125][126][127]. The ability to explore the optimization landscape for these control problems more broadly may lead to deep insights into the quantum behavior, e.g., of biological systems [128,129].
In quantum metrology, there is a wide range of opportunities to use optimal control beyond simple state-to-state transfers. For one, we can address functionals like the recently developed population transfer functional for atom interferometry [130]. So far, this has only been explored with Krotov's method, but the use of semi-AD would allow optimizing the components of an atom interferometer using grape, potentially exploiting the improved asymptotic convergence [24] in the generally flat optimization landscape of a robustness optimization [89]. This would also extend to a recently proposed tractor atom interferometer [131,132]. More fundamentally, we may directly maximize metrological measures through quantum control [133,134]. Generalizing from a recent application of optimal control to the creation of extreme spin-squeezed states [135] we may wish to directly maximize the quantum Fisher information [28,29], which is non-analytic in open quantum systems [30]. Thus, we expect semi-automatic differentiation to be an indispensable tool for the design of metrological protocols in open quantum systems.

Data Availability
The code used to generate the benchmarks in Section 5 and Fig. 2 is available at Ref. [136], or under DOI 10.5281/zenodo.7386493. Moreover, examples for the perfect entangler and concurrence optimizations for coupled transmon qubits that show the resulting optimized fields are available as part of the documentation of the GRAPE.jl and Krotov.jl packages, which implement the semi-AD approach [69,70].