Efficient variational synthesis of quantum circuits with coherent multi-start optimization

We consider the problem of the variational quantum circuit synthesis into a gate set consisting of the CNOT gate and arbitrary single-qubit (1q) gates with the primary target being the minimization of the CNOT count. First we note that along with the discrete architecture search suffering from the combinatorial explosion of complexity, optimization over 1q gates can also be a crucial roadblock due to the omnipresence of local minimums (well known in the context of variational quantum algorithms but apparently underappreciated in the context of the variational compiling). Taking the issue seriously, we make an extensive search over the initial conditions an essential part of our approach. Another key idea we propose is to use parametrized two-qubit (2q) controlled phase gates, which can interpolate between the identity gate and the CNOT gate, and allow a continuous relaxation of the discrete architecture search, which can be executed jointly with the optimization over 1q gates. This coherent optimization of the architecture together with 1q gates appears to work surprisingly well in practice, sometimes even outperforming optimization over 1q gates alone (for fixed optimal architectures). As illustrative examples and applications we derive 8 CNOT and T depth 3 decomposition of the 3q Toffoli gate on the nearest-neighbor topology, rediscover known best decompositions of the 4q Toffoli gate on all 4q topologies including a 1 CNOT gate improvement on the star-shaped topology, and propose decomposition of the 5q Toffoli gate on the nearest-neighbor topology with 48 CNOT gates. We also benchmark the performance of our approach on a number of 5q quantum circuits from the ibm_qx_mapping database showing that it is highly competitive with the existing software. The algorithm developed in this work is available as a Python package CPFlow.

1 Introduction 1 2 Background and notation 3 2.1 Quantum circuits and quantum gates .

Introduction
While many quantum algorithms, such as integer factoring [1], unstructured search [2], or linear equation solvers [3], promise game-changing speedups over classical, the current state of the quantum computing technology does not yet allow for a decisive demonstration with useful applications, although it might be on the verge (see e.g. a recent review [4]). There are plenty of factors limiting the performance of the current generation of quantum devices, such as initialization and readout errors, loss of coherence over time, and errors in gate operations. In the current NISQ When drawing large circuits we will often abbreviate rotation gates Rσ(a) by σ(a) to lighten the notation.
era [5], most gate-based quantum protocols are constructed out of single-qubit (1q) and two-qubit (2q) gates with the latter being significantly more errorprone across all leading platforms (at the same time, the problem of realizing multi-qubit gates is also under active study, see e.g. Ref. [6]). Hence, minimization of the 2q gate count is one of the key objectives that can improve performance of the near-term algorithms. On the other hand, in the fault-tolerant future a different type of resource, e.g. the T gate count, is likely to be the most expensive. At the high level, quantum algorithms are usually described using primitives, such as large multicontrolled gates or quantum Fourier transform, that are not directly accessible on the current devices. The standard compiling routines [7,8] include decomposing algorithmic primitives into native gates (which is always possible [9]), routing stage to comply with the possible connectivity restrictions of the target chip, and local simplifications of the resulting circuits (see e.g. here for a benchmark comparison of different frameworks [10]). Using special data structures such as graphs, tensor networks, ZX diagrams, decision diagrams and others, allows carrying out the compilation process without the need to simulate any part of the circuit. This makes these techniques extremely scalable, allowing to compile and optimize circuits with hundreds of qubits. The downside is that the resulting decompositions may be significantly less efficient than possible. A complementary strategy is to work directly with the unitary matrix of the circuit, thus eliminating any potential redundancies or inefficiencies in the original gate-based description. This is only feasible for small scale circuits, as the size of the state space and the corresponding unitary matrices scales exponentially with the number of qubits. In fact, even for a few-qubit circuits there may be other limiting factors such as circuit complexity, as we emphasize in this work. Although there could be applications of direct unitary synthesis to enhancing the performance of the NISQ algorithms, we expect the use of highly optimized small scale circuits as building blocks of large scale algorithms to be the most promising possibility.
The unitary synthesis problem amounts to finding the most efficient circuits optimizing a given objective function. Typical applications include maximizing fidelity with respect to the target unitary (compilation) or maximizing the overlap with the target state (state preparation), but more general problems can be considered. We study the problem of the variational synthesis into the gate set consisting of a single 2q gate (CZ or CNOT ) and arbitrary 1q gates. The primary optimization objective is the amount of CNOT gates, although indirectly we also address CNOT depth and even T count and T depth. It is natural to divide the problem into two parts: (i) Discrete optimization or architecture search, looking for best placements of 2q gates. (ii) Continuous optimization of 1q gate parameters for a given architecture. The difficulty associated with the architecture search has combinatorial origin and is manifest. The difficulty of the continuous optimization is however also essential, as is known in the context of variational quantum algorithms, but apparently underappreciated in the context of the variational compiling.
After fixing our notation and giving a brief introduction in Sec. 2, we zoom in the issue of the continuous optimization in Sec. 3. Results of this analysis may be of independent interest. Another central ingredient in our approach is to use parametrized 2q gates as a means to relax the discrete architecture search to another continuous optimization, that can be performed simultaneously and coherently with the optimization over 1q gates. We introduce the CPFlow algorithm in Sec. 4. In Sec. 5 we first illustrate all central features of CPFlow using the 3q Toffoli gate as an example, and then go beyond this toy case to synthesize efficient (and likely novel) decompositions of the 4q and 5q Toffoli gates on constrained topologies. In Sec. 6, we provide further benchmarks showcasing that CPFlow is very efficient in compilation of small scale circuits with moderate complexity, but also outline its limitations. Sec. 7 concludes with the summary and outlook.
There are three main contributions we present in this study.
(i) Exposing the problem of local minimums in the loss landscapes of variational compiling problems as a crucial yet underappreciated challenge. (ii) Developing a new heuristic algorithm for a simul-taneous search over the architectures and singlequbit angles and demonstrating that for smallscale quantum circuits of intermediate complexity it can produce optimal or nearly optimal decompositions. (iii) Proposing a simple post-processing step, that often allows refining approximate decompositions into exact. Observation (i) and technique (iii) are not specific to our core algorithm and are likely to be of wider interest.
Optimization of quantum circuits implementing various tasks, from simulation to combinatorial optimization, is critical for successful applications. Some approaches are informed by the structure of the problem, e.g. the symmetry of the dynamics to be simulated [11,12,13,14]. Others seek for compilation layouts robust to noise [15,16,17,18] or even attempt to redefine the basis gate set based on the chip design [19]. Our work is better characterized as numerical circuit synthesis.
The idea to use computer assisted search and numerical optimization for circuit synthesis goes back a long way [20] and continues to the present day with advances due to both algorithm design and growth of raw computational power. Possible frameworks include purely discrete search over a finite gate set [21,22], a natural separation into discrete architecture search and continuous optimization [23,24], adaptive circuit synthesis [25,26,27,28], techniques such as genetic algorithms [29,30] and machine learning [31,32], and a hybrid approach with part of the architecture search outsourced to a version of continuous optimization [33,34]. The scheme developed in Ref. [34] is in many respects similar to the one proposed in this paper, and similarly to our work, was originally motivated by the impressive success of variational compiling of random unitaries [35,36,37]. Random unitaries can be thought of as the circuits with maximal complexity. In this work, we address circuits of intermediate complexity, which are a much more challenging target.

Quantum circuits and quantum gates
Quantum circuits are usually drawn as diagrams similar to Fig. 4. Horizontal wires correspond to qubits, boxes and vertical connections to quantum gates. Each quantum gate can be though of as a unitary matrix. The ones we use in this paper are summarized in Fig. 1. An important detail is that CZ and CNOT gates are equivalent up to a conjugation by the 1q Hadamard gate, and hence completely equivalent for the purpose of the variational compiling that we consider. We often mention CNOT gates in more general discussions, as is more standard, but refer to CZ CZ block (a) Connected layer (b) Chain layer gates when the technical details are important. The state space associated to n qubits has dimension 2 n , the unitary matrices acting on this space have dimension 2 n ×2 n . The unitary matrix of a quantum circuit can be constructed by an appropriate tensor product of all the gates involved. The overall complex phase of the unitary matrix is inconsequential and can be fixed in any convenient way. The space of quantum circuits on n qubits is hence equivalent to the special unitary group SU (2 n ). A natural measure of fidelity between two unitary matrices is the distance induced from the Hilbert-Schmidt norm It is normalized to take values between 0 and 1 with the minimum being reached iff U and V differ by a global phase.

Template circuits
The prevalent approach to the variational compiling in the literature [24,36,37,38,34] that we will follow is to construct the template (or ansatz) circuits by repeated application of two-qubit blocks. The two types of entangling blocks we will use are CZ and CP blocks depicted in Fig. 2. They only differ by the type of the entangling gate used. We will explain the choice of 1q gates shortly. The blocks are further arranged in sequences we refer to as layers. In principle, layers can be arbitrary, but we will usually identify layers Figure 4: Template circuit U 4 CZ on a connected 3q topology with 4 entangling CZ -blocks. The complete connected layer is boxed, the incomplete layer is dash-boxed.
with coupling maps of the target topology (ordered in an arbitrary way). For example, see Fig. 3 showing layers corresponding to the fully connected and chain (or nearest-neighbor) topology. Finally, to fully specify the template, one must provide the total number of 2q gates. Layers are repeated until the specified number of 2q gates is reached, the last layer is truncated if needed. We will write U k CZ or U k CP for templates with k entangling gates of type CZ or CP respectively (layer specification is assumed but left implicit in the notation). For illustration, Fig. 4 depicts U 4 CZ on a connected topology (here and in the following connected topology means fully connected; connectivity restrictions will always be specified explicitly).

Theoretical lower bound
There is a provable minimum amount of CZ gates required to decompose any n-qubit unitary [39], that we will refer to as the theoretical lower bound It essentially follows from a simple parameter counting argument. We will sketch the argument, which is not only instructive, but also helps to motivate the structure of our template circuits. A unitary matrix of an n-qubit circuit has in general 4 n real parameters. Generic 1q gate on the other hand has 3 real parameters, e.g. angles in the Euler decomposition. Thus, without 2q gates, a quantum circuit with n qubits can have no more than 3n real parameters. Our template circuits start with a round of 1q gates on each qubit wire, cf Fig. 4. Adding a single CZ gate allows appending two more 1q gates that do not immediately combine with the existing ones. Superficially, this permits adding 6 real parameters per CZ gate. However, one parameter in each 1q gate is redundant, as illustrated in Fig. 5. Using ZXZ decomposition of an arbitrary single-qubit block U and the fact that R Z commutes with the CZ gate, the leftmost R Z blocks on each qubit can be pulled to the left and joined with the existing 1q blocks. Therefore, adding a single entangling CZ block allows increasing the real dimension by four. Requiring that the amount of 2q gates is at least sufficient to cover the dimension of the SU (2 n ) manifold leads to the equation 4 T LB(n) + 3n ≥ 4 n − 1, equivalent to (2) (additional unit subtracted is the irrelevant global phase parameter). Figure 5: Entangling CZ -block only allows adding four real parameters to the circuit. Explicit gate angles are not depicted.
While expression (2) is a simple theoretical bound, a strong evidence for its tightness exists. First, there is an constructive analytic procedure, known as the quantum Shannon decomposition [40], which synthesizes an arbitrary n-qubit unitary using only 23 12 T LB(n) CNOT gates (roughly twice as much as the theoretical lower bound requires). Second, recent numerical studies [36,37] suggest that the overhead of the quantum Shannon decomposition is not necessary and that CNOT count given by Eq. (2) is sufficient to compile random unitaries with a great numerical accuracy.
So far, our discussion and the bound (2) addressed generic or random unitaries. However, the unitary matrices of the central importance to quantum computation are highly structured and typically require much less 2q gates. The quantum Shannon decomposition does not appear to be particularly useful in this case. Its extension to restricted topologies is also difficult, often leading to a large multiplicative overhead [40]. Note that being able to find truly optimal decompositions of arbitrary unitaries would amount to determining their gate complexities, which is an NP-complete problem [41]. It is therefore natural to use numeric optimization and heuristic methods in the search for efficient decompositions.

Variational synthesis and its challenges
As mentioned in the introduction, it is natural to split the variational compiling into the discrete architecture search and the continuous optimization of 1q gates. The difficulty of the architecture search caused by the combinatorial explosion of complexity is manifest. At the same time, the difficulty of continuous optimization also can not be ignored. It is a non-convex problem and thus can not be solved with guaranties. In practice, it may suffer from a range  of problems including local minimums, plateaus, and saddle points. Mathematically, the problem of the variational synthesis is very similar to the classical optimization loop in quantum variational algorithms [42], especially their hardware-efficient [43] and adaptive [26] forms. Here, the two key obstacles are the barren plateaus and local minimums. The barren plateaus [44] manifest as negligible gradients in large areas of the loss landscape and are usually associated with a large number of qubits or parameters. In our experiments with small-scale quantum circuits, we did not find them to be relevant. On the other hand, the problem of local minimums alone is sufficient to render training of the variational algorithms NP-hard [45]. As our numerical experiments suggest, local minimums constitute a real hindrance to the variational compiling.
We will quantify the challenges associated with local minimums by the empirical success ratio where N is the total number of times the optimization procedure is performed starting with random initial conditions and M is the number of times the global minimum is reached. For example, let U (a) be the unitary matrix of the template circuit from Fig. 4 and a * be some particular choice of angles. It is clear that the global minimum of the Hilbert-Schmidt distance D(U (a), U (a * )) is zero (attained at a = a * ), but gradient-based optimization does not always reach it (as a cutoff value we take D ≤ 10 −4 ). With some particular random choice of a * and random uniform initialization of the template angles, our default optimization (detailed in Sec. 4.3) yields success ratio SR ≈ 0.3, which implies that roughly two thirds of the times the optimization gets stuck in a local minimum.
We now extend this simple numerical experiment more systematically. Fig. 6 charts the success ratios for 3q and 4q circuits as a function of the number of gates. The basic procedure is the same as above, with several additions. For each gate count k, we construct a CZ template with connected layer U k CZ and find the success ratio of this template learning its We consider the optimization successful if the Hilbert-Schmidt distance (1) drops below 10 −4 . This quite permissive numerical cutoff is enough to reveal the local minimums without worrying about the optimization details such as convergence rate. 1 Independent evidence that unsuccessful attempts are indeed due to local minimums will be presented at the end of the section.
More precisely, we take 10 different template in-stances for each gate count U k CZ (a * 1−10 ) and compute the success ratio for each of them using 1000 initial conditions, generated uniformly at random. Blue markers represent mean success ratios averaged over 10 target circuits, while error bars quantify the standard deviations. Absence of blue markers implies that the empirical success ratio turned out vanishing, i.e. that the global minimum was not reached. For 4q circuits data points were only collected for k = 3n, n ∈ N, k ≤ 63.
There are several remarkable features of these plots. First, the success ratio drops very quickly as the 2q gate count increases, reaching values below 10 −3 at 10 CZ gates for 3q circuits and 15 CZ gates for 4q circuits. Next, perhaps surprisingly, the success ratio rises back to values of order 1 as the number of CZ gates approaches the theoretical lower bound (2). In fact, this is in agreement with the empirical evidence found in the literature [36,37,35] that near the theoretical lower bound numerical compilation appears to be very efficient as if the problem was convex. That over-parameterized quantum circuits can often be trained efficiently have also been motivated theoretically [46,47]. Finally, although there is a certain spread of success ratios across different template instances, dependence on the 2q gate count sets the dominating trend. This suggests that the success ratio is mostly determined by the template, not by the target. To confirm this intuition, we carried out additional experiments using random unitaries V instead of template instances as targets. The difficulty here is that the true value of the global minimum of D(U k CZ (a), V ) is not known, yet the presence of local minimums is still manifest, because different optimization runs tend to end up with significantly different loss values. We modify the definition of the success ratio in this case, by counting as successful all optimization runs that approached sufficiently closely the lowest value across all runs for a given target unitary (using the same cutoff as before D − D min ≤ 10 −4 ). Note that with this modified definition, the success ratio can never be zero (because there is always at least a single run with the lowest value). We see that in the regime when success ratios for random instances are sufficiently high, success ratios for random unitaries closely parallel them, both in mean and in deviation. In the regions where success ratios for random instances are very small or vanishing, success ratios for random unitaries are nonzero (they can not be by construction) but are close to zero. We expect them to drop further if more samples are accounted for. Overall, our experiments strongly suggest that local minimums are mostly determined by the templates and not by the targets.
It is also instructive to inspect not just the success ratios, but the distribution of loss values for different templates. The histograms at Fig.6(c) depict distributions of final loss values achieved by the opti-mization starting from random initial conditions for connected 4q templates U k CZ (a) with three different depths k. The first observation here is that most loss values are clustered near a mean value away from the global minimum (by construction, the global minimum has zero loss). Next, the quality of the local minimums increases as the depth (and hence expressivity) of the template grows, and the spread shrinks. This is in a remarkable agreement with recent analytic results [48,49], where the following asymptotic distribution of the density of critical points E 0 for Hamiltonian-agnostic variational loss functions was derived Here E is the variational loss function normalized to satisfy 0 ≤ E ≤ 1, m the dimension of the Hilbert space, and l the number of independent parameters. Ratio quantifies the expressivity of the circuit and crucially affects the distribution of critical points. For γ ≪ 1 most local minimums are far away from the global minimum and the loss function is hard to train. For γ > 1 the local minimums cluster exponentially close to the global minimum and the model is easy to train (yet it is of exponential depth). In Fig.6(c) we fit the loss histograms with the distribution (4). Using the Hilbert-Schmidt test (see e.g. [24]), the compilation problem on n qubits can be reformulated as the state preparation problem on 2n qubits. Hence we choose the dimension of the Hilbert space m = 2 8 . The expressivity parameters γ * are fitted in an ad hoc way, to visually match the histograms. The expressivity parameters γ (5) are also indicated in the plot, but distributions corresponding to γ do not align well with the histograms and are not shown. Of course, one should not expect a precise quantitative agreement between the distribution (4) and our empiric histograms. For one, (4) is an asymptotic statement valid for large system sizes. There are other assumptions going in the derivation of (4) that our setup may fail to satisfy. Nevertheless, we view the qualitative agreement between our empiric results and the theoretical analysis as a strong indication that the local minimums in the variational compilation are real, present a significant hindrance, and should be taken into account in any synthesis approach relying on numerical optimization of parametric gates.
Interestingly, Figs. 6(a,b) suggest that for the circuits with few parameters, the local minimums are not present. Also, we note that for the 4q circuits, success ratios initially drop more slowly than for 3q circuits. Studying the onset of local minimums, and it's scaling with the system size, is an interesting question that we do not address here.
We should mention that a success ratio is not only a function of the loss landscape, but also of the optimization algorithm and the distribution of the initial conditions. One popular optimization mode for quantum algorithms is a layerwise training [50], as it uses less computational resources and can mitigate the Barren plateaus [51]. However, these concerns are not relevant on the scale of our experiments and, moreover, the layerwise training is known to raise additional trainability issues in some cases [52]. There is a number of proposals to alleviate the problem of local minimums by the choice of optimizer [53,54], but in our experiments, none performed sufficiently better than simple ADAM [55]-based optimization to justify additional computational resources that are typically required by higher-order methods such as the natural gradient [56] or imaginary time evolution [57]. A recent empirical comparison of various optimization methods for quantum variational algorithms [58] also suggests that ADAM optimizer is often the simplest and most efficient choice.
In contrast, our experiments suggest that parametrization and/or distribution of the initial parameters can have noticeable effects. As explained in Sec. 2.3 template structure illustrated at Fig. 4 features redundant 1q gates. In any entangling block, two rotation gates can be removed without compromising circuit expressivity, i.e. without any shrinking in the space of all unitaries obtainable from the template. However, performance of the templates with the minimal number of 1q gates appears to be worse on average (though there are counter examples, see Sec. 5.2). This can be due to the fact that overparametrization favorably deforms the loss landscape and/or because the random uniform initialization of the angles produces a different distribution of the initial unitaries. In the present study, we do not attempt to disentangle the two possible effects and leave this important question for future work. Unless stated otherwise, reported results correspond to the 'XYZ' templates as per Fig. 4.

The CPFlow algorithm 4.1 Motivation and overview
In the context of variational synthesis, results of the previous section suggest that solving the continuous optimization problem may be just as difficult as solving the discrete architecture search: even if the structure of the template is a perfect match for the target unitary, finding the suitable angles may be very challenging. In the absence of an efficient way to solve the latter problem in our approach, we choose the brute force route of an extensive multi-start optimization.
Our second main technique is to relax the discrete architecture search to yet another continuous optimization. For illustration, consider the circuit at Fig. 7. Here, the 2q gates are the controlled phase gates (1), which interpolate between the identity gate CP(0) = I and the CZ gate CP(π) = CZ. For generic values of the angle, a single CP(a) gate can be decomposed into 2 CZ gates (plus 1q gates). Therefore, different values of parameters in CP gates in the template (7) effectively capture several different templates with the 2q CZ gates and training templates with the CP blocks can encompass both the architecture search and the tuning of continuous parameters, moreover performed in a coherent manner.
We can anticipate, however, that training CP templates directly will result in most CP gates having generic angles and hence effectively doubling the CZ count of the original template. To address this issue, we introduce an additional penalty term to the loss function that is intended to drive all CP angles to either 0 or π. The shape of the penalty function that we use is presented in Fig. 8.
This penalty is intended to drive all CP angles during the optimization to either 0 or π, and hence to reduce the CZ count of the resulting circuit. Moreover, at values a = 0, 1 2 π, 3 2 π, 2π the regularization term faithfully captures the CZ cost of the CP gate. We choose a simple linear interpolation between these values because piecewise-linear penalty functions are known to lead to the discrete decision-making in certain cases of continuous relaxation of discrete optimization problems such as sparcification of machine learning models [59], compressed sensing [60,61], and robust PCA [62] to name a few. For numerical stability, small plateaus near the reference values are added (empirically we find that CP angles often relax near 1 2 π, 3 2 π as well). An obvious problem with this regularization function is the presence of the local minimum at a = π. In fact, enumerating all local minimums associated with the regularization terms of a CP template is equivalent to the discrete search through all CZ templates that it can reduce to. However, our empirical results suggest that simultaneous optimization over CP and 1q angles is a very efficient strategy if the overall weight of the regularization term is chosen properly. If it is too small, the regularization term has little effect and the resulting decompositions tend to have a high CZ count. When the weight is too high, the CP angles effectively get captured by the closest local minimum and the flexibility of our strategy is lost. In fact, optimization with a high regularization weight could be considered a version of a random search over the architectures (if the initial CP angles are chosen randomly) and performs significantly worse than optimization with a properly tuned regularization weight.

Procedure
We formulate the general synthesis problem as follows. Let L(U ) be the loss function to be minimized Figure 7: Template 3q circuit U 3 CP on a connected topology.
The first term is the value of the original loss function evaluated with the variational circuit as the argument. The second term is the regularization term, summing the CP penalties for all CP angles in the template. The number of 2q gates k and the overall regularization weight r are two of the most important hyperparameters of the model. Three main stages of the algorithm are described below. Precise details are given in Sec.4.3.

Static synthesis
1. Raw sampling. Loss function (6) is minimized starting from many initial conditions (num_samples). For each sample, both the CP angles and the angles of 1q gates are generated uniformly and independently at random.

2.
Selecting prospective results. Results of the first step are filtered based on two criteria (i) the original loss function L(U k CP ) must be below a given threshold (entry_loss) and (ii) the number of CZ gates in a projected CP circuit must be below a specified value (accepted_num_cz_gates). Condition (i) means we only accept circuits that are close enough to the global minimum, while (ii) rejects decompositions with too high CZ count. Projection from CP to CZ circuits U k CP (a) → U k ′ CZ (a ′ ) is preformed by rounding off angles of CP gates that are sufficiently close (within threshold_cp) to 0 or π and substituting other CP gates with their CZ decompositions.
3. Verification. At this stage the projected circuits contain only CZ gates and the regularization term is removed. For each prospective CZ circuit, the original loss function L(U k ′ CZ (a)) is further optimized starting from initial angles a ′ inherited from the CP circuit. The verification is considered successful if the CZ circuit reaches a more stringent loss threshold (target_loss).
This basic scheme can be modified in many ways: by choosing a different regularization function, different sampling of the initial angles or altering the details of the gradient based optimizer to name a few. We have mostly experimented with varying two hyperparameters that are evidently crucial, the number of CP gates k and the regularization weight r. Heuristically, we find that a reasonable number of CP gates is usually between k 0 and 2k 0 , where k 0 is the expected optimal CZ count of the decomposition. A performant choice for the regularization weight r for loss functions normalized so that 0 ≤ L(U ) ≤ 1 is r = 5 × 10 −4 . There could be exceptions to both these rules of thumb. To make better choices of hyperparameters on a case by case basis, we use the Bayesian tuning algorithm provided by the Hyperopt package [63].
Tuning of hyperparameters is significantly hindered by the fact that the loss function is stochastic. Taking sufficiently many samples to reliably estimate the quality of a hyperparameter configuration may cost too much computational resources, while not taking enough can make the acquired data too noisy to be useful. On the positive side, the ultimate goal is not to find the best hyperparameters, but rather to find the best decompositions which routinely occur at suboptimal points as well. The routine including hyperparameter tuning can be summarized as follows.

Adaptive synthesis
1. Defining the search space. Choose distributions to draw the number of gates k and the regularization weight r from. Typically, we use uniform distribution for k in some integer range and lognormal distribution for r with mean around 5 × 10 −4 and standard deviation 0.5.

2.
Evaluating the score function. Draw a sample k, r from the hyperparameter distribution according to the Hyperopt algorithm. Execute steps 1 and 2 from the static routine. Any CZ count is accepted at this stage, i.e. the results are only selected by the value of the original loss function L(U k CP (a)). Let k 1 , k 2 , . . . be CZ counts of all prospective results selected. We define the score function (reminiscent of softmin) by where N is the total number of raw samples. The intuition is as follows. Templates that are too expressive and have high regularization weight will yield many high-fidelity decompositions with excessive CZ counts. Imposing stricter hyperparameters will yield more efficient decompositions, but also more raw samples failed to converge. Function (7) balances between these scenarios by averaging CZ counts of accepted decompositions, weighting them exponentially, i.e. a single decomposition with k CZ gates scores as two decompositions with k + 1 or four with k + 2. The minimum of this function is k 0 , the CZ count of the best possible decomposition. It could be achieved if all raw samples reach the threshold fidelity and have the optimal CZ count k = k 0 . If some of the raw samples fail to converge or require higher than optimal CZ count, the score function increases. The maximum score = +∞ is obtained when none of the raw samples passed the fidelity threshold.
3. Verifying best decompositions. At the previous stage, prospective decompositions are usually not verified as the verification process is time-consuming and should not significantly alter the score estimation (some of the decompositions may fail to pass the verification, but this is rare). However, if there are prospective decompositions that improve on the current best they are verified and if accepted, the current best is replaced.

4.
Repeat. Repeat steps 2 and 3 until either the maximum number of score evaluations is reached or a decomposition with the desired number of gates is found.
As the result of the adaptive routine, one narrows down the space of good hyperparameters for the problem and collects several efficient decompositions found in the process. This may already be sufficient for the end goal, or provide a good starting point to generate more decompositions using the static routine with appropriately chosen hyperparameters. Although the algorithm directly targets only minimization of the CZ gate count, generating many similar decompositions allows one to further select by other criteria such as CZ depth, or even T count and T depth. We will illustrate this process in Sec.5.1.

Technical details
The static routine implemented in CPFlow proceeds as follows. First num_samples of initial angle combinations are generated uniformly at random. Learning at the raw sampling stage proceeds with the ADAM optimizer with learning_rate 0.1 ran for num_gd_iterations (2000 by default). At the selection stage for each sample, the best configuration of angles a * is chosen corresponding to the minimum of the regularized loss function (6) across all iterations. If the primary loss function at this configuration of angles L(U k CP (a * )) is below the entry_loss threshold (10 −3 by default) the CP circuit is projected to the CZ . Projection is performed as follows. CP gates with angles lying within the cp_threshold (0.2 by default) distance away from 0 or π are replaced by the identity and CZ gates respectively. CP gates that were not replaced at the previous stage are decomposed into 2 CZ decomposition using standard methods. If the 2q gate count k ′ of the resulting CZ circuit is below accepted_num_cz_gates, the circuit is deemed prospective.
At the verification stage, the original loss function L(U k ′ CZ (a)) is further optimized with the ADAM optimizer with a smaller learning_rate_at_verification (0.01 by default) ran for an increased number of iterations num_gd_iterations_at_verification (5000 by default). Importantly, the new optimization starts with the initial angles a ′ obtained after projecting the CP circuit. If the new optimization pass reaches a more stringent target_loss threshold (10 −6 by default) the verification is considered successful and the resulting circuit is added to the collection of decompositions.
The parameters specified above work well with the Hilbert-Schmidt loss function (1) in a sense that the majority of prospective decompositions pass at the verification stage. For other normalizations/shapes of the loss function, different parameter specifications might be required.
The adaptive search basically consists of several static rounds. At each round the score function (7) is computed from the prospective results. If there is a prospective result that improves the current best decomposition, it is verified and added to the decomposition pool if successful. Otherwise, the verification stage is omitted and a new static round with altered hyperparameters is initiated. The total number of rounds is controlled by max_evals (100 by default). Hyperparameters for each round are chosen by the Hyperopt implementation of the tree-structured Parzen Estimator algorithm [63] based on the previous score evaluations and input parameter distributions specified by the user. We use the uniform distribution for the number of gates k in the range between min_num_cp_gates and max_num_cp_gates. For the regularization weight r we use lognormal distribution with default values r_mean=5.5 × 10 −4 and r_variance=0.5. Note that by the default that we keep, the first 20 parameter choices in Hyperopt are intended to sample broadly from the search space and do not depend on the previous score evaluations (only on the input distributions), the actual optimization starts at further steps. However, with a good choice of the initial distributions optimal or near optimal decompositions are often found by CPFlow within several first evaluations.

Computational setup
CPFlow [64] is written entirely in Python, with the computational efficiency enabled by the JAX library [65]. One advantage of CPFlow is that the computations are highly parallelizable as the core of both basic routines consists in performing independent multi-start optimizations. We also rely on Qiskit [7] for visualization, validation, and postprocessing. Numerical experiments reported in this paper were carried out on a server equipped with a 16 GB NVIDIA Quadro RTX 5000 GPU. A single static routine with 1000 samples for 4q and 5q unitaries took several minutes in our setup. Correspondingly, a typical adaptive routine with 100 evaluations using 1000 samples each took several hours.

Synthesis of Toffoli gates
We now put the CPFlow algorithm to work. We choose the Toffoli gates as our key benchmark examples, motivated by several considerations. First, the Toffoli gates are among the most essential building blocks for a great variety of quantum algorithms. Next, large multi-controlled Toffoli gates can be built recursively from the smaller ones [9], hence optimization of the small-size Toffoli gates can potentially be propagated to the large multi-controlled gates required in useful quantum algorithms. Lastly, Toffoli gates have been studied extensively [9,66,67,68,69] and although deriving rigorous bounds beyond the 3q case is very difficult, the best existing decompositions  are likely to indeed be optimal and hence provide a perfect benchmark. The basic Toffoli gate, also known as the Controlled-Controlled-NOT or the C 2 X gate, is depicted as follows.
• • The right diagram represents the Toffoli gate as the C 2 Z gate conjugated by the two Hadamard gates (HZH = X). The C 2 Z gate itself is represented by a diagonal matrix C 2 Z= diag(1, 1, 1, 1, 1, 1, 1, −1) and is symmetric with respect to all qubits. Therefore, up to a conjugation by 1q gates the Toffoli gates (C 2 X as well as any C n X) are also symmetric. This implies that even if the symmetry between the qubits is broken by e.g. a non-trivial topology, the choice of the target qubit for the Toffoli gate is not relevant for our synthesis problem.

3q Toffoli
In this subsection, we use CPFlow to find efficient decompositions of the 3q Toffoli gate and illustrate many important features of the algorithm along the way. 3q Toffoli gate can be decomposed into 6 CZ gates on the fully connected topology or 8 CZ gates on the chain topology. Using CPFlow we were able to find many inequivalent decompositions with these optimal CZ counts. First, we ran the adaptive routine with 100 evaluation steps to identify the best hyperparameters in each case. Results are reported at Fig. 9.
Best hyperparameters for each topology are shown in Table 1. Note that at this stage decompositions with the optimal CZ counts have already been found, but it is instructive to further analyze the performance of the algorithm and generate more decompositions. To this end, we ran the static routine with optimal hyperparameters and 100 samples for each topology. The third column in Table 1 states how many of the initial conditions led to the decompositions with the optimal CZ count. For connected and chain topologies, chances of finding optimal decompositions are roughly 30% and 20% respectively. These figures are to be compared with the success ratios at the corresponding gate counts shown at Fig. 6, which are of order

Connected topology
Chain topology Figure 10: Decompositions of the 3q Toffoli gate that are likely to be optimal with respect to all four metrics: CZ depth, CZ count, T depth, T count. Rotation gates are shortened from Rσ to σ for readability. Non-Clifford gates, each obtainable from a single T gate, are highlighted. Figure 11: Decomposition of the 4q Toffoli gate on the star-shaped topology (all CZ gates touch the uppermost qubit) with 16 CZ gates.
10% and 2% respectively. The remarkable conclusion is that (at least in this simple example) our strategy of coherent learning of the architecture together with continuous parameters seems to be unreasonably efficient. With the best hyperparameters, the frequency of finding CZ count optimal decompositions is greater than the mean success ratios for reaching the global minimum for fixed architectures.
Decompositions which are initially generated by CPFlow are approximate and bear little resemblance to the standard Clifford+T representations, where 1q gates are expressible as rotation gates with angles being rational multiples of π. We develop a simple procedure that attempts to eliminate redundant gates, rationalize remaining ones, and if possible expand the circuit into Clifford+T gate set. Details of the procedure are deferred to App. A. The procedure is heuristic and does not guarantee elimination of all spurious angles, but often works well in practice. Applying it to the decompositions found at the static stage, we were able to generate 12 Clifford+T decompositions of 3q Toffoli gates for each topology. We then sorted these decompositions according to four metrics : CZ count, CZ depth, T count and T depth. For both topologies decompositions with the smallest T depth were simultaneously optimal with respect to the three remaining metrics. Fig. 10 depicts the best decompositions that were generated. Note that the decomposition on the chain topology has T depth 3 and is possibly a new result.
By construction, the distance D of decompositions reported in Tab. 1 and depicted in Fig. 10 to the exact 3q Toffoli gates satisfies D < 10 −6 , approaching the machine precision in our setup. It is natural to assume, that decompositions with rational angles, like those in Fig. 10, are in fact exact. Since all matrix elements of the rationalized decompositions are polynomials in e iπp/q (p, q ∈ Z) with integer coefficients, it must be possible to reduce the unitary of the decomposition to the target unitary using basic algebraic manipulations. Using symbolic algebra software external to CPFlow we verified that circuits in Fig. 10, as well all other decompositions presented explicitly in the paper (Figs. 11,15,14) are exact. Directly integrating this verification into CPFlow is left for future work.

4q Toffoli
We now proceed to the decomposition of the 4q Toffoli gates, which are significantly more challenging. Variational synthesis of the 4q Toffoli gate on various 4q topologies has been recently addressed in Ref. [38]. The approach adopted there was that of an exhaustive search over all architectures. It is interesting to note that for exhaustive search connectivity restrictions actually simplify the problem. Using CPFlow we were able to reproduce all results presented in Ref. [38], see Topology   CZ count  14  14  16  16  18  CZ depth  11  13  15 16 14   Table 2. Moreover, for the star-shaped topology we achieved a minor improvement, reducing the CZ count from 17 to 16. The corresponding circuit is depicted at Fig. 11. Note that we did not look for decompositions minimizing CZ depth, but simply report CZ depths of the first CZ count optimal decompositions found during the search.
In all except for the fully connected topology, decompositions were discovered by the adaptive algorithm with the full range of template depth for 4q unitaries (0, 61), 500 samples at each hyperparameter configuration, and 50 hyperparameter evaluations. The clock time taken by the search for each topology was about 40 minutes on our server 4.4. In contrast, the optimal decomposition on the fully connected topology only appeared after about 200 hyperparameter evaluations and took about 2 hours.
One might wonder why did the exhaustive search approach of Ref. [38] overlook the decomposition on the star topology with 16 CZ gates. A plausible cause might again be due to the local minimums. Table 3 shows empirical success ratios for the optimal circuits found by CPFlow on the fully connected and star topologies with two different choices of 1q gates: XYZ and XZ. First thing to note is that the success ratio for the star topology and XZ structure of 1q gates is only about 2%. Work [38] indeed used the XZ template, albeit with a different optimization procedure [70]. We find it likely that the appropriate architecture was missed simply due to an insufficient amount of trials. In fact, the problem of local minimums deprives of guarantees even the exhaustive search over architectures Another interesting observation is that for a connected circuit the success ratios of XYZ and XZ templates differ by an order of magnitude. This highlights the importance of the parametrization and/or initial sampling. Figure 12: A decomposition of the 5q Toffoli gate.

5q Toffoli
To our knowledge, the best decomposition of the 5q Toffoli gate on the fully connected topology without ancilla qubits features 30 CZ gates, an upper bound valid for all diagonal gates [40]. The best result of direct synthesis with CPFlow running for several hours that we observed featured 36 CZ gates, indicating that 5q unitaries with sufficiently many gates are significantly harder to address. However, synthesis with topological restrictions also poses significant challenges to the demultiplexing framework of Ref. [40]. It is thus interesting to see if our numerical routine can be useful.
To this end, we consider the compilation of the 5q Toffoli gate on the chain topology. As a reference point we take the best result achieved by the Qiskit transpiler out of 1000 runs with optimization_level set to 3, that yielded a decomposition with 61 CZ gates. 2 The best results of direct synthesis with CPFlow yielded a decomposition with 69 CZ gates. As we now show, by using a combination of the analytic and numerical techniques this result can be improved to 48 CZ gates.
Our strategy is to reduce synthesis of the 5q Toffoli gate to the synthesis of several 4q blocks. We first observe that the optimal 30 CZ gate decomposition of the 5q Toffoli gate on a fully connected topology can be obtained from the left circuit depicted in Fig. 12. Triply controlled √ X gate is a diagonal gate up to a conjugation by the Hadamard gates and hence can be decomposed using 14 CZ gates, just as the standard 4q Toffoli gate. Singly controlled √ X gates can be decomposed into 2 CZ gates each. Finally, the boxed 4q Toffoli gates can be replaced by their relative phase counterparts [67], each requiring only 6 CZ gates. In total, this gives 30 CZ gates, the best known amount.
We now try to adapt this decomposition to the chain topology. The right circuit in Fig. 12 shows that 2 Note that the transpilation process in Qiskit generally yields circuits which are only equal to the target unitary up to a possible permutation of qubits. As restoring the original permutation might be an expensive operation in terms of the CZ count, the transpilation results should be considered as a lower bound. by inserting four additional CNOT gates (each pair originates from a SWAP gate, the two closest CNOT gates cancel each other) around the triply controlled √ X gate we can place all 4q gates on the first four qubits. Next, we use CPFlow to decompose C 3 √ X and relative phase C 3 X gates on the chain topology. We found a decomposition of C 3 √ X with 18 CZ gates, the same gate count that is needed for decomposing of the C 3 X gate on a chain topology 2. For the relative phase C 3 X gate, we found a decomposition with 11 CZ gates. Details and corresponding circuit diagrams are delegated to App.B. In the end, this yields a decomposition of the 5q Toffoli gate on the chain topology with 48 = 2 × 11 + 18 + 2 × 4 CZ gates in total. We are not aware of other methods improving this count.

Further benchmarks
Following [34], we test the performance of CPFlow on a range of standard benchmark circuits from the ibm_qx database [72,71]. In Ref. [34] an extensive comparison between packages SQUANDER , QFast , and QSearch was performed. Provided enough time, SQUANDER reliably outperformed other packages in most examples. From each of the Tables 1, 3 and 4 presented in Ref. [34] we pick 5 circuits with the highest CZ count found by SQUANDER , as these are likely to be the most challenging and hence the most informative (for the same reason we skipped Table 2, which mostly contains much simpler circuits).
All selected examples are 5q circuits. As before, we consider the compilation successful if the distance to the target unitary (1) is less than 10 −6 , which is approaching the machine precision in our setup.
We must note that currently CPFlow does not have a dedicated subroutine to estimate the expected target complexity and narrow the hyperparameter window accordingly. Using the adaptive routine with the full range of gate counts allowed by the theoretical lower bound (2) already for 5q circuits is unnecessarily time-consuming. Informed by the gate counts obtained by SQUANDER we ran the adaptive routine with CP counts of templates in the range from  Table 4: CNOT counts of circuits from imb_qx set [71] synthesized with CPFlow , SQUANDER , QSearch and a hybrid combination QFast+Qiskit+SQUANDER . Bars indicate either a failure to synthesize a circuit or the absence of data. Results of CPFlow were obtained via the adaptive routine with the following options: min_num_cp_gates=20, max_num_cp_gates=100, num_samples=1000, max_evals=100 except for the gate counts marked with an asterisk. The latter were obtained under a different option set: min_num_cp_gates=40, max_num_cp_gates=60, num_samples=2000, max_evals=100. Results of the other software packages are reproduced from [34].
20 to 100 for 100 evaluations with 1000 samples each. Results are reported in Table 4 (CZ counts marked by asterisks are explained below).
Results for group I, targeting synthesis of lower complexity circuits on the fully connected topology, show a significant compression achieved by CPFlow compared to SQUANDER , averaging to approximately 25%. In contrast, for circuits of similar complexity on the chain topology, group II, the difference between CPFlow and SQUANDER is less noticeable, averaging to a 10% additional compression. It would be interesting to understand the role of topology in either approach in more detail. The last group III consists of the circuits that the SQUANDER package failed to synthesize on its own, being unable to generate initial templates for compression [34]. In our view, the reason is likely to be rooted in the local minimums problem, which grows more acute with an increasing gate count. The authors of Ref. [34] proposed an interesting workaround to generate initial templates using other software packages (Qiskit + QFast ) and then further compress them with SQUANDER . The resulting gate counts are reported in the last column of Table 4. The last two circuits in this group were synthesized by CPFlow along with the circuits from groups I and II and turned out to have far lower complexity than SQUANDER results suggest. On the other hand, the first three circuits indeed proved to be the hardest to synthesize, and in fact CPFlow found only poor or no decompositions at all for these circuits with the original search options. This lead us to initiate a second adaptive optimization with a narrower gate range (min_num_cp_gates=40, max_num_cp_gates=60) and increased amount of samples (num_samples=2000). Eventually, acceptable and even apparently efficient decompositions were found by CPFlow yielding an average compression of 25%. Yet, it also became apparent that gate counts above 40 are a very challenging target for the algorithm, with most trials yielding no prospective circuits at all.
We need to stress that figures in Table 4 should not be taken as an accurate performance comparison between the algorithms (and hence we do not report the consumed resources and runtimes). First, the lowlevel implementation and the processor that we used are very different from those employed in Ref. [34]. Second, we relied on the results obtained in Ref. [34] to tune hyperparameters of CPFlow in advance. Precise comparison should also have a clear metric such as the maximum compression efficiency, faster runtime, scalability etc. Our primary objective was to minimize gate counts of synthesized circuits, the task that CPFlow addressed with a promising efficiently and within a relatively short time frame.

Summary and outlook
In this paper, we presented a new approach to the variational synthesis into CNOT plus 1q gate set. We identified the problem of local minimums as a crucial yet underappreciated obstacle that needs to be addressed. In the absence of an efficient way to avoid local minimums we have made an extensive exploration of the initial conditions space an integral part of our scheme. We also proposed to use parametric 2q gates as a way to unify the architecture search with the continuous optimization of 1q gates into a single coherent optimization and demonstrated its efficiency. A recent work based on a similar idea [34] also showed significant improvement over more standard discrete architecture search [27].
Another contribution of this work is a technique to promote approximate numerical circuits to exact decompositions. As an example, for the Toffoli gates studied in this paper we found that often efficient decompositions can be translated into circuits with angles being rational multiples of π and verified to yield the exact synthesis. The technique is basically a post-processing of the numerical synthesis results and can be adopted by other approximate synthesis frameworks, so we believe it should be of broader interest.
While capable of generating many interesting results our method has its limitations. The first is rather fundamental and common to all similar schemes. Since the input circuit is represented by means of the corresponding unitary matrix, only small scale circuits that are easy to simulate classically can be addressed. This however does not preclude using variational synthesis to optimize smaller building blocks of large scale useful quantum algorithms [33]. Exploring this direction is an important and practically relevant avenue for future work.
Scaling up the variational approach within the classically accessible regime appears to be mostly limited by the circuit complexity rather than the qubit count, e.g. it seems to be more difficult to efficiently compile a 4q circuit that requires many 2q gates than it is to compile a 5q circuit that can be represented using only a small amount of 2q gates. Importantly, the challenges associated with the higher complexity are not caused only by the combinatorial growth of the possible architectures alone, but also by the proliferation of the local minimums in the loss landscape. Our empirical results (cf Fig. 6) indicate that already for 4q circuits in the range from 12 to 50 2q gates the probability of a successful continuous optimization is less than 0.1% even for a correctly chosen architecture. This probability depends on numerous factors including details of the optimization procedure, distribution of the initial conditions and parametrization of the loss landscape. Detailed understanding of these mechanisms may lead to a dramatic improvement in variational compilers' efficiency.
There are numerous further possibilities to enhance variational synthesis of high-complexity unitaries. For example, if the unitary originates from a known quantum circuit, it could be possible to take advantage of this information. Splitting the original circuit in parts, each having lower complexity, and synthesising them separately may lead to better results. Another proposal is to use the original circuit as the starting template for the variational compression [34]. A recent work [73] has shown that modifying template architectures on the go can help to reduce both global minimums and barren plateaus. It would be interesting to see if some of the circuits generated by CPFlow, which have efficient gate count but insufficient fidelity, can serve as a useful starting point for the "burrowing" procedure suggested in [73]. Eventually, the viability and resource allowance of the variational synthesis must be justified by the payoff if provides for useful applications.

A Circuit refinement
Quantum circuits that result from numerical optimization performed by CPFlow typically have many redundant 1q gates, see Fig. 13(a) for an example.We propose a simple refinement procedure that helps to find a simpler representation for such circuits.
First, we test if an angle a can be set to 0 without affecting the loss function. For instance, in the state preparation problem the initial round of R Z gates has no effect when applied to the all-zero state and hence can be simply omitted.
Next, we check if there are cancellations between pairs of rotation gates (not necessarily adjacent). As a function of a single angle any parametrized quantum circuit has the following simple form U (a) = U 0 cos a + U 1 sin a (9) with some unitary matrices U 0 , U 1 . In turn, as a func- (c) Rationalized circuit X ( 3π 4 ) • Z (−π) X ( 3π 4 ) • Z (−π) X ( π 4 ) • Z (−π) X ( −π 4 ) Z (π) Figure 13: Refinement of the approximate decomposition of the 3q Toffoli gate into an exact result.
Typically, all terms in this expression are nonvanishing and different choices of a 1 and a 2 correspond to different unitaries (up to discrete redundancies). It may happen, however, that either the unitary U itself or the loss function of interest L(U ) does not contain terms with a 1 + a 2 or a 1 − a 2 . For instance, for the two consecutive R X rotations R X (a 1 )R X (a 2 ) only the terms that depend on the sum of angles are present in (11). Hence, for every pair of angles (it is usually sufficient to only consider angles of gates acting on the same qubit) we check if L(U (0, a 2 ± a 1 )) = L (U (a 1 , a 2 )) and if such pair is found, the first angle is set to 0 and the second angle is adjusted accordingly. Results of this step are illustrated at Fig. 13(b). Finally, we check if the resulting angles of the 1q gates can be approximated by the rational multiples of π without compromising the accuracy of the loss function L(U ) (in fact the accuracy is often improved at this step), see Fig. 13(c).
The steps outlined above are heuristic and do not always lead to expected results, but often work well in practice. All circuits reported in this paper were obtained automatically in this fashion. Heuristically we find that when decompositions are close to optimal, the refinement procedure works best. Possibly this can be attributed to the fact that extra gates allow for more complicated redundancies in the circuit that are not accounted for by our simple steps. Also, the procedure works best when the loss function is the most restrictive as in the compilation problem, when only the global phase of the unitary is not defined. When the loss function is more permissive, such as in state preparation, further steps usually need to be taken to eliminate all redundant gates. B 4q gates featuring in the decomposition of the 5q Toffoli gate In Sec. 5.3 we constructed a decomposition of the 5q Toffoli gate on the chain topology with 48 CZ gates. This decomposition used efficient representations for the square root of the 4q Toffoli gate C 3 √ X and a relative-phase 4q Toffoli gate on the chain 4q topology. Decomposition of C 3 √ X with 18 CZ gates can be found using standard methodology described in Sec.5.2. The resulting circuit is depicted at Fig. 15. To find a relative-phase Toffoli gate one needs to use a non-standard loss function. By definition [67], U is a relative phase Toffoli gate if U = V D, where V is the unitary matrix of the Toffoli gate and D is a diagonal unitary matrix. To construct the corresponding loss function we can use the fact that the sum i |D ii | 2 for a unitary matrix D has the maximum value 2 n iff D is diagonal. Hence, the following loss reaches its minimum iff U is a relative phase Toffoli gate With this loss function and standard parameter specifications for the 4q circuits used in this work CPFlow generated a relative phase 4q Toffoli gate on the chain topology with 11 CZ gates. The circuit is shown at Fig.14.