Quantum-assisted quantum compiling

Compiling quantum algorithms for near-term quantum computers (accounting for connectivity and native gate alphabets) is a major challenge that has received significant attention both by industry and academia. Avoiding the exponential overhead of classical simulation of quantum dynamics will allow compilation of larger algorithms, and a strategy for this is to evaluate an algorithm's cost on a quantum computer. To this end, we propose quantum-assisted quantum compiling (QAQC). In QAQC, we use the overlap between a target unitary $U$ and a trainable unitary $V$ as the cost function to be evaluated on the quantum computer. More precisely, to ensure that QAQC scales well with problem size, our cost function involves not only the global overlap ${\rm Tr} (V^\dagger U)$ but also the local overlaps with respect to individual qubits. We introduce novel short-depth quantum circuits to quantify the terms in our cost function, and we present both gradient-free and gradient-based approaches to minimizing this function. As a demonstration of QAQC, we compile various one-qubit gates on IBM's and Rigetti's quantum computers into their respective native gate alphabets. Future applications of QAQC include algorithm depth compression, black-box compiling, noise mitigation, and benchmarking.


I. INTRODUCTION
Factoring [1], approximate optimization [2], and simulation of quantum systems [3] are some of the applications for which quantum computers have been predicted to provide speedups over classical computers. Consequently, the prospect of large-scale quantum computers has generated interest from various sectors, such as the financial and pharmaceutical industries. Currently available quantum computers are not large-scale but rather have been called noisy intermediate-scale quantum (NISQ) computers [4]. A proof-of-principle demonstration of quantum supremacy with a NISQ device may be coming soon [5,6]. Nevertheless, demonstrating the practical utility of NISQ computers appears to be a more difficult task.
While improvements to NISQ hardware are continuously being made by experimentalists, quantum computing theorists can contribute to the utility of NISQ devices by developing software. This software would aim to adapt textbook quantum algorithms (e.g., for factoring or quantum simulation) to NISQ constraints. NISQ constraints include: (1) limited numbers of qubits, (2) limited connectivity between qubits, (3) restricted (hardware-specific) gate alphabets, and (4) limited circuit depth due to noise. Algorithms adapted to these constraints will likely look dramatically different from their textbook counterparts.
These constraints have increased the importance of the field of quantum compiling. In classical computing, a * The first three authors contributed equally to this work.
compiler is a program that converts instructions into assembly language so that they can be read and executed by a computer. Similarly, a quantum compiler would take a high-level algorithm and convert it into a lower-level form that could be executed on a NISQ device. Already, a large body of literature exists on classical approaches for quantum compiling, e.g., using temporal planning [7,8], machine learning [9], and other techniques [10][11][12][13][14][15][16][17].
A recent exciting idea is to use quantum computers themselves to train parametrized quantum circuits, as proposed in Refs. [2,[18][19][20][21][22]. The cost function to be minimized essentially defines the application. For example, in the variational quantum eigensolver (VQE) [18] and the quantum approximate optimization algorithm (QAOA) [2], the application is ground state preparation, and hence the cost is the expectation value of the associated Hamiltonian. Another example is training errorcorrecting codes [19], where the cost is the average code fidelity. In light of these works, it is natural to ask: what is the relevant cost function for the application of quantum compiling?
In this work, we introduce quantum-assisted quantum compiling (QAQC, pronounced "Quack"). The goal of QAQC is to compile a (possibly unknown) target unitary to a trainable quantum gate sequence. The cost function we use is given by the Hilbert-Schmidt inner product between the target unitary and the trainable gate sequence.
A key feature of QAQC is the fact that the cost is computed directly on the quantum computer. This leads to an exponential speedup (in the number of qubits involved in the gate sequence) over classical methods to compute the cost, since classical simulation of quantum dynamics is exponentially slower than quantum simulation. Conse-arXiv:1807.00800v2 [quant-ph] 25 Jul 2018 quently, one should be able to optimally compile largerscale gate sequences using QAQC, whereas classical approaches to optimal quantum compiling will be limited to smaller gate sequences [23].
The circuit used to evaluate the cost in QAQC can itself be viewed as a quantum algorithm. This costevaluation circuit should ideally have the shortest possible depth, since noise in NISQ hardware will lead to inaccurate cost values for deep circuits. Our key technical contribution is to present two novel short-depth circuits for cost evaluation. We propose using one of these circuits for gradient-free QAQC (where the circuit evaluates the cost only) and the other for gradient-based QAQC (where the circuit evaluates the cost and the gradient of the cost).
Both of our circuits compute the Hilbert-Schmidt inner product between a target unitary U and a trainable unitary V , Our first circuit, which we call the Hilbert-Schmidt Test, computes the magnitude of this inner product, | V, U |. It achieves short depth by avoiding implementing controlled versions of U and V , and by implementing U and V in parallel. Our second circuit computes the real and imaginary parts of V, U . It is closely related to the Power of One Qubit (POOQ) [24], a circuit that computes the trace of a unitary U by implementing the controlled-U gate with a single ancilla. In fact, it directly generalizes the POOQ, using two ancillas, and hence we call it the Power of Two Qubits (POTQ). Although the POTQ requires controlled versions of U and V , a key feature that keeps the depth short is that these controlled gates are performed in parallel. Interestingly, setting V to the identity in POTQ recovers the original POOQ.
When incorporated into QAQC, both of our circuits provide an exponential speedup over quantum compiling on a classical computer. Below we give detailed descriptions of our gradient-free and gradient-based methods for QAQC, which, respectively, employ our Hilbert-Schmidt Test and our POTQ circuits. Our QAQC method for training a gate sequence V involves optimizing over both the structure of the gate sequence (types of gates and their locations) as well as the continuous internal parameters inside the gates. Hence, QAQC involves solving a hybrid discrete-continuous optimization problem.
As a proof-of-principle, we implement our gradient-free QAQC on both IBM's and Rigetti's quantum computers, and we compile various one-qubit gates to the native gate alphabets used by these hardwares. To our knowledge, this is the first compilation of a target unitary with cost evaluation on actual NISQ hardware. In addition, we compile multi-qubit gates using a simulator with both our gradient-free and gradient-based methods. Although the noise level of current NISQ hardware prevents us from compiling multi-qubit gates with our QAQC method, we denotes the z-rotation gate Rz(θ), while represents the π/2-pulse given by the x-rotation gate Rx(π/2). Both gates are natively implemented on commercial hardware [25,26]. (a) Compressing the depth of a given gate sequence U to a shorter-depth gate sequence V in terms of native hardware gates. (b) Uploading a black-box unitary. The black box could be an analog unitary U = e −iHt , for an unknown Hamiltonian H, that one wishes to convert into a gate sequence to be run on a gate-based quantum computer. (c) Training algorithms in the presence of noise to learn noise-resilient algorithms (e.g., via gates that counteract the noise). Here, the unitary U is performed on high-quality, pristine, qubits and V is performed on noisy ones. (d) Benchmarking a quantum computer by compiling a unitary U on noisy qubits and learning the gate sequence V on high-quality qubits.
are optimistic that slight improvements in the noise will enable this application.
In what follows, we first discuss several applications of interest for QAQC. Section III presents our main results: our short-depth circuits for cost evaluation on a quantum computer. Section IV outlines our gradient-free and gradient-based methods for QAQC. Finally, in Sec. V and VI, respectively, we discuss the implementations of our compiling methods on quantum hardware and on a simulator. Figure 1 illustrates four potential applications of QAQC. Suppose that there exists a quantum algorithm to perform some task, but its associated gate sequence is longer than desired. As shown in Fig. 1(a), it is possible to use QAQC to shorten the gate sequence by accounting for the NISQ constraints of the specific computer. This depth compression goes beyond the capabilities of classical compilers.

II. APPLICATIONS OF QAQC
As a simple example, consider the quantum Fourier transform on n qubits. Its textbook algorithm is written in terms of Hadamard gates and controlled-rotation gates [27], which may need to be compiled into the native gate alphabet. The number of gates in the textbook algorithm is O(n 2 ), so one could use a classical compilier to locally compile each gate. But this could lead to a sub-optimal depth since the compilation starts from the textbook structure. In contrast, QAQC is unbiased with respect to the structure of the gate sequence, taking a holistic approach to compiling as opposed to a local one. Hence, it can learn the optimal gate sequence for given hardware. Note that classical complilers cannot take this holistic approach for large n due to the exponential scaling of the matrix representations of the gates.
Alternatively, consider the problem of simulating the dynamics of a given quantum system with an unknown Hamiltonian H (via e −iHt ) on a quantum computer. We call this problem black-box uploading because by simulating the black-box, i.e., the unitary e −iHt , we are "uploading" the unitary onto the quantum computer. This scenario is depicted in Fig. 1(b). QAQC could be used to convert an analog black-box unitary into a gate sequence on a digital quantum computer.
Finally, we highlight two additional applications that are the opposites of each other. These two applications can be exploited when the quantum computer has some pristine qubits (qubits with low noise) and some noisy qubits.
Consider Fig. 1(c). Here, the goal is to implement a CNOT gate on two noisy qubits. Due to the noise, to actually implement a true CNOT, one has to physically implement a dressed CNOT, i.e., a CNOT surrounded by one-qubit unitaries. QAQC can be used to learn the parameters in these one-qubit unitaries. By choosing the target unitary U to be a CNOT on a pristine (i.e., noiseless) pair of qubits, it is possible to learn the unitary V that needs to be applied to the noisy qubits in order to effectively implement a CNOT. We call this application noise-tailored algorithms, since the learned algorithms are robust to the noise process on the noisy qubits. Figure 1(d) depicts the opposite process, which is benchmarking. Here, the unitary U acts on a noisy set of qubits, and the goal is to determine what the equivalent unitary V would be if it were implemented on a pristine set of qubits. This essentially corresponds to learning the noise model, i.e., benchmarking the noisy qubits.

III. OUR CIRCUITS
Here we present our main results: short-depth circuits for evaluating the cost in Eq. (1). We note that these circuits are also interesting outside of the scope of QAQC, and they likely have applications in other areas.

A. Hilbert-Schmidt Test
Consider the circuit in Fig. 2. Below we show that this circuit computes |Tr(V † U )| where U and V are ddimensional unitary matrices. Defining n := log 2 d, the circuit involves 2n qubits, where we call the first (second) n-qubit system A (B).
The first step is to create a maximally entangled state between A and B, namely, the state where j = (j 1 , j 2 , ..., j n ) is a vector index where each component j k is chosen from {0, 1}. The first two gates in Fig. 2-the Hadamard gates and the CNOT gates (which are performed in parallel when on distinct qubits)create the |Φ + state. The second step is to act with U on system A and with V * on system B. (V * is the complex conjugate of V , where the complex conjugate is taken in the standard basis.) Note that these two gates are performed in parallel. This gives the state We emphasize that the unitary V * is implemented on the quantum computer, not V itself. The third and final step is to measure in the Bell basis. This corresponds to undoing the unitaries (the CNOTs and Hadamards) used to prepare |Φ + and then measuring in the standard basis. At the end, we are only interested in estimating a single probability: the probability for the Bell-basis measurement to give the |Φ + outcome, which corresponds to the all-zeros outcome in the standard basis. The amplitude associated with this probability is To obtain the first equality we used the ricochet property: which holds for any operator X acting on a d-dimensional space. The probability of the |Φ + outcome is then the absolute square of the amplitude, i.e., (1/d 2 )|Tr(V † U )| 2 . Hence, this probability gives us the absolute value of the Hilbert-Schmidt inner product between U and V . We therefore call the circuit in Fig. 2 the Hilbert-Schmidt Test (HST). Consider the depth of this circuit. Let D(G) denote the depth of a gate sequence G for a quantum computer whose native gate alphabet includes the CNOT gate and the set of all one-qubit gates. Then, for the HST, we have The first term of 4 is associated with the Hadamards and CNOTs in Fig. 2, and this term is negligible when the depth of U or V * is large. The second term results from the fact that U and V * are performed in parallel. Hence, whichever unitary, U or V * , has the larger depth will determine the overall depth of the HST.

B. The Power of Two Qubits
While the HST only gives the magnitude of V, U , we now consider a circuit that gives the real and imaginary parts. Before we discuss this circuit, let us review the Power of One Qubit (POOQ) [24], shown in Fig. 3(a), which is a circuit for computing the trace of a d-dimensional unitary U . This circuit acts on a ddimensional system A, initially in the maximally mixed state, 1/d, and a single-qubit ancilla Q initially in the |0 state. After applying a Hadamard gate to Q and a controlled-U gate to QA (with Q the control system), the reduced density matrix ρ Q has its off-diagonal elements proportional to Tr(U ). Hence, one can measure Q in the X and Y bases, respectively, to read off the real and imaginary parts of Tr(U ).
Our circuit for computing V, U generalizes the POOQ and is called the Power of Two Qubits (POTQ), depicted in Fig. 3(b). As the name suggests, the POTQ employs two single-qubit ancillas, Q and Q , each initially in the |0 state. In addition, two d-dimensional systems, A and B, are initially prepared in the Bell state |Φ + defined in Eq. (2). (Although not shown in Fig. 3(b), this Bell state is prepared with a depth-two circuit, as shown in Fig. 2.) The first step in the POTQ is to prepare the two-qubit maximally entangled state 1 √ 2 (|0 |0 + |1 |1 ) between Q and Q , using the Hadamard and CNOT gates as shown in Fig. 3(b). The second step is to apply a controlled-U gate between Q and A (with Q the control system). In parallel with this gate, the anticontrolled-V T gate is applied to Q B, with Q the control system, where anticontrolled means that the roles of the |0 and |1 states on the control system are reversed. This results in the state: where to obtain the equality we used the ricochet property in Eq. (6). As in the HST, note that V itself is not implemented. In this case, its transpose is implemented. Finally, a CNOT gate is applied to QQ , with Q the control system. This results in the reduced state on Q being By inspection of ρ Q , one can see that measuring Q in the X and Y bases, respectively, gives the real and imaginary parts of Tr(V † U ). Interestingly, if we set V to the identity in the POTQ, then since the CNOT gate commutes with the controlled-U gate and the reduced state of |Φ + is the maximally mixed state 1/d, we recover the POOQ. The POTQ is therefore a generalization of the POOQ.
Note that while the POOQ can also be used to determine Tr(V † U ), the POTQ has the advantage that the controlled gates for U and V can be executed in parallel, while in the POOQ they would have to be executed in series. This makes the POTQ better suited for NISQ devices, where short depth is crucial. Consider the depth of the POTQ. Denoting the controlled-U and the anticontrolled-V T as C U and C V T respectively, the overall depth is Note the similarity here to Eq. (7). The overall depth is essentially determined by whichever controlled gate has the largest depth.
(a) Power of One Qubit

IV. OUR OPTIMIZATION METHOD
In this section, we give a detailed description of our optimization method for QAQC. First, we discuss how we parameterize the trainable unitary V in terms of both discrete and continuous parameters. Then we present our two different approaches to optimizing over the continuous parameters: a gradient-free approach (which employs the circuit in Fig. 2) and a gradient-based approach (which employs the circuit in Fig. 3).

Consider an alphabet
that are native to the quantum computer of interest. Here, α ∈ R is a continuous parameter, and k is a discrete parameter that identifies the type of gate and which qubits it acts on. For a given quantum computer, the problem of compiling U to a gate sequence of length L is to determine where

Classical Computer
If α optimal If α not optimal FIG. 4. Outline of our approach to optimization over gate structures and continuous gate parameters in order to perform QAQC for a given input unitary U . The optimization starts with a random choice of gate structure, denoted by k, followed by a continuous optimization over the internal gate parameters α in the trainable unitary V k (α). After obtaining the optimal gate parameters, as determined by minimizing the cost function C(U, V k (α)), the structure is updated and the internal gate parameters are optimized again, followed by another structure update. This process repeats until the cost reaches its minimum.
is the trainable unitary. Here, V k (α) is a function of the sequence k = (k 1 , . . . , k L ) of parameters describing which gates from the native gate set are used and of the continuous parameters α = (α 1 , . . . , α L ) associated with each gate. The function C(U, V k (α)) is the cost, which quantifies how close the trained unitary is to the target unitary.
The optimization in (11) contains two parts: discrete optimization over the finite set of gate structures parameterized by k, and continuous optimization over the parameters α characterizing the gates within the structure. Our quantum-classical hybrid strategy to perform the optimization in (11) is illustrated in Fig. 4. As the set of gate structures grows exponentially with the number of gates L, a brute force search over all gate structures in order to obtain the best one is intractable in general. To efficiently search through this exponentially large space, we adopt an approach based on simulated annealing. (An alternative approach is genetic optimization, which has been implemented previously to classically optimize quantum gate sequences [28].) Our simulated annealing approach starts with a random gate structure, then performs continuous optimization over the parameters α that characterize the gates in order to minimize the cost function. We then perform a structure update that involves randomly replacing a subset of gates in the sequence with new gates and reoptimizing the cost function over the continuous parameters α. If this structure change produces a lower cost, then we accept the change. If the cost increases, then we accept the change with probability decreasing exponentially in the magnitude of the cost difference. We iterate this procedure until the cost is (sufficiently close to) zero or until a maximum number of iterations is reached.
To perform the continuous optimization over the gate parameters, we use both a gradient-free approach and a gradient-based approach. Our gradient-free approach is based on a cost function defined by the Hilbert-Schmidt Test, while our gradient-based approach uses a cost function defined by the Power of Two Qubits.

B. Gradient-free optimization
In the gradient-free approach to optimizing over the continuous gate parameters α, we take as our cost function and we compute this using the Hilbert-Schmidt Test as illustrated in Fig. 2 and described in Sec. III A. Note that for any two unitaries U and V , C GF (U, V ) = 0 if and only if U and V differ by a global phase factor, i.e., V = e iϕ U for some ϕ ∈ R. By minimizing C GF , we thus learn an equivalent unitary V up to global phase.

Algorithm 1: Gradient-free Continuous Optimization for QAQC
Input: Unitary U to be compiled; trainable unitary V k (α) of a given structure; error tolerance tol ∈ (0, 1); maximum number of iterations N ; sample precision > 0. Output: Parameters αopt such that at best CGF(U, V k (αopt)) tol. Init: αopt ← 0; cost ← 1 1 repeat 2 choose an initial parameter α (0) at random; 3 run gp_minimize with α (0) as input and αmin as output; whenever the cost is called upon for some α, run the HST on V k (α) * and U approximately 1/ 2 times to estimate the cost CGF(U, V k (α)) 4 if cost CGF(U, V k (αmin)) then 5 cost ← CGF(U, V k (αmin)); αopt ← αmin 6 until cost tol, at most N times. 7 return αopt, cost For a given set of gate structure parameters k, the calculation of the cost on a quantum computer (as well as on a simulator) is affected by the fact that, due to finite sampling, the Hilbert-Schmidt Test allows us to obtain only an estimate of the magnitude of the Hilbert-Schmidt inner product. Noise within the quantum computer itself also affects the calculation of the cost. Therefore, in order to perform gradient-free optimization over the continuous internal gate parameters α, we make use of stochastic optimization techniques that are designed to optimize noisy functions. Specifically, we make use of the gp_minimize routine in the scikit-optimize Python library [29], which is a gradient-free optimization routine that performs Bayesian optimization using Gaussian processes [30,31]. See Algorithm 1 for a general overview of the optimization procedure.
In Appendix A, Algorithm 3, we propose an alternative algorithm for gradient-free optimization that, on average, significantly reduces the number of calls to the quantum computer. As a result, it is more suitable for cloud computing under a queue submission system (e.g., IBM's Quantum Experience). This algorithm performs a "multi-scale bisection" of the parameter space based on simulated annealing.
We emphasize that our approach to gradient-free optimization avoids the exponential overhead of evaluating the cost function classically, yet at the same time makes use of fast and efficient classical heuristics for optimization. In fact, using the Hilbert-Schmidt Test, Algorithm 1 requires onlyÕ(1/ 2 ) calls to the quantum computer in order to evaluate the cost, where = 1/ √ n shots is the sample precision, which is related to the number of samples n shots taken from the device. As mentioned in Sec. III A, a subtle point about gradient-free compiling via the Hilbert-Schmidt Test is that, to compute Tr(V k (α) † U ) in order to obtain the cost C GF (U, V k (α)), the complex conjugate V k (α) * must be executed on the quantum computer, not V k (α) itself. The complex conjugate of a unitary corresponding to a gate sequence can be obtained by taking the complex conjugate of each unitary in the gate sequence. However, if each gate in the sequence comes from a gate alphabet A, it is possible that the complex conjugate of a gate in the sequence is not contained in the alphabet; for example, if A = {R x (π/2), R z (θ)}, then the complex conjugate of R x (π/2), which is R x (−π/2), is not contained in A. But the unitary R z (π)R x (π/2)R z (π) is equal (up to a global phase) to R x (−π/2). There are thus two ways to proceed when performing the compilation procedure: during the optimization over the continuous parameters, directly run the gate sequence corresponding to V k (α), expressing it in terms of the native gate alphabet of the quantum computer, then at the end establish the complex conjugate of the optimal unitary as the unitary to which U has been compiled. The latter would involve translating the complex conjugate of each gate in the optimal sequence into the native gate alphabet of the quantum computer. An alternative is to first take the complex conjugate V k (α) * by translating the complex conjugate of each gate in the sequence into the native gate alphabet, then execute the resulting sequence on the quantum computer. In each case, we allow for a small-scale classical compiler that can perform the simple translation of the complex conjugate of a gate sequence into the native gate alphabet of the quantum computer. Note that this small-scale classical compiler does not come with exponential overhead because it is only compiling one-and two-qubit gates. Also, observe that if a gate alphabet is not closed under complex conjugation, then the depth of a gate sequence from that alphabet can increase by taking its complex conjugate. This is true for the example given above, in which the complex conjugate R x (−π/2) of R x (π/2) has a depth of three under the alphabet A = {R x (π/2), R z (θ)}, while the original gate has a depth of only one. However, in general, note that the final depth increases by at most a constant factor relative to the original depth.

C. Gradient-based optimization
In this section, we introduce a gradient-based method for continuous optimization over the gate parameters α based on the POTQ algorithm for computing Tr(V † U ). While recent work on gradient descent continuous optimization has shown vast quantum speedups over classical variants [32][33][34], the majority of proposals still appear to be out of reach for implementations on NISQ devices, mainly due to their use of techniques, such as quantum state representation, the quantum Fourier transform, and the Grover search algorithm, which have high resource requirements. Instead, we focus on continuous optimization procedures that are feasible on current quantum computers and leave improvements to our algorithms as an open problem.
We define our cost function as the normalized Hilbert-Schmidt distance, Note that for any two unitaries U and V , the cost C GB (U, V ) is zero if and only if U = V . Contrary to the gradient-free cost function C GF , this cost function does not vanish if U and V differ only by a global phase. Indeed, if V = e iϕ U , then C GB (U, V ) = 1 − cos(ϕ).
In order to avoid exponential overhead, it is crucial for our gradient-based method to evaluate the gradient of the cost function on a quantum computer. Notice that the POTQ allows us to compute Re[Tr(V k (α) † U )] on a quantum computer. Similarly, since the gradient of the cost function C GB is given by the POTQ can also be used to evaluate the gradient of the cost function on a quantum computer. Here, the gradient of V k (α) is defined by where the (i, j) matrix element of the -th component is

FIG. 5.
(a) Any single-qubit gate U can be decomposed into three elementary rotations (up to a global phase). Given appropriate parameters α = (αz 1 , αy, αz 2 ), U can be written as V (α) = e −iαz 2 σz /2 e −iαy σy /2 e −iαz 1 σz /2 . (b) Any twoqubit gate UAB can be decomposed into three CNOT gates as well as 15 elementary single-qubit gates, where each unitary Uj(α (j) ) can be written as in (a). This decomposition is known to be optimal [35], i.e., it uses the least number of continuous parameters and CNOT gates. General universal quantum circuits for n-qubit gates are discussed in [36].
Evaluating the gradient on a quantum computer is possible due to the fact that any unitary gate can be decomposed into circuits in which only the single-qubit gates are parametrized. This is illustrated in Fig. 5. For example, consider the rotation gate R z (α) = e −iασz/2 , which is parametrized by the angle α. Then, the derivative with respect to α can be written as which follows from the Taylor expansion of the exponent. This means that In general, then, to compute the derivative of the cost function with respect to the continuous parameters, we simply add the appropriate local Pauli gate as part of the POTQ. Once the gradient has been evaluated, it can be supplied to a classical gradient descent optimization routine.
Our gradient-based optimization procedure is outlined in Algorithm 2. Given an arbitrary unitary U as input, Algorithm 2 compiles U to a unitary V k (α opt ) of a given structure k that minimizes the cost C GB . The gradient is evaluated with the POTQ circuit as a subroutine within a classical gradient-descent algorithm. The overall query complexity of Algorithm 2 isÕ(N T L/ 2 ), where = 1/ √ n shots is the sample precision, N is the maximum number of repetitions over random initial parameters α 0 , L is the dimension of the continuous parameter space of α, and T is the number of gradient descent iterations for a suitable learning rate η > 0. In order to improve convergence, it may also be useful to supply the quantum subroutines for computing the cost function and the Algorithm 2: Gradient-based Continuous Optimization for QAQC Input: Unitary U to be compiled; a trainable circuit V k (α) of a given structure, where α is a continuous circuit parameter of dimension L; maximum number of iterations N ; error tolerance tol ∈ (0, 1); learning rate η > 0; sample precision > 0. Output: Parameters αopt such that at best CGB(U, V k (αopt)) tol. Init: αopt ← 0; cost ← 1 1 repeat 2 choose initial parameters α (0) at random 3 for τ = 1, 2, . . . , T do 4 for i = 1, 2, . . . , L do 5 run the POTQ on ∂α i V k (α (τ −1) ) T and U approximately 1/ 2 times to estimate run the POTQ on V k (α (τ ) ) T and U approximately 1/ 2 times to estimate the cost CGB(U, V k (α (τ ) )) 8 if cost CGB(U, V k (α (τ ) )) then 9 cost ← CGB(U, V k (α (τ ) )); αopt ← α (τ ) 10 until cost tol, at most N times 11 return αopt, cost gradient to a more advanced stochastic minimizer, for example as found in the Python library SciPy [37]. We give results on compiling both single-qubit and two-qubit gates on a simulator in Section VI. When performing Algorithm 2, we rely on the ability to perform controlled operations of arbitrary gates. These gates may be unknown, e.g., as in Fig. 1(b). In general, to perform a controlled operation with respect to a target unitary U , one can use a method for "remote control" [38]. This method employs a local U gate and controlled-SWAP operations in order to realize the controlled-U gate. In practice, since any controlled unitary gate can be decomposed into native gates, the ability to compile controlled-SWAP, the Toffoli gate, and the set of controlled rotations is sufficient. In order to address this issue, we allow the user to have access to a small-scale classical compiler to perform such a translation. This does not incur exponential overhead since the gates to be translated are one-and two-qubit gates (or their controlled versions). While this may cause the depth of our compiled unitary to increase, it will only be by a constant factor.
We emphasize that our gradient-based approach can also be applied without explicitly searching over gate structures. Due to the existence of exact universal circuits for n-qubit unitary operations [36], our gradientbased QAQC approach can, in principle, directly compile an arbitrary unitary gate at the cost of suboptimal depth. We give examples of universal circuits for single-qubit and two-qubit gates in Fig. 5.

V. IMPLEMENTATION ON QUANTUM HARDWARE
In this section, we present the results of executing our QAQC procedure described in Sec. IV on IBM's and Rigetti's quantum computers. In each case, we performed gradient-free continuous parameter optimization in order to minimize the cost function C GF in (13), evaluating this cost function on the quantum computer. In what follows, the depth of a gate sequence is defined relative to the native gate alphabet of the quantum computer used.

A. IBM's quantum computers
Here we consider the 5-qubit IBMQX4 and the 16qubit IBMQX5. For these quantum computers, the native gate set is where the single-qubit gates R x (π/2) and R z (θ) can be performed on any qubit and the two-qubit CNOT gate can be performed between any two qubits allowed in the topology; see [39] for the topology of IBMQX4 and [40] for the topology of IBMQX5.
To compile a given unitary U , we use the general procedure outlined in Sec. IV. Specifically, our initial gate structure, given by V k (α), is selected at random from the gate alphabet in (21). We then calculate the cost C GF (U, V k (α)) by executing the Hilbert-Schmidt Test shown in Fig. 2 on the quantum computer. To perform the continuous parameter optimization over the angles θ of the R z gates, instead of employing the optimization procedure outlined in Sec. IV B, we make use of Algorithm 3 outlined in Appendix A. This method is designed to limit the number of objective function calls to the quantum computer, which is an important consideration when using queue-based quantum computers like IBMQX4 and IBMQX5 since these can entail a significant amount of idle time in the queue.
In essence, our method discretizes the continuous parameter space of angles θ to perform the continuous optimization. These angles are selected uniformly over the unit circle and the grid spacing between them decreases in the number of iterations. See Appendix A for full details. If the cost of the new sequence is less than the cost of the previous sequence, then we accept the change. Otherwise, we accept the change with a probability that decreases exponentially in the magnitude of the difference in cost. This change in cost defines one iteration as shown in Fig. 6.
In Fig. 6(a), we show results for compiling single qubit gates on IBMQX4. All gates (1, T , X, and H) converge to a cost below 0.1, but no gate achieves a cost below our tolerance of 10 −2 . As elaborated upon in Sec. V C, this is due to a combination of finite sampling, gate fidelity, decoherence, and readout error on the device. The single qubit gates compile to the following gate sequences: 1. 1 gate: R z (θ), with θ ≈ 0.01π.
In particular, note that R x (π/2)R x (π/2) is a textbook decomposition of the X gate, yet the cost is not zero. Similarly, Fig. 6(b) shows results for compiling the same single-qubit gates as above on IBMQX5. The gate sequences have the same structure as listed above for IBMQX4. The optimal angles achieved are θ = −0.03π for the 1 gate and θ = 0.23π for the T gate. The X gate compiles to R x (π/2)R x (π/2), and the Hadamard H compiles to R x (π/2)R z (π/2)R x (π/2).
In our data collection, we performed on the order of 10 independent optimization runs for each target gate above. The standard deviations of the θ angles were on the order of 0.05π, and this can be viewed as the error bars on the average values quoted above.

B. Rigetti's quantum computer
The native gate set of Rigetti's 8Q-Agave 8-qubit quantum computer is where the single-qubit gates R x (±π/2) and R z (θ) can be performed on any qubit and the two-qubit CZ gate can be performed between any two qubits allowed in the topology; see [41] for the topology of the 8Q-Agave quantum computer.
As with the implementation on IBM's quantum computers, for the implementation on Rigetti's quantum computer we make use of the general procedure outlined in Sec. IV. Specifically, we perform random updates to the gate structure followed by continuous optimization over the parameters θ of the R z gates using the gradient-free stochastic optimization technique described in Sec. IV B, Algorithm 1. We take the cost error tolerance (the parameter tol in Algorithm 1) to be 10 −2 , and for each run of the Hilbert-Schmidt Test, we let n shots = 10, 000 in order to estimate the cost. Our results are shown in Fig. 6(c). As described in Algorithm 1, we define an iteration to be one accepted update in gate structure followed by a continuous optimization over the internal gate parameters.
The gates compiled in Fig. 6(c) have the following optimal decompositions. The same decompositions also achieve the lowest cost in the cost vs. depth plot in the inset. order of 0.05π, which can be viewed as the error bars on the average values (over 10 independent runs) quoted above.

C. Discussion
On both IBM's and Rigetti's hardware, we are able to successfully compile one-qubit gates with no a priori assumptions about gate structure or parameters. In principle, our approach extends to compiling larger unitaries U with an exponential speedup over classical methods. In practice, limitations arise in this approach due to decoherence, gate infidelity, and readout error on NISQ computers.
These limitations are more pronounced as circuit depth increases. We therefore encounter significant performance loss for controlled unitaries as required in the POTQ. Consequently, we did not implement our gradient-based optimization method on current quantum devices, but we speculate that improvements to quantum hardware will enable this application.
Our gradient-free optimization method employs a shorter-depth circuit (the HST in Fig. 2), and hence we were able to implement it on NISQ hardware. Nevertheless, the imperfections in NISQ hardware do affect the accuracy of estimating the cost. A qualitative noise analysis of the HST is as follows. To compile a unitary U acting on n qubits, a circuit with 2n qubits is needed. Preparing the maximally-entangled state |Φ + in the first portion of the circuit requires n CNOT gates, which are significantly noisier than one-qubit gates and propagate errors to other qubits through entanglement. In principle, all Hadamard and CNOT gates can be implemented in parallel, but on near-term devices this may not be the case. Additionally, due to limited connectivity of NISQ devices, it is generally not possible to directly implement CNOTs between arbitrary qubits. Instead, the CNOTs need to be "chained" between qubits that are connected, a procedure that can significantly increase the depth of the circuit.
The next level of the circuit involves implementing U in the top n-qubit register and V * in the bottom n-qubit register. Here, the noise of the computer on V * is not necessarily undesirable since it could allow us to compile noise-tailored algorithms that counteract the noise of the specific computer, as described in Sec. II. Nevertheless, the depth of V * and/or U essentially determines the overall circuit depth as noted in (7), and quantum coherence decays exponentially with the circuit depth. Hence, compiling larger gate sequences involves additional loss of coherence on NISQ computers.
The final level of the circuit involves making a Bell measurement on all qubits and is the reverse of the first part of the circuit. As such, the same noise analysis of the first portion of the circuit applies here. Readout errors can be significant on NISQ devices [42], and our circuit involves a number of measurements that scales linearly in the number of qubits. Hence, compiling larger unitaries can increase overall readout error.
For all of these reasons, we limit implementation on today's quantum devices to one-qubit gates. This establishes a proof-of-principle demonstration of QAQC, and our approach will only improve as the engineering of quantum hardware improves. To demonstrate the applicability of QAQC to future devices, we now show examples for compiling two-qubit unitaries by running the Hilbert-Schmidt Test and the POTQ on simulators.
We also note that the performance of any cost function optimization method is likely to decrease when compiling unitaries over large numbers of qubits, particularly when using random initial guesses for the circuit structure. Recent results on barren plateaus in gradient-based optimization [43,44] suggest that the probability of observing non-zero gradients tends to become exponentially small as a function of the number of qubits. Similar issues have been noted in classical deep learning [45]. While this is not an issue for the small circuits considered in this paper, as the number of qubits increases this becomes an increasingly important consideration. There may be ways around these problems, as suggested in [45]. For instance, recent work [46] shows that gradient descent with momentum (GDM) using an adaptive (multiplicative) integration step update, called resilient backpropagation (rProp), can help with convergence. But, this is an active research area and will likely be important to the success of NISQ devices and more generally to the success of quantum machine learning.

VI. IMPLEMENTATION ON SIMULATOR
In this section, we present our results on finding shortdepth algorithms for single-qubit and two-qubit gates using a simulator. We use the gate alphabet which is the gate alphabet defined in Eq. (21) except with full connectivity between the qubits. We adopt the methods from Sec. IV B and IV C to perform the continuousparameter optimization. The simulations are performed assuming perfect connectivity between the qubits, no gate errors, and no decoherence. Although we do not obtain the exponential advantages of running our algorithms on a quantum computer, these assumptions allow us to compile larger unitaries. a. Gradient-free optimization. Using Rigetti's quantum virtual machine [25], we compile the controlled-Hadamard (CH) gate, the CZ gate, the SWAP gate, and the two-qubit Fourier transform QFT 2 by adopting the continuous optimization procedure in Algorithm 1. We also compile the single-qubit gates X and H. For each run of the Hilbert-Schmidt Test to determine the cost, we took n shots = 20, 000. Our results are shown in Fig. 7. For the SWAP gate, we find that circuits of depth one and two cannot achieve zero cost, but there exists a circuit with depth three for which the cost vanishes. The circuit achieving this zero cost is the well-known decomposition of the SWAP gate into three CNOT gates. While our compilation procedure reproduces the known decomposition of the SWAP gate, it discovers a decomposition of both the CZ and the QFT 2 gates that differs from their conventional "textbook" decompositions, as shown in Fig. 7(c). In particular, these decompositions have shorter depths than the conventional decompositions when written in terms of the gate alphabet in (23).
b. Gradient-based optimization. We use IBM's simulator [26] to compile a selection of single-qubit and twoqubit gates by performing the gradient-based optimization procedure in Algorithm 2. In order to improve convergence, we additionally supplied the gradient, as well as the cost function, to a stochastic minimizer from the Python library SciPy. For the single-qubit gates, we assume a fixed structure for the gate sequence according to the decomposition in Fig. 5(a), while for the two-qubit gates we assume a fixed structure for the gate sequence FIG. 8. Compiling one-and two-qubit gates on a simulator with the gate alphabet in (23) using the gradient-based optimization technique described in Sec. IV C, Algorithm 2, with n shots = 10, 000. Shown is the cost as a function of the number of iterations of the continuous parameter optimization using SciPy. The gate structure for the single-qubit gates is fixed to the one shown in Fig. 5(a), while the gate structure for the two-qubit gates is fixed to the one shown in Fig. 5(b).
according to the decomposition in Fig. 5(b). We compile the T gate, X gate, Hadamard (H) gate, as well as the CNOT and CZ gates, all with n shots = 10, 000. The results are shown in Fig. 8. We note that increasing n shots to higher orders of magnitude significantly reduces the sampling error and results in more stable convergence at the cost of an increase in runtime.

VII. CONCLUSIONS
Quantum compiling is crucial in the era of NISQ devices, where constrains on NISQ computers (such as limited connectivity, limited circuit depth, etc.) place severe restrictions on the quantum algorithms that can be implemented in practice. In this work, we presented a methodology for quantum compilation called quantumassisted quantum compiling (QAQC), whereby a quantum computer provides an exponential speedup in evaluating the cost of a gate sequence, i.e., how well the gate sequence matches the target. In principle, QAQC should allow for the compiling of larger algorithms than standard classical methods for quantum compiling, due to this exponential speedup. As a proof-of-principle, we implemented QAQC on IBM's and Rigetti's quantum computers to compile various one-qubit gates to their native gate alphabets. To our knowledge, this is the first time NISQ hardware has been used to compile a target unitary. Current noise levels in the hardware prevented us from compiling multi-qubit gates, although we are optimistic that future noise reduction could enable larger scale QAQC.
Our main technical results were two short-depth circuits for computing the Hilbert-Schmidt inner product between a target unitary U and a trainable unitary V . One of these circuits, which we call the Hilbert-Schmidt Test and describe in Sec. III A, computes the inner product magnitude, and we incorporated this circuit into our gradient-free QAQC method in Sec. IV B. The other circuit, which we call the Power of Two Qubits and describe in Sec. III B, computes the real and imaginary parts of the inner product, and we exploited this circuit for gradient-based QAQC in Sec. IV C. The latter circuit generalizes the famous Power of One Qubit [24] and hence is likely of interest to a broader community.
QAQC is a novel variational algorithm, similar to other well-known variational algorithms such as VQE [18] and QAOA [2]. Variational algorithms are likely to provide some of the first real applications of quantum computers in the NISQ era. In the case of QAQC, it is an algorithm that makes other algorithms more efficient to im-plement, via algorithm depth compression. The central application of our technique is thus to make quantum computers more useful.