Entanglement-efficient bipartite-distributed quantum computing

by embedding processes, which merge two non-sequential distributing processes and hence save the entanglement cost. We show that the structure of distributability and embeddabil-ity of a quantum circuit can be fully represented by the corresponding packing graphs and conflict graphs. Based on these graphs, we derive heuristic algorithms for finding an entanglement-efficient packing of distributing processes for a given quantum circuit to be implemented by two parties. These

In noisy intermediate-scale quantum computing, the limited scalability of a single quantum processing unit (QPU) can be extended through distributed quantum computing (DQC), in which one can implement global operations over two QPUs by entanglementassisted local operations and classical communication.To facilitate this type of DQC in experiments, we need an entanglement-efficient protocol.To this end, we extend the protocol in [Eisert et. al., PRA, 62:052317(2000)] implementing each nonlocal controlled-unitary gate locally with one maximally entangled pair to a packing protocol, which can pack multiple nonlocal controlled-unitary gates locally using one maximally entangled pair.In particular, two types of packing processes are introduced as the building blocks, namely the distributing processes and embedding processes.Each distributing process distributes corresponding gates locally with one entangled pair.The efficiency of entanglement is then enhanced by embedding processes, which merge two nonsequential distributing processes and hence save the entanglement cost.We show that the structure of distributability and embeddability of a quantum circuit can be fully represented by the corresponding packing graphs and conflict graphs.Based on these graphs, we derive heuristic algorithms for finding an entanglement-efficient packing of distributing processes for a given quantum circuit to be implemented by two parties.
These algorithms can determine the required number of local auxiliary qubits in the DQC.We apply these algorithms for bipartite DQC of unitary coupled-cluster circuits and find a significant 1

Introduction
In the noisy intermediate-scale quantum era [1], the noise in quantum processes and the decoherence of qubits are two bottlenecks of the scalability of quantum computing.In a single quantum processing unit (QPU), the scalability is constrained by the number, connectivity and coherence time of qubits, which determine the effective width and depth of a quantum circuit.The effective size of a quantum processor can be quantified by its quantum volume [2,3].It is believed that the connectivity of qubits on a QPU is a key feature to scale up quantum volume [3].However, each platform has its intrinsic topological limits on the number of qubits and their connectivity on a QPU.
All of these technologies facilitate the development of quantum internet [22] in different physical systems.Benefiting from the topology of a quantum internet, one can extend the connectivity of local qubits to build up a hybrid quantum computing system with its QPUs distributed over a quantum network, e.g.quantum computation over the butterfly network [23].To distribute universal quantum computing over a quantum internet, one has to implement the universal set of quantum gates over local servers.The only global gate in the universal set that needs to be distributed is the CNOT gate.It is locally implementable with the assistance of entangled pairs according to the protocols in [24,25].It shows the feasibility of distributed universal quantum computing through local operations and classical communication (LOCC), if a sufficient amount of entanglement is provided.However, the distribution of entanglement over a quantum network is probabilistic and time-consuming with respect to the coherence time of local qubits.To make distributed quantum computing (DQC) feasible, one has to therefore find an efficient way to exploit the costly entanglement resources.
There are two methods for nonlocal gate handling in DQC with entanglement-assisted LOCC.One is based on quantum state teleportation [24]; the other is based on the remote implementation of quantum gates through entanglement-assisted LOCC [26][27][28][29], which is also referred to as "telegate" [30,31].In the former scheme, qubits associated with nonlocal gates are teleported forward and backward across parties.In the latter scheme, one implements a nonlocal gate with entanglement-assisted LOCC in a way such that all associated qubits of the gate are kept functional locally without sending them to the other parties.It has been experimentally implemented in [32,33].Several telegate protocols are employed for compiling multipartite DQC [34][35][36].
In particular, the telegate protocol in [26] referred to as the EJPP protocol in this paper can be employed to implement arbitrary nonlocal control unitaries.The DQC schemes based on the state teleportation and the EJPP protocol are compared and combined in a general quantum network [37].The EJPP protocol consumes only one ebit, that is one maximally entangled pair, which is optimum in entanglement efficiency.Based on this protocol, one can find the optimum partition for a circuit consisting of control-Z gates and single-qubit gates through hypergraph partitioning [34].The EJPP protocol employed in the partitioning can be extended for sequential control-Z gates and hence reduces entanglement cost [35].In [35], the EJPP protocol is terminated by each single-qubit gate, and restarted for control-Z gates with a new entangled pair.This leads to an entanglement cost much higher than the theoretical lower bound given by the operator Schmidt rank of a general circuit [38].To further facilitate DQC in experiments, we develop a DQC protocol with a higher entanglement efficiency, in which we introduce a notion of entanglement-assisted packing processes for bipartite DQC through an extension of the EJPP protocol.Note that this method is extended to mul-tipartite DQC with some additional treatments and is employed as the core nonlocal-gate-handling subroutine for modular quantum computing under network architectures [39].A similar application of this method can be also employed for other modular compilation architectures [31,36,[40][41][42][43][44].
As two particular types of entanglement-assisted packing processes, we introduce distributing processes and embedding processes as the two fundamental building blocks for DQC.In each distributing process, one ebit of entanglement is employed to achieve an operation equivalent to a global unitary, implemented by packing local gates using entanglementassisted LOCC.The number of distributing processes equals the entanglement cost for DQC.To reduce the entanglement cost, we introduce embedding processes of intermediate gates to merge two non-sequential distributing processes into a single distributing process.Each embedding process can therefore save one ebit without changing the distributability of the circuit.The distributability and embeddability of gates in a circuit reveal the underlying entanglement cost for DQC employing packing processes.Such a packing method provides a tighter constructive upper bound on the entanglement cost for DQC.It can significantly enhance the entanglement efficiency of DQC.
The rest of the paper is structured as follows.In Section 2, we review the EJPP protocol.In Section 3, we introduce the entanglement-assisted packing processes and two special types of processes, namely, the distributing and embedding.The conflicts between embeddings are addressed.The packing graph and conflict graph are introduced to represent the distributing and embedding structure.In Section 4, we derive the algorithms for the identification of packing graphs and conflict graphs for quantum circuits consisting of one-qubit and two-qubit gates.In Section 5, we demonstrate the enhancement of entanglement efficiency in unitary coupled-cluster (UCC) circuits [45,46] employing our method.In Section 6, we conclude the paper.

EJPP protocol
As shown in Fig. 1, a controlled-unitary gate C U on the left-hand side can be implemented through entanglement-assisted LOCC [47] with only one maximally entangled state on the right-hand side according to the EJPP protocol presented in [26].The controlled unitary on the left side has a controlling qubit q on the local quantum processor A and a target unitary U acting on the multi-qubit subsystem Q B on the local quantum processor B. In the EJPP protocol, there is a pre-shared maximally-entangled pair {e, e ′ }, on which one implements local controlled unitaries and global measurement-controlled gates through classical communication.It consumes only one ebit, i.e. a maximally-entangled pair, which is the minimum nec-essary amount of entanglement for a distributed implementation of a controlled unitary determined by its operator Schmidt rank [38].It means that the EJPP protocol is most entanglement-efficient for a controlled unitary.
Figure 1: The quantum circuit on the left side is a controlledunitary gate.The quantum circuit on the right side is a distributed implementation of the controlled-unitary gate according to the EJPP protocol, where QA ∪{e ′ } and QB ∪{e} are qubits in the local quantum processor A and B, respectively.The qubits {e ′ , e} are the auxiliary memory qubits that store a pre-shared maximally entangled pair, namely one ebit entanglement.
The EJPP protocol can be divided into three parts, namely the start, kernel, and end.The starting and ending are constructed with fixed gates, while the kernel can be tailored for the implementation of particular unitaries.Here, we define the two fixed parts as the starting process, and the ending process, which correspond to the cat-entangler and the catdisentangler in [48], respectively.

Definition 1 (Starting process and ending process)
The quantum circuit on the right-hand side of Fig. 2 (a) and (b) are referred to as a starting process and an ending process rooted on q, respectively, which are denoted by the symbols on the left-hand side.The auxiliary qubits e and e ′ are initialized as a Bell state (|00⟩ + |11⟩)/ √ 2 in the starting process.
The starting process duplicates the computationalbasis components of an input state |ψ⟩ on q to an entangled state S q,e (|ψ q ⟩) = C q,Xe |ψ q , 0 e ⟩ = ⟨0 q |ψ q ⟩ |0 q , 0 e ⟩ + ⟨1 q |ψ q ⟩ |1 q , 1 e ⟩ , (1) where C q,Xe is a control-X gate with the control qubit q and target qubit e, and S is a linear isometry operator S ∈ L(H q → H q ⊗ H e ), S q,e : |i⟩ q → |i⟩ q |i⟩ e , for all i ∈ {0, 1}. (2) On the other hand, by applying an ending process to the state |Ψ q,e ⟩ = S q,e (|ψ q ⟩), one obtains the partial trace of C q,Xe |Ψ q,e ⟩ E q,e (|Ψ q,e ⟩) = tr e (C q,Xe |Ψ q,e ⟩ ⟨Ψ q,e | C q,Xe ) .(3) The ending process of |ψ⟩ is therefore the inverse of the starting isometry S, which restores the original state E q,e • S q,e (|ψ q ⟩) = |ψ q ⟩ .( The protocol in [26] for distributed implementation of controlled unitary gates.
Inserting a local unitary C e,U Q B acting on {e} ∪ Q B between the starting and ending processes, one can implement a global controlled unitary C q,U Q B with local gates according to the following equality where Employing the squiggly arrows for starting and ending processes, the protocol in [26] can be represented as Fig. 2 (c).

Entanglement-assisted packing processes
Beyond controlled-unitary gates, one can extend the EJPP protocol [26] to locally implement a swapping gate with the assistance of two-ebit of entanglement, which is shown in Fig. 3.There are two steps to construct the distributed implementation.In the first step, we extend the protocol through the replacement of the local control unitary a global unitary K AB .In the second step, we implement the EJPP protocol for the remaining global controlled-X gate starting and ending on the qubit q ′ .
In this example, the process in the first step is implemented with a kernel K AB sandwiched by the starting and ending processes.The kernel K AB is a global unitary acting on the working qubit {q, q ′ } and the auxiliary qubit e.In general, K AB can be either local or global.For a global kernel K AB , one needs further steps to locally implement the global gates in the kernel with the starting and ending processes.Once all the global gates are distributed, then the distribution of the quantum circuit is finished.Such an extension of the EJPP protocol through the replacement of local controlled-unitary gates by a global kernel K AB is therefore a general building block for the distributed quantum computing based on the starting and ending processes.In this paper, we refer to these building blocks as entanglement-assisted packing processes.
Definition 2 (Entanglement-assisted packing process) An entanglement-assisted packing process P q,e [K] with a kernel K rooted on a qubit q assisted by two auxiliary qubits (e, e ′ ) is a quantum circuit that packs a unitary K by the entanglement-assisted starting and ending processes rooted on q (see Fig. 4(a)), In short, we call P q,e [K] a q-rooted e-assisted packing process.
(a) (b) Figure 4: Entanglement-assisted packing process.(a) A qrooted e-assisted packing process with a kernel KAB.(b) A q-rooted e-assisted A|B-distributing process with a local kernel KA ⊗ KB.
In the starting process, one needs a maximally entangled pair, namely one ebit, and two auxiliary memory qubits e ′ and e.We refer to the first auxiliary qubit e ′ as the root auxiliary qubit and the second auxiliary qubit e as the packing auxiliary qubit.The memory space of a root auxiliary qubit can be released immediately after the starting process, while a packing auxiliary qubit is occupied for the kernel K AB until the ending process.The required memory space on each local QPU is therefore one root auxiliary qubit for initializing a starting process and several packing auxiliary qubits for parallel packing processes.In DQC based on entanglement-assisted packing processes, the external resources that can be optimized are therefore the entanglement cost and the number of packing auxiliary qubits.By definition, an entanglement-assisted packing process consumes oneebit entanglement and occupies one packing auxiliary qubit e.A packing auxiliary qubit e implies one-ebit preshared entanglement between (e, e ′ ).The number of packing processes in a compilation of a unitary U is therefore exactly the number of maximally entangled pairs and the number of packing auxiliary memory qubits that are required for the DQC implementation of U .This number is always a constructive upper bound on the entanglement cost for the DQC implementation of U based on entanglement-assisted LOCC.A tighter bound can be determined if one finds a more entanglement-efficient compilation of a circuit.In general, a packing process is a completely positive and trace-preserving map (CPTP) represented by a set of two Kraus operators.For certain types of kernels, the two Kraus operators are equivalent up to a global phase φ.In this case, a packing process is equivalent to a unitary acting on the working qubits.We call P q,e [K] a canonical unitary-equivalent packing process, if the global phase φ is equal to φ = 0.The condition for a packing process being canonically unitary-equivalent is discussed in Appendix A, where the packing processes with general kernels that leads to Kraus operators are also discussed.In the rest of this paper, we consider this ideal case and treat all packing processes as canonically unitary-equivalent.
The canonical unitary-equivalent packing processes are the building blocks of DQC.They have an important property that two sequential packing processes can be merged into one packing process according to the following theorem.

Theorem 3 (Merging of packing processes)
Let P q,e [K 1,2 ] be two q-rooted e-assisted packing processes that implement two unitaries U 1,2 , respectively, The product of unitaries U 2 U 1 can be then implemented by a single q-rooted e-assisted packing process with the kernel Proof: See Appendix D.1.
This theorem allows us to implement the unitary U 2 U 1 through the packing of the two kernels K 2 K 1 into one packing process and hence save one ebit.To decompose a circuit in DQC, two types of packing processes are needed, namely the distributing processes and embedding processes, which will be introduced in Section 3.2 and 3.3, respectively.

Distributing process
Our ultimate goal is to find an efficient implementation of a quantum circuit with packing processes that contain only local kernels.To this end, we need to decompose a circuit into distributing processses, which are defined as follows.
Definition 4 (Distributing process) A unitary U is q-rooted distributable (or distributable on q) over two local systems A|B, if there exists a q-rooted entanglement-assisted process with a local kernel with respect to A|B (see Fig. 4 (b)), The kernel describes a (A|B)-distributing rule The process P q,e [K A ⊗K B ] is called a q-rooted (A|B)distributing process of U .
Explicitly, the EJPP protocol in Fig. 1 is a distributing process, while the DQC implementation of the swapping gate in Fig. 3 involves two nested distributing processes.Distributing processes are the backbones of distributed quantum computing.
The solution for entanglement-assisted quantum computation is to reveal the distributability structure of a quantum circuit, namely to find a decomposition of quantum circuits that consists of distributable unitaries.
In general, a unitary U on the left-hand side of Fig. 5 can be written in the following form where q denotes the set of qubits complement to {q}.
The q-rooted A|B-distributability can be determined with the following necessary and sufficient condition.

Theorem 5 (Distributability condition)
A unitary U is q-rooted distributable over A|B, if and only if U is diagonal or anti-diagonal on q and given in the following form, The unitary V j and W j act on the qubits Q A \ q and Q B , respectively.Without loss of generality, we assume that q ∈ Q A is on the QPU A. The corresponding distributing rule described by the kernel K A ⊗ K B can be constructed as Proof: See Appendix D.2 The q-rooted distributability of a unitary U depends on its representation in the computational basis of the root qubit.Note that the distributability condition can also be formulated in another basis of the root qubit, if one transforms the control qubit in the starting process and the correction gate Z in the ending process into the corresponding basis.However, such variant starting processes and ending processes can not be merged with the standard ones defined in Definition 1.We therefore stick to the definition of distributability with respect to the standard starting and ending processes, and determine the distributability in the computational basis of the root qubit with Theorem 5.
As a result of Theorem 5, some trivial distributable gates can be determined as follows.

Corollary 6
The following unitaries are all q-rooted distributable.
1. Single-qubit unitaries on q that are diagonal or anti-diagonal.
2. Two-qubit controlled-unitary gates that have q as their control qubit.
The distributability of these gates does not depend on the bipartition A|B.
As a result of this corollary, a two-qubit control-phase gate C V (θ) acting on q 1,2 is q 1,2 -rooted distributable, where V (θ) is a phase gate According to Theorem 3, one can implement two sequential q-rooted distributable unitaries U 2 U 1 through the packing of their distributing kernels D q (U 2 )D q (U 1 ) within a single packing process consuming only one ebit.Such packing can be even possible for two non-sequential distributable unitaries through the another type of packing processs, namely embeddings.

Embedding process
In this section, we introduce embedding processes, which enable the packing of non-sequential distributing processes through the embedding of the intermediate unitary between them.
Definition 7 (Embedding process) A unitary U is qrooted embeddable with respect to a bipartition A|B, if there exists a decomposition of U , U = i V i (Fig. 6 (a)), and corresponding kernels B q (V i ), where L i,A and K i,A act on local qubits Q A , and L i,B and K i,B act on local qubits Q B ∪{e} (Fig. 6 (b)), such that The unitary V i is an embedding unit of U .The kernel B q (V i ) describes a q-rooted embedding rule of V i .The process P q,e [B q (V i )] is a q-rooted e-assisted embedding process of V i .
According to Theorem 3, a q-rooted embeddable unitary can be packed into a packing process with additional local kernels sandwiching the original embedding units V i (see Fig. 6 (c)), Note that V i can be a nonlocal unitary that still needs to be distributed.An example of embedding processes can be found in the first step of DQC implementation of swapping in Fig. 3, where a CNOT gate is packed into a qrooted embedding process.As shown in Fig. 7, an additional local CNOT gate is introduced in the kernel of the embedding process, while the nonlocal CNOT gate remains unaltered in the circuit.Such an embedding process in the swapping circuit can merge two non-sequential distributing processes into one packing process according to Theorem 3, as illustrated in Fig. 8 (a).
Formally, let U be a q-rooted embeddable unitary with the embedding rule B q (U ) = (M A ⊗M B )U (L A ⊗ L B ) and placed between two q-rooted distributable unitaries D 1,2 , which is shown in Fig. 8 (b).According to Theorem 3, one can pack the two nonsequential distributable unitaries D 1,2 into a packing process through the embedding of U , where D q (D i ) = K i,A ⊗K i,B are the distributing rules for D 1,2 and In the end, the embeddable unitary U is embedded into a merged packing process with the kernel A M B are both local unitaries and the distributability of U remains untouched.
The utility of embedding is not limited to packing processes rooted on a single qubit.It can be extended to simultaneous embedding processes rooted on multiple qubits.As shown in Fig. 9 (a), the unitary implemented by three CNOT gates can be jointly embedded in four packing processes which start simultaneously with four starting processes rooted on {q 1 , q 2 , q 3 } assisted by {e 1 , e ′  1 , e 2 , e 3 } and end simultaneously with four ending processes.In general, one can simultaneously embed a unitary into a joint packing process {P qi,ei [K i ]} i rooted on multiple qubits {q 1 , ..., q k }, which is referred to as a {q 1 , ..., q k }-rooted joint embedding and illustrated in Fig. 9 (b).
Definition 8 (Joint embedding) A unitary U is jointly embeddable on a set of root qubits Q = {q 1 , ..., q k }, if there exists a decomposition U = i V i and corresponding joint embedding rules such that each embedding unit can be implemented by a joint entanglement-assisted process P Q,A rooted on a set of qubits Q and assisted with a set of auxiliary qubits A = {e 1 , ..., e k }, With joint embedding, one can simultaneously merge non-sequential distributing processes on multiple qubits.Note that a joint embedding can have packing processes on the same root qubit multiple times.
As shown in Fig. 9, the joint embedding of U on {q 1 , q 2 , ..., q k } has two packing processes rooted on the same qubit q 1 and assisted with two auxiliary qubits e 1 and e ′ 1 , respectively.
In general, it is non-trivial to derive a necessary condition for the embeddability of U , since one has     to explore all the possible decompositions of U .However, one can still derive a primitive embedding rule B Q (U ) as a sufficient condition for the Q-rooted embeddability of U as follows.(25) where C Q,X A is a sequence of control-X gates on {(q i , e i ) :

Corollary 9 (Primitive embedding rule)
This embedding rule is called the primitive Q-rooted embedding rule of U .Proof: This corollary is a result of Corollary 39 in Appendix A.
Due to the diagonal and anti-diagonal condition for distributability in Theorem 5, one can show that a qrooted distributable is always q-rooted embeddable.
Theorem 10 (Distributability and embeddability) If a unitary U is q-rooted distributable, then U is also q-rooted embeddable with the following embedding rules, 1. for U being diagonal on q, U = P q,e [U ]; (27) 2. for U being anti-diagonal on q, where X e is an X gate on the auxiliary qubit e.
Proof: According to Theorem 5, a q-rooted distributable unitary U is diagonal or anti-diagonal on q.One can show that, for U being diagonal on q, C q,Xe U C q,Xe = U, ( while for U being anti-diagonal on q, As a result of Corollary 9, U is embeddable on q with the embedding rules B q (U ) = C q,Xe U C q,Xe .
In general, the q-rooted embeddability is not a sufficient condition for q-rooted distributability, since there are some unitaries which are q-rooted embeddable but not q-rooted distributable.However, if the unitary we consider is a local unitary, one can show that embeddability is equivalent to distributability.

Corollary 11 (Embeddability of local unitaries)
A local unitary is q-rooted embeddable, if and only if it is q-rooted distributable.Proof: Without loss of generality, we assume that q is on the system A. If a local unitary U A ⊗1 B is q-rooted embeddable, there exists an embedding rule, The unitary U A is therefore q-rooted distributable with the local kernel , which shows that q-rooted embeddability is a sufficient condition for q-rooted distributability of a local unitary.According to Theorem 10, q-rooted embeddability is also a necessary condition for q-rooted distributability.
According to Theorem 5, this corollary implies that a local unitary is embeddable on q, if and only if it is diagonal or anti-diagonal on q.

Compatibility among embeddings
Let U be embeddable on two different sets of root qubits Q and Q ′ with the following embedding rules, respectively Suppose there are two Q-rooted packing processes P Q,A [L] and P Q,A [L ′ ] sandwiching the unitary U .The intermediate unitary U can be packed through its Qrooted embedding The same packing holds for Q ′ -rooted packing processes Although one can always implement the Q-rooted embedding of U followed by the Q ′ -rooted embedding recursively, it is not guaranteed that one can still pack the Q- Fig. 10 (a) shows a {q 1 , q 2 }-rooted embedding of U followed by a recursive {q 1 , q 3 }-rooted embedding.In this example, the root qubits are Q = {q 1 , q 2 } and Q ′ = {q 1 , q 3 } associated with the auxiliary qubits A = {e 1 , e 2 } and A ′ = {e ′ 1 , e 3 }, respectively.One can simultaneously pack Q-rooted processes and Q ′rooted processes through the joint embedding of U , if

(b),
Note that Q ⊎ Q ′ is the disjoint union of Q and Q ′ , as the root qubits Q and Q ′ may share the same qubits.For example, the circuit on the left-hand side of Fig. 9 (b) is embeddable on both Q = {q 1 , q 2 } and Q ′ = {q 1 , q 3 }.The Q-rooted embedding and Q ′rooted embedding of the circuit are compatible, since one can simultaneously implement these two embeddings in one joint embedding process.
As an embedding process can merge two nonsequential distributing processes, the compatibility of multiple embeddings allows us to merge multiple nonsequential distributing processes on multiple qubits.Fig. 11 demonstrates the utility of two compatible embeddings of U to merge the distributing process of V 1,2 and W 1,2 .Suppose the unitary U in Fig. 11 is embeddable on both Q = {q 1 , q 2 } and Q = {q 1 , q 3 }, which can be jointly embedded as shown in Fig. 10.Let V 1,2 and W 1,2 be distributable on Q ′ and Q, respectively, while W 1,2 be Q ′ -rooted embeddable with the embedding rule As shown in Fig. 11 (a), one can firstly pack the Q ′ -rooted distributing processes of V 1,2 through the Q ′ -rooted embedding of W 1,2 and Q ⊎ Q ′ -rooted embedding of U .Afterward, one can pack the Q-rooted distributing processes of W 1,2 through the Q ⊎ Q ′ -rooted embedding of U .In the end, the Q ⊎ Q ′ -rooted embedding of U allows the simultaneous merging of the Q-rooted processes and the Q ′ -rooted processes.
In this example, the recursive embeddings of U rooted on Q and Q ′ are compatible for a simultane-ous packing on Q ⊎ Q ′ .For simultaneous packing, two embeddings must possess a recursive compatibility defined as follows.

Definition 12 (Compatible recursive embeddings)
Let B Q (U ) and B Q ′ (U ) be two embedding rules of a unitary U rooted on Q and Q ′ , respectively.They are compatible for a recursive embedding on Since the two sets of root qubits Q and Q ′ can share some common elements, the compatible merging of the two embeddings B Q and B Q ′ takes the disjoint union set Q ⊎ Q ′ as its root qubits.For the example in Fig. 10 (b), the root qubits after the compatible merging are As a result of Corollary 9, the compatibility of two recursive embeddings can be determined with the following sufficient condition.

Lemma 13 (Condition for recursive compatibility)
It is recursively compatible with another Q-rooted embedding of U , if The joint embedding rule is As a result of Lemma 13, the embedding rule for distributable unitaries are all recursively compatible.

Corollary 14 (Recursive embedding of distributing)
The q-rooted embedding of a q-rooted distributable unitary is recursively compatible with any other embeddings.Proof: As a result of Lemma 13, the embedding rules in Theorem 10 are compatible for any recursive embeddings, since the additional kernel is 1 or X e , which commutes with any control-X gate A more complicated situation of simultaneous embeddings is the nested embedding shown in Fig. 12. Suppose the unitaries U = BA and U ′ = CB are embeddable on Q = {q 1 , q 2 } and Q ′ = {q 1 , q 3 } with the embedding rules B Q and B Q ′ , respectively.If we have a unitary decomposed as CBA, it is not guaranteed that one can implement the embeddings of B Q (BA) and B Q ′ (CB) simultaneously.If the equality in Fig. 12 holds, the nested embeddings B Q (BA) and B Q ′ (CB) are compatible for simultaneous implementation.
Definition 15 (Compatible nested embedding) Let U and U ′ be embeddable on Q and Q ′ with the embedding rules B Q (U ) and B Q ′ (U ′ ), respectively.For the decomposition of U and U ′ , The embedding rules where B are all local unitaries (see Fig. 12).

Packing of distributable packets
A quantum circuit of η qubits and depth τ can be represented by a η × τ grid with the nodes {(q, t) : q ∈ {q 1 , ..., q η }, t ∈ {1, ..., τ }}.At each depth t, we have a unitary U t implemented by a set of gates g (Qi) t,i acting on disjoint sets of qubits Q i , which satisfy As a result of Theorem 5, the distributability of U t on a root qubit q does not depend on the bipartition A|B.It means that one can easily identify all the distributable nodes of a circuit without paying any attention to the partitioning.The distributable nodes indicate the potential root points of distributing processes for global gates.In the packing procedure of the distributing processes, one can neglect the circuit nodes of singlequbit distributable gates and only consider the multiqubit gates.According to Theorem 3, two subsequential distributable nodes (q, t) and (q, t + 1) can be straightforwardly packed in one distributing process.
Suppose there are two non-sequential distributable nodes (q, t 1 ) and (q, t 2 ) separated by a unitary W (q) t2;t1 , which is non-distributable but embeddable.One can still pack the two non-sequential distributing processes rooted on (q, t 1 ) and (q, t 2 ) into one packing process through the embedding of W allows us to pack the two non-sequential distributing processes through the embedding of W (q) t2;t1 .Such a packing of non-sequential distributing processes reduces the entanglement cost in the distributed quantum computation.To explore all the possible pack-ing of distributing processes in a quantum circuit, we need to identify its potential embedding processes and distributing processes.The embedding processes will then merge the root points of distributing processes into distributable packets, which are introduced as follows.
Definition 16 (distributable packet) Let V q,T be a set of distributable nodes (q, t) on the qubit q at the depths t ∈ T of a quantum circuit, V q,T := {(q, t) : t ∈ T, U t is q-rooted distributable, and (q, t) is associated with a multi-qubit gate}. ( Let B be a set of embedding rules.The set V q,T is B-packable with respect to a bipartition A|B, if the unitary between each pair of depths with respect to the bipartition A|B according to the embedding rules in B.The set V q,T is called a distributable Bpacket.A single distributable node {(q, t)} is a trivial distributable B-packet.The set of distributable B-packets of a circuit is denoted as If there is no ambiguity in a context, we will employ the terminology "distributable packet" without the specification of the embedding rule B for conciseness.
In this definition, we only consider the distributable nodes, which are associated with global gates in Eq. (44), since local gates do not need to be distributed.It is worth noting that the nodes in V q,T are not necessarily contiguous.
If two distributable packets have non-empty intersection, they can be merged into a larger set through the following corollary.

Lemma 17 (Merging of distributable packets)
Two distributable packets V q,T1 and V q,T2 can be merged into one packet V q,T1∪T2 , if they have nonempty intersection The symbol ∪ D represents the merging of distributable packets.
After the merging of these processes, one obtains a packing process on a distributable packet, of which the kernels are given as follows.
Theorem 18 (Packing on distributable packets) Given a distributable packet V q,T , the q-rooted distributing processes of U t with t ∈ T can be packed into one packing process through embedding as follows, where the kernel K t of the depth t is Proof: The unitary t∈[min T,max T ] U t can be implemented with distributing processes for t ∈ T and embedding processes for t / ∈ T t∈[min T,max T ] where the kernel K t of the depth t is the distributing rule D q (U t ) for t ∈ T and the embedding rule B q (U t ) for t / ∈ T .As a result of Theorem 3, the product of unitary-equivalent packing processes can be packed into the single packing process given in Eq. (47).
The set of distributable packets of a quantum circuit after all possible mergings contains only disjoint packets.The DQC of a quantum circuit can be implemented through the packing processes rooted on these distributable packets.The number of selected root packets is equal to the entanglement cost of the corresponding DQC implementation.

Conflicts among packing processes
Although distributing processes and embedding processes can be packed in a distributable packet, it is not guaranteed that the packing processes rooted on two distributable packets can be simultaneously implemented.The packing process on a distributable packet contains the distributing processes rooted on its elements, and the embedding processes between them.The implementation of these distributing processes and embedding processes may introduce additional local kernels that change the distributablity and embeddability of another packet.This may cause incompatibility among distributable packets.
To reveal the incompatibility among distributing processes and embedding processes on distributable packets, we can unpack a packing process on a packet V q,T in Eq. ( 47) into a sequence of packing processes represented by distributing kernels and embedding kernels where B q;t1,t2 := If there exists a trivial distributable packet V q;t with t 1 < t < t 2 , then one can replace the embedding rules B q;t in Eq. ( 51) by the distributing rule D q;t B q;t1,t2 → B q;t,t2 D q;t B q;t1,t . (52) It will add the node (q, t) into the packet V q,T in Eq. ( 50) and form a larger packet V q,T ∪{t} .We call such embedding decomposable, which is defined as follows.
Definition 19 (Indecomposable embedding) An embedding process B q;t1,t2 is decomposable, if there exists only a distributable node V q,t of a multiqubit gate with t 1 < t < t 2 , such that P q,e [B q;t1,t2 ] = P q,e [B q;t,t2 D q;t B q;t1,t ], (53) By definition, the embedding B q;t on a trivial distributable node V q,t is also an indecomposable embedding.
Given a circuit Q, the set of all indecomposable embeddings fully describes the embedding rules applied to the circuit Q, On the other hand, the set of all distributing kernels on trivial distributable packets fully describes the distributing rules for Q The incompatibility among distributing rules and embedding rules should be identified at the level of indecomposable embedding processes and trivial distributing processes, which forms a set of kernels K of the potential packing processes We call such a set the indecomposable packing kernels of Q.The incompatibility among indecomposable packing kernels can be identified with directed conflict edges Definition 20 (Kernel-conflict edges) Let K q1;T1 , K q2;T2 ∈ K(Q) be two indecomposable packing kernels of a circuit Q.The incompatibility between K q1;T1 and K q2;T2 is represented by a directed edge if the implementation of K 1 prevents the implementation of K 2 .If the two kernels are mutually incompatible, i.e. c(K 1 , K 2 ) = c(K 2 , K 1 ) = 1, it forms an undirected edge (58) The matrix {c(K i , K j )} i,j is the adjacency matrix of conflict edges.
A conflict caused by intrinsic incompatibility of packing processes due to the intrinsic quantum circuit structure, such as the incompatibility of embeddings, is called an intrinsic conflict.In practice, two simultaneous packing processes need to compete for the necessary external resources.As shown in Fig. 2, the external resources consumed in a packing process are one-ebit entangled pair, an auxiliary memory qubit e ′ on the root QPU and an auxiliary memory qubit e on the remote QPU, which are required in DQC implementation.The limitation on available external resources leads to conflicts among packing processes.Such a conflict caused by limited external resources is referred to as an extrinsic conflict.
Note that, in this paper, the auxiliary memory qubits in the root QPU are not counted as limited external resources, since they are only needed in the starting process (see Fig. 2 (a)), and can be reset and reused immediately after the starting processes.Besides, we ignore the limitation on the coherence time of auxiliary memory qubits, as we assume that their coherence time is not worse than the local working qubits and long enough for implementing packing processes.
In a circuit, an indecomposable packing kernel can be represented by a vertex placed at its root circuit nodes with its packing rule written on it.For example, in Fig. 13, the two trivial distributable packets D q1;t and D q2;t are represented as red vertices.According to Theorem 10, every distributable unitary is also embeddable.There must be an underlying indecomposable embedding process associated with each trivial distributable packet rooted on the same circuit nodes.On the same root circuit nodes, the root vertices of distributing kernels D q1;t and D q2;t are placed over their underlying indecomposable embedding vertex B q1;t and B q2;t , respectively.A further example of indecomposable packing kernel B q1;t1,t3 is shown in Fig. 14 as a pincer-shaped vertex on its root circuit nodes, where its pincer-shaped geometry indicates its ability to merge two distributable packets before and after.
At the level of indecomposable kernels, there are three types of incompatibility, namely the incompatibility between two distributing processes, between a distributing process and an embedding process, and between two embedding processes, which are demonstrated in Fig. 13, 14 and 15, respectively.DD-type conflicts (Fig. 13).The first type of conflict is the incompatibility between two distributing processes, which we call a DD-type conflict.As shown in Fig. 13(a), on the right-hand side, a control-Z gate can be distributed in a distributing process with the kernel D q1;t or D q2;t rooted on the node (q 1 , t) or (q 2 , t), respectively.If one of the distributing processes is implemented, the other process is not needed anymore.The DD-type conflict of this example is an intrinsic property of a nonlocal control-Z gate.It is therefore an instrinsic conflict.On the left-hand side of Fig. 13(a), the distributability of the control-Z gate is represented by two vertices (trivial distributable packets), where the distributing rules D q1;t and D q2;t of the control-Z gate are rooted.The incompatibility of the distributability is then represented by a red bidirected edge (D q1;t ↔D q2;t ) between these two red vertices.
For the same CZ gate, besides the distributability, it can be also embedded with the embedding rule B q1;t shown on the right-hand side of Fig. 13 (a).The embedding rule B q1;t and the distributing rule D q1;t compete for auxiliary memory qubits on system B, since each packing auxiliary qubit can only be as-signed to a process at one time.If there is only one auxiliary memory qubit e B available, then only either B q1;t or D q1;t can be implemented, which is an extrinsic conflict.Since a distributing process has a higher priority over an embedding process, the conflict edge (B q1;t ←D q1;t ) is directed from the distributing rule represented by a green arrow in the right-hand side figure .As a whole, the incompatibility among the packing processes is represented by a graph of indecomposable packing root vertices.A red bidirected edge represents the intrinsic conflict between two trivial distributing processes, while a directed green edge represents the incompatibility between a trivial distributing process and its underlying embedding.Note that the directed green edge is omitted on the left-hand side graph by default.
Other possible DD-type conflicts occur when two trivial distributable packets are placed at the same depth of a local QPU as shown in Fig. 13 (b).The conflict is caused by the competition among packing processes for external auxiliary memory qubits, which is extrinsic.The conflict between the embeddings B q1;t and B q2;t is automatically resolved if the conflicts associated with the trivial distributable packets are resolved.We therefore employ a dashed edge to represent it.DB-type conflicts (Fig. 14).The second type of conflict is the incompatibility between a trivial distributing process and an indecomposable embedding process, which we call a DB-type conflict.As shown on the right side of Fig. 14, a circuit consisting of an X gate and a CNOT gate on the depth t 2 can be distributed in a distributing process with the rule D q2;t2 rooted on (q 2 , t 2 ), which is also embeddable with the rule B q2;t2 rooted on (q 2 , t 2 ).Another indecomposable embedding rule B q1;t1,t3 can be applied on (q 1 , t 2 ), which merges the packing processes rooted on (q 1 , t 1 ) and (q 1 , t 3 ).
All of the three packing processes compete for local auxiliary memory qubits, which leads to extrinsic conflicts.Since the distributing D q2;t2 has a higher priority than the embeddings, we employ green directed edges D q2;t2 →B q2;t2 and D q2;t2 →B q1;t1,t3 from the distributing root vertex D q2;t2 to the embedding root vertices to represent the DB-type conflicts.
The conflict between the indecomposable embeddings is represented by a blue bidirected edge B q2;t2 ↔B q1;t1,t3 .Here, a dashed line is employed, since this type of extrinsic conflicts will be automatically resolved, once the conflicts associated with trivial distributing process are resolved.For a detailed explanation please refer to the BB-type conflicts in the next paragraph.BB-type conflicts (Fig. 15).The third type of conflict is the incompatibility between two indecomposable embeddings, which we call BB-type conflicts.
Fig 15 (a) shows a circuit example of extrinsic BBtype conflicts, in which the embedding B q1;t1,t3 of the X gate on (q 1 , t 2 ) competes for auxiliary memory qubits with the embedding B q2;t1,t3 of the CNOT gate on (q 2 , t 2 ).A blue bidirected edge B q1;t1,t3 ↔B q2;t1,t3 on the pincer-shaped vertices represents the extrinsic conflict.In DQC implementation, the utility of external resources for a packing process is meaningless unless a distributing process is involved.An indecomposable embedding must be packed inside a distributable packet.An extrinsic conflict between two indecomposable embeddings therefore implies a conflict between two distributable packets competing for the same extrinsic resource.It implies the existence of distributing processes that compete for the same resource and leads to DB-type or DD-type conflicts, which have a higher priority.As a result, such an extrinsic BB-type conflict is redundant, which is then represented by dashed lines.
Besides extrinsic BB-type conflicts, there are some conflicts between indecomposable embeddings, which are intrinsic and not resolvable by adding external resources.For the circuit example shown in Fig 15 (b), the nonlocal gate is either embeddable on (q 1 ; t 2 ) or (q 2 ; t 2 ) with the embedding rules B q1;t1,t3 or B q2;t1,t3 , respective.However, these two embedding rules are not compatible for joint embedding due to fundamental limits (see Theorem 36 in Section 4.2).One has to resolve this conflict by abandoning one of the embeddings.Such an intrinsic BB-type conflict is represented as a blue bidirected edge B q1;t1,t3 ↔B q2;t1,t3 connecting the embedding root vertices.Since an intrinsic conflict is never redundant, it is always represented with a solid line.
Conflict structure and its solution.The distribution of a quantum circuit can possess all these three types at the same time.Fig. 16 shows the conflict structure of a circuit example.The kernels of trivial distributing processes and indecomposable embedding processes of the circuit are listed as follows, With limited resources, one has to abandon conflicting processes to resolve the conflicts, the actions include the removal of distributing processes and the removal of embedding processes from a distributable packet.Let V q,T ′ be a sub-packet of a distributable packet V q,T .The abandonment of the distributing processes associated with a sub-packet V q,T ′ from a packet V q,T does not affect the embeddability on (q, T ′ ).It therefore leads to a new packet V q,T \T ′ .The action of the removal of V q,T ′ from V q,T is defined as follows.
with V q,T ′ from V q,T , where T ′ ⊂ T , one obtains On the contrary, the removal of embedding processes from a distributable packet splits the packet.
Definition 22 (Removal of embedding processes) Let B q;t1,t2 be an indecomposable embedding in a distributable packet V q;T with {t 1 , t 2 } ⊆ T .After the removal of the embedding B q;t1,t2 , the packet V q;T is split into two packets where T 1,2 are the two split regions of depths,

Packing graphs and conflict graphs
The conflict between packing processes can be solved through the removal of distributing processes or embedding processes.Since distributing processes have higher priority, we will first solve the DD-type conflicts among distributing processes prior to the DBtype and BB-type conflicts, which involve embedding processes.In a quantum circuit, the conflicts between two distributing processes are associated with global gates.For general multi-qubit gates, one needs hyperedges to describe the conflict.Since all multiqubit gates can be decomposed into two-qubit gates, in this paper, we consider the circuits solely consisting of one-qubit and two-qubit gates.For a two-qubit gate which is distributable on both of its gate nodes (q 1 , t) and (q 2 , t), each gate node belongs to a distributable packet V q1,T1 and V q2,T2 , respectively.One has the freedom to choose a root packet from V q1,2,T1,2 to anchor a distributing process.Once a root packet is selected, the distributable node on the other packet has to be removed according to Definition 21.One can assign a packing edge to each two-qubit gate and form a packing graph.

Definition 23 (Packing edge)
Given a set of distributable B-packets V B , two distributable B-packets V q1,T1 and V q2,T2 are connected by a two-qubit gate g t at the depth t, if the nodes {(q 1 , t), (q 2 , t)} of g t belong to V q1,T1 and V q2,T2 , respectively, The pair of connected distributable packets forms a packing edge associated with the gate g t The gate g t is a packing-edge gate.The packing edge set is denoted by belong to the same local system, the edge E gt is local, otherwise global.Definition 24 (Packing graph) A B-packing graph of a circuit is formed by the set of distributable B-packets and the set of packing edges, To distribute all the global gates, one needs to select a set of distributable packets, which cover all packing edges.Note that the packing edges are defined at the level of distributable B-packets, while the kernelconflict edges in Definition 20 are defined at the level of indecomposable kernels of distributing and embedding processes.
Besides the DD-type conflict, one also needs to take the conflicts associated with embeddings into account.Each conflict edge at the level of indecomposable kernels incident to an embedding kernel also defines a conflict edge at the level of distributable packets.
A quantum circuit has two sets of conflict edges at two different levels, one is the set of kernel-conflict edges at the level of indecomposable packing kernels, the other is the set of packet-conflict edges at the level of distributable packets.
Definition 25 (Conflict edge set) The set of kernelconflict edges C(K) is defined at the level of indecom-posable packing kernels K, ) while the set of packet-conflict edges C(V) is defined at the level of distributable packets V, C(V) := {V 1 → V 2 : defined in Eq. (66), and V 1,2 ∈ V}. (68) The extrinsic conflict edges caused by the competition for external resource, e.g.auxiliary qubits, can be solved by supplying sufficient external resources.The intrinsic conflict edges caused by the intrinsic incompatibility of embeddings is not solvable with external resources.The extrinsic and intrinsic edge sets are denoted by subscripts C in and C ex , respectively.We can employ the intrinsic and extrinsic conflict graphs to represent the conflicts of packing processes.
Definition 26 (Conflict graph) Packet-conflict graphs and kernel-conflict graphs are defined at the level of distributable B-packets V B and indecomposable packing kernels K, respectively, for intrinsic conflict (69) and for extrinsic conflict (70)

Packing algorithm
The packing graphs G and conflict graphs κ contain the full information of distributability, embeddability, and compatibility, which can be employed to determine the final packing of a quantum circuit.In this section, we provide heuristic algorithms for finding entanglement-efficient packing in the scenarios of unlimited and limited external resources, respectively.
Packing with unlimited auxiliary qubits.Suppose that we have unlimited packing auxiliary qubits, one can determine the ultimate packing and the required amount on packing auxiliary qubits according to Algorithm 27.In Algorithm 27, one first searches all possible minimum vertex covers {V i } i of the packing graph G B .From each minimum vertex cover V i , one obtains a set of packing kernels K i rooted on the packets in V i .The kernel set K i induces a subgraph C in (K i ) of the intrinsic kernel-conflict graph.To solve the intrinsic conflicts in C in (K i ) with the least amount of additional entanglement, one needs to find a minimum vertex cover K i,j of C in (K i ).One will need to remove the embeddings in K i,j and split the corresponding packets in V i to get a set of compatible root packets R i,j .The cardinality |R i,j | is the number of packing processes and hence the entanglement cost for implementing the packing processes rooted on R i,j .One may further reduce the entanglement cost through the merging of the packets in R i,j with an addon function "extended_embedding" based on the extended embedding introduced in Appendix B.
The required number of packing auxiliary qubits on the local A(B) system can be determined by the chromatic number χ of the extrinsic packet-conflict ).In the end, one obtains a set of compatible root packets R i,j and their corresponding required amount of local auxiliary qubits χ (71) Depending on explicit scenarios, one can choose either the set of root packets with least amount of ebits min i,j |R i,j |, or least number of local packing ancillas min i,j |χ |, or some particular trade-off between the entanglement consumption and packing auxiliary qubits.
Note that enumerating all minimum vertex covers is a hard problem, as the cardinality of the set of minimum vertex covers can exponentially increase [49].A heuristic solution is to search for a minimum vertex cover instead of enumerating all of them, which sacrifices the optimality for the entanglement efficiency.Besides, the determination of the chromatic number of a general conflict graph C ex is an NP-hard problem [50].Nevertheless, one can still efficiently estimate an upper bound on the chromatic number χ , which implies a required number of auxiliary memory qubits that guarantees the implementation of the distributing processes.Overall, the time complexity of Algorithm 27 is estimated as O(m 2 3 4m/3 2 2m ), while its heuristic implementation can be efficiently implemented with the complexity O(m 7/2 ), where m is the number of nonlocal 2-qubit control-phase gates in a circuit.(see Appendix D.5 for an explanation).
Packing with limited auxiliary qubits.For the packing with extrinsic limits of packing auxiliary qubits, one can employ Algorithm 28 to find an entanglement-efficient packing given a fixed number of auxiliary qubits.In Algorithm 28, the steps up to line 9 are solving the intrinsic conflicts.They are identical to Algorithm 27 for the packing without extrinsic limits.From line 10, the steps for solving extrinsic conflicts under the extrinsic limits must be modified accordingly.Suppose the available amount of local packing auxiliary qubits are (χ A , χ B ).On line 12, one divides R A i,j into χ A colors {R A i,j,c } c=1,...,χ A on the system A and same for the system B. Each color represents a packing auxiliary qubit.The goal is to find the optimum χ A(B) -color partition of R -conflict edges covered the same-color packets is minimum.A partitioning algorithm can be employed to find the minimum-conflict color partition [51], however, there is no efficient algorithm available yet.For a heuristic solution, one can simply find a χ A(B) -color coloring and continue to the next step.
After obtaining the c-color extrinsic conflict graph C ex (R A(B) i,j,c ), one extracts the process kernels i,j,c and finds the minimum vertex cover K j,c ).To solve the conflicts, one removes the Algorithm 27: Packing algorithm without extrinsic limits B∈ Ki,j V \ B B: split the vertex cover V i by removing its embedding processes included in K i,j ; /* Find possible entanglement reduction with an addon function extended_embedding based on an extended embedding.(See Appendix B for the details of extended embeddings.)*/

7
R i,j ← extended_embedding(R i,j ) looks for possible extended embeddings in R i,j ← R i,j : divided into two sets of local packets; )): update the extrinsic local packet-conflict graphs according to Eq. (66); and obtains a set of compatible c-color packets R A(B) i,j,c .In the end, one obtains the candidate sets of root packets R i,j through the union of the c-color root packets R A(B) i,j,c The ultimate set of root packets is the one that has the minimum cardinality.

R = arg min
Ri,j The corresponding quantum circuit can be then locally implemented with the packing process rooted on the packets in R with the assistance of χ A and χ B local packing auxiliary qubits and | R| ebits of entanglement.Note that Algorithm 28 has the same time complexity as Algorithm 27 (see Appendix D.5 for an explanation).

Identification of packing graphs and conflict graphs
The packing algorithms in Section 3.8 determine the distributable packets based on the distributability and embeddability of a quantum circuit, which are fully described by the packing graphs and conflict graphs of a circuit.Therefore, one needs to identify these characteristic graphs before implementing the packing algorithms.Since single-qubit and two-qubit gates form a universal set of quantum circuits, all quantum circuits can be decomposed into single-qubit and two-qubit gates.We therefore consider the identification of packing graphs and conflict graphs of quantum circuits consisting of only single-qubit and two-qubit gates.For the vertices, we need to identify the set of distributable packets, which is equivalent to the identification of indecomposable embeddings.For the edges, we need to identify the packing edges associated with two-qubit gates, the intrinsic conflict edges associated with incompatible embeddings, and the extrinsic conflict edges associated with external resource limits.

Identification of distributable packets
According to Definition 16, in a circuit consisting of 1-qubit and 2-qubit gates, the elements in a distributable packet are the gate nodes of 2-qubit gates.According to Definition 23, a packing edge between two distributable packets is always associated with a two-qubit gate, which is referred to as a packing-edge gate.A packing-edge gate of particular interest is the control-phase gate C V (θ) in Eq. (17).One can show that the connectivity of packing graphs can be identified solely with control-phase gates, since all packingedge gates are locally equivalent to a control-phase gate up to single-qubit distributable gates.

Lemma 29 (The connecting 2-qubit gate)
A two-qubit gate g acting on q 1 and q 2 is distributable on both of the grid points (q 1 , t) and (q 2 , t), if and only if it is equivalent to a control-phase gate up to single-qubit distributable gates, where D qi and D qi are single-qubit distributable gates acting on q i .Proof: See Appendix D.6 Algorithm 28: Packing algorithm with extrinsic limits 1 χ A , χ B ← available amount of auxiliary qubits on each local systems; R i,j ← V∈Vi,B∈ Ki,j V \ B B: split the vertex cover V i by removing its embedding processes included in K i,j ; /* Find possible entanglement reduction with an addon function extended_embedding based on an extended embedding.(See Appendix B for the details of extended embeddings.)*/

8
R i,j ← extended_embedding(R i,j ) looks for possible extended embeddings in R i,j ; )): update the extrinsic local packet-conflict graphs according to (66) ; As a result of Lemma 29, to reveal the connectivity of distributable packets, we need to convert control unitaries to control-phase gate through where W is the transformation matrix for the diagonalization of and θ i are the eigenphases of U † 0 U 1 .As it is shown in Fig. 17 (a), after the controlphase conversion , a circuit can be decomposed into a sequence of two-qubit blocks, where the elemental two-qubit block U t is shown in Fig. 17 (b) and given by In Fig. 17 (a), we use the symbols white circles for single-qubit distributable gates, white squares for identity gates, black squares for Hadamard gates, gray squares for irrelevant gates.The control-phase nodes in Fig. 17 (a) are the nodes that we want to pack into packing processes through the embedding of the unitaries between them.If each two-qubit block U t in Eq. (76) is q-rooted embeddable, then the total unitary U is also q-rooted embeddable.One can employ this sufficient condition to determine the q-rooted embeddability of U and identify the distributable packets in a circuit.Note that we do not exclude the possible embeddability of U ... that possesses a non-embeddable block U t .

D-type H-type
After the control-phase conversion, a quantum circuit consisting of one-qubit and two-qubit gates can be decomposed into two types of embedding units, which are shown in Fig. 17 (c).The embedding rules for such a quantum circuit can be summarized as the embedding rules for two-qubit blocks U qq ′ as follows.

Corollary 30 (Embedding rules for 2-qubit blocks)
A two-qubit block U qq ′ in Eq. (77) is q-rooted embeddable in the following two cases 1. (D-type, Fig. 18 (a,b)) F q and G q are distributable, which means diagonal or antidiagoanl.The embedding rule is where n is the number of anti-diagonal gates in {G q , F q }.
2. (Global H-type CZ, Fig. 18 (c)) The control phase gate C V (θ) is global with a phase of θ = π.The single-qubit gates F q and G q can be decom-posed as where D q and D q are diagonal or anti-diagonal.The embedding rule is where n is the number of anti-diagonal gates in {D q , D q }.
The block U qq ′ is not embeddable in the following case, Proof: For a D-type block, U qq ′ is distributable on q, and has embedding rules in Eq. ( 27) and (28) according to Theorem 10.For a global H-type CZ block, one can derive the primitive embedding rule through For a local H-type block, it is not distributable according to Theorem 5.As a result of Theorem 10, a q-rooted non-distributable local unitary can never be q-rooted embeddable.Now, we have the embedding rules for single-qubit gates B 1 (Theorem 10) and two-qubit gates B 2 (Corollary 30), which form a set of embedding rules B for the packing of gate nodes of control-phase gates, B = {B 1 := Theorem 10, B 2 :=Corollary 30}.(84) Employing the set of embedding rules B, one can identify the set of distributable B-packet V B of a quantum circuit.
As it is shown Fig. 17 (a), to determine the Bpackability of the nodes (q, t 0 ) and (q, t 8 ), one needs to determine the q-rooted embeddability of the unitary between them In this example, the embedding of W t8;t0 can be decomposed into four indecomposable embeddings W i=1,...,4 between control-phase gates, where The embeddings of W 2 and W 3 (green dot-dashed lines) do not include any embedding unit.Their embeddability can be easily determined with the embedding rules B 1 for single qubit gates (Theorem 10).These two embeddings do not contain embedding units.They identify the packet of neighbouring control-phase nodes V q,{t4,t5} and V q,{t5,t6} , respectively.We call such an embedding a neighbouring embedding, and its corresponding distributable packet a neighbouring distributable packet.
On the other hand, the embeddings of W 1 and W 4 (blue dot-dashed lines) contain embedding units, which allow the packing of remote control-phase nodes V q,{t0,t4} and V q,{t6,t8} , respectively.We call such an embedding a hopping embedding, and its corresponding distributable packet a hopping distributable packet.One need the B 2 embedding rules (Corollary 30) to identify hopping embeddings.
The embedding of W t8;t0 in Eq. ( 85) is also a hopping embedding, which forms the hopping distributable packet V q,{t0,t8} .However, according to Definition 19, W t8;t0 is decomposable.In this example, one can see that the indecomposable embeddings {W 1 , ..., W 4 } lead to the largest distributable packet V q,{t0,t4,t5,t6,t8} , which is merged according to Lemma 17, V q,{t0,t4,t5,t6,t8} =V q,{t0,t4} ∪ D V q,{t4,t5} It is therefore necessary to identify all indecomposable embeddings of a quantum circuit to fully explore its packability.The identification of distributable packets of a circuit is therefore equivalent to the identification of all the indecomposable embeddings.
For neighbouring embeddings, the identification according to B 1 (Theorem 10) always returns indecomposable embeddings.For indecomposable hopping embeddings constructed with the embedding units in Corollary 30, one can search them with the following condition.
Lemma 31 (Indecomposable hopping embedding) Two distributable nodes {(q, t i ), (q, t j )} with i ≤ j − 2 form an indecomposable hopping distributable Bpacket V q,{ti,tj } , if and only if the unitary W tj ;ti between (q, t i ) and (q, t j ) can be decomposed as where D ′ q,tj and D q,ti are diagonal or anti-diagonal, and each embedding unit U t k is global H-type with the control phase θ = π.Proof: See Appendix D.7.
This lemma implies that a control-phase gate, which is local or has a control phase θ ̸ = π, cannot be a part of indecomposable hopping embedding.
Corollary 32 Let V q,{ti,tj } be an indecomposable distributable packet.All the control-phase gates C q,qt V (θ t ) with t i < t < t j between the nodes (q, t i ) and (q, t j ) must be global and θ t = π.Proof: This is a direct result of Lemma 31.Now, we have all the essential tools to develop an identification algorithm for distributable packets, which is summarized in Algorithm 33.To implement this algorithm, one needs two functions, neighbouring() and hopping(), to determine the qrooted neighbouring and hopping embeddings, respectively.The neighbouring embeddings can be determined according to Algorithm 34 through the embedding rules B 1 in Theorem 10, which leads to a set of distributable B 1 -packets V B1 .Meanwhile the indecomposable hopping embeddings can be determined according to Algorithm 35   ultimate set of distributable packets V B .The complexity of the identification of distributable packets is estimated to be O(nd 2 ), where n and d are the total qubit number and depth of a circuit (see Appendix D.8 for an explanation).
In the identification algorithm for indecomposable hopping embeddings (Algorithm 35), from line 10 to 28, the ultimate goal is to shift the Hadamard gates between two control-phase nodes into an embedding block through commuting them with the single-qubit distributable gates to form H-type embedding units, which is demonstrated by the examples in Fig. 19.In Fig. 19 (a), it shows a quantum circuit after the control-phase conversion.In this example, we look at the packing on the qubit q.The white squares are identity gates.
The single qubit gates (red diamonds) between the control phase nodes are converted into three types according to Eq. (94).In Fig 19 (b In Fig. 19 (b), the red-dashed two-qubit blocks U t0 , U t6 , and U t9 contain a local control-phase gate or a global control-phase gate with the phase θ ̸ = π, which can never be involved in an indecomposable hopping embedding according to Lemma 31.One can therefore divide the starting-gate and ending-gate groups into different divisions, such as S (q) 1,2 and E (q) 1,2 .In each division, one can then construct a set of candidate qrooted packets k and E (q) k .These steps are summarized on line 7 to 9 in Algorithm 35.
For each candidate packet {(q, t i ), (q, t j )} ∈ P (q) k , one can then check the q-rooted embeddability following line 10 to 28.For the example of the candidate packet {(q, t 1 ), (q, t 6 )} in Fig. 19 (b), the single-qubit gate W t4;t3 is an S-type and E-type gate, while W t5;t4 is D-type.One first converts these gates into an Mtype gate following line 12 to 23.In the next step, one checks the embeddability of W t6;t1 by shifting the red squares into the embedding units according to line 25 to 28, which is as shown in Fig. 19 (c), If there are no non-embeddable gates remaining outside the embedding units, then {(q, t 1 ), (q, t 6 )} forms a B-distributable packet V q,{t1,t6} .Explicitly, the condition for a successful construction of sequential embedding units is given on line 25  The determination of the packability of {(q, t1), (q, t6)}.(d) The determination of the packability of {(q, t0), (q, t4)}.
*/ q) or {(q, t k ), (q, t k+1 )} ∈ E (q) do 14 Convert the S-type or E-type gates into M-type gates through where 22 end end end /* Determine the q-embeddability of Wt j ;t i with the following necessary and sufficient condition.For the proof, see Appendix D.9. */ ... ...
... ... .Theorem 36 implies that an indecomposable neighbouring embedding is always compatible with any other indecomposable embeddings.The incompatibility of indecomposable embeddings exists only for the indecomposable hopping embeddings, which contain a global control-Z gate.Such incompatibility is intrinsic and independent from external resources.After the identification of the intrinsic incompatibility be-tween two indecomposable hopping embeddings, one can represent the incompatibility associated with a global control-Z gate by a conflict edge at the level of a set of distributable packets V.

Definition 37 (Intrinsic conflict edges in V)
In a set of distributable packets V, a global control-Z gate C (q,q ′ ;t) Z defines a conflict edge where V q,T , V q ′ ,T ′ ∈ V are the distributable packets that cover the control-Z gate, namely min T < t < max T , min T ′ < t < max T ′ .The set of intrinsic conflict edges in V is denoted by Since each indecomposable embedding in B 2 corresponds to a unique distributable packet in H, the intrinsic conflict edges between indecomposable embeddings introduced in Eq. (67) are equivalent to the intrinsic conflict edges between the distributable packets in H.One can therefore construct the intrinsic packet-conflict and kernel-conflict graphs as and For the extrinsic kernel-conflict graphs, one will need to include the trivial distributable packets V 0 in the vertices, K = V 0 ∪ H.The extrinsic kernelconflict graph C ex (K) will be then constructed at the level of K, from which one can construct the extrinsic packet-conflict graph C ex (V B ) according to Eq. (66).

Applications of embeddingenhanced distributed quantum computing
In this section, we demonstrate our packing algorithms for DQC of unitary coupled-cluster (UCC) circuit [45,46], with which one implements a quantum variational eigensolver to find the eigenvalue of an observable that simulates a chemical system.The implementation of our algorithm can be found in an opensource package pytket-dqc 1 , which is developed for multipartite DQC based on our embedding-enhanced method [39].

DQC of a 4-qubit UCC circuit
A 4-qubit UCC circuit after the control-phase conversion is shown in Fig. 24 in Appendix C. It contains 64 global control-Z gates.The qubits are partitioned in two local systems A = {q 0 , q 1 } and B = {q 2 , q 3 }.Employing Algorithm 33, 34 and 35, one obtains a packing graph G B in Fig. 21 (a).The colored vertices are the B-distributable packets, while the white vertices stretching from the colored vertices represent the hopping embeddings that merge the neighbouring packets to form a colored vertex.According to Algorithm 27, one first determines a minimum vertex cover V mvc of the packing graph G B , which are highlighted as red vertices.Note that in this example, we only heuristically find a minimum vertex cover instead of searching for all minimum vertex covers.
According to Section 4.2, one can obtain the corresponding packet-conflict graph in Fig. 21 (b).The conflict edges are defined at the level of distributable packets.One has to solve the conflicts (red edges) 1 https://github.com/CQCL/pytket-dqcinduced by the minimum vertex cover V mvc .To this end, we need to expand the packet-conflict graph to a kernel-conflict graph, as it is shown in Fig. 21 (c).According to Algorithm 27, we select the set of kernels K mvc that are induced by the minimum vertex cover V mvc , and highlight them in red.The K mvcinduced kernel-conflict graph C in (K mvc ) representing the conflicts that need to be resolved.We then find the minimum vertex cover K mvc of the conflict graph C in (K mvc ), and highlight them in blue.To solve the conflicts, we remove the blue-highlighted hopping embeddings K mvc from the minimum-vertex-covering packets V mvc .
In total, we find 10 distributable packets in V mvc that cover all the packing edges and remove 7 hopping embedding kernels K mvc from the selected packets.In the end, we need only 17 distributing processes to implement a DQC of the 4-qubit UCC circuit, which contains 64 global control-Z gates.Our method therefore saves 47 ebits in total.

Comparison with other protocols
We use the heuristic implementation of the packing algorithm (Algorithm 27) to analyze the bipartite distribution of UCC circuits compared with the G * -simple and G * -LP algorithms based on the "Migration Selection" method introduced in [35].In the "Migration Selection" method, neighboring nonlocal CZ gates can also be packed together, but its packing stops when the process encounters single-qubit gate or local CZ gates, no matter whether embeddable or not.In our embedding-enhanced packing algorithm, one packs control-phase gates and uses embedding to merge non-sequential distributing processes over embeddable single-qubit and two-qubit gates.The G *algorithm can therefore be understood as a special packing algorithm for CZ-compiled circuits without embedding.
Our benchmarking is performed on the UCC circuits with an even number of qubits that are uniformly distributed over two QPUs, each of which has the same specifications.To account for the entanglement cost that depends on qubit allocation, we benchmark all possible bipartitions with an equal number of local qubits for 4-qubit and 6-qubit circuits.The result is shown in Fig. 22 (a).It shows that embedding can significantly reduce entanglement costs under every possible bipartition.The entanglement efficiencies of these three protocols are further compared for UCC circuits up to a qubit number of 20 in Fig. 22 (b).Since the number of bipartitions increases exponentially with respect to the qubit number, in this comparison, we fixed a bipartition for each qubit number.The result shows that the entanglement efficiency of embedding-enhanced distributing is also significantly improved for circuits with large qubit numbers.The conflict edges between packets are induced from the intrinsic conflict edges between kernels according to Eq. (66).The red edges are the conflict edges induced by the minimum vertex cover Vmvc.(c) Intrinsic kernel-conflict graph defined at the level of kernels of packing processes.The colored vertices are the indecomposable hopping packets representing the kernels of hopping embeddings, while the white circles are the distributable packet formed from these embedding kernels.The red vertices Kmvc are the indecomposable hopping packets that included in the minimum vertex cover Vmvc.The red edges are the conflict edges induced by Kmvc.The blue vertices Kmvc are a minimum vertex cover of the Kmvc-induced conflict graph, which are the embeddings that needs to be removed.The order of Kmvc is 7.

Embedding-enhanced multipartitedistributed quantum computing
The extension of our embedding-enhanced packing protocol to multipartite DQC requires an extended definition of distributable packets (Definition 16) through an extension of the bipartite embedding rules to multipartite systems.Once we get the extended packing graphs and conflict graphs constructed by the extended distributable packets as their vertices, one can follow Algorithms 27 and 28 to compile the DQC.However, the identification of packing edges and conflict edges is challenging.For multipartite embeddings, for instant in an A|B|C system, their compatibility has to be considered under all possible bipartite subsystems, since the additional local correction gates in an embedding process of a nonlocal gate between two parties A|B may destroy its embeddability between the other party A|C.The incompatibility of embeddings is therefore not more describable by a bipartite conflict edge.
A straightforward but not optimal solution is to extend the notion of distributable packets to a set of gate nodes that are associated with two-qubit gates acting on the same local systems, such that each distributable packet is associated with only two parties.
With this solution, one can circumvent the multipartite conflict problem of embeddings and introduce the embedding method to existing compilers for multipartite DQC [34] to allow entanglement-efficient DQC on multipartite quantum computing networks [39].

Conclusion
In this paper, we have developed a theoretical architecture for entanglement-efficient distributed quantum computing over two local QPUs with the assistance of entanglement.The building blocks of the DQC architecture are the entanglement-assisted packing processes (Definition 2), which are extended from the EJPP protocol [26].An entanglement-assisted packing process packs a sequence of gates as its kernel sandwiched by the entanglement-assisted starting and ending processes (Definition 1).Two types of entanglement-assisted packing processes are essential, namely the distributing processes (Definition 4) and the embedding processes (Definition 7).The ultimate goal of distributed quantum computing with entanglement-assisted packing processes is to implement a quantum circuit with distributing processes, in which the kernels are all local gates.An embed-  ding process allows the packing of two non-sequential distributing processes and hence save one ebit of entanglement.We therefore explore the embeddability of quantum circuits to save the entanglement cost in DQC.
The distributability of a unitary can be determined by Theorem 5.Each distributing process consumes an ebit of entanglement.To save entanglement, we introduced the embedding processes to merge two nonsequential distributing processes into one single packing process.The embeddability of a unitary between two distributing processes can be determined with a sufficient condition, from which one can derive the primitive embedding rules (Corollary 9, Theorem 10, and Corollary 11).These embedding rules join the nodes of distributable gates into distributable packets (Definition 16 and Lemma 17), on each of which one can implement the gates locally assisted with one-ebit entanglement in a single distributing process (Theorem 18).
The trivial distributing processes and indecomposable embedding processes (Definition 19) associated with different packets may be incompatible for simultaneous implementation.We therefore introduce packing edges between distributable packets (Definition 23) and conflict edges incident to embedding processes (Definition 20 and 25) to represent the incompatibility among distributing and embedding processes.With these edges, one can construct the packing graphs and conflict graphs of a circuit, which contain the full information of distributability, embeddability and incompatability in a circuit.Based on these graphs, one can determine the packing of a circuit with heuristic Algorithm 27 (or Algorithm 28) for the scenario of unlimited (or limited) external resources.
In particular, we consider circuits consisting of only one-qubit and two-qubit gates, which are universal for quantum computing.The distributability structure (packing edges) of such a circuit is completely represented by the control-phase gates (Lemma 29) in its control-phase conversion.In a control-phase conversion, a circuit is decomposed as a sequence of twoqubit building blocks.We have derived the primitive embedding rules for these building blocks in Corollary 30.The embedding rules for single-qubit gates (Theorem 10) and two-qubit blocks (Corollary 30) identify the indecomposable neighbouring embeddings (Algorithm 34) and hopping embeddings (Lemma 31 and Algorithm 35), respectively.Based on these embedding rules, one can then identify the set of distributable packets of a circuit with Algorithm 33.Fi-nally, one obtains the ultimate packing graph through the association of packing edges to global controlphase gates, and the ultimate conflict graphs through the association of conflict edges to incompatible embeddings (Theorem 36).
These algorithms are demonstrated in the distributed implementation of a 4-qubit UCC circuit, which contains 64 global control-Z gates.The ultimate number of packing processes in the distributed implementation is 17, hence requiring 17 ebits of entanglement and saving 47 ebits.We benchmark the algorithms with further instants of the UCC circuits up to 20 qubits.The comparison with other protocols shows a significant enhancement of entanglement efficiency by embeddings.
The packing method in this paper can be summarized as "distributing enhanced by embedding".It incorporates the extrinsic limit of quantum resources, such as the number of available local memory qubits, in the conflict edges.It can be therefore employed to study the scalability of distributed quantum computing.One can further extend our method to multipartite scenarios and adopt quantum network topology to establish entanglement-efficient DQC over quantum internet [39].
The DQC architecture of a quantum circuit based on entanglement-assisted packing processes revealed by the packing algorithms is a constructive method to determine an upper bound on the entanglement cost of the entanglement-assisted LOCCs of a circuit.Benefit from embeddings, one can obtain a tighter upper bound on the entanglement cost approaching the lower bound, which is determined by the operator Schmidt rank [38].One can therefore exploit the packing method to explore the unitaries that have the optimal entanglement cost implemented with entanglement-assisted packing processes.The packing protocol in this paper is therefore practical for finding an entanglement-efficient solution for distributed quantum computing, and also useful for the fundamental study of the optimality in entanglementassisted LOCC.
In the computational basis of q and e, this equality is equivalent to 0 = ⟨i q , (i ⊕ 1) e | K|i q , i e ⟩ for i = 0, 1. (116) This proves that a packing process is equivalent to a unitary if and only if the R (e) X (φ)-calibrated kernel K has the following form where X (φ) calibration only introduce a global phase to the Kraus operator K − of a packing process, P q,e [K] and P q,e ( K) implement the same unitary U .The unitary U implemented by P q,e [K] is therefore given by As a result of Eq. (116), the representation of U in the computational basis of q is therefore which leads to

W = K
(q,ē) 00,01 K (q,ē) 00,10 Since K and U are both unitary, it holds then which leads to As a result, This completes the proof.
Corollary 39 (Sufficient condition for U -equivalent) An entanglement-assisted process with the kernel C q,Xe U C q,Xe is canonical and equivalent to the unitary U , U = P q,e [C q,Xe U C q,Xe ]. (125) The kernel C q,Xe U C q,Xe is the primitive kernel of U .Proof: According to Eq. (108), the Kraus operator of P q,e [C q,Xe U C q,Xe ] are

B Extended embedding
The  In addition to the ordinary H-type embedding rule in Corollary 30 for a two-qubit embedding unit with a control-Z gate, we introduce an extended embedding rule for the H-type embedding unit with general controlling phase θ ̸ = π.

Lemma 40 (Extended embedding)
An H-type 2-qubit embedding unit can be packed with the following rule (Fig. 23) H q U (q,q ′ ) H q =P q,e H q C (q,e) Z H e U (e,q ′ ) H e C (q,e) Z H q . (126) Proof: Employing the primitive embedding rule in Corollary 9, we know that H q U (q,q ′ ) H q =P q,e C q ′ ,Xe H q U (q,q ′ ) H q C q ′ ,Xe . (127) One can then show that As it is shown in Fig. 23, this extended embedding of the global H-type block localizes the controlphase gate to the B system with two additional global control-Z gates.On the left-hand side, the original distributable packets V a and V b (blue dashed circles) form a packing edge (V a ↔ V b ).After the extended embedding, the packet V a is replaced by V α , and the packing edge on the right-hand side with two additional packets V c and V d globally connected to V α .
The global packing edges of the kernel are now changed to V c ↔ V α ↔ V d .The corresponding graph has the minimum vertex cover {V a }.If V a is already selected as one of the root packets for the distributing processes, then after the extend embedding, one can replace the original packet V a in the original vertex cover by V α .Such a replacement does not change the entanglement cost in the kernel.It therefore leads to a possible reduction of entanglement by one ebit through the merging of two non-sequential distributing processes hopping over the H-type unit.
However, if V b is selected in the vertex cover of the original packing graph rather than V a , the 1-ebit entanglement saved by the extended embedding hopping over V α has to be consumed for the distributing process rooted at V α .It means that the extended embedding does not bring any reduction of entanglement cost.
It is therefore necessary to select V a as a root packet for the extended embedding to save one ebit.Suppose that V a is a root packet.After the extended embedding, the new local control phase gate C (α ′ ,b) V may intrinsically prevent the other hopping embeddings.This happens when there is a selected root packet merged by the q ′ -rooted hopping embedding over the packet V b .In this case, one has to remove the q ′ -rooted hopping embedding, and split the corresponding packet, which will reuse the 1-ebit entanglement saved by the extended embedding.Such a conflict is the only intrinsic conflict that prevents the reduction of entanglement cost through the extended embedding.As a result, the necessary condition for reduction of entanglement cost by the extended embedding rule can be summarized as follows.

Corollary 41 (Entanglement reduction condition)
The q-rooted extended embedding of an H-type embedding unit can save 1-ebit, only if 1. the distributable packet on q in the unit is already selected as a root packet, and does not conflict with a selected hopping embedding.
This necessary condition is also sufficient, if there is no extrinsic limits on external resources.
After solving the intrinsic conflicts and identifying the root packets P i,j in Algorithm 27 (line 6) and Algorithm 28 (line 7), one can introduce an addon function "extended_embedding()" to further reduce the entanglement cost by checking the two conditions in Corollary 41.The algorithm for the addon function is given in Algorithm 42.Note that for Algorithm 28 with extrinsic limits, after the extended embedding, one has to update the extrinsic conflict graph on line 9.

C UCC circuits
We employ the "pytket-dqc" package to generate a 4qubit UCC circuit.The 4-qubit circuit is converted according to the control-phase conversion in Eq. (76) and (77).The converted circuit is shown in Fig. 24.There are 64 2-qubit blocks.Every block contains one global control-phase gate.After the controlphase conversion, one employs Algorithm 33, 34 and 34 to identify the distributable packets and the corresponding packing graph and conflict graphs, which are shown in Fig. 21.

D Proof of theorems and lemmas D.1 Proof of Theorem 3
According to Lemma 38, the matrices of the kernels K i in the computational basis are where U (q) i,jk := ⟨j q |U i |k q ⟩.It holds then where As a result of Lemma 38, one obtains This completes the proof.

D.2 Proof of Theorem 5
Let U be a q-rooted distributable unitary with q being on the local system A. According to Lemma 38, there exists a canonical kernel K A ⊗ K B such that It means that K A and K B must be diagonal or antidiagonal at the same time, where . As a result, It means that the following equality is a necessary condition for a q-rooted distributable unitary, This equality is also sufficient for a q-rooted distributable unitary, which can be implemented with the kernels given in Eq. (136).

D.3 Proof of Lemma 13
Since U is embeddable on Q and Q ′ with the embedding rules B Q and B Q ′ , it holds According to Eq. (108), the unitary is equivalent to where A ∪ A ′ = {e 1 , ..., e k } is the disjoint set union of the auxiliary qubits A and A ′ .As a result of the commutation relation in Eq. (41), the unitary U is then joint embeddable on Q between t 1 and t 2 is q-rooted embeddable.If (q, t 1,2 ) ∈ V q,Ti belong to the same distributable packet, W t1;t2 is q-rooted embeddable by definition.For t 1 ∈ T 1 and t 2 ∈ T 1 , let t 0 ∈ T 1 ∩ T 2 be a common node in both V q,T1 and V q,T2 .There are three possible position for t 0 , namely, t 0 < t 1 < t 2 , t 1 < t 0 < t 2 , and t 1 < t 2 < t 0 .For t 0 < t 1 < t 2 , the unitary W t1;t0 and W t2;t0 are both embeddable, since {(q, t 0 ), (q, t i )} ⊆ V q,Ti .It holds then which is also q-rooted embeddable.Analogously, one can prove the q-rooted embeddability of W t2;t1 for t 1 < t 0 < t 2 and t 1 < t 2 < t 0 .This completes the proof.

D.5 The time complexity of Algorithm 27 and 28
The time complexity of Algorithm 27 is estimated as follows.Given a packing graph G B = (V, E), there are three main steps in the algorithm, 1. enumerate all minimum vertex covers K i .
2. enumerate all minimum vertex covers {R i,j } j of the intrinsice kernel-conflict graph C in (K i ).
3. determine the chromatic number of extrinsic packet-conflict graphs C ex (R ) The determination of minimum vertex covers is equivalent to the determination of maximum matching according to the Köning theorem [52].The complexity of enumerating all minimum vertex covers of a graph can be therefore estimated as O(|V| 1/2 |E| + |V|N MVC ) according to Theorem 2 in [49], where N MVC is the number of all minimum vertex covers.
Let m be the number of nonlocal control phase gates in the control-phase conversion of a quantum circuit (Eq.(74)).The vertex number |V| and edge number |E| of a packing graph are upper bounded by 2m and m, respectively.The complexity of the algorithm is then estimated as The complexity of the first two steps is where is the maximum cardinality of the set of minimum vertex covers obtained in step 2.
The number of minimum vertex covers can be upper bounded by the number of minimal vertex covers, which is equal to the number of maximal independent sets and upper bounded by 3 |V|/3 ≤ 3 2m/3 according to [53].In the worst case, the complexity of the first two steps is O(m 3 4m/3 ).
In the last step, one needs to determine the chromatic number of the extrinsic conflict graph C ex (R (A,B) i,j ).In general, the determination of the chromatic number has a complexity of O(2 |V| |V|) [51], which is NP-hard.Overall, to obtain the full information about all the options of packing strategies {(R i,j , χ (A) i,j , χ (B) i,j )} i,j with Algorithm 27, the complexity is roughly upper bounded by which is exponentially increasing.
For a heuristic implementation of the algorithm, one can simply find a minimum vertex cover without enumerating all of them.The complexity of finding a minimum vertex cover can be estimated by O(|V| 1/2 |E|) according to the Hopcroft-Karp algorithm [54] Besides, an upper bound on the chromatic number of a conflict graph can also be efficiently determined by χ ≤ max v∈V (d v + 1), where d v is the degree of a vertex, with a time complexity of O(|V| 2 ).As a whole, the heuristic implementation of Algorithm 27 has a time complexity of O(m 7/2 ), (148) which is polynomial.In Fig. 25, the runtimes t of Algorithm 27 for different UCC circuits are plotted with respect to the number of the nonlocal gates m in the circuits.It shows a polynomial increasing of the runtime t with respect to the nonlocal gates m in the order of 3.157.
Algorithm 28 only differs from Algorithm 27 on step 3, where one searches for χ A and χ B partitioning on the extrinsic graph C ex (R (A) i,j ) and C ex (R (B) i,j ), respectively, instead of the determination of their chromatic numbers.The complexity of χ A,B partitioning is equal to the determination of the chromatic number [51].The complexity of Algorithm 28 is therefore the same as Algorithm 27, which is given in Eq. (147).The complexity of the heuristic solution for Algorithm 28 is the same as the one given in Eq. (148).

D.6 Proof of Lemma 29
According to Definition 23, a connecting gate g is distributable on both q 1 and q 2 .As a result of Theorem 5, the gate g can be decomposed as where {u ij } i,j and {v ij } i,j are diagonal or antidiagonal.Let D 1 and D 2 be diagonal or antidiagonal single-qubit gates such that {u ij } i,j = D 1 .V (φ 1 ) and {ũ ij } i,j = D 2 .V (φ 2 ) is diagonal.It holds then It follows that D † 1 V i and D † 2 V i are diagonal unitary.Furthermore, D † 1 V 0 and D † 2 U 0 fulfill ⟨0|D † 1 V 0 |0⟩ = ⟨0|D † 2 U 0 |0⟩.It follows that Furthermore, one can then show the following equality As a result,

D.7 Proof of Lemma 31
Two remote control-phase nodes {(q, t i ), (q, t j )} with i ̸ = j − 2 form a B-distributable packet V q,{ti,tj } , if and only if the unitary W tj ;ti between them is Bembeddable through a hopping embedding consisting of several embedding units.According to the embedding rules B in Theorem 10 and Corollary 30, all embedding rules for embedding units are the global Dtype, local D-type and global H-type and as shown in Fig. 18 (a,b,c).The unitary W tj ;ti between {(q, t i ), (q, t j )} is B-embeddable, if and only if it can be decomposed as where U t k are D-type or H-type embedding units.Suppose there exists a D-type embedding unit U t l , one can then decompose W tj ;ti into two embeddings where W tj ;t l and W t l ;ti are both B-embeddable given by W tj ;t l = F q,tj k:i<k<l W t l ;ti = F q,t l k:l<k<i As a result, the distributable packet V q,{ti,tj } is indecomposable, only if the embedding units in Eq. (159) are all H-type.This completes the proof.

D.9 Proof of line 25 in Algorithm 35
We need to prove that a unitary W tj ;ti given in the following form is q-rooted embeddable according to the embedding rules B, if and only if α k+1 + γ k = nπ for all k, W tj ;ti = V (γ j−1 ) H V (α j−1 ) C (q,qj−1) Since the phase gates in unitary W tj ;ti commute with C V gates, one can merge them together as follows According to the embedding rules B 2 in Corollary 30, W tj ;ti is q-embeddable only if the gate C V must form a H-type embedding unit.To satisfy this necessary condition, we insert H H into W tj ;ti to construct Htype embedding units HC V H, This unitary is q-embeddable according to B 1 in Theorem 10 and B 2 in Corollary 30, if and only if HV (α k+1 + γ k )HV (β k ) is embeddable for all k.It is equivalent to the following condition This completes the proof.

D.10 Proof of Theorem 36
In this section, we prove the compatibility of the primitive embedding rules in a circuit consisting of singlequbit and two-qubit gates.Fig. 20 shows all the three possible cases of nested distributable packets.In Fig. 20 (a,b), the two distributable packets are on the same qubit q.For (a), it shows the distributable packets V q,{t0,t5} and V q,{t1,t6} identified by two nested embeddings.We first implement the primitive embedding of W t5;t0 .According to the embedding rules B in Eq. (84), the additional kernels are all gates acting on system B, which are irrelevant for the embedding of W (q) t6;t2 on q.For the embedding of W (q) t6;t2 on q, the only additional gate is the control-X gate C q,Xe in the ending process of embedding of W (q) t5;t0 , which acts before (q, t 5 ).One can shift the control-X gate C q,Xe into the block U t5 U t5 = ( D q ⊗ G qn−1 ) C q,Xe C V (θ 5 ) (D q ⊗ F qn−1 ) (167) For the indecomposable embedding of W (q) t6;t2 , one can find a decomposition of local gates on q that forms a H-type embedding unit for U t5 .U t5 = ( D q H q ⊗ G qn−1 ) C q,Xe C V (θ 5 ) (H q D q ⊗ F qn−1 ).
(168) The unitary U t5 can be decomposed into two H-type embedding units by inserting H 2 q ⊗ 1 between the two control gates C q,Xe and C V (θ 5 ), as follow U t5 = ( D q H q ⊗ G qn−1 ) C q,Xe (H q ⊗ 1 B ) ×(H q ⊗ 1 B )C V (θ 5 ) (H q D q ⊗ F qn−1 ).( 169) (H e ⊗ H e ′ ) acting on the two auxiliary qubit {e, e ′ }.Removing duplicated Hadamard gates between two control-Z gate on the auxiliary qubits {e, e ′ }, the implementation of embedding is shown in Fig. 20 (b), where the additional local gate C (e,e ′ ) Z is highlighted in green.This proves the compatibility of two nested embedding shown in Fig. 20 (a).
For the case shown in Fig. 20 (c), one of the embedding W (q) t5;t2 is included in the embedding of W (q) t6;t0 .One can first implement the embedding W (q) t6;t0 , of which the local kernels of embedding all acts on B and are all irrelevant with the embedding of W t5;t2 .It means that the embedding of W (q) t5;t2 is not affected by the embedding of W (q) t6;t0 .These two embeddings are therefore compatible.The implementation of the two embeddings are shown in Fig. 20 (d).
For the case shown in Fig. 20 (e), the two nested embeddings are rooted on two qubits {q, q ′ }, which are on two different local systems.The embeddings of W (q) t5;t0 can be decomposed into two embedding units U t2 and U t3 , while W (q ′ ) t6;t1 can be decomposed into U t3 and U t4 .The two hopping embeddings share the same embedding unit U t3 .If one wants to implement the two hopping embedding together, one has to implement the joint embedding of U t3 on {q, q ′ }.The primitive recursive embedding of U t3 will lead to a global control-Z gate C (e,e ′ ) Z acting on the two auxiliary {e, e ′ }, C q,Xe C q ′ ,X e ′ (H q ⊗ H q ′ ) C V (θ) (H q ⊗ H q ′ ) C q ′ ,X e ′ C q,Xe = (H e ⊗ H e ′ ) C (q,e ′ ) Z C (q ′ ,e) Z C V (θ)C (e,e ′ ) Z (H e ⊗ H e ′ ) . (170) The incompatibility of the nested embeddings is caused by the incompatibility of the recursive embedding of the gate embedding unit U t3 .
The distributable packets V q,{ti,tj } and V q ′ ,{t k ,t l } are incompatible, if and only if {q, q ′ } belong to two local systems, and there is an embedding unit (H q ⊗ H q ′ ) C (q,q ′ ) V (θ) (H q ⊗ H q ′ ) located on t with t i < t < t j and t k < t < t l .Existence of such an embedding unit for two distributable packets is equivalent to the existence of a control-phase gate acting on {q, q ′ } at a depth of t with t i < t < t j and t k < t < t l .This completes the proof.

Figure 2 :
Figure 2: The starting and ending processes: the symbol on the left side represents the operation given by the quantum circuit on the right side.A working qubit q and an auxiliary qubit e ′ belong to a local QPU A, while and an auxiliary qubit e belongs to another local QPU B. (a) The starting process.(b) The ending process.(c) The protocol in[26] for distributed implementation of controlled unitary gates.

Figure 3 :
Figure 3: Distributed implementation of swapping with two entanglement-assisted LOCCs

Figure 6 :
Figure 6: Embeddable unitary: (a) The decomposition of U as a product of embedding units.(b) The embeddability of each embedding unit.(c) The embedding of U .

Figure 7 :
Figure 7: The embedding process of a CNOT gate Fig.10(a) shows a {q 1 , q 2 }-rooted embedding of U followed by a recursive {q 1 , q 3 }-rooted embedding.In this example, the root qubits are Q = {q 1 , q 2 } and Q ′ = {q 1 , q 3 } associated with the auxiliary qubits A = {e 1 , e 2 } and A ′ = {e ′ 1 , e 3 }, respectively.One can simultaneously pack Q-rooted processes and Q ′rooted processes through the joint embedding of U , if U is Q ⊎ Q ′ -rooted embeddable, as it is shown in Fig.

Figure 11 :
Figure 11: Simultaneous packing through compatible recursive embeddings (a) The packing of the Q ′ -rooted processes including the embedding of W1,2, embedding of U and distributing of V1,2.(b) The simultaneous packing of Q-rooted processes and Q ′ -rooted processes that merges the Q ′ -rooted embedding of W1,2, Q ′ -rooted distributing of V1,2, and Q-rooted distributing of W1,2 through the Q ⊎ Q ′ -rooted embedding of U .

Figure 12 :
Figure 12: Compatible nested embeddings.The unitaries U = BA and U ′ = CB are embeddable on Q = {q1, q2} and Q ′ = {q1, q3} with the embedding rules B Q and B Q ′ , respectively.If the equality holds, B Q and B Q ′ are compatible.

Figure 13 :Figure 14 :
Figure 13: Conflict between two distributing processes.(a) Intrinsic conflict between two trivial distributable packets.(b) Extrinsic conflict between two trivial distributable packets.

Figure 15 :Figure 16 :
Figure 15: Conflict between two embedding processes.(a) Extrinsic conflict between two indecomposable embeddings competing for memory qubits.(b) Intrinsic conflict between two indecomposable embeddings due to their intrinsice incompatibility.

Figure 17 :
Figure 17: Identification of distributable packets: white circles represent distributable single-qubit gates, white squares represent identity gates, black square represents Hadamard gates, and gray square presents q-irrelevant gates.(a) A hopping embedding (black dot-dashed) decomposed into indecomposable hopping embeddings (blue dot-dashed) and neighbouring embeddings (green dash-dot).(b) A two-qubit block.(c) The D-type and H-type blocks.

Figure 19 :
Figure19: Algorithm for the identification of distributable packets.(a) A circuit after control-phase conversion.The white squares are identity gates, the white circles are single-qubit distributable gates, and the red diamonds are the single-qubit non-distributable gates.(b) The circuit after the 1st step in Algorithm 33.The red squares are Hadamard gates.(c) The determination of the packability of {(q, t1), (q, t6)}.(d) The determination of the packability of {(q, t0), (q, t4)}.

Figure 20 :
Figure 20: Nested embeddings in elemental blocks.(a) Nested embeddings on the same qubit q.(b) Implementation of the embedding in (a).(c) Inclusive nested embeddings on the same qubit q.(d) Implementation of the embedding in (c).(e) Nested embeddings on two qubits {q, q ′ } that belong to different local systems.(f) Incompatibility of the two nested embedding due to the additional global gate C (e,e ′ ) Z .

Figure 21 :
Figure21: The distributability and embeddability structure of a 4-qubit UCC circuit.(a) The packing graph: Each colored vertex is a distributable packet, which is formed by the merging of the neighbouring packets through indecomposable hopping packets.The indecomposable hopping packets are depicted as white vertices and aligned in a line stretching from each colored vertex.The edges are packing edges.The red vertices are a minimum vertex cover Vmvc of the packing graph.The order of Vmvc is 10.(b) Intrinsic packet-conflict graph: the conflict graph defined at the level of distributable packets has the same vertex set as the packing graph.The conflict edges between packets are induced from the intrinsic conflict edges between kernels according to Eq. (66).The red edges are the conflict edges induced by the minimum vertex cover Vmvc.(c) Intrinsic kernel-conflict graph defined at the level of kernels of packing processes.The colored vertices are the indecomposable hopping packets representing the kernels of hopping embeddings, while the white circles are the distributable packet formed from these embedding kernels.The red vertices Kmvc are the indecomposable hopping packets that included in the minimum vertex cover Vmvc.The red edges are the conflict edges induced by Kmvc.The blue vertices Kmvc are a minimum vertex cover of the Kmvc-induced conflict graph, which are the embeddings that needs to be removed.The order of Kmvc is 7.

Figure 22 :
Figure 22: The entanglement costs of different DQC protocols for UCC circuits (left column) and their reduction ratios (right column).(a) The entanglement cost for 4-qubit and 6-qubit UCC circuits under different bipartitions.(b) The entanglement cost for UCC circuits with a qubit number from 6 up to 20 under a fixed bipartition.

Figure 23 :
Figure 23: An extended embedding rule for general global H-type embeddig unit.

Figure 25 :
Figure 25: The log-log plot of the runtimes of the packing algorithm (Algorithm 27) for different UCC circuits.
through the embedding rules B 2 in Corollary 30.The indecomposable hopping embeddings determined by Algorithm 35 remotely connect two neighbouring distributable packets in V B1 .One can then merge two remote neighbouring distributable packets through the indecomposable hopping embeddings in B 2 and obtain the ), these singlequbit gates are then grouped into S for startinggate candidates, E for ending-gate candidates, M for intermediate-gate candidates, and D for distributable intermediate-gate candidates according to line 3 to 6 in Algorithm 35.The gates in the groups S and E are the potential starting and ending single-qubit gates of an indecomposable hopping embedding, since they can be decomposed into gates containing only one Hadamard gate.The gates in the group M can only be the intermediate single-qubit gates in an indecomposable hopping embedding, since they can only be converted into the gates containing two Hadamard gates.Note that all gates in S, E, D and M can serve as single-qubit intermediate gates in an indecomposable hopping emdedding, since they can be decomposed into a gate that contains two Hadamards.
embedding introduced in Definition 7 can merge two non-sequential distributing processes without changing global gates in the kernel.The packing edges associated with global gates in a packing graph are therefore not affected by an embedding process.Although an embedding process may introduce conflicts to some other embeddings, one can solve these conflicts through the removal of embeddings, which split their original distributable packets.It means that the distributability structure of global gates represented by packing edges in the packing graph is not changed by ordinary embeddings.Since the global gates in a kernel are not changed, the entanglement cost for distributing processes in the kernel does not change.The merging of two non-sequential distributing processes will then save 1 ebit.
)First of all, the unitaries U t with t ∈ T 1 ∪ T 2 are all distributable by definition.Second, for any t 1,2 ∈ T 1 ∪ T 2 with t 1 < t 2 , we need to prove that the unitary Figure 24: The 2-qubit blocks of a 4-qubit UCC circuit after the control-phase conversion.D.4 Proof of Lemma 17 Up to line 8, the complexity is equal to O(m 1 + m 2 ), where m 1 and m 2 are the numbers of single-qubit and two-qubit gates, respectively.For each qubit, the identification of neighboring distributable packets has the complexity O(d), where d is the depth of a circuit.For the identification of hopping distributable packets in Algorithm 35, the complexity is determined by the examination of the embeddability of W tj ;ti on line 10, which is upper bounded by O(d(d − 1)).As a result, the complexity of Algorithm 33 is estimated byO(m 1 + m 2 + n × d(d − 1)), where n is the number of total qubits from all parties.Since m 1,2 ≤ nd, the complexity is upper bounded by O(nd 2 ).
means that one can implement the embedding of W t6;t2 with an additional local gate (H e ⊗ H e ′ )C (e,e ′ ) Z