Optimising Matrix Product State Simulations of Shor's Algorithm

We detail techniques to optimise high-level classical simulations of Shor's quantum factoring algorithm. Chief among these is to examine the entangling properties of the circuit and to effectively map it across the one-dimensional structure of a matrix product state. Compared to previous approaches whose space requirements depend on $r$, the solution to the underlying order-finding problem of Shor's algorithm, our approach depends on its factors. We performed a matrix product state simulation of a 60-qubit instance of Shor's algorithm that would otherwise be infeasible to complete without an optimised entanglement mapping.


Introduction
With the potential for quantum computers to outperform the best classical computing resources available, achieving this quantum supremacy promises to be a major milestone in computing. However, the ability to demonstrate quantum supremacy [1][2][3] depends on several factors, including the computational task under consideration, as well as properties of the physical quantum computer (e.g. connectivity and error rate). On the side of classical computation, the difficulty in defining a quantum supremacy point for a given problem stems from bounding the computational power of classical machines. Therefore, we seek techniques to perform simulations of quantum algorithms and circuits more efficiently [3][4][5]. Generally, quantum algorithms possess some sort of structure that might be exploited to provide some advantage to classical simulation. In this paper, we examine how the entanglement structure of Shor's algorithm for integer factorisation lends itself to a particular matrix product state representation that quantifiably reduces the computational requirements for classical simulation. Additionally, we show how particular instances of Shor's algorithm become significantly easier to simulate when this structure is exploited.
Shor's algorithm [6] may be used to factor a semiprime integer N of l binary digits in polyno-mial time with respect to l on a quantum computer. Presently, there is no known algorithm to perform this factorisation efficiently (i.e. in polynomial time) on a classical computer. The presumed difficulty of factoring very large semiprime numbers is the foundation underpinning the security of the RSA public-key cryptosystem [7] used to secure online communications [8]. This provides motivation to build quantum computers capable of performing large enough instances of Shor's algorithm and to develop public-key cryptosystems resistant to quantum computers [9].
Existing physical implementations of Shor's algorithm [10][11][12][13] have been produced to factor small semiprimes no longer than five bits in length. This is significantly less than even the 15-bit instances first simulated in [14], requiring 45 qubits. Therefore, the simulated instances presented in this paper may be presented as medium-term goals for quantum hardware to benchmark against.
A typical state vector representation of an n-qubit system requires the storage of 2 n complex scalars, regardless of the state being stored. While simulations of quantum circuits may be performed by operating on this collection of scalars, the exponential space complexity ultimately limits the size of the systems that can reasonably be simulated. Instead, by using the matrix product state representation [15] of a quantum state, the space requirements scale according to the amount of entanglement present in the system. As tensor networks [16], matrix product states were originally used for simulating one-dimensional quantum many-body systems [17,18], but have since been adapted for simulating quantum circuits [14,15,19]. As such, even states of many qubits may feasibly be stored using a matrix product state representation, provided its entanglement is sufficiently limited. Other examples of tensor networks that may be used to simulate quantum circuits include PEPS [20], MERA [21] and tree tensor networks [22].
By examining the entanglement introduced by a high-level circuit of Shor's algorithm, we were able to improve space requirements over previous simulations [14] by sensibly mapping this entanglement across the one-dimensional structure of a matrix product state. Furthermore, in treating the task of simulating Shor's algorithm as a sampling problem, we could take advantage of single-qubit measurement to collapse entanglement and hence reduce our space usage. The combined result of our optimisations allowed us to simulate a nontrivial instance of Shor's algorithm involving 60 qubits on a supercomputer using an approximate total of 14 TB of RAM. In comparison to the 2 60 × 128 bit 1.8 × 10 7 TB to store a state vector for 60 qubits in double precision, this represents a significant reduction in the required memory.
To outline this paper, we begin with a review of Shor's algorithm in Sec. 2 and a review of the matrix product state formalism in Sec. 3. We then examine the entanglement introduced at particular stages in Shor's algorithm in Sec. 4 and then detail in Sec. 5 how the subsystems of a matrix product state may be partitioned to take advantage of this entanglement evolution. Finally, we provide benchmarks of our implementation in Sec. 6, including our simulation of a 60-qubit instance.

Review of Shor's Algorithm
We consider the circuit shown in Fig. 1 for this review of Shor's algorithm and for the general structure of our following simulations. Given an odd, squarefree semiprime N = pq represented by at least l binary digits, a high-level circuit for Shor's algorithm as detailed in [23] consists of an 'upper' register of 2l qubits initialised to |0 and a 'lower' register R of l qubits initialised to the state |1 : The Hadamard gate is then applied to each qubit of the upper register, creating the superposition where we shall ignore normalisation constants for clarity. Randomly choosing integer a = 1 from Z * N , the set of integers modulo and coprime to N , exponents of the unitary operator are applied to the lower register, controlled by qubits in the top: where r is the period of U . To determine this period r, the lower register is measured, forcing a choice of the index j in Eq. (2). This collapses the entanglement between the two registers, and we can now separate them. The upper register is now in state for the 0 ≤ j < r corresponding to the value (a j mod N ) measured from the lower register in Eq.
(2). The quantum Fourier transform (QFT) is then applied to this upper register: The upper register is then measured to produce a value s with probability for random variable S. This implies that values of s such that rs/2 2l is close to an integer are more likely to be measured, resulting in the peaks shown in the example probability distribution in Fig. 1(c). The quantum processing component of Shor's algorithm is now complete, and the result of S may be classically processed using the continued fractions and Euclidean algorithms to successfully factor N with high probability [23].

Review of Matrix Product States
To briefly review the matrix product state (MPS) [15] formalism for representing quantum states, we begin with some general composite state of n subsystems given by A subsystem m with dimensionality d m is referred to as a d m -level qudit. In the case where d m = 2, m is a qubit. Conventionally, the state-vector approach for computationally storing this state involves recording each of the complex coefficients ψ i1i2...in in a single, perhaps multidimensional, array. The space complexity of this particular representation is fixed at O(d) regardless of the specific state |ψ , where d = n m=1 d m . . . .  At its core, the MPS representation of a state involves storing matrices for each qudit, whose product equals each of the coefficients in Eq. (5): where we have used the usual summation notation. Notably, qudits whose matrices are adjacent in Eq. (6) may be contracted and redefined into a higher-dimensional qudit: By sequentially contracting adjacent qudits, the state vector representation may be obtained from an MPS representation of a state. Additionally, the reverse operation of Eq. (7) may be performed to decompose the matrices of two combined qudits. This involves rearranging the elements of Γ in Eq. (7) and applying a matrix decomposition such as the trivial decomposition where I is the appropriately sized identity matrix, or the rank-revealing singular value decomposition (SVD) where U and V are unitary matrices and the singular values, the elements of S, are nonnegative reals. The intermediate index α m is referred to as the bond index of its respective bipartition. By sequentially decomposing qudits into constituent subsystems, an MPS representation of a state represented in the state vector form may be obtained. Though the trivial decomposition, Eq. (8), is simple to perform, an MPS produced solely from such decompositions on a state vector representation offers no computational advantage in terms of time or space usage. Instead, if the SVD, Eq. (9), is used as the decomposition, the MPS representation might look like where each λ [m] is the vector of singular values obtained from Eq. (9). If care is taken during the treatment of these singular values [15], a canonical MPS form can be defined such that these λ [m] αm are equal to the Schmidt coefficients [23]  To simulate measurements on qudit m of a composite system, the reduced density matrix of m may be obtained from the general MPS expression, Eq. (6), by tracing over all other qudit subsystems: where the bars denote complex conjugation. From this, the measurement probabilities of the d m possible values of m may be read from the diagonal elements of ρ [m] . For a canonical MPS in the form of Eq. (10), ρ [m] may instead be obtained locally: This can be generalised to multiple-qudit gates acting on consecutive qudits in an MPS by contracting them, applying the gate, and then decomposing. Swap gates may be used to rearrange the order of qudits in an MPS.

Entanglement in Shor's Algorithm
Since the space usage of an MPS scales with the amount of entanglement in the state, we should examine the entangling properties of Shor's algorithm in order to determine an efficient embedding of the circuit's 3l qubits into an MPS with its inherently linear connectivity. By rewriting Eq. (3) for the state of the upper register following measurement of the lower register R in the form we observe that the α least-significant bits of each state on the right hand side of Eq. (13) are identical, where α is the number of trailing zeroes in the binary representation of r. Therefore, immediately prior to the measurement of R, these α least-significant qubits of the upper register only exhibit entanglement with R and not between themselves or other qubits of the upper register. Also at this point in the circuit, the remaining 2l − α most-significant qubits of the upper register exhibit entanglement between themselves and with the lower register R, due to the odd factor β ≡ r/2 α of r which cannot be localised to specific qubits. Following measurement of the lower register, all l qubits in R are now completely separable. Entanglement between the upper register with R is consequently collapsed, leaving the upper register's α least-significant qubits completely separable and its remaining 2l − α most-significant qubits only entangled amongst themselves.
To examine some properties of α, we note that r divides λ(N ), the Carmichael function at N , which is defined as the smallest integer such that x λ(N ) ≡ 1 mod N for all x ∈ Z * N . By the Chinese remainder theorem and Carmichael's theorem, for our odd, squarefree semiprime N . Therefore if d p is the largest integer such that 2 dp |(p−1) and similarly for d q , then max(d p , d q ) is the largest value α can take for a specific N over all choices of a. Also by the Chinese remainder theorem, uniformly choosing a = 1 from Z * N is equivalent to independently and uniformly choosing a p ∈ Z * p and a q ∈ Z * q so that a p , a q = 1. Therefore, by [23, Lemma A4.12 ], asymptotically for randomly and uniformly chosen a = 1. Furthermore, Dirichlet's theorem implies that there is asymptotically a 2 1−m probability that 2 m |(p − 1) for randomly chosen prime p. Therefore, d p is distributed geometrically such that Pr(d p = m) = 2 −m . If p and q are independently chosen, d p and d q are i.i.ds whose maximum has an expected value by [24,Eq. (2.6)]. This analysis suggests a further partitioning of the upper register. From now, we will refer to A as the set of α least-significant qubits in the upper register and B as the remaining 2l − α most-significant qubits of the upper register.

Previous approaches
High-level simulations of Shor's algorithm were performed in [14] with a static MPS qudit ordering that places the upper-register qubits on one side of the lower register R, per the example in Fig. 2(a). To briefly summarise this static method, the l qubits of R were kept contracted as a single qudit. With R effectively stored as a 1-dimensional qudit in the state |1 , each of the 2l upper-register qubits q i for 0 ≤ i < 2l is entangled with R by applying the following procedure to each q i in order of descending i: q 21 q 20 q 19 q 18 q 17 q 16 q 15 q 14 q 13 q 12 q 11 q 10 q 9 q 8 q 7 q 6 q 5 q 4 q 3 q 2 q 1 q 0 R Qudit Subsystem q 21 q 20 q 19 q 18 q 17 q 16 q 15 q 14 q 13 q 12 q 11 q 10 q 9 q 8 q 7 q 6 q 5 q 4 q 3 q 2 R q 1 q 0 Qudit Subsystem • An initially separable qubit q i is inserted between the previous upper-register qubit and R by altering the bond sizes appropriately and is set to the state (|0 + |1 )/ √ 2, representing the starting round of Hadamard gates.
• The gate U 2 i , which may be formed by repeated squaring of U , Eq. (1), is applied to R and controlled by q i . This involves contracting q i with R beforehand and may require R's current effective dimensionality to be increased.
• This contraction is then decomposed back into qubit q i and the lower-register qudit R by using the trivial decomposition, Eq. (8). Rank minimisation through a rank-revealing decomposition is not required at this stage since the apparent rank from the trivial decomposition is equal to the Schmidt rank.
With sufficient index book-keeping, these steps may be combined into a single operation per qubit q i . The effect of these controlled operations between each of the 2l upper-register qubits with the lower register R is to quickly initialise an MPS that encodes the state |ψ modexp in Eq. As all upperregister qubits are on one side of the lower-register qudit, applying these controlled gates ensures that the Schmidt ranks at each bipartition are nondecreasing toward the lower-register qudit. This culminates in a rank of r between the least-significant qubit q 0 of the upper register with the lower register R [25], as demonstrated in Eq. (2) and Fig. 2(c). Additionally, R is effectively stored as an r-dimensional qudit at this point.
This MPS is not in canonical form, since the trivial decomposition, Eq. (8), was used instead of the SVD, Eq. (9). As such, nonlocal measurement of the R was simulated by directly calculating its reduced density matrix through Eq. (11), and upon accordingly setting a value for R, entanglement is collapsed to reduce space usage by performing a sweep outward from R. Since the lower register is now separable at this stage, it is removed from the MPS and the remaining upper register of 2l qubits encodes the state |ψ upper in Eq. (3).

The quantum Fourier transform (QFT) was then
Accepted in Quantum 2018-12-31, click title to verify performed in [14] by sequentially contracting qubits together, eventually resulting in a 2 2l -dimensional state vector of the upper register. Using this full contraction ultimately limits the size of an instance of Shor's algorithm that may be simulated. However, it does scale well timewise with parallelism since decompositions are not required. Though in that case, for these high-level simulations, the use of a distributed discrete Fourier transform library [26] as a single operation on the fully contracted state vector should result in greater performance.
A separate approach to simulating Shor's algorithm up to the lower-register measurement, making use of a tree tensor network, is detailed in [22]. Due to the strict linear layout of an MPS, this tree tensor network more evenly represents the multipartite entanglement induced by the controlled exponentiated U gates at the cost of a more complicated series of tensor operations when performing basic tasks such as measurement or applying nearest-neighbour gates. While [22] elucidates how such a tree tensor network might be converted to an MPS with a more even entanglement distribution than in [14], we detail how such an MPS might be generated solely through MPS operations.

Simulation optimisations
Instead of returning a state vector following a simulation of Shor's algorithm, which ultimately negates the space savings made by using the MPS formalism, we performed simulations of Shor's algorithm as a sampling problem where we only wish to sample measurements according to the underlying probability distribution resulting from the circuit. This method mimics the process of obtaining results from an actual quantum computer, and is therefore a task that may be performed by both classical and quantum machines. Furthermore, performing simulations in this way is ideal for MPS since it mitigates the need to contract to a state vector, and as the entanglement collapse following a measurement reduces memory usage.
From the description of Shor's algorithm in Sec. 2, the coefficients in all quantum states prior to the QFT are real. Therefore, we can approximately halve our time and space requirements for this section by storing our MPS elements as real scalars instead of complex scalars of equal precision, and then later converting to complex scalars prior to the QFT. These savings come at an opportune time due to the amount of entanglement before measurement of the lower register R.
The centrepiece of our additional optimisations, however, is to embed the lower-register qudit R within our MPS. Whilst the static approach in the previous section had qudits positioned in the order [B] : [A] : [R] as in Fig. 2(a), this has a significant space disadvantage since A is not entangled to B, but the entanglement between B and R crosses through A and mul-tiplies the Schmidt ranks within A by β. Instead, we have chosen to perform our simulations by positioning our partitions as [B] : [R] : [A] in our MPS as in Fig. 2(b), which allows direct connectivity to R from A and B simultaneously. With this dynamic layout, the size of the MPS matrices for each of the α qubits in A is reduced by a factor of β 2 compared to the static approach as seen by comparing Figs. 2(c) and 2(d), resulting from a factor of β for the left and right bond indices of a matrix. Therefore, the dynamic qudit layout refines classification of the difficulty in simulating Shor's algorithm to include the factors of r, rather than just r per the static layout.
Since simulations should not depend on prior knowledge of the values r, α or β, our layout requires us to dynamically detect the transition from qubits in B to qubits in A when applying the controlled exponentiated U gates. In Sec. 4, we noted that B was entangled with R prior to the lower-register measurement, and that the qubits within B were entangled amongst one another. Therefore, as the controlled exponentiated U gates are sequentially operated on qubits in order of descending significance, the Schmidt ranks with R will temporarily stop increasing (at a value of β) when this amount of entanglement has been reached. Only when we start operating on qubits in A does the entanglement to R begin to increase again, as seen in Fig. 2(c). By detecting this plateau and subsequent rise in the Schmidt ranks, we can detect the boundary between A and B and begin to place qubits from A on the opposite side of R.
At this point, we also note from Eq. (15) that the choice of a has probability at least one half to result in the maximum α permitted by semiprime N . Therefore, classical simulations exceeding a given memory limit may be retried a constant number of times with different values of a before this maximum α is expected to be observed. Furthermore, from Eq. (16), when the choice of a results in the maximum α, A is expected to contain 8/3 qubits across all choices of N . In our dynamic qudit layout where the size of each MPS matrix in A is reduced by a factor of β 2 , this may result in significant space savings compared to the static layout.
When sufficiently many controlled exponentiated U gates have been operated to initialise the systems [B] : [R], we perform a right sweep as required for the local measurement of R. As each of the α qubits in A is interacted with R, their Schmidt ranks toward R must sequentially double to reach a lower-register occupancy of r = β × 2 α , given the contribution of β from B. This is demonstrated in Fig. 2(d). Use of the trivial decomposition Eq. (8) with careful normalisation removes the need for the left sweep across A. R may then be measured using Eq. (12). Sweeps are then performed outward from this site to collapse entanglement involving R and with R now separable, it is removed to leave an MPS consisting of systems [B] : [A]. Since the only remaining entanglement exists between the qubits of B, the highest Schmidt rank within the MPS at this stage is β, and the qubits of A are completely separable at this point.
We then applied the linear nearest-neighbour (LNN) circuit for the QFT to our remaining qubits, as shown in Figs. 1(a) and 1(b). Since each controlled phase gate is immediately followed by a Swap on the same qubits, we applied these as a combined operation. We note that the structure of the LNN QFT circuit allows us to measure any qubits that no longer require further operations. By performing these measurements as soon as possible and sweeping afterwards, we help mitigate any extra entanglement introduced during the progression of the QFT circuit. The final result is a completely separable MPS, rather than a full state vector, of the upper register corresponding to a value distributed according to Eq. (4).   Our simulations were implemented using the Elemental [27] library for distributed-memory linear algebra, mainly due to its divide and conquer SVD [28]. All results were obtained in double precision, with 64 bit real scalars before the quantum Fourier transform and 128 bit complex scalars during. Square process grids were used and elements follow a twodimensional element-wise cyclic distribution.
We began by simulating Shor's algorithm with the same parameters (l, N, a) benchmarked in [14]. Our results in Table 1 were obtained on a single Intel Xeon E5-2683 v4 with a base core clock of 2.10 GHz, using the Intel Parallel Studio XE suite for compilers and intraprocess BLAS and LAPACK [29], and Open MPI [30] for our MPI implementation. We also limited the space usage for these results to 8 GB of RAM to showcase our optimisations against the 16 GB required in [14]. Despite using a processor of lower clock speed, our implementation appears to result in a significant overall performance increase.
The parameters (l, N, a) chosen in [14] produce values of r equal to its upper bound of lcm(p − 1, q − 1) from Eq. (14), since the static layout bases difficulty on the size of r. Our dynamic approach classifies difficulty according to the factors of r, especially the odd factor β of r which is similar across the three cases in Table 1. This is somewhat reflected in our results by our increasing time advantage over the results of [14].
Though our individual timings for the application of the exponentiated U gates and the QFT appear slower than the results of [14], this is due to our t U being used to record the time to perform the sweep prior to lower-register measurement and us choosing to use the LNN QFT with interleaved measurements respectively. Our time savings made during the lowerregister measurement more than makes up for this. Additionally, whilst [14] is able to claim 1/n proc time scaling in their results due to the parallelism of MPS contraction, we relied more on MPS decomposition through the SVD to keep our memory requirements low. It would appear then that the SVD scales differently with n proc since we do not observe 1/n proc time scaling.
To benchmark a truly distributed implementation, we also performed some simulations across multiple nodes on Magnus [31], a Cray XC40 supercomputer with 24 cores at 2.60 GHz and 64 GB of RAM per node. Our results in Table 2 were run using the GNU Compiler Collection suite for compilers, OpenBLAS [32] for intranodal BLAS and LAPACK, and Cray's proprietary MPI implementation.
For our flagship 60-qubit l = 20 instance, we chose the parameters N = 961307 and a = 5. These parameters were specifically chosen to highlight the differences between the static and dynamic methods, by having the maximum period r = 479568 permitted by N , but with α = 4 higher than expected. Since r here is relatively high, it would be infeasible to simulate this with the static method even if the final MPS contraction was not performed. In our simulation, the α = 4 qubits in A significantly decrease the amount of resources required, just from an understanding of the entangling properties of Shor's algorithm. As such, we were able to perform a high-level simulation of sampling a measurement from Shor's algorithm on 60 qubits using a total of 216 nodes, 5184 cores, and 13.824 TB of memory within 8 h. To the best of our knowledge, this is the largest high-level simulation of Shor's algorithm.

Conclusion
We performed classical simulations of Shor's algorithm as a sampling problem using a dynamic MPS qudit layout. Compared to previous approaches that relied on static qudit layouts, our approach better maps the entanglement induced by the circuit for Shor's algorithm onto a system with linear connectivity as represented by an MPS. The use of this dynamic layout also refines classification of the difficulty in simulating Shor's algorithm not only to include the size of the period r, but also its factors. Furthermore, by simulating Shor's algorithm as a sampling problem, we were able to take advantage of measurement to collapse entanglement within the MPS. This reduces space usage during the quantum Fourier transform with respect to contracting MPS matrices into a full state vector.
In particular, we found that asymptotically, on average, the power of 2 that divides r is 8/3. This number of qubits would become completely separable following measurement of the lower register, greatly reducing the simulation difficulty. We also note that instances with semiprime N = pq such that p − 1 or q − 1 is divisible by a high power of 2 are especially easy to simulate if r does not possess a correspondingly large odd factor.
Our optimisations resulted in significant time and space savings for instances with up to 45 qubits when compared to previous benchmarks using the static qudit layout with full state-vector contraction. Through the use of supercomputing resources, we were able to simulate a 60-qubit instance of Shor's algorithm with high r, rendering it infeasible via the static approach. In terms of the number of qubits, this represents one of the largest simulations of a nontrivial quantum circuit ever performed.