How to factor 2048 bit RSA integers in 8 hours using 20 million noisy qubits

We significantly reduce the cost of factoring integers and computing discrete logarithms over finite fields on a quantum computer by combining techniques from Griffiths-Niu 1996, Zalka 2006, Fowler 2012, Eker{\aa}-H{\aa}stad 2017, Eker{\aa} 2017, Eker{\aa} 2018, Gidney-Fowler 2019, Gidney 2019. We estimate the approximate cost of our construction using plausible physical assumptions for large-scale superconducting qubit platforms: a planar grid of qubits with nearest-neighbor connectivity, a characteristic physical gate error rate of $10^{-3}$, a surface code cycle time of 1 microsecond, and a reaction time of 10 micro-seconds. We account for factors that are normally ignored such as noise, the need to make repeated attempts, and the spacetime layout of the computation. When factoring 2048 bit RSA integers, our construction's spacetime volume is a hundredfold less than comparable estimates from earlier works (Fowler et al. 2012, Gheorghiu et al. 2019). In the abstract circuit model (which ignores overheads from distillation, routing, and error correction) our construction uses $3 n + 0.002 n \lg n$ logical qubits, $0.3 n^3 + 0.0005 n^3 \lg n$ Toffolis, and $500 n^2 + n^2 \lg n$ measurement depth to factor $n$-bit RSA integers. We quantify the cryptographic implications of our work, both for RSA and for schemes based on the DLP in finite fields.


INTRODUCTION
Peter Shor's introduction in 1994 of polynomial time quantum algorithms for factoring integers and computing discrete logarithms [1,2] was a historic milestone that greatly increased interest in quantum computing. Shor's algorithms were the first quantum algorithms that achieved an exponential speedup over classical algorithms, applied to problems outside the field of quantum mechanics, and had obvious applications. In particular, Shor's algorithms may be used to break the RSA cryptosystem [3] based on the hardness of factoring integers that are the product of two similarly-sized primes (hereafter "RSA integers"), and cryptosystems based on the discrete logarithm problem (DLP), such as the Diffie-Hellman key agreement protocol [4] and the Digital Signature Algorithm [5].
The most expensive operation performed by Shor's factoring algorithm is a modular exponentiation. Modern classical computers can perform modular exponentiations on numbers with thousands of bits in under a second. These two facts may at first glance appear to suggest that factoring a thousand bit number with Shor's algorithm should only take seconds, but unfortunately (or perhaps fortunately), that is not the case. The modular exponentiation in Shor's algorithm is performed over a superposition of exponents, meaning a quantum computer is required, and quantum hardware is expected to be many orders of magnitude noisier than classical hardware [6][7][8]. This noise necessitates the use of error correction, which introduces overheads that ultimately make performing reliable arithmetic on a quantum computer many orders of magnitude more expensive than on classical computers [9,10].
Although Shor's algorithms run in polynomial time, the constant factors hidden by the asymptotic notation are substantial. These constant factors must be overcome, by heavy optimization at all levels, in order to make the algorithms practical.

A. Our contributions and a summary of our results
In this work, we combine several novel and existing optimizations to reduce the cost of implementing Shor's algorithms. The main hurdle to overcome is to implement one or more modular exponentiations efficiently, as these exponentiations dominate the overall cost of Shor's algorithms.
We use the standard square-and-multiply approach to reduce the exponentiations into a sequence of modular multiplications. We apply optimizations to reduce the number of multiplications and the cost of each multiplication.
FIG. 1. Log-log plot of estimated space and expected-time costs, using our parallel construction, for various problems and problem sizes. See Section 3 for additional details. The jumps in space around n = 32786 occur as the algorithm exceeds the error budget of the CCZ factory from [11] and switches to the T factory from [12]. Generated by ancillary file "estimate costs.py".
The cost of each multiplication is reduced by combining several optimizations. We use Zalka's coset representation of modular integers [25], which allows the use of cheaper non-modular arithmetic circuits to perform modular additions. We use oblivious carry runways [26] to split registers into independent pieces that can be worked on in parallel when performing additions. We bound the approximation error introduced by using oblivious carry runways and the coset representation of modular integers by analyzing them as approximate encoded permutations [26]. We use windowed arithmetic (again) to fuse multiple additions into individual lookup additions [20]. We use a layout of the core lookup addition operation where carry propagation is limited by the reaction time of the classical control system, and where the lookups are nearly reaction limited [27]. Finally, we optimize the overall computation by trying many possible parameter sets (e.g. window sizes and code distances) for each problem size and selecting the best parameter set.
We estimate the approximate cost of our construction, both in the abstract circuit model, and in terms of its runtime and physical qubit usage in an error corrected implementation under plausible physical assumptions for large-scale superconducting qubit platforms with nearest neighbor connectivity (see Figure 1). We provide concrete cost estimates for several cryptographically relevant problems, such as the RSA integer factoring problem, and various parameterizations of the DLP in finite fields. These cost estimates may be used to inform decisions on when to mandate migration from currently deployed vulnerable cryptosystems to post-quantum secure systems or hybrid systems.
Compared to previous works, we reduce the Toffoli count when factoring RSA integers by over 10x (see Table I). To only compare the Toffoli counts as in Table I may prove misleading, however, as it ignores the cost of routing, the benefits of parallelization, etc. Ideally, we would like to compare our runtime and physical qubit usage to previous works in the literature. However, this is only possible when such estimates are actually reported and use physical assumptions similar to our own. The number of works for which this requirement is met is limited.
The works by Fowler et al. [9] and Gheorgiu et al. [19] stand out in that they use the same basic cost model as we use in this paper, enabling fair comparisons to be made. We improve on their estimates by over 100x (see Table II) when accounting for slight remaining differences in the cost model. We also improve upon the cost estimates given by O'Gorman et al. [18], even if a proper comparison is not possible as their physical assumptions are different from ours (they assume slower operations where all qubits are interconnected, which is typical of ion trap architectures, whilst we assume faster operations with nearest neighbor connectivity, which is typical of superconducting architectures).
As most previous works focus either on factoring, or on the DLP in elliptic curve groups, it is hard to find references against which to compare our cost estimates for solving the DLP in multiplicative groups of finite fields. In general, the improvements we achieve for factoring RSA integers are comparable to the improvements we achieve for solving the general DLP in finite fields. As may be seen in Figure 1, the choice of parameterization has a significant impact on the costs. For the short DLP, and the DLP in Schnorr groups, we achieve significant improvements. These are primarily due to our use of derivatives of Shor's algorithm that are optimized for these parameterizations.

B. Notation and conventions
Throughout this paper, we refer to the modulus as N . The modulus is the composite integer to be factored, or the prime characteristic of the finite field when computing discrete logarithms. The number of bits in N is n = lg N where lg(x) = log 2 (x). The number of modular multiplication operations to perform (i.e. the combined exponent length in the modular exponentiations) is denoted n e . Our construction has a few adjustable parameters, which we refer to as c exp (the exponent window length), c mul (the multiplication window length), c sep (the runway separation), and c pad (the padding length used in approximate representations).
In the examples and figures, we consider moduli of length n = 2048 bits when n needs to be explicitly specified. This is because n = 2048 bits is the default modulus length in several widely used software programs [28][29][30]. Our optimizations are not specific to this choice of n. Section 3 provides cost estimates for a range of cryptographically relevant modulus lengths n.
We often quote costs as a function of both the number of exponent qubits n e and the problem size n. We do this because the relationship between n e and n changes from problem to problem, and optimizations that improve n e are orthogonal to optimizations that improve the cost of individual operations.
C. On the structure of this paper The remainder of this paper is organized as follows. In Section 2, we describe our construction and the optimizations it uses, and show how to estimate its costs. We then proceed in Section 3 to describe how existing widely deployed cryptosystems are impacted by our work. In Section 4, we present several ideas and possible optimizations that we believe are worth exploring in the future. Finally, we summarize our contributions and their implications in Section 5.

A. Quantum algorithms
In Shor's original algorithm [1], composite integers N that are not pure powers are factored by computing the order r of a randomly selected element g ∈ Z * N . Specifically, period finding is performed against the function f (e) = g e . This involves computing a modular exponentiation with n e = 2n qubits in the exponent e. If r is even and g r/2 ≡ −1, this yields non-trivial factors of N . To see why, lift g to Z. As g r − 1 = (g r/2 − 1)(g r/2 + 1) ≡ 0 (mod N ) it suffices to compute gcd((g r/2 ± 1) mod N, N ) to factor N . The order r is even with probability at least 1/2.
In [21][22][23], Ekerå and Håstad explain how to factor RSA integers N = pq in a different way; namely by computing a short discrete logarithm. This algorithm proceeds as follows: First y = g N +1 is computed classically, where as before g is randomly selected from Z * N and of unknown order r. Then the discrete logarithm d = log g y ≡ p + q (mod r) is computed quantumly. To see why d ≡ p + q (mod r), note that Z * N has order φ(N ) = (p − 1)(q − 1) by Euler's totient theorem, so d ≡ pq + 1 ≡ p + q (mod r) as r divides φ(N ) and pq + 1 ≡ p + q (mod φ(N )), with equality if r > p + q.
For large random RSA integers, the order r > p+q with overwhelming probability. Hence, we may assume d = p+q. By using that N = pq and d = p + q, where N and d are both known, it is trivial to deterministically recover the factors p and q as the roots of the quadratic equation p 2 − dp + N = 0.
The quantum part of Ekerå and Håstad's algorithm is similar to the quantum part of Shor's algorithm, except for the following important differences: there are two exponents e 1 , e 2 , of lengths 2m and m qubits, respectively, for m a positive integer such that p + q < 2 m . Period finding is performed against the function f (e 1 , e 2 ) = g e1 y −e2 rather than the function f (e) = g e . The total exponent length is hence n e = 3m = 1.5n + O(1) qubits, compared to 2n qubits in Shor's algorithm. It is this reduction in the exponent length that translates into a reduction in the overall number of multiplications that need to be performed on the quantum computer.
The two exponent registers are initialized to uniform superpositions of all 2 m and 2 2m values, respectively, and two quantum Fourier transforms are applied independently to the two registers. This implies that standard optimization techniques, such as the square-and-multiply technique, the semi-classical Fourier transform of Griffiths and Niu [31], recycling of qubits in the exponent registers [32], and so forth, are directly applicable.
The classical post-processing algorithm used to extract the logarithm from the observed frequencies is by necessity different from Shor's. It uses lattice-based techniques, and critically does not require r to be known. Ekerå shows in [23] that the post-processing algorithm has probability above 99% of recovering d. As the factors p and q are recovered deterministically from d and N , this implies that it in general suffices to run the quantum algorithm once.
In summary, Ekerå and Håstad's algorithm is similar to Shor's algorithm from an implementation perspective: a sequence of multiplications is computed, where one operand is in a quantum register and one operand is a classical constant. The multiplications are interleaved with the Fourier transform. It is of no significance to our construction what sequence of classical constants are used, or if one or more independent Fourier transforms need to be applied. This fact, coupled with the fact that the algorithms of Ekerå and Håstad may be used to solve other relevant problems where n e is a different function of n (see Section 3), lead us to describe our construction in terms of n and n e .
Note that the above description of the algorithm has been somewhat simplified compared to the algorithm described in [22,23] in the interest of improved legibility. Furthermore, there are technical conditions that need to be respected for the analysis in [23] to be applicable. For the full details, see section A.2.1 of [23].

B. Reference implementation
To avoid overloading the reader, we will describe our factoring construction by starting from a simple reference implementation of the quantum part of Shor's original algorithm and then apply optimizations one by one.
The reference implementation works the way most implementations of Shor's algorithm do, by decomposing exponentiation into iterative controlled modular multiplication [13-16, 25, 33]. A register x is initialized into the |1 state, then a controlled modular multiplication of the classical constant g 2 j (mod N ) into x is performed, controlled by the qubit e j from the exponent e, for each integer j from 0 to n e − 1. After the multiplications are done, x is storing g e (mod N ) and measuring x completes the hard part of Shor's algorithm.
Controlled modular multiplication is still not a primitive operation, so it must also be decomposed. It can be performed by introducing a work register y initialized to |0 and then performing the following two controlled scaled additions: y += x · k (mod N ) then x += y · (−k −1 ) (mod N ). After these two operations, y is storing the result and x has been cleared to the |0 state. Swapping the two registers, so that the result is in x, completes the multiplication.
Performing a controlled scaled addition with classical scale factor k can be done with a series of n controlled modular additions. For each qubit q j in the input register, you add k · 2 j (mod N ) into the target, controlled by q j . Controlled modular addition in turn is performed via a series of non-modular additions and comparisons. For example, [13] uses five additions for this purpose. Finally, using the Cuccaro adder [34], uncontrolled non-modular additions can be performed with no additional workspace using 2n Toffolis.
Combining the numbers in the preceding two paragraphs implies a Toffoli count of n e · 2n · 5 · 2n = 20n e n 2 for the reference implementation.
C. The coset representation of modular integers Following Zalka [25], the first improvement we make over the reference implementation is to switch from the usual representation of integers to the coset representation of modular integers. The usual representation of an integer k in a quantum computer is the computational basis state |k . The coset representation is different: it uses a periodic superposition with period N and offset k. In ideal conditions, the integer k (mod N ) is represented by the state √ 2 −c pad 2 c pad j=0 |jN + k where c pad is the number of additional padding qubits placed at the end of the register. The key idea is that the periodicity of the superposition causes a non-modular adder to perform approximate modular addition on this representation, and the error of the approximation can be exponentially suppressed by adding more padding to the register.
As we will discuss later, the amount of padding required is logarithmic in the number of operations. This small cost enables the large benefit of using non-modular adders instead of modular adders. It is possible to perform a controlled non-modular addition in 4n Toffolis [34,35], significantly cheaper than the 10n we assumed for a controlled modular adder. Therefore switching to the coset representation reduces the leading asymptotic term of the Toffoli count of the reference implementation from 20n e n 2 to 8n e n 2 .

D. Windowed arithmetic
The next optimization we use is windowed arithmetic. Specifically, we follow the modular exponentiation construction from [20] and use windowing at two levels.
First, at the level of a multiplication, we window over the controlled additions. We fuse groups of controlled additions into single lookup additions. A lookup addition is an addition where the value to add into a register is the result of a table lookup. Small windows of the qubits that would have been used as controls for the additions are instead treated as addresses into classically precomputed tables of values to unconditionally add into the target.
Second, at the level of the exponentiation, we window over the controlled multiplications. This is done by including exponent qubits in the addresses of all lookups being performed within the multiplications. We refer the reader to [20] for the exact details of this nested windowed arithmetic construction.
The cost of windowed arithmetic depends on the size of the windows. Let c exp be the size of the window over exponent qubits that are being used to control multiplications. Let c mul be the size of the window over factor qubits being used to control additions. Using these parameters, the n e controlled multiplications we needed to perform become n e /c exp uncontrolled multiplications while the 2n controlled additions we needed to perform within each multiplication become 2n/c mul uncontrolled additions. The tradeoff is that each addition is now accompanied by a lookup with an address of size c exp + c mul .
Using Cuccaro et al's adder [34], each n-bit addition has a Toffoli count and measurement depth of 2n. Using Babbush et al's QROM read [36], each table lookup has a Toffoli count and measurement depth of 2 c mul +cexp . The cost of uncomputing the lookup is negligible due to using measurement based uncomputation [37]. The overhead due to the logarithmic padding introduced by the coset representation of modular integers is also negligible (for now).
Thus, by using windowed arithmetic, the leading term of the Toffoli count of the reference implementation has been reduced to 2nen c mul cexp (2n + 2 c mul +cexp ). If we set c mul = c exp = 1 2 lg n then we achieve a Toffoli count with leading term 24nen 2 lg 2 n . In terms of space, we are using 3n + O(lg n) logical qubits. The qubits are distributed as follows: n + O(lg n) for the accumulation register storing the product, n + O(lg n) for a workspace register during the multiplication, n + O(lg n) for the lookup output register, and O(lg n) qubits to hold the part of the exponent needed for the current stage of the semi-classical Fourier transform [31].

E. Oblivious carry runways
The last major algorithmic optimization we apply is the use of oblivious carry runways [26]. The basic problem addressed by runways is that, normally, a carry signal generated at the bottom of a register must propagate all the way to the top of the register. This process can be short-circuited by instead terminating the carry propagation into an appropriately constructed runway. Runways allow large additions to be performed piecewise, with each piece being worked on in parallel, by terminating carries into appropriately placed runways at the end of each piece.
As with the coset representation of modular integers, circuits using oblivious carry runways approximate the original circuit instead of perfectly reproducing it. But, as we will discuss later, increasing the runway length c pad exponentially suppresses the approximation error.
The major benefit of oblivious carry runways, compared to previous techniques for reducing the depth of addition such as Draper et al's logarithmic depth carry-lookahead adder [38], is that oblivious carry runways can be introduced gradually without incurring large overheads. The Toffoli count and workspace overheads are linear in the number of pieces n/c sep (where c sep is the runway spacing) but only logarithmic in n. For example, if you place a single runway at the midpoint of a 2048-qubit register, then the number of qubits and the number of Toffolis needed to perform an addition will increase by a couple percent but the depth of the additions is nearly halved.

F. Interactions between optimizations
A major benefit of the set of optimizations we have chosen to use in this paper is that they complement each other. They make orthogonal improvements that compound or reinforce, instead of conflicting.
For example, when using the coset representation of modular integers, it is important that additions only use offsets less than N . Larger offsets cause larger approximation error. However, because we are using windowed arithmetic, every addition we perform has an offset that is being returned by a table lookup. Since the entries in the tables are classically precomputed, we can classically ensure all offsets are canonicalized into the [0, N ) range.
There are two cases where the optimizations do not mesh perfectly. First, using oblivious runways reduces the depth of addition but not the depth of lookups. This changes their relative costs, which affects the optimal window sizes to use in windowed arithmetic. When addition is suddenly four times faster than it used to be, the optimal window sizes c exp and c mul decrease by 1 (approximately).
Second, when iterating over input qubits during a multiply-add, it is necessary to also iterate over runway qubits and padding qubits. This increases the number of lookup additions that need to be performed in order to complete a windowed multiplication. This issue is partially solved by temporarily adding the runways back into the main register before performing a multiply-add where that register is used as one of the factors (instead of the target). This temporarily reduces each runway to a single carry qubit (see Figure 3). We could reduce the runways to zero qubits, but this would require propagating the carry qubits all the way across the register, so we do not.
3. How to temporarily reduce oblivious carry runways to a single qubit, in preparation for being used as the input factor in a multiply-add operation that must iterate over all qubits in the register. The multiply-add should occur during the ". . . " section in the middle.

G. Abstract circuit model cost estimate
We have now described all the details necessary to estimate the cost of our implementation in the abstract circuit model. The cost depends on the two parameters specified by the problem (the input size n and the number of exponent qubits n e ) and also the four controllable parameters we have discussed (the exponentiation window size c exp , the multiplication window size c mul , the runway separation c sep , and the padding/runway length c pad ).
Although we do consider alternate values when producing tables and figures, in general we have found that the settings c exp = c mul = 5, c sep = 1024, c pad = 2 lg n + lg n e + 10 ≈ 3 lg n + 10 work well. In this overview we will focus on these simple, though slightly suboptimal, values.
Recall that an exponentiation may be reduced to a sequence of n e multiplications, which we process in groups of size c exp . For each multiplication, we do two multiply-adds. Each multiply-add will use several small additions to temporarily reduce the runway registers to single qubits. The multiply-add then needs to perform a sequence of additions controlled by the n main register qubits, the O(lg n) coset padding qubits, and the n/c sep reduced runway qubits. Using windowed arithmetic, these additions are done in groups of size c mul with each group handled by one lookup addition. Therefore the total number of lookup additions we need to perform is The lookup additions make up essentially the entire cost of the algorithm, i.e. the total Toffoli count is approximately equal to the Toffoli count of a lookup addition times the lookup addition count. This works similarly for the measurement depth. Ignoring the negligible cost of uncomputing the lookup, the Toffoli count of a lookup addition is 2n + nc pad /c sep + 2 cexp+c pad and the measurement depth is 2n + 2c pad + 2 cexp+c mul . Therefore ToffoliCount(n, n e ) ≈ LookupAdditionCount(n, n e ) · 2n + c pad n c sep + 2 cexp+c pad ≈ 0.2n e n 2 + 0.0003n e n 2 lg n MeasurementDepth(n, n e ) ≈ LookupAdditionCount(n, n e ) · 2c sep + 2c pad + 2 cexp+c mul ≈ 300n e n + 0.5n e n lg n These approximate upper bounds, with n e set to 1.5n, are the formulae we report in the abstract and in Table I for the cost of factoring RSA integers.

H. Approximation error
Because we are using oblivious carry runways and the coset representation of modular integers, the computations we perform are not exact. Rather, they are approximations. This is not a problem in practice, as we may bound the approximation error using concepts and results from [26].
The oblivious runways and the coset representation of modular integers are both examples of "approximate encoded permutations" which have a "deviation". When using a padding/runway length of c pad , and a runway separation of c sep , the deviation per addition is at most n csep2 c pad . Deviation is subadditive under composition, so the deviation of the entire modular exponentiation process is at most the number of lookup additions times the deviation per addition.
We can use this to check how much deviation results from using c pad = 2 lg n + lg n e + 10, which we so far have merely claimed would be sufficient: When an approximate encoded permutation has a deviation of , the trace distance between its output and the ideal output is at most 2 √ . Therefore the approximation error (i.e. the trace distance due to approximations), using the parameter assignments we have described, is roughly 0.1%. This is significantly lower than other error rates we will calculate.

I. Spacetime layout
Our implementation of Shor's algorithm uses a layout that is derived from the (nearly) reaction limited layouts presented in [27]. In these layouts, a lookup addition is performed as follows (see Figure 4).
All data qubits are arranged into rows. The rows of qubits making up the target register of the addition are spread out into row pairs with gaps five rows wide in between. In these gaps there will be two rows for the lookup output register, and three rows for access hallways. This arrangement allows the lookup computation two ways to access each output row, doubling the speed at which it can run, while simultaneously ensuring the target register and the lookup output register are interleaved in a convenient fashion.
After the lookup is completed, the rows of the lookup output register and the addition target register are packed tightly against the top (or equivalently bottom) of the available area. An operating area is prepared below them, and the data is gradually streamed through the operating area while the operating area gradually shifts upward. This performs the MAJ sweep of Cuccaro's ripple carry adder [34]. Everything then begins moving in the opposite direction in order to perform the UMA sweep on the return stroke.
The lookup register is quickly uncomputed using measurement based uncomputation [37], and the system is returned to a state where another lookup addition can be performed.
In order to perform piecewise additions separated by oblivious carry runways, we simply partition the computation horizontally as shown in Figure 5 and Figure 7. The lookup before each addition prepares registers across the pieces, as shown in Figure 6, but the additions themselves stick to their own piece.
The width of each piece is determined by the number of CCZ factories needed to run at the reaction limited rate. This number is 14, assuming a code distance of 27 and using the CCZ factories from [11,27]. Also, assuming a level 1 code distance of 17, the footprint of the factory is 15x8 [27]. The factories are laid out into 2 rows of 7, with single logical qubit gaps to allow space for data qubits to be routed in and out. The total width of a piece is 15 · 7 + 7 = 113 logical qubits.
The height of the operating area is 33 rows of logical qubits (2 · 8 for the CCZ factories, 3 for the ripple-carry operating area, 6 for AutoCCZ fixup boxes, and 8 for routing data qubits). The c pad + c sep qubits from each register add another 30 rows (approximately). So, overall, when working on a problem of size n, the computation covers a 113w × 63 grid of logical qubits where w = n/c sep = n/1024.  [27]. During lookup, the target register is spread out to make room for the temporary register that will hold the lookup's output. During addition the target register and lookup output register are squeezed through a moving operating area that sweeps up then down, applying the MAJ and UMA sweeps of Cuccaro's adder. Uncomputing the lookup is done with measurement based uncomputation [37], which is overlapped slightly with the UMA sweep (this is why the yellow rows disappear early).

J. Runtime
Because our implementation is dominated almost entirely by the cost of performing lookup additions, its duration is approximately equal to the number of lookup additions times the duration of a single lookup addition.
During the lookup phase of a lookup addition, the computation is code depth limited. Assuming a code depth of d = 27 and a surface code cycle time of 1 microsecond, it takes 1µs · d/2 · 2 cexp+c mul ≈ 14 milliseconds to perform the lookup using double-access hallways. During the addition phase of a lookup addition, the computation is reaction limited. Given a reaction time of 10 microseconds, it takes 2(c sep + c pad ) · 10µs ≈ 22 milliseconds to perform the addition. The remaining bits of a lookup addition, such as uncomputing the lookup and rearranging the rows, take approximately 1 millisecond.
Thus one lookup addition takes approximately 37 milliseconds. Given this fact, i.e. that we perform quantum lookup additions slower than most video games render entire frames, we can approximate the total duration: TotalRuntime(n, n e ) ≈ LookupAdditionCount(n, n e ) · 37µs ≈ 4n e nµs Though we caution the reader that this estimate ignores the fact that, at larger problem sizes, lookups become FIG. 5. Data layout during a parallel addition, as generalized from [27]. To scale, assuming a 2048 bit number is being factored. The left and right halfs of the computer run completely independently and in parallel. The factories (red boxes and pink boxes) are feeding AutoCCZ states into the blue area, which is rippling the carry bit back and forth as the offset and target register data (green and yellow rows) is routed through gaps between the factories to the other side.
FIG. 6. Data layout during a table lookup, as generalized from [27]. To scale, assuming a 2048 bit number is being factored. The factories (red boxes and pink boxes) are feeding AutoCCZ states into the dark gray region, which is performing the unary iteration part of a table lookup computation. There are enough factories, and enough work space, to run the lookup at double speed, made possible by the fact that every qubit in the lookup output register (yellow) is adjacent to two access hallways. The target register (green rows) and factor register (blue rows) are idle, except that a few qubits from the factor register are being used as address bits in the table lookup. slower due to the minimum code distance increasing. This estimate implies that factoring a 2048 bit integer will take approximately 7 hours, assuming only one run of the quantum part of the algorithm is needed. Note that in our reported numerical estimates we achieve lower per-run numbers by using more precise intermediate values and by more carefully selecting parameters.

K. Distillation error
In order to perform the 0.2n e n 2 + 0.0003n e n 2 lg n Toffoli gates our implementation uses to factor an n = 2048 bit RSA integer, we need to distill approximately 3 billion CCZ states. According to the spreadsheet included in [11], using a level 1 code distance 17 and a level 2 code distance of 27, this corresponds to a total distillation error of 6.4%. This quantity is computed by considering the initial error rate of injecting physical T states, topological error within the factory, and the likelihood of the various stages of distillation producing a false negative.

L. Topological error
Now that we know the number of logical qubits, and the duration of the algorithm, we can approximate the probability of a topological error occurring within the surface code during the algorithm. This will allow us to verify our initial assumption that a code distance of 27 is sufficient in the case where n = 2048. Larger computations will require larger code distances.
Assuming a physical gate error rate of 10 −3 , the probability of error in a logical qubit of distance d, per surface code cycle time, is approximately 10 − d/2+1 (see equation 11 from [39]). When factoring an n = 2048 bit RSA integer we are using a board of 226 · 63 logical qubits. Approximately 25% of these qubits are being used for distillation, which we already accounted for, and so we do not count them in this calculation. The remaining qubits are kept through 4n e n ≈ 50 billion surface code cycles, implying the chance of topological error is approximately 10 − 27/2+1 · 226 · 63 · 0.75 · 50 · 10 9 ≈ 50%. This is a large error rate. Using a code distance of 27 is pushing the limits of feasibility. We would need to repeat the computation twice, on average, doubling the expected runtime. If our goal is to minimize the expected spacetime volume of the computation, perhaps we should increase the code distance to 29. Doing so would increase the physical qubit count by 15%, but the error rate would drop by approximately a factor of 10 and so the expected number of retries would be much closer to 1. Ultimately the choice comes down to one's preferences for using more space versus taking more time.
Note that, in the more precise estimates computed by ancillary file "estimate costs.py", the total error rate at distance 27 is estimated to be closer to 30%. This lowers the gains from increasing the code distance.

M. Physical qubit count
In lattice surgery, a logical qubit covers 2(d + 1) 2 physical qubits where d is the code distance (see Figure 8). Assuming we push the limits and use a code distance of 27 at n = 2048, each logical qubit will cover 1568 physical qubits. Therefore the total physical qubit count is the number of logical qubits 226 · 63 times 1568; approximately 23 million qubits.
Attentive readers will note that this number disagrees with the estimate in the title of the paper. This is because, throughout this section, we have been sloppily rounding quantities up and choosing fixed parameters in order to keep things simple. The estimate in the title is produced by the ancillary file "estimate costs.py", which does not make these simplifications. (In particular, "estimate costs.py" realizes that the level 1 code distance used during distillation can be reduced from 17 to 15 when n = 2048 and this adjusts the layout in several fortuitous ways.)

N. Summary
We now review the basic flow of our implementation of Shor's algorithm, including some details we would have otherwise left unstated.
The quantum part of the algorithm starts by preparing two registers, storing zero and one respectively, into the coset representation of modular integers with oblivious carry runways. The cost of this preparation is negligible compared to the cost of the rest of the algorithm. The register storing one is the accumulation register that will ultimately store the modular exponentiation, while the zero register is a workspace register.
FIG. 8. Lattice surgery qubits, in the rotated surface code, with code distances d of (from left to right) 3, 5, 7, 9, 11, and 13. Each green (blue) square correspond to physical X (Z) measurement qubit. Each white square corresponds to a physical data qubit. Blacked out regions correspond to disabled qubits. The two types of boundaries are highlighted with blue and green borders. There are 2(d + 1) 2 physical qubits per logical qubit (including the dividers). With a physical error rate of 10 −3 , each increase of the code distance by 2 suppresses logical errors per surface code cycle by a factor of 10.
FIG. 9. Physical view and logical view of lattice surgery qubits during a computation. The vertical columns (right) are logical qubits, corresponding to patches in the physical layout (left). The horizontal bar/region corresponds to a stabilizer measurement of the product of the two connected qubits (times whatever else is connected) [12]. The connections shown in the 3d diagram correspond to missing boundaries in the physical layout, and indicate that physical stabilizer measurements are crossing between the two regions instead of staying separated.
The bulk of the execution time is then spent performing the modular exponentiation(s). Exponent qubits are iteratively introduced in groups of size c exp . We are performing a semi-classical Fourier transform [31], so the exponent qubits must be phased according to measurements on previous exponent qubits. Performing the phasing operations requires using T factories instead of CCZ factories, but there is so little phasing work compared to CCZ work that we ignore this cost in our estimates. The phased exponent qubits are used as address qubits during a windowed modular multiplication, then measured in the frequency basis and discarded. See Figure 2. Within this step, almost all of the computation time is spent on lookup additions.
After all of the modular multiplications have been completed, the accumulation register is storing the result of the modular exponentiation. Note that this register still has oblivious carry runways and is still using the coset representation of modular integers. We could decode the register before measuring it but, because the decoding operations are all classical, it is more efficient to simply measure the entire register and perform the decoding classically. As with the cost of initialization of the registers, this cost is completely negligible.
This completes the quantum part of the algorithm. The exponent qubit measurement results, and the decoded accumulator measurement result, are fed into the classical post-processing code which uses them to derive the solution. If this fails, which can occur e.g. due to a distillation error during quantum execution, the algorithm is restarted.
Note that there are many minor implementation details that we have not discussed. For example, during a windowed multiplication, one of the registers is essentially sitting idle except that its qubits are being used (c mul at a time) as address qubits for the lookup additions. We have not discussed how these qubits are routed into and out of the lookup computation as they are needed. It is simply clear that there is enough leeway, in the packing of the computation into spacetime, for this routing task to be feasible and contribute negligible cost.

CRYPTOGRAPHIC IMPLICATIONS OF OUR CONSTRUCTION
In this section, we consider how the security of RSA, and of cryptographic schemes based on the intractability of the DLP in finite fields, is affected when the optimized construction described in the previously section is used to implement the currently most efficient derivatives of Shor's algorithms.
Our goal throughout this section is to minimize the overall expected spacetime volume, including expected repetitions, when factoring RSA integers or computing discrete logarithms. That is to say, we minimize the spacetime volume in each run of the quantum algorithm times the expected number of runs required.
To derive an upper bound on the overall probability of errors occurring, we separately estimate the probabilities of topological errors occurring due to a failure of the surface code to correct errors, of approximation errors occurring due to using oblivious carry runways and the coset representation of modular integers, of magic state distillation errors, and of the classical post-processing algorithm failing to recover the solution from a correct run of the quantum algorithm. We combine these error probabilities, assuming independence, to derive an upper bound on the overall probability of errors occurring.
We chose to optimize the quantity s 1.2 ·t/(1− ), which we refer to as the "skewed expected spacetime volume". The t/(1 − ) factor is the expected runtime, and the s 1.2 factor grows slightly faster than the space usage. We skew the space usage when optimizing because we have a slight preference for decreasing space usage over decreasing runtime. We consider all combination of parameters (d 1 , d 2 , δ off , c mul , c exp , c sep ), choose the set that minimizes the skewed expected spacetime volume, and report the corresponding estimated costs.

B. Implications for RSA
Today, the arguably most commonly used modulus size for RSA is n = 2048 bits. Larger moduli are however in widespread use and smaller moduli have been used historically. The best current academic record is the factorization of a 768 bit RSA modulus in 2009, see [40] for details. In Table III and Figure 1, we provide estimates for the resource and time requirements for attacking RSA for various cryptographically relevant modulus lengths n. The estimates are for factoring RSA integers with Ekerå-Håstads's algorithm [22,23] that computes a short discrete logarithm, see the appendix to [23] for full technical details. As is explained in [23], a single correct run of this quantum algorithm suffices for the RSA integer to be factored with at least 99% success probability in the classical post-processing. C. Implications for finite field discrete logarithms Given a generator g of an order r subgroup to Z * N , where the modulus N is prime, and an element x = g d , the finite field discrete logarithm problem is to compute d = log g x. In what follows, we assume r to be prime. If r is composite, the discrete logarithm problem may be decomposed into problems in subgroups of orders dividing r, as shown by Pohlig and Hellman [41]. For this reason, prime order subgroups are used in cryptographic applications.
As Z * N has order N − 1, it must be that r divides N − 1, so N = 2rk + 1 for some integer k ≥ 1. The asymptotically best currently known classical algorithms for computing discrete logarithms in subgroups of this form are generic cycle-finding algorithms, such as Pollard's ρ-and λ-algorithms [42], that run in time O( √ r) and O( √ d), respectively, and the general number field sieve (GNFS), that runs in subexponential time in the bit length n of N .
The idea of factoring via algebraic number fields was originally introduced by Pollard [43] for integers on special forms. It was over time generalized, by amongst others Buhler et al. [44], Lenstra et al. [45] and Pollard [46], to factor general integers, and modified by amongst others Gordon [47] and Schirokauer [48,49] to compute discrete logarithms in finite fields. Much research effort has since been devoted to optimizing the GNFS in various respects. For a proper in-depth historical account of the development of the GNFS, see [50,51].
Let z be number of bits of security provided by the modulus with respect to classical attacks using the GNFS. Let n d and n r be the lengths in bits of d and r, respectively. It then suffices to pick n d , n r ≥ 2z to achieve z bits of classical security, as the generic cycle-finding algorithms are then not more efficient than the GNFS.
When instantiating schemes based on the intractability of the finite field discrete logarithm problem, one may hence choose between using a Schnorr group, for which n d = n r = 2z, or a safe-prime group, for which n r = n − 1. In the latter case, one may in turn choose between using a short exponent, such that n d = 2z, or a full length exponent, such that n d = n r = n − 1. All three parameterization options provide z bits of classical security.

What finite field groups are used in practice?
In practice, Schnorr groups or safe-prime groups with short exponents are often preferred over safe prime groups with full length exponents, as the comparatively short exponents yield considerable performance improvements.
The discrete logarithm problem in Schnorr groups is of standard form, unlike the short discrete logarithm problem in safe-prime groups, and Schnorr groups are faster to generate than safe-prime groups. A downside to using Schnorr groups is that group elements received from untrusted parties must be tested for membership of the order r subgroup. This typically involves exponentiating the element to the power of r, which is computationally expensive. Safe-prime groups are more flexible than Schnorr groups, in that the exponent length may be adaptively selected depending on the performance requirements. The reader is referred to [52] for a more in-depth comparison and historical recommendations. In more recent years, the use of safe-prime groups would appear to have become increasingly prevalent. Some cryptographic schemes, such as the Diffie-Hellman key agreement protocol, are agnostic to the choice of group, whereas other schemes, such as DSA, use Schnorr groups for efficiency reasons.
The National Institute of Standards and Technology (NIST) in the United States standardizes the use of cryptology in unclassified applications within the federal government. Up until April of 2018, NIST recommended the use of randomly selected Schnorr groups with moduli of length 2048 bits for Diffie-Hellman key agreement. NIST changed this recommendation in the 3rd revision of SP800-56A [53], and are now advocating using a fixed set of safe-prime groups with moduli of length up to 8192 bits, with short or full length exponents. These groups were originally developed for TLS [54] and IKE [55] where, again, they are used either with short of full length exponents.

Complexity estimates
To estimates the resource and time requirements for computing discrete logarithms in finite fields for various modulus lengths n, and for the aforementioned parameterizations options, we need to decide on what model to use for estimating z as a function of n. Various models have been proposed over the years, see for instance [56,57]. For simplicity, we use the same model that NIST uses in SP 800-56A [53]. It is described on page 110 in FIPS 140-2 IG [58]. Note that NIST rounds z to the closest multiple of eight bits.
To compute short logarithms in safe-prime groups, the best option is to use Ekerå-Håstad's algorithm [22,23] that is specialized for this purpose. To compute general discrete logarithms in safe-prime or Schnorr groups, one option is to use Ekerå's algorithm [24]. As is explained in [23,24], a single correct run of these quantum algorithms suffices for the logarithm to be recovered with ≥ 99% success probability in the classical post-processing. These algorithms do not require the order of the group to be known. See Table IV and Figure 1 for complexity estimates.  [24] algorithms. This table was produced by the script in the ancillary file "estimate costs.py".
If the group order is known, a better option for computing general discrete logarithms in safe-prime groups and Schnorr groups when not making tradeoffs is to use Shor's original algorithm [1], modified to work in the order r subgroup rather than in the whole multiplicative group Z * N , and to start with a uniform superposition of all exponent values, as opposed to superpositions of r values. Note that the latter modification is necessary to enable the use of the semi-classical Fourier transform, qubit recycling and the windowing technique. The modified algorithm is described by Ekerå in [21] in the section on general logarithms, and in [59] where a heuristic analysis is also provided.
When using this modified version of Shor's algorithm to compute discrete logarithms, the heuristic analysis [59] shows that a single correct run suffices to compute the logarithm with ≥ 99% success probability, assuming each of the two exponent registers is padded with 5 bits, and assuming a small search is performed in the classical post-processing. This implies that Shor's algorithm outperforms Ekerå's algorithm, as Shor's algorithm performs only approximately 2n r group operations per run, compared to 3n r operations in Ekerå's algorithm, see Table V for complexity estimates. This is because Ekerå's algorithm does not require r to be known. In fact, it computes both d and r.
Note that for safe-prime groups, r = (N − 1)/2, so when N is known to the adversary then so is r. For Schnorr groups, it may be that r is unknown to the adversary, especially if the group is randomly selected. It may be hard to compute r classically, as it amounts to finding a n r = 2z bit prime factor of (N − 1)/2.  [1] modified as described in [21,59]. This table was produced by the script in the ancillary file "estimate costs.py".

D. Implications for elliptic curve discrete logarithms
Over the past decades cryptographic schemes based on the intractability of the DLP in finite fields and the RSA integer factoring problem have gradually been replaced by cryptography based on the intractability of the DLP in elliptic curve groups. This is reflected in standards issued by organization such as NIST.
Not all optimizations developed in this paper are directly applicable to arithmetic in elliptic curve groups. It is an interesting topic for future research to study to what extent the optimizations developed in this paper may be adapted to optimize such arithmetic operations (see Section 4 E). This paper should not be perceived to indicate that the RSA integer factoring problem and the DLP in finite fields is in itself less complex than the DLP in elliptic curve groups on quantum computers. The feasibility of optimizing the latter problem must first be properly studied.

E. On the importance of complexity estimates
It is important to estimate the complexity of attacking widely deployed asymmetric cryptographic schemes using future large-scale quantum computers. Such estimates enable informed decisions to be made on when to mandate migration from existing schemes to post-quantum secure schemes.
For cryptographic schemes that are used to protect confidentiality, such as encryption and key agreement schemes, a sufficiently long period must elapse inbetween the point in time when the scheme ceases to be used, and the point in time when the scheme is projected to become susceptible to practical attacks. This is necessary so as to ensure that the information that has been afforded protection with the scheme is no longer sensitive once the scheme becomes susceptible to practical attacks. This is because one must assume that encrypted information may be recorded and archived for decryption in the future. If the information you seek to protect is to remain confidential for 25 years, you must hence stop using asymmetric schemes such as RSA and Diffie-Hellman at least 25 years before quantum computers capable of breaking these schemes become available to the adversary. For cryptographic schemes that are used to protect authenticity, or for authentication, such as signature schemes, it suffices to migrate to post-quantum secure schemes before the schemes become susceptible to practical attacks. This is an important distinction. The process of transitioning to post-quantum secure schemes has already begun. However, no established or universally recognized standards are as of yet available. Early adopters may therefore wish to consider implementing schemes conjectured to be post-quantum secure alongside existing classically secure schemes, in such a fashion that both schemes must be broken for the combined hybrid scheme to be broken.

A. Investigate asymptotically efficient multiplication
The multiplication circuits that we are using have a Toffoli count that scales quadratically (up to polylog factors). There are multiplication circuits with asymptotically better Toffoli counts. For example, the Karatsuba algorithm [60] has a Toffoli count of O(n lg 3 ) and the Schönhage-Strassen algorithm [61] has a Toffoli count of O(n lg n lg lg n). However, there are difficulties when attempting to use these asymptotically efficient algorithms in the context of Shor's algorithm.
The first difficulty is that efficient multiplication algorithms are typically classical, implemented with non-reversible computation in mind. They need to be translated into a reversible form. This is not trivial. For example, a naive translation of Karatsuba multiplication will result in twice as many recursive calls at each level (due to the need to uncompute), and increase the asymptotic Toffoli count from O(n lg 3 ) to O(n lg 6 ). Attempting to fix this problem can result in the space complexity increasing [62], though it is possible to solve this problem [63].
The second difficulty is constant factors, both in workspace and in Toffoli count. Clever multiplication circuits have better Toffoli counts for sufficiently large n but, when we do back-of-the-envelope estimates, "sufficiently large" is beyond n = 2048. This difficulty is made worse if the multiplier is incompatible with optimizations that work on naive multipliers, such as windowed arithmetic and the coset representation of modular integers. Clever multiplication circuits also tend to use additional workspace, and it is necessary to contrast using the better multiplier against the opportunity cost of using the space for other purposes (such as distillation of magic states). For example, the first step of the Schönhage-Strassen algorithm is to pad the input up to double length, then split the padded register into O( √ n) pieces and pad each piece up to double length. The target register quadruples in size before even getting into the details of performing the number theoretic transform! This large increase in space means that the quantum variant of the Schönhage-Strassen algorithm is competing with the many alternative opportunities one has when given 6000 more logical qubits of workspace (at n = 2048). For example, that's enough space for fifty additional CCZ factories.
Can multipliers with asymptotically lower Toffolis counts help at practical sizes, such as n = 2048? We believe that they don't, but only a careful investigation can properly determine the answer to this question.

B. Optimize distillation
To produce our magic states, we use slightly-modified copies of the CCZ factory from [11] as explained in [27]. We have many ideas for improving on this approach.
First, the factory we are using is optimized for the case where it is working in isolation. But, generally speaking, error correcting codes get better when working over many objects instead of one object. It is likely that a factory using a block code to produce multiple CCZ states at once would perform better than the factory we used. For example, [64] presents a distillation protocol that produces good CCZ states ten at a time.
Second, since publishing [11], we have realized there are two obvious-in-hindsight techniques that could be used to reduce the volume of the factories in that paper. First, we assumed that topological errors that can occur within the factories were undetected, but actually many of them are heralded as distillation failures. By taking advantage of this heralding, it should be possible to reduce the code distance used in many parts of the factories. Second, when considering the set of S gates to apply to correct T gate teleportations performed by the factories, there are redundancies that we were not previously aware of. In particular, for each measured X stabilizer, one can toggle whether or not every qubit in that stabilizer has an S gate applied to it. This freedom makes it possible to e.g. apply dynamically chosen corrections while guaranteeing that the number of S gate fixups in the 15-to-1 T factory is at most 5 (instead of 15), or to ensure a particular qubit will never need an S gate fixup (removing some packing constraints).
Third, we believe it should be possible to use detection events produced by the surface code to estimate how likely it is that an error occurred, and that this information can be use to discard "risky factory runs" in a way that increases reliability. This would allow us to trade the increased reliability for a decreased code distance. As an extreme example, suppose that instead of attempting to correct errors during a factory run we simply discarded the run if there were any detection events where a local stabilizer flipped. Then the probability that a factory run would complete without being discarded would be approximately zero, but when a run did pass the chance of error would be amazingly low. We believe that by picking the right metric (e.g. number of detections or diameter of alternating tree during matching), then interpolating a rejection threshold between the extreme no-detections-anywhere rule and the implicit hope-all-errors-were-corrected rule that is used today, there will be a middle ground with lower expected volume per magic state. (Even more promisingly, this thresholded error estimation technique should work on any small state production task and almost all quantum computation can be reformulated as a series of small state production tasks.) Reducing the volume of distillation would improve the space and time estimates in this paper. But turning some combination of the above ideas into a concrete factory layout, with understood error behavior, is too large of a task for one paper. Of course, the problem of finding small factories is a problem that the field has been exploring for some time. In fact, in the week before we first released this paper, Daniel Litinsky made [65] available. He independently arrived at the idea of using distillation to herald errors within the factory, and provided rough numerics indicating this may reduce the volume of distillation by as much as a factor of 10. Therefore we leave optimizing distillation not as future work, but as ongoing work.

C. Optimize qubit encoding
Most of the logical qubits in our construction spend most of their time sitting still, waiting for other qubits to be processed. It should be possible to store these qubits in a more compact, but less computationally convenient, form. A simple example is that resting qubits don't need the "padding layer" between qubits shown in Figure 9, because this layer is only there to make surgery easier. Another simple example is that there appear to be ways to pack lattice surgery qubits that use less area while preserving the code distance (e.g. the "houses" shown in Figure 10).
One could also imagine encoding the resting qubits into a block code. The difficulty is in finding a block code that a) works with small groups of qubits, b) can be encoded and decoded fast enough and compactly enough to fit into the existing computation, and c) is sufficiently better than the surface code that the benefits (reduced code distance within the surface code) outweigh the costs (redundant additional qubits).
Finally, there are completely different ways of storing information in the surface code. For example, qubits stored using dislocations [66] could be denser than qubits stored using lattice surgery. However, dislocations also create runs of stabilizers over not-quite-adjacent physical qubits. Measuring these stabilizers requires additional physical operations, creating more opportunities for errors, and so errors will propagate more quickly along these runs.
Do the "qubit houses" in Figure 10 actually work, or is there some unexpected error mechanism? Can dislocation qubits be packed more tightly than lattice surgery qubits while achieving an equivalent logical error rate? Is there a block code that is sufficiently beneficial when layered over surface code qubits? Until careful simulations are done it will be unclear what the answer to these questions is, and so we leave the answers to future work.

D. Distribute the computation
An interesting consequence of how we parallelize addition, by segmenting registers into pieces terminated by carry runways, is that there is limited interaction between the different pieces. In fact, the additions themselves require no communication between the pieces; it is only the lookup procedure preparing the input of the additions that requires communication. If we place each piece on a separate quantum computer, only a small amount of communication is needed to seed the lookups and keep the computation progressing.
Given the parameters we chose for our construction, each lookup is controlled by ten address qubits. Five of those qubits come from the exponent and are re-used hundreds of times. The other five come from the factor register, which is being iterated over during multiplication. If the computation was to be distributed over multiple machines, the communication cost would be dominated by broadcasting the successive groups of five qubits from the factor register.
Recall from Section 2 that each lookup addition takes approximately 37 milliseconds. Five qubits must be broadcast per lookup addition. Therefore, if each quantum computer had a 150 qb/s quantum channel, the necessary qubits could be communicated quickly enough to keep up with the lookup additions. This would distribute the factoring computation.
For example, a network topology where the computers were arranged into a linear chain and were constantly forwarding the next set of factor qubits to each other would be sufficient. Each quantum computer in the distributed computation would have a computational layout basically equivalent to the left half (or right half) of Figure 7. Factoring an n = 2048 bit number could be performed with two machines each having perhaps 11 million physical qubits, instead of one machine with 20 million physical qubits.
We can decrease the size of each individual quantum computer by decreasing the piece size we use for the additions. For example, suppose we used an addition piece size of c sep = 256 and distributed an n = 2048 RSA factoring computation over 8 machines. Normally using smaller pieces would proportionally accelerate the addition, but the number of magic state factories has not changed (they have simply been distributed) so each lookup addition will still take approximately the same amount of time. So a 150 qb/s quantum channel per computer should still be sufficient. One caveat here is that each of these much smaller machines has to run its own copy of the lookup operation, and because there are so few factories per machine the lookup will take much longer. The optimal windows sizes of the lookups will decrease, and the total computation time will approximately double. In the diagram, there is one distance 9 logical qubit per "house" (triangular roof plus rectangular base). The logical X/Z observable chains of one of the qubit houses is shown. The two region highlight colors correspond to different operation orders for performing stabilizer measurements. This is important information because every stabilizer measurement has one particular physical error, the "long error", that can produce detection events a distance 2 apart. By choosing the operation order, we determine if the long errors of each type are oriented vertically or horizontally. The goal is to avoid pointing long errors towards boundaries of the same type, because that reduces the effective code distance. In green regions, the long blue errors are vertical and the long green errors are horizontal (by "blue error" and "green error" we mean an error that terminates on a boundary of the same color). In blue regions, the long errors have the opposite orientation.
Based on this surface analysis it seems that, instead of using 1 machine with 20 million qubits, we could use 8 machines each with perhaps 4 million qubits, as long as they were connected by quantum channels with bandwidths of 150qb/s. But we have not carefully explored this question, and perhaps there are even better ways to distribute the computation with even lower overheads. Only future work can tell.

E. Revisit elliptic curves
Many of the optimization techniques that we use in this paper generalize to other contexts where arithmetic is performed. In particular, consider the Toffoli count for computing discrete logarithms over elliptic curves reported by [17]. It can likely be improved substantially by using windowed arithmetic. On the other hand, because of the need to compute modular inverses, it is not clear if the coset representation of modular integers is applicable.
Which of our optimizations can be ported over, and which ones cannot? How much of an improvement would result? These are interesting questions for future research work.

CONCLUSION
In this paper, we combined several techniques and optimizations into an efficient construction for factoring integers and computing discrete logarithms over finite fields. We estimated the approximate cost of our construction, both in the abstract circuit model and under plausible physical assumptions for large-scale quantum computers based on superconducting qubits. We presented concrete cost estimates for several cryptographically relevant problems. Our estimated costs are orders of magnitude lower than in previous works with comparable physical assumptions.
In [67], Mosca poses the hypophoric question: "How many physical qubits will we need to break RSA-2048? [...] Current estimates range from tens of millions to a billion physical qubits". The upper bound of "a billion physical qubits" is likely from [9]. Our physical assumptions are more pessimistic than the physical assumptions used in that paper (see Table II), so our results can be directly compared. Doing so shows that, in the four years since 2015, the worst case estimate of how many qubits will be needed to factor 2048 bit RSA integers has dropped nearly two orders of magnitude; from a billion to twenty million. Clearly the low end of Mosca's estimate should also drop. However, the low end of the estimate is highly sensitive to advances in the design of quantum error correcting codes, the engineering of physical qubits, and the construction of quantum circuits. As predicting these advances is out of the scope of this paper, we leave the task of estimating new lower bounds to others.
Post-quantum cryptosystems are in the process of being standardized [68], and small-scale experiments with deploying such systems on the internet have been performed [69]. However, a considerable amount of work remains to be done to enable large-scale deployment of post-quantum cryptosystems. We hope that this paper informs the rate at which this work needs to proceed.

CONTRIBUTIONS
Craig Gidney designed the efficient construction for modular exponentiation, produced initial cost estimates for RSA, and assembled results from other papers for comparison. Martin Ekerå did the cryptographic impact analysis, and extended the construction and cost estimates to problems beyond factoring RSA integers and to algorithms beyond Shor's factoring algorithm. table span decades, and a range of assumptions about the architecture of quantum computers, and generally do not provide spacetime volumes. To assign volumes to these papers, we plugged their asymptotic formulas into various possible realizations of that algorithm (serial, parallel, and intermediate) using ancillary file "fillin-table.py". Each volume entry is the minimum volume achieved by the different possible realizations. We assumed all papers but our own had a negligible retry chance.

Entries
Some entries in the table are directly from a paper, others had to be inferred by hand, and others were filled in using the output of the ancillary file "fill-in-table.py". Here is where each entry in the table came from: • Vedral et al. 1996 [13]: -Toffoli+T/2 Count: 80n 3 derived from figures at end of the paper.
-Abstract Qubits: Paper says 7n + 1 (end of section IV). Verified using figures at end of paper. Paper mentions this can be improved to 4n + 3, but the described method would radically increase the Toffoli count.
-Measurement Depth: Equal to the Toffoli count. The circuit construction is serial.
-Measurement Depth: Equal to the Toffoli count. The circuit construction is serial.

• Beauregard 2002 [15]:
-Toffoli+T/2 Count: Derived from figures in the paper. Our estimate has an additional factor of lg n to account for the need to approximate arbitrary phase rotations using T states.
-Abstract Qubits: Stated in the title of the paper.
-Measurement Depth: Derived from figures in the paper. Our estimate has an additional factor of lg n to account for the need to approximate arbitrary phase rotations using T states.
-Abstract Qubits: The paper erroneously claims 2n in table I due to overlooking a necessary workspace register. We corrected this to 3n + O(1).