Halving the cost of quantum addition

We improve the number of T gates needed to perform an n-bit adder from 8n + O(1) to 4n + O(1). We do so via a"temporary logical-AND"construction which uses four T gates to store the logical-AND of two qubits into an ancilla and zero T gates to later erase the ancilla. This construction is equivalent to one by Jones, except that our framing makes it clear that the technique is far more widely applicable than previously realized. Temporary logical-ANDs can be applied to integer arithmetic, modular arithmetic, rotation synthesis, the quantum Fourier transform, Shor's algorithm, Grover oracles, and many other circuits. Because T gates dominate the cost of quantum computation based on the surface code, and temporary logical-ANDs are widely applicable, this represents a significant reduction in projected costs of quantum computation. In addition to our n-bit adder, we present an n-bit controlled adder circuit with T-count of 8n + O(1), a temporary adder that can be computed for the same cost as the normal adder but whose result can be kept until it is later uncomputed without using T gates, and discuss some other constructions whose T-count is improved by the temporary logical-AND.


Introduction
The surface code [5,8,21,22,11] is a quantum error correcting code that works on a 2D nearest-neighbour array of qubits and achieves a threshold error rate of approximately 1%. This makes the surface code a likely component in the architecture of future error corrected quantum computers, because 2D arrays of qubits with nearest-neighbor connections are possible with many qubit technologies [23,4,12,18,17] and other well understood error correcting codes either have lower thresholds or require stronger connectivity.
One of the downsides of the surface code is that it has no cheap mechanism to apply non-Clifford operations such as T gates. Instead, T gates are performed by distilling and consuming |T = 1 √ 2 (|0 + e iπ/4 |1 ) Craig Gidney: craiggidney@google.com  Figure 2 for the adder building-block and Figure 3 for the logical-AND computation and uncomputation circuits.
states. The T gate is performed via a controlled-NOT from the target qubit into the |T ancilla, followed by a classical feedback step that measures the ancilla to determine if an S gate is applied to the target. Unfortunately, distilling the |T state that will ultimately be passed into this process has significant cost. The cost of distillation is high enough that error corrected quantum computations are expected to bottleneck waiting for |T states to be produced, so that their runtime is dominated by the number of T gates (as opposed to being dominated by gate count, circuit depth, or even measurement depth). For example, although our adder has the same measurement depth as the Cuccaro adder (i.e. 2n + O(1)), we still expect our adder to execute more quickly in practice on early error corrected machines (assuming there is room for the n ancillae our adder requires). Because the surface code is a likely component of future quantum computers, and the runtime of computations within the surface code will be dominated in practice by the number of T gates, it is important to optimize the number of T gates used by quantum circuits. Optimizing the T-count of basic elements of quantum circuits, such as the construction of adders and Toffoli gates, is particularly important because any improvement is widely applicable.
The textbook construction of a Toffoli gate uses seven T gates [20]. When Toffoli operations are paired, i.e. when an initial Toffoli operation is later uncomputed by a second Toffoli operation, each Toffoli in the pair can omit three of the T gates from the textbook construction. This introduces phase errors but, assuming intermediate operations aren't sensitive to the phase errors, the second Toffoli gate can uncompute the phase errors while uncomputing the state permutation [3,20]. It is also possible to reduce the T-count of an unpaired Toffoli gate to 4 by using an ancilla qubit and a classically conditioned fixup operation [15].
The Cuccaro adder uses 2n + O(1) Toffoli gates [7]. Existing constructions implement the Cuccaro adder's Toffolis using 8n + O(1) T gates [3,7,1]. This T-count can be achieved either by applying the ancilla-and-fixup construction from [15] to each individual Toffoli gate, or by noting that all but one of the adder's Toffoli gates appear in compute/uncompute pairs and then applying the matched-phase-error construction from [3,20] to each pair.
The leading factor of 8 in the T-count of an n-bit adder has stood for over a decade [3,7,10]. We improve the leading factor with a construction based on the temporary logical-AND, combining ideas from both the matched-phase-error construction and the ancillaand-fix-up construction. We thereby halve the number of T gates needed to perform Toffoli gates that appear in compute/uncompute pairs and thus halve the T-count of Cuccaro-style adders from 8n + O(1) to 4n + O (1).
Although the title of this paper focuses on addition, we consider our main conceptual contribution to be reframing the ancilla-and-fix-up construction from [15] into the temporary logical-AND operation. This conceptualization makes it clear that the construction is more widely applicable than previously realized. Anytime operations share controls, even if those operations are far apart, a temporary logical-AND allows the controls to be combined once instead of once per operation (as long as intermediate operations are not sensitive to phase error in the controls). Addition is just one example of an operation where this is beneficial. We also discuss other examples.

Opportunity Cost of Ancillae
Our adder construction consumes 4n fewer |T states than the Cuccaro adder, but in doing so it holds n ancillae through a measurement-depth of 2n. Before we discuss our adder construction, we must first discuss the question: is that tradeoff worth making?
Obviously, if performing an addition on a small quantum computer with no space to spare for additional ancillae, it would be not be possible to use our adder whereas Cuccaro's would still be viable. However, lack of space is not the only reason ancillae can be problematic. In particular, note that any qubits being used as ancillae are qubits not being used in T factories.
It is possible to produce a high-quality |T state in 960 units of "spacetime volume" [2] (though this is certainly not a lower bound!). For comparison, an ancilla qubit stored as a double-defect through k serial measurements will cover at least 2k units of spacetime volume. Putting these two facts together, and ignoring the fact that T factories are discrete objects with packing constraints, we will approximate the opportunity cost of holding an ancilla as 1 480 |T states per measurementdepth.
With the above in mind, the "effective" T-count of our adder, including the opportunity cost of ancillae (and accounting for the fact that some are kept for longer than others), is 1 480 n 2 + 4n whereas the effective T-count of the Cuccaro adder is 8n. This implies that our adder is worse than the Cucarro adder when n > 1920. However, switching adders at n = 1920 ignores the possibility of using a hybrid of both adders where bit positions below a cutoff propagate carries via ancillae as in our adder and bit positions above the cutoff propagate carries inline as in the Cuccaro adder. In a hybrid adder of this type, the cutoff actually occurs at n = 960.
We caution that the cutoff of 960 that we just estimated is not a fixed constant. It is affected by other optimizations, by future research, and by overall system design. As the spacetime volume of T factories is improved, the cutoff will move downward. Conversely, if the spacetime volume of ancillae is improved, the cutoff moves upward. For example, qubits that are idle can be compressed into an inactive "memory" form that covers six times fewer physical qubits than the double-defect form [13]. This means that idle ancillae, like the ones in our adder, have a lower opportunity cost than active ancillae. This optimization increases the cutoff to n ≈ 5760, which is beyond the size of adders that would be used to break 4096-bit RSA keys with Shor's algorithm.
An interesting consequence of ancillae having a Tcount opportunity cost, besides making circuits that ex-ecute inplace desirable, is that it makes circuit depth relevant to the T-count. Logarithmic-depth adders may use a constant factor more |T states and ancillae than our adder, but they use exponentially less ancilladepth. The T-count opportunity cost of logarithmicdepth adders grows like Θ(n lg n) instead of like Θ(n 2 ). So, despite their "raw" T-count being larger, for sufficiently large n their effective T-count must become lower than the effective T-count of our ripple-carry adder.
Beware that the previous paragraph only applies in the regime where there are enough physical qubits to support enough T factories to properly feed a logarithmic-depth adder. In the early error corrected regime, when only a few T factories are available, the constant-factor penalty on the T-count of logarithmicdepth adders will result in a runtime that is longer than the runtime of ripple-carry adders (because they both have to wait for |T states). It is only as more and more T factories become available that the parallelism inherent to logarithmic-depth adders becomes accessible, overcomes the constant-factor T-count penalty, and triggers a gradual transition away from ripple-carry adders.
Whenever ancillae are used to save T gates, it is important to consider that there may be alternative uses of those ancillae that net even more |T states (or other desirable resources). These tradeoffs depend on overall system design. Overall system optimization is important, but it is not the subject of this paper and so we will limit ourselves to introducing basic tools (e.g. the concept of a temporary logical-AND, as well as an adder that uses 4 T gates and one ancillae per bit instead of 8 T gates per bit) that future system architects can combine with other techniques on a case by case basis.

Results
In Figure 1, we present a 5-bit adder with a T-count of 16. It performs 4 temporary logical-ANDs, each with a T-count of 4. All other operations are Clifford operations, with no T-count.
The building block of our adder is shown in Figure 2. We construct n-bit adders by nesting n copies of the building block inside of each other. The outer-most and inner-most blocks (which act on the low bit and high bit respectively) are then specialized based on the fact that they either have no carry input or no carry output.
Our adder uses temporary logical-AND operations, which we draw as wires emerging out of a pair of controls then later merging into the same pair of controls. Figure 3 shows how we compute the logical-AND of the two controls, and also the corresponding uncomputation.
x Figure 3: How to compute and uncompute the logical-AND of two qubits. The computation circuit (top) has a T-count of 4 and a measurement-depth of 1. For systems where |T states cannot be used as an input without performing a measurement, it is still possible to achieve a measurement depth of 1 by rearranging the circuit and using a temporary ancilla (e.g. as in Figure 1 of [15]). Note that the |T state input contributes to the T-count, because |T states are the resource used to perform T gates. The uncomputation circuit (bottom) uses a measure-and-fixup approach [15] that requires only Clifford gates, and so has a T-count of zero and a measurement-depth of 1.
An alternative uncomputation construction is to simply do the reverse of the computation circuit. This alternative approach has a net T-count of 2 (because a |T state is recovered). The resulting temporary logical-AND with a T-count of 6 would still be an improvement on existing work, but would be inferior to the measure-and-fixup approach shown above.
Computing the temporary logical-AND has a T-count of 4, but uncomputing it has a T-count of 0. This surprising asymmetry is due to the fact that measurement is not reversible. The uncomputation uses measurement in a way that the computation cannot. Figure 4 shows the building-blocks for two variations on our adder: a controlled adder and an out-of-place adder.
Some additions, such as the ones performed by the multiplications within the modular exponentiation in Shor's algorithm, are conditioned on a control qubit. Our controlled adder reduces the cost of these additions from 21n + O(1) [19]

to 8n + O(1).
Out-of-place adders are useful when a circuit is going to compute an addition, use it for awhile, then uncompute it. Our out-of-place adder does not improve on the cost of computing an out-of-place addition. However, because our out-of-place adder is based on computing a temporary logical-AND, its inverse does not use any T gates. This makes it possible to uncompute an out-of-place addition while consuming no T gates.
Decreasing the T-count of addition reduces the Tcount of any construction based on addition. For example, in [11] it is estimated that factoring a 2048-bit num-ber on a surface-code-based quantum computer would take 2·10 12 distilled |T states and have a measurementdepth spanning 27 hours (though the actual computation would likely be bottlenecked on distillation rather than measurement). The time estimate assumes Toffolis have a measurement-depth of 3, and the T-count estimate assumes Toffolis have a T-count of 7. Shor's algorithm is dominated by the cost of controlled additions. With our techniques, the average T-count and measurement-depth of the relevant Toffolis is ∼ 2.7 and 1 respectively. With these numbers, the measurementdepth estimate is reduced to 9 hours and the T-count estimate is reduced to 8 · 10 11 distilled |T states.
On top of reducing the T-count of obviously-related classical operations like multiplication and exponentiation, reducing the T-count of addition also reduces the T-count of quantum-specific operations such as rotating qubits.
For example, our improved adder allows the operation R Z (θ) to be applied to n qubits with a T-cost of 4n + O(lg n lg 1 ) as follows. First, we must reduce the n target qubits into a binary "Hamming weight register", which indicates how many of the target qubits are on. We start by assigning each qubit a "weight" of 1. We then start applying our out-of-place adder buildingblock to triplets of qubits of the same weight. Each application of the adder will turn 3 relevant qubits of weight w into 1 relevant qubit of weight w and 1 relevant qubit of weight 2w. Note that this means the number of relevant qubits goes down by 1 per adder application, except when only 2 qubits of weight w remain (which, as long as lower-weight qubits are always added first, occurs at most lg n times). Since we start with n qubits, and the Hamming weight register will have lg n qubits, and the two-remaining-qubits-ofsame-weight case occurs at most lg n times, it will take at most n − lg n + lg n = n adder applications to reduce the n initial qubits into the lg n qubits forming the Hamming weight register. Therefore, computing the Hamming weight register has a T-count of at most 4n (whereas uncomputing it will be free). Now, for each position p in the Hamming weight register, synthesize and apply the operation R Z (θ · 2 p ) to the register qubit at that position. This uses O(lg n lg 1 ) T gates, which is negligible in comparison to 4n for large n. Finally, uncompute the Hamming weight register (using no T gates). This completes the application of the n desired R Z (θ) operations.
Another quantum operation that can be implemented via an adder is the n-qubit phase gradient operation Grad n = 2 n −1 k=0 e 2iπk/2 n |k k|. Normally this operation would be implemented by separately applying the operation R Z (π2 −p ) to each qubit of the target register, where p is the qubit's index in the register and the (a) control number of T gates needed for each rotation depends on the maximum per-gate error . However, assuming a "phase gradient register" prepared in the state |k is available, the phase gradient operation can be performed via addition [16]. Add the target register into the phase gradient register, and phase kickback will apply the Grad n operation to the target. With our adder, this construction performs phase gradients using 4n + O(1) T gates. This T-count is interesting because it appears to be independent of despite the phase gradient operation involving arbitrarily small rotations. However, there are three ways in which this phase gradient construction's cost does depend on . First, still bounds the minimum quality of the |T states powering the T gates. Second, large phase gradients can be truncated down to a size n max asymptotically equal to Θ(lg 1 ). Third, initializing the reusable phase gradient register has a one-time cost of O(n max lg 1 ) T gates.
The temporary logical-AND is useful for optimizing an even wider variety of circuits than addition is. Whenever Toffoli gates appear in compute/uncompute pairs, and intermediate operations are not sensitive to phase errors on the controls of the Toffoli gate (i.e. the condition in Figure 5 is satisfied), it is possible to save 4 T gates by replacing the pair of Toffoli gates with a temporary logical-AND.
For example, a simple way to construct Grover oracles starts by translating a classical predicate into a classical reversible circuit made with Toffoli gates. This reversible circuit is then used on a quantum computer to compute an output qubit, and a Z gate is applied to the output qubit before the circuit is run in reverse to uncompute the output qubit. Every Toffoli gate generated by this construction is part of a com- Figure 5: A sufficient condition for replacing a pair of Toffoli gates with a temporary logical-AND, saving 4 T gates.
1) The later Toffoli gate must be uncomputing the earlier Toffoli gate.
2) Intermediate operations must not be sensitive to the presence of the entangled ancilla.
pute/uncompute pair (each Toffoli in the computation will match a Toffoli in the uncomputation). Using temporary logical-ANDs, instead of individually translating each Toffoli gate into T gates, halves the T-count of this approach to constructing Grover oracles. However, note that this naive optimization tends to require an unreasonable number of ancillae and will not generalize to all methods for producing Grover oracles.
As another example, note that the temporary logical-AND can perform NOT gates with many controls by iteratively combining the controls down to a single representative ancilla. The representative ancilla is then used to control a CNOT onto the target qubit. This construction takes 4n − 4 T gates to perform a NOT with n controls (and is equivalent to a nesting construction from [15]). As when replacing Toffoli gates in compute/uncompute pairs with a temporary logical-AND, the ancilla representing the n controls should be kept and used multiple times whenever possible (instead of being uncomputed and recomputed).
Yet another example of a kind of circuit that benefits from the temporary logical-AND is low-depth adder circuits. For example, the Toffoli gates produced by the P and P −1 rounds in Draper et al's logarithmic-depth adder [9] form compute/uncompute pairs that can be replaced by temporary logical-ANDs. Alternatively, by using temporary logical-ANDs to implement the classical Brent-Kung adder [6] in a reversible fashion, two n-bit numbers can be added in O(lg n) depth with a T-count of only 12n.
Finally, we note that Toffolis also appear in compute/uncompute pairs in quantum circuits rooted in physics instead of mathematics. For example, we used temporary logical-ANDs to cut the T-count of a chemistry algorithm in preparation by nearly 50% [2].

Discussion
For over a decade, the T-count of addition has been 8n + O(1) [1,3,7]. In this paper we showed how to halve the leading factor of this cost by replacing Toffolis in compute/uncompute pairs with temporary logical-ANDs. The temporary logical-AND is a basic circuit construction widget that can be used in many circuits, and a good example of measurement breaking the symmetry between computation and uncomputation (allowing one to be more efficient than the other). We demonstrated how to optimize the T-count of a few tasks using our adder and the temporary logical-AND, but there are many other low hanging applications (e.g. converting between binary and unary, temporary sorting, applying an operation to a qubit indexed by a register in superposition, parameterized bit-rotation of a qubit register, computing the greatest common divisor of two values, etc).
It is interesting to consider if addition or temporary logical-ANDs can be done with still fewer T gates. In [14] it is proven that a Toffoli gate requires at least 4 T gates, though the specific nature of the proof does not eliminate the possibility that n Toffolis could be implemented with fewer than 4n T gates for some large n. Although we suspect the true lower bound really is 4, we also expect that there are regardless many opportunities to optimize how many T gates go into performing Toffolis. For example, if a task involves repeating an action several times, T gates inside Toffolis at the end of one repetition could cancel out T gates in work being done at the start of the next repetition.
Regardless of whether there will be further T-count improvements for operations as basic as paired Toffoli gates, it is clear that many other kinds of Tcount optimizations are waiting to be found. Not just in the construction of basic low-level operations, but in medium-level constructions, high-level constructions, lower-than-circuit level constructions, and generally across the whole technology stack that will be needed to perform error corrected quantum computation with the surface code for the first time.