A volumetric framework for quantum computer benchmarks

We propose a very large family of benchmarks for probing the performance of quantum computers. We call them \emph{volumetric benchmarks} (VBs) because they generalize IBM's benchmark for measuring quantum volume [1]. The quantum volume benchmark defines a family of \emph{square} circuits whose depth $d$ and width $w$ are the same. A volumetric benchmark defines a family of \emph{rectangular} quantum circuits, for which $d$ and $w$ are uncoupled to allow the study of time/space performance trade-offs. Each VB defines a mapping from circuit shapes -- $(w,d)$ pairs -- to test suites $\mathcal{C}(w,d)$. A test suite is an ensemble of test circuits that share a common structure. The test suite $\mathcal{C}$ for a given circuit shape may be a single circuit $C$, a specific list of circuits $\{C_1\ldots C_N\}$ that must all be run, or a large set of possible circuits equipped with a distribution $Pr(C)$. The circuits in a given VB share a structure, which is limited only by designers' creativity. We list some known benchmarks, and other circuit families, that fit into the VB framework: several families of random circuits, periodic circuits, and algorithm-inspired circuits. The last ingredient defining a benchmark is a success criterion that defines when a processor is judged to have"passed"a given test circuit. We discuss several options. Benchmark data can be analyzed in many ways to extract many properties, but we propose a simple, universal graphical summary of results that illustrates the Pareto frontier of the $d$ vs $w$ trade-off for the processor being benchmarked. [1] A. Cross, L. Bishop, S. Sheldon, P. Nation, and J. Gambetta, arXiv:1811.12926 , (2018).


. This data is for illustration purposes only and does not come from a real device or rigorous simulation!
We need techniques for characterizing these devices, and assessing their performance. Much recent work has gone into QCVV (quantum characterization, verification, and validation), which focuses on measuring intrinsic noise/error behavior of 1-and 2-qubit gates. But with the emergence of larger processors that can be viewed and treated like baby quantum computers, there is increasing need for holistic benchmarks that capture interesting aspects of processors' overall performance -preferably at practically relevant tasks, whenever possible -and enable comparing and contrasting different processors and architectures. IBM recently introduced a metric of quantum computer performance called "quantum volume" [1]. Quantum volume is intended to measure the size of a quantum processor's accessible state space. A processor with n qubits has a 2 n dimensional state space. No matter how well it works, it can't access more than 2 n computational states. On the other hand, if the processor is poorly controlled, or suffers a lot of noise, then some of those n qubits may be superfluous -the full state space won't be accessible. Quantum volume aims to capture both of these limitations at once. It asks "What's the largest number of qubits on which the processor can reliably produce a random state?" Random states can be generated by random programs (circuits) [10], but the number of steps required (the depth of the circuit) grows with the number of qubits. So demonstrating that a processor's quantum volume is at least 2 n requires demonstrating the ability to reliably run a random quantum circuit with width (# of qubits involved) and depth (# of consecutive steps) both at least n. But while the quantum volume can provide a concise, high-level summary of system performance, very few quantum algorithms actually correspond to square circuits. Each aspect of a circuit's shape (its width and depth) is independently interesting and can change dramatically depending on the target application. Probing precisely how a circuit's shape affects a given processor's ability to run it successfully (see Fig. 1) can serve as an important device diagnostic as well as an informative performance metric.
IBM proposed a specific benchmark to probe quantum volume. By benchmark, we mean a set of quantum circuits (a test suite) together with instructions for how to run them (an experimental design), an analysis procedure for processing the raw results, and finally an interpretation rule for drawing high-level conclusions. Benchmarks are often designed to measure one or more metrics, and in this case the metric is quantum volume. Other benchmarks like this exist. Randomized benchmarking (RB) [11][12][13][14][15] is a benchmark designed to measure a certain fidelity metric. Long-sequence gate set tomography (GST) [16][17][18] defines a tailored family of benchmarks designed to measure the gate set that describes the processor's noisy operations, from which a variety of metrics can be computed.
But benchmarks don't necessarily measure a well-defined metric. LINPACK [19] is a benchmark for classical computers that measures. . . performance on LINPACK. That's a proxy for performance on general linear algebra, but it doesn't correspond to any intrinsic or preexisting property of the computer. LINPACK effectively defines a new synthetic metric. Most benchmarks for classical computers are like this. Some are intended to measure a well-defined metric like FLOPS, but even in those cases the benchmark defines a particular context (e.g. linear algebra), and the metric's meaning varies across contexts.
Just what properties of quantum computers will be the most relevant -which intrinsic and/or synthetic metrics will capture important aspects of performance -is very unclear at this time. Quantum volume is intriguing, and a great first step, but almost certainly not the last step in benchmarking quantum computers. In the absence of clear desiderata for metrics and/or benchmarks, we suggest that what the quantum computing community needs is a lot of candidate benchmarks -which may or may not be tied to intrinsic metrics -that can be tested, deployed, applied, and made to compete against each other. The ones that prove the most useful will persist.
In this paper, we lay out a framework for a large and flexible family of benchmarks (see example in Fig. 1). It was inspired by IBM's benchmark for quantum volume, so we call them volumetric benchmarks. We give a few examples, mostly to demonstrate how the framework works, but these are not intended to be exhaustive! The most important property of a test circuit is its structure, because circuits with different structures can probe (or suppress) totally different noise/error properties. The volumetric framework allows circuit families with radically different structures, which we hope will allow the construction of volumetric benchmarks that capture a radically wide range of behaviors.
Like most ideas, this one is not entirely original. Quantum volume is the most obvious and explicit precedent, and our explicit intent here is to generalize it. More generally, it is a basic and well-known truth that both depth and width of the circuits that can be run on a quantum processor are interesting. Emerson and Wallman proposed a metric -called quantum circuit capacity in a 2017 talk [20] or, more recently, quantum processing power [21] -based on a processor's ability to run d × w rectangular circuits derived from cycle benchmarking [22]. Bishop, responding to a question in an online forum, suggested that the quantum volume could serve as an "executive summary" of device performance, but that one should additionally report benchmarking results at a range of width/depth pairs [23]. In fact, almost every aspect of the framework we propose here is inspired by one or more proposed approaches to benchmarking. Our main contributions are (1) their integration, and (2) the reporting and visualization framework in the last section of the paper. We have attempted to cite prior work of which we're aware, and apologize in advance for any failures.
2 Motivations for generalizing (why not just quantum volume?) An obvious question is "Is it even necessary to generalize beyond quantum volume, given that quantum volume already exists?" Although quantum volume is an excellent (and inspiring) idea, neither quantum volume nor any other single metric will capture all the important information about holistic performance. Quantum volume measures the processor's ability to run "square" circuits of a particular form. While this correctly captures the idea that useful circuits' depth must increase with width (number of qubits), the circuits for most algorithms aren't square. Shor's factoring algorithm [24], for example, requires O(n) qubits but O(n 3 ) depth. Grover's algorithm [25] applied to search over n-bit strings requires O(2 n/2 ) repetitions of an oracle subroutine that, itself, generally has depth O(poly(n)). On the other hand, many quantum algorithms can be reformulated to require only polylog(n) depth, at the cost of much larger width [26].
The intuition behind quantum volume is that taking full advantage of n qubits requires the ability to access any part of their Hilbert space, and that task -also known as scrambling the initial state [27] -demands O(n) depth. There's significant truth in this, but it's incomplete. Useful quantum algorithms do not always explore the whole Hilbert space -for example, Grover's algorithm operates within a 2-dimensional subspace spanned by the answer and the uniform superposition. Conversely, just one scrambling isn't usually enough; algorithms often need to follow a long and complex path through Hilbert space.
For all these reasons, users and designers of quantum computers will want to understand the trade-offs between width and depth. A 100-qubit processor that can perform 1000 layers of gates is more useful for certain tasks than one that can just perform 100 layers. Similarly, there are situations where 1000 qubits that can perform 100 layers are more useful than just 100 qubits that can also perform 100 layers. Square circuits might be the most important ones, but by no means the only important circuits.
There's another important reason to design, run, and study other benchmarks besides the one that defines quantum volume. Different errors affect different circuits in different ways. Independent local depolarizing errors affect pretty much all circuits in the same predictable way. But unitary or "coherent" errors (e.g. over-rotations in gates) behave very differently in random circuits, where they tend to get smeared out into effective depolarizing errors [15], and in periodic circuits, where they can get amplified rapidly [16]. More exotic faults like crosstalk and non-Markovianity are not yet well understood, but will almost certainly have strongly circuit-dependent effects. The circuits for algorithms and other applications tend to be highly and nontrivially structured. We do not yet fully understand how specific types of errors affect them! But we have no reason to believe that random and/or scrambling circuits will be a reliable proxy for algorithms. This is a strong argument for inventing and using other benchmarks -ones that incorporate structured or periodic circuits, and ones that attempt to mimic real algorithm circuits. A diverse pool of benchmarks -rather than a monocultureensures robustness against unknown and unexpected effects.
Finally, quantum computing is still a very immature technology. There's a lot we don't know! Benchmarks will surely need to evolve. For example, it's not entirely clear how best to quantify "success" for a given circuit. Heavy outcome probability, proposed for the quantum volume benchmark, is a promising idea -but it's also one of the first ideas, and better success metrics may be invented. New error types may appear and become significant as hardware matures. New circuit types will almost surely be invented. Our goal in this paper is to encourage a proliferation of new, creative benchmarks from which (through use and argument) the best will emerge. Too many constraints inhibit creativity, but a total absence of structure inhibits communication. We hope the volumetric benchmark framework will hit a sweet spot, establishing a minimalist structure that enables all practitioners to measure and communicate performance, while allowing a lot of flexibility to define new benchmarks that capture many aspects of quantum computer performance. In the "Examples" at the end of this paper, we show that many -perhaps almost all -extant benchmarks and test suites can be fitted into the volumetric framework, which makes us hopeful that it can encompass future benchmarks too. 2. A defined measure of "success" for each circuit, which takes as input a set of N w-bit strings resulting from running that circuit on a device, and outputs either (a) a bit (0/1) indicating whether the device "passed" the test, or (b) a real number between 0 and 1 quantifying how well it scored on the test.

3.
A defined measure of overall success on an ensemble of circuits, which takes as input a set of success values (as defined in #2, above) for all the circuits in an ensemble, and condenses them into a single representative number between 0 and 1.

4.
Optional An experimental design specifying how the circuits are to be run. The simplest experimental design is just a specification of N (how many times each circuit should be run). More complicated designs may also include guidelines or requirements on what order to run the circuits in, such as whether the N repetitions should be interleaved or divided in chunks to mitigate effects of drift [28,29], or perhaps even specifying an algorithm for choosing circuits adaptively based on preliminary results, etc.
All of these properties should be specified and/or referenced clearly and unambiguously when a volumetric benchmark is reported. Some benchmarks (like randomized benchmarking) have evolved organically, and are performed in various ways [30]. For such benchmarks, repeatability and cross-platform comparison is only possible if the results are accompanied by a precise description of exactly what variation of the benchmark was performed (including, if at all possible, the order and timing of the circuits, which define the experimental design). If a benchmark has been precisely defined before (e.g. in a published article), it is sufficient when reporting results to reference that standard and note any specific variations. When defining and reporting the performance of a new benchmark, it is desirable to be as precise and specific as possible so that subsequent experiments can reproduce the same benchmark as closely as possible in order to enable comparison, or state precisely what was changed.
The width-and depth-dependent circuit ensemble C(w, d) is the defining property of a benchmark. For a given circuit shape (w, d), the test suite C may contain: 1. A single circuit, 2. A list of K(w, d) circuits, all of which need to be run on the device, 3. A probability distribution over a large set of circuits, together with an integer K(w, d) that specifies how many circuits should be drawn at random from the distribution and run on the device.
Each circuit ensemble defines a test that the device has to pass. The nature and specification of the circuit ensembles is up to the designer of the benchmark. However, the circuit ensembles are expected to share a common structure that "scales" naturally with w and d. Increasing w and/or d is expected to make the test harder, or at least not easier. Several classes -not necessarily exhaustive -are discussed in the next section. Concrete examples of volumetric benchmarks from these classes are given in the concluding "Examples" section.
The other main ingredients of a benchmark define what it means to "pass" a test: first, what is required to succeed at performing a single circuit; second, what counts as success for an ensemble of them.
For single circuits, the simplest measure of success is "Did running the circuit yield the correct/expected outcome more than 2/3 of the time?" (Note: 2/3 is an arbitrary constant, but fairly conventional). However, this only makes sense for definite-outcome circuits that have a unique "correct" outcome. Other criteria are discussed below in the "Success Criteria" section.
If a benchmark defines a single circuit for each shape, then "passing" just requires succeeding at that circuit. If the benchmark specifies a list of circuits, it may require all circuits to be run successfully, or may require the average success rate to be greater than 2/3, or may require success on a fixed fraction of the circuits, etc. Similar criteria apply to distributions over circuits, but in this case, average success rate is more compelling than (e.g.) criteria that require all circuits to succeed, because of reproducibility concerns when the circuits are drawn randomly and therefore unreproducibly. More complex criteria are also possible -we do not intend to constrain the possible definitions. Potential success criteria (and their uses) are discussed below in the "Success Criteria" section.
Finally, a benchmark may or may not specify a particular experimental design. For example, randomized benchmarking is conventionally not associated with an experimental design -the circuits can be run in any order, with the N repetitions of each circuit interleaved or batched together as the experimentalist finds convenient. But in a benchmark designed to test stability over time (for example), a particular experimental design that specifies a particular order of circuits (e.g., N/2 repetitions at one time, and N/2 at another time) might be important [29]. We expect that most benchmarks will not require any special experimental design, but this allows for extensibility to address new tasks not envisioned at this time.

Classes of circuits for volumetric benchmarks
This paper presents a framework for volumetric benchmarks, and is not intended to introduce (or advocate for) any particular benchmarks. We hope to see a wide variety of benchmarks proposed, each motivated by a different aspect of "performance" that the benchmark seeks to capture. The possibility of basing benchmarks on creative and novel kinds of circuits is essential for that vision. But we also want to illustrate the framework's potential with some examples, so in this section we give a non-exhaustive list of circuit classes that could (and in some cases already do) form benchmarks.

Random circuits
Random circuits may be the most obvious choice, and many ensembles of random circuits are known. Generally, a random circuit class is defined by (1) specifying an ensemble of subroutine circuits, and then (2) building a large circuit of width w and depth d by drawing subroutines at random and combining them in series, parallel, or both. Some higher-level structure is usually imposed as well. For example, standard (Clifford) randomized benchmarking circuits [14] are mostly composed of randomly chosen w-qubit Clifford subroutines combined in series, but with the high-level constraint that the last Clifford must invert the previous d−1 Cliffords.
Clifford RB is not the only random-circuit volumetric benchmark. Direct RB [31] replaces the random Clifford subroutines with d depth-1 layers that are themselves formed by parallel combination of native 1-and 2-qubit gates on w qubits. Here, the higher-level structure takes the form of a prefix and a postfix. The prefix prepares a random stabilizer state and the postfix transforms to the computational basis before measurement. (Note that "depth" d has a different meaning in each of these benchmarks; this is discussed in the next section). The simultaneous RB protocol [32] defines a benchmark composed of circuits that combine 1-or 2-qubit Clifford (or direct) RB subroutines in parallel across w qubits. There are several other variations and spin-offs of RB, e.g. cycle benchmarking [22], that could also be adapted as volumetric benchmarks.
Conventionally, RB is used as a QCVV protocol to measure an intrinsic error rate that can be extracted through particular data analysis. Here, we ignore that application, and propose repurposing the RB circuits as a volumetric benchmark. In this use, the results are not analyzed in the conventional way; we propose a standardized way to report volumetric benchmarking in the "How to Report Results" section at the end of this paper.
Scrambling circuits, which are used in quantum volume, [1,27] are another class of random circuits. These can, in principle, explore every corner of the system's Hilbert space (RB circuits typically only explore a discrete subset that spans the state space). Scrambling circuits can be constructed in various ways. The construction given by IBM for quantum volume involves a high-level structure: the circuits are composed of d alternating sequential layers, each of which is either a randomly chosen subroutine that permutes the w qubits or a randomly chosen subroutine that applies local 2-qubit unitaries from SU (4). Another class has been proposed by Google [33,34].

Periodic circuits
The oldest qubit test suites -Rabi and Ramsey sequences -are periodic rather than random. More recent test protocols that define suites of periodic circuits include robust phase estimation (RPE) [35] and gate set tomography (GST) [16]. Periodic circuits repeat a single subroutine O(d) times. They stress-test different aspects of a processor's performance from the ones emphasized by random circuits -coherent errors can be amplified by periodic circuits, whereas in random circuits they tend to get twirled away or smeared out. Even the simplest periodic circuit, a long sequence of "idle" operations, each of which is intended to do nothing for a single clock cycle, is useful for identifying calibration errors and decoherence. It's easy to envision a processor that excels at random circuits but fails at periodic ones, or vice versa. Doing equally well on both kinds of benchmark is a hallmark of reliability.
Periodic circuits are also useful because they reflect the structure of many quantum algorithms. Grover's algorithm is an extreme example; two subroutines are performed alternately O( √ N ) times to search a set of N possible solutions. Phase estimation, a very common subroutine [36], relies on repeated applications of a unitary to amplify its eigenvalues.

Application circuits
Not all algorithms are periodic or random. Furthermore, generic periodic circuits (Rabi/Ramsey sequences or GST circuits) don't reflect the complexity of algorithm circuits. So a third essential class of circuits is application circuits -circuits that behave like the ones that are expected to be run in real-world usage. Currently, applications for circuit-model quantum computers fall into three main classes: "digital" quantum algorithms that are intended to be compiled into discrete gate sets to run on error-corrected qubits; "analog" quantum algorithms designed to run on physical qubits that admit continuous families of native gates (e.g. Z rotations by θ) [37]; and quantum error correction.
In certain cases, it may be possible to define a volumetric benchmark by simply choosing actual instances of a given algorithm that correspond to each (w, d) pair. However, most algorithms don't scale smoothly to every depth and width. For example, full Grover search on w qubits requires depth O(2 w/2 ). But the circuits in a benchmark needn't actually perform an algorithm. Rather, they should be representative of the circuits that perform that algorithm (or other application). For example, a Grover-inspired benchmark might include circuits that perform d iterations of the Grover subroutine -for any integer d -in between "prefix" and "postfix" subroutines designed to simplify the circuits' output distribution. More generally, a wide variety of representative circuits can be constructed by identifying key subroutines for larger applications (e.g. phase estimation or syndrome extraction) that scale more smoothly with d and w than the full-scale application.
Eventually, application circuits will constitute the most important and useful class of test suites. However, identifying the best ways to construct test suites based on applications will require significant additional research. As we said above, our goal in this paper is to construct a useful container for benchmarks, not the benchmarks themselves. We anticipate that the volumetric framework and the observations above will help to guide the construction of application-specific benchmarks.

Meanings of "depth" (and why not to compare different benchmarks naively)
We have been intentionally vague, so far, on precisely what "depth" (d) means. For some circuit classes, including Rabi-oscillation circuits, the depth refers simply to the number of physical clock cycles 2 required to implement the circuit. But many other circuit classes possess their own natural definition of depth. Ramsey oscillation and direct RB circuits, for instance, require explicit prefix/postfix operations that are typically excluded from the depth count. The number of clock cycles required to implement these prefix/postfix operations may not even be constant -e.g., for direct RB circuits, the number of required clock cycles is somewhat random (and grows with the circuit width). But more generally, "depth" counts the number of subroutines in a circuit, and those subroutines themselves often require more than 1 clock cycle. Standard (Clifford) RB is a good example; the subroutines are Clifford operations, which generally require more than 1 native gate.
To clarify matters, we distinguish between a circuit's logical depth, which corresponds to the number of consecutive subroutines, and its physical depth, which corresponds to the number of clock cycles or consecutive elementary gates required to implement the circuit. Both are useful. Which one is more relevant, in a specific situation, depends on what property of the processor is being focused on. Physical depth is more relevant for debugging, hardware-level metrics, and quantifying the amount of error per gate. Benchmarks for rely on physical depth (e.g. direct RB) will tend to capture the raw "physics" performance of quantum gates and hardware, and make more sense to experimental physicists and hardware engineers. On the other hand, logical depth is more abstract, hardware-agnostic, and consistent with concepts from computer science. In this paper, d refers a circuits logical depth, exclusively.
If Processor A and Processor B have very different architectures and APIs, then it's entirely possible for A to consistently succeed at circuits with higher physical depth than B. . . but for B to consistently succeed at circuits with higher logical depth than A. This would suggest that A has higher-fidelity native gates, but that standard subroutines (e.g. Clifford operations, Grover steps, qubit permutations, or arbitrary SU(4) operations [1]) can be compiled much more efficiently on B. Neither processor dominates -each can do something better than the other, and as of 2019 we believe it's too early to state which capability might be more important. This is the motivation for a common framework that encompasses many different open-source benchmarks and allows exploration of different aspects of performance and failure.
The existence of two distinct and useful notions of depth highlights something important about volumetric benchmarks: performance on two different volumetric benchmarks is generally not directly comparable. A 3-qubit processor might easily achieve d = 400 on benchmarks derived from Rabi oscillation or direct RB circuits, but only d = 50 on random-Clifford benchmarks, and top out at d = 10 on benchmarks based on quantum volume or circuits for Grover's algorithm. Because d corresponds to logical depth, these are all consistent. But even with two benchmarks where "depth" means exactly the same thing, it usually doesn't make sense to compare performance directly. For example, error can add up very differently in random circuits and periodic circuits, so it should not be surprising if a processor can achieve physical depth d = 1000 on a random-circuit benchmark but only d = 200 on a periodic benchmark or vice-versa. Certain specific cross-benchmark comparisons will make sense (e.g., the last example would hint at coherent errors), but in general each benchmark will stand alone.

Circuit ensembles
A volumetric benchmark defines a family of tests -one for each circuit shape (w, d) -that a processor must pass. A test can take various forms. The simplest test is "Demonstrate the ability to run a specific circuit C(w, d) and get an acceptable answer most of the time." But there are other kinds of tests. So a volumetric benchmark defines, for each circuit size (w, d), an ensemble of circuits that we denote C(w, d). The term "ensemble" is intentionally vague here, allowing freedom to encompass at least three distinct kinds of test: 1. A single circuit.

2.
A finite list of circuits, some or all of which must be "passed".

3.
A distribution over a [possibly infinite] set of circuits, from which a sample is to be drawn.
Passing a single-circuit test just requires demonstrating the ability to run that circuit successfully. We discuss how to evaluate this in the next section below. Benchmarks of this form will be relatively rare, because we usually want to know whether a processor can run many circuits of a particular form. Passing a list-of-circuits test can mean at least three distinct things: 1. Running all circuits in the list successfully.
2. Running at least a certain number (e.g. 90%) of the circuits successfully.
3. Achieving an average quantitative success -defined by the average of some success metric (see next section) over all the circuits in the list -above a specified threshold.
We see the first criterion -success at all circuits -as the most clearly useful. While it is clearly the most challenging of the alternatives, it is by no means unreasonable. All the circuits in C(w, d) are of the same size, and it's reasonable to expect a quantum processor that can run some or most circuits of a given size to run all circuits of that size. Moreover, benchmarks that require success at all circuits provide strong guarantees of performance. No user wants to find out (later) that the circuit they need to run is one of the few that their processor can't run!

Success criteria for individual circuits
Passing a test comes down to succeeding at individual circuits. The most obvious definition of success at circuit C is: Running C on an as-built processor always yields the same result that it would on a perfect processor with no errors. But this definition is too strict -some amount of failure has to be tolerated in order to evaluate today's processors -and it only applies to definite-outcome circuits that would ideally produce a unique outcome. Most circuits are intended to generate random outputs sampled from a specific but nontrivial probability distribution. Several success criteria are available, and we expect more to be invented. It's critical that when benchmarks are defined and used, the success criterion that was used is (1) specified precisely, and (2) motivated and/or explained. Since there are as yet no hard and fast rules for what makes something a "useful" or valid success criterion, there is potential for abuse -someone could in principle introduce benchmarks based on weak or meaningless success criteria, and use them for "benchmarketing" (inflating performance figures in a misleading fashion). The defense against this abuse is to demand a clear motivating description of why a new criterion really does indicate "success", and how it compares to other, better studied, criteria.
Here are some candidate criteria: • The probability of the correct outcome of circuit C is greater than a specific threshold (e.g. 2/3) with high (e.g. 95%) statistical confidence. Note that this only applies to definite-outcome circuits, and see discussion below.
• The heavy output probability [38] of circuit C is greater than a specific threshold (e.g. 2/3) with high statistical confidence. This is the criterion used for quantum volume [1]. Note that this applies to some indefinite-outcome circuits, but not uniform-outcome circuits (where all outcomes are equally probable), and also that it requires the ability to compute all outcome probabilities in advance.
• The distribution of the outcomes of circuit C is within a specific distance of the ideal distribution (according to some specified metric such as total variational distance) with high statistical confidence. This criterion can apply to all circuits (definite-, indefinite-, and uniform-outcome), but requires computing outcome probabilities in advance. Also, the amount of data required to achieve statistical significance increases with the entropy of the distribution.
• The coarse-grained outcome distribution is close to ideal (precisely as stated in the previous criterion) with respect to some specific coarse-graining. See discussion below -this is a powerful class of criteria that includes heavy output probability as a special case.
• The probability of an output within a specified Hamming distance (e.g. 2) of the correct outcome is greater than a certain threshold (e.g. 2/3). This definition captures the idea that in a many-qubit processor, some results are more wrong than others. If only few bits ever get flipped, this suggests that the processor may be suitable for quantum error correction.
• The observed outcome frequencies of circuit C are consistent with the predicted probabilities, meaning that the empirical cross-entropy between them is below a specified threshold. This interesting quantifier of success is at the heart of Google's proposal for demonstrating quantum supremacy [33,34].
A rather long paper could be written entirely about success criteria. We do not have space to discuss this topic in that level of detail! Our goal here is to state, and demonstrate, the simple idea that there are many useful ways to define "success" at running a circuit. Most of the definitions given here have appeared in testing and characterization protocols already.
Two points are worth touching on briefly, though. First, as noted in the first criterion (probability of the correct outcome), even the simplest criteria have some subtleties. The precise probability of a given outcome is not accessible through experiment. Flipping a coin 1000 times and observing 689 heads doesn't mean that P r(heads) = 0.689. We can only make statements such as: given the empirical frequency of 0.689, P r(heads) is greater than 2/3 with reasonably high confidence, and greater than 1/2 with very high confidence. It's also not clear where to put the threshold. 2/3 is common in computer science, but this is explicitly arbitrary. The best-motivated threshold may be problem-dependent.
Second, we note that coarse-graining of the output distribution is a powerful and general technique. This means taking the 2 w possible outcomes of a circuit on an w-qubit processor and dividing them into k 2 w bins according to some rule. Now we can discuss, report, and analyze the empirical and predicted distributions over just k outcomes. If k is kept constant as n increases, then many problems with comparing those distributions become manageable.
There are many ways to coarse-grain. Heavy output probability, which is used in quantum volume [1] and also in other quantum supremacy benchmarks [38], is one example. By computing all the outcome probabilities in advance, the outcomes are divided into just two bins: those with probability greater than the median, and those with probability less than the median. Now, evaluating success is simply a matter of comparing two 2-outcome distributions (the empirical one and the predicted one). This is a neat and useful idea -but it's not unique. Many other useful coarse-grainings exist, and should be considered. One is local marginals (obtained by marginalizing over all bits except one). Or, for an algorithm that is supposed to produce an outcome that's uniformly random over a relatively small subset of strings, the natural coarse-graining is into (a) all the strings that should occur, and (b) the rest.

Examples of volumetric benchmarks
The motivation for the VB framework presented here is simple: it gathers up and generalizes many existing techniques that have been proposed and used to evaluate the performance of quantum processors. Every aspect of the framework discussed in the previous sections was inspired directly by some specific known protocol (although to the best of our knowledge the visualization and reporting paradigm illustrated in the next section is novel). So, in this section, we give some examples of how existing protocols might be adapted into VBs.
We emphasize that the VB based on an existing protocol is not necessarily identical to the protocol itself, because the data analysis is often different. QCVV protocols generally define not just a circuit family, but also a specific analysis routines that extract intrinsic properties of a device. For example, simultaneous RB data is usually analyzed via a nontrivial procedure to detect crosstalk, and gate set tomography data is usually fit to a complex Markovian gate set model. So when we co-opt a protocol's circuit definitions to use as a VB, the associated analyses are not part of the VB framework. A VB based on GST might inherit the same circuit families used for GST, but analyze the data in a totally different (and much simpler) way that makes no attempt to infer a gate set. Furthermore, it may be useful to generalize the circuit families defined by a given QCVV protocol. For instance, the standard GST analysis makes use only of circuits with depths equal to powers of two. When GST circuits are adapted for use as a volumetric benchmark this restriction can be easily relaxed to construct circuits of nearly any depth.
As we observed above, circuit classes are the heart of a VB. Changing the circuit class structure (e.g. from random circuits to periodic ones) creates an entirely new and incomparable VB, whereas other changes (e.g. to the success metric) may have little effect. So we organize our list of examples by circuit type. This is not intended to be exhaustive; there are undoubtedly some good examples that we fail to mention here. For each example, we list all the specific properties and choices that define a VB, as well as a brief description of the device property that is measured by the benchmark.

Random circuits
The examples given here, based on various types of random circuits, require almost no adaptation to fit in the VB framework. Data from randomized benchmarking, scrambling (quantum volume) circuits, and random quantum supremacy circuits are usually analyzed in simple ways that are consistent with the VB framework. Although RB data is often used to estimate a decay rate -which goes outside the analysis we suggest for a VB -this simple estimation of a parameter is very closely correlated with the VB analysis, which would simply identify the size/shape of quantum circuits where the overall failure probability drops below a pre-established threshold.
Protocol: Clifford randomized benchmarking [14,15]. Protocol: Direct randomized benchmarking [31]. Circuit family C(w, d): Sequences of arbitrary w-qubit depth-1 layers chosen from a set that generates the w-qubit Clifford group, prefixed and postfixed by random-stabilizer-initialization and inversion Clifford subroutines. Measure over circuits: Random over sequences of layers, i.i.d. with respect to each layer, but with adjustable constraints and probability distributions on layers (see citation for details). Definition of "depth": Physical, with an additive constant that scales with w. Per-circuit success metric: Probability of unique correct outcome. Family-wise success metric: Uniform average over all sampled circuits. Property probed: Average error of randomized native-gate circuits.
Protocol: Simultaneous randomized benchmarking [32]. Circuit family C(w, d): Sequences of arbitrary w-parallel 1-qubit Clifford subroutines, each postfixed by a single fully-constrained inversion Clifford. Measure over circuits: Uniformly random over the first d − 1 Cliffords, and independent for each qubit.

Application circuits
The example VBs listed in this section are speculative. We are not aware of any experiments or explicit proposals that could be evaluated as a VB in current form. However, for the two circuit families we suggest -iterations of a Grover step, and steps of a Trotterized Hamiltonian simulation -we believe it is obvious that a VB could be built around them. Exactly how to do so is not obvious, though. Many variations are possible, and we do not attempt to nail down the specifics here. Establishing the best way to construct circuits for these VBs, and identifying other algorithms/applications that can form the basis of VBs, is a promising area for future work. (Note: VBs based on syndrome extraction circuits for quantum error correction are also good candidates for "application-focused" benchmarks, but identifying a sensible circuit family is nontrivial for several reasons. For example, in contrast to all other examples, increasing the number of physical qubits w is likely to increase the success probability if it increases the code distance.) Protocol: Grover iterations [25]. Circuit family C(w, d): d iterations of a single w-qubit Grover step, alternating oracle marking and reflection steps. Oracle can be defined by user; simplest choice is to just mark |0 or a randomly chosen fixed basis state. Measure over circuits: Unique circuit unless the oracle is allowed to vary; for the simple oracle suggested above, the marked state could be uniformly random. Definition of "depth": Logical; physical depth depends strongly on architecture and even more strongly on the oracle chosen if it is nontrivial. Per-circuit success metric: Heavy outcome probability, cross-entropy, or other metric suitable for arbitrary non-definite outcome distributions. Family-wise success metric: Average of single-circuit criterion (if applicable) Property probed: Ability to run Grover's algorithm without errors.
Protocol: Trotterized Hamiltonian simulation [40]. Circuit family C(w, d): d iterations of a single Trotter step for simulating a w-qubit Hamiltonian comprising a sum of local terms. Measure over circuits: Various choices: both input state and Hamiltonian could be varied either randomly over some measure or systematically. Alternatively, a single physicallymotivated Hamiltonian and input state could be used (with w controlling fineness of discretization). Definition of "depth": Logical; physical depth depends strongly on architecture and even more strongly on the Hamiltonian chosen. Per-circuit success metric: Heavy outcome probability, cross-entropy, or other metric suitable for arbitrary non-definite outcome distributions. Alternatively, error in empirically estimated expectation value of an observable of physical interest must be < threshold. Family-wise success metric: Various choices if C(w, d) is non-unique; simplest choice to to require all circuits to succeed. Property probed: Duration over which Hamiltonian time evolution can be simulated accurately.

Reporting and interpreting VB results
Note: The figures in this section are for illustration purposes only. They do not represent real data; the "data" presented in these figures do not come from real devices or rigorous simulations! In this section, we propose and illustrate a consistent, informative visual representation of data from volumetric benchmarking experiments. It allows a single figure to provide a comprehensive answer to the nontrivial question "What quantum circuits of a specific type can (and can't) this device run successfully?" The value of a VB, as an alternative to singlenumber metrics like quantum volume, is the richness of the answer it provides to this question. This richness is, in general, better suited to visual display. A useful volumetric benchmarking plot should allow a reader to quickly understand the gross performance of one or more quantum devices under some particular metric of success. At a minimum, it should address the following questions: 1. What volumetric benchmark was used?

For which width/depth pairs were experiments successful?
We display benchmark results by arranging circuit shapes (width/depth pairs) in a grid, with circuit depth increasing along the abscissa (x-axis) and circuit width increasing along the ordinate (y-axis). We specify the circuit family and success metric in the figure title and caption, and strongly recommend this practice -many VBs are possible, and a precise specification of which VB was used is critical for reproducibility. The figures in this section are intended to illustrate how an experimental paper might display real data. Each figure contains and compares multiple sub-figures, each with its own caption. The sub-figure captions are in plain Roman type (e.g. " Figure 1(c)") and illustrate how real data might be captioned. Main figure captions are in bold (e.g. " Figure 1:"), and represent our own views.
For small, NISQ-scale processors, linear axes work well. But displaying data from processors that can achieve width or depth 1 typically demands logarithmic axes. Current state of the art processors can achieve depth > 1000, but width at most 10 − 20. To display such capabilities, we generally mark the depth axis with integer powers of 2, but mark the width axis with integer powers of 1.2 (rounded to an integer, duplicates removed). The rounding process results in an approximately linear scaling up to ∼ 10 qubits before transitioning to a more apparent logarithmic scale, as in Fig. 5.
The easiest VBs to report are those with binary success criteria -each shape (width/depth pair) either passes or fails. In this situation, we indicate successes with a filled square. Failed shapes can be indicated, if desired, with a hollow square (to distinguish "test failed" from "test not performed"). If a benchmark's success metric is real-valued, we place a solid marker at each tested shape and use shading to indicate the observed value of the success metric. Figure 2 illustrates how both binary and real-valued VB results can be displayed for a hypothetical two-qubit system. Even a quick glance at Fig. 2 reveals the most important result, which is that two-qubit (w = 2) circuits perform significantly worse than single-qubit (w = 1) circuits.
The two-qubit examples demonstrate the core features of volumetric benchmarking plots, but become much more useful for larger processors. The most relevant information from benchmarks with binary success criteria is contained in the Pareto frontier that separates the regions of passed and failed tests. The Pareto frontier is defined by the maximum depth, at each circuit width, for which the processor is reliable under the test metric. It bounds the feasible region.
In most cases, the device's performance can be effectively conveyed just by plotting the Pareto frontier, without showing the success/failure of every shape in the grid. This provides simpler, cleaner plots and is especially useful for superposing and comparing results from multiple devices in a single figure. Figure 3 illustrates this. Figures 4 and 5 provide further examples of how showing the Pareto frontier can either replace or augment the basic grid display.
In many situations, prior information about the expected performance of a processor is available. This usually constitutes (or implies) a predictive model of the processor -e.g., one based on one-and two-qubit calibration experiments. It can be extremely useful to compare experimental benchmarking data to the predictions of such a model. We can do this by displaying both the empirical and predicted behavior on the same plot. If they disagree, then the form of the discrepancy can highlight and help to identify emergent failure mechanisms that occur in sophisticated processors, such as crosstalk and non-Markovianity. Figures 4,5,and 6 show how to do this. In each of these figures, we display the experimental data as before using large squares, and add small squares to represent circuit shapes that the model predicted should pass. This technique can be extended to quantitative metrics as shown in Fig. 5(b). In that plot, we use shading to indicate quantitative performance, so the color of the large square indicates observed performance, while the color of the small square indicates predicted performance. Discrepancies between the two are apparent by inspection. Figure 5(b) takes this one step further. To highlight one advantage of plotting predictions and data together, we use red boxes to indicate regions on the plot where a specific discrepancy between observed and predicted performance suggests what might be going wrong. In general, these discrepancies indicate that the model is insufficient to describe the dynamics of the device. In particular, width deficiencies can indicate bad qubits, bad two-qubit gates, or the appearance of emergent crosstalk. Depth deficiencies can also arise from crosstalk, but for width-1 circuits they are indicative of non-Markovianity (or coherent errors if the benchmark being used is sensitive to them).
We conclude by showing how quantum volume relates to -and can be shown within -the VB visualization framework. Obviously, making a rigorous connection to quantum volume as defined in Ref. [1] requires choosing and performing a specific benchmark (the one proposed in that work)! But even for that benchmark, there are advantages to the richer reporting that we propose. They are illustrated in Fig. 6. In this example, the quantum volume experiments are explicitly highlighted along the diagonal, as is a region of implied success (those circuits with width and depth less than or equal to the quantum volume). This particular hypothetical device has a quantum volume of 2 8 , which is clear in the plot -but it's capable of implementing significantly deeper circuits if restricted to slightly fewer qubits. As usual, logarithmic axes allow the display of a much wider range of circuit shapes, as shown in Fig. 6(b).
The sample volumetric benchmarking plots shown in the next pages were generated using Python code, which is provided as supplementary material. Volumetric Benchmarking of 2 Qubits Scrambling Circuits Heavy Output Probability > 2/3 Figure 2(a). Two-qubit volumetric benchmarking using scrambling circuits. Circuit ensemble passes if the average heavy-output probability is greater than 2/3. These results indicate that while single-qubit circuits up to depth ∼ 2000 can be performed reliably, two-qubit circuits are limited to depth ∼ 8.  Figure 2(b). Two-qubit volumetric benchmarking using scrambling circuits. Circuit ensemble passes if the average heavy-output probability is greater than 2/3. These results indicate that while single-qubit circuits up to depth ∼ 2000 can be performed reliably, two-qubit circuits are limited to depth ∼ 8.  Figure 2(c). Two-qubit volumetric benchmarking using scrambling circuits. Displayed is the heavy output probability for each width/depth pair tested. These results indicate that while single-qubit circuits up to depth ∼ 2000 can be performed reliably, two-qubit circuits are limited to depth ∼ 8.        Volumetric benchmarking of a 16 qubit device using scrambling circuits. If at least 2/3 of the measurement results are heavy for a given width/depth pair, then the pair passes the test and is marked with a large, solid blue box. Using linear axes, the quantum volume experiments appear along the diagonal and are outlined with heavy, red lines. For this example, log 2 (V Q ) = 8. It is expected that scrambling circuits with both width and depth less than or equal to the quantum volume should succeed, and we highlight these with a grey background. . Volumetric benchmarking of a 16 qubit device using scrambling circuits. If at least 2/3 of the measurement results are heavy for a given width/depth pair, then the pair passes the test and is marked with a large, solid blue box. Using logarithmic axes, the quantum volume experiments appear along a curved line and are outlined with heavy, red lines. For this example, log 2 (V Q ) = 8. It is expected that scrambling circuits with both width and depth less than or equal to log 2 (V Q ) should succeed, and we highlight these with a grey background. Figure 6: Logarithmic axes allow more data (more circuit shapes) to be compressed into the same space. In Fig. 6(b), the depths are powers of 2 and the widths are powers of 1.2 (rounded and with duplicates removed). This width scaling was chosen to deal with the relatively small size of contemporary devices. This data is for illustration purposes only and does not come from a real device or rigorous simulation!