Gate Set Tomography

Gate set tomography (GST) is a protocol for detailed, predictive characterization of logic operations (gates) on quantum computing processors. Early versions of GST emerged around 2012-13, and since then it has been refined, demonstrated, and used in a large number of experiments. This paper presents the foundations of GST in comprehensive detail. The most important feature of GST, compared to older state and process tomography protocols, is that it is calibration-free. GST does not rely on pre-calibrated state preparations and measurements. Instead, it characterizes all the operations in a gate set simultaneously and self-consistently, relative to each other. Long sequence GST can estimate gates with very high precision and efficiency, achieving Heisenberg scaling in regimes of practical interest. In this paper, we cover GST's intellectual history, the techniques and experiments used to achieve its intended purpose, data analysis, gauge freedom and fixing, error bars, and the interpretation of gauge-fixed estimates of gate sets. Our focus is fundamental mathematical aspects of GST, rather than implementation details, but we touch on some of the foundational algorithmic tricks used in the pyGSTi implementation.

III. Linear gate set tomography (LGST) 10 A. Introduction to LGST  Getting to useful quantum computation will require engineering quantum bits (qubits) that support highfidelity quantum logic operations, and then assembling them into increasingly large quantum processors, with many coupled qubits, that can run quantum circuits. But qubits are susceptible to noise. High-fidelity logic operations are hard to perform, and failure rates as low as 10 −4 are rare and remarkable. Useful quantum computation may require fault tolerant quantum error correction, which distills a few high-fidelity logical qubits from many noisy ones. Threshold theorems [1][2][3][4] prove that this is possible in realistic quantum processors -if the processor's errors satisfy certain conditions. Each threshold theorem assumes different conditions on the errors, but in general they must (1) be only of certain types, and (2) occur at a rate below a certain value (generally around 10 −3 -10 −4 for the most common types). So it is not an exaggeration to say that the entire program of practical quantum computing depends on understanding and characterizing the errors that occur in as-built processors, controlling their form, and reducing their rates.
Assessing how qubits, logic operations, and entire processors behave is the central task of quantum device characterization, a.k.a. "quantum characterization, verification, and validation" (QCVV). There are many protocols for this task, all of which share the same basic structure: a set of experiments described by quantum circuits are performed on the processor, yielding data that is processed according to some algorithm, to produce an estimate of an aspect of the processor's behavior. Some protocols produce highly detailed predictive models of each logic operation (tomography ). Others produce semi-predictive partial characterizations (component benchmarking ). And some assess the performance of entire processors (holistic benchmarking [75][76][77][78][79][80][81][82][83][84][85][86][87]).
All of these methods are valuable parts of the QCVV toolbox, serving distinct (though overlapping) purposes. Some tasks demand simple holistic summaries of overall performance, while others are best served by partial characterization. But for debugging, rigorous device modeling, and reliable prediction of complex circuit performance (including the circuits that perform quantum error correction) there is as yet no substitute for detailed tomography of individual logic operations, a.k.a. quantum gates. The most powerful and comprehensive technique for this task is currently gate set tomography (GST). Developed around 2012-13, GST has been used in a variety of experiments [33,[35][36][37][88][89][90][91][92][93], discussed in the literature [68,69,[94][95][96][97][98][99][100][101][102][103], and implemented in a large opensource software project [104,105]. But no comprehensive explanation of the theory behind GST has appeared in the literature. This paper fills that gap.

A. Gate set tomography
The term "gate-set tomography" first appeared in a 2012 paper from IBM Research [106] that introduced several of GST's core concepts, but the group did not pursue GST further. Around the same time, our group at Sandia National Labs started pursuing an independent approach to the same problem [33,35], which we have developed extensively since then [36,37,89,105]. This approach to GST, which has been used in a number of additional experiments over the past 5 years [88, 90-93, 98, 99, 107], is the one that we discuss and present here. We discuss its relationship to IBM's original approach where appropriate.
Like process tomography, GST is intended to characterize how logic operations (e.g. gates) impact their target qubits. Those gates target subsystem and state space needs to be specified up front, and GST reconstructs the gate-driven evolution of those targets only. GST is not designed to probe the holistic performance of many-qubit processors, and as most commonly used, it does not capture general crosstalk. (Although GST can easily be adapted to focus on specific types of crosstalk, and characterize them in detail.) GST differs from state and process tomography, which we discuss later in Sec. II B, in two fundamental ways. First, it is almost entirely calibration-free (or "selfcalibrating"). When GST reconstructs a model of a quantum system, it does not depend on a prior description of the measurements used (as does state tomography) or the states that can be prepared (as does process tomography). Second, GST reconstructs or estimates not a single logic operation (e.g., state preparation or logic gate), but an entire set of logic operations -a gate set including one or more state preparations, measurements, and logic gates. These two features are inextricably linked, and we discuss this connection later in this paper.
GST's independence from calibration is its original raison d'etre. As IBM pointed out [106], state and process tomography become systematically unreliable when the "reference frame" operations on which they rely (pre-calibrated measurements for state tomography, pre-calibrated state preparations and measurements for process tomography) are either unknown, or misidentified. This limits their practical utility and makes them unsuitable for rigorous characterization.
GST is not the only calibration-free method. Today, new characterization protocols are more or less expected to be calibration-free. The oldest calibration-free protocol is randomized benchmarking (RB) [38,39], which predates GST by almost 10 years. Detailed comparisons between RB and GST have been given elsewhere [36,37], and they are complementary tools addressing different needs.
This paper is not a GST "how-to" manual. We do provide some incidental discussion of how to perform GST in practice, and how it is implemented in our pyGSTi software [104], but only inasmuch as it illustrates aspects of the theory. Separately, we recently published a concise guide to applying GST to characterize hardware [105]. In contrast, this article is intended as a theory paper presenting GST's mathematical background, justifications, and derivations.

B. Section guide & expert summary
Because this paper attempts to be comprehensive, it is rather long. So we begin here with a short expertlevel summary of the major ideas and results, which may help readers decide what parts to read. A reader who understands everything in this section may not need to read any further at all! Other readers may wish to skip to the first part that discusses a statement (from this summary) that is not obvious.
We begin in Section II by establishing mathematical and conceptual foundations. We introduce quantum logic operations (II A 1); the mathematical representations of states and measurements (II A 2); and superoperators as a model of noisy quantum logic gates, including transfer matrix and Choi matrix representations, and introduce super-Dirac (superbra / superket) notation (II A 3). We introduce gate sets, their representation as a collection of superoperators and operators, and gauge freedom (II A 4). Finally, we establish notation and conventions for quantum circuits and the superoperator τ (g) caused by a quantum circuit g (II A 5).
In the second part of Section II, we review standard forms of tomography for states, processes, and measurements (II B). We conclude this introduction by explaining the role of calibration in standard tomography (II C) to set the stage for GST.
Section III is devoted to linear inversion GST (LGST). We start with the history of GST, describing how LGST solved some of the problems with the initial versions of GST (III A). We then derive LGST, emphasizing parallels with process tomography (III B), and introducing techniques required for long-sequence GST. Readers inclined to skip over the mathematics in Section III B should feel free to do so, as these derivations are not prerequisites for subsequent material. In the last parts of Section III, we address three issues left unanswered in the derivation; how to deal with overcomplete data (III C), how to prepare fiducial states and measurements (III D), and how to fit LGST data better with maximum likelihood estimation (III E).
Section IV introduces long-sequence GST, the standard "GST" protocol. (LGST came first, and is easier to analyze theoretically, but is rarely used alone in practice). We introduce the two main motivations for longsequence GST, which are (1) much higher accuracy, and (2) the ability to impose constraints such as complete positivity. After discussing early and obsolete versions of long-sequence GST (IV A), we explain how to construct circuits that amplify errors and enable Heisenberglimited accuracy (IV B). Then, we define the GST likelihood function and show how to maximize it to find the maximum likelihood estimate of a gate set model (IV C).
Section V's two subsections cover two "advanced" aspects of long-sequence GST. The first (V A) conceptualizes gate sets as parameterized models, introducing the term "gate set model", and discusses how to impose constraints like trace preservation or complete positivity. The second (V B) describes "fiducial pair reduction", a method for eliminating redundancy among the standard set of GST circuits and thereby reducing the number of circuits prescribed.
Section VI addresses the question "What can you do with a GST estimate once you have it?" We start with the critical step of assessing goodness-of-fit and detecting model violation or non-Markovianity (VI A). We also explain what we mean by "Markovian" in this context, and why model violation is associated with "non-Markovian" behavior. We then turn to the important problem of extracting simplified metrics (e.g., fidelity) from a gate set, explain why choosing a gauge is necessary, and explain how to do so judiciously using a process we call "gauge optimization" (VI B). Finally, we discuss how to place error bars around that point estimate, including how to deal with the complications induced by gauge freedom (VI C).
Appendices address some of the most technical and tangential points. These include the gauge (A), the historically-important eLGST algorithm (B), specific implementation choices made in pyGSTi (C), numerical studies that support claims made in the main text (D), and the bias we find in a particular estimator (E). Appendices are referenced from the text where applicable.

A. Mathematical Preliminaries
We begin with a concise introduction to the mathematics, formalism, and notation used to describe qubits, multiqubit processors, their logic operations, and the programs (quantum circuits) that they perform. Because the models used by GST treat distinct layers as a separate n-qubit gates (indicated by the arrow), FI/CO circuits are, for the purposes of this paper, composed of: a n-qubit state preparation, a sequence of n-qubit gates (layers), and a nqubit measurement. The symbols ρ, Gγ and M = {Ej}j label these operations within a gate set (cf. Eq. 6). The particlar circuit shown begins with preparation in the i-th available state, proceeds with execution of gates indexed by γi, . . . γ6, and concludes with performance of the m-th available measurement.

Quantum processors and quantum circuits
An idealized quantum information processor comprises one or more qubits that can be (i) initialized, (ii) transformed by quantum logic gates, and (iii) measured or "read out" to provide tangible classical data [125]. Using such a processor is usually described by physicists as running experiments, and by computer scientists as running programs. These are the same thing. Experiments/programs are described by quantum circuits (see Figure 1a), which specify a schedule of logic operations.
In this paper, the experiments we consider correspond to circuits that begin with initialization of all the qubits, conclude with measurement of all the qubits, and apply zero or more logic gates in the middle.
If a processor contains just one qubit, then only one logic gate can be performed at a time, and every circuit corresponds to a state preparation, a sequence of logic gates, and a measurement. Circuits on two or more qubits typically specify logic gates on one or two qubits to be performed in parallel on disjoint subsets of qubits. In this case, the operations to be performed in parallel during a single clock cycle constitute a layer of the circuit, and every circuit corresponds to a preparation, sequence of layers, and measurement. Models of manyqubit circuits must then have a means of describing the way a circuit layer propagates a n-qubit quantum state based on the one-and two-qubit gates within that layer.
Such modeling, when the layers are composed of imperfect gates, is nontrivial [126].
The implementation of GST described in this paper utilizes models where each circuit layer is an independent operation. That is, the only models we consider describe only n-qubit operations (gates) that never occur in parallel. Every circuit layer is an independent "gate", and all gates act on all the qubits. This important connection motivates our synonymouse use of "gate" and "circuit layer" throughout the text. For example, a GST model assumes no a priori relationship between the behavior of the following three layers (labeled by G for "gate"): • "G XI " -An X gate on qubit 1, in parallel with an idle gate on qubit 2.
• "G IX " -An idle gate on qubit 1, in parallel with an X gate on qubit 2.
• "G XX " -An X gate on qubit 1, in parallel with an X gate on qubit 2.
This approach rapidly becomes unwieldy as the number of qubits grows, since exponentially many distinct layers are generally possible. While this limits the scalability of the GST protocol (which was developed in the context of one-and two-qubit systems where modeling parallel gates wasn't an issue), much of the ideas and theory surrounding it can be extended to many-qubit models [127]. QCVV protocols, including tomography, seek to deduce properties of a processor's elementary logic operations from the results of running circuits. Doing so requires a model for all three kinds of logic operations (initialization, gates, and measurements) that has adjustable parameters, and can be used to predict circuit results probabilistically. The most common model, and the one used by GST, is based on Hilbert-Schmidt space.

States, measurements, and Hilbert-Schmidt space
A quantum system -e.g., an n-qubit processor -is initialized into a particular quantum state. A quantum state enables predicting probabilities of outcomes of any measurements that could be made on the system. After initialization, the state evolves as gates are applied. The final state, just before the processor is measured, determines the probabilities of all possible measurement outcomes. So we model the three kinds of quantum logic operations as follows: initialization is modeled by a quantum state; logic gates are modeled by dynamical maps that transform quantum states; measurements are described by linear functionals that map quantum states to probability distributions.
A quantum system is described with the help of a ddimensional Hilbert space H = C d , where d is the largest number of distinct outcomes of a repeatable measurement on that system (and is therefore somewhat subjective). For a single qubit, d = 2 by definition, and for a system of n qubits, d = 2 n . 1 A quantum state is described by a d × d positive semidefinite, trace-1 matrix ρ that acts on its Hilbert space (ρ : H → H) [128]. This is called a density matrix. A matrix must be positive semidefinite and have trace 1 in order to represent a physically valid state. Such density matrices form a convex set, and it is very convenient to embed this set in the complex d 2 -dimensional vector space of d × d matrices. This vector space is called Hilbert-Schmidt space and denoted by B(H). We are almost always only concerned with Hermitian (self-adjoint) matrices, which form a real d 2 -dimensional subspace of Hilbert-Schmidt space. In this paper we will often disregard the constraints on density matrices and use the symbol ρ to describe any element of Hilbert-Schmidt space, stating explicitly if/when the trace and positivity constraints are to be assumed.
Hilbert-Schmidt space is equipped with an inner product A, B ≡ Tr(A † B) that is used often in quantum tomography. We represent an element B of Hilbert-Schmidt space by a column vector denoted |B , and an element of its (isomorphic) dual space by a row vector denoted A|, and so A|B = Tr(A † B). We will represent quantum states as Hilbert-Schmidt vectors, and linear functionals on states (e.g., outcomes of measurements) as dual vectors. Measuring a quantum system yields an outcome or "result" which we assume is drawn from a discrete set of k alternatives. Each outcome's probability is a linear function of the state ρ. Therefore, the ith outcome can be represented by a dual vector E i |, so that Pr(i|ρ) = E i |ρ = Tr(E i ρ). The laws of probability require that E i ≥ 0 and i E i = 1l. The E i are called effects. The set {E i } completely describes the measurement and is called a positive operator-valued measure (POVM). As with states, we sometimes relax positivity constraints on POVM effects for convenience.
We find it useful to define a basis for Hilbert-Schmidt space, {B i }, with the following properties: Here, 1l is the d-dimensional identity map.
For a single qubit, the normalized Pauli matrices 1l/ √ 2, σ x / √ 2, σ y / √ 2, σ z / √ 2 constitute just such a basis. For n qubits, the n-fold tensor products of the Pauli operators satisfy these conditions.
Since states and effects are both Hermitian, choosing a Hermitian basis makes it easy to represent states and effects in the d 2 -dimensional real subspace of B(H) containing Hermitian operators. This trivializes the adjoint ( †) operation. We abuse notation slightly by using "Hilbert-Schmidt space" and B(H) to denote this real vector space, because we almost never use its complex extension.
We conclude with a brief observation about notation and representation of matrices. Conventionally, states (ρ) and effects (E) are represented and thought of as matrices. Matrices are closed under linear combination, so they form a vector space -but they can also be multiplied. However, multiplication is not especially relevant for states and effects. When it appears [e.g., in Born's Rule via Tr(Eρ)], its role can be filled by the Hilbert-Schmidt dot product. This allows us to represent states and effects just as vectors in B(H). As we explain in the next section, operators that act on B(H) are called superoperators. So, generalizing Dirac's notation, we refer to |ρ as a superket and to E| as a superbra.

Quantum logic gates
What happens between the beginning of a quantum circuit (initialization into a state ρ) and its end (a measurement {E i }), is defined by a sequence of circuit layers ( Figure 1a). Because we treat each circuit layer as a unique and independent "gate" (cf. Sec. II A 1), the middle portion of a circuit can be seen as a sequence of quantum logic gates (Figure 1b). Gates are dynamical transformations on the set of quantum states. An ideal quantum logic gate is reversible and corresponds to a unitary transformation of H. Such a gate would transform ρ as ρ → U ρU † for some d × d unitary matrix U . This is a linear transformation from B(H) to itself, and is called a superoperator. The superoperator describing an ideal (unitary) gate is orthogonal. 2 Real logic gates are noisy, and not perfectly reversible. They still act linearly on states, so they can be represented by superoperators on B(H). But the superoperators for noisy gates are not orthogonal, and are generally contractive. These superoperators are known as quantum processes or quantum channels. Given any superoperator Λ, we can represent it as a d 2 × d 2 matrix that acts associatively (by left multiplication) on |ρ ∈ B(H), by just choosing a basis for B(H). This representation is often called the transfer matrix of Λ. We adopt this terminology, and denote it by S Λ . Thus, we write where the column vector S Λ |ρ is obtained by multiplying the column vector |ρ from the left by the matrix S Λ . If ρ is prepared, and then Λ is performed, and finally {E i } is measured, then the probability of outcome i is Not every linear superoperator describes a physically allowed operation. For example, the summed-up probability of all measurement outcomes is given by Tr(ρ), so Tr(ρ) must equal 1. A superoperator Λ that changes Tr(ρ) violates the law of total probability. So only tracepreserving superoperators are physically allowed. Furthermore, if ρ is not positive semidefinite, then some outcome of some measurement will have negative probability. So states must satisfy ρ ≥ 0. If there exists any ρ ≥ 0 such that Λ(ρ) is not ≥ 0, then that Λ is not physically possible. A superoperator that preserves ρ ≥ 0 is called positive.
However, a stronger condition is actually required: when Λ acts on part of a larger system, it must preserve the positivity of that extended system. This corresponds to requiring Λ ⊗ 1l A to be positive for any auxiliary state space A, and is called complete positivity [129,130]. It is stronger than simple positivity; some positive maps aren't completely positive. Superoperators representing physically possible operations must be completely positive and trace-preserving (CPTP). The CPTP constraint turns out to be both necessary and sufficient; any CPTP superoperator can be physically implemented given appropriate resources by the Stinespring dilation construction [128].
The TP (trace-preserving) condition is easy to describe and impose in the transfer matrix representation; it corresponds to 1l|S Λ = 1l|. If S Λ is written in a basis whose elements are traceless except for the first one (as required earlier), then Λ is TP if and only if the first row of S Λ equals [1, 0 . . . 0].
The CP condition is more easily described in a different representation, the operator sum representation [129,130]: Here, {B i } is a basis for B(H), and χ Λ ij is a matrix of coefficients called the "Choi process matrix" that represents Λ. This Choi representation of Λ provides the same information as the transfer matrix S Λ . The mapping between them is known as the Choi-Jamiolkowski isomorphism [131]: where Π EPR = |Ψ EPR Ψ EPR | and |Ψ EPR is a maximally entangled state on a space that is the tensor product of B(H) and a reference auxiliary space of equal dimension. Note that Eq. 4 is technically problematic: the right hand side is a super-ket, but the left hand side is a matrix. Equality here means that, in a consistent basis, these two objects are element-wise equal. This element-wise equality is the Choi-Jamiolkowski isomorphism. Krauss operators correspond to the eigenvectors of χ Λ , and provide an intuitive picture of many common superoperators [128]. The CP condition is elegant and simple in the Choi representation: Λ is CP if and only if χ is positive semidefinite. This condition is independent of the basis {B i } used to define χ.
Hereafter, we use the transfer matrix representation almost exclusively, and so we use the term "superoperator" to also refer to the superoperator's transfer matrix. Likewise, when we refer to a superoperator's matrix representation, this should be understood to be the transfer matrix. When we use the Choi matrix representation (only to test/impose complete positivity), we will state it explicitly. We use the term "quantum operation" interchangeably with "gate" to refer to general quantum operations (not necessarily CP or TP), and will explicitly state when CP and/or TP conditions are assumed.

Gate sets
Prior to GST, tomographic protocols sought to reconstruct individual logic operations -states, gates, or measurements -in isolation. But in real qubits and quantum processors, this isolation isn't possible. Every processor has initial states, gates, and measurements, and characterizing any one of them requires making use of the others. Underlying GST, RB, robust phase estimation [68], and other modern QCVV protocols is the simultaneous representation of all a processor's operations as a single entity -a gate set.
Most processors provide just one native state preparation and one native measurement. Gates are used to prepare other states and perform other measurements. For completeness, we consider the most general situation here. Consider a processor that can perform N G distinct gates, N ρ distinct state preparations, and N M distinct measurements, and where the m-th measurement has N (m) E distinct outcomes. We define the following representations for those operations: The collective set of symbols G i , |ρ (i) , and E (m) i | serve two distinct roles. First, by simply naming a set of operations, they provide a specification of the quantum processor's capabilities. Secondly, each symbol represents a mathematical model for the behavior (ideal, estimated, or unknown, depending on context) of the physical hardware device when the corresponding operation is performed in a quantum circuit. The notation given above is used throughout this paper. We refer to all these matrices and vectors collectively as a gate set, denoted by G, and defined as . (6) A gate set is a complete model and description of a quantum processor's behavior when running quantum circuits. A gate set is more than just a convenient way to package together independent gate, preparation, and measurement operations. The operations on a quantum processor that a gate set describes are relational and interdependent. As a consequence, the representation given above -where gates correspond to d 2 × d 2 superoperators, state preparations to d 2 -element column vectors, and POVM effects to d 2 -element row vectors -is an overspecification of the gate set. To see this, consider a transformation of the gate set that acts as where M is any invertible superoperator. This transformation changes the concrete representation of the gate set, but does not change any observable probability (cf. Eq.10). Since absolutely nothing observable has changed, these are equivalent descriptions or representations of the same physical system. The transformations defined by Eq. 7 form a continuous group, so the standard representation of gate sets described above contains entire "orbits" of equivalent gate sets. This degeneracy, referred to generically as "gauge freedom", is a perennial irritant and obstacle to characterization of quantum processors [37,54,132,133]. More importantly, it shows that a gate set is not just the sum of its parts, and that tomography on a gate set is not just tomography on its components. GST is tomography of a novel entity.

Circuits
The term "quantum circuit" is used in many contexts to mean subtly different things. For clarity, we find it helpful to distinguish two related but distinct types of quantum circuits: fixed-input, classical-output (FI/CO) and quantum-input, quantum-output (QI/QO) circuits.
Data for standard QCVV protocols -e.g., state/process tomography, RB, or GST -consists of the counted outcomes of experiments. Each experiment is described by a quantum circuit that begins by initializing and ends by measuring of all the qubits. Because the input to such a circuit is "fixed" by the state preparation and the output of the circuit is classical measurement data, we call this type of circuit a fixed-input classical-output circuit. A FI/CO circuit represents and describes a probability distribution over classical bit strings, and a more-or-less concrete procedure for generating it with a quantum processor.
A quantum circuit can also describe an arrangement of unitary logic gates, with no explicit initialization or measurement, that could be inserted into a larger circuit as a subroutine. We call this kind of circuit a quantum-input quantum-output circuit. It is a sequence of circuit layer operations, none of which are preparation or measurement layers. A QI/QO circuit represents and describes a quantum channel that maps an input quantum state to an output quantum state.
Thus, each FI/CO circuit comprises (1) one of the N ρ possible state preparations, (2) a QI/QO circuit (a sequence of layers), and (3) one of the N m measurements, which generates a specific outcome. On a processor that has just one state preparation and one measurement operation, every FI/CO circuit can be described completely by the QI/QO circuit composed of all its circuit layers.
Let γ be an index (or label) for available circuit layers (gates). When we define a QI/QO circuit S as S = (γ 1 , γ 2 , ...γ L ), it means "do layer γ 1 , then do gate γ 2 , etc." When the layer indexed (labeled) by γ is performed, it transforms the quantum processor's state. This transformation is represented by a superoperator G γ . Performing an entire QI/QO circuit, e.g. S, also applies a superoperator. We denote the transfer matrix for QI/QO circuit S by τ (S). It denotes the map from B(H) → B(H) formed by composing the elements of S, i.e., τ (S) = G γ L · · · G γ2 G γ1 . Because it is a common source of confusion, we emphasize that gates appear in chronological order in layer sequences (S), but in reverse chronological order in matrix products τ (S), because that's how matrix multiplication works. For later reference, the general action of τ on a layer sequence can be written of corresponding gates (quantum processes). We use integer exponentiation of a QI/QO circuit to denote repetition. So, if S = (γ 1 , γ 2 ) then S 2 = SS = (γ 1 , γ 2 , γ 1 , γ 2 ). It follows that Data sets are generated by defining a set of FI/CO circuits, repeating each one N times, and recording the results. Results from each specific circuit are summarized by observed frequencies, f k = n k /N , where n k is the number of times the kth measurement outcome was observed, for k = 1 . . . N (m) E . These frequencies are usually close to their corresponding probabilities, and can be used to estimate them. More sophisticated estimators exist, but Eq. 10 motivates the idea that some or all of |ρ (i) , E  2: Structure of the circuits required for state, process, and measurement tomography. Each of these protocols reconstructs an unknown entity (a state ρ, process G, or measurement M) by placing that entity in circuits with an assumedknown reference frame formed by an informationally complete set of state preparations or measurements (or both). Primed symbols (ρ and M ) are meant to connote effective state preparations and measurements, which are often implemented by applying gate operations after or before a native state preparation or measurement. A critical problem with these standard tomographic techniques is that ρ and M are in practice never known exactly.
observed frequencies. Doing so is tomography. Each kind of tomography treats some operations as known, and uses them as a reference frame to estimate the others. It is usually apparent from context whether a FI/CO or QI/QO circuit is being referenced. In the remainder of this paper, we only specify the type of circuit in cases when it may be ambiguous.

B. State, Process, and Measurement Tomography
In this section we review standard state and process tomography. We also briefly describe the (straightforward) generalization to measurement tomography. These methods establish the context for GST, but they also introduce the mathematical machinery and notation that we will use for GST.

Quantum state tomography
The goal of quantum state tomography [5][6][7][8][9][10][11][12][13][14][15][16][17] is to construct a full description of a quantum state ρ, from experimental data. It is assumed that some quantum system can be repeatedly prepared in the unknown state ρ, that various fiducial measurements can be performed on it, that those measurements are collectively informationally complete, and that the POVMs representing those measurements are known (see Figure 2). "Fiducial" means "accepted as a fixed basis of reference", and the measurements used in state tomography constitute a frame of reference. To perform state tomography, many copies of ρ are made available, and divided into M pools. The mth fiducial measurement is applied to all the copies in the mth pool. The observed frequencies f (m) i are recorded, and used to estimate the corresponding probabilities Then ρ is deduced from these probabilities.
Practical state tomography is more complicated (and interesting) because only finitely many copies can be measured, so each observed frequency isn't generally equal to the corresponding probability. Those probabilities have to be estimated from the data. One option is to estimatep i = f i (where· over a variable indicates an estimate). This is called linear inversion. It's not very accurate, and often yields an estimateρ that isn't positive, but despite these practical drawbacks it's a theoretical cornerstone of tomography. Linear inversion underlies the concept of informational completeness, which captures whether an experiment design (here, a set of measurements) is sufficient for estimation. These ideas also underly GST, and are used directly in linear GST. So we now present a concise overview of linear inversion state tomography and informational completeness.
Let the Hilbert-Schmidt space vector |ρ denote an unknown quantum state that we want to reconstruct. Let Now, let us assume that we have learned the exact probabilities of each measurement outcome in state ρ (ignoring the entire process of performing measurements on samples, collating the results, and estimating probabilities from frequencies). These probabilities are given by Born's rule, We can we write this as a Hilbert-Schmidt space inner product, and then stack all the row vectors E j | atop each other to form an Now, p = A|ρ is a vector whose jth element is p j (the probability of E j , as defined above). In the context of state tomography, all the measurement effects E j are assumed to be known a priori, so the entire matrix A is known. A also has full column rank (d 2 ), because the { E (m) i |} are informationally complete and therefore span Hilbert-Schmidt space. Therefore, we can reconstruct ρ using linear algebra. If A is square, it has an inverse, and |ρ = A −1 p. If N f 1 is greater than d 2 , then the { E (m) i |} form an overcomplete basis and A is not square. In this case, we can solve for |ρ using a pseudoinverse: 2. Quantum process tomography The goal of quantum process tomography [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32] is to construct a full description of a quantum process (e.g., a gate) from experimental data. This is done by constructing an informationally complete set of known fiducial states, preparing many copies of each of them, passing those copies through the process to be tomographed, and performing state tomography on the output states (see Figure 2). All the same caveats and complications mentioned for state tomography apply here. As above, we ignore them and focus on the underlying math problem of reconstructing an unknown process from exactly measured probabilities.
Let G be the superoperator (transfer matrix) representing the process we want to reconstruct. Let { E j |} be the list of POVM effects representing all the outcomes of all the fiducial measurements, as in state tomography. And let {|ρ i } be a list of N f 2 fiducial quantum states that will be repeatedly prepared as inputs to G.
If state ρ i is prepared, G is applied, and a measurement is performed with possible outcomes {E j }, then the probability of observing outcome E j is In addition to the A matrix defined previously, we define a new d 2 × N f 2 matrix B from the column vectors representing the fiducial states |ρ i : Now, we can write the N f 1 × N f 2 matrix P whose elements are P j,i as In the context of process tomography, both the fiducial states and the fiducial measurement effects are known, so all the elements of both A and B are known. A must have full column rank and B must have full row rank, because the {ρ i } and {E j } were assumed to be informationally complete. If they are square (N f 1 = N f 2 = d 2 ), then they are invertible, and we can immediately reconstruct the original process as If there are more than d 2 input states and/or measurement outcomes, then (as before) we use a pseudo-inverse to obtain a least-squares solution: More sophisticated estimators -i.e., maps from {p i,j } data to estimates of G -exist. They are important when finite-sample errors in the estimated probabilities are significant. But the linear inversion tomography described here is the essence of process tomography.

Measurement tomography
Using tomographic techniques to characterize measurements gets much less attention than state and process tomography -partly, perhaps, because it's a straightforward generalization of state tomography.
Measurement tomography [108][109][110][111][112] uses precalibrated fiducial states to reconstruct the POVM effects for an unknown measurement (see Figure 2). This is easy to represent using the framework already developed for process tomography. If {ρ i } are known fiducial states, and {E j } is the unknown POVM, then we can define one vector of observable probabilities p j for each unknown effect: Using the B matrix defined in the preceding discussion of process tomography, this is simply which means that each effect E j can be reconstructed by linear inversion as where B −1 is a matrix inverse or pseudo-inverse, as appropriate.

C. The role of calibration in tomography
State, process, and measurement tomography are conceptually straightforward, but rely on a critical (and unrealistic) assumption. They reconstruct the unknown operation relative to some reference operations -fiducial measurements and/or states -that are assumed to be perfectly known. Often, the reference operations are also assumed to be noiseless.
These assumptions are never satisfied in practice. Identifying the exact POVMs that describe the fiducial measurements for state tomography -i.e., calibrating them -would require perfectly known states. But identifying those states, by state tomography, relies on known measurements. This leads to an endless loop of self-referentiality. In the same way, process tomography relies on fiducial states and measurements that are almost always produced by applying quantum logic gates (specific quantum processes) to just a few native states and measurements [134]. A process tomographer's knowledge of those fiducial objects can be no more precise than their knowledge of the gates used to prepare them -which would have to be characterized with process tomography.
Ref. 106 articulated this problem clearly. It also demonstrated that this concern is real, and has practical consequences. In realistic scenarios, errors in state preparation and measurement (SPAM) dominate inaccuracy in process tomography.
Several forms of "calibration-free" tomography [33,37,[135][136][137][138][139][140][141][142][143][144][145][146][147][148][149][150] were proposed either independently from, or in response to, Ref. 106. Each sought to characterize quantum gates, state preparations, and/or measurements self-consistently without any prior calibration. Gate set tomography has emerged as the most widely adopted of these, and is now a de facto standard for performing calibration-free tomography. However, it comes with certain costs, which have motivated the continued use of state and process tomography in some experiments. While we recognize the costs and drawbacks of GST, the risks of traditional state and process tomography are so clear that they should not be used, except under remarkable circumstances (e.g., where there are very good reasons to believe the reference operations are known to higher precision than is needed in the final estimate).

III. LINEAR GATE SET TOMOGRAPHY (LGST)
GST offers two benefits. It resolves the circularity inherent to state and process tomography, and achieves higher accuracy with lower experimental cost than process tomography. These features are only loosely linked. Linear GST (LGST) resolves the pre-calibration problem, but has the same accuracy/effort scaling as process tomography. In this section, we use LGST to introduce the basic concepts and methods of GST.

A. Introduction to LGST
Linear-inversion gate set tomography (LGST) is basically a calibration-free amalgamation of state, process, and measurement tomography. It constructs a low-precision estimate of a gate set G using the data from a specific set of short circuits.
LGST demonstrates the feasibility of avoiding pre-calibrated reference frames, and also why doing so creates gauge freedom.
LGST was developed roughly contemporaneously with IBM's "overkill" approach to gate set tomography [106]. IBM identified the pre-calibration problem, recognized that circuit outcome probabilities are given by expressions of the form p S = E|G . . . |ρ , and observed that a sufficiently large set of such probabilities should be sufficient to reconstruct the gate set. They proposed performing all circuits of 3 or fewer gates to generate data, then fitting a gate set to that data by numerical maximization of the likelihood function (maximum likelihood estimation, or MLE) But this likelihood function is not well-behaved. It is not quasi-convex (i.e., its level sets are not convex), because (1) the probabilities for the experimental observations are not linear functions of the parameters, and (2) the existence of the gauge makes the likelihood's maximum very degenerate. Plotting the likelihood would reveal, instead of a unimodal "hill", an assortment of winding "ridges" with perfectly level crests. Standard optimization methods [151] often fail to find global maxima on such functions, getting stuck in local maxima. The IBM "overkill" algorithm therefore relied on starting with a good initial guess for the gate set, that would lie with high probability in a local neighborhood of the likelihood's maximum.
LGST avoids this problem by using a very different, structured, experiment design. Instead of an unstructured set of short circuits, LGST prescribes a specific set of them. The resulting data can be analyzed using only linear algebra, and the analysis is very similar to that used for process tomography.
As LGST looks very much like process tomography, it's important to recognize that it is doing something significantly different. If LGST was trying to reconstruct transfer matrices for individual gates, it would fail, because reconstructing individual gates without a pre-calibrated reference frame is not possible.
LGST reconstructs gates up to gauge freedom. Actually, it does more -it reconstructs the entire gate set up to the global gauge freedom given in Eq. 7, recapitulated below: Such transformations change the elements of the gate set, but not any observable probability. So it's not possible to distinguish between gauge-equivalent gate sets, and reconstructing a gate set up to arbitrary M constitutes success.
Since a gate set comprises states, gates, and measurements, it's tempting to say that LGST characterizes all of them simultaneously. But this is not quite right. A gate set is not a collection of unrelated quantum operations. Quantum operations are usually described relative to an implicit and absolute reference frame. But in most experiments, no such reference frame is available. So GST characterizes all these operations relative to each other, and estimates every property of a gate set that can be measured without a reference frame. But some properties of gate sets can't be measured, even in principle, and they correspond to gauge degrees of freedom.
Gauge freedom makes some familiar properties of gates unmeasurable. Other properties of gates turn out to be not associated with a single operation, but purely relational properties -i.e., they are properties of the gate set, but not of any individual gate within it. This awkwardness is the unavoidable price of avoiding pre-calibrated reference frames. GST outputs a self-consistent representation of the available states, processes, and measurements, but that representation is generally not unique. If finite-sample errors did not exist, LGST would be a perfect estimator of the gate set, and this paper would be much shorter. But real experiments always suffer from finite sample error. N trials of an event with probability p does not generally yield exactly pN successes, so estimating p from data generally yieldsp = p ± O(1/ √ N ). As a result, LGST's estimation error scales as O(1/ √ N ), just like process tomography. So estimating a gate set to within ±10 −5 with LGST would require repeating each circuit N ≈ 10 10 times, which is impractical. The "longsequence GST" protocol described in Section IV is much more efficient, and makes standalone LGST preferable only when there are severe resource constraints. But LGST remains important both pedagogically and as a key first step in long-sequence GST's analysis pipeline.

B. The LGST algorithm
In this section, we present the core LGST algorithm. We focus on (1) what LGST seeks to estimate, (2) what data is required for LGST, and (3) how to transform that data into an estimate under ideal circumstances. At first, we make some idealized simplifying assumptions in order to maximize clarity. After presenting the core algorithm, we return to these assumptions, relax them, and show how to make this algorithm practical and robust. The simplifying assumptions are: 1. We assume the ability to create informationally complete sets of fiducial states |ρ j and measurement effects { E i |} (see Section III D).
2. We ignore finite sample error in estimated probabilities, and its effects (see Section III E).
3. We assume that the fiducial states and effects are exactly informationally complete, not overcomplete, so that N f 1 = N f 2 = d 2 (see Section III C). LGST also requires the circuits that perform state (measurement) tomography on ρ (M), but these are not explicitly shown. They are similar to (d)-(f ) (just replace ρ with ρ or M with M), and are actually included as a subset of these circuits when the gate set contains only a single native state preparation (measurement) and one of the preparation (measurement) fiducial circuits is the empty (do-nothing) cirucit.
We use the notation of Eq. 6 to denote the contents of a generic gate set G: As in Sec. II A 4, N ρ , N G and N M are the number of distinct state preparations, gates, and measurements, and N (m) E is the number of possible outcomes for the m-th distinct measurement.
To perform process tomography on an operation G in Sec. II B 2, we constructed a matrix P of observable probabilities using informationally complete fiducial states and effects. For LGST, we will do the same thing. But although we assume the existence and implementability of fiducial sets, we do not assume that we know them. They still form a reference frame, but we don't know what that frame is.
To reconstruct a set of gates (processes) {G k }, we will need one such matrix P k for each gate: These probabilities are directly measurable -we don't know what ρ j and E i are, but we can prepare/measure them. The first line of Figure 3 illustrates the circuit corresponding to Eq. 28. We can construct A and B matrices from the fiducial vectors exactly as for process tomography. The difference, of course, is that although those matrices exist, their entries are not known to the tomographer. Just as before, we can write Since we do not know A or B, we cannot solve this equation for G k . Instead, to compensate for our ignorance about ρ j and E i , we measure some additional probabilities that correspond to process tomography on the null operation. We arrange these into a Gram matrix for the fiducial states/effects: The second line of Figure 3 depicts the circuits whose outcome probabilities make up1l. This matrix can also be written in terms of the A and B matrices, as We assumed that these matrices are square (N f 1 = N f 2 = d 2 ) and invertible (which follows from informational completeness). So we can invert the Gram matrix to get1l and solve for G k to get This may not look like the solution -there's still an unknown B involved -but it is. We've reconstructed G k up to a similarity transformation by the unknown B. Moreover, we can do this in exactly the same way for all the gates G k , and get estimates of them all up to the same B.
We also need to reconstruct the native states ρ (l) and measurement effects {E (m) l } in the gate set. To do so, we construct the following vectors (denoted by ·) of observable probabilities: Measuring these probabilities corresponds to performing state tomography on each native state ρ (l) , and measurement tomography on every native effect {E (m) l } -in the unknown frame defined by the fiducial effects and states. They can be written using A and B as and by using the identity1l = AB in Eq. 36, we get the following equations for all the elements of the gate set: Perhaps surprisingly, this is the answer -we've recovered the original gate set up to a gauge. The unknown B is now a gauge transformation (see Eq. 7). It's not just unknown, but unknowable, and irrelevant. The tomographer can set B equal to any invertible matrix and get back the original gate set, up to gauge freedom. Equivalently, B indexes different (but equivalent) representations of the gate set. The simplest choice is B = 1l, but the resulting representation is not very useful. The best a priori choice is to choose a B corresponding to the tomographer's best a priori guess for the fiducial states (see Eq. 17), because that (implicitly) defines the gauge in which the tomographer is expecting to work. But although this is the best a priori choice, it is almost never satisfactory. Reliable analysis of the estimate requires a posteriori gauge-fixing, which we discuss in Section VI B.

C. Addressing overcompleteness with pseudo-inverses
Although using exactly N f 1 = N f 2 = d 2 fiducial states and effects makes the mathematics convenient, there are practical advantages to using more than d 2 fiducials. They include • Redundancy: overcomplete fiducials provide more information and allow self-consistency checking, • Precision: with some gate sets, such as Clifford gates, optimal fiducial sets (2-designs) can only be generated with more than d 2 fiducials, • Feasibility: if only projective measurements with d effects are used, then at least d + 1 of them are required for informational completeness, which yields d 2 + d effects.
Overcomplete fiducials yield vector spaces of observable probabilities whose dimension exceeds that of Hilbert-Schmidt space. The A, B, and (generally) P k matrices are not square, and therefore not invertible. However, all the steps from the previous derivation still hold -they just require slightly more complicated linear algebra.
Let us return to the derivation at Eq. 29. We have measured all the N f 1 × N f 2 elements of P k and1l, and we want to solve for G k in these two equations: The elements of P k and1l will experience small random perturbations because of finite sample errors in estimates of probabilities. So in general, these equations will have no exact solutions -the rank of P k and 1l will be higher than d 2 , but AB can only have rank d 2 (because we demand reconstructed gates of this dimension). Therefore, we seek the best approximate solution. A simple and elegant criterion for "best" (though not actually the best criterion; see "Maximum likelihood estimation" below) is least-squares. We seek the G k (and also A and B) that minimize |P k − AG k B| 2 and |1l − AB| 2 .
To find A and B, we start with the identity differentiate with respect to A (using the rule that ∇ A Tr[AX] = X T and ∇ A Tr[A T X] = X), set the resulting gradient to zero and solve. This yields Some useful observations about this expression: • This is a valid generalization of the expression derived previously (A =1lB −1 ), because B T (BB T ) −1 is a right inverse of B -i.e., multiplying it on the left by B yields 1l.
• Furthermore, if B is square and invertible, then this inverse is unique, so this expression reduces to A =1lB −1 in that case.
• This inverse is well-conditioned. Although B is generally singular, BB T is square and full-rank (by informational completeness).
Next, we write out |P k − AG k B| 2 the same way, take the gradient with respect to G k , set it to zero, and solve for G k . This yields and substituting A from Eq. 42 yields This, again, is a valid generalization of the expression derived earlier (G k = B1l . If B is invertible, then this entire expression reduces to the earlier one. But this expression only requires inverting d 2 × d 2 matrices that have full rank if the fiducials are informationally complete. One problem remains; B (which is still unknown) appears in the estimate. In the previous derivation, where B was assumed to be square and invertible, we showed that although the final expression for G k involved B, it only determined the gauge, and had no impact on any predicted probability.
In Eq. 44, this is not quite true. B does affect the estimate. . . but only through its support. Recall that B's dimensions are now d 2 × N f 1 . If the fiducial states are informationally complete, then B has rank d 2 . Its columns span the entire d 2 -dimensional state space, but its rows only span a d 2 -dimensional subspace of the N f 1dimensional space of observable probabilities.
We can write B as B = B 0 Π, where Π is a d 2 × N f 1 projector onto some d 2 -dimensional subspace, and B 0 is an arbitrary full-rank d 2 × d 2 matrix. Substituting this into Eq. 44 yields a remarkable simplification: Here, B 0 clearly determines the gauge and only the gauge. Π, on the other hand, has a real effect. What's happening here is simple: the high-dimensional information contained in the data (1l and P k ) has to be compressed into the lower-dimensional space of the estimate G k . The estimator is a linear map, so the only way to do this is by projection, and Π (the projector onto the support of B) is that projection. So we are free to choose B more or less arbitrarily (as long as it has rank d 2 )but its support will affect the estimate, while its form on that support affects only the gauge.
To choose Π optimally, we recall that A and B are chosen to minimize |1l − AB| 2 . If we write out AB using Eq. 42 and the identity B = B 0 Π, we get and if we define the complement projector This is uniquely minimized by choosing Π to be the projector onto the d 2 right singular vectors of1l with the largest singular values. With this specification for Π, Eq. 45 is a completely specified LGST estimator for G k . A more or less identical (but easier) calculation yields linear inversion estimates of the native states and effects, in terms of the directly observable probability vectors

D. Preparing the fiducial vectors
In Section III B above, we assumed by fiat that informationally complete sets of fiducial states {ρ j } and ef-fects {E i } could be prepared. But most processors admit just one native state preparation, and one measurement. So fiducial states and measurements are implemented using gates from the gate set itself (as shown in Figure 3 (b-c,e-f)).
To do this, we define two small sets of QI/QO "fiducial circuits". Each fiducial state is prepared by applying one of the preparation fiducial circuits to a native state preparation. Each fiducial measurement is performed by performing one of the measurement fiducial circuits before a native measurement. In the common case where the native state preparation and measurement are unique, the {|ρ j } and { E i |} are entirely determined by the corresponding fiducial circuits: as shorthand to avoid too cumbersome a notation.
In the more general case of multiple state preparations and measurements, these expressions become: They must include simple functions that map each effective-item index (i or j) to the native preparation index (r(j)), measurement index (m(i)), or fiducial index (f (j) or h(i))) corresponding to that item. Equations 52 and 53 (or 50 and 51) introduce and define measurement fiducial circuits, {H k }, and preparation fiducial circuits, {F k } that result in the sets {|ρ j } and { E i |} being informationally complete. Their essential function is describe how effective preparations and measurements are constructed from native operations, and are referred for this purpose in remainder of this work. This is shown pictorially in Figure 3, where indices have been omitted to increase clarity. This construction has no consequences for the linear inversion analysis described above -how the fiducial states/effects are produced doesn't matter, as long as they are consistent. But it does have other consequences.
First, it ensures that every "observable probability" required by LGST can be obtained by running a specific circuit, composed from operations in the gate set itself.
Second, it subtly reduces the number of free parameters in the model, because (e.g.) ρ 1 and ρ 2 are not entirely independent, but are actually generated by the same set of operations applied in a different order. The LGST linear inversion algorithm is blind to this correlation, but more careful statistical analysis can extract it and provide a somewhat more accurate estimate (see below).
Third, this construction places the burden of ensuring informational completeness of the fiducial states/effects onto the choice of fiducial circuits. This choice is not "canonical" for GST -it depends on the gates available in the gate set. In practice, "fiducial selection" is usually done by (1) modeling the gates as error-free, and then (2) using that model to find a sets of fiducial circuits that produce informationally complete states/effects. The goal is to have a Gram matrix1l that is well-conditioned, so that its inverse doesn't inflate finitesample errors. Optimal condition number is achieved when the fiducial states/effects form a 2-design [152].
Of course, the gates are not necessarily error-free, and fiducial selection may fail if they are sufficiently erroneous. Fortunately, post-hoc validation is straightforward, by computing the singular values of the measured 1l before inverting it. If the d 2 largest singular values are not all sufficiently large, then different (or additional) fiducials should be added (and more data taken). If no fiducials are found to produce informationally complete sets, then the operations in the gate set are not capable of exploring the system's nominal state space, and tomography must be limited to a subspace thereof [153].

E. Improving LGST with maximum likelihood estimation
In the preceding sections, we have described the goal of tomography (including GST, and specifically LGST) as "reconstructing" states, processes, or gate sets. This language has a long history in the tomography literature. But admitting the existence of finite sample errors -observed frequencies of circuit outcomes will not exactly match their underlying probabilities -disrupts this paradigm. The objects that we "reconstruct" are not quite the true ones. Worse yet, in general there is no "true" state or process! Quantum states and quantum processes are mathematical models for real physical systems. Those models rely on assumptions (e.g., Markovianity), which are never quite true in practice.
"Reconstructing" is a flawed term, and at this point we leave it behind. It is more accurate to say that GST estimates the gate set. The goal of GST is to fit a gate set to data -i.e., to find the gate set that fits the observed data best. This description is more consistent with the leastsquares generalization of linear inversion derived above. But, having rephrased the problem this way, it is very reasonable to ask "Is least-squares the right way to fit the data?" The LGST algorithms derived above minimize a sum of squared differences between predicted probabilities and observed frequencies: But while minimizing this sum of squared residuals is very convenient, it's not especially well-motivated. Furthermore, the LGST algorithms given above don't actually minimize the average squared error over the entire dataset -i.e., all the circuits performed to obtain LGST data -because the projector Π is chosen specifically to capture the support of the Gram matrix1l, without regard for the other directly observed probabilities (e.g., those in the P k ). The best way to fit the observed data optimally is to set aside traditional tomography and linear algebra, and focus directly on (1) the actual data (outcome frequencies of quantum circuits), and (2) the model being fit to it (gate sets). The set of all possible gate sets can be viewed as a parameterized manifold. Each point on that manifold is a gate set that predicts specific, computable probabilities for each quantum circuit. It is therefore possible to write the objective function from Eq. 54 precisely -with j ranging over every circuit performed -as a function of the gate set, and minimize it numerically to find a gate set that really does minimize the total squared error.
There's no particularly compelling reason to minimize the total squared error, though. An objective function that is very well motivated in statistics is the likelihood function: Although not immediately obvious, the sum-of-squares objective function in Eq. 54 can be seen as an approximation to the likelihood. Maximizing the likelihood function is the apotheosis of linear inversion -it's what linear inversion wants to be when it grows up. If we numerically vary over gate sets to find the one that maximizes L, that is maximum likelihood estimation (MLE), and it is a strictly better and more accurate way of analyzing LGST data than the linear inversion methods defined above.
In the rest of this paper, we will adopt this approach. We will view the data as a list of circuits with observed outcome frequencies. We will view gate sets as statistical models that predict circuit probabilities. And we will find good estimates of real-world gate sets by fitting that model's parameters to observed data, using MLE or approximations to it.
A natural question, then, is "Why develop the whole linear-inversion LGST framework, if having done so we immediately move beyond it?" There are several good reasons. The most practical is that the linear-inversion LGST algorithm developed above is actually a critical first step in any numerical MLE algorithm! Because circuit probabilities are nonlinear functions of the gate set parameters, both Eq. 54 and the likelihood function itself have unpleasant global behavior, including local extrema. Linear-inversion LGST is a very fast algorithm to obtain a pretty good estimate, which can serve as a starting point for (and be refined by) numerical MLE. But LGST also serves as a critical conceptual foundation. Regardless of what estimator we use (linear inversion, MLE, or something else), we can't estimate the gate set's parameters unless the experiments that generated the data are sensitive to them.
LGST constructs a set of experiments guaranteed to be sensitive to all the parameters in the gate set. Deeper circuits, introduced in the next section, amplify those parameters and allow greater sensitivity -but the experiments that we use in long-sequence GST are clearly derivative of the ones in LGST, and rely on the same structure to guarantee sensitivity to all amplified parameters.

IV. LONG-SEQUENCE GATE SET TOMOGRAPHY
Traditional process tomography on a gate G k uses circuits in which G k appears just once. Observed probabilities depend linearly on G k , making it very easy to invert Born's rule to estimate G k .
LGST bends this rule -a given G k may appear both in the middle of the circuit and in the fiducial circuits -but the circuits are short, and each gate appears only O(1) times. This enables the relatively straightforward analysis presented in the previous section, but it also limits precision. If A and B (Eqs. 13 and 17) are well-conditioned [condition number O(1)] then each matrix element of G k is very close to a linear combination of observed probabilities.
If each circuit is performed N times, then finite sample fluctuations cause estimation errors in each probability, and the accuracy ofĜ k is also limited to ±O(1)/ √ N . Under certain circumstances, some of these probabilities (with p ≈ 0) can be estimated to within O(1/N ) [154], but the resulting improvement in overall accuracy is limited by SPAM noise and by the fact that only a few of the circuits can be chosen to have p ≈ 0.
A completely different way to break the 1/ √ N boundary is to incorporate data from deep circuits, in which a gate appears many times. The circuits must be designed so that their outcome probabilities depend more sensitively on the elements of G k , with sensitivity proportional to the number of times G k appears in the circuit. For example, the outcome probabilities for are four times as sensitive to changes in G k as the outcome probabilities for and therefore allow four times the precision in estimating some aspects of G k . Long-sequence GST turns this simple idea into a practical algorithm for characterizing gate sets. The original motivation for long-sequence GST was precision and accuracy, but another advantage emerged later. Unlike traditional tomography and LGST, longsequence GST is easily adaptable to a wide range of noise models. In process tomography and LGST, each noisy gate is specifically described by an arbitrary transfer matrix. Even small extensions -e.g., focusing on low rank transfer matrices -require conceptual and practical modifications of tomography [155][156][157][158]. But long-sequence GST only requires a parameterized model for the noisy gates that can be embedded into transfer matrices. The gate set of Eq. 6 is such a model, with one parameter per element of |ρ (i) , G i , and E (m) i | expressed in a chosen basis. But a gate set can, more generally, be described by any mapping between a parameter space and the aforementioned operations. Long-sequence GST can be used to estimate a gate set that is parameterized in any reasonable way. This flexibility can be used to model various features and constraints on the gate set, including complete positivity (CP) and trace preservation (TP). We defer a discussion of this flexibility until section V A, as these details are not essential to understanding longsequence GST. Presently, it is sufficient to note that a gate set is a mapping from a parameter space (e.g., the direct sum of the vector spaces for each operator) to a set of operations capable of predicting circuit outcomes.
The core long-sequence GST protocol used by our research group has remained largely unchanged since about 2017. But between 2013 and 2017, it morphed repeatedly as we discovered that certain things didn't work, or found better ways to solve certain problems. Its current stable form is the result of a nontrivial evolution process. Like most products of evolution, it contains both historical artifacts and necessary solutions to non-obvious problems. Rather than present the state of the art as a fait accompli, we introduce it by outlining the historical path that led to it (Section IV A), emphasizing some of the techniques that didn't work. Next, we explain how to construct a good long-sequence GST experiment design, a set of circuits of depth at most O(L) that enables estimating gate set parameters to precision O(1/L) (Section IV B). We conclude this presentation of long-sequence GST by showing how to estimate those parameters by efficiently maximizing the likelihood function, and discuss a variety of technical insights that make the numerical methods reliable in practice (Section IV C).

A. Historical background
Our first attempt to achieve higher accuracy involved performing random, unstructured circuits and fitting gate set models to the resulting count data [33]. We augmented the set of circuits prescribed by LGST with a variety of random circuits (much like those used in direct randomized benchmarking [59]) of various depths, and then used numerical MLE to fit a gate set to the resulting data. This approach produces a likelihood function that is generically messy -it is not a quasi-convex function of the gate set, usually has local maxima, and has no particular structure. However, it is straightforward to evaluate, and its derivative can be computed efficiently from the data. Moreover, a subset of the data enable LGST, which provides a "pretty close" starting point for local gradient maximization of L(G).
Investigation of this approach revealed two problems. 2. The lack of structure in the likelihood function made numerical MLE problematic. Local gradient optimization worked only unreliably, and was highly dependent on starting location. We achieved reasonable success in simulations by starting with LGST, and then refining this estimate by adding successively deeper circuits to the likelihood function. However, this technique proved less reliable in experiments, where the underlying model was less valid. Running the optimizer repeatedly with different starting conditions, and incorporating more global-optimization techniques, often revealed better local maxima of the likelihood. This suggested that even the best estimates we found might not be global maxima.
We concluded that (1) as the IBM group had observed earlier, brute-force MLE on unstructured GST data was not a satisfactory solution, and (2) we needed to choose circuits more cleverly (as with LGST) to make the analysis easier and more reliable.
We developed an approach called extended LGST (eL-GST). It relied on two critical modifications to the original "unstructured MLE" approach, which impose additional structure to the experiment design and data analysis (respectively).
First, instead of performing random circuits, we constructed a set of circuits corresponding to performing LGST -not on the gates G k themselves, but on a small set of "base" circuits, {S l }. We took each S l and sandwiched it between fiducial circuits (just as we did with each G k for LGST). Originally, the base circuits were simple repetitions of individual gates, e.g. G p k for p = 1, 2, 4, 8, . . .. We found that this amplified some, but not all, parameters of the gate set.
This extended-LGST (eLGST), experiment design eventually evolved into one where the base circuits were chosen to be a set of short "germ" circuits (g i ) repeated p times (g p i ). This spawned the "germ selection" procedure used today in long-sequence GST, which we discuss in detail in Section IV B. This structure ensures that each base circuit amplifies some set of deviations from the target gates, so that these deviations change the observed probabilities for the circuits based on that base circuit by O(L). The germs are chosen so that they amplify different deviations, and so that collectively they amplify all deviations. This careful, non-random experiment design allowed eLGST to achieve consistent, reliable, and predictable accuracy that scales as 1/L. Second, instead of using MLE to fit a gate set directly to the data, eLGST fits the gates indirectly via a 2-step process. In the first step, we estimated the transfer matrix for each base circuit, τ (g p i ). Then, those estimates were used to back out a transfer matrix for each gate. The eLGST protocol is a precursor to the long-sequence GST we now describe. Appendix B gives a full description of eLGST.

B. Experiment design for long-sequence GST
A long-sequence GST experiment is designed to enable high accuracy with minimal experimental effort.
LGST can estimate a gate set to arbitrary accuracy, but because uncertainty in the estimated parameters scales as O(1/ √ N ) if each circuit is repeated N times, achieving precision requires O(1/ 2 ) repetitions. This makes precisions of ≈ 10 −5 , as demonstrated in [159], practically impossible to reach.
Long-sequence GST overcomes this barrier by specifying a different experiment design -a set of quantum circuits to be run -containing circuits that amplify errors in the gate set. This experiment design retains the basic structure of LGST: each of a list of "operations of interest" is probed by constructing circuits that sandwich it between pre-and post-operation fiducial circuits. But instead of a single gate, the middle of each sandwich is a more complicated base circuit that amplifies certain errors so they can be measured more precisely by tomography. In this section, we present the long-sequence GST experiment design. We use the term experiment to mean an experiment design along with the data obtained by repeating each of its circuits many times.
Each long-sequence GST circuit constitutes three consecutive parts (see Figure 4): 1. Prepare a particular state, |ρ k by performing a native preparation followed by a fiducial circuit.
2. Perform p repetitions of a short circuit g that we call a germ.

Perform a particular measurement { E
(m) i |} by performing a fiducial circuit and then a native POVM measurement.
The outcome statistics from repeating such a circuit many times allow estimating probabilities like If the states and measurements are each informationally complete, these probabilities are sufficient for tomography on τ (g p j ). All GST experiments are derived from this basic structure, and every circuit performed for GST is specified by an (j, k, p, m) tuple.
We call the short circuit g a germ because, like a germ of wheat or the germ of an idea, it serves as the template for something larger -here, an arbitrarily deep circuit built by repetition. We refer to g p , the object of tomography, as a base circuit and p as a germ power. Repeating g amplifies errors in g. For example, if g is a single unitary gate G, and G over-rotates by angle θ, then τ (g p ) will over-rotate by pθ. By varying the input state and final measurement (k and m) over informationally complete sets, we probe the operation between them, just as in process tomography or LGST. If this tomography on g p measures pθ to precision ± , then we've measured θ itself to precision ± /p.
Simple germs comprising a single gate G are not sufficient to amplify every error. Some errors can only be amplified by first constructing a multi-gate circuit, e.g., g = G 1 G 2 , and then repeating it. Repeating g many times amplifies errors that commute with τ (g). In the example above, an over-rotation error in G is an error that commutes with G, so it is amplified. But suppose that G rotates by the correct angle, but around the wrong axis, e.g., instead of a rotation aroundẑ, G performs a rotation aroundẑ cos φ +x sin φ. This tilt error is not amplified by repeating G, but it can be amplified by a more complex germ that includes G together with other gates.
To achieve high precision estimates efficiently, we need to amplify every parameter in the gate set that can be amplified. Two kinds of parameters cannot be amplified. Gauge parameters cannot be measured at all, and properties of the SPAM operations cannot be amplified because SPAM operations only appear once in each circuit. So germs amplify errors in gates. Therefore, as we discuss selection of germs, we will focus exclusively on the gates and ignore SPAM operations.
In the following sections we describe how to select a set of base circuits -germs and powers to raise them to that amplify all of the amplifiable parameters of a given "target" gate set. This target gate set is typically taken to be the ideal gates a device is designed to implement. More details may be found in Appendix C 1. FIG. 5: Overview of how to design a GST experiment.
Step 1. Choose germ circuits that amplify all gauge-invariant parameters in the processor's gate set.
Step 2. Define base circuits by raising each germ to powers chosen so that the base circuits' depths are as close as possible to being logarithmically spaced.
Step 3. Sandwich each base circuit between each of N f 1 × N f 2 fiducial pairs defined by N f 1 preparation and N f 2 measurement fiducial circuits.
Step 4. Optionally apply fiducial pair reduction (FPR) to eliminate redundant circuits from the experiment design, leaving just enough to ensure sensitivity to each base circuit's amplified parameters. Final result 5. A GST experiment design can be visualized as a grid of plaquettes. The plaquettes correspond to base circuits, are arranged by germ and L, the maximum depth used to construct the base circuit from the germ. Within a plaquette, squares indicate different fiducial pairs, and the plaquette will be "sparse" when FPR has been applied.

Selecting germs (g)
To estimate a gate set efficiently and accurately, every variation in the gate set, except those corresponding to SPAM and gauge directions, must be amplified by some germ. We call a set of germs that achieves this goal amplificationally complete, in direct analogy to "informationally complete" sets of states or measurements.
To evaluate (and ensure) amplificational completeness, we model errors in the gate set as small perturbations to the target gate set. Each germ g, when repeated, will amplify certain errors. Specifically, any small perturbation to a gate set's parameters that causes a change in τ (g) that commutes with τ (g) will be amplified by g. The set of all such perturbations is easily shown to be closed under linear combination, so each germ g amplifies error vectors that lie in a subspace of the gate set's parameter space. This subspace can be easily constructed from the target gate set and the germ.
It is also straightforward to construct the tangent subspace of gauge variations. It contains all small perturbations around the target gate set that do not change any observable probability at all. Its complement is the subspace of physical errors that need to be amplified. Now we can define amplificational completeness precisely: a set of germs {g j } is amplificationally complete for a target gate set G if and only if the union of the error subspaces amplified by each g j span the complement of the subspace of gauge variations.
Amplificational completeness defines a concrete condition that germs must satisfy for GST. However, it depends on the target gate set, so each new target gate set requires finding a new set of germs. Furthermore, different desiderata may motivate different amplificationally complete sets of germs (e.g., it is reasonable to prioritize the smallest set of germs, or the shortest possible germs, or the set that achieves the highest precision for a given maximum depth). In the pyGSTi implementation of GST [104,105], we use a particular search algorithm to evaluate sets of short germs and find the smallest amplificationally complete set. Algorithmic and mathematical details can be found in Appendix C 1. We require that a chosen germ set always includes each gate G i in the gate set as an independent germ.

Defining base circuits
Once a set of germs is selected, a set of base circuits is constructed by raising each germ to several powers p. We begin by selecting p = 1. Since each gate G i is also germ, this ensures that each individual gate appears as a base circuit. What remains is to choose all the other values of p that will appear in the experiment.
Large values of p amplify errors more. But including only large powers p would create an aliasing problem. If τ (g) over-rotates by θ = π/16, then repeating it p = 32 times creates an over-rotation by 2π, which appears to be no error at all! So, exactly as in phase estimation, smaller powers p are needed too. Since this is essentially a binary search, logarithmically spaced values of p are optimal.
More generally, a base circuit's p is less relevant than its overall depth. If germ g 1 has depth 1, while g 2 has depth 5, then g 10 1 and g 2 2 both have depth 10. So in describing GST experiments, we typically organize base circuits by their approximate depth l rather than p. If we denote the depth of germ g by |g|, then the depthl base circuit based on g is g l/|g| . Our intuition to use logarithmically spaced p above leads us to choose logarithmically spaced l -i.e., l = 1, m, m 2 , m 3 , . . . -to choose the base circuits.
Making m larger reduces the total number of circuits to be run, but requires higher precision (more repetitions) at each value of l. Empirically, we have found m = 2 to work reliably -i.e., l = 1, 2, 4, 8, 16 . . . -but other choices are possible. Germs of depth d > 1 do not have realizations for depths l < d, so a germ of depth 5 would appear first at l = 8 (with 1 repetition), then l = 16 (with 3 repetitions), etc. Thus, the depth of a "depth l" base circuit is not necessarily equal to l, but is always in the interval (l/m, l]. Any given experiment can only include finitely many circuits, so for each germ there must be a maximum depth L at which it appears. This is configurable. Selecting L should be guided by three facts. First, increasing L amplifies errors more and yields more precision (all else being equal). Second, increasing L makes the experiment larger (more circuits), which requires more time to obtain and analyze. Third, there is essentially no benefit to increasing L beyond the point where decoherence and stochastic errors dominate. If the rate of decoherence in each gate is η, then circuits of depth 1/η generally all produce the same equilibrium state, and little or nothing will be learned from circuits of depth L > O(1)/η. If different gates have very different rates of stochastic errors, then it is useful to let L vary from germ to germ.

Putting it all together
The circuits for a full GST experiment are constructed as follows. First a set of amplificationally complete germs is selected (cf. IV B 1, Fig. 5 step 1). Next a set of base circuits is chosen (cf. IV B 2, Fig. 5 step 2). Let these be given by O = g pi,j i i,j , where i indexes a germ and p i,j is the j-th power that we apply to the i-th germ. As explained above, to amplify the desired errors it is sufficient that the GST circuits estimate the probabilities where a = 1 . . .
The corresponding circuit (which can be read off directly, from right to left because the matrix-ordering of Eq. 61 is reversed from time-ordering) is repeated N times to approximate p abij . The steps of this circuit are: 3. measure using the m(a)-th type of measurement The frequency f t(a) = n t(a) /N , where n t(a) is the number of times the t(a)th outcome was observed, estimates p abij . In this way, a circuit's outcome data can estimate all of the p abij that differ only in their t(a) (POVM effect) index.
Letting a, b, i and j range over their allowed values in the steps above defines all the circuits needed to run long-sequence GST. These circuits constitute a GST experiment design. Their structure is show schematically in Figure 4, where again we omit the many indices (cf. Eq. 61) for clarity. (For later analysis of the circuits' outcome data, it is helpful to separately keep track of the fiducial and germ circuits, and the germ powers.) In appendix D we show that a set of so-chosen circuits (using the particular pyGSTi algorithms discussed in appendix C 1) result in a Heisenberg-like accuracy scaling with respect to circuit depth and total number of circuits.
Finally, we note that arriving at a definite list of circuits (an experiment design) required information based on the experimental circumstances. The sets of fiducials and germs were based on the available native gates (specified by a target gate set), and the maximum depth of the circuits was based on a tradeoff between resources and accuracy (longer experiments are more resource-intensive, because they require more and deeper circuits, but they give more accuracy -although only up to a maximum useful depth that is roughly the inverse of the decoherence rate). GST experiments are tailored to a hardware's capabilities.

C. Estimating gate set parameters in long-sequence GST
A gate set is a parameterized statistical model. The parameterization may be simple (e.g., every gate element is a parameter) or nontrivial (see section V A). Fitting a gate set model to data from the experiments described in Section IV B is an example of parameter estimation. There are many ways to do this (cf. section III E). In this paper, we focus on maximum likelihood estimation (MLE), which requires varying the gate set's parameters to maximize the probability of the data (or, for practical simplicity, its logarithm).
The loglikelihood function can be constructed as follows. Let G denote the model, s index a circuit of the experiment design, and N s be the total number of times circuit s was repeated in the GST experiment. Furthermore, let m s be the number of outcomes of s, and N s,βs denote the number of times outcome β s was observed. The contribution to the total loglikelihood from s is simply the multinomial likelihood function for an m s -outcome Bernoulli scheme, log L s = N s βs f s,βs log(p s,βs ), where p s,βs is the probability predicted by G of getting outcome β s from circuit s, and f s,βs = N s,βs /N s is the corresponding observed frequency. The total loglikelihood for the entire GST experiment is just the sum where s ranges over all of the circuits in the experiment design. This derivation assumes that G is tracepreserving (TP), so that ∀s βs p s,βs = 1. An extension to non-TP gate sets requires some technical tricks and is explained in appendix C 2.
Maximizing log L is made nontrivial by the structure of the problem. This contrasts with state and process tomography, where the parameterized model is a density matrix ρ or a superoperator G, each observable probability is a linear function of the parameter, and the loglikelihood is a sum of logarithms of linear functions. This means MLE state/process tomography is a straightforward convex optimization problem.
In contrast, the GST likelihood function (Eq. 63) is extremely non-convex. Because gates appear repeatedly in circuits, the probabilities p s,βs are nonlinear functions of gate elements (and, more generally, the parameters of G). The construction outlined previously causes the p s,βs to oscillate, creating a loglikelihood function that looks like an egg crate or optical lattice, with many local maxima. Finding global maxima of such functions is generally hard.
The gauge freedom creates more problems, by turning unique maxima into "ridges" that trace out gauge orbits. Constraining the optimization to CP gate sets creates complexity, because the CP condition does not respect gauge symmetry and can create local maxima. Ignoring the CP constraint makes optimization easier, but allows unphysical gate sets that can have zero or negative likelihood.
We have developed a particular pipeline of MLE and related estimators that reliably maximize Eq. 63 when G is parameterized in one of several common ways (including with TP and CPTP constraints). We present this method, not as the only way to perform the parameterestimation step of long-sequence GST, but as a way of implementing this crucial step of the protocol.
Our approach is outlined in Algorithm 1. Here θ is the vector of G's parameters being optimized, D is a data set, Truncate(D, L) is the subset of D's data corresponding to circuits whose germ-power has depth ≤ L, and Argmin(S, G, D, θ 1 ) yields the θ at which statistic S(G( θ), D) is minimal using a local optimizer seeded at θ 1 . D 0 is the full GST data set, and θ 0 is the initial vector of model parameters, often provided by LGST (Section III). We have found m = 2 to work well in practice.
We highlight three important elements to this approach that we believe are particularly important to its success: Optimization Stages. Our approach proceeds in multiple "stages". At each stage, we run a traditional local optimization method (we find the Levenberg-Marquardt technique to give good results) on a subset of all the data. Each stage sets a maximum base circuit depth L, and at that stage only data from base sequences with depth less than or equal to L are incorporated into the likelihood function (Eq. 63). We choose L = 1, m, m 2 , m 3 , . . ., so that the stage corresponding to L contains the circuits whose base circuits are g l/|g| for l = 1, m, m 2 . . . , L. Successive stages add deeper circuits, while keeping all the shorter circuits. By using the best-fit output of a stage as the starting point of the next we create a daisy chain of estimations that avoids local minima. As long as finite sample fluctuations are kept small enough (by repeating each circuit enough times) the previous stage's best-fit estimate lies with high probability in the correct basin of the next stage's objective function, rendering its oscillatory nature benign.
Use of χ 2 as a log L proxy. At every stage except the last one, the χ 2 statistic is optimized instead of loglikelihood. The χ 2 is a weighted-sum-of-squares function that (like likelihood) quantifies goodness-of-fit. Using our definitions above, it can be written as where is the contribution from a single circuit s. The weights (1/p s,βs ) ensure that χ 2 is a local quadratic approximation to the negative loglikelihood. Compared with the loglikelihood, χ 2 can be computed faster, and is more well-behaved as an objective function. It has one significant disadvantage: its minimum is a slightly biased estimator (see appendix E). Optimizing the loglikelihood in the final stage renders this a non-issue, and we find minimizing χ 2 to be more robust at seeding the final log L maximization than performing log L maximizations all the time. One notable exception to this occurs when the number of circuit repetitions is low and χ 2 becomes a poor proxy for log L.
Regularization. We increase the reliability of optimization by regularizing both log L and χ 2 . Both functions have poles when probabilities are zero, which can lead to numerical instabilities when probabilities are near zero. By slightly altering each function, we make them more amenable to optimization. A simple and effective way to regularize Eq. 64 is to limit the least-squares weights to a maximum cutoff, e.g. 1/p min . Since χ 2 is already just a proxy for the negative loglikelihood, this has no effect on the final fit. The log L function needs to be tweaked more carefully, by replacing Eq. 62 with its second-order Taylor series when p s,βs < p min , where p min is chosen to be much less than the smallest possible non-zero frequency (e.g., 10 −4 if each circuit is repeated 1000 times). Because this alters the log L function only when a probability is much different than its observed frequency, it distorts the objective's value (and thereby the rigor of its interpretation) only for particularly bad fits. We have evidence that these small adjustments to the standard log L and χ 2 cause local optimization methods to be more robust, presumably because they avoid regions with very large gradient and widen basins of convergence.
The method outlined above is not guaranteed to find a global maximum of log L, but does so impressively often in practice. Improvements to the algorithm are a deserving area for future work, but the algorithm described above demonstrates that the parameter estimation problem lying at the heart of long-sequence GST is tractable, enabling its practical use.
The long-sequence GST protocol described above -designing an experiment, running that experiment to produce data, and finally analyzing the data via multi-stage MLE -produces a best-fit gate set in an unknown gauge. The gauge is irrelevant when predicting circuit outcomes, so the estimated gate set can be used immediately for this purpose. (For example, it is possible to predict the success probabilities of RB circuits and extract the RB number.) However, computing standard metrics such as fidelity and diamond-distance requires an additional step called gauge optimization. We discuss this, and other common post-processing, in Section VI. Readers should feel free to jump to that section, if desired. Section V covers several advanced topics that were omitted from our presentation thus far of long-sequence GST.

V. ADVANCED LONG-SEQUENCE GST
This section discusses two additional topics related to long-sequence GST. Although they are not essential to the protocol as a whole, they infuse it with significant additional capability. First, we formalize the notion of a gate set model, and suggest a natural path to extending long-sequence GST by creating additional models. Second, we describe how the number of circuits in the GST experiment design (Section IV B) can be sizeably reduced by taking advantage of overcompleteness present in the standard design. The material here is not required by any of the remaining main text.

A. Gate set models
Gate sets, as we have defined them, contain general state, measurements, and superoperators. A gate set G (Eq. 6) represents each gate as a d 2 × d 2 real-valued matrix, and each state preparation or measurement effect as a d 2 -dimensional real-valued vector or dual vector, respectively. Let us define matrix space as M ∼ R Ne , where is the total number of (real) elements in a gate set. M is thus isomorphic to the direct sum of the vector spaces containing each gate set operation's matrix, vector, or dual vector. Any gate set (as defined by Eq. 6) is a point in matrix space, G ∈ M, with coordinates given by the elements of each gate set operation.
A gate set is a statistical model that assigns probabilities to the outcomes of quantum circuits. 3 We use the term gate set model to mean a mapping between a parameter space P ∼ R Np and M. The dimension of parameter space, N p , is typically less than N e . Formally, a gate set model corresponds to a choice of P and a map W : P → M.
A gate set model is a parameterized statistical model that associates with every point in parameter space a gate set. Long-sequence GST finds an optimal gate set by searching over this parameter space, optimizing over θ ∈ P. So by using different gate set models, we can constrain this optimization to subsets of the entire matrix space. Informally, a basis for P defines "knobs" that can be adjusted -e.g., by the MLE optimization, but also by hand if so desired -to vary a gate set and make it more consistent with the observed data. If we set P = M and W (x) = x then each element of every operation in the gate set is an independent parameter. We call this the fully parameterized gate set model, and we can use it for GST (as in Sec. IV). However, it includes all gate sets, even wildly nonphysical ones that violate complete positivity and/or trace preservation. So it is useful to define smaller gate set models that parameterize strict subsets of M (e.g., only CPTP gate sets).

Gauge freedom in constrained gate sets
Gate set models may have gauge freedoms. A gauge freedom exists if there is a transformation that can be applied to θ ∈ P that changes θ but does not change any observable probability that can be computed from the corresponding gate set W ( θ). We have already discussed the gauge freedoms of the fully parameterized gate set model; they correspond to similarity transformations by invertible matrices M that act on each gate matrix as . 7). For any given G, the action of all possible M on G traces out a gauge orbit containing all the gate sets G ∈ M that are equivalent to G. For general gate set models, one must consider the pullback of Eq. 7 to P. As we discuss below, this can make the analysis more complicated. In M, gauge freedoms correspond to gauge orbits -sets of equivalent gate sets. But in P, these freedoms can map onto a different construct. Still, it is helpful to picture the parameter space as being foliated into gauge orbits, like an onion or layered pastry. Appendix A 1 describes gauge transformations in more detail.
It can be useful to count the gauge degrees of freedom for a given gate set model. All the gauge orbits, except for a set of measure zero, are manifolds of this fixed dimension. (If one or more gates in a gate set have degenerate spectra, then they remain invariant under certain gauge transformations; the corresponding orbits have lower dimension, like the center of an onion). Since P has dimension N p , then any point θ can be varied in N p linearly independent directions. These define a local tangent space at θ, which we can partition into orthogonal subspaces of N gauge p gauge directions that are tangent to the gauge orbit on which θ lies, and N nongauge p non-gauge directions that are normal to the gauge orbit (see appendix A 1). It follows trivially that N gauge p + N nongauge p = N p . This partition of N p is same at almost all points θ (the exceptions are the singular points corresponding to gate sets with degenerate gates), allowing us to view P mathematically as a fiber bundle.
The fully parameterized gate set model has a parameter space with dimension and it generally has d 2 gauge degrees of freedom exceptions occur when one or more operators commute with every operation in the gate set.

TP and CPTP constrained models
Quantum theory requires density matrices to have unit trace, and the operations acting on them to be tracepreserving (TP). These are constraints on G which can be used to define a smaller gate set model. The TP constraint corresponds to locking the first row of every superoperator to be [1, 0, . . . 0] and the first element of every state preparation vector to equal 1/ √ d (since Sec. II A 2). Enforcing these constraints defines a TP parameterized gate set model The mapping W is trivial to construct in this case: it leaves some of the elements of G fixed.
When running long-sequence GST, the TP parameterized gate set model is generally superior to the fully parameterized model, and presents no extra complications. The gauge freedoms for the TP parameterized model are easy to derive; gauge transformations correspond exactly to matrices M that are also TP, so instead of d 2 gauge parameters, the TP parameterized model has d 2 − d gauge parameters.
Quantum theory also requires operations to be completely positive (CP). Imposing this constraint defines a CP parameterized gate set model. However, the CP constraint is harder to define (and impose) than the TP constraint. Whereas the TP constraint is a constant equality and defines a subspace of the full parameter space, the CP constraint is an nonlinear inequality.
A mapping function W whose range is constrained to CP gate sets can be constructed in several ways. One is to represent each gate by the Cholesky factorization T of its Choi matrix. In this way, each gate is represented by a lower-triangular matrix T k with real diagonals, and the gate (transfer) matrix G k is obtained by applying the Choi-Jamiolkowski isomorphism to the Choi matrix χ k = T † k T k . It is usually desirable to apply the TP constraint as well, to get a CPTP parameterized gate set model ; this is done by additional constraints on each T k (unfortunately, the TP constraint is not as simple in the Choi representation).
An alternative way of constraining a gate to be CP is to write it in terms of an error generator, which we denote ξ. Let G k be the transfer matrix of a gate, and the corresponding ideal (error-free) CPTP operation be G 0 k . We define G k 's error generator, ξ, to be the logarithm of the quotient G k (G 0 k ) −1 , so that The error generator ξ is itself a superoperator, and when ξ = 0, G k = G 0 k and the gate has no error. By restricting ξ to be a Lindbladian [160], we can guarantee that e ξ is CPTP. Because CPTP maps form a semigroup, G k is CPTP as well. The Lindblad form that ensures e ξ is CPTP is where operators H i and S jk act as on density matrices ρ, α i is real, β is a positive semidefinite (Hermitian) matrix. Here, {B i } is a basis for B(H) with the properties given in Sec. II A 2. Indices j and k sum only over the non-identity elements, so they begin at 2. The condition on β can be implemented by parameterizing β's Cholesky factorization as described above.
We have found that this parameterization facilitates optimization better than the Choi-matrix parameterization. (This practical advantage makes this the CPTP parameterization of choice in pyGSTi.) A downside to the error generator approach is that not all CPTP maps can be put into the form given by Eq. 70 -only those that can be infinitesimally generated. We have not observed this restriction to have significant consequences in any experiment to date. Using either parameterization, the number of parameters N p for the CP and CPTP gate set models are the same as those for the full and TP models, respectively.
In the literature on state and process tomography, estimators that respect CP are common, and generally regarded as superior to unconstrained estimators. The argument is simple: nature and quantum theory only permit CP processes, so we should not consider non-CP estimates. Furthermore, the CP constraint constitutes universally valid prior information, and must therefore improve estimation accuracy (at the cost of some bias).
The value of CP is more ambiguous in the context of GST. The CP constraint can be imposed, as described above, but doing so has some practical consequences. It makes MLE estimation more complex [161], and complicates the construction of error bars (see section VI C). It also removes a valuable diagnostic: the data should be consistent with a CP gate set whether or not the constraint is imposed, so when GST returns a non-CP gate set (possible when the constraint isn't imposed), this is a sign that something else is wrong -usually some form of non-Markovian dynamics in the experiment.
Most importantly, the CP constraint interacts badly with the gauge freedom. Every gate set on a gauge orbit is equivalent -gauge transformations do not change any physically observable property. But non-unitary gauge transformations do change complete positivity! This may seem paradoxical, since non-CP gates can produce negative circuit outcome probabilities. But this operational interpretation requires the ability to create arbitrary input states, including states entangled with a reference frame, and apply the gate to them. Such states constitute an absolute reference frame, and the gauge freedom arises precisely because no such reference frame is available. A gate set may be CP in one gauge, and non-CP in another gauge, because those two gauges imply different sets of allowable (non-negative) input states to the gates. We explore these issues further in Appendix A 1.
In practice, we have found it useful to perform longsequence GST using both the CPTP and TP gate set models, and compare the results. If a significantly better non-CP fit is found, then the two estimates should be examined carefully: either the CPTP fit got trapped in a local maximum, or significant non-Markovian dynamics occurred during the experiment. If difference between the two fits is not statistically significant, then the CPTP fit is often preferable for later analysis because it is better behaved within post-processing (e.g., each gate's eigenvalues cannot be greater than 1.0).
We conclude this discussion of gate set models by noting that we have only scratched the surface of possible models here. The long-sequence GST framework (Section IV) is agnostic to which gate set model is used. We have experimented with models ranging from the fully parameterized model to radically simplified models like a single-parameter model where each gate experiences depolarizing error with the same rate. In this paper we do not dig more deeply into these possible variations or their applications: we generally assume that one of the fully-, TP-, or CPTP-parameterized models is being used.

B. Fiducial-pair reduction
In the "standard long-sequence GST" experiment design of Section IV B, every base circuit is sandwiched between every possible pair of preparation and measurement fiducial circuits to generate the full experiment design. This enables complete tomography of each base circuit, which is clearly sufficient to estimate the gate set. But it also yields very large GST experiments. Information completeness requires O(d 2 ) preparation fiducial circuits and O(d) measurement fiducial circuits (because each one provides d−1 independent outcomes), giving O(d 3 ) fiducial pairs. In the 2-qubit case this means ≈ 4 3 = 64 circuits are required for each base circuit.) However, many of these circuits are redundant. Eliminating redundant circuits yields much smaller and more efficient experiments. The redundancy stems from the fact that (as observed previously) each germ g only amplifies certain "directions" in error space. Because this subspace corresponds to commuting deviations in τ g (the operation produced by g), its dimension is that of τ g's commutant. This can be much smaller than the space defining τ g itself -e.g., if τ g is a nondegenerate d 2 × d 2 matrix, then although perturbations to τ g form an d 4 -dimensional space, the commutant is only d 2dimensional. So measuring all the fiducial pairs is redundant. We only need to ensure that each base circuit is probed so that every amplified perturbation impacts the observed outcome probabilities. Each amplified direction will affect the outcome probabilities corresponding to at least one fiducial pair. So if a given germ amplifies m directions in parameter space, it is sufficient to select m fiducial pairs that measure linearly independent probabilities sensitive to those m variations. Step 4 in Figure  5 depicts how FPR functions during the construction of a GST experiment design.
The pyGSTi implementation performs "fiducial pair reduction" via an algorithm, described in appendix C 1, that starts by constructing an informationally complete set of effective SPAM pairs, then selects germs, and finally eliminates redundant fiducial pairs for each base circuit, one by one, until no further fiducial pairs can be eliminated without losing sensensitivity to some amplified perturbation of the gate set's parameters.
We have found that fiducial pair reduction can reduce the size of GST experiments significantly, by 50-90% in many cases. However, eliminating redundancies in the data also tends to increase the risk that GST's parameter estimation step will get stuck at a local maximum. We see the fiducial pair reduction technique as useful and sometimes necessary, but not one that should be applied in every case.

VI. ANALYZING GST ESTIMATES
We now turn from describing the GST protocols to interpreting their output. GST is a characterization protocol, and its purpose is not just to provide overall performance estimation (like benchmarking) but to reveal how an as-built component (e.g., gate or gate set) differs from its ideal. This requires nontrivial analysis. In this section, we give several ways to analyze or postprocess the results of a GST experiment. We first discuss how to assess the validity of the GST estimate (Section VI A). If a GST estimate is valid, it may be helpful to gauge-optimize the estimate in order to compute common gauge-dependent metrics like process fidelity or diamonddistance (Section VI B). Finally, we describe how to put error bars on quantities derived from a GST estimate (Section VI C).

A. Goodness-of-fit and Markovianity
The first diagnostic from a GST experiment has nothing to do with the estimated gate set itself. It is the goodness of fit -how well does that estimate (whatever it is) match the data? GST's gate set model is capable of describing any set of Markovian gates in d-dimensional Hilbert space, along with state preparations and POVM effects (modulo optional TP and/or CP constraints). So if it fails to fit the data, this is strong evidence that some assumption of the model was violated. We describe all such violations as "non-Markovianity", meaning that the observed behavior was influenced by some internal or external context variable that was not included in the model [162,163]. Common phenomena of this type include slow drift [89], leakage [53,164], persistent environments [165,166], pulse spillover [45,[167][168][169], and gate-induced heating [170,171]. We do not attempt to diagnose specific phenomena here. Instead we focus on how to quantitatively detect when the GST model has "failed to fit the data" as well as it should.
Consider a GST experiment containing N exp distinct circuits. It will produce a dataset described by N o free parameters, where N o is simply the number of independent circuit outcomes that can be observed. In the typical case where there is just one native measurement (N M = 1) that has N (1) E outcomes, For a single qubit supporting just one native 2-outcome measurement, N E = 2 and each circuit's data has 2 − 1 = 1 independent degree of freedom.
Any data set like this can be explained perfectly by a trivial "maximal model" with N o parameters -the one that assigns one probability for each independent outcome. Fitting this maximal model to the data achieves log L max (see section IV C).
If GST's estimate has a likelihood of L, then we can measure how well GST fit the data by comparing log L to the maximal model's log L max . If the data were produced by some Markovian gate set, then both the GST model and the maximal model are valid, and both should fit the data equally well if we account for extra free parameters in the maximal model. Wilks' theorem [172] tells us how to do this accounting, and we can use it to quantify the fit quality of the GST estimate relative to that of the maximal model. Wilks' theorem states that if the gate set model is valid, then the loglikelihood ratio statistic between them is a χ 2 k random variable: where the maximal model has k = N o − N nongauge p more parameters than the gate set model has non-gauge parameters.
If the loglikelihood ratio appears to have been sampled from a χ 2 k distribution, then the GST model fits the data as well as can be expected, and captures the behavior of the device exposed by this data. On the other hand, if the observed loglikelihood ratio is so high that a χ 2 k distribution would be very unlikely to have produced it, then we have evidence that the data were generated by a non-Markovian process. The χ 2 k distribution has mean k and standard deviation √ 2k. We quantify the observed model violation by the number of standard deviations by which the loglikelihood ratio exceeds its expected value under the χ 2 k hypothesis: N σ ≤ 1 indicates an extremely good fit that appears completely trustworthy. (We have rarely seen this except on artificially simulated data). Conversely, N σ 1 indicates significant model violation, meaning that no gate set can describe all of the data. Since the gate set only assumes Markovianity (and sometimes the physical TP and CP constraints), model violation indicates the presence of some kind of non-Markovian noise (as defined above).
We emphasize, however, that a large N σ value does not necessarily mean that the observed behavior is strongly non-Markovian. It indicates high confidence in the conclusion "the Markovian model was violated". This does not quantify how much the model would have to be expanded to fit the data well, nor the probability of observing a surprising event, and it doesn't imply that the GST estimate is useless or untrustworthy. If there is any model violating behavior at all, then N σ generally increases linearly with the number of times each circuit is repeated. So more sensitive experiments will yield higher N σ , even with exactly the same physics. Quantifying model violation in more operational and useful ways is a compelling topic for future research.
The N σ statistic is sensitive to the total model violation, added up across all circuits. It's also useful to identify individual circuits whose behavior is inconsistent with the GST estimate. We can do this by examining each circuit's contribution to the loglikelihood, log L s (Eq. 62), and comparing it to the maximal model. Wilks' theorem predicts 2(log L max,s − log L s ) should be χ 2 k -distributed, with k approximately equal to the number of independent outcomes of a single circuit. 4 These per-circuit tests are almost independent from the "total" test described above -either test may reveal model violation when the other does not, although most forms of model violation trigger both.
If many circuits are tested simultaneously for model violation (as is usually the case), it is important to raise the threshold for detecting a violation using extreme value theory. If K tests are performed in parallel, then to keep the probability of any false positive below α, each test should be performed at the α 1/K confidence level.
We have found a plot of the per-circuit goodness-of-fit values, represented as a grid of colored boxes arranged by germ and base circuit depth (see Figure 6), to be a useful diagnostic tool. The color scale in Figure 6 changes from a linear grayscale to a logarithmic red-scale when the log L s value of a box exceeds the 95th percentile of the expected χ 2 distribution (but see note above about adjusting for multiple tests). Deeper circuits are more sensitive to most forms of non-Markovian noise. If model violation (red boxes) appear preferentially in circuits con- 4 Technically it would be more accurate to subtract from this value the number of gate set parameters "per circuit", but this is usually much less than one and has virtually no impact. for circuits s. Each 6 × 6 plaquette of colored squares represents a set of circuits based on the same base circuit g p with (maximum) depth L increasing along the horizontal axis and the germ g varying along the vertical axis. The 36 distinct boxes within a plaquette represent distinct circuits, formed by sandwiching g p between 6 different preparation fiducials (indexed by column) and 6 measurement fiducials (indexed by row). In this particular case the singlequbit gate set ideally comprised of X(π/2), Y (π/2) and I (the identity) was used, and both the preparation and measurement fiducials are the set {∅, X, Y, XX, XXX, Y Y Y }, where X and Y are shorthand for X(π/2) and Y (π/2).
taining a particular germ, this usually indicates that a gate appearing in that germ is the cause.

B. Gauge optimization
We now turn to analysis of estimated gate sets. The output of "doing GST" on a quantum processor -constructing an experiment, performing the circuits, and fitting a model to the data -is a gate set that represents how that processor's gates act. The catch, as noted repeatedly above, is that this estimate is only unique up to gauge transformations. And while gauge transformations have no effect on outcome probabilities of circuits, they can have a huge effect on the individual transfer matrices representing the gates, and derived quantities of interest like fidelity.
The ideal solution to this problem would be to only compute and report gauge-invariant properties. But most of the metrics commonly used to assess quantum operations are not gauge-invariant. They vary -usually quite a lot -over gauge orbits. This means they don't correspond to physically observable quantities [54]. These metrics originated as ways to quantify the performance of a single gate in the context of other, perfect operations that form a reference frame. Gauge freedom emerges from the absence of a reference frame, and the motivation for GST is that, despite common assumptions, experiments on quantum processors usually do not feature an absolutely reliable reference frame. But even if they are flawed, gauge-variant metrics like fidelity, diamond distance, or entropy are not currently dispensable; no gauge-invariant replacements exist. So in order to compute well-studied gauge-variant metrics such as process fidelity and diamond distance for GST estimates, we choose a particular gauge using what we call gauge optimization.
Gauge optimization means reporting the GST estimate,Ĝ, in a gauge that optimizes some gauge-variant metric of "closeness" to the ideal target gate set. In other words, we choose the gauge to make the gates look as good as possible. The best metric is not unique or obvious, and can be highly situation-dependent. The implementation of GST in pyGSTi [105] supports gauge optimization using a variety of metrics. Actually performing the gauge optimization, once a metric is chosen, is straightforward and uninteresting (pyGSTi uses local gradient minimization, with occasional long random jumps to avoid rare local minimum traps). We focus here instead on (1) the rationale and justification for gauge optimization, and (2) the rationale for choosing a metric to optimize.

Why gauge optimization?
A common and fundamental objection to GST and gauge optimization goes something like this: "Intentionally seeking out and constructing the gate set closest to the desired target seems like cheating, if not actively circular." No truly satisfactory answer to this can exist, because in the absence of a privileged reference frame, gaugevariant metrics have limited meaning and shouldn't be computed. Gauge optimization is fundamentally a hack. But the quantum computing community has been reluctant to abandon metrics like process fidelity, diamond distance, and trace distance that are deeply rooted in the theory and literature of quantum information -not least because no good alternatives are known yet.
But, with that disclaimer, there are good arguments (which we find compelling) that gauge optimization is at least a good hack, contrary to the objection proposed above.
First: Gauge optimization is not an all-powerful tool for reducing errors. It cannot alter any predicted circuit outcome probabilities, and is therefore powerless to improve any circuit-outcome-based performance metric. Gauge optimization does seek out the gate set closest to the desired targets -but it searches over a very lim-ited set of candidates! Each gate's eigenvalues are gaugeinvariant. Target gates are almost always unitary, so they have unit-modulus eigenvalues. Any errors that change a gate's eigenvalues -e.g., stochastic errors that shrink a gate's eigenvalues toward zero -cannot be disguised or eliminated by gauge transformations. Almost every known physical error in gates cannot be eliminated by gauge transformations. Even coherent tilt errors, which manifest as relational mismatches between the rotation axes of different gates, can only be pushed from one gate to the other by gauge transformations.
Second: While gauge transformations cannot remove most physically plausible errors, they can create arbitrarily large unphysical errors. Applying a large unitary gauge transformation to the target gate set itself will make every gate appear to have large coherent errors, and thus large infidelities and diamond distances. But of course, this is an illusion -this gate set has no errors at all, because it's just another representation of the target. As long as we are stuck with gauge-variant metrics, this example demands gauge optimization or an equivalent gauge-fixing procedure.
Third: It's tempting to seek a different gauge-fixing procedure that produces a standard form for gate sets, without explicitly trying to make them look like the target gate set. But, as the previous example shows, any gauge-fixing procedure that works actually has to do this anyway -just in disguise. If we construct a gate set that's equivalent to the target, and then feed it into a gauge-fixing procedure, it absolutely must return the target gate set, regardless of what the target is. If it doesn't, then it will return a gate set that appears to have errors, which is not true (by construction).

Metrics for gauge optimization
Given an estimated gate setĜ and a target G 0 , gauge optimization finds the gauge that minimizes some function Some general guidelines for choosing f include: it should be non-negative, it should be uniquely zero ifĜ = G 0 , and it should facilitate minimization (e.g., smoothness and convexity are desirable). However, it does not need to satisfy all the properties of a mathematical metric (e.g., the triangle inequality). We examined variations on three different metrics, all derived from summing up some well-known function of two matrices over all the logic operations in the gate set 5 : 1. Infidelity, 2. Trace/diamond distance,

Squared Frobenius (Euclidian) distance.
Infidelity and trace/diamond distance have operational interpretations in quantum information. The Frobenius distance between transfer matrices and density matrices generally does not.
Infidelity is not a suitable metric for gauge optimization for an interesting reason: when applied to general matrices (with no positivity constraint), it is neither strictly positive nor uniquely minimized whenĜ = G 0 . This is related to the fact that infidelity is not actually a metric in the mathematical sense. If G is a unitary gate, then although G has fidelity 1 with itself -F (G, G) = 1 -it's actually possible to achieve fidelity greater than 1 by performing a nonunitary gauge transformation, so F (G, M GM −1 ) > 1 for some M . We emphasize that in these circumstances, M is not unitary, and M GM −1 is not completely positive. But nonetheless, this means that infidelity only satisfies the required desiderata if the CP constraint is imposed during gauge optimization. This is inelegant, and poses practical problems in gauge optimization. Furthermore, using infidelity leads to other undesirable outcomes beyond this simple example. We conclude that infidelity is not a good metric for gauge optimization.
We found trace/diamond distance to be expensive to compute (because there is no closed form for diamond distance), hard to work with (because it is not smooth), and to yield results not significantly different from squared Frobenius distance. In principle, diamond distance is an appealing metric -it has deep operationally meaning, and there is a nice elegance in saying "We have chosen the gauge that minimizes the diamond distance to the target gates." However, we found it too inconvenient in practice to justify this advantage.
We find the squared Frobenius distance to be the most useful gauge optimization metric. It is well-behaved and fast to compute, and we have found few disadvantages to its use. In particular, we utilize the weighted sum of squared Frobenius distances between the estimated and target gate set elements, where the notation of Eq. 27 is used to denote the elements of the estimated (Ĝ) and target (G) gate sets, and | · | represents the Frobenius norm. The weights α i , β i and γ m,i are real numbers. The weighting turns out to be important, for the odd and interesting reason that SPAM errors cannot be amplified. Long-sequence GST with very deep circuits can resolve errors in gates to very high precision, potentially below 10 −5 . But achieving this sort of accuracy in SPAM operations is virtually impossible. An error that rotates the initial state relative to the measurement axis by a small angle θ cannot be detected using fewer than O(1/θ 2 ) shots. In an experiment where the longest circuit is L gates deep, and each circuit is repeated N times, estimation error in the SPAM operations is typically O(1/ √ N ), while estimation error in the gates is O(1/L √ N ), which can be vastly smaller. But gauge transformations can "slosh" certain errorscoherent errors, in particular -between SPAM operations and gate operations. As a specific example, consider a GST experiment with N = 10 4 and L = 100, performed on a perfect experimental system. In this idealized case, all estimated errors are due entirely to finite-sample noise in the experiment. We would expect the estimated errors in the gates to be around ±1/(L √ N ) ≈ 10 −4 , but estimated errors in the SPAM to be around ±1/ √ N ≈ 10 −2 . But if we perform gauge optimization and seek to minimize the equally weighted sum of Eq. 78, the optimal solution will "split the difference," finding a gauge in which both the SPAM and gate operations have about 0.5 × 10 −2 coherent error. This is a poor and misleading choice of gauge, because the highly precise estimate of the gates is polluted by estimation error in the SPAM.
One solution is to adjust the weights in Eq. 78, assigning different weights to SPAM operations (α i and γ m,i ) and gate operations (β i ). We have found that a good rule of thumb is to assume that the finite-sample estimation error in each operation is proportional to the largest number of times it appears in any circuit -typically L for gates and 1 for SPAM -and then give the corresponding term in the metric a weight proportional to the inverse square of its estimation error. So, in the example above, we would set α i = γ m,i = 1 and β i = 10 4 and minimize Eq. 78. This would tend to split relational error between SPAM and gates unequally, assigning 99% of the error to the SPAM.
Another heuristic for avoiding poor gauge choices, which we find works even better in practice, is to perform several sequential "stages" of gauge optimization. Each stage uses the Frobenius distance metric (Eq. 78) with different weights and a different optimization domain. In the case of a fully parameterized gate set these stages are: 1. Optimize over the full gauge group (over all invertible M in Eq. 7) using uniform weights. This finds a decent gauge choice -one that may need some tweaking but isn't too far from a good choice. Starting from this point, we next: 2. Optimize over the unitary group (over all unitary M ) using 100% weight on the gates (α i = γ m,i = 0, β i = 1). This rotates the gates into the best versions of themselves possible, and is justified in the common case where we know the gates to far higher accuracy than the SPAM. By restricting to the unitary group, we disallow transformations that make the gates slightly better by drastically stretching or skewing the notion of distance.
3. Optimize over the SPAM gauge group using 100% weight on the SPAM operations. We define the "SPAM gauge group" as the 2-parameter family {diag(α, β, . . . β) : α, β ∈ R}, where diag(. . .) is a diagonal matrix with the given values on its diagonal. Matrices of this form slosh error between SPAM operations and the non-unital part of the gates (recall that the 0-th basis element is the identity) in a gate set. Thus, this optimization reduces errors on the SPAM operations at the expense of increasing non-unital gate errors. Because the SPAM gauge group doesn't necessarily preserve complete positivity, we include this as an optimization constraint.
In the case of a TP gate set, the initial stage is modified to optimize only over TP matrices M and the SPAM gauge group of the final stage is restricted to the 1-parameter family of diagonal matrices {diag(1, β, . . . β) : β ∈ R} whose top-left element is one. We treat CPTP gate sets by simply omitting the first stage of the procedure followed in the TP case. Gauge optimization yields a gate setĜ GO that is gaugeequivalent toĜ, but effectively in standard form. That is, any distinct yet equivalent gate set would be brought to the same form by gauge optimization, provided that the same metric was used for gauge optimization. FromĜ GO , gauge-variant quantities can now be computed, and assigned meaning on the grounds that at least they've been computed in a fixed and well-motivated gauge. Computing gate metrics such as the fidelity or diamond distance requires some type of informed gauge-fixing is required, and gauge optimization as described here is the best method we have found.

C. Error bars
The GST analysis pipleine described so far produces a point estimate of the gate set -a single gate set that fits the data well. But estimation is like throwing darts at a board: no dart ever hits the exact center of the target. Even if we assume there exists a "true" gate set, G MLE will at best be close to it. We need to estimate how close. We need what physicists tend to call "error bars", and what statisticians refer to as a region estimator.
At least three kinds of region estimators are commonly used: confidence regions [173], credible regions [174,175], and standard errors [176]. The details of what each means, and their relative merits, are rather technical, and good discussions relevant to tomography can be found in [13,177]. Here, we mostly follow the confidence region approach, but we do not attempt rigor. Our goal is to demonstrate that putting plausible, reasonable error bars around GST estimates, by adapting well-known techniques, is possible. We welcome rigorous critiques and improvements.
We have successfully used two techniques to equip GST estimates with error bars. The first is bootstrapping. The second is likelihood-ratio confidence regions. Our limited testing has found both to reliably produce reasonable error bars. However, our testing also makes it clear that GST is a "messy" statistical problem, and further research and development are needed. To put this more bluntly: we would not attempt to publish the error bar algorithms reported here as a standalone paper at this time, because they are insufficiently developed. But, at the same time, we believe it is irresponsible to publish or deploy a QCVV protocol without any mention of error bars. Since the overall theory of GST is mature and overdue for publication, our imperfect solution to these frustrated constraints is to discuss the best available error bar technology for GST, including this clear disclaimer.

General problems with region estimators
There is no universal agreement on the "correct" way to report uncertainty about any tomographic estimate. However, the most principled approach appears to be to construct and report either a confidence region [13,177,178] or credible region [16,179]. In either approach, the tomographer describes their uncertainty by (1) choosing a probability α (e.g., α = 95%), then (2) reporting a subset R of the parameter space, of more or less arbitrary shape, for which a statement like "R probably contains the true parameter, with probability α." can be made. 6 Both approaches pose some significant practical challenges.
1. There is no standard or efficient way to describe an arbitrary region in a high-dimensional space.
2. This construction requires the tomographer to pick a particular α in advance. But this value is rarely meaningful to end users, and there is absolutely no guarantee that a region for α = α can be extrapolated from the α region.
3. Researchers and end users are almost always more concerned with some particular scalar function f ( θ) of the high-dimensional tomographic parameter θ than with θ itself. Extracting an interval estimate for f ( θ) from a region estimate for θ is nontrivial, and if done crudely can dramatically overestimate the uncertainty in f .
None of these problems invalidate the theoretical validity of optimal confidence or credible regions. But because of them, most experiments either ignore error bars, or simplify theoretically rigorous techniques so ruthlessly that their provable properties (optimality or rigorous coverage) are lost. For example, the first issue can be mitigated by specializing to regions of a particular shapee.g., ellipsoids, or spheres in a particular metric -but if coverage probability is maintained, this usually requires regions that are far from optimal and overestimate uncertainty.
The two methods we discuss here can both be motivate by a fairly common ansatz: we assume that local asymptotic normality (LAN) applies. When LAN holds, 1. the likelihood function for the current data is Gaussian, and 2. the likelihood function for other hypothetical data sets that could have been observed (but weren't) is almost surely Gaussian with almost the same covariance matrix.
These consequences mean that we can treat the underlying statistical problem like a Gaussian shift model. The MLE will be N ( θ true , σ) distributed around the true parameter value, and the likelihood function will be a Gaussian whose covariance matrix is that same σ. Inasmuch as this assumption holds, it resolves all of the practical problems above: 1. Every optimal confidence region is an ellipsoid centered at the MLE with covariance σ.
2. Changing α just requires scaling that ellipsoid up or down.

Confidence intervals for any linear function f ( θ)
can be extracted very straightforwardly from the confidence region (by marginalizing the confidence ellipsoid and applying a Bonferroni correction), and the same procedure can be used when f ( θ) is nonlinear as long as it's approximately linear over the extent of the confidence ellipsoid.
Of course, LAN will never hold exactly, and we can construct GST examples where it is a pretty bad approximation. In these cases, ellipsoid confidence regions will (usually) overestimate uncertainty, and (occasionally) have insufficient coverage probability. More research is required to identify and map out the regimes where the LAN ansatz used here is unreliable.

Bootstrapping
The basic idea of the bootstrap is very simple: repeatedly simulate the experiment that generated the real data to generate a large number of "fake" datasets, analyze each of them to produce a point estimate, and use their scatter (variance) to construct a region. This works very well for Gaussian shift models, as long as a large enough sample of fake datasets is constructed. It is known to break down when the estimator is biased, when it is significantly heteroskedastic, or if the bootstrap distribution is undersampled. Both bias and heteroskedasticity are known problems for state and process tomography due to positivity constraints. We find these to be less important for GST for two reasons. First, we do not usually impose CPTP constraints explicitly. Second, long-sequence GST can achieve such high accuracy that the data bound the estimate comfortably away from the zero-probability boundaries where heteroskedasticity becomes large. When we have cross-validated bootstrapping against the alternative method described below, we have observed consistent results.
The two basic types of bootstrap are parametric (in which the "fake" datasets are simulated from scratch using the MLE gate set) and non-parametric (in which each circuit's outcomes are resampled directly from the data, with replacement). We have performed both types, and have not seen significant differences. However, on theory grounds, the parametric version should be more reliable because it is less affected by constraint-induced bias, and by certain small-sample-size effects.
Applying standard bootstrap methods to GST encounters two specific problems. First, the GST model has a lot of parameters. One qubit gate sets usually have at least n = 30 free parameters, while 2-qubit gate sets usually have at least n = 1000. Unless at least n 2 samples are generated, the sample covariance will necessarily be rank-deficient, so estimating the population covariance by bootstrapping requires at least O(n 2 ) samples. Since each bootstrap sample requires running an MLE analysis on a fake dataset, which can take minutes or hours, the time required to generate 10 6 (or even 1000) samples is excessive.
So, instead of trying to construct a confidence region for the entire gate set (which, as noted above, is almost never directly useful), we use the bootstrap to construct confidence intervals for scalar parameters directly. Estimating the uncertainty in a single parameter this way requires only O(1) samples, and k distinct parameters (e.g., all the matrix elements of the transfer matrices in the gate set) can be reliably bootstrapped with only O(log k) samples.
The second issue is more subtle, and stems from the gauge freedom in gate sets. We discuss this in Sec. VI C 5 below.

Confidence regions based on the Hessian of the likelihood
Likelihood-ratio (LR) confidence regions are an alternative and logically independent approach to quantifying uncertainty. They are based on the intuition that the true θ will probably have a high likelihood -not much less than the maximum -and so the set of all θ with likelihood above some threshold constitutes an efficient confidence region. The basic theory for LR confidence regions in tomography can be found in Ref. 173.
If we assume that LAN applies, then the likelihood function is Gaussian, its logarithm is quadratic, and so it is completely determined by its second derivatives or Hessian evaluated at the MLE. Under these assumptions, a LR confidence region can be computed simply by 1. Calculating the Hessian H of the loglikelihood at the MLE, which is somewhat tedious but straightforward.
2. Inverting H and multiplying H −1 by an appropriate scaling factor to define an ellipsoid that coincides with the a specific contour of the likelihood function.
The desired contour is defined by χ 2 theory, and corresponds to gate sets G whose loglikelihood satisfies where k = N nongauge p is the number of non-gauge parameters in the gate set, L(Ĝ) is the maximum likelihood for any gate set (e.g., GST's point estimate), α is the desired confidence level, and λ thresh is the α quantile of a χ 2 k distribution. Confidence intervals for linear functions f ( θ) = v · θ of the gate set can be derived easily from this procedure. We simply project the covariance matrix H −1 onto the subspace spanned by v to get the variance of a scalar, and then set λ thresh based on k = 1 free parameter. However, we have to deal with the gauge freedom first.

Dealing with gauge freedom
GST's gauge freedom creates problems for both the bootstrapped and the Hessian-based regions described above. All the gate sets on a gauge orbit are completely equivalent. Imagine that we could parameterize the set of all gate sets as G = (X, Y ), where X describes which gauge orbit G is on and Y specifies its location on the gauge orbit. X would include all the gauge-invariant parameters, and Y would constitute pure gauge parameters. In this scenario, how would we describe our uncertainty about (i.e., put error bars on) Y ? On one hand, the data provide absolutely no information about Y , suggesting total uncertainty and large error bars. But, on the other hand, we may simply choose a gauge by fiat, suggesting absolute certainty and vanishing error bars.
The answer is made clear by considering how "gauge parameters" (Y , above) affect the variables that scientists wish to learn from GST -e.g., superoperator matrix elements, fidelities, or diamond norms. These are all functions of the gate set, but they are not gauge invariant. So if we know X exactly, but allow Y to vary wildly, then these desirable variables will also vary. Thus, if we assign large error bars to gauge parameters, we necessarily get large uncertainty about fidelities and other gate properties -even if we have learned everything that can be learned about the gate set (X) to high precision! So we can only assign consistent error bars if we fix the gauge. Every point estimate necessarily has a fixed gauge, by its very nature -Ĝ = (X,Ŷ ), soĜ has a fixed value of Y (gauge). But a region is a set of points {G}. It is easy to construct a confidence region containing gate sets with different gauges -say, G 1 = (X 1 , Y 1 ) and G 2 = (X 2 , Y 2 ), where Y 1 = Y 2 . If we allow this to happen, then the error bars on interesting but nongauge-invariant quantities will be artificially inflated. So fixing the gauge for a region requires ensuring that every element of the region has the same Y coordinates. This is harder than it sounds, and is actually something of a category mistake. Above, we said "Imagine that we could" separate G into gauge-invariant parameters and pure gauge parameters. This is not generally possible in gauge theories, because it would require a global definition of parallel transport that does not generally exist. Different gauge orbits are typically not the same, or even isomorphic. For example, consider two gate sets G 1 , G 2 that each contain just a single gate G i -but in G 1 , G i is the identity matrix, while in G 2 , G i is nondegenerate. One G i is invariant under all gauge transformations, while the other is not. So the gauge orbit of G 1 is a zero-dimensional point (the entire gate set G 1 = {G i } is invariant under gauge transformations), while that of G 2 is much larger and more complex. No coordinate system (Y ) can describe them both, and if a confidence region includes both G 1 and G 2 there is no way to demand that they have "the same Y coordinate".
Instead, we gauge-fix a region by first identifying each orbit (equivalence class of gate sets) in the region, assigning it a single representative gate set, and selecting those representatives to minimize the size (variance) of the resulting set of gate sets. This approach can be used for both bootstrapped and Hessian-based regions, but plays out differently in practice for the two constructions.

Gauge-fixing bootstrap regions
We have found the following procedure effective at eliminating effects of gauge freedom from boostrapped error bars. Each gate set that is sampled via the bootstrap is "gauge optimized" using a fixed procedure, which sets its gauge to make it as close as possible to the target gate set. This is a good operational approximation to requiring every element of the region to be "in the same gauge". By making every representative as close as possible to a fixed target, it minimizes the radius and variance of the entire region.

Gauge-fixing Hessian-based regions
This is a bit trickier, because in normal statistical scenarios, the Hessian H of the loglikelihood is full rank. But each gauge degree of freedom creates a direction in parameter space, at the MLE, along which the loglikelihood is locally constant. It has no curvature in these directions, and thus its second derivative vanishes, giving H a kernel. Before we can invert H, we must get rid of these zero eigenvalues by projecting H onto its support. This projection corresponds to "slicing" an uncertainty ellipsoid that is infinitely extended along the tangent plane of the gauge orbit at the MLE, which yields a finite and bounded ellipse. However, minimizing the size of the region requires choosing the right plane in which to slice. This is somewhat technical, and details are given in Appendix A 2. The resulting projected Hessian has full rank and can be inverted and analyzed as given above. It defines a covariance tensor in (non-gauge) gate set space that, when scaled by an appropriate factor, defines an ellipsoid that is a valid α confidence region. Writing down this ellipsoid explicitly (as a tensor), while possible, is not very practical. Instead, we use it to define error bars (confidence intervals) for any and all interesting scalar quantities (e.g., fidelities, diamond norms, superoperator matrix elements, etc).
We compute such confidence intervals as follows. Let f (G) be a scalar function of a gate set parameters with linearization f (G) ≈ f 0 + ∇f · (G − G MLE ). We define an α confidence interval around the best-estimate value f 0 = f (G MLE ) by computing δf = (P (∇f )) † · (P (H)/C 1 ) −1 · P (∇f ) (79) where P indicates projection onto the (correctly chosen) support of H, P (H) is the projected Hessian and P (∇f ) is the similarly projected gradient of f . C 1 is a scalar constant which satisfies CDF 1 (C 1 ) = α, where CDF k is the cumulative density function of the χ 2 k probability distribution. With δf so defined, f 0 ± δf specifies a good α confidence interval for f . Within the linear approximation to f , which is valid for small δf , this interval corresponds to minimizing and maximizing the value of f over the contour of the loglikelihood corresponding to a α confidence interval if the loglikelihood had a single parameter.
We emphasize that this does not construct a α confidence region. Equation 79 can be used to construct α confidence intervals for each of the N nongauge p parameters, but the region formed by taking the product of all these intervals contains the truth only if every one of the intervals contains its parameter. This occurs with probability at least (α) N nongauge p , not (α). A true α region can be readily constructed by simply replacing C 1 with C k in Eq. 79, where k = N nongauge p . When error bars on a scalar quantity are needed, we believe the confidence intervals constructed above are more meaningful than projections of the full α confidence region, for two reasons. First, these intervals are consistent with the error bars reported by the parametric bootstrap, which yields standard errors for each parameter independently. Those intervals would have to be expanded significantly to construct a joint confidence region. Secondly, we are often interested in the uncertainty of single scalar quantity (e.g., diamond norms), independently of others. The confidence intervals constructed this way correctly measure this uncertainty.

VII. SUMMARY
We have presented gate set tomography as a method for characterizing a quantum logic device. GST is in some ways a successor to existing tomographic methods such as quantum state and process tomography, but it is also fundamentally different: a gate set is a entity unto itself, not just a collection of unrelated components. So GST is really a new type of tomography that probes a different object (a gate set) with unique properties. There is still considerable overlap between GST and other tomographic methods, so we conclude by recalling and summarizing GST's primary distinguishing (and novel) features.
• Calibration free: GST neither requires nor even considers a pre-calibrated reference frame, nor relies on any priori assumptions about the accuracy of any operations.
• Self consistent: GST produces self-consistent descriptions of all the operations (state preparations, gates, and measurements) exposed by a quantum processor. This is different from estimating each quantum state and superoperator independently, both because it requires no reference frame, and because (as a result) it exposes gauge degrees of freedom that make some errors relational between operations, rather than endemic to a specific gate.
• Hyperaccurate: The use of deep, periodic circuits enables GST estimates' precision to achieve Heisenberg-like scaling (modulo gauge freedom; see appendix D).
• Constraint-friendly: GST is not restricted to estimating arbitrary quantum transfer matrices, but can naturally incorporate almost any constraint or physics-inspired model, by treating a gate set as a parameterized model. Immediately useful examples includ restricting the estimate to be trace preserving (TP) or completely positive and trace preserving (CPTP).
• Model validation: It is very natural within GST to detect and quantify violation of the underlying model that is being fit, using standard statistical techniques. This is a useful byproduct of the overcomplete experiments used by GST, and can be used as a built-in warning system to identify systematic errors in characterization.
Although this paper is intended as the first comprehensive description of gate set tomography, including the details of the long-sequence variant, GST has been used in a variety of published experiments [33, 35-37, 88-94, 98, 99] in several qubit technologies. Like all protocols, it has limitations, which we've tried to expose and discuss here. But GST is also a mature protocol with extensive real-world validation in experiments. A reference implementation of GST is available in the free, open-source Python package pyGSTi [104,105]. Extensions of GST to include additional constraints and to analyze time-dependent data are currently under investigation, as are variants of the protocol which scale more favorably with the number of qubits. But even if these extensions should fail to produce useful practical tools, the development of GST to date has contributed something we believe is of lasting value: it demonstrated the principle of self-consistently characterizing all the operations on a quantum logic device, and revealed that this problem is both more and less than the sum of its parts, thanks to the unexpected role of gauge freedom. GST estimates gate set parameters by varying these parameters to better fit a given set of data. A naive goal would be to estimate every parameter with high accuracy -i.e., to estimate a best-fit point within the parameter space P. However, it is usually the case (in GST analyses) that at every point in parameter space there are directions along which none of the predicted probabilities change. Variations along these directions do not affect the goodness-of-fit (likelihood) at all. We call them gauge directions.
Gauge directions arise from the class of gauge transformations T M : M → M acting on the space of explicitly represented gate sets. They were defined by Eq. 7 and we repeat them here: where M is any invertible superoperator. In this appendix, we refer to these as matrix-space gauge transformations. Such transformations map one gate set in M to another without changing any observable probability (cf. Eq.10). The set of transformations T M forms a group isomorphic to the Lie group GL(d 2 ), as We call M ∈ GL(d 2 ) an element of the "gauge group" GL(d 2 ). The action of T M can be pulled back by the gate set model's mapping (W ) to act on the parameter space P. This pullback is not always a well-defined function as T M does not, in general, respect the constraints on M imposed by W . When it is well-defined, we call the resulting pulled back transformation on parameter space a parameter-space gauge transformation. In this appendix, the unqualified term "gauge transformation" will always refer to a parameterspace transformation and T M : M → M will be referred to as a "matrix-space gauge transformation".
In the case of the fully parameterized model, P is isomorphic to M, and the gauge transformations are identical to the matrix-space gauge transformations of Eq. 7.
In the case of a TP parameterized model, the gauge transformations are given by restricting Eq. 7 to the parameterized elements of M (i.e. excluding the first row of G i and the first element of |ρ (i) ) and constraining M to be TP (i.e., its first row should be [1, 0, 0, . . . 0]). The gauge transformations of the TP model form a subgroup of the matrix-space gauge transformations, which is isomorphic to the subgroup of GL(d 2 ) comprised of all d × d matrices whose first row is [1, 0, 0, . . . 0]. This nice structure results in a well-defined gauge group on the parameter space.
When dealing with a CP (or CPTP) parameterized model, there is no such structure and the set of gauge transformations that respect the model's constraints varies from point to point in parameter space. The CP constraint does not correspond to a constraint of M ∈ GL(d 2 ) to one of its subgroups, and so there is no well-defined gauge group. In this sense, we say the CP constraint "does not play nicely with the gauge". In this scenario, it may be possible to identify a gauge subgroup that does respect the constraints. For example, consider restricting the set of all matrix-space gauge transformations to those for which M corresponds to a unitary action on H. Since these transformations respect the CP and TP constraints and have a group structure isomorphic to the unitary group, then form a proper gauge subgroup.
Each gauge direction at point p ∈ P corresponds to a degree of freedom within the class of gauge transformations. For example, in a single-qubit fully parameterized gate set model, the gauge transformations are in one-to-one correspondence with the invertible 4 × 4 matrix M of Eq. 7. So there are 16 independent gauge directions at every point in parameter space, defined by varying each of the elements of M . In a single-qubit TP-parameterized model there are only 12 such gauge directions because of the restriction that M be TP. In the CP (CPTP) case there are 16 (12) gauge directions at every point in parameter space except those corresponding to gate sets where at least one gate lies on the boundary of CP space. At such points, the gauge freedom is constrained, and there are fewer gauge directions. 7 As we stated in the main text, the number of gauge directions is fixed almost everywhere by the gate set model, but the gauge directions themselves vary from one parameter-space point to another. The gauge freedom thus gives rise to curved "gauge manifolds" (or orbits) within parameter space whose points all yield identical probabilities and thus identical fits to any data. If S : P → P is a gauge transformation and θ is a vector of gate set parameters, then the gate sets W ( θ) and W (S( θ)) are physically indistinguishable (W : P → M is the gate set model's mapping). This situation is similar to that found in electromagnetism and quantum field theory, where the natural mathematical description of an entity is over-complete or redundant.
The parameter space, because it is foliated into gauge manifolds, takes the structure of a fiber bundle. Fibers are given by the local gauge directions, and the base manifold given by the perpendicular non-gauge directions. Importantly, the base manifold depends on the metric chosen for P. In essence (see Figure 7), P can be viewed as a space of physically distinguishable fibers of equivalent gate sets (or, more precisely, their pre-images). In this picture, "fixing a gauge" means choosing a representative point from each of the fibers to define a set (which may be a manifold, if the choice is performed in a smooth way) of physically distinguishable gate sets. When the gate set model yields gauge transformations with a group structure -i.e., when there is a well-defined gauge group -the fibers correspond to orbits of the gauge group.
Let us consider the tangent space T q at a given point q ∈ M. Recall that the dimension of M is given by N e (Eq. 66). T q is spanned by the derivatives with respect to some basis for M. The derivatives of the infinitesimal gauge transformations will span a subspace of T q , which we call the gauge subspace at q. Let us write an element of the gauge group GL(d 2 ) (cf. Eq. 7) as where K is a d 2 × d 2 real matrix. The space of infinitesi-7 There may fewer than 16 and 12 gauge parameters in special circumstances when there are classes of would-be gauge transformations which commute with all of the gates and SPAM elements. . Movement within the fibers of P are accomplished by "gauge transformations" as referred to by the text. Since the goodness-of-fit between a gate set and data is invariant under gauge transformations, the "best-fit" gate set, GMLE is really a best-fit fiber.
mal gauge transformations is spanned by the operations which can be found by inserting Eq. A2 into Eq. 7 and keeping only first order terms in K. Square brackets in Eq. A3 denote the commutator. Taking a derivative with respect to each element K jk yields the gauge subspace directions where u ij is the matrix whose i, j-th element equals 1 and whose other elements are zero. These gauge directions can be viewed as length-N e vectors, dQ ij = dT M dKij ∈ T q , where i and j range between 1 and d 2 . Let dQ be the N e × d 4 matrix with the dQ ij as its columns, and let dP be the N e × N p Jacobian of W at point p ∈ P. The first N p components of each vector in a basis for the nullspace of [dP |dQ] give a basis for the pullback of the gauge subspace to T θ , the tangent space at θ ∈ P. That is, the columns of P, where form a basis for the gauge space at θ and projection onto this space is given by the action of the projector We call the number of gauge directions at each point (equivalently the dimension of the fibers) the "number of gauge parameters", and denote it N gauge p . The number of remaining dimensions in parameter space, N nongauge p = N p − N gauge p (also the dimension of the base manifold) is called the "number of non-gauge parameters". Note that while the gauge directions generally vary from point to point, the number of total (N p ), gauge (N gauge p ), and nongauge (N nongauge p ) parameters are globally fixed almost everywhere by our common gate set models (full, TP, and CP) and can be treated as constants.
Gauge degrees of freedom are very relevant to the central task of parameter estimation because at any point in parameter space there is no difference in the goodness of fit along the gauge directions (i.e. along the gauge fiber). This means that though we naively desire a best-fit point in parameter space (i.e. knowledge of all the parameters), the data can at most determine a best-fit fiber (a point of the base manifold). GST data analysis (e.g., MLE) seeks to find a gate set whose parameters mark a point within the best-fit fiber: the data cannot distinguish between points within a fiber. Any specific best-fit gate set necessarily assigns values to all N p parameters, but only N nongauge p linear combinations of those parameters are actually physically meaningful, or can have finite error bars, as discussed in section A 2. Gauge transformations, which act as automorphisms of each gauge fiber, may be used to map the initial best-fit gate set to another physically identical one. We say that two gate sets mapped to each other under gauge transformations are gauge-equivalent.
This gauge ambiguity -the inability to pin down N gauge p parameters of a gate set -is unavoidable and stems from a relativism intrinsic to gate sets (as discussed in the main text). For example, there is no unique matrix representation for a X(π/2) gate. Instead an X(π/2) gate is defined by its behavior relative to the other gates, state preparations, and effects which comprise the gate set. This is related to the standard freedom to choose a basis in linear algebra; the fundamental issue here is that the data do not and cannot choose a single preferred basis (i.e. reference frame).
Gauge ambiguity makes it difficult to compare two gate sets, since two distinct gate sets corresponding to different points in P may actually be physically equivalent.
Example. Consider the gate set G = {ρ = |0 0| , E = |1 1| , G 1 = G X }, where G X : ρ → σ x ρσ x . Any unitary change of basis is a valid gauge transformation, so G is gauge-equivalent to G = {ρ = |+ +| , E = |− −| , G 1 = G Z }, where G Z : ρ → σ z ρσ z . If an experimentalist intended to produce G, but tomography reports the (equivalent) gate set G , then by all the obvious metrics the device is wildly out of spec. Of course, this is merely a misunderstanding resulting from different choices of reference frame. To avoid such communication failures, it is necessary that all parties agree on a shared gauge.

Gauge considerations in error bars
Following Section VI C 6, suppose we compute the Hessian of the loglikelihood at a given point in parameter space. In the absence of gauge freedoms the Hessian would be full rank. In our case, however, the presence of N gauge p gauge degrees of freedom mean that at the MLE there are N gauge p directions in parameter space along which the loglikelihood has no (zero) curvature. Stated equivalently, the N p ×N p Hessian matrix will have N gauge p zero eigenvalues. If one were to proceed with the Hessian-based confidence region formalism, ignoring the gauge freedoms, trouble would arise when needing to invert a rank-deficient Hessian.
As usual, the gauge freedom gives us the opportunity to make a choice -namely how to fix the gauge. From the point of view of uncertainty analysis, we don't need to have any uncertainty about the position along the gauge directions because we can fix it. The data tell us nothing about the gauge directions because they are physically meaningless, yet because they are meaningless we may choose them however we want, and therefore know them completely.
Practically this means we may ignore the the Hessian components along the gauge directions and only consider those along the "non-gauge" directions, defined as those orthogonal to the gauge directions. However, orthogonality is metric-dependent. Our choice of a parameter-space metric will determine which directions are considered to be the non-gauge directions, and therefore will influence the Hessian and consequently any region estimates and error bars. Recognizing this, and dealing with it, is critical for assigning sensible error bars.
To visualize how error bars are affected by the choice of a metric, let us return to the fiber-bundle picture of parameter space presented in section A 1 and Figure 7 above. The MLE, a single point in parameter space, is an arbitrary point within the fiber corresponding to the best-fit equivalence class of gate sets. The gauge directions are well-defined: they are the directions given by Eq. A4 along which the loglikelihood is invariant, and they form a N gauge p -dimensional space. However, there are many different N nongauge p -dimensional non-gauge spaces which are linearly independent from the gauge space, and would be considered orthogonal to the gauge space in some metric. As Figure 8 illustrates, choosing this space is like choosing a specific angle/way of cutting through the fiber bundle. by choosing a specific non-gauge space, we are essentially choosing how much Computing Hessian-based error bars requires that we fix the gauge not just within the ML orbit but in a region around this orbit (so we can project the Hessian and remove its zero eigenvalues). Performing this regional gauge fixing can be done in many ways, as it amounts to selecting a point within each fiber; the arrows give two such examples. movement along the fibers (the gauge directions) should be coupled to movement between the fibers.
Almost any set of N nongauge p independent directions, or equivalently almost any metric, will yield a valid nongauge space. Moreover, there is no "correct" metric: we are free to choose whichever we please. But our choice will affect the error bars. So, to choose wisely, we may perform an optimization over metrics, minimizing a weighted sum of the size of the error bars on 1) the matrix elements of each superoperator representation, and 2) the vector elements of each state preparation and measurement effect.
An alternative way of choosing a metric involves partitioning the parameters into "gate" and "SPAM" groups. We then define each group's "intrinsic error" as the minimum average size of the error bars on parameters in that group when other groups' error bars are ignored (allowed to be large). We then choose the metric which weights the parameters of the different partitions according to the ratios of their intrinsic error rates.
Once a metric is selected, the Hessian can be projected onto the corresponding non-gauge space by constructing a projector using an expression identical to Eq. A6, but where the columns of P are non-gauge rather than gauge directions.

Appendix B: The extended LGST (eLGST) protocol
The extended-LGST (eLGST) protocol was an important milestone on the way to the long-sequence GST protocol described in the main text. As we state in Section IV A, it utilizes structured circuits based on repeated germs (similar to long-sequence GST), but uses LGST in combination with a least-squares objective function to arrive at its final estimate (instead of MLE used in long-sequence GST). As such, it constitutes an interesting fossil -an extinct intermediate step -that, despite being of little practical use, is worth outlining here.
The eLGST analysis proceeds as follows. Let {g i } be the set of germ circuits, and let the data constitute LGST experiments on base circuits of the form g lj i -that is, l repetitions of each germ circuit, for some set of integers {l j }. Each circuit is repeated N times to yield data.
First, LGST is used for each base circuit, to obtain a reasonably accurate estimate of τ (g i ) lj for that base circuit. We denote this estimate by τ (g i ) lj , and assume that Now, if the g i were just numbers (not matrices), then we could get a high-precision estimate by simply taking the low-precision estimate τ (g i ) l , for some very large l, and computing its lth root, Getting this same result when g i is a matrix is a bit more complicated, because matrix roots are very unstable. To make this concept work for matrices, we need to 1) iteratively increase l and 2) avoid explicitly taking lth roots of matrices. These technical points are discussed below.
The following iterative algorithm is used in eLGST to get a high-precision estimate of g i :

Perform
LGST on base circuits of the form g l i for l = 1, 2, 4, 8, 16 . . . l max to obtain approximate estimates τ (g i ) lj , where i indexes the germ circuit.
2. For each g i , set the initial τ (g i ) equal to τ (g i ) 1 .
3. Use least-squares optimization to find the τ (g i ) that minimizes
6. Use the same approach to extract high-precision estimates of the gates (G k ) from the estimates of the germs (τ (g i )).
This algorithm was found to be reliable and accurate when tested on simulated data, and thus inspired the current stable algorithm used for long-sequence GST.
The problems with matrix roots mentioned around Eq. B2 are worth a brief discussion. The first problem is demonstrated even by the simple case where τ (g i ) is a complex scalar, τ (g i ) = e iθ . Now τ (g i ) l = e i(lθ mod2π) , and so the lth root is multi-valued (i.e., θ → θ + n 2π l leaves all observable probabilities invariant). We have to choose the right branch. This is impossible without further information.
We solve this problem by iteratively increasing l from 1 to its final value logarithmically. For example, if initially l = 32, we would include the base circuits for l = 1, 2, 4, 8, 16, 32. The additional cost is only logarithmic in maximum l, and as long as N is large enough, it completely solves the problem of choosing the correct branch. We begin by using the l = 1 data to get a decent estimate of τ (g i ) ± O(1)/ √ N . Then, we use the l = 2 data to deduce that and use the l = 1 estimate to identify unambiguously which of the two branches indicated by the ± symbol is correct. We repeat this process recursively for each successively larger l.
The second problem is peculiar to matrices. The procedure given above works very well for scalar τ (g i ) = e iθ , but not for matrix-valued τ (g i ). To see this, consider the example of Clearly, τ (g i ) 2 = 1l. If we perform LGST on τ (g i ) 2 , we will generally obtain g 2 i = 1l ± O(1)/ √ N , where the error term is a small, random matrix. Suppose (without loss of generality) that we get g 2 i = 1l + σ x = σ x , where σ · denotes a standard Pauli matrix. There are multiple square roots of g 2 i , but this isn't the main problem; the main problem is that none of them are even close to the true value of τ (g i ) = σ z ! Instead, every square root of σ x is diagonal in the σ x basis.
The root of the problem is that the matrix squareroot function is highly non-smooth, and can rip apart the topology of matrices. To fix it, we observe that we should be looking for a g i such that g i 2 ≈ g 2 i -not such that g i ≈ g 2 i . These two equations are not equivalent due to the topological violence that can be hiding in the matrix root.
These tricks are implemented in the final version of eLGST given above, and were pivotal to the success of eLGST as a protocol.

Appendix C: Implementations in pyGSTi
This appendix provides descriptions of several algorithms used in pyGSTi's implementation of gate set tomography. These are not included in the main text because they describe implementation choices that are not fundamental to GST as a protocol. These implementation details demonstrate that various steps of the GST protocol are tractable in practice, and may be helpful to readers seeking to implement GST in computer code.

Circuit selection
Recall that the primary goal when selecting circuits for long-sequence GST is to amplify all possible gate errors. More specifically, the germs must amplify all of a gate set's parameters with the exception of those linear combinations that correspond to gauge freedoms -we want to amplify N nongauge p independent parameter directions as defined in appendix A 1. By imposing a periodic structure on circuits (see section IV B), the choice of circuits becomes one of choosing (together) a set of germs g i , germ-powers p, and effective SPAM pairs (when fiducial-pair reduction is used, cf. V B). The native set of SPAM operations is a property of the target gate set (input by the user). The selection of (effective-preparation, effective-measurement) pairs means selecting appropriate fiducial circuits to precede or follow the some or all of the native state preparations or measurments. In the typical case when there is just a single native state preparation and single native POVM, this pair selection amounts to just selecting pairs of fiducial circuits.
In pyGSTi's most tested (and trusted) implementation of circuit selection, the tasks of selecting fiducials, germs, and germ-powers are separated for simplicity. After these tasks complete, one may optionally perform fiducial pair reduction. We consider each task in turn below.
Throughout this appendix, G refers to a gate set model for the quantum processor at hand. So, in addition to using the notation of Eq. 6 to describe members of G, we may also speak of G's parameters. G should approximate the true gate set (i.e. the to-be-determined best-fit estimate) so that circuits that amplify G's parameters also amplify the parameters of the best-fit estimate, and likewise for informationally complete fiducial circuits. Initially, we set G to be a TP-parameterized gate set containing the ideal (target) gates. Usually the best-fit gate set is close enough to the ideal one that the experiment design resulting from this initial choice of G is all that is ever needed. However it is possible, if necessary, to iterate this process, using, e.g., the initial GST estimate to generate a new experiment design which gives rise to a second GST estimate, etc.

a. Fiducial selection
Fiducial selection refers to the process of (independently) choosing informationally complete (IC) sets of effective state preparations and effective measurements. That is, fiducial selection is the process that determines {H k } and {F k } in Eqs. 52 and 53, respectively.
A set of matrices is IC if and only if it spans the vector space B(H). This requires at least d 2 linearly independent elements. While choosing d 2 random circuits would almost certainly result in an IC set, elements could likely be chosen that are "more" linearly independent. This notion of an amount of linear independence is quantified by the spectrum of the Gram matrix1l, defined (as in Eq. 30) by1 If either |ρ j or { E i |} fails to be IC, the Gram matrix will fail to have d 2 non-zero (to machine precision) singular values. As the d 2 -th largest magnitude singular value approaches zero this indicates that one of the sets is close to being linearly dependent, which is to be avoided. Thus, an optimal set of fiducial circuits (one whose elements span B(H) as uniformly as possible) can be found in practice by maximizing the smallest of the top d 2 singular values of the Gram matrix over many candidate sets. We typically consider as a candidate set all the circuits with depth below some cutoff, and have only a single native state preparation and measurement (N ρ = N M = 1). The precise details of how fiducial selection is done are non-essential to achieving the desired asymptotic accuracy scaling, as this only requires informational completeness. More uniformly IC sets are beneficial because they decrease the prefactor in the the accuracy scaling (i.e. the y-offset of the diamond-distance vs. depth series in Figure 10).
For fiducial selection to succeed, one must compute the Gram matrix using good estimates of the actual gates. When the true (or best-fit) gate set is only slightly different from G, then the actual gates will prepare states (and effects) that are close to the expected ones -and therefore close to uniformly IC. If the true gates are sufficiently far from those used in fiducial selection, this is readily detected in the singular values of the empirical Gram matrix (Eq. 30), and remedied by iterating over multiple experiment-generation steps as described at the beginning of this section. From this point forward, let us assume fiducial selection has succeeded in identifying sets of N f 1 effective preparations and N f 2 effective POVM effects.

b. Germ Selection
Germ selection seeks to identify a set of low-depth circuits whose powers make useful "operations of interest" for tomography. Let us begin by considering more generally a completely arbitrary set of "operations of interest", O, and asking: "What needs to be in this set to amplify all possible gate set parameters?".
By performing the circuits that sandwich each O ∈ O between all pairs of effective state preparations and measurements, we are able to estimate the probabilities Naively we could simply take O to be the set of single gates (each a depth-1 circuit) in the gate set, and repeat-edly measure the prescribed circuits many times. These are the same circuit used by process tomography, and are, up to some additional short circuits (cf. Eqs. [30][31][32][33][34][35], sufficient to characterize the gates using LGST. However, by repeatedly measuring each circuit N times we are only able estimate each gate's elements (and thereby the gate set parameters, assuming these parameters are linear in the elements) to an accuracy proportional to 1/ √ N (cf. Section IV). This is just the standard stochastic error scaling arising from the 1/ √ N scaling of the standard deviation of the mean in a Gaussian or binomial distribution. To achieve 1/N scaling, we must include deeper circuits. Intuitively, we want to take each parameter θ (an element of θ after fixing a basis for parameter space) in the gate set model and map it to a probability that depends on it as p ≈ pθ, where p is (or is proportional to) the germ-power.
Consider the case of a single-qubit gate G x that is intended to be an x-rotation by π/2 radians but is actually an x-rotation by θ = π/2+ , where is small. Let us suppose that every circuit is measured N times. If τ (O) = G x then we would estimate θ mod 2π = π/2+ ±α/ √ N for some constant α. (The mod 2π arises because rotations by 2π are undetectable.) If τ (O) = G 2 x , which is a rotation by 2θ = π + 2 , then we would estimate 2θ mod 2π = π + 2 ± α/ √ N and thus θ mod π = π/2 + ± α/(2 √ N ). More generally, if τ (O) = G p x , then we would estimate pθ mod 2π = pπ/2 + p ± α/ √ N and thus θ mod 2π/p = π/2 + ± α/(p √ N ). While it may be initially concerning that as p increases the "branch" ambiguity in θ increases (as one is unable to discriminate between angles separated by 2π/p), this issue can be circumvented by probing G p x for multiple values of p by choosing logarithmically-spaced depths l and letting p = l/|g| (see Section IV B 2)). One only needs logarithmically-spaced p (or l) values to determine the correct "branch" for θ, as each p can be used to rule out a constant factor of the possible branches. The same reasoning lies behind the Robust Phase Estimation protocol [180].
In the end, we see that by increasing l the uncertainty in θ decreases as 1/l. Since the total number of circuits scales at most linearly with l (and more often than not logarithmically since l-values are logarithmically spaced), this is at worst a Heisenberg-like scaling in the total circuit number and markedly better than the 1/ √ N scaling that would be obtained by just increasing N for the single This simple example suggests that we should include in O logarithmically spaced powers of each of the gates. These circuits amplify some but not all types of gate errors (i.e. parameters of the gate set). Let us return to our single-qubit example, but now suppose G x is an exact π/2 rotation around a slightly wrong axis. This corresponds to the unitary map G x = e −i(π/4)(cos σx+sin σy) , where σ x and σ y are the Pauli operators and is assumed small. This "tilt error", is not amplified by G p x , which is easily seen by observing that G 4 x = 1l, so the error cancels, or "echos", itself out after just four repetitions. More sophisticated circuits are needed to amplify tilt errors. In this example, it is sufficient to probe G x G y , where G y is a perfect single-qubit π/2 rotation around the y-axis. G x G y is then a rotation by 2π/3 + / √ 3. Therefore, performing (G x G y ) p amplifies the deviation by a factor of p and, as before, allows estimation of to an accuracy scaling as 1/(p √ N ). This illustrates the need for circuits other than the gates themselves which must be repeated to amplify all of a gate set's parameters. We call each of the circuits which is repeated, O ∈ O, a "germ" (consistent with section IV B).
The general situation gets rapidly complicated -e.g., if G y were not perfect in the above example, then G x G y alone could not distinguish between y-axis tilt in G x and x-axis tilt in G y . In general, every circuit is sensitive to some nontrivial linear combination of gate set parameters. To choose a set of germs, we create a list of candidate germs (e.g., every circuit shorter than some cutoff depth), and for each candidate germ g identify what linear combination of parameters it amplifies. We do this by computing the Jacobian, where τ (g) is the B(H) → B(H) map formed by composing the elements of g (cf. Eq. 8), and θ ∈ P is a vector of gate set parameters. Note that we are justified in assuming we have access to this entire Jacobian -i.e. the derivatives of all the elements of τ (g) p -because we have committed to sandwiching the base circuit between informationally complete sets of fiducials. As we will see in section C 1 d, inclusion of all of these fiducial pairs is overkill and we can remove many of them. Note also that, as in the main text, since the germs are only able to amplify errors in the gates and not in the SPAM, we will assume a gate set model whose parameters only vary its gates.
In general, τ (g) is a d 2 ×d 2 matrix and the gate set has N p parameters (N p = N G d 4 in the case of fully parameterized gate set model). This means ∇ (p) g is a d 4 ×N p matrix. Its d 4 right singular vectors indicate linear combinations of model parameters that τ (g) amplifies, and the corresponding singular values quantify how much they are amplified. A zero singular value indicates a parameter combination that is not amplified at all (like the tilt error discussed above). We expect each germ to amplify at least the d 2 linear combinations of parameters given by its eigenvectors. It may amplify more than d 2 if its spectrum is degenerate and remains so for all unitary perturbations (e.g., a single-qubit unitary gate will always have two unit eigenvalues). The amplification properties of a set of N g germs g 1 . . . g Ng is described by the N e × N p Jacobian (C5) Our goal is to choose germs that provide high sensitivity at "large" values of p (or equivalently l, if p = l/|g| ). In practice, it is not useful to make l larger than 1/(|g|η), where η is the rate of stochastic or depolarizing noise. To select germs, however, we ignore this effect and make the simplifying assumption that the gates (and therefore τ (g)) are reversible. In practice, we project the gates of G onto their closest unitary and consider random unitary variations of G to avoid the effects of accidental degeneracies. Under the assumption of unitary gates, it is possible to define the l → ∞ limit of the Jacobian in Eq. C4. Using the product rule and that τ (g) −1 = τ (g) † , As p → ∞, the average over all powers n of τ (g) twirls ∇ (1) g . By Schur's lemma, the effect of twirling is to project ∇ (1) g onto the commutant of τ (g) -i.e., onto the subspace of matrices that commute with τ (g). Furthermore, multiplication by the unitary τ (g) −(p−1) is merely a change of basis, and has no effect on the right singular vectors or the singular values of ∇ g . So, up to an irrelevant change of basis: where Π τ (g) is the projection onto the commutant of τ (g). This framework defines a notion of completeness for germs. The set of germs {g i } Ng i=1 is amplificationally complete (AC) if and only if the right singular rank of its Jacobian, the p → ∞ limit of Eq. C5, equals the total number of physically accessible parameters in the gate set, N nongauge p (cf. appendix A 1). To build an AC set of germs, it is sufficient to add germs, one by one, to a set until its Jacobian has rank equal to N nongauge p . One may continue to optimize the set of germs numerically by adding and removing germs (taken from some master candidate list), and only keeping modifications that lower a certain score function. The primary score function used in pyGSTi is This score estimates the mean squared error of estimation if a fixed number of counts are spread over the N g distinct germs.
c. Base circuit selection It remains to specify the number of repetitions (i.e. the values of l and therefore p) for each germ used to construct the base circuits of a GST experiment design. We have motivated the use of logarithmically spaced germ powers (above, and in appendix B), and as described in the main text (Section IV B 2) the maximum number of repetitions should be chosen so that the depth of the repeated germ does not exceed L η , a rough maximum circuit depth before the output probabilities cease to yield useful information.
The main idea is to repeat each germ a logarithmically increasing number of times until the depth of the repeated germ circuit reaches L η . We define the set of "maximum depths", l i = m i−1 N l i=1 (with m = 2 usually) such that l N l ≤ L η , and define l max = l N l = max i l i for later convenience. Let us write the germ power p formally as a function of germ and max-depth: so that p(g, l) is the maximum number of times g may be repeated without exceeding depth l (if |g| > l then p(g, l) = 0). Germ g is then repeated p(g, l i ) times for each i = 1 . . . N l . Note that this yields slightly different base circuits from just choosing logarithmically-spaced powers (p) -see Figure 5 step 2. In particular, when |g| > 1 we may end up including additional circuits that we would not have if p was itself logarithmically spaced. These additional circuits add to the over-completeness of the data, which we discuss later. The reason for choosing logarithmically-spaced maximum depths (l) rather than exponents (p) is primarily so that we can, in most cases, use a single set {l i } for all germs. If a common set of powers (p) was used for all the germs, this would either result in wasted circuits of depth greater than l max or the failure to repeat short germs as many times as one could. A more flexible (but more complicated) approach is to vary l max from germ to germ based on prior knowledge of the device's stochastic noise behavior (e.g., when twoqubit gates have more stochastic error than one-qubit gates). pyGSTi allows users to specify per-germ maximum depths, but this is not the default behavior.
In the end, the complete set of pyGSTi-generated GST circuits is as follows. With all the repeated germs given j=1 is the set of N l maximum depths, the probabilities corresponding to Eq. 60 are given by (using p i,j = p(g i , l j )): While the set of circuits corresponding to Eq. C12 achieves the goal of amplifying all of the gate set model's parameters, it isn't as efficient as it could be. This is because we have included a complete set of fiducial pairs for each base circuit. In this section we expand upon the description of fiducial pair reduction (FPR) in the main text (Section V B).
FPR is especially needed when we consider scaling GST beyond a single qubit. Moving to more qubits necessarily brings dramatic increases in the dimensionality of the parameter manifold, but the standard GST experiment design increases more rapidly than is required by the greater number of parameters. An ambitious 1-qubit GST experiment uses 6 fiducial circuits, 11 germs, and up to 13 powers (L = 1, 2, 4, . . . 8192). This adds up to approximately 6×6×11×13 = 5148 circuits. This sounds like a lot -but a direct generalization of this method to 2 qubits would require 36 fiducial circuits and 80 germs, for roughly 1.3 × 10 6 distinct circuits. This is not practical.
Thankfully, this many circuits is not necessary. As discussed in the main text, the standard experiment design contains redundancies stemming from the fact that we don't need full tomographic information for each repeated germ (g p(gi,lj ) i ) circuit. We only need to be able to extract the amplified linear combinations of parameters for each germ, which are typically much fewer (≈ d 2 ) than the ≈ d 3 circuits that pinpoint the ≈ d 4 elements in τ g p(gi,lj ) i . As such, we roughly expect that we can reduce the number of circuits by a factor of d 3 /d 2 = d and still maintain the desired Heisenberg scaling.
One way of doing this is to simply randomly remove some fraction of the N f 1 · N f 2 circuits corresponding to each repeated-germ. This technique seems to work in practice but has not been studied extensively. An alternate technique is to remove a specific set of the N f 1 · N f 2 "fiducial pairs" (corresponding to (a, b) pairs of indices in Eqs. 60-61) for each germ. The same set of fiducial pairs is removed for every power of the same germ. To construct a reduced set of M fiducial pairs for germ g we construct the projectors {P k } where P k = J k J T k and J k is the M × N p matrix whose m-th row is the derivative of E am |τ (g)|ρ bm with respect to k-th amplified parameter of g. The "amplified parameters" of g are the non-zero elements of τ (g)'s Jordan normal form, as these are the generalization of τ (g)'s eigenvalues which include off-diagonal elements of degenerate blocks. If the rank of k P k is equal to the maximal number of parameters g can amplify then the set of fiducial pairs {(a m , b m )}

M m=1
is sufficient and may be used as the reduced set of fiducial pairs for g. (The maximal number of parameters g can amplify can easily be found by computing the rank when including all possible fiducial pairs.) For each germ we find a minimal set of fiducial pairs which amplifies the maximum number of parameters, and this collection of per-germ fiducial pairs dictates a reduced set of circuits which should maintain the same desired accuracy scaling. We find this to be true in practice, but that the FPR process also makes the parameter estimation less robust to data that does not fit the gate set model. The redundancy, as one might expect, serves to stabilize the numerical optimization used in the parameter estimation portion of GST.
It is also possible to consider different maximum-depth sets {l i } for different germs, so that the number of times each each germ is repeated is strictly a logarithmic sequence. We expect that this would similarly lead to GST maintaining the desired scaling on data that can be fit well but that it could lead to an overall lack of robustness to data which cannot be fit well. We have not performed simulations to verify this hypothesis, and leave this as a subject of future work.

Non-TP loglikelihood optimization
If the gate set is not TP, then the expression for the loglikelihood in Eq. 62 above must be modified. In doing so, we will make use of a general relationship between the Poisson and Multinomial distributions. Consider a multinomial distribution with event probabilities {n j } and number of samples N . Now suppose the number of samples is no longer fixed but instead is Poisson distributed with mean Λ = N . This results in the previously multinomially-distributed n j becoming independently Poisson-distributed with rates λ j = n j Λ = n j N . This is easier to see in the reverse direction, starting from independent Poisson-distributed n j , and post-selecting on N = N 0 where N 0 is a fixed number. This introduces correlations among the n j , and the n j become multinomially distributed.
To see this, suppose we have K Poisson-distributed random variables n j , each with its own rate parameter λ j : In turn, the random variable X ≡ j n j is Poisson distributed, with rate parameter λ ≡ j λ j : Further, since the n j are independent, the probability of observing a particular set of values {n j } K j=1 is simply the product of the individual probabilities: The denominator of the above expression has already been calculated (Eq. C14); what remains is to calculate the numerator. Notice that if the sum of the {n j } is not N 0 , then the numerator is zero, while if their sum is N 0 , then the numerator is Pr({n j }): (C17) Therefore, the conditional probability is which is a multinomial distribution over N 0 trials, and with event probabilities p j = λj k λ k . Note that when moving to the third line in this derivation we used the fact N 0 = j n j .
When a gate set is not constrained to be TP, it means that the predicted outcome probabilities for a circuit need not sum to one, and equivalently that the predicted total number of counts need not equal the actual (observed) number of counts. Assuming the predicted number of counts will be (Poisson-) distributed around the actual number, we find ourselves in the situation just described with {n j } corresponding to the outcome probabilities {p s,βs } βs . We may thus think of λ s,βs ≡ N s p s,βs as the rate of seeing outcome β s after circuit s and write the likelihood for a single circuit as the likelihood for independent Poisson distributions: where in the final line we have for convenience discarded the p s,βs -independent N s,βs log(N s ) and log(N s,βs !) terms and defined log L s . The total loglikelihood is the sum of each log L s , giving log L = s log L s = s,βsNs N s,βs log(p s,βs ) − N s p s,βs (C19) The second term in Eq. C19 can be viewed as a Lagrange multiplier (or penalty term) which softly enforces the TP constraint that βs p s,βs = 1 for all s. Furthermore, we note that this "Poisson-picture" form of the loglikelihood is identical (up to an unimportant constant shift) to Eq. 63 for TP gate sets, and so Eq. C19 can be interpreted as a more general formulation which allows for both TP and non-TP gate sets.
pyGSTi uses this Poisson-picture formulation of log L all the time, both because it is more general (can handle cases when the gates are not TP) and because it makes the function more amenable to least-squares optimization techniques such as the Levenberg-Marquardt method. As noted in the text (Section IV C), pyGSTi also regularizes the log L functions so that evaluating the objective function at negative probabilities (as can be predicted by non-CP-constrained gate set models) is possible.

Appendix D: Numerical verification
This appendix presents a numerical verification of the accuracy scaling claims derived in the main text. Here we present evidence that • the accuracy with which LGST reconstructs a gate set scales as O(1/ √ N ), where N is the number of times each circuit is run on the device (so there are N sampled outcomes per circuit), and • the accuracy with which long-sequence GST reconstructs a gate set scales as O(1/L), where L is the maximum depth of the base circuits used in longsequence GST.
Both points can be handled similarly, as they both involve testing whether the slope of a particular plot is near a target value. Our results consist of verifying the values of such slopes over many trials on 1-and 2-qubit systems. Each trial consists of the following steps: 1. Choose a "true" gate set. This identifies the trial, and is a slightly perturbed ideal/target gate set. The true gate set contains imperfect operations.
2. Simulate circuits using the true gate set. Circuit outcomes are generated by sampling N times the multinomial distribution corresponding to the outcome probabilities predicted by the true gate set. For LGST trials, the number of circuits is fixed and N is varied to form multiple data sets each corresponding to a different value of N . For longsequence GST trials, N is fixed, and data sets corresponding to different maximum base-circuit depth L are created.

Perform
LGST or GST on the simulated data sets. This leads to a series of estimated gate sets -one for each value of N or L, respectively.

4.
Plot the accuracy of the LGST (GST) estimates vs. N (L) on a log-log plot, and compute its best-fit slope. Accuracy is computed as a distance to the true gate set (see below), and quantifies how well the estimate matches the true gate set.
In the end, each trial leads to a slope that we compare with the target value of −1/2 (for LGST trials) or −1 (for GST trials). Trials are constructed by first selecting a perfect gate set that has unitary gates. We consider 6 1-qubit perfect gate sets and 5 2-qubit perfect gate sets that we commonly see in hardware. Each 1-qubit (2-qubit) perfect gate set is perturbed 100 (10) times by applying (a) independently random depolarization to each gate, with a maximum strength of 10 −3 , (b) a constant SPAM error of 10 −2 (1%), and (c) independently random rotations of up to 10 −3 radians about each axis (x, y, and z for 1qubit and axes corresponding to the 15 nontrivial Pauli matrices in 2-qubit cases). These noise strengths were chosen to be realistic values for current state-of-the-art devices. Each perturbed gate set is then used to generate a series of data sets by varying N or L as described above.
Quantifying the accuracy of LGST/GST is done by computing the average diamond-norm distance, denoted avg (G, G 0 ), between the gates of the true gate set G 0 and those of the estimate G, where In Eq. D1, G (1) i are the gates belonging to G 1 , G (2) i are the gates belonging to G 2 , and the diamond norm is given by where ρ ranges over all valid quantum states.
Since the diamond distance is not gauge invariant (see appendix A 1) we gauge-optimize the estimated gate sets to the true gate set before computing it. If we didn't, the results wouldn't be meaningful and we would not expect to observe avg to decrease in the expected way. The choice of average diamond distance here is somewhat arbitrary. If we instead used the infidelity, the Frobenius distance, or another similar well-behaved quantity measuring the difference between two gate sets, we would expect to see the same scaling behavior. Figure 9 shows the average diamond distance vs. N for estimates obtained from LGST, and confirms the expected 1/ √ N scaling (slopes are near -0.5). This is true for all the single-and two-qubit gate sets we tested. In the figure legends, single-qubit rotation gates are given by their axis and angle, e.g., X(π/2) denotes a x-axis rotation by π/2 radians. The N φ gate is a π/2-rotation gate about the tilted (x, y, z) = ( √ 3/2, 0, −1/2) axis. Two-qubit gates are given as either two letters that indicate a tensor product of single-qubit π/2 rotation gates (e.g., XI means a X(π/2) gate on qubit 1 and an idle on qubit 2), or as a commonly known CNOT or CPHASE gate.
The O(1/ √ N ) scaling leads to an overall O(1/ √ N ) scaling, where N is the total number of samples, since the number of circuits required by LGST is fixed and N is proportional to N . X (2 ) , Y (2 ) [0] X (2 ) , Z (2 ) [1] X (2 ) , Y (2 ) , I [2] X (2 ) , Y (2 ) , Z (2 ) , I [3] Z (2 ) , N [4] X (4 ) , Z (2 )    LGST accuracy. The average diamond norm distance between the gates of a LGST estimate and the data-generating "true" gate set is computed as a function of the number of data samples per circuit, N . The dashed line with slope -0.5 indicates the expected O(1/ √ N ) scaling. The left and right columns of plots correspond to 1-and 2-qubit results, respectively. At each value of N , 100 (1-qubit cases) or 10 (2-qubit cases) perturbations of the original perfect gate set were used (color-coded). A line is plotted for each perturbed gate set. Y-axis values are multiplied 10 k to aid in visibility (this shifts lines upward by k grid cells). The integer k is different for each perfect gate set, and is given in the legend. The legend's k-value indicates the numbr of grid cells the data must be shifted down to make the y-axis values are correct. Below each plot is a histogram of the best-fit slopes of all the lines within it.  X (2 ) , Y (2 ) [0] X (2 ) , Z (2 ) [1] X (2 ) , Y (2 ) , I [2] X (2 ) , Y (2 ) , Z (2 ) , I [3] Z (2 ) , N [4] X (4 ) , Z (2 )    XI, YI, IX, IY, CPHASE [1] XI, YI, IX, IY, II, CNOT [2] XI, YI, IX, IY, II, CPHASE [3] XI, YI, IX, IY [4] slope Left and right columns of plots correspond to 1-and 2-qubit results, respectively. At each value of L, 100 (1-qubit cases) or 10 (2-qubit cases) perturbations of the original perfect gate set were used (color-coded). A line is plotted for each perturbed gate set. Y-axis values are multiplied 10 k to aid in visibility (this shifts lines upward by k grid cells). The integer k is different for each perfect gate set, and is given in the legend. The legend's k-value indicates the numbr of grid cells the data must be shifted down to make the y-axis values are correct. Below each upper plot is a histogram of the best-fit slopes of all the lines within it.
Long-sequence GST utilizes deep circuits to amplify gate errors and thereby make them more visible. By design, the sensitivity of GST to gate errors should scale as 1/L, where L is the maximum depth of the repeated germ circuits. Thus, we expect GST to achieve a O(1/L) accuracy scaling. Figure 10 shows the average diamond distance vs. L for estimates obtained from LGST, and confirms this 1/L scaling (slopes are close to −1).
The long-sequence GST simulations were run with fixed N = 1000 and the maximum base-circuit lengths chosen to be logarithmically-spaced powers of 2 from 1 to L. There is some expected "roll-off" at larger values of L that is believed to be due to L approaching L η (the depth at which circuit outcomes no longer provide useful information because the quantum state is completely decohered). This slight bend is more visible in the 2-qubit case.
This O(1/L) scaling easily translates to an overall O(1/N ) scaling, where N is the total number of samples, since by fixing N and pessimistically assuming that there are L different maximum-base-circuit-depth values (usually there are log(L)), N is proportional to L.
All of the analysis shown here used the LGST and long-sequence GST implementations in pyGSTi version