Operational, gauge-free quantum tomography

As increasingly impressive quantum information processors are realized in laboratories around the world, robust and reliable characterization of these devices is now more urgent than ever. These diagnostics can take many forms, but one of the most popular categories is tomography, where an underlying parameterized model is proposed for a device and inferred by experiments. Here, we introduce and implement efficient operational tomography, which uses experimental observables as these model parameters. This addresses a problem of ambiguity in representation that arises in current tomographic approaches (the gauge problem). Solving the gauge problem enables us to efficiently implement operational tomography in a Bayesian framework computationally, and hence gives us a natural way to include prior information and discuss uncertainty in fit parameters. We demonstrate this new tomography in a variety of different experimentally-relevant scenarios, including standard process tomography, Ramsey interferometry, randomized benchmarking, and gate set tomography.

Quantum computing offers the potential for significant advantages across a wide range of important problems. Establishing a rigorous understanding of the costs involved in producing enterprise-scale quantum computers is a critical part of current decision making. This need has driven efforts to more precisely estimate the costs of different algorithms across different applications, such as in quantum chemistry simulations [1]. However, these resource estimations depend critically on the quality of the qubits used, i.e., the accuracy with which one can perform quantum gates and measurements. The collection of procedures used for detecting and debugging faulty operations on quantum computers is known as quantum characterization, verification, and validation (QCVV). Through QCVV, scientists and engineers working on quantum hardware can hope to diagnose errors and certify performance, in turn improving qubit design and operation.
One goal of QCVV is to learn what actually happens when we attempt to apply a target unitary operator U, a procedure broadly known as quantum tomography. Using the language of open quantum systems, we can hypothesize that there is some channel Λ that, if we knew it, would allow us to predict what happens when we apply U to any state. The problem then becomes determining how should we best learn Λ given experimental evidence from our quantum device. The tomography problem has been approached in a wide variety of ways [2][3][4][5][6][7][8][9][10][11][12]. The various procedures generally learn Λ by (repeatedly) preparing a variety of input states {ρ i }, sending each through the application of U, and then measuring a variety of effects {E j }. In some cases, the use of auxiliary qubits as a reference can eliminate the need to vary over input states [13]. However, these latter approaches are mainly useful for reasoning mathematically about tomography [4] and offer limited experimental applicability. Hence, we will focus on the more typical case in this work.
While valid, this rests critically on the assumption that we know what each of {ρ i } and {E j } are. In practice, each state ρ i and each measurement E j may be subject to its own physical errors, and these may in turn be objects that we would like to learn. Worse still, we often prepare states by performing a particular privileged state preparation procedure ρ , and then applying unitary evolution operators {V i } to obtain ρ i := V i ρ V † i . Similarly, measurements are often effected by transforming a particular privileged measurement under unitary evolution.
Once we include the experimental consideration that the channels we would like to study are the same ones that we use to prepare and measure our devices, we are forced to ensure that we learn the channels describing our system in a self-consistent manner. We cannot learn a channel Λ entirely independently of the experimental context in which Λ occurs, but must describe that channel such that we can predict its action in a larger experiment. This self-consistency requirement then forces us to face another difficulty: we can always transform the entire description of an experiment in a consistent fashion, such that there is no observable difference whatsoever. For instance, the states |0 and |1 are in essence just labels for two levels of a quantum system; there is no observable impact to our calling them |♥ and |♦ , | and | , or even |1 and |0 .
That we can rename |0 to |1 and vice versa illustrates one way to formally describe the challenge imposed by self-consistency. In particular, if we perform an experiment whose outcomes are described by Born's rule as Tr(EΛ[ρ]), then for any unitary map U the experiment Tr(UEU † · UΛ[U † · UρU † · U]U † ) has the exact same outcome distribution, and thus cannot be distinguished from our original description. That is, we cannot decide if we have (ρ, E, Λ) or if we have (UρU † , UEU † , UΛU † ).
Taking a step back, something seemingly ridiculous has happened: we asked merely for a description of how one component of our quantum device operates, and arrived at a seemingly fundamental limit to what knowledge we can ever gather about our device. After all, UρU † and ρ seem to be very different preparation procedures! Recently, gate set tomography (GST) [14,15] has been used as a means to solve this conundrum by explicitly including the effects of this apparent ambiguity into estimation procedures. With GST, we perform inference on the entire gate set, state preparation, and measurement procedure based on empirical frequencies from repeated experiments. This inference procedure can be quite sophisticated in practice, with carefully designed experiments to tease out very slight channel imperfections. Over the past several years, GST has been demonstrated experimentally on a wide variety of platforms [15][16][17][18][19][20][21][22][23][24][25][26][27][28], predominately using the software package pyGSTi [29,30].
Of course, gate set tomography also has drawbacks, suffering from a conceptual difficulty known as the gauge problem [17,31,32]. Specifically, GST eschews any notion of a fixed reference frame in favor of a gauge group that specifies how to transform a valid estimate of an error model into a family of related error models which give identical experimental predictions. While such gauge transformations do not impact the predictions made by such a model, they do impact the particular channels reported at the end of the inference procedure, and some commonly reported metrics on channels are not gauge-invariant. In practice, one gauge-fixes resulting channels to some external reference frame, but this procedure requires nonlinear optimization whose global convergence is not guaranteed. Finally, a procedure for systematically including prior information in GST has not yet been put forward, which could potentially result in massive savings if developed.
In this paper, we introduce operational quantum tomography (OQT), which is a general (operational) framework that allows us to reason about a host of different tomographic procedures (including GST) in a manifestly gaugeindependent manner. In addition to resolving the gauge problem, OQT allows us to naturally include prior information in GST within Bayesian inference, which was computationally prohibitive previously due to the the gauge fixing procedure.
OQT is enabled by using a new, manifestly gauge-invariant, representation of our gate set. This representation is inspired by linear-inversion gate set tomography [15]; we term this the operational representation. After introducing the operational representation and explaining how it resolves the gauge problem, we discuss how to implement OQT numerically within a Bayesian framework while including prior information. We then detail the performance of this technique by tracking prediction loss, a useful and gauge-invariant measure of the quality of our ability to predict the outcome of future experiments, across a suite of experimentally relevant problems: Ramsey interferometry, quantum process tomography, and randomized benchmarking. We close by showing how dynamics of quantum systems may, in general, be described using the operational representation.

A. GST formalism
As described in the introduction, quantum state and process tomography make strong assumptions about our ability to perform state preparation and measurement (SPAM). Tomographic reconstructions of states and processes that are made assuming perfect SPAM will be inconsistent with the true, noisy operations. A key advantage of GST is that it produces self-consistent estimates by simultaneously characterizing SPAM along with other processes.
Here we briefly review GST, following Refs. [14,15,17], restricting our attention to the simplest case with a single state preparation and a single, two-outcome measurement. To start, suppose that we have the ability to prepare an (unknown) state ρ, perform an (unknown) two-outcome measurement E, and perform some number n additional (unknown) operations {G 0 , . . . G n−1 }. We think about such a system as a box with labeled buttons, as depicted in Figure 1, where each button denotes an operation we can perform. Hence, we have a button for state preparation (b ρ ), measurement (b E ), and buttons for each other operation labeled by elements of the set B := {b 0 , . . . , b n−1 }, where we abbreviate b i = b G i for notational convenience. A light on the box turns on or stays off to indicate the outcome of the measurement.
Within this formalism, all experiments we can perform are of the form: 1. Press b ρ to begin the experiment.

3.
Press b E to end the experiment.

Record whether the light turned on.
Our goal is to compute the likelihood of observing the light given a particular sequence of buttons. Within a quantum model, we do this by expressing the actions of buttons as super-operators, which are linear operators that take density matrices to density matrices. Formally, let H = C d be a Hilbert space of finite dimension d. Then, we denote by L(H) the space of linear functions H → H, and denote by T(H) the space of linear functions L(H) → L(H). Since T(H) is a space of linear functions, elements of T(H) can be written down as linear operators acting on vectors in L(H). We denote vectors in L(H) by "super-kets", e.g. |ρ ; covectors for L(H) are "super-bras" and correspond to measurements, e.g., E|.
As an example, if d = 2, then ρ can be represented as a 2 × 2 matrix, which we can instead arrange as a 4 × 1 column vector |ρ ∈ C 4 . In this case, we can represent elements from T(H) as 4 × 4 matrices, which act linearly (by 0 Figure 1: An example of a quantum device modeled as a black box with buttons. Buttons are labeled by the actions they perform, for example prepare state ρ, apply operation G i , and take measurement E. A light on top of the box turns on or stays off to indicate the result of the measurement. multiplication) on super-kets. 1 We assign to each button b i a super-operator G i ∈ T(H) (which we represent as a matrix acting on C d 2 ). If Λ G i is a quantum channel acting on a density matrix ρ, then G i is the operator such that We refer to the set of button presses between applications of SPAM as a sequence. We denote the set of possible sequences as S, which contains the empty sequence, as well as every possible combination of button presses. Sequences compose under concatenation. 2 For two sequences s, t ∈ S: We will also write |s| to mean the length of s, such that |s + t| = |s| + |t|.
Using the assignment of super-operators to buttons, we can compute the likelihood of experimental outcomes.
be a sequence of m button presses from the buttons on our box. The likelihood of the light turning on after performing s, the sequence probability, is given by the Born rule: This shows that, were we to learn the explicit form of |ρ , E|, {G i }, we would be able to predict the results of any future experiment. Nevertheless, as we already touched on in the introduction, super-operators suffer from a gauge problem, making many numerically distinct sets of super-operators operationally equivalent. (In the language of super-operators, {|ρ , E|, {G i }} is gauge-equivalent to {B|ρ , E|B −1 , {BG i B −1 }} for any appropriately-sized invertible matrix B.) In the next section, we clarify this notion within the context of linear-inversion GST.

B. Linear inversion GST
The simplest GST inference procedure to learn |ρ , E|, {G i } is linear-inversion GST (LGST). For any GST protocol, one first chooses a set of fiducial sequences, f = { f i }, which act as a "reference frame" for analysis of the experiments. 3 Fiducial sequences are typically short sequences of button presses, and as a set they must be informationally complete (which will be formally defined below, with a consequence being that the set of fiducials has at least d 2 elements).
We use the set of fiducial sequences to construct the following scalar quantities: 1 This is because, if vec(A) denotes the vectorization of matrix A (by column-stacking), then vec(ABC) = (C T ⊗ A)vec(B). As a density matrix ρ evolves by similarity transformation (or a sum over them), |ρ evolves by matrix multiplication. 2 The set of experimental sequences S is a monoid under addition; that is, S is closed under concatenation, and has the empty sequence of buttons () as an additive identity. We note that S does not contain inverses, since we cannot make a sequence of buttons shorter by pressing more buttons. 3 It is sometimes necessary to pick distinct preparation and measurement fiducial sets; our results extend to those situations as well.
where F i is the super-operator obtained by multiplying together the super-operators for the constituent buttons in a fiducial sequence f i . In principle, the entries of these matrices are the probabilities of the light turning on for the given experiments. Hence, by repeating the experiments, we approximate these probabilities via the empirically observed frequencies. Defining A = ∑ j |j E|F j and B = ∑ j F j |ρ j|, where the |j are basis states of the space H ⊗ H, we can recover the desired |ρ , E|, and {G k } according to: We further note that, by definition,F = AB. We require thatF has rank of at least d 2 , where d is the dimension of the qubit Hilbert space. If dim(F) > d 2 , then the pseudo-inverse is used instead of the normal inverse. This rank criterion (rank(F) = d 2 ) is what provides our definition of informational completeness in this context. It provides a check of the choice of fiducials, which can be useful if good initial guesses of the gates are not known. Such a set of fiducials can even be chosen 'on the fly' by performing experiments until one can construct an invertibleF.

C. Gauge and the operational representation
In the above section, one might be troubled that we did not actually recover the literal values G k , ρ, and E. Rather, they are now complicated by the presence of the gauge transformation B -the "true" super-operators G k , ρ, E are all gauge-dependent quantities. However, the gauge B itself is not accessible experimentally. This is because the observed sequence probabilities are totally independent of gauge: More formally, let us begin by making a mapping between button sequences and super-operators. We assign an element of T(H), the space of linear operators on H, to each sequence s ∈ S using a mapping Φ : S → T(H), In general, the mapping Φ between button sequences and channels can arbitrary, especially in the presence of non-Markovian errors, but in this work we will consider the special case in which Φ is a homomorphism between the monoids S and T(H) 4 , This mapping is not unique, but can be specified by the outputs of Φ for each single-button sequence, Φ((b 0 )) = G b 0 , Φ((b 1 )) = G b 1 , and so forth. Considering this special case, we can think of SPAM as a special button b SPAM such that Making this identification, we can then use Φ to recover the probabilities in (2) by taking the trace of Φ(s) for each sequence s ∈ S, so long as we adopt the convention that s begins with b SPAM , Pr(light|s; Φ) = Tr(Φ(s)).
The problem of inferring the properties of our box is equivalent to identifying which Φ maps from button sequences to super-operators in a manner that correctly predicts experimental outcomes according to (2). Following this motivation, we define that two mappings Φ, Φ : S → T(H) are gauge-equivalent (Φ ∼ Φ ) if and only if they yield the same sequence probabilities for all elements of S. The term "gauge" used to describe the equivalence class ∼ is motivated by the observation that LGST Experiment LGST). A set of fiducial sequences is chosen; we perform the specified experiments and record how many times the light turned on. Following the linear inversion step in (4), we can reconstruct a copy of the super-operators for each button. However, the results we obtain will be expressed in an unknown gauge which is one of infinitely many in the gauge orbit.
We say that the equivalence class [Φ] := {Φ ∈ Hom(S, T(H)) such that Φ ∼ Φ} of gauge-equivalent Φ is the gauge orbit. It is easy to identify one such Φ (just choose any invertible matrix of appropriate dimension), but it is expensive to compute an entire equivalence class of distinct ones.
Choosing a gauge to represent a gate set is typically accomplished through nonlinear optimization, in which a gauge is sought that transforms the estimated gate set to be as close as possible (by some metric) to an ideal "target" gate set. This allows for computation of gauge-variant metrics between the estimate and the target (e.g., diamond distance, fidelity). In practice, these procedures can work reasonably well [17], but they scale inefficiently, are not guaranteed to be numerically stable, and are not guaranteed to not get stuck in a local extremum. Thus, as a practical matter, we would like to identify a set of parameters that is necessary and sufficient to identify gauge orbits without having to actually perform a gauge optimization. That is, we seek to parameterize and perform inference on the set of gauge orbits directly: where A/∼ is the factor set of A defined by the relation ∼ as the set of equivalence classes A/∼ := {[a] : a ∈ A}.
When it is clear from context which button set and Hilbert space are used to define our box, we will omit them for brevity, writing that G = G(B, H). We say that each member of G(B, H) is a gate set, such that identifying which member of G(B, H) was used to generate a data record is gate set tomography. When it is clear from context, we will also refer to sets of super-operators G = {G 0 , . . . , G k−1 , |ρ , E|} as gate sets, with the implicit understanding that we are interested in the gauge orbit [G] (equivalence class under ∼) of G.
We call any such representation of G(B, H) operational, since it is a complete description of all operational experiments that we can perform on our box, under the promise that the box is described by some model over H. In fact, we have already seen an especially convenient operational representation:Ẽ,F, {G (k) }. They are a set of gaugeindependent values (as they are directly experimentally observable), and are unique to a particular gate set for a given choice of fiducials. They also yield the same measurement probabilities as their gauge-dependent counterparts. To see this, consider some sequence of button presses (b ρ , b s 0 , . . . b s m−1 , b E ). The sequence probability is: This leads to the remarkable fact that when we learnẼ,F, {G (k) }, we can predict the outcome of any future experiments. Note that this statement is distinct from performing LGST: we can useẼ,F, {G (k) } as our underlying model, while updating it via more sophisticated experiments.

III. IMPLEMENTATION
Having thus established that learning the operational representation of a gate set allows us to predict its behavior, we are left with the question of how to learn operational representations from data records. In this section, we describe our implementation of operational quantum tomography, based on Bayesian inference. In particular, we implement the inference numerically using the particle filter, or sequential Monte Carlo (SMC) approach, a standard technique for carrying out Bayesian inference computationally [33].

A. Bayesian inference: obtaining posteriors from evidence
As applied to quantum information, Bayesian inference is a formalism for describing our knowledge about a quantum system given classical data observed from it. In particular, Bayesian inference represents our state of knowledge at any given point in a characterization protocol by a distribution of the form Pr(hypothesis|data), where "hypothesis" describes some hypothesis that we can use to predict the future behavior of our quantum system, and "data" is the set of observations made of that system.
In the special case that data = {} (that is, before we have made any observations), we write our state of knowledge as Pr(hypothesis), also known as our prior distribution. For example, in traditional Ramsey interferometry, our hypothesis might consist of the assumption that the system evolves under a Hamiltonian of the form H = ωσ z /2 for some ω. We may assign a prior distribution over ω such as representing that we are equally willing to believe that ω has any value in the interval [0, ω max ].
Since distributions of the form Pr(hypothesis|data) represent our state of knowledge at any point during an experimental procedure, equipped with such a distribution, we can answer questions such as "what is the best hypothesis to report given what we have learned from our quantum system?". Returning to the Ramsey example, we may want to report an estimateω such that the the squared error (ω − ω) 2 is minimized on average. As summarized in Appendix A, this is achieved by reporting the Bayesian mean estimateω BME := E ω [ω|data] = ω Pr(ω|data)dω.
We are thus left with the problem of finding our state of knowledge at some point in an experimental procedure given our most recent observation, and given our previous state of knowledge; that is, of how to update our state of knowledge to reflect new information. To do so, we rely on Bayes' rule, which states that Pr(hypothesis|data) ∼ Pr(data|hypothesis) × Pr(hypothesis), (14) where Pr(hypothesis) is our prior distribution, and ∼ indicates equality up to renormalization. Intuitively, this rule tells us that a hypothesis is reweighted according to how plausible it is for a given observation to arise given that hypothesis. To perform this update, we must simulate Pr(data|hypothesis), known as the likelihood function for our quantum system. Put differently, we can only learn properties of a system whose effects can be simulated. We cannot learn about a parameter that has no effect on the outcomes of system, or whose effects we cannot simulate. It is for this reason that, in the rest of the paper, we take our hypothesis to be the operational representation of some quantum system. In particular, the operational representation is a minimal set of parameters required to simulate the behavior of that system, such that any parameter beyond the operational representation cannot have any effect on our predictions. For example, we can never learn gauge parameters from experimental observations, as they have no effect on the likelihood function for any measurement that we could perform 5 .

B. Numerical approach: sequential Monte Carlo
So far, we have regarded Bayesian inference in the abstract, without reference to or concern for how one might implement an inference procedure in practice. A practitioner interested in using Bayes' rule will find it difficult to work with (14) directly, as the normalization suppressed by the use of ∼ notation converges exponentially quickly to 0 with the amount of data considered, exacerbating numerical precision issues. Moreover, any choice of discretization informed by the prior is not likely to be terribly useful as the posterior shrinks in width.
In lieu of these considerations, a number of different computational algorithms have been developed that offer a Bayesian practitioner a range of different options. For instance, rejection sampling techniques such as the Metropolis-Hastings algorithm [34], as well as more sophisticated modern algorithms such as Hamiltonian Monte Carlo [35] and NUTS [36], allow for obtaining samples from a posterior distribution with reasonable computational effort. These algorithms have been used in quantum information to solve otherwise intractable problems such as the estimation of randomized benchmarking parameters [37].
For application to online experimental protocols, however, it is often useful to adopt an algorithm that works in a streaming fashion. This allows for samples from a posterior distribution to be drawn at any point in an experimental procedure, such that adaptive decisions such as stopping criteria or experiment design can be made easily. Critical to realizing this capability is that the cost of an algorithm can depend only approximately linearly on the amount of data taken. This restriction motivates the use of filtering algorithms, which update an approximation of a prior given incoming data to yield a new approximation of the resulting posterior. The Kalman filter, for example, is a Bayesian filter for the special case in which the prior and posterior are both normal, and in which the likelihood is a linear model perturbed by normally distributed noise [38].
In this paper, we adopt the particle filter [33], also known as the sequential Monte Carlo approximation. Particle filters are applicable to a very broad range of likelihood functions, and give rich diagnostic data to assist in understanding their execution. The QInfer library [39] provides a useful implementation of particle filters for quantum information applications, and this library is used throughout the rest of the work.
Particle filters work by representing the distribution over some random variable x as a weighted sum of δdistributions at each step, where {w i } are non-negative real numbers summing to 1, and where {x i } are different hypotheses about x. Each hypothesis x i is called a particle, and is said to have a corresponding weight w i . Numerical stability is achieved by periodically moving each particle to concentrate discretization on regions of high posterior density [40]. Examples of this in operation can be seen in videos at https://youtu.be/aUkBa1zMKv4 and https://youtu.be/4EiD8JcCSlQ.

C. Setting priors over the operational representation
Within a Bayesian framework, we begin with a statement about our beliefs before starting an experiment. We write this down formally as a prior distribution, which gives us a mathematical description of our prior knowledge. In absence of any data from a particular experimental run, a prior distribution π assigns a probability π(x) to each object of interest x (e.g., the elements of the operational representation).
In experimental QCVV, we typically express our beliefs in terms of gauge-dependent formalisms (e.g., superoperators). Here, we need to translate these prior distributions into a prior over the operational representation, which is gauge-independent. Fortunately, we can easily sample from the prior distribution over the operational representation induced by a distribution over a gauge-dependent representation. Upon choosing a set of fiducial sequences, we proceed to: 1. State the prior over some gauge-dependent representation (e.g., parameters in super-operators).
2. Draw a sample from the gauge-dependent prior.
3. Convert the gauge-dependent sample to the operational representation by applying LGST. 4. Return this as the sample from the gauge-independent prior.
As a concrete example, suppose we intuit that a particular button should perform single-qubit Z-rotation gates. We can write these in a familiar, gauge-dependent way by expressing them as super-operators in some matrix basis, R z (θ) (in our implementation, we use the Pauli basis). Now suppose that we suspect this button over-rotates about Z by an angle δθ that is somewhere between 0 and π/10. To generate samples from this prior expressed in the operational representation according to this belief, we first choose samples of δθ uniformly at random from 0 to π/10. Next, we use these sampled angles to synthesize corresponding channels for each member of the gate set, i.e. R z (θ + δθ). A prior distribution in terms of superoperators can be constructed in a similar manner for each button on the box. Together, we use them to compute the frequencies for each element of the operational representation using the linear-inversion step of (4).

D. Informational completeness and germ sensitivity
In addition to choosing a prior distribution, we must also choose a set of fiducial sequences to fix a reference frame. Any choice will yield a valid operational representation, in the sense that we can populate anẼ,F, and {G (k) } with the outcome frequencies of the experiments. However, an additional requirement of the fiducial sequences is that they must be informationally complete. As we will see, the definition of this is dynamic.
Consider for a moment standard quantum state tomography, where we prepare (perfectly) some unknown state, and can execute perfect measurements. In the case of a single qubit, it is well known that measuring σ x , σ y , and σ z is sufficient to fully reconstruct the state [41]. These measurements span the Bloch sphere, and we say that they are informationally complete.
In GST, a similar notion holds. However, we do not know a priori how the measurement and operation buttons are oriented relative to an external reference frame. For instance, if someone provides us with a box with buttons labeled σ x , σ y , σ z , we do not know what they actually do. They may be noisy implementations of these operations, they may be completely different operations, or, they may even do nothing at all. Naively using these buttons to execute measurements is therefore not guaranteed to give us something informationally complete, even if the labels suggest they should.
In GST, we can check for informational completeness using the matrixF in the operational representation. Recall that this is constructed using experiments performed by applying pairs of the fiducial sequences. When we initialize the operational representation from the prior over super-operators, we must computeF −1 . If the fiducial sequences are poorly chosen,F may be ill-conditioned, or even singular.

Definition 2.
A set of fiducial sequences f ⊂ S is informationally complete for a gate set G ifF = ∑ ij E|F i F j |ρ |i j| has rank of at least d 2 , where d is the dimension of the qubit Hilbert space. Here, Note that since we can conjugate by B in Definition 2, we can choose any Φ ∈ G that is convenient for evaluating F -if we can find a Φ such that a set of fiducial sequences is informationally complete, it must also be complete for An issue that arises as a consequence of the choice of fiducials is that it is possible to find values acrossẼ,F, {G (k) } that are identical. For example, if one of the fiducials is the empty sequence, thenẼ 0 =F 00 ,Ẽ 1 =F 01 =F 10 , and so forth. Were we to perform a SMC update step on the full set of matrix entries, the entries that are constrained to be identical will be perturbed in different ways, leading to inconsistent outcomes.
To remedy this, we first perform a preprocessing step that eliminates redundant entries, producing a minimal set of model parameters on which we can perform inference. Mappings are employed to transform the minimal set back to fullẼ,F, {G (k) }, and vice versa, throughout. Learning this minimal set of parameters is then sufficient to characterize the entire system. This trimming procedure also has the benefit of substantially reducing the number of model parameters required, speeding up the inference process.
Beyond fiducial selection, we have considerable freedom in the selection of the particular experiments we perform. The best choice of experiments depends on our particular learning objective. For most of our demonstrations in this work, we fix a total number of experiments, a minimum/maximum sequence length, and then produce sequences of increasing lengths between the bounds. In some implementations of GST, one designs a small collection of short button sequences known as "germs", such that by taking appropriate powers of germs one can amplify coherent errors to gain optimal information as the number of experiments is increased. Such a pattern of germs is called "amplificationally complete" [17], and can reduce the total number of experiments required. We take this approach in our implementation of GST, and take care to identify the experiments we carry out in all of our examples.

E. Constraints on gates and gate sets
If we represent inferred channels with super-operators (as is customary in quantum process tomography), the allowed form of the matrices is not arbitrary. Rather, physical constraints such as positivity and trace preservation of density matrices restrict the allowed structure. When generalizing to GST, the problem of identifying when a gate set is valid is complicated by the introduction of the gauge. The elements of a gate set might not, in a particular representation, be CPTP, but are gauge equivalent to a CPTP representation. Performing inference on an operational representation introduces a similar challenge: how do we ensure that an operational representation corresponds to a gate set that makes physical sense?
Analogous to the case in GST, we need a condition that is simultaneously satisfied by all the gates in the gate set. For the operational representation, an obvious first test is to check whether all the entries are in the interval [0, 1].
Other than by converting to the operational representation a standard (gauge-variant) representation which is explicitly positive (in some gauge), it is unclear how one can create operational representations that are positive by construction. However, it is possible to ensure that inference begins from a point where this is true through our choice of prior distribution. As described in Section III C, when we set a prior distribution, we begin with a gaugedependent prior. When we do this, we express each gate of the gate set in the same gauge that we have chosen. We can then guarantee by construction that each member of the gate set has characteristics such as complete positivity, to ensure they will always produce valid outcome probabilities.
As inference proceeds, however, checking for properties such as complete positivity is practically difficult. This is because to check such properties, one needs to perform a gauge-fixing procedure, which we wish to avoid for aforementioned reasons. Such a procedure is not impractical to do once at the end of an inference procedure, but it is at each update step during Bayesian inference.
One workaround to this is to ensure at the very least all the values in the operational representation are positive. Though this of course doesn't guarantee true positivity, we found in practice that negative values of the likelihood function appear regularly, and these must be handled appropriately in order for the sequential Monte Carlo updates to succeed. As a workaround, we simply clip the output of the likelihood function so that any negative 'likelihoods' are set to 0, and any positive likelihoods greater than 1 are set to 1.
An alternate way to approach model validity is to choose a set of validation experiments. In plausible experimental settings, one has a specific application (and hence gate sequence) in mind. We can then decide if a model is valid for a particular set of gate sequences by checking if it produces a proper likelihood for all the validation experiments, a notion that we call operational positivity.

Definition 4 (Operational positivity). An estimate of a gate setĜ
From these definitions, positivity implies operational positivity but the converse need not be true. Operational positivity is a useful concept because it is both easy to test and also of practical relevance when one wishes to check particular applications.

IV. PREDICTION LOSS
Once we have obtained a posterior distribution Pr(x|data) over the operational representation x of our gate set from some sequence of experiments, we are typically interested in extracting diagnostic and benchmarking information. To do so in a manner consistent with the gauge, one could consider gauge fixing procedures, which consist of optimization problems that pick out particular gauge-dependent representations of a gate set that we can then use to report traditional metrics [17].
For instance, if we intend a priori that the b x , b y , b z buttons should be describable by unitary transformations e −iπσ x /2 , e −iπσ y /2 , and e −iπσ z /2 , respectively, we may wish to report gauge-dependent metrics such as the diamond norm by fixing to a gauge that best agrees with this description. By taking the best case over members of the gate set in that gauge we can construct statements such as "there exists a gauge-dependent description of our gates such that with posterior probability at least (1 − α), the agreement between each gate and their action in a particular chosen frame is no worse than ." Unfortunately, this gauge-fixing procedure can be cumbersome to implement (especially across many hypotheses), is not guaranteed to work (i.e., find the optimum gauge given a target gate set) and is open to multiple interpretations. As an alternative, we instead will score our predictions on a set of experiments of interest. To do this, we recall that a gate set G is sufficient to predict the outcome of any hypothetical experiment we may wish to perform within the GST framework by 4. We thus take a data-driven approach to the problem and choose a set of button sequences 6   S validate that we are interested in correctly predicting. Concretely, let p s := Pr(light|s) for each s ∈ S validate be a parameter that we are interested in estimating. If we predictp s for p s , then we can consider the quadratic loss We call this a prediction loss for the sequence s, since it rewards estimators that can accurately predict the outcome of future experiments. The quadratic loss is by no means unique, and there are other suitable choices, such as the Kullback-Liebler divergence.
Since each prediction loss function is Bregman for each s, the Bayesian mean estimator (BME), where we average over the prediction made for each gate set in the support of our posterior, is optimal [42] 7 . That is, to minimize loss we choose as our estimatorp Intuitively, we predict the outcome of measuring the sequence s for each hypothesis G, and then take the average. This gives us much better predictive capability than restricting ourselves to using a single estimated gate set to predict all future experiments. As we validate with longer and longer sequences than those in the training set that we used to obtain our posterior in the first place, our posterior uncertainty in p s will necessarily grow, as can be seen from the method of hyperparameters [44]. This is not reflected if we pick a single gate set, but is immediately included in the Bayesian mean estimator (17), which will tend to hedge towards 1 /2 as sequences grow in length.

V. EXAMPLES
In this section, we demonstrate the versatility of our framework by applying it to many common QCVV protocols. This includes replicating the results of other state-of-the-art techniques, such as long-sequence gate set tomography [17] and randomized benchmarking. A discussion of applications of OQT to quantum state tomography can be found in Appendix B.

A. Ramsey interferometry
Single-qubit operations are often implemented by applying electromagnetic pulses to induce rotations about the Bloch sphere. The basis of such methods is the intrinsic Rabi oscillation frequency of the system, which tells us the likelihood of measuring the qubit in its |0 or |1 state at a given time. Knowledge of the Rabi frequency allows us to adjust the pulse duration in order to obtain exactly the superpositio nwe desire.
Typically, one learns this frequency by means of either Rabi or Ramsey interferometry. In Ramsey interferometry (depicted in Figure 3), a qubit is prepared in state |0 and then a R x π 2 pulse is applied. The qubit is left to evolve under the Hamiltonian H = ωt 2 σ z for some time t, after which another R x π 2 pulse is applied, followed by measurement in the computational basis. The likelihood of obtaining the measurement outcome |0 is given by 7 For a more detailed discussion of the role of Bregman estimators in tomography, we refer the reader to the work of Ferrie and Kueng [43].
Fiducial sequences: (R x , (δt) n , R x ) n = 2, . . . , 49 Testing experiments: (R x , (δt) n , R x ) n = 50, . . . , 100 However, in a given implementation, it is likely that the R x ( π 2 ) pulse is not perfect and the resultant state will be slightly over-or under-rotated, yielding an incorrect estimate of ω. Our goal is to learn not only ω, but also the precise rotation angle so that we can compensate for this discrepancy by adjusting the duration of the pulse.
First, we translate Ramsey interferometry into the operational framework language (a box with buttons). In this case, there are four buttons. The first two are for SPAM -the button b ρ prepares the |0 state, and b E performs a measurement in the computational basis. The third button b R x performs R x ( π 2 ), and the final button b δt waits for a discrete time δt (free evolution). By applying b δt a total of n times, we can wait time t = n · δt.
Next, we choose a prior distribution from which to sample to begin Bayesian inference. For convenience, these priors are summarized in Table I. As explained in Section III C, our prior is defined initially in a gauge-dependent way, which is then used to induce a prior distribution on the gauge-independent operational representation. The initial R x π 2 are sampled from a distribution that encompasses over-and under-rotation: we choose rotations of the form R x π 2 ± δθ , where δθ is a deviation sampled from a normal distribution with mean 0 and a small variance σ 2 = 10 −3 . As δt is meant to indicate evolution around the z axis, we sample from R z (θ) with θ chosen uniformly from between 0 and 1.
For both the state preparation and measurement priors, we apply depolarization to the ideal state |0 . When acting on a density matrix ρ, depolarization of strength p, 0 ≤ p ≤ 1, sends The associated Bloch vector then transforms according to [41]: Here all super-operators are expressed in the Pauli basis, where applying depolarization to super-operator G sends For the priors, we assume that depolarization occurs with strength p chosen from the uniformly at random between 0 and 0.1, denoted U (0, 0.1). The last step is to choose a set of fiducial sequences. We choose These sequences are read from left to right; the first is the empty sequence, and application of SPAM is implicit in all sequences. This choice is not unique, and we picked some that performed well in practice. Using these fiducials to construct an operational representation results in 27 parameters, which is reduced due to duplication from the 52 that are expected from counting the fullẼ,F, andG k .
We initialized a SMC cloud with 10000 particles, and performed Bayesian inference over these parameters by feeding in simulated experimental data for sequences of the form (b R x , (b δt ) n , b R x ) for n = 2, . . . 49. The 'true' values of the parameters that generated the data were randomly sampled from the prior distribution, with specific parameters given in Table I. See the supplementary materials for the implementation.  We see an increased spread at higher sequence lengths, which is highlighted later in Figure 6.
In Figure 4 we plot the likelihood as calculated over the posterior distribution, and compare to the true likelihood (in this case, the set of model parameters that was used to produce the experimental data). We see that our inference has learned this operational representation, and produces comparable likelihoods even out to sequences that are double the length of those we trained with. Using Figure 4, it is possible to fit a curve of the form cos 2 (ωt/2) and extract an estimate for the value of ω. We obtainω = 0.345905, a roughly 0.6% difference from the true value ω = 0.346754 as noted in Table I.
Instead, we can judge the quality of the reconstruction by plotting prediction loss, as shown in the right panel of Figure 4. The amount of quadratic loss is small in the absolute sense, and clearly worsens with experiment length, with the peaks increasing quadratically. To build upon Section IV, we include in the supplementary materials a similar plot using the KL divergence.
We can also visually examine the loss by plotting trajectories of different particles sampled from the posterior. Shown in Figure 5 are the trajectories of 50 such particles, with likelihoods computed out to sequences of up to n = 300 presses of b δt . As one might expect, we can see that the trajectories begin to 'spread' significantly past the n = 100 point. The spread can also be quantified and visualized in the manner of Figure 6, in which we have plotted the difference between the likelihoods of all particles of the posterior and the true likelihoods at each sequence length. We can see that the mean deviation from the likelihood increases as the spread in possible values becomes greater at longer sequence length. While Ramsey interferometry is an arguably simple characterization procedure, it is perhaps the most surprising successful application of OQT we have explored. The same task would not be possible in the traditional GST formalism if one is limited to performing only Ramsey-type experiments, namely two R x ( π 2 ) pulses separated by some amount of time. Circuits of that form are not rich enough to generate all the sequences required by GST. While one can construct an informationally complete fiducial set using only compositions of R x ( π 2 ) and R z (δt) gates, there will always be GST-required circuits that do not follow the Ramsey circuit form. (For example, GST will require at least one circuit that requires three applications of R x ( π 2 ); such a circuit is not allowed if one is only performing Ramsey circuits, all of which have only two applications of R x ( π 2 ).) Even though such circuits appear in the operational representation, our prior information allows us to not perform them if we so choose; this highlights the value of being able to incorporate prior information into a characterization protocol. Since the entire prior distribution is created computationally, we can still perform OQT even in cases where we are not able to physically perform a set of experiments that corresponds to every sequence in the operational representation.

B. Long-sequence gate set tomography
We next compare OQT to long-sequence GST, where carefully designed sequences are used to self-consistently fit both SPAM and an unknown gate set [17]. Long-sequence GST uses the linear-inversion step of LGST as a starting point, and then proceeds with a longer maximum-likelihood estimation over experiments with progressively longer sequences of gates. Once the procedure finishes, a final gauge fixing is often used to compare the resulting superoperators to expected super-operators.
In [17], long-sequence GST was performed on experimental data from a trapped-ion qubit on which we could perform three operations: G i , G x = R x π 2 , and G y = R y π 2 . Thus including SPAM, our box has five buttons. The linear inversion step in [17] was originally performed using 6 fiducials. However, choosing the same 6 fiducials here results in a 6 × 6F that has rank 4. The reason to include those extra fiducials is to increase stability, since LGST then represents an overdetermined system of linear equations. In OQT, we can still include these extra experiments in our analysis, but since the fiducials are used directly to define our model parameters, we need to pick a subset of of size 4 (we choose We perform OQT using the set of experiments included in the supplementary material of [17]. These experiments have the form ( f i , (g k ) L , f j ), where g k are 'germ' sequences that are specified in Table II. The particular germs were chosen in [17] because they are amplificationally complete. From these experiments, we do not include those of the form ((b G x ) n ), ((b G y ) n ), and ((b G i ) n ) for n = 1, 2, 4, . . . 8192 in our training data set -these are kept as a testing set.
The choice of prior plays a particularly important role here, due to the inherently noisy nature of a physical Table II: OQT parameters for long-sequence GST on trapped-ion data. Button labels are abbreviated b G x → G x for simplicity here when specifying button sequences. All priors involve adding random noise to the original super-operators using the Ginibre distribution over qubits for ρ and E (denoted here by GinibDM(2)), and the Ginibre distribution for the super-operators (denoted by BCSZ (2)). The set of training experiments is the same as in Blume-Kohout et al. [17], however as denoted below we have removed a subset of these for testing.

Button label Prior Example values
Fiducial sequences: system. We choose a very general prior, based on convex combinations of the ideal super-operators with ones chosen uniformly at random. For both ρ and E, we take a combination of the form GinibDM(2) denotes the Ginibre distribution, the uniform distribution over single-qubit density matrices. Such states are sampled by computing where here X is a 2 × 2 matrix. We take a similar approach for G i , G x , and G y by adding Ginibre noise to the ideal super-operators: Here, Λ is a super-operator chosen from the uniform distribution over CPTP super-operators, known as the BCSZ distribution [45], denoted by BCSZ(2). The choice of was informed by a combination of the experimental data and a grid search. Observing Figure 1c in [17], we note that likelihoods in the (ideally) definite-outcome testing experiments start to significantly decay at around 10 4 gates, hence we intuit that should be around 10 −4 . This was later confirmed using a grid search. We ran OQT using a cloud of 10,000 particles for 192 different combinations of 's. We set the same for G i , G x , G y at {10 −m , 2 · 10 −m , 4 · 10 −m , 8 · 10 −m } for m = 5, 4, 3. For SPAM, we also choose the same for ρ and E, and explore the range {10 −m , 2 · 10 −m , 4 · 10 −m , 8 · 10 −m } for m = 5, 4, 3, 2.
The quality of each pairing of was determined by (a) whether or not the SMC updater succeeded without all particle weights going to zero, and (b) the sum of the total variation distance over the testing experiments. For a given sequence s, let p(s, R) and p(s, E ) be the experimental probabilities for a given reconstruction R and the experimental data E . The total variation distance (TVD) between the two probability distributions is: Bayesian inference ran to completion 8 for of the gates in the range 4 · 10 −5 up to 2 · 10 −4 . For these values, inference was successful over essentially the full range of SPAM values. However the sum of total variation distances   Table II to ideal likelihoods, pyGSTi reconstruction, and experimental data for the trapped-ion data of [17]. The testing experiments consist of exponentially longer sequences of repeated button presses, G k x and G k i . (Right) Total variation distance of pyGSTi and OQT reconstructions for the same gate sequences. Here we see that the OQT results vary from those of pyGSTi for G x and G y , but give comparable results for G i . The total TVD for OQT across all testing experiments is lower, at 0.724, while that of pyGSTi is 0.961.
was notably lower for gate = 10 −4 , and SPAM between 10 −5 and 10 −4 , reaching a minimum of 10 −4 during one full sweep of the grid search.
Results for OQT run with the parameters of Table II are plotted in Figure 7. The left column of plots compares the likelihoods predicted for the test sequences from the OQT posterior distribution to the likelihoods of the 'perfect' gate, the experimental counts, and the gate set reconstructed in [17] using the pyGSTi software package. We see that OQT produces results that are competitive with its contemporaries without the need to perform MLE. This is quantified in the right column that plots the variation distance for the same set of experiments. The total TVD for the OQT posterior is 0.724, and that of the pyGSTi reconstruction is 0.961.

C. Randomized benchmarking
Like GST, OQT equips us to make predictions about the outcome of any future experimental sequences. Hence, as has been done before using GST [17], we can use OQT to perform randomized benchmarking (RB). To do so, we perform OQT to learn the generators of the Clifford group. Then, using samples taken from the obtained posterior, we will apply RB type sequences and compute the survival probability.

Background for randomized benchmarking
RB makes use of random elements of the Clifford group, which for one qubit is constructed using two generators, C = H, S , where Up to a phase, C contains 24 elements. A traditional RB experiment seeks to characterize the errors present in our Clifford gates. We begin by preparing a known state ρ ψ , and then apply a randomly chosen sequence of Clifford elements. This is followed by applying the element that is the inverse of the group element formed by the sequence (not just performing the sequence backwards). We then measure our system using the measurement operator E ψ corresponding to our initial state.
If there are no errors in the Clifford gates, the action of the sequence and its inverse would cancel, leaving the state exactly as we found it. When there are errors, however, we can compute what is termed the survival probability of the original state. As the sequences increase in length, the survival probability decays, as errors accumulate. Typically, one plots a "decay curve" of the form where m is the sequence length, and where P(m) is the mean survival probability over all sequences of length m.
That is, we define that We note that since probabilities are not directly observable, and can only be estimated, caution must be taken when estimating P(m) or interpreting estimates obtained in an ad hoc fashion. Keeping this caution in mind, the form (27) for the expectation value of the survival probability over sequences of length m was derived analytically by Magesan et al. [46], where it was noted that the parameter p contains information about the average fidelity of our Clifford elements. In particular, where d is the dimension of the Hilbert space under consideration (d = 2 for a single qubit RB experiment), where F ave (Λ) is the average gate fidelity of the channel Λ, and where Λ is the average error in implementing each member of the Clifford group. In particular, Λ takes on the gauge-dependent form where (U † • ) is the ideal action of U † , and Λ U is a super-operator representing the actual implementation of U.
Despite the large literature on RB [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60], both the experimental implementation and statistical interpretation are challenging. Since RB is frequently used to assess suitability for quantum error correction applications, this is troubling. Since the technique that we describe here indirectly performs RB through ex post facto simulation, we are less vulnerable to some of these challenges. More details on this can be found in Appendix C.

Performing RB using OQT
To perform RB using OQT, we need a box with 4 buttons: ρ ψ , E ψ , b H , and b S . As RB is robust to SPAM errors, we assume for simplicity that SPAM is perfect. That is, we focus on learning H and S. Our first step is to choose an appropriate prior over H and S: we pick one that represents our belief that errors in each generator are due to both systematic over-rotations and Ginibre noise. To apply over-rotation to the Hadamard, we begin with its superoperator representation G H = H ⊗ H. Mathematically, this can also be written as H ⊗ H = e iπ 2 (H⊗1 1−1 1⊗H) , which we recognize as just evolution for the time π/2 under the Lindbladian L = (1 1 ⊗ H − H T ⊗ 1 1). We perturb the evolution time slightly to write G H (δθ) = e i( π 2 +δθ)(H⊗1 1−1 1⊗H) (31) = cos 2 (δθ)H ⊗ H + sin 2 (δθ)  An S gate is simply a R z π 2 gate, so in line with the previous examples, we choose a distribution R z π 2 + δθ where δθ is normally distributed with mean 0 and variance σ 2 θ . We then add Ginibre noise to both H and S by sampling random super-operators from the BCSZ distribution, such that the sampled super-operators will have the form For the presented example, we chose = 10 −3 , and δθ H , δθ S ∈ N (0, σ 2 = 0.0015). With respect to the choice of σ 2 , it can be shown that a channel over-rotated by δθ has fidelity F = 2 3 + 1 3 cos(2δθ). Fidelities on the order of 0.999 are typical of qubits today, and so assuming that δθ corresponds roughly to the standard deviation, we choose σ 2 = 0.0015. As we are assuming the addition of Ginibre noise, the actual fidelity of our operations will be slightly lower than this.
To generate data, we chose a 'true' version of G H and G S by sampling from the prior. The sampled parameters, as where the super-operators Λ H , Λ S are expressed in the Pauli basis.
With this prior distribution, we initialized a cloud of 10000 particles. Bayesian inference was performed to learn G H and G S by training with 100 RB sequences of length 40 to 60, using an equal number of sequences at each length. We then tested the model using 87 sequence lengths logarithmically spaced from the range from 10 to 252, using 100 random sequences at each length. For each particle in the posterior distribution, we compute the survival probability for each sequence (the same set of sequences was used for each particle). For each particle we can then fit to a curve of the form P(m) = (A − B)p m + B to obtain the traditional set of RB fit parameters. The parameters A, B are constrained to be between 0 and 1, and p to be between −0.5 and 1. The fit is a least-squares fit weighted by variance, since at every sequence length the survival probability is averaged over 100 different sequences.
The mean survival probability is shown in Figure 8. At each length, the mean is computed first for each particle over the set of 100 sequences, and then a weighted average over these means is taken using the particle weights in the posterior distribution. Since each particle yields a set of (A, B, p), we can also compute the weighted mean of these parameters, shown as the solid blue line in Figure 8. The mean fit parameters are (A, B, p) = (0.999916, 0.481494, 0.991119) The mean value of p can be used to compute an average gate set fidelity of (1 + p)/2 = 0.995560. Using the same testing experiments, we can compute the RB decay rate for the 'true' gate set that generated the data. We obtain a 'true' value of 0.995337. We can plot the 95% credible interval over all RB parameters using the Bonferroni correction [61,62]. This interval is shown in Figure 8 as dotted lines. We obtain for p the interval [0.990608, 0.992230], corresponding to fidelities of [0.995304, 0.996115], which neatly contains the 'true' value of 0.995337. Details and additional plots are available in the supplementary materials.

VI. QUANTUM MECHANICS IN AN OPERATIONAL REPRESENTATION
While the operational representation we give above is strongly motivated by issues in tomography, in theory all discrete quantum processes can be represented within the operational representation through an appropriate choice of definitions of buttons. Note that unlike most representations, there is no explicit definition of the quantum state |ρ , but instead the information about the quantum state is encoded in the initial gates, fiducials and measurements that define the gate set. This is also relevant because it provides a clearer delineation of what the operational representation models. Alone, the representation does an excellent job of describing the operation of a quantum device that can prepare, manipulate and measure a quantum state. However, if the state |ρ(t) has its own internal dynamics then the operational representation alone does not have enough information to predict the outcomes of experiments as a function of time. Here we show that we can solve these issues and describe not just the behavior of our quantum device as a function of time, all while avoiding the explicit use of gauge-dependent quantities such as |ρ(t) .
In particular, to allow arbitrary quantum dynamics it is convenient to think now of our operational representation as being an explicit function of time. We assume here for simplicity that the gates, fiducials and measurements all are given by time-independent sequences. As an example, let us consider the case where ∂ t |ρ(t) = L|ρ(t) , where L is the Lindbladian super-operator and |ρ(t) is the initial state evaluated for the system at time t.
The equation of motion for the operational representation of the gate set is then A challenge with this representation is that its evaluation relies on objects that we do not know a priori and are not related (directly) to observed quantities since the expectation values of L are not assumed to be known in the operational representation. In part, this has to do with the way that we have chosen to represent L. In the following, let us assume that there exist coefficients α such that These values of α can further be learned empirically using the operational representation. Let us assume that we empirically measure by choosing δ 1 and taking ∂ t E(t) ≈ ( E(t + δ) − E(t))/δ. If we then take the resultant vector to be˙ E(t), F(t) to be the matrix representation of the F ij (t) tensor and take α to be the unknown matrix of coefficients for the Lindbladian using (38) then if F is an invertible matrix theṅ Thus such a representation can be learned if F(t) is an invertible matrix. If not, a least-squares approximation can be found by applying the Moore-Penrose pseudoinverse in its place. Of course, this merely proves the existence of a solution (or a least-squares solution) for the coefficients of the Lindbladian as a function of the Fiducials. In practice, Bayesian methods such as the ones considered here and elsewhere may be of great use for both learning and quantifying the uncertainty in the model Lindbladian. Given a set of coefficients for the Lindbladian the first order system of equations that governs the evolution can be expressed as As we can see the derivatives of E i (t) depend on the values of F ij (t) but the derivatives of F ij and G (k) ij depend on expectation values of cubic functions of the fiducials. Thus we can solve these equations, but doing so may require more information in some cases. Below we consider two important cases. The first is where the set of fiducial superoperators is not closed under multiplication and the second is where the group is closed and and consists of at most quadratic polynomials in the fiducials.

Dynamics for infinite sets of Fiducials
As a first example of how the dynamics of the operational representation works, consider the case where the fiducial super-operators form an infinite group wherein the group product is given by super-operator multiplication. In this case, we cannot assume any structure to the fiducials that will cause products of them to contract to a finite set of super-operators.
If we have such a model then the dynamics can again be written in terms of a set of observables, however the set that needs to be measured becomes larger in this setting. In particular, we extend the definition of the E and G tensors such thatẼ i 1 ,...,i p (t) = E|F i 1 · · · F i p |ρ(t) .
Under these assumptions the dynamics of the operational representation of the gate set takes the form of a driven first order dynamical system.
Thus the entire dynamics of the gate set can be predicted if theẼ andG tensors are known in their entirety. This is operationally equivalent to the Schrödinger equation, while eschewing the need for unobservable quantities such as the quantum state. While solving the resultant dynamical equations formally requires knowing an infinite hierarchy of terms to predict future dynamics perfectly, in many cases the super-operators for the fiducials will form a finite group making knowledge of the complete hierarchy of tensors unnecessary. Finally, in practice the entire hierarchy is not needed in order to accurately estimate the dynamics for all subsequent times from data at a single time given the decomposition of the Lindbladian into a sum of fiducials. We have from Taylor's theorem and Stirling's approximation that Thus by solving this equation for the value of K that yields error we find that a sufficient value of K is if ∆ ≤ ∑ |α |. Thus the total number of terms needed to simulate the dynamics for a short time step with error at most . varies logarithmically with the error tolerance. Each such term can be approximated using Monte-Carlo sampling such that the estimate of the derivatives is at most using a number of samples that scales as O(poly(1/ )) and therefore even in the case where the algebra does not close the dynamics can be simulated using a small number of observables. It should be noted that in the event that the fiducials form a closed group that this scaling improves exponentially Monte-Carlo sampling is no longer required to estimate the derivatives. This shows that under reasonable assumptions the operational representation can also be used to describe the dynamics of a quantum system that we can probe experimentally using a set of fiducial operations and gates. Hence, while inspired by problems of characterization in quantum systems, much broader classes of quantum dynamical problems can also be discussed using our formalism while only making reference to observable quantities.

Dynamics for closed sets of fiducials
Next let us consider a simpler case where the set of fiducial super-operators is closed under multiplication. Specifically, let S = {F i F i F j } be the set of all monomials and binomials in the fiducials. Next because the set is closed under multiplication there exists a function g such that for any s i and s j in S there exists s g(i,j) such that s i s j = s g f (i,j) . Also for simplicity, assume that the sets are laid out in lexicographic ordering such that s 1 = F 1 , s 2 = F 2 , . . .. It then follows that if we use the fact that the set is closed then the equations of motion for the operational representation greatly simplify to the following finite system of equations ij 1 j 2 (t) = ∑ α E|F i G k s g(j 1 ,g(j 2 , )) |ρ(t) (46) These equations can, in many cases be solved directly without having to truncate (as was done in the infinite case considered above). Also, because of the lack of a curse of dimensionality the resulting equations can be solved within error using O(polylog(1/ ) operations via existing differential equations solvers. For this reason, cases where the fiducial operators form a closed (or approximately closed) set under multiplication can greatly simplify the equations of motion. However, it should be noted that such cases are highly restrictive and, for example, preclude the inclusion of depolarizing noise or similar effects because such noise models will typically lead to a fiducial set that is not closed under multiplication. For such situations, truncating the infinite dynamics at finite order may be preferable.

VII. CONCLUSIONS
We have demonstrated a framework for quantum tomography in which we can represent many other characterization tasks. Working with a gauge-independent representation of the system, we can learn its behavior from experimental data and predict the outcomes of future experiments. OQT gives us the freedom to incorporate prior information computationally (without any physical experiments). Future improvements to OQT involve the extension to two-qubit operations, as well as allowing for buttons to be held down for arbitrary duration (i.e. time-dependent operations). of data we could have possibly obtained. Then, we will say that any functionx(·) : D → X which accepts data and returns estimates is an estimator.
For example, given any x 0 ∈ X , the constant functionx(D) = x 0 is an estimator that disregards all evidence in favor of returning x 0 . Clearly, while this is a valid estimator, it is not a very good one to use in practice. Our task in estimation theory is then to recommend a particular estimator that is desirable according to some set of practical considerations. We may, for example, want an estimator that incurs as little error as possible.
We can formalize this desire by introducing a function L : (X × X ) → R + such that L(x, x) is the loss that we incur if we returnx as our estimate when the true value is x. For example, if we are estimating a single real number (X = R), then we may choose the squared error L(x, x) = (x − x) 2 as our loss. More generally, for X = R d for d ∈ N, the quadratic loss L Q (x, x) = (x − x) T Q(x − x) is a well-defined loss function for any positive definite matrix Q.
Once we have decided upon a loss function, we can then reason about what losses we may incur in a given experiment using a particular estimator. To do so, we first need to extend our definition of loss from estimates to estimators by taking the average over all possible data sets that an estimator could take as input. Concretely, given a loss function L, define the risk R : (D → X ) → R + of an estimator as The risk implicitly defines a multi-objective optimization problem, in that an estimator that works well for a particular ground truth need not work well more generally. At an extreme, the constant estimatorx(D) = x 0 works beautifully well when x = x 0 . We thus at a minimum want an estimator that minimizes the risk that we incur in some case of interest. To formalize this notion, we say that an estimatorx(·) is dominated by an estimatorx (·), if for all x, R(x, x) ≥ R(x , x), and if there exists some x for which this inequality is strict. Put differently, an estimator dominates another estimator if it is less risky in all circumstances, such that there is no decision-theoretic basis for preferring the dominated estimator. An estimator which is not dominated by any other estimator is said to be admissible.
From a Bayesian perspective, however, we are generally most interested in minimizing what we expect the risk to be given our experience with a system so far. We can make this precise by taking the expectation value of the risk with respect to a prior distribution to obtain the Bayes risk of an estimator, The unique estimator minimizing the Bayes risk for a particular loss function is called the Bayes estimator for that loss,x Bayes := arg minx (·) r(x(·)).
By construction, the Bayes estimator is admissible: any estimator that dominates the Bayes estimator would have a strictly smaller Bayes risk. Under fairly weak conditions [63], however, we can conclude the converse as well, namely that every admissible estimator is the Bayes estimator for a particular prior distribution. In full generality, computing the Bayes estimator for a particular loss function requires minimizing over functions of all data sets, which is not feasible or practical. Some loss functions, however, allow for much more efficiently computing Bayes estimators. In particular, Bregman divergences are loss functions which can be written as the difference between a convex function and its first-order Taylor expansion. If a loss function is Bregman, then the celebrated theorem of [42] shows thatx That is, the posterior mean of x is the Bayes estimator for any Bregman divergence. Many practically relevant loss functions are Bregman divergences, including the squared error, quadratic loss, and Kullback-Liebler divergence. Thus, the posterior mean gives us a method of efficiently computing admissible estimators that minimize the average error we incur in inference procedures. As we saw in Section III B, the posterior mean can be efficiently computed using particle filtering, giving us a practical method for reporting Bayes estimates.

B. QUANTUM STATE TOMOGRAPHY
In traditional quantum state tomography, we seek to learn an unknown state using a set of measurements that are presumed to be perfectly known. Naively performing this task, however, leads to estimates that are not selfconsistent. We provide a demonstration of this by performing OQT on unknown rebits, i.e. qubits with no ycomponents. Table IV: OQT parameter specification for rebit state tomography. Button labels are abbreviated b R x → R x for notational simplicity. State tomography was performed independently for 500 states (and associated gate sets) sampled from the distributions below. As such, we do not provide examples of the sampled parameters in this case.
We sample our states from the Ginibre rebit distribution, the uniform distribution over rebit states. Such states are sampled by computing where in our case, X is a 2 × 2 matrix 9 . The rebit states are subject to a small amount of depolarization with strength p ∈ U (0, 0.1). We apply similar depolarization to the measurement E = |0 0|. Full details of our parameter specifications are shown in Table IV. The set of chosen fiducial sequences is f = {(·), (R x ), (R y ), (R x , R x )}. If our buttons were perfect, this set of fiducials provides a set of measurements that is informationally complete in the traditional sense, meaning that the measurements span the entire Bloch sphere. However in practice these will be noisy -our definition of informationally complete thus shifts to whether or not the fiducials produce a well-conditionedF; we find that the chosen set is reliable in practice.
In the 'naive' method of performing state tomography, the fiducial sequences and associated probabilities would be directly related to the coordinates on the Bloch sphere (a x , a y , a z ): In the remainder of this section, we will demonstrate the consequences of this naive method. We performed state tomography with OQT independently on 500 random states. In Figure 9, we have plotted the true Bloch coordinates of the initial states in the left panel. In the middle panel, we see the coordinates obtained from their initial operational representations according to (52). States pulled from the Ginibre ensemble should lie firmly within the boundaries of the Bloch sphere, or circle, in the rebit case. However reconstruction according to (52) produces Bloch coordinates that fall well outside the boundaries. Furthermore, they pick up small y-component, as demonstrated in the first two panels of Figure 10.
For our OQT experiments, we push the state preparation button once, then apply a sequence of randomly selected gate buttons from a minimum length of 1 to a maximum length of 10. We then measure, record the outcome, and repeat 50 cases to form a training corpus. The sequence length steadily increased during training, with the same amount of sequences generated at each length. 2.00 Figure 9: What happens when we perform state tomography 'naively' using the measurement results from noisy gates we had assumed were perfect. (Left) The 500 initial random states, sampled from the prior. They are rebits, and have only x and z components. (Center) A 'pseudo Bloch circle' constructed by pulling coordinates from the initial operational representation, i.e. fiducial experiment probabilities, as per (52). Points are colored by their distance to the corresponding true states in the left panel. (Right) The same plot as for the middle, but calculating the Bloch coordinates using the posterior mean after performing OQT. See Figure 11 for a histogram of the colored difference before and after reconstruction.  Figure 9, we color the states according to the y component of the pseudo Bloch vectors. In theory this should always be 0, but we observe here that our naive reconstruction method produces slight deviations both before and after reconstruction. However we note that after reconstruction, the deviation is less, as displayed in the right panel of Figure 11.
We note that Figure 9 illustrates the dangers of 'naive' state tomography in the presence of measurement errors. For each of the hypotheses shown in the middle and right-hand plots of Figure 9, if a naive tomographer were to take an infinite amount of data from a system described by that hypothesis and then reconstruct the initial state ρ, they would correctly conclude either that their data was impossible, or that ρ lies outside the Bloch sphere entirely. Put differently, if one assumes that the measurement sequences used in a state tomography experiment are ideal, then naive state tomography will return absurd results even in the limit of infinite data.

C. DETAILS ON RANDOMIZED BENCHMARKING
In this appendix, we discuss the advantages of performing randomized benchmarking within an operational framework. Naive pseudo Bloch y component Figure 11: Histograms of pseudo Bloch vector properties before and after performing OQT. Solid lines show the mean of the corresponding distribution. OQT learns these vectors well and produces comparable distributions, but this naive method of tomography nevertheless leads to a noticeable y component added to many of the rebits.

A. Using operational formalisms to perform randomized benchmarking
Magesan et al. [46] derived that A and B contain information about the state preparation and measurement errors incurred by a randomized benchmarking experiment. They are expressed analytically as and B = Tr E ψ Λ 1 1 d .
A key point here is that traditional RB assumes that Λ is the same for all elements of the Clifford group. However, as we will see, Clifford elements implemented in the GST framework will naturally have different errors, as elements are composed of sequences of H and S of varying lengths. If the implementations of each Clifford element are perfect, we obtain A = 1, B = 1/2, and p = 1, and so the survival probability is identically 1 for all sequences. However in the worst case, we obtain something essentially depolarized and so p = 0, meaning that the curve will immediately decay to B = 1/2. Fitting the experimental data to a curve of this form can thus give an idea of the value of p, which in turn can give us an estimate of the average gate fidelity.
Before proceeding, it is helpful to establish that, despite its apparent simplicity, learning figures of merit from randomized benchmarking data is an astonishingly subtle problem that warrants no small amount of caution. Especially given the rigorous demands placed on randomized benchmarking results for application to predicting the success of fault-tolerance, it is of the utmost importance that the results of RB experiments are understood in full recognition of the caveats placed on said results by current experimental and theoretical limitations. For instance, as mentioned above, for instance, the derivation of Magesan et al. [46] rests critically on the assumption that the noise on each element of a gate set is independent of which element is being considered. While Magesan et al. [46] does provide a derivation that attempts to include gate-dependence, later counterexamples have shown that this assumption cannot even be made in a gauge-independent fashion [58] -this implies that the gate-independence assumption cannot be experimentally tested. Later work has shown that the effects of gate-dependence exponentially small effects on randomized benchmarking data [59,64], but it is still an open question as to how to meaningfully interpret RB data.
Perhaps more pressing still, the original derivation of Magesan et al. [46] only derived the mean survival probability and not any higher moments. A fitting procedure such as homoscedastic least-squares fitting (the default procedure offered by MATLAB, SciPy, and many other packages, see Appendix C B for a review) will thus necessarily give incorrect or misleading answers, as the variance over randomized benchmarking data depends both on the variance within each sequence and over shots of that sequence, and on the variance between different sequences. This challenge can be overcome by committing to taking exactly one repetition of each sequence before choosing a new sequence [65], but this is feasible only for a small number of experimental platforms, such as those controlled by custom FPGA firmware [66]. As an alternative solution, one can introduce nuisance parameters to track the unknown higher moments and estimate them at the same time as the expectation of interest. A recent proposal of this form was advanced by Hincks et al. [37], who introduced a parameterization for RB protocols that includes a distribution at each sequence length that is then sampled using Hamiltonian Monte Carlo, effectively introducing an uncountable number of nuisance parameters in a way that they can be efficiently estimated.
From this perspective, using OQT to analyze randomized benchmarking data provides an explicit and gaugeindependent nuisance parameterization that avoids both the interpretational and practical difficulties of drawing inferences from RB data. We can then rely on the procedure of Blume-Kohout et al. [17] to synthesize from a final posterior over operational representations RB data of a form that is immediately amenable to analysis by even relatively informal methods such as heteroscedastic least-squares fitting.

B. Estimation within randomized benchmarking
In this Appendix, we review the estimation theory underlying randomized benchmarking and summarize some of the most prevalent pitfalls. To do so, we will rely heavily on the Likelihood Principle [67], which informally states that in order to make decisions consistent with experimental observation, we must base our decisions only on the evaluation of a likelihood function at our data, and cannot base our inference on any property of our data that is not expressed in the likelihood. For RB in particular, this consistency requirement forces us to describe our implementations of RB in an operational manner, such that we can write down likelihood functions.
For instance, we recall that as per (27), the Magesan et al. [46] model gives us that the mean sequence probability Thus, if we wish to compute likelihood functions for K shots at each sequence, we must be able to compute the Kth moment of the distribution of sequence probabilities over all sequences of a given length. This makes it clear how both the techniques of Granade et al. [65] and Hincks et al. [37] operate. The former restricts attention to the case in which K = 1, such that the needed moment is precisely that given by Magesan et al. [46], while the latter introduces additional parameters (formally, nuisance parameters) to track the higher moments of distributions over sequences.
Though both of these approaches are provided along with software implementations, they may be practical constraints that prevent using the K = 1 experimental limitation or introducing large numbers of nuisance parameters. In practice, therefore, convenience often demands deviating from statistical principle and exploring what can be done with ad hoc methods. For example, least-squares methods are often used in experimental papers to report results from randomized benchmarking observations [68]. In this case, such methods are ad hoc in the sense that least-squares fitting requires additional assumptions that are often left implicit.
In particular, if one is attempting to learn the argument x of a function f (x) from samples y i = f (x i ) + i where i ∼ N (0, σ 2 ), then the least-squares solution can be readily shown to be the maximum likelihood estimator for x. Thus, if a minimum variance unbiased estimator exists for x, it is equal to the least-squares solution. Applying this argument to the RB case thus demands a strong additional assumption be made, namely that n(m i ) ∼ N (P(m), σ 2 ).
By using heteroscedastic least-squares fitting, we can relax this assumption such that the variance on each n(m i ) is a function of m i , n(m i ) ∼ N (P(m), σ 2 i ).
In order to apply heteroscedastic least-squares fitting, we must therefore be able to assume normality, and we must have a way to compute σ 2 i for each m i . In typical experiments, we do not have direct access to such variances. That said, when synthesizing RB data from a posterior over operational representations, something remarkable happens: we can interpret the variance as the mean Bayes risk for the prediction loss over sequences. This interpretation makes it possible to directly compute σ 2 i from our posterior uncertainty, motivating the use of heteroscedastic least-squares fitting.