Semi-device-dependent blind quantum tomography

Extracting tomographic information about quantum states is a crucial task in the quest towards devising high-precision quantum devices. Current schemes typically require measurement devices for tomography that are a priori calibrated to high precision. Ironically, the accuracy of the measurement calibration is fundamentally limited by the accuracy of state preparation, establishing a vicious cycle. Here, we prove that this cycle can be broken and the dependence on the measurement device's calibration significantly relaxed. We show that exploiting the natural low-rank structure of quantum states of interest suffices to arrive at a highly scalable `blind' tomography scheme with a classically efficient post-processing algorithm. We further improve the efficiency of our scheme by making use of the sparse structure of the calibrations. This is achieved by relaxing the blind quantum tomography problem to the de-mixing of a sparse sum of low-rank matrices. We prove that the proposed algorithm recovers a low-rank quantum state and the calibration provided that the measurement model exhibits a restricted isometry property. For generic measurements, we show that it requires a close-to-optimal number of measurement settings. Complementing these conceptual and mathematical insights, we numerically demonstrate that robust blind quantum tomography is possible in a practical setting inspired by an implementation of trapped ions.


Introduction
The development of quantum technologies is arguably one of the most vivid scientific endeavours of current times. This development is faced with a daunting challenge: To achieve the promising advantages of those technologies one must engineer individual quantum components with an enormous precision. The main limiting factors in implementing the many existing theoretical proposals for exciting applications in quantum computing today are the achievable noise levels and scalability of the components.
From an engineering perspective, improving such noisy intermediate scale quantum (NISQ) devices [1,2] requires flexible diagnostic techniques to extract actionable advice on how to improve the device in the engineering cycle. Such diagnostic schemes must meet tight practical constraints in terms of their complexity as well as the required accuracy of the used devices. One of the most basic diagnostic tasks is the extraction of tomographic information about quantum states from experimentally measured data. Indeed, at the heart of every quantum computation is the preparation of a quantum state. Quantum state tomography can therefore provide valuable information for improving quantum devices beyond a mere benchmarking of their correct functioning [3].
However, in any such endeavour one encounters the following fundamental challenge: In order to arrive at an accurate state estimate, most tomography schemes rely on measurement devices that are calibrated to a very high precision. At the same time, a precise and detailed characterization of a measurement device requires an accurate state preparation. But improving the accuracy of the state preparation using tomographic information was our goal to begin with. We are trapped in a vicious cycle. This vicious cycle, depicted in Figure 1, constitutes a fundamental obstacle to the improvement of quantum devices.
Using various assumptions and models that are motivated by the specific underlying physical platform, quantum devices are routinely calibrated in a bottom-up fashion, building trust in the individual components such as steps of ground state preparation, read-out, and individual gate pulses one at a time. However, such methods are ultimately limited by the vicious cycle. State-of-the-art quantum comput- ing experiments in addition employ the feedback from self-consistent characterization methods that are robust to state-preparation and measurement (SPAM) errors such as (linear) crossentropy benchmarking [4,5], other variants of randomized benchmarking [6,7,8,9] or gateset tomography [10,11] to refine the device calibration heuristically. Notably, these approaches are closely tied to modelling a quantum computing device in terms of structured gate sets in addition to state preparation and measurement, and moreover require performing sequences of multiple gates. Such models are distinct from our abstraction in terms of an unknown statepreparation and an uncalibrated device.
An important conceptual question with immediate practical relevance is therefore: Is there any hope to directly break the vicious cycle? In other words, can one perform quantum state tomography blindly, that is, without full knowledge of the measurement to begin with? More specifically, can one simultaneously infer a quantum state and learn certain unknown calibration parameters of the measurement device in a self-calibrating tomography scheme [12]? A simple parameter count indicates that this is typically impossible by just measuring a set of mutually orthogonal observables: While an arbitrary quantum state in a d-dimensional Hilbert space is characterized by d 2 −1 many real parameters, at the same time, the number of linearly independent measurements in this space implies that we can learn at most d 2 independent parameters. This leaves room for a single additional (calibration) parameter only and prohibits even slightly relaxing the requirement of a complete and accurate characterization of the measurement device towards a partially uncalibrated device. Tomography of an arbitrary quantum state is therefore typically intrinsically measurement device-dependent in this sense.
In this work, we break this vicious cycle. We observe that the above naive parameter count misses the point that in reasonably controlled quantum devices, commonly encountered quantum states exhibit a natural structure: they are close to being pure. We leverage this natural property to prove that one can simultaneously learn an unknown calibration and a low-rank quantum state. We thus arrive at what we coin a semi-device-dependent scheme in which the dependence on the measurement apparatus is significantly softened.
In order to achieve this goal, we formulate the blind tomography problem as the recovery task of a highly structured signal. This allows us to exploit and further develop a powerful formal machinery from modern signal processing to propose a scalable self-calibrating state tomography scheme that comes with mathematically guarantees. We use a general model of the measurement device which applies in a variety of relevant experiments: we model the measurements as depending linearly on the unknown parameters of the possible calibration errors. Indeed, in many situations the daunting uncertainty about the device calibration is small and can be approximated as a linear deviation from an empirically known calibration baseline.
Our scheme makes a trade-off between the dependence on the measurement device and the state preparation device explicit and allow to optimally exploit this dependence in a practical scheme. It is an intriguing feature of our results that while structural assumptions on the quantum state to be learned typically allow for more efficient solutions [13,14,15,16,17], here, structural assumptions allows one to solve a task in settings where it could not be solved at all in the absence of this assumption.
Going further, we exploit yet another structure to significantly extend the realm of applicability and efficiency of our scheme, namely the sparsity of the calibration. Physically, this structural property amounts to the assumption that only a small number out of the many possible calibration errors has occurred in the specific experiment. In our scheme we therefore simultaneously exploit the low-rank structure of the quantum state and sparsity of the calibration coefficients to overcome the vicious tomography cycle and provide rigorous guarantees with a favourable scaling in terms of both the system dimension and the number of calibration errors.

Provable blind tomography via sparse demixing
Let us be slightly more formal in order to give an overview over methods used, and technical contributions made in this work. In mathematical terms, the blind tomography task that we solve is to infer a vector ξ of n calibration parameters and a rank-r quantum state ρ from data of the form where B : ξ, ρ → B ξ (ρ) is a bi-linear map describing the measurement model. The measured data y might for example be estimates for the expectation values of observables or probabilities of POVM elements. For the time being, we ignore the error of the estimates induced by finite statistics. It is convenient to regard the data as associated to a structured linear estimation problem: we can equivalently model the measurement map as a linear map A acting on ξ ⊗ ρ. Such structured linear inverse problems are studied in the mathematical discipline of modelbased compressed sensing [18,19], where efficient algorithms with analytical performance guarantees have been developed. A work horse of compressed sensing that most rapidly solve the relevant inverse problems are so-called iterative hardthresholding (IHT) algorithms [20]. In this work, we will use this general algorithmic paradigm, study the novel thresholding operations that arise in our context and prove new recovery guarantees.
As a first result of this work, we establish that the key step of an IHT algorithm that solves the blind tomography problem is NP-hard. To overcome this obstacle, we propose an IHT algorithm that solves a slightly relaxed version of the blind tomography problem: the task of de-mixing a sum of n different low-rank quantum states ρ i , i.e., data of the form where {e i } n i=1 denotes the standard orthonormal basis. An efficient IHT algorithm for the demixing problem of low-rank matrices was developed and analysed in Ref. [21]. This algorithm can be readily adapted to our problem.
But relaxing the blind tomography problem to the de-mixing problem artificially introduces an overhead in the number of unknown degrees of freedom of the problem scaling as 2drn, and in particular linearly with the number of calibration parameters in the model. This leads to an unfavourable situation in a two-fold manner: First, determining many calibration parameters also requires many measurement settings as the cost per calibration parameter scales with the dimension d of the quantum system. Second, a necessary condition for a well-posed blind de-mixing problem of rank-r with a maximal number of d 2 linearly independent measurements of the form (2) is that there are more linearly independent measurements than real parameters, i.e., 2rnd ≤ d 2 . This means that the simultaneous determination of a certain number of calibration parameters n can in principle only work for sufficiently large system dimension d in many situations. This causes severe constraints in the achievable selfcalibration for small system sizes.
We argue that an additional well-motivated structural assumption can render the blind tomography much more broadly applicable. This structure is exploited in our new sparse-demixing thresholding (SDT) algorithm. Our argument is based on the observation that the problem of determining an accurate estimate of the quantum state in the blind setting involves solving two distinct sub-problems: first, one needs to determine which ones of many potential error models of the measurement contribute. Second, one needs to estimate the calibration parameters of these models. Generically, there are many potential models that parametrize, for instance, the deviation of every imperfect implementation of a fixed measurement setting from its ideal implementation.
In this case, the first problem becomes combinatorially costly since many distinct measurement settings need to be simultaneously cal-ibrated. In contrast, in our approach, it is straightforward to solve both tasks simultaneously and even avoid a combinatorial overhead using the built-in relaxations of compressed sensing. To this end, we observe that allowing for many potential errors with associated calibration parameters only a small number s of which contribute amounts to assuming that the calibration vector ξ is s-sparse, i.e., it has only s nonvanishing entries. Of course, we do not assume that we know the support of the vector ξ. This falls naturally into the framework of structured signal recovery. To summarize: we observe data generated by linear measurements acting on ξ ⊗ ρ where ξ is an s-sparse vector and ρ is a rank r quantum state.
We are now faced with the recovery problem of de-mixing a sparse sum of different low-rank quantum states. We show that the projection onto this structure can be efficiently calculated using hierarchical thresholding [22] and therefore circumvents our NP-hardness result. We derive the corresponding iterative hard-thresholding algorithm and prove that it successfully recovers the states ρ i and the sparse vector ξ provided that the measurement map A acts isometrically on sparse sums of low-rank states. We further show that generic measurement ensembles with m different measurement settings exhibit this restricted isometry property provided that m scales at least as srd + s log n. Thus, we find that our algorithm solves the blind tomography problem with an overhead in the required number of measurements that scales linearly in s as compared to the number of degrees of freedom in the problem given by rd + s. In particular, the number of potential calibration models n enters only logarithmically in the measurement complexity of the scheme. This renders the scheme highly scalable in n providing flexibility in the modelling of systematic measurement errors or calibration corrections. Furthermore, it leaves sufficiently many linearly independent parameters to allow one to infer a couple of calibration parameters already for comparably small system sizes. We demonstrate the performance of the algorithm for the physically relevant case of measuring Pauli operators that are locally mixed with the unknown calibration parameters. Our results do not only answer a practically-inspired, conceptional question at hand in the context of blind quantum to-mography, but at the same time contribute to the mathematical framework of compressed sensing as such and have potential application in other engineering disciplines.

Practical blind tomography
Going beyond working out the theoretical guarantees, we numerically demonstrate the functioning of the scheme and the mindset behind it. Specifically, we show that the iterative hardthresholding algorithm solves the blind tomography problem from much fewer samples than competing methods from generic (Gaussian) measurements as well as sub-sampled random Pauli measurements. We then take the theoretical model to the practical testbed and turn to a realistic model of measurement errors given by a coherent over-rotation along some axis. Those measurements have significantly more structure. We observe that the measurement structure together with the sparsity constraints causes the SDT algorithm to frequently get stuck at objective variables with an incorrect support. For this reason, we also study the performance of a more pragmatically minded optimization strategy, namely, constrained alternating minimization that does not require the relaxation to the de-mixing problem. We numerically demonstrate that the blind tomography problem in a realistic setting can be solved using this adapted algorithmic approach. Thereby, we show that exploiting the low-rank structures of quantum states allows for performing tomography blindly in realistic calibration and measurement models. These findings may serve as a strong motivation and invitation to translate our approach to a variety of concrete experimental settings that are practically relevant in the quantum technologies. The main theoretical prerequisite is to identify plausible linear (or simple non-linear) calibration models for concrete measurement implementations. However, it is a formidable experimental task in itself to identify an application and demonstrate a concrete advantage of performing state tomography blindly compared to using standard bottom-up protocols, other semi-device-dependent robust calibration approaches or error mitigation techniques.

DI DD semiDI semiDD
Selftesting [23,24,25,26] Blind tomography Standard tomography Figure 2: Illustration of the spectrum between fully device-independent (DI) and fully device-dependent (DD) quantum system characterization methods such as self-testing and standard tomography, respectively. Semi-device-independent (semi-DI) methods relax the stringent requirements of full device-independence. Selfcalibrating tomography relaxes the assumptions on the calibration of the measurement device and therefore exemplifies a semi-device-dependent (semi-DD) scheme. The blind tomography scheme presented here is an example of such a semi-device-dependent scheme.

Related work and applications in signal processing
In our semi-device-dependent, self-calibrating scheme we aim at softening the requirements of fully device-dependent schemes that crucially rely on perfect measurement apparata. Coming from the opposite end, in quantum communications introducing mild assumptions such as bounds on the system dimension [27], one can weaken the impractical stringency of full device independence to semi-device independence [23,24,25,26]. Device-independent and device-dependent approaches can, thus, be seen as the extreme ends of an axis that quantifies the amount of assumptions on the measurement device, see Figure 2 for an illustration. Hand in hand with reducing the amount of assumptions and gaining robustness to imperfections, the amount of novel information that can be gained is dramatically reduced. Semidevice-independent schemes move away from the requirements of full device-independence towards more practical settings but are still extremely demanding in terms of the required resources and acceptance criteria. Our semi-device-dependent tomography scheme lies in the opposite regime. It slightly relaxes the assumptions on the precision of the measurement apparatus but still extracts tomographic information.
Self-calibrating tomography schemes have been previously proposed in specific contexts using different methods and assumptions as a leverage to break the vicious blind tomography cycle. In Ref. [28] it has been argued that single photon detectors can be simultaneously calibrated during state tomography under the assumptions that the state is squeezed extending the mindset of Ref. [29]; see also Ref. [30] for a more extensive discussion of potential classes of states and the recent Ref. [31] for error bars in this context. Ref. [12] has reported experimental demonstration of simultaneously reconstructing a quantum state together with certain unknown unitary rotations associated to the measurement device via maximum likelihood estimation in a linear optics setting. Here, the term 'self-calibrating tomography' was coined. From a practical perspective, this work is perhaps closest in mindset to the current work. Complementing and going significantly beyond this work, here we prove that such a self-calibrating approach works under very mild and natural structural assumptions and give rigorous guarantees.
Another general-purpose framework is the Gram matrix completion proposed in Refs. [32,33]. Here, a correlation matrix encoding information about the measurement, the state and the measured data is completed from a subset of known indices. So-called data-pattern tomography [34] avoids the calibration of the measurement device by comparing the data to previously determined signatures of well-controlled reference states such as coherent states [35]. More conceptually speaking, schemes incorporating model selection for quantum state tomography can be also viewed as self-calibrating [36].
Another set of approaches focuses on characterizing entire gate sets or their respective unitary errors self-consistently from the observed statistics when applying different sequences [10,11,37,38,39,40]. These methods typically rely on a certain design of the measurement sequences and cost-intensive classical post-processing. In contrast, our model of the measurement devices as linearly depending on a set of calibration parameters is much simpler and requires far less resources.
Our work builds-on and further develops compressed-sensing techniques for the tomography of quantum devices. Previous compressedsensing schemes for quantum tomography reduce the effort in the data acquisition while still ensuring an efficient classical post-processing [13,41,42,15,43,44]. These schemes come with theoretical guarantees and have as well been successfully employed in experiments [17,16,45]. The practical applicability of compressed sens-ing tomography schemes rests on their robustness and stability against various imperfections of the experimental setup. Small deviations from the compressive model assumption and additive errors to the measurement outcomes, e.g. induced by finite statistics, are reflected in a proportional and only slightly enhanced estimation error. Still, the schemes rely on measurement devices that are calibrated to very high precision with the notable exception of compressive tomography schemes for quantum processes that use randomized benchmarking data [46,47]. Here, we relax this requirement using a semi-device-dependent approach. In distinction, in the previous schemes low-rank assumptions were considered to reduce the complexity of a tomography scheme, giving rise to an important quantitative improvement. Here, those assumptions are expected to often make blind tomography possible in the first place and therefore permit even a qualitative improvement over the known schemes.
Recovery problems of the form (1) or the related de-mixing problem (2) also arise in other disciplines. For example, these problems appear in future mobile communication scenarios with the promise to yield much more scalable protocols with respect to the number of served devices [48]. More specifically, our work can be applied in order to extend the internet-of-things setup described in Ref. [21] in case one additionally wants to exploit the sporadic (sparse) user activity of machine-type messaging. Furthermore, our work identifies yet another set of hierarchical signal structures that allow for an efficient projection: It extends the work on compressed sensing with hierarchically sparse signals of a subset of the authors to low-rank matrices [49,22].
The remainder of this work is organized as follows. In the subsequent Section 2, we give a detailed description of a concrete experimental setup that motivates our mathematical formulation of the blind tomography problem. In Section 3, we provide the formal definitions of the blind tomography problem and introduce the notation used in the subsequent parts of the work. The details of the sparse demixing algorithm and its variant based on alternating optimization are derived in Section 4. On the way, we establish the NP-hardness of the projection associated to the original blind tomography problem. The theorems guaranteeing the performance of the sparse demixing algorithm are explained in Section 5. The corresponding proofs are given in the appendix. Finally, numerical simulations of the algorithms performance and its application to practical use cases are shown in Section 6 before we conclude with an outlook in Section 7.
2 Quantum state tomography with imperfect Pauli correlation measurements So far, our description of the measurement scheme has been fairly abstract. In the following, we describe a concrete scenario in which our formalism applies. Consider an ion trap experiment preparing a multi-qubit quantum state ρ. We perform Pauli correlation measurements, i.e., we estimate m expectation values of the form Id} is a Pauli matrix acting on the jth qubit and k ∈ [m] := {1, 2, . . . , m}. We refer to A 0 : C d×d → R m as the measurement map or sampling operator 1 .
In many experimental setups, it is natural to implement measurements of a certain Pauli observable -in the case of ion traps Pauli Z -while the other Pauli observables require more effort. A measurement of any other Pauli observablein the case of ion traps Pauli X and Pauli Ycan then be implemented by applying a suitable sequence of unitary gates prior to the measurement. For example, using addressed laser pulses of different duration one can implement rotations around different axes and thus implement the Hadamard gate H as well as the phase gate S. In this way, one can realize measurements in the X = HZH and Y = SHZHS † basis.
But each application of an additional gate may come with a coherent error in addition to the native error associated with the measurement itself. In this way, we end up with different systematic errors for different Pauli observables 1 Note that one might actually implement projective measurements in the multi-qubit Pauli basis as done, e.g., in Refs. [16,17]. While such projective measurements contain more information than the Pauli correlation measurement, we restrict ourselves to Pauli expectation values both for the sake of simplicity and to remain in a setting for which theoretical guarantees can be proven [13,42]. parametrized by the angles θ, φ of a coherent error given by e iθX e iφZ . This gives rise to some probability of actually measuring the expectation value of another local Pauli matrix than the targeted one. For example, consider a coherent error given by a (small) rotation around the Z-axis as given by e iφZ . The faulty implementation of the Hadamard gate is then given byH = e iφZ H. Of course, the native Z-measurement is untouched by this coherent error, since no unitary rotation precedes this measurement. However, instead of Y one now actually measuresỸ = SHZH † S † = cos(2φ)Y + sin(2φ)X. At the same time X remains undisturbed.
More generally, we can introduce calibration parameters ξ W →W measuring the strength of the error that replaces a certain target Pauli matrix W byW . For instance, in the above example those parameters are given by ξ Y →Y = cos(2φ), ξ Y →X = sin(2φ) and ξ Z→Z = ξ X→X = 1. For simplicity, we assume that these calibration parameters are identical for different qubit registers. Assuming that errors are not too large, the calibration parameters for the target measurement fulfil ξ W →W ≈ 1 or all W ∈ {X, Y, Z}. This leaves us with six independent calibration parameters corresponding to the cross-contributions. To construct the measurement map A, we start from the definition of the target measurement A 0 in (3). From A 0 we can derive calibration measurement components A W →W appearing with the coefficient ξ W →W by replacing all appearances of the Pauli matrix W in the definition of A 0 with W . If W appears in a multi-qubit Pauli observable several times the resulting observable is the sum of all Pauli observables generated by replacing only one of the W byW , assuming that the coherent errors are small so that the higher-order terms can be neglected. For example, a faulty realization of the observable ZY ZZY is now given Altogether, to linear order in the calibration parameters ξ W →W with W ̸ =W we end up constructing a description of the effective faulty measurement by which can be written as linear map A action on ξ ⊗ ρ with ξ = [ξ 0 , ξ X→Y , ξ X→Z , . . . , ξ Z→Y ] T . By assumption, we set ξ 0 = 1. In this measurement model the sparsity assumption is justified if unitary errors in a certain coordinate plane are dominant compared to others thus singling out certain types of calibration measurement components. Importantly, we do not assume that we know which corrections are dominant (i.e., the support of ξ) a priori. The measurement model also exemplifies a setting in which one is ultimately limited to measuring a maximal set of d 2 observables. Thus, blind tomography becomes only possible exploiting structure assumptions if one does not allow for different ways of implementing the same measurement that yield different calibration corrections without introducing too many new calibration parameters.

Formal problem definition
Motivated by this example, we set out to provide a formal definition of the blind tomography problem and the related sparse-de-mixing problem. The notation and terminology introduced in this section allows us to formulate a general signal-processing framework using which the blind tomography can be provably solved. Both, the blind tomography and the sparse de-mixing, problems are linear inverse problems that feature a combination of different compressive structures. These are smaller sets of linear vector spaces, and it will be convenient to introduce some notation to refer to these sets. The prototypical example is the set of s-sparse real vectors which is defined by the support supp ξ of a vector ξ, i.e., the index set of the non-vanishing entries of ξ, having cardinality smaller or equal than s. The set of s-sparse vectors is not a vector space itself but the union of n s s-dimensional subspaces. In the realm of quantum mechanics, the noncommutative analogue of sparse vectors, namely low-rank matrices, is important. We denote the set of complex rank r matrices by Since we are dealing with quantum states we will restrict our attention to the set D d ⊂ C d×d of trace-normalized, positive semidefinite matrices, i.e., ρ ≥ 0 and Tr ρ = 1 for all ρ ∈ D d . Figure 3: The signal sets of the blind tomography and sparse de-mixing problem can be regarded as subsets of C nd×d , i.e., matrices consisting of n blocks of d × d.
For a blind tomography signal in Ω n,d s,r , only s out of the n blocks are non-zero and are proportional to the same rank r matrix. In contrast, a signal of the sparse de-mixing problem inΩ n,d s,r comprises s non-vanishing blocks with potentially different rank r matrices.
Our results can be straightforwardly generalized to general matrices without these constraints. We denote the set of rank-r quantum states as In particular, D d 1 is the set of pure quantum states. In order to solve the blind tomography problem we need to simultaneously recover an s-sparse real vector ξ and a rank-r quantum state ρ. It is convenient to regard both ξ and ρ as a combined signal X = ξ ⊗ ρ and model the measurement including its dependence on the calibration parameter as a linear map A acting on X. Considering such linear maps instead of bi-linear maps is sometimes referred to as 'lifting' in the compressed sensing literature [50]. For a physicist, 'lifting' is also the natural isomorphism at the heart of the density matrix formulation of quantum mechanics. The signal X is highly structured as it is a tensor product of a sparse vector and a low-rank quantum state. We denote the set of all potential signals as One can regard a signal X ∈ Ω n,d s,r as an nd×d matrix consisting of n blocks of size d × d stacked on top of each other as depicted in Figure 3, where each d×d block is proportional to the same quantum state ρ and only s of the blocks are nonvanishing.
We are now equipped to concisely state the problem we would like to study.

Problem 1 (Blind tomography). Let A :
C nd×d → R m be a linear map. Given data y = A(X) ∈ R m and the linear map A, recover X under the assumption that Our approach to algorithmically solving the blind tomography problem makes use of a proxy problem: we relax it to signals that are a bit less restrictively structured Both setsΩ n,d s,r and Ω n,d s,r are subsets of C nd×d . The difference between them as illustrated in Figure 3 is the following: While for X ∈ Ω n,d s,r all d × d blocks are proportional to the same quantum state x, we allow the d×d blocks ofX ∈ Ω n,d s,r to be proportional to different quantum states x i . Again only s out of the n blocks ofX are nonvanishing. Analogously to Problem 1, we define the linear inverse problem associated withΩ.
Problem 2 (Sparse de-mixing). Let A : C nd×d → R m be a linear map. Given data y = A(X) ∈ R m and the linear map A, recover X under the assumption that The observed data of the sparse de-mixing problem can be equivalently described as where we have split up X into trace-normalized d × d blocks x k and their norm ξ k according to the definition ofΩ n,d s,r . Correspondingly, we can decompose the linear map A into the set of linear maps {A k } n k=1 where each A k acts only on the kth d × d block of X. From this reformulation it becomes clear that the problem amounts to reconstructing a set of low-rank signals {x k } k from observing its sparse mixture under linear maps, hence the name sparse de-mixing.
For both the blind-tomography and the sparsede-mixing problem, we alternatively write each of the n linear maps A k in terms of m observables with the Hilbert-Schmidt inner product ⟨X, Y ⟩ = Tr(X † Y ).
Note that as long as we consider Hermitian matrices for the measurement A (i) k and signals x i , we end up with a real data vector y ∈ R m . For applications other than quantum tomography it is straightforward to adopt our proofs and results to real signals or complex-valued measurement maps. Furthermore, for the sake of simplicity we have formulated both recovery problems without noise. More generally, the data can be assumed to be of the form y = A(X) + ϵ where ϵ denotes additive, e.g. statistical, noise.
In the following, we will also make use of the inner product of vectors x, y ∈ R n defined as ⟨x, y⟩ = i x i y i , their ℓ 2 -norm ∥x∥ ℓ 2 = ⟨x, x⟩ and the Frobenius norm a matrix X ∈ C d 1 ,d 2 induced by the Hilbert-Schmidt inner product ∥X∥ F = ⟨X, X⟩.

Algorithm
We now turn to the technical derivation of our algorithm for the blind quantum tomography and the sparse de-mixing problem. Our algorithm builds on primitives developed in the field of compressed sensing. In particular, we generalize the hard thresholding algorithm to accommodate the structural assumptions of both problems. As a first step, we establish the hardness of direct thresholding approaches to the blind tomography problem before stating a tractable algorithm for the sparse de-mixing problem.
Let us be more precise: the blind quantum tomography problem requires different assumptions on two levels. First, we want the signal to be a tensor product ξ ⊗ ρ, i.e., of rank one. Second, both tensor factors are assumed to be structured. Concretely, we assume ξ to be s-sparse and ρ to be of rank r. We are therefore faced with low-rank structures on two separate levels: first, the block-structured signal as given by the tensor product of calibration vector and quantum state has unit rank. Second, by assumption the target quantum states, i.e., the individual blocks of the signal, have low rank.
It has been observed in the compressed sensing literature that multi-level structures with structured tensor components can be notoriously hard to reconstruct. One prototypical example of this is combined sparsity and low-rankness in the sense that the signal is the tensor product of two sparse vectors, i.e., X = ξ ⊗ τ with ξ, τ ∈ Σ n s . This problem is already very similar to the blind tomography problem where one of the sparse vec-tors is replaced by a low-rank matrix, the quantum state.
The obstacle arising from such structures can be understood from a different perspective present in the compressed sensing literature that is related to different algorithmic approaches. The perhaps most prominent approach in compressed sensing is the convex relaxation of structure-promoting regularisers yielding efficient convex optimization programs. Minimizing the ℓ 1 -norm or the Schatten-1-norm is known to solve linear inversion problems involving sparse or low-rank vectors efficiently and sampling optimal, respectively. However, simply combining both regularisers in a convex fashion does not yield a sampling-optimal reconstruction of problems that feature both structures [51].

Hard-thresholding algorithms: Ease and hardness of the projection
Another algorithmic approach used in compressed sensing are so-called hard thresholding algorithms such as CoSAMP, IHT or HTP [52,20,53]; see also the textbook [19] for an introduction. These are typically iterative procedures that minimize the deviation from the linear constraints in some way or other, e.g. by gradient descent, and in each iteration project onto the structure of the signal. For many compressed sensing problems this is possible because even though recovery problems, such as are NP-hard [54], the related projection can be computed efficiently. For the given example of projecting onto s-sparse vectors, this solution is given by the hard-thresholding operation defined as follows: Let Σ max be the set of indices of the s absolutely largest entries of τ . Then, In words, one keeps the largest entries of τ and replaces the other entries by zero. Analogously, the projection of Hermitian matrices onto low-rank matrices can be efficiently calculated by calculating the eigenvalue decomposition and applying P Σ d r to the eigenvalue vector. Let X ∈ C d×d be a Hermitian matrix with eigenvalue decomposition X = U diag(λ)U † . We define the projection onto positive semi-definite low-rank matrices as where λ| ≥0 denotes the restriction of λ to its nonnegative entries.
In hard-thresholding algorithms, the problems associated with simultaneously exploiting sparse and low-rank structures are manifest in the computational hardness of computing the respective projections. For the case of unit rank matrix with sparse singular vectors, calculating the projection is the so-called sparse PCA problem, i.e., given a Indeed, exactly solving this problem in the worst case is NP-hard by a trivial reduction to the CLIQUE problem [54]. But it turns out that the hardness is much worse: one can even make average-case hardness statements based on conjectures regarding the hardness of the planted clique problem [55,56,57]. Moreover, the SparsePCA problem remains just as hard even when one merely asks for an approximation up to a constant relative error [58,54].
As the first technical result of this work, we show that also the projection onto Ω n,d s,r is an NPhard problem by reducing it to the sparse PCA problem.

Theorem 3 (Hardness of constrained minimization). There exists no polynomial time algorithm that calculates
for all A ∈ C nd×d unless P = NP. This still holds for s = n.
The details of the proof are given in Appendix A. This hardness result provides a strong indication that a straightforward adaptation of compressed sensing techniques is not feasible. In this work, our way out of this is to sacrifice sampling optimality of the algorithm for a lower runtime and being able to prove analytical performance guarantees. Alternating minimization approaches that make the factorization explicit is also a viable way forward. We provide a detailed description of such an algorithm in Section 4.4. But proving global recovery guarantees for nonconvex algorithms typically becomes much more involved.

Relaxing the blind tomography problem: sparse de-mixing
In fact, the bi-sparse and low-rank structure can be relaxed to a simple hierarchical sparsity constraint [59,60]. A vector ξ ∈ C N n consisting of N blocks of size n is called (s, σ)hierarchically sparse if it has at most s blocks with non-vanishing entries, that themselves are σ-sparse [61,62,63,64]. For this structure a hard-thresholding algorithm together with theoretical recovery guarantees has been derived in Refs. [22,49,65,66]. It has been applied in different contexts [67,68,69,70] including sparse blind deconvolution [59] which features the combined low-rank, sparse structure. Here, we make use of this approach to solve the blind quantum tomography problem as formalized in Problem 1. At the heart of our approach is the insight that the projection ontoΩ n,d s,r can be efficiently computed since the n d × d blocks may be different. This allows one to combine the projection onto Σ n s and the projection onto D d r : First, the low-rank projection P D d r is applied to each of the d × d blocks of the input matrix X. Subsequently, the sparse projection operator is applied by setting the n − s smallest blocks in Frobenius norm to zero. The resulting algorithm is summarized as Algorithm 1, where Y W denotes the subvector of Y indexed by the entries in the complement of W . The computational cost of the Algorithm 2 SDT algorithm Require: Data y, measurement A, sparsity s and rank r of signal 1: Initialize X 0 = 0. 2: repeat 3: Calculate step-widths µ l 4: s,r is dominated by the eigenvalue decomposition required to compute the lowrank approximation D d r of each block. Computing the full eigenvalue decomposition of the d × d blocks requires computation time of O(d 3 ) using, e.g., Householder reflections [71]. Since we are only interested in the dominant r ≪ d eigenvalues, the effort can be reduced to O(rdw) using the Lanczos algorithm, where w is the average number of non-zero elements in a row of a block [71]. Using randomized techniques one might be able to further reduce the computational costs [72]. The calculation of the Frobenius norms contributes O(nd 2 ) flops. The largest blocks can be selected using the quick-select algorithm [73] in O(n). Note that the low-rank projections and Frobenius norms of all blocks can also be performed in parallel without any modification to the algorithm.
Equipped with an efficient projection forΩ n,d s,r , we can construct a structured iterative gradient descent algorithm. This is a variant of the IHT algorithm, that was originally developed for sparse vectors [20]. The IHT algorithm is a projective gradient descent algorithm that iteratively alternates gradient steps to optimize the ℓ 2 -norm deviation between the data and a projection onto the constraint set. The resulting recovery algorithm for the sparse de-mixing (SDT) problem is stated as Algorithm 2.
The SDT algorithm is closely related to the IHT algorithm for de-mixing low-rank matrices that was developed in Ref. [21]. We will refer to this algorithm as the DT algorithm. The main difference between our SDT and the DT algorithm of Ref. [21] is that the latter does not make the additional sparsity assumptions on the signal.
For this reason, the SDT algorithm differs in the projection PΩn,d s,r that additionally applies the projection P Σ n s selecting the s dominant blocks. In fact, in the special case of considering non-sparse signals inΩ n,d n,r the SDT algorithm coincides with the DT algorithm.

Details of the SDT algorithm
To be fully self-contained, let us now go through the individual steps of the SDT algorithm and specify the relevant details. Every iteration of the algorithm starts with the computation of G l = A † (y −A(X l )), the gradient for the ℓ 2 -norm deviation f (X) = 1 2 ∥y − A(X)∥ 2 ℓ 2 evaluated at X l . The algorithm subsequently employs a modification from Ref. [74] in calculating the steepest gradient inspired by geometrical optimization techniques which leads to a faster convergence [75,76]: The set of rank r matrices is an embedded differential manifold in the linear vector space of all matrices. Thus, a direction on this embedded manifold is characterized by a tangent vector on the manifold. While this geometry straightforwardly generalizes to the set of nd × d matrices with rank r blocks, due to sparsity constraint Ω n,d s,r fails to be a differential manifold. Nonetheless, we can make use of tangent vectors as 'natural' search directions in our optimization problem for the non-vanishing blocks of X l that are conforming with the fixed-rank constraint.
The tangent space of rank r matrices at point x is given by the set of matrices that share the same column or row space x [75]. Correspondingly, the tangent space projection of a non-vanishing block of X can be defined as follows: Let x k = U k Λ k U † k be the eigenvalue decomposition of the kth block of X with Λ k the diagonal matrix with eigenvalues in decreasing order. Further, let U (r) k denote the restriction of U k to its first r columns corresponding to the range of x k . Then, the tangent space projection acting on g k the kth block of G is given by with The entire tangentspace projection P T X (G) is defined by acting trivially on the blocks of G corresponding to vanishing blocks of X and as the projection (7) otherwise.
As we prove below in generic situations the SDT algorithm converges for a constant stepwidth set to µ l = 1 and even without using the tangent space projection. Empirically, a faster convergence is achieved with the tangent space projection and using the following prescription for the step-width calculation: From the projected gradient G l P = P T X (G l ) in the lth iteration we then calculate the algorithm's step width for each block individually as and multiply each block by the corresponding µ l k .
In order to have a compact notation, we introduce the diagonal matrix diag(µ l ) = diag(µ 1 l , . . . , µ l 1 , µ l 2 , . . . , µ l 2 , . . . , µ l n ) where each step width is repeated d times. The new state of the algorithm, X l+1 , is given by the projection of the result of a gradient step with step width µ l onto the setΩ n,d s,r . Finally, we have to specify a stopping criterion at which the loop of the algorithm is exited. We terminate the algorithms if the objective function is below a specified threshold, i.e., or a maximal number of iteration is reached. If the data vector y has additive noise, γ break has to be chosen to be larger than the expected norm of the noise. To be less relying on expectations on the noise levels, one can alternatively make use of criteria on the gradient and step width or test for oscillating patterns in the identified support.

Blind tomography via alternating leastsquare optimization
A more direct algorithmic approach to the blind tomography problem is to use a constrained alternating least square (ALS) optimization. In ALS optimization, one performs a constrained optimization of the objective function with respect to one of the two variables while regarding the respective other variable as constant in an alternating fashion, see Algorithm 3. We perform the optimization over Σ n s , Algorithm 3 Step 3, using the standard IHT algorithm The ALS optimization requires an initialization with a suitable ρ 0 in order to evaluate the first objective function for optimizing ξ. One method that we found viable is to randomly draw a rank-r state using Haar-random eigenvectors. Note that, in general, constrained ALS optimization can be highly sensitive to the chosen initialization. For this reason, depending on the measurement map and calibration model, alternative initialization strategies might become necessary. As break-off criteria we can again use a bound on the objective function as in (8) and an allowed maximal number of iterations.

Recovery guarantees
We now prove that for certain simple measurement ensembles, the SDT algorithm converges to the optimal solution before we numerically demonstrate its performance in the following section. More precisely, following the outline of model-based compressed sensing [18,77], the SDT algorithm can be accompanied by recovery guarantees based on a restricted isometry property (RIP) of the measurement ensemble that is custom-tailored to the structure at hand. Intuitively, it seems clear that a measurement map should at least in principle allow for solving the associated linear inverse problem uniquely if it acts as an isometry on signals from the constraint set. So-called RIP constants formalize this intuition:

Definition 4 (Ω n,d s,r -RIP). Given a linear map
for all x ∈Ω n,d s,r . The constant δ s,r measures how much the action of A when restricted to elements ofΩ n,d s,r deviates from that of an isometry. Correspondingly, if δ s,r is sufficiently small we expect this to be sufficient to ensure that the restricted action of A becomes invertible. In fact, if a measurement map has a sufficiently small RIP constant one can prove the convergence of projective gradient descent algorithms to the correct solution of the structured linear inverse problem. For the sake of simplicity, we analyze the SDT algorithm omitting the tangent space projection and also assuming a constant step widths µ l = 1. In numerically simulations we observe that making use of the tangent space projection and a more sophisticated heuristic for the step width yields faster convergence and better recovery performance. But the RIP assumption is in fact strong enough to already for this simpler algorithmic variant ensure that the following theorem holds:

Theorem 5 (Recovery guarantee). Let A :
C nd×d → C m be a linear map and suppose that the following RIP condition for A holds δ 3s,3r < 1 2 .
Then, for X ∈Ω n,d s,r , the sequence (X l ) defined by the SDT algorithm (Algorithm 2) with µ l = 1 and P T X l = Id with y = A(X) satisfies, for any l ≥ 0, The theorem's proof is presented in Appendix B. We establish that the SDT algorithm converges to the correct solution of the sparse demixing problem at a rate that is upper bounded by the RIP constant δ 3s,3r of the measurement map. The right-hand side of the RIP condition (9) is not expected to be optimal. Typically, one can at least improve the bound to 1 √ 3 with a slightly more complicated argument [19]. Since we are interested in the parametric scaling here, we choose to present a simpler argument at the cost of slightly worse constants. Furthermore, the statement of Theorem 5 does not account for statistical noise or potential mild violation of the signal constraints. When deployed in practice, one of course expects that the calibration parameter and the quantum state will only be approximately sparse and of approximately lowrank, respectively. For example, even a small amount of depolarizing noise causes a pure quantum state to be of full rank. Such a state will still be well-approximated by a rank-one matrix, however, and one would expect the recovery to be robust to such deviations. Theorem 5 can be generalized to a noise-and model-robust guarantee. We provide a detailed discussion at the end of Appendix B. For the current analysis focusing on the scaling behaviour, we are content with the significantly simpler version.
The pressing next question is, of course, which measurement ensembles actually exhibit the required RIP. Interestingly, it is notoriously hard to give deterministic constructions of measurement maps that are sample optimal and feature the RIP. In fact, already for the RIP for s-sparse vectors there are no sample optimal deterministic measurement maps known to date [19]. To further complicate the state of affairs, it is also known to be NP-hard to check whether a given measurement map exhibits the s-sparse RIP with RIP constant small than a given δ [78].
For this reason, the field of compressed sensing uses probabilistic constructions to arrive at provably sampling optimal measurement maps. Using a random ensemble of measurement maps of sampling optimal dimension one establishes that with high probability a randomly drawn instance will exhibit the RIP property. In other words, one proves that the originally hard linear inverse problem typically becomes easy for a certain mea-surement ensemble. Arguably, the simplest measurement ensemble consists of observables given by i.i.d. chosen random Gaussian matrices. In our setting a fully Gaussian measurement map can be constructed from a set of of m Gaussian matrices with entries draws i.i.d. from the normal distribution N (0, 1) and defining y (l) = Tr(A i X).
As a toy model for quantum tomography it is more natural to consider observables drawn from a random ensemble of Hermitian matrices such as the Gaussian unitary ensemble (GUE). Operationally, we define the GUE by drawing a matrix X with complex Gaussian entries, X k,l ∼ N (0, 1) + iN (0, 1), and subsequently projecting X onto Hermitian matrices using P : X → 1 2 (X + X † ). For measurement maps from GUE we prove the following statement: The proof of the theorem is provided in Appendix C. Based on the result for random Hermitian measurement maps we now discuss the asymptotic scaling of the measurement complexity of our approach to the blind tomography problem and the sparse de-mixing problem. First, the derived measurement complexity (10) is in accordance with the degrees of freedom of signal X ∈Ω n,d s,r . The second term of O(drs) corresponds to the number of degrees of freedom specifying the s rank-r matrices of dimension d. The first term of O(s ln n) is the minimal sampling complexity in s for learning the s non-trivial entries and their support [19]. Second, in analogy, we expect the optimal number of measurements for the blind tomography problem, i.e., reconstructing signals in Ω n,d s,r instead ofΩ n,d s,r , to scale as O(s ln n + dr). Hence, having a provably efficient algorithms capable of solving the blind tomography as well as the sparse de-mixing problem comes at the cost of an increase in the sampling complexity by an additional factor of s in the second term of the sampling complexity. Most importantly, invoking the sparsity assumption on the calibration vector ξ allows us to get away without a linear increase n of the number of calibration parameters. Thus, the overhead in measurement complexity of our approach to the blind tomography problem is relatively mild.
In fact, the measurement complexity derived for Gaussian measurements can often be used as a guideline for the sampling complexity of other measurement ensembles that are also sufficiently unstructured. However, the proof techniques for model-based compressed sensing that exploit the combination of different structures are not easily translatable to other measurement ensembles. An exception are measurement ensembles that feature a structure that is sufficiently aligned with the signal structure such as the one exploited in Ref. [22] for hierarchically sparse signals. We leave the study of more involved measurement ensembles to future work.

Numerical results
The analytical results of the previous section provide worst-case bounds on the asymptotic scaling for a class of idealized, unstructured measurements. In order to benchmark and assess the non-asymptotic performance of compressed sensing algorithms in practice, however, numerical simulations are indispensable. In a first step we therefore perform numerical simulations for the idealized measurement model as given by random GUE matrices, comparing the performance of our algorithm to related established algorithms that do not entirely exploit the structure of the problem. In a second step, we compare the SDT algorithm 2 with standard CS tomography in a blind tomography setting involving measurements of Pauli correlators, cnf. (3). To do so we randomly draw subsets of the possible Pauli measurements as possible calibrations A i of the measurement apparatus. Finally, we demonstrate the feasibility of blind tomography under structure assumptions in the realistic measurement and calibration setting involving single-qubit coherent errors described in Section 2. To this end, we employ the Algorithm 3 that performs alter- Figure 4: The recovery rate for the SDT, DT and informed DT algorithm for different number of observables m for GUE measurements. Each point is averaged over 50 random measurement and signal instances with r = 1, d = 16, n = 10 and s = 3. A signal is considered successfully recovered if its Frobenius norm deviation from the original signal is smaller than 10 −3 . One observes nearly coinciding recovery performances for the informed DT and the SDT algorithm. In comparison, the DT algorithm requires significantly more observables for recovery.
nating constrained optimization. The algorithms and the scripts producing the plots have been implemented in Python and will be made available under Ref. [79].

GUE measurements
The SDT algorithm goes beyond existing IHT algorithms for the de-mixing problem of low-rank matrices in that it additionally allows one to exploit a sparse mixture. We demonstrate that this yields a drastic and practically important improvement in the number of measurement required for the reconstruction.
To this end, we draw signal instances X = ξ ⊗ρ at random from Ω n,d s,r . We use four qubit pure states ρ = |ψ ⟩⟨ψ | with r = 1 and d = 16, where |ψ ⟩ is drawn uniformly (Haar) random from the complex ℓ 2 -norm sphere. The calibration vector ξ ∈ R n with n = 10 has a support of size s = 3 drawn uniformly from the set of all n s possible supports. The non-vanishing entries of ξ are normal distributed with unit variance. The measurements are drawn at random from the GUE ensemble as defined above with a varying number of observables m.
The closest competitor to the SDT algorithm is  the related algorithm of Ref. [21]. The algorithm of Ref. [21] coincides with the special case of the SDT algorithm where we use the projection on toΩ n,d n,r with s = n ignoring the sparsity in the block structure. We will refer to this algorithm as the DT algorithm. We can also give the DT algorithm the 'unfair' advantage of restricting the problem to the correct block support of the signal from the beginning. We will refer to this variant as the informed DT algorithm. Figure 4 shows the recovery rate for the SDT algorithm, the DT algorithm and its informed variant for different m. Each point is average over 50 random signal and measurement instances. We consider a signal as successfully recovered if the distance of the algorithm's output to the original signal is smaller than 10 −3 in Frobenius norm. The algorithm terminated if either the stopping criterion (8) with γ break = 10 −5 is met or after a maximal number of 600 iteration. We observe that if one of the algorithm successfully recovers a signal it typically meets the stopping criterion after less than 100 iterations.
The curves for all three algorithm in Figure 4 display a sharp phase transition from a regime where the number of measurement is too small to recover any signal to a regime of reliable recovery. While the phase transition for the SDT algorithm appears in a similar regime to the informed DT algorithm, the DT algorithm requires considerably more samples in order to recover the signal instances.
We conclude that the sparsity of the calibration parameters can be exploited by the SDT algorithm to considerably reduce the required number of measurements. Even more so, this does not require many more sampling points as compared to an algorithm which is given a priori knowledge which errors were present, that is, the block support of the signal. This shows that the SDT algorithm solves the de-mixing and blind tomography task in a highly efficient way and scalable. Finally, the number of possible erroneous measurements A i can be scaled up at a very low cost in terms of required measurement settings.

Sub-sampled Pauli measurements
For the application in characterizing quantum devices, it is key to compare the recovery performance of the SDT algorithm with standard lowrank quantum tomography algorithms. To this end note that the SDT algorithm restricted to n, s = 1 is also a state-of-the-art algorithm for standard low-rank state tomography without the on-the-fly calibration. Thus, we will make use of this implementation of conventional low-rank state tomography in the following.
We draw signal instances as before but using three-qubit states, s ∈ {3, 4} and altering the model for the calibration parameter: We set the first entry of ξ to ξ 0 = 1. The support of the re-  We simulate statistical noise using 10 8 samples per expectation value in order to realistically limit the resolution of the SDT algorithm.
We simultaneously perform recoveries with the SDT algorithm using the entire measurement matrix including the calibration measurement components and the SDT algorithm using only the target measurement A 0 as in a conventional tomography setting.
The resulting trace distance of the state estimate, i.e., the trace-normalized first block of X, from the original ρ is shown for different number of measurements in Figure 5 and One observes that the conventional low-rank tomography becomes more accurate with an increasing number of measurement but is asymptotically still bounded from below by the systematic error induced by the calibration on the order of 10 −1 . This agrees with the order of magnitude of variance of the calibration coefficients. In contrast, the SDT algorithm while performing slightly worse in a regime of insufficient measurements outperforms the conventional algorithm for a moderate number of samples and is ultimately only limited by the statistical noise. However, in the parameter regime under investigation there are even for large number of samples m > 150 a small number (well below 10%) of instance where SDT only reaches an accuracy comparable to standard tomography. In these instances we find that the support for the calibration measurement components was incorrectly identified. For s = 4 we furthermore observe one pathological instance of SDT for m = 240 that is worse in recovery than standard tomography is in this regime. For s = 4 the phase transition of SDT appears for a slightly larger values of m compared to s = 3. The curves for the reconstruction error of the quantum state approximately coincide with the curves for the error in the calibration parameter. We conclude that for a sufficient number of measurement settings, the SDT algorithm almost always performs a significantly more accurate state reconstruction and simultaneously ex-  tracts the calibration parameters. The precision is ultimately only limited by the statistical error in the estimation of the expectation values.

Pauli measurements with coherent singlequbit errors
We now come back to the concrete realistic scenario described in Section 2. There we derived the calibration measurement model originating from coherent errors in the gates that implement the single-qubit measurements. For the numerical simulations, we draw a set of m Pauli observables uniformly at random as the target measurement. Subsequently, we introduce six calibration blocks such that every observable in the set {X, Y, Z} is swapped with another Pauli observable in {X, Y, Z} in a specific block. We generate data y for given states and calibration parameters using the linear calibration measurement model without noise as induced by finite statistics.
We find that in the parameter regimes that are easily amenable to numerical studies on desktop hardware the SDT algorithm is not capable of successfully reconstructing the states when the calibration parameters for the corrections are considerably smaller than the leading order measurement. To thoroughly understand this limitation, in the following, we briefly report the performance of the SDT algorithm on different sub-tasks related to the recovery problem.
First, we choose d = 16 and n = s = 1 such that only a single block, either the ideal measurement or one of the correction blocks, is used to generate the signal from a random pure state (r = 1). We observe that the SDT algorithm is able to recover the signals in this standard tomography problem. This indicates that also the calibration blocks individually allow for tomographic reconstruction of low-rank states. Second, the SDT algorithm can discriminate between different mixtures of the six correction blocks. To demonstrate this, we ignore the ideal measurement and employ only the correction blocks to generate the signal. We set the active calibration coefficients to one. Thus, n = 6, s ≤ n and ξ i = 1 for i active. We observe that given a sufficient number of measurement settings the SDT algorithm correctly reconstructs pure states in this measurement setting. The same findings hold true if the target measurement is again considered as long as the active calibration coefficients are set to 1. We observe successful reconstructions of unit rank states for n = 7 and s ∈ 1, 2, 3.
A more natural setting however would typically have calibration coefficients that are considerably smaller than the ideal measurement. This justifies the linear expansion for the measurement model in the first place. If we choose, e.g., ξ i = 1/10 for the indices i of active blocks, we were unable to identify a parameter regime on  desktop hardware where the SDT algorithm can successfully recover the majority of instances of pure states. We observe that if the SDT algorithm settles on an objective variable with an incorrect block support in the first few iterations, it is not able to subsequently run into objective variables with a different block support in most instances. Despite the negative result for the SDT algorithm in the most realistic setting, the general mindset to exploit structure (low-rankness) to allow quantum state tomography in a blind fashion is fruitful using a slightly different algorithmic strategy.
To this end, we use the constrained alternating least square (ALS) algorithm described in Section 4.4. We set the first calibration coefficient corresponding to the ideal measurement to one. The support of the remaining active calibration coefficients is drawn uniformly at random and their value are i.i.d. drawn from a shifted normal distribution with standard deviation 0.05 and mean value 0.2. We use Haar random pure states, r = 1 of a four-qubit system, d = 16, as the target states.
The algorithm is initialized with a Haarrandomly drawn pure state. We allow for a maximal number of 1000 iterations of the algorithm or terminate if the criterion (8) with γ break = 10 −5 is met. Furthermore, if the stopping criterion is not met after 50 iterations, we re-initialize the algorithm with a new random pure state. We allowed for a maximal number of 10 or 20 reinitializations for s = 2 and s = 3, respectively. We observe that in case of successful recovery typically at most 3 re-initializations are required with most instances already correctly converging from the initial state.
As in the previous section, we compare the recovery performance of the ALS with the standard low-rank tomography algorithm. The trace-norm error and calibration error for different numbers of measurement settings for s = 2 and s = 3 are displayed in Figure 7 and 8, respectively. We observe that, as expected, the reconstruction error of standard low-rank tomography is again lowerbounded by a scale set by the magnitude of the calibration parameters. In contrast, with an only slightly larger number of measurement settings, the constrained ALS algorithm is capable of recovering the states and the calibration parameter with an accuracy that is improved by orders of magnitude and in the noiseless scenario only limited by the algorithms stopping criterion. Compared to recovery performance of the SDT algorithm we observe an even sharper phase transition to the regime of recovery.
Finally, in practice due to further imperfections, e.g. incoherent noise, the underlying state and calibration vector will actually be only approximately of low-rank and approximately sparse, respectively. In order to probe the robustness of the ALS algorithm against such model mismatch, we generate measurement data as before for s = 2, choosing m = 130 and either add depolarizing noise to the state of different strength or replace the vanishing coefficient in the calibration vector by the absolute value of random Gaussian variables of different standard deviation. Figure 9 displays the reconstruction error of the state and calibration coefficients against the amount of model mismatch, as determined by the depolarizing strength and Gaussian widths. We observe a linear dependency of the reconstruction errors with the model mismatch and conclude that the method is robust enough to tolerate small deviations from the structure assumptions.

Summary and outlook
In this work, we have shown that the natural assumption of low-rankness allows one to perform self-calibrating quantum state tomography. Relaxing the blind tomography problem to a sparse de-mixing problem has allowed us to develop an efficient classical post-processing algorithm, the SDT algorithm, that is theoretically guaranteed to recover both the quantum state and the device calibration under a restricted isometry condition of the measurement model. We have demonstrated the necessity of relaxing the blind tomography problem within the framework of hardthresholding algorithms by establishing the NPhardness of the projection onto the set consisting of the outer products of vectors and fixedrank matrices. Introducing a sparsity assumption on the calibration coefficients ensures that the reconstruction scheme can already be applied for fairly small system dimension. We have explicitly proven that a Gaussian random measurement model meets the required restricted isometry condition with a close-to-optimal measurement complexity in O(s ln n+drs). Furthermore, we have numerically demonstrated an improved performance of the SDT algorithm for random instances of measurement models compared to previously proposed non-sparse de-mixing algorithms and standard low-rank state tomography.
While these generic measurement and calibration models allows us to derive analytical guarantees, it is fair to argue that these models might at best capture some aspects of actual experimental implementations. A potential starting point for extending recovery guarantees to more realistic settings is the generalization of our results to random Pauli measurements as considered in Sec. 6 [42] together with the coherence measures and structured measurement guarantees developed in the context of hierarchically spares signals [63,22,65,66].
To complement our conceptually and rigorously minded work with a more pragmatic approach, we have additionally developed and implemented a structure-exploiting blind tomography algorithm based on alternating optimization. We have numerically demonstrated that the alternating algorithm is able to perform selfcalibrating low-rank tomography in a realistic measurement and calibration model that is wellmotivated by gate implementations in ion traps. These numerical simulations indicate that the approach to the blind tomography problem developed here might be well-suited to improve tomo-graphic diagnostics in current experiments. Ultimately, the recovery performance of the proposed algorithms has to be evaluated on experimental data. It is the hope that this work contributes to establishing a new mindset in quantum system identification and specifically tomographic recovery in which no component used has to be precisely known, but still under physically meaningful structural assumptions, a mindset here referred to as being semi-device-dependent.

A Hardness of projection
As a starting point we state the SparsePCA problem.
Problem 7 (SparsePCA). Input: Symmetric matrix A ∈ R n×n , sparsity s, positive real number a > 0. Question: Does there exist an s-sparse unit vector v ∈ R n with v T Av ≥ a?
It has been folklore for quite some time that the sparse PCA problem is NP-hard. A formal proof can be found in Ref. [54], where the CLIQUE problem is encoded into instances of SparsePCA. From the hardness of SparsePCA it follows that there does not exist a polynomial time algorithm for the projection onto the set of symmetric, unit rank matrices with sparse eigenvectors, unless P = NP. Formally, we have: Proposition 8 (Hardness of projection onto the set of symmetric, unit rank matrices with sparse eigenvectors). Given a matrix A ∈ R d×n and s, σ ∈ N, there exist no polynomial time algorithm that calculates Proof. It turns out to be sufficient to only consider the case where σ = d, i.e., only one of the factors is required to be sparse. It is straightforward to see that solving the problem with both vectors being sparse allows one to solve the projection with only one sparse vector: Define with 0 a,b being an a × b matrix filled with zeros. It then holds that We now embed the SparsePCA problem. To do so we first make the normalization of the vectors v, w in the optimization problem explicit to it to a maximization problem over normalized vectors: with B n ℓ 2 = {v ∈ R n | ∥v∥ ℓ 2 ≤ 1} the ℓ 2 -norm ball. Solving the optimization problem over λ yields Since A is fixed we conclude that the optimization problem (11) is equivalent to maximize |⟨w, Av⟩| Furthermore, using the Cauchy-Schwarz inequality we find that Now consider an instance of the SparsePCA problem with a symmetric input matrix B ∈ R n×n , sparsity s and a > 0. W.l.o.g. we can assume that B is a positive matrix since solving the SparsePCA problem for the B − min {0, λ min (B)} Id shifted by the smallest eigenvalue λ min (B) of B and a shifted correspondingly, allows one to solve the SparsePCA problem for B. For a positive matrix B we find a factorization B = A T A. Hence, deciding whether the maximum over all w ∈ Σ n s of w T Bw is larger than a is solved by calculating the maximum of ∥Aw∥ 2 ℓ 2 = w T Bw. This completes the reduction.
We are now prepared to tackle our related problem: the projection onto Ω n,d s,r . We have the following statement: Theorem 9 (Hardness of constrained minimization). There exist no polynomial time algorithm that calculates for all A ∈ C n×n : minimize ∥A − X∥ F subject to X ∈ Ω n,d s,r , (12) unless P = NP. This still holds for s = n.
We note that our result for exactly computing straightforwardly generalizes to the case of approximating the target function up to constant relative error using results on the approximatability of SparsePCA [58].
Proof. Suppose there existed an efficient algorithm that determines the objective value of the projection (12). To encode the SparsePCA problem, we choose an instance of A as follows: Let A ′ ∈ R n×d be a matrix and let A ′ i denote the ith row of A. Let e i be the basis vectors (e i ) j = δ i,j , with δ i,j the Kronecker symbol. We ∈ R nd to be the vector arising by concatenating all rows of A. By definition an X ∈ Ω n,d s,r can be decomposed as X = ξ ⊗ ρ with ξ ∈ Σ n s and ρ ∈ H d r . Let ρ = U diag(λ)U † the eigenvalue decomposition of ρ with a suitable unitary U ∈ U (n) and λ the vector of its eigenvalues. Then, we can rewrite where we have used the unitary invariance of the ℓ 2 -norm in the third step. We can introduce the doubly stochastic matrix W with entries W k,l = |U k,l | 2 and relax the optimization to Since the optimum is, hence, attained for a permutation matrix W = Π σ and U i,j = (Π σ ) Thus, an algorithm calculating the projection onto Ω n,d s,r for the matrix A chosen here solves the SparsePCA problem for A ′ . We conclude that there exists no polynomial time algorithm for the problem.

B Convergence proof
In this section we provide the proof of Theorem 5. We first introduce a bit more notation. Consider X ∈ Ω n,d s,r . By definition, it can be written as Note that the projection simultaneously projects onto the "block-wise support" of X.
It is common and useful to rewrite the RIP inequalities such as in Definition 4 as an equivalent spectral condition of restrictions of A † A. Proposition 10. Let X ∈Ω n,d s,r and A : C nd×d → R m a linear map. Then the following two statements are equivalent: Proof. The inequality The last bound is equivalent to (14).
We will now prove the recovery guarantee, Theorem 5. The derivation of recovery guarantees for the IHT algorithm follows largely the same blueprint developed in the original IHT proposal for sparse vectors [20], see also Ref. [19] for a detailed description of the proof. Here, we are in addition in the comfortable position that Ref. [21] already fleshed out the details of the recovery proof for an IHT algorithm for de-mixing lowrank matrices. However, in order to accommodate a non-trivial choice of the step width the proof of Ref. [21] yields a slightly weaker result than what can be shown by a simpler argument for a fixed step width. Thus, we give a slightly simpler proof that carefully adapts the one given in Ref. [21] to account for the additional sparsity constraint and uses a slightly more concise notation.
Proof of Theorem 5. Let X ∈Ω n,d s,r be the matrix to be recovered. Let X l denote the lth iterate of the vector of matrices in the SDT algorithm (Algorithm 2). Since the algorithm always involves a projection step ontoΩ n,d s,r the lth iterate X l is inΩ n,d s,r . Furthermore, we observe that X + X l + X l+1 ∈Ω n,d 3s,3r . For convenience, we denote the projection onto the ("block-wise") joint range and support of X, X l and X l+1 simply by P l := PΩ (X+X l +X l+1 ) and its orthogonal complement by P l ⊥ . It is crucial for the proof to bound norm deviations restricted to the range of P l as this eventually allows us to apply a RIP bound.
We want to show the convergence of the iterates of the algorithm X l to the correct solution X. In other words, we want to derive a bound of the form with constant γ < 1. Note that by the theorem's assumption we set the step width to µ l = 1 and omit the tangent space projection P T X l .
We first derive the following consequence of the thresholding operation: Let G l := A † (y − A(X l )) = A † • A(X − X l ). By the definition of X l+1 as the best approximation to X l + G l in Ω n,d s,r it holds that Since the parts of both sides of the inequality that are not in the kernel of P l ⊥ coincides, we get the same inequality also for the with P l inserted With the help of this inequality, we can bound where in the last step we used the definition of G l , the fact that P l acts trivially on X l − X and defined M 1 := P l • (Id −A † • A) • P l . To arrive at the theorem's assertion, we now bound the spectral norm of M 1 using the RIP property of A and Proposition 10: since the range of P l is in Ω n,d 3s,3r . Using (16) in (15) completes the proof.
Theorem 5 assumes that the input data y for the SDT algorithm originate from a signal X ∈Ω s,r . In particular, we are assuming a bound on the block-sparsity s and rank-r of the blocks. In practice, one will often encounter the situation that the signal producing the data is not exactly sparse and of low-rank but rather well-approximated by a structured signal. It is straight-forward to also derive a model-robust version of Theorem 5. We here briefly sketch the required modifications to the proof: LetX ∈ C nd×d (an arbitrary signal without the hierarchical structure) and suppose that y = A(X) = A(X) + A(X − X) with X = PΩn,d s,r (X) is the input to the SDT algorithm. In the proof, we can adapt the definition of G l to include the additional term involving the model-mismatch X − X. In the following steps, we can finally account for the model-mismatch term by an additional summand to the right-hand side of (15) proportional to X − X F . In this way, we es- for some constant C, independent of n,d,s, and r. We, thus, conclude that if the signal is violating the structure assumptions, the SDT algorithm converges to the projection of the signal onto theΩ n,d s,r up to an accuracy proportional to the magnitude of the model-mismatch, measured as X − PΩn,d

C RIP guarantee for Hermitian random matrices
In this section we provide the proof of Theorem 6 that establishes the RIP condition for measurement matrices consisting of Hermitian matrices i.i.d. drawn from the GUE. Establishing RIP conditions for Gaussian matrices for a set of structured signals typically proceeds in two steps: One first derives a strong concentration result for a single signal in the set using standard concentration of measure. Second, one takes the union bound over the signal set with the help of an ϵcovering net construction to arrive at the uniform statement of RIP. We can readily adapt this strategy also to GUE.
For the first step, we derive a Gaussian-type concentration result, modifying a standard line of arguments for our example, see, e.g., Ref. [21]. The result is summarized as the following lemma:  (5) and (6). Then, for 0 < δ < 1 with probability of at least 1−2e −mδ 2 /C δ and constant C δ ≥ 40.
Our proof essentially follows the argument of Ref. [21] for Gaussian measurements and then exploits that the Hermitian blocks of the signal X ∈Ω n,d s,r only overlap with the Hermitian part of the Gaussian measurement matrix.
Proof. Let X ∈Ω n,d s,r and denote its n d×d blocks by x i . Consider a set {B (k) i ∈ C d×d } m,n k,i=1 of m · n d × d matrices with entries independently drawn from the complex-valued normal distribution. Let A (k) i := P B (k) i be corresponding matrices drawn from the GUE and A the corresponding measurement map. Since all blocks x i are Hermitian, we have Since all entries of B i , x i ⟩} are i.i.d. real random variables from the distribution N (0, ∥x i ∥ 2 F ) for all i and k. We conclude that all entries y k = A(X) (k) of A(X) are Gaussian distributed with variance σ 2 = i ∥x i ∥ 2 F = ∥X∥ 2 F and have even moments E[y k 2t ] = 2 −t t! 2t t σ 2t [19,Corollary 7.7]. Correspondingly, the squared entries are subexponential random variables with mean E[y 2 k ] = σ 2 . We denote the associated centred subexponential variable as The moments of z k are bounded by where the first inequality follows from the triangle and Jensen's inequality. The binomial can be upper bounded using Stirling's formula [19, (C.13)] by 2t t = 4 t r t / √ πt with r t ≤ e 1/(24t) . Thus, for t ≥ 2 we have E[|z k | t ] ≤ t!R t−2 Σ 2 /2 with R = 4σ 2 and Σ 2 = 2/πe 1/48 16σ 4 ≤ 0.815 · 16σ 4 . Controlling the moments of z k for t ≥ 2, we can apply the Bernstein inequality [19,Theorem 7.30] and bound the probability that ∥A(X)∥ 2 ℓ 2 varies by more than ∆ > 0 from its expectation value Let ∆ = δ∥X∥ 2 F for some 0 < δ < 1. Then we can rewrite the tail bound (17) as with a constant C δ ≥ 40. Hence, the condition holds with probability at least 1 − 2e −mδ 2 /C δ .
Note that by the homogeneity of the RIP condition it suffices to restrict ourselves to normalized elements ofΩ n,d s,r in the proof of Theorem 6. In the following, we will therefore focus on the set Ω n,d s,r := {X ∈Ω n,d s,r | ∥X∥ 2 F = 1}.
To take a union bound over the setΩ n,d s,r we need to bound the size of an ϵ-net that covers the set Ω n,d s,r . An ϵ-net S covering a set of matrices M ⊂ C nd×d is a finite subset of M such that for all X ∈ M there existsX ∈ S such that ∥X−X∥ F ≤ ϵ. Our construction generalizes the construction of Ref. [21]. Therein, a covering net for the set of normalized block-wise low-rank matricesΩ n,d n,r was derived. We summarize the statement given in Ref. [21] in the following lemma without giving a proof.
The proof of Lemma 12 basically lifts the result of an ϵ-net for low-rank matrices of Ref. [81] to the setΩ n,d n,r using the triangle inequality.
We can combine multiple ϵ-nets forΩ s,d s,r to construct an ϵ-covering net for the setΩ n,d s,r of block-sparse matrix vectors with low-rank blocks. The bound on the cardinality of the resulting ϵcovering net is given in the following lemma: Lemma 13 (Bound on the cardinality of a covering net). ForΩ n,d s,r there exists an ϵ-covering net S n,d s,r of cardinality bounded by n s (9/ϵ) (2d+1)sr . Furthermore, for each X = [X 1 , . . . , X n ] ∈Ω n,d s,r there existsX = [X 1 , . . . ,X n ] ∈ S n,d s,r such that ∥X −X∥ F ≤ ϵ and ∥X k ∥ F = 0 for all k for which ∥X k ∥ F = 0.
Proof. Let Γ ⊂ [n] with |Γ| ≤ s, i.e., the indices of the support of an s-sparse vector. The set is an ϵ-covering net forΩ n,d s,r . The union is taken over n s different sets. Thus, the cardinality of S n,d s,r is upper bounded by n s (9/ϵ) (2d+1)sr . The second statement follows by construction.
We are now in the position to prove Theorem 6.
Proof of Theorem 6. The proof proceeds in two steps. First, we prove the RIP for elements of the ϵ-covering net S n,d s,r ofΩ n,d s,r . To do so, we combine the concentration result of Lemma 11 and the union bound of Lemma 13 to establish uniform concentration. In a second step, following Ref. [21], we then use the definition of an ϵcovering net to show that for elements X ∈Ω n,d s,r that are close enough to an element of the net, the RIP condition still holds.
Step 1: Taking the union bound over the ϵnet S n,d s,r constructed in Lemma 13 and using the result of Lemma 11 in the form of (18) with constant C δ ≥ 40 we get The aim is to find a lower bound for the number of measurements m for which the probability ( with probability at least 1 − τ .
Step 2: Let us now transfer the RIP of S n,d s,r to the entire setΩ n,d s,r while keeping the error under control. To this end, we choose the net parameter ϵ as δ 4 √ 2 . By definition of an ϵ-net, for elements X ∈Ω n,d s,r , there exists an element X ∈ S n,d s,r such that To prove the RIP for the setΩ n,d s,r we need to bound ∥A(X)∥ F from above and below.
We start with the upper bound, making use of Eq. (22): ∥A(X)∥ ℓ 2 ≤ ∥A(X)∥ ℓ 2 + ∥A(X − X)∥ ℓ 2 Now ∥A(X − X)∥ ℓ 2 has to be bounded from above. We use that by the second statement of Lemma 13 the block supports of X and X coincide. Therefore, X − X has also s nonvanishing blocks that have rank of at most 2r. We can, thus, decompose X − X = B + C in terms of orthogonal matrices B, C ∈Ω n,d s,r that obey ⟨B, C⟩ = 0. In particular, B and C have the same block support as X. Let us define κ s,r := sup X∈Ω n,d s,r ∥A(X)∥ ℓ 2 .
Then we get using homogeneity where the last step makes use of the orthogonality of B and C. Together with (23) it follows that It remains to derive an upper bound for κ s,r . To this end, we use that, by definition, κ s,r is the best upper bound of the left-hand side of (24). Inserting (25) into the right-hand side of (24), we find the condition κ s,r ≤ 1 + δ 2 + δ · κ s,r 4 .