The battle of clean and dirty qubits in the era of partial error correction

When error correction becomes possible it will be necessary to dedicate a large number of physical qubits to each logical qubit. Error correction allows for deeper circuits to be run, but each additional physical qubit can potentially contribute an exponential increase in computational space, so there is a trade-off between using qubits for error correction or using them as noisy qubits. In this work we look at the effects of using noisy qubits in conjunction with noiseless qubits (an idealized model for error-corrected qubits), which we call the"clean and dirty"setup. We employ analytical models and numerical simulations to characterize this setup. Numerically we show the appearance of Noise-Induced Barren Plateaus (NIBPs), i.e., an exponential concentration of observables caused by noise, in an Ising model Hamiltonian variational ansatz circuit. We observe this even if only a single qubit is noisy and given a deep enough circuit, suggesting that NIBPs cannot be fully overcome simply by error-correcting a subset of the qubits. On the positive side, we find that for every noiseless qubit in the circuit, there is an exponential suppression in concentration of gradient observables, showing the benefit of partial error correction. Finally, our analytical models corroborate these findings by showing that observables concentrate with a scaling in the exponent related to the ratio of dirty-to-total qubits.


Introduction
Since Feynman's seminal proposal [1], quantum computing has moved from a purely theoretical exercise to real devices. Yet, quantum computers today are still far from an ideal machine able to truly harness the full power of fault-tolerant computing. Current devices do not have the qubit counts or gate fidelities required to implement large-scale error correction. This means that only small-scale implementations of errorcorrecting codes are currently feasible [2]. Therefore, implementing Shor's factoring algorithm [3], the Harrow-Hassidim-Lloyd algorithm [4], and the myriad of possibilities this computational paradigm could bring are still a distant reality.
An important primitive in quantum computing is estimating the expectation values of some operator at the end of a quantum circuit. This appears in the Noisy Intermediate Scale Quantum (NISQ) [5] setting for the case of variational quantum algorithms [6,7] and quantum machine learning [8], but also in faulttolerant algorithms [9]. However, it has been shown that noise can have detrimental effects resulting in the exponential concentration of expectation values [10][11][12][13]. In the specific case of variational quantum algorithms and quantum machine learning, this implies untrainability of the models, i.e., exponential scaling of resources required for training [13][14][15][16][17][18]. This scaling phenomenon is known as a Noise-Induced Barren Plateau (NIBP) [13], which is a specific type of barren plateau [14][15][16][17][18][19][20][21][22]. In addition, it was also shown that error mitigation cannot reverse the effect of noise on expectation value concentration [23]. Therefore, noise presents one of the largest obstacles to the practical utility of quantum computers and to obtaining near-term quantum advantage. This leads to a critical question that we aim to address in this work: how can we mitigate, or remove, the effect of NIBPs and exponential concentration due to noise?
To investigate this question we construct a quantum computational model that employs "clean" and "dirty" qubits. Here, we define clean qubits as qubits on which no noise channels are applied, which we envision as the best-case implementation of an errorcorrecting code. On the other hand, the dirty qubits in our setup are affected by noise. This setup presents a novel, physically motivated paradigm where we can explore the scalability of cost concentration in quantum algorithms. We can motivate this setup as an idealized scenario of the regime where the number of qubits becomes sufficiently large to implement error correction on some, but not all, of the qubits in a quantum computer. The intersection of NISQ and quantum error correction has been considered in other recent works [24][25][26]. We note our setup is different from earlier clean qubit models [27][28][29], which work on a system where one or more qubits are in a known state, such as |0⟩, and the rest are in the maximally mixed state 2 .
We find that using clean qubits can mitigate the effect of NIBPs. Specifically, we find that using clean qubits increases the depth accessible to an ansatz before the magnitude of the gradients vanishes. However, NIBPs are not avoided unless all qubits are clean. We analytically explore the effect of clean qubits on the concentration of expectation values and gradient using two toy models. The first is a CNOT ladder acting on an input state followed by depolarizing noise on the dirty qubits. The second explores two initially disjoint subsystems, one or both of which are noisy, that are subsequently connected via CNOTs. Both show that the expectation value concentrates with a scaling in the exponent related to the ratio of dirty to total qubits.
Our numerics arrive at essentially the same conclusions as our analytics. Here, we numerically explore the magnitude of the gradient when optimizing the Hamiltonian Variational Ansatz (HVA) for systems of 4, 6, and 8 qubits under a depolarizing noise model and also a more realistic model of a trapped ion quantum computer. This allows us to investigate how the gradient scales with system size under the effect of noise. We show that using clean qubits leads to gradient scaling similar to the case of reducing error rates of the noise on every qubit. This in turn means that deeper circuits can be employed before NIBPs become an issue. Therefore, our work presents a possible avenue to improve the reach of quantum computing using a combination of error-corrected and dirty qubits.
Clean qubits U U U U Figure 1: The clean and dirty setup. A schematic example of the clean and dirty setup. We have nc noiseless, clean qubits at which no errors happen. For these qubits, we represent the action of a gate by its unitary U . We also have n d uncorrected, dirty qubits. The action of a gate at these qubits is represented by U followed by a noise channel N .
In the case of gates acting at both the dirty and the clean qubits, we represent them by U followed by a noise channel N acting only at the dirty qubits.

The Clean and Dirty Computational Model
In this work, we investigate how the emergence of exponential concentration and NIBPs is affected by introducing clean, noiseless qubits in a quantum computer. Such a setup may provide insights into an early era of quantum error correction when, due to limited numbers of physical qubits available, the number of error-corrected, high-quality logical qubits will be severely restricted [33][34][35]. In such a case, a question arises as to whether combining a limited number of high-quality logical qubits with uncorrected, noisy qubits could mitigate or perhaps even prevent the exponential concentration of expectation values and gradients. If so, this would highlight the power of post-NISQ quantum computing architectures, relative to their NISQ counterparts.
To provide insight into these questions, we consider an idealized model of such a quantum computer, with n qubits, partitioned into n c noiseless, clean qubits, and n d noisy, dirty qubits. We assume that the action of a gate g, defined by a unitary U , on the clean qubits of a state ρ is ρ → U ρU † . (1) When g acts at the dirty qubits we represent its action by U followed by a noise channel N acting only at the dirty qubits: ρ → N (U ρU † ). (2) For a gate g acting between a dirty and clean qubit, we assume no noise acts on the clean qubit. We call this setup the clean and dirty setup, and we illustrate it schematically in Figure 1. This simple setup can be used to understand the fundamental limitations of a quantum computation involving both high-quality logical, and low-quality, either uncorrected or logical qubits. For n d = 0 the setup is equivalent to a perfect, noiseless quantum computation, and for n c = 0 it becomes equivalent to a noisy quantum device.
We note that this idealized setup neglects error rates of real-world logical qubits due to imperfections of quantum error correction. Furthermore, the model does not account for the larger number of native hardware gates necessary to implement logical non-Clifford gates compared to the Clifford ones [36]. This overhead may result in a higher effective error rate for logical non-Cliffords that the clean-and-dirty setup does not consider. Therefore, we treat this work as an exploration of the potential of quantum algorithms combining qubits with large and small effective error rates rather than a realistic model of future quantum devices.
An alternative approach for early implementations of quantum error correction is to use less robust quantum error correction codes with lower distances to detect and correct a part of the errors. An example of this approach is presented in Ref. [37]. In general, the error rates of logical qubits are expected to decay exponentially with the code distance, with the base of the exponential being proportional to the ratio of the physical qubits' error rate to its threshold value [35]. Therefore, at the beginning of the quantum error correction era, when the physical error rates will be close to the threshold, obtaining high-quality logical qubits would require a large code distance and consequently a large number of physical qubits per logical qubit [35]. Consequently, implementing less powerful codes would require much fewer physical qubits making them more suitable for the first realizations of quantum error correction.
Nevertheless, to obtain a physical qubit count large enough for a quantum advantage, it might be necessary to supplement such logical qubits with noisy, physical ones. Furthermore, a partition of the device qubits into lower-quality error-corrected qubits and noisy ones would increase the number of computational qubits whilst potentially retaining some advantages of quantum error correction. Therefore, a combination of the noisy and quantum-error-corrected qubits modeled by the clean and dirty setup might be the optimal solution when a limited number of device qubits is the primary limitation on the computer's computational power.
While this motivation applies to the first generations of quantum computers utilizing quantum error correction, the clean and dirty setup can also provide insights into later, more advanced architectures combining high and lower-quality logical qubits. Such a combination might be desirable to optimally utilize qubit counts of quantum hardware when the number of qubits is insufficient to use high-quality error corrections for all logical qubits. Furthermore, in the case of algorithms having qubits with particularly many gates acting upon them, much more errors will occur at these qubits. In such a case, using a code with a higher distance for these qubits might be a natural choice. An example of such an algorithm is quantum phase estimation with a large unitary controlled by an ancilla qubit [38]. Therefore, we expect insights gained with the clean and dirty model to potentially remain relevant beyond the early era of quantum error correction.

Concentration of expectation values
In this section, we present two analytical models in which exponential scaling effects due to noise are observed, even though a proportion of the registers are noise-free. Such scaling effects are present even if there is only one noisy register. The two settings demonstrate two different mechanisms in which noise can spread through a system via entangling gates. In the first setting, we consider a simple circuit where clean and dirty qubits interact via cycles of CNOT gates. Here, we find that the CNOTs allow entropy to spread from noisy to clean registers despite the fact that if the initial state is in the computational basis, no entanglement is actually created between the subsystems. In the second case, we consider a setting where two subsystems, which can in principle be initially disconnected, undergo an entangled measurement. We show that, due to the entangled measurement, local noise in the dirty qubits leads to decoherence and information loss in the joint system of all clean and dirty qubits. In both models, we find that the output distribution of the circuit, as measured in the computational basis, is exponentially indistinguishable from the uniform distribution with increasing circuit depth. However, in both settings, this concentration effect is dampened with increasing number of clean qubits, compared to the number of dirty qubits.
In the first setting, we consider the interaction of local noise with CNOT ladders, as shown in Figure  2. Ladders of 2-qubit gates can be found as a primitive quantum machine learning settings (e.g. in [39]) as well as in variational quantum algorithms settings (e.g. for the unitary coupled cluster ansatz when one implements an exponentiation of a multi-qubit Pauli operator [40]). In order to understand how such ladders interact with noise, we consider a simplified model where layers of CNOT ladders are interleaved Figure 2: CNOT ladders. In our first analytical example we consider an input state ρ acted on by L layers of CNOT ladders. After each layer, we consider an instance of local depolarizing noise Dp with depolarizing probability p acting on each of the first n d registers. We compare the output distribution as measured in the computational basis to the uniform distribution.
with instances of local depolarizing noise, where the local noise only acts on the first n d qubits. In the following proposition, proved in Appendix A, we show how the output distribution converges with circuit depth. Figure 2 with L layers, n d dirty qubits each with depolarizing noise with depolarizing probability p in each layer, and computational basis measurement O zn = |z n ⟩⟨z n |; z n ∈ {0, 1} n . Denote the channel that describes the action of the CNOT ladders and noise in the circuit as W (n d ) L . For any input state ρ, the expectation value concentrates as

Proposition 1. Consider the circuit in
(3) for any computational basis measurement, where P (ρ) = Tr[ρ 2 ] denotes the purity of ρ, and we have defined Proposition 1 shows that the output distribution of the CNOT ladder circuit exponentially concentrates in the number of layers of the circuit. Moreover, compared to the setting where all registers are noisy, the exponent is scaled by a factor approximately equal to n d /n. Thus, we can consider the scaling to depend on an effective noisy depth L eff := Ln d /n. We note that the bound in Eq.
(3) holds even if there is no entanglement in the circuit -that is, it holds even if the input state ρ is a tensor-product state diagonal in the computational basis.
An alternative interpretation of the clean and dirty computational model follows by inspecting Proposition 1 in the low error limit. Namely, considering for simplicity n to be a power of 2, then we have (1 − p) Γ n,n d ,L = (1 − p) L eff and hence, if p ≪ 1, then up to first order in p. Equation (5) shows that adding clean qubits (left-hand side) is also approximately equivalent to a fully noisy system with an effective noise parameter rescaled by n d /n. The identification of the right-hand-side can be made precise from Proposition 1 as (1 − p n d n ) L is precisely the scaling coefficient one finds in an L-layered system where all qubits are subject to noise with a probability p n d n . As such, having fewer dirty qubits has a similar effect to reducing the effective noise in a system where all qubits undergo the same rate of depolarizing noise. This can be thought of as arising due to the fact that the CNOTs propagate the noise from dirty qubits through to the rest of the system. Thus, for sufficiently deep circuits, the localized contributions of error instances wash into the rest of the circuit, and the decoherence can be broadly characterized by the total number of error instances, or equivalently effective error rate, and depth L eff . This interpretation is explored more in our numerical studies in Section 4. We note that our first setting can also be adapted to give similar results for Pauli observables. Specifically, by considering the reverse circuit with Pauli observables we observe the same exponential concentration. We discuss this further in Appendix A.3.1.
In the second setting, we show how local noise impacts the global system due to entangled measurements. This can also equally be considered to demonstrate the role of entanglement when considering the Heisenberg picture (i.e., considering the circuit acting in the reverse order). We consider the circuit in Figure 3(a) and suppose that the registers are split into two subsystems A and B. Both subsystems are acted on by a global unitary W (which can in principle even be separable across the cut A|B), before they undergo evolution according to some channel T which is also separable across the cut A|B. One choice for T is to consider noise-free unitary evolution which we denote as U ⊗ V. In order to characterize the effects of noise, we can also modify these channels to account for hardware noise, as demonstrated in Figure 3(b), in the sense that between each non-parallelizable layer of gates there is an instance of local depolarizing noise acting on each of the noisy qubits. We denote such a noisy modification as U → U or V → V. Further, we denote the number of such unitary layers as L U and L V respectively. Finally, at the end of the circuit there is an entangled measurement across m pairs of registers in the two subsystems, which we denote with the = = (a) (b) Figure 3: Entangled registers (a) In our second analytical example we show that entanglement can also cause local noise to spread globally. In this circuit, we consider an arbitrary input state ρ that is acted on by some unitary W . We then partition the system into two subsystems, each of which undergoes its own unitary evolution, denoted as U and V. In our analytics, we consider a modification where one of the subsystems undergoes noisy evolution whilst the other is noise-free. At the end of the circuit, a subset of the qubits undergoes an entangled measurement across the two subsystems. (b) We define the depth of U to be the number of non-parallelizable unitary layers LU in a given hardware implementation. When we consider a noisy modification U → U, we insert layers of local depolarizing noise with depolarizing probability p in between each unitary layer. We define LV and V in an analogous manner. measurement operator B m (|z 2m ⟩⟨z 2m |), where B m is the map corresponding to a Hadamard gate followed by a CNOT on m pairs of qubits. We note that a special case of this circuit is the (global) Hilbert-Schmidt test circuit [41]. We find that the following proposition, proved in Appendix A, holds. Figure 3(a) partitioned into two subsystems A and B. We have entangling operation B m on m ⩾ 1 pairs of qubits across A and B, noisy unitary evolution T separable across the cut A|B, global unitary evolution W, and computational basis measurement O zn = |z n ⟩⟨z n |; z n ∈ {0, 1} n . For any input state ρ and any computational basis measurement, the expectation value concentrates as

Proposition 2. Consider the circuit in
is the purity of ρ and we have defined Proposition 2 shows the role entanglement (or entangled measurements) can play in spreading noise from a noisy subsystem to a clean subsystem. Namely, despite noise only occurring in one part of the circuit, the output distribution of the full system exponentially concentrates in the circuit depth. We also note that this occurs even if only one clean qubit is coupled to a dirty qubit (m = 1). The proof follows from analysis in the Heisenberg picture and thus implies that entangling registers at the beginning of the computation also leads to an identical effect, which we demonstrate in Appendix A.4.
We note that, when only subsystem A (B) is noisy, there is still exponential scaling, however, compared to the fully noisy setting the exponent is modified by If we consider the unitaries to have linear depth in the number of qubits and scale proportionally with the same factor (i.e. L U /n A = L V /n B ), then the modification to the exponent is n A /n (n B /n). That is, the exponent is rescaled by the ratio of the number of dirty qubits to the total number of qubits, which is similar to the effect we find in Proposition 1.
Propositions 1 and 2 together show that despite some of the registers being clean, exponential concentration of expectation values in the circuit depth can still occur, in a similar spirit to that of NIBPs as presented in Ref. [13]. Indeed, when we set n d = n we recover the same scaling. Moreover, there is also exponential concentration with each increasing dirty qubit (for linear depth circuits, in the case of Proposition 2). We note that following the methods presented in Ref.
[13], our results can also be generalized to classes of Pauli noise that have at least two types of Pauli error occurring with non-zero probability.
We remark that, despite the fact that our bounds are one-sided, there is some physical significance to the factor of n d /n that the exponent is rescaled to when transitioning from n to n d dirty qubits. This is due to the fact that the exponent in our bounds corresponds to the Pauli string with the smallest amplitude decay. Thus, whilst we would not expect all circuits to exactly exhibit this scaling, we do expect it to be characteristic of circuits in the best case. Further details are given in the proofs of Propositions 1 and 2, which can be found in Appendices A.3 and A.4 respectively.

Gradient scaling
Our above results can also be readily transported to consider gradient scaling. First, we note that if either of the two above settings is modified to contain a trainable parameter (without breaking initial assumptions) and in settings amenable to the parameter shift rule, then Propositions 1 and 2 automatically imply exponentially vanishing gradients in the depth of the circuit. Namely, this implies barren plateaus for lin-ear depth circuits if n d scales linearly with n. This also implies barren plateaus for polynomial depth circuits of degree 2 or greater, for any n d > 0. Further, in more general settings where one cannot apply the parameter shift rule, we now explicitly demonstrate how gradient scaling results can be obtained for modified circuits, and present further details in Appendix A.5.
We consider a trainable unitary of the form where {W k } K k=0 are arbitrary fixed unitary operators and {H k } K k=1 are Hermitian operators. We denote the channel that corresponds to this unitary as Y(θ). In the following propositions, we modify the circuits in Section 3.1 by inserting Y (θ) in the middle of the circuit and constraining the input state to be a computational basis state.

Proposition 3. Consider a cost function
denotes the channel corresponding to L i instances of CNOT ladders and local noise on n d qubits as presented in Figure  2, and with computational basis measurement O zn = |z n ⟩⟨z n |; z n ∈ {0, 1} n . The partial derivative with respect to parameter θ k is bounded as for i ∈ {1, 2}.

Proposition 4. Consider a cost function
where ρ is a computational basis input state, B mi is an entangling operation between m i pairs of qubits in subsystems A i and B i as considered in Proposition 2, T i denotes noisy unitary evolution separable across the cut A i |B i , and with computational basis measurement O zn = |z n ⟩⟨z n |; z n ∈ {0, 1} n . The partial derivative with respect to parameter θ k is bounded as where we denote for i ∈ {1, 2}, and L Ui and L Vi are defined in the same manner as L U and L V in Figure 3.
Propositions 3 and 4 demonstrate that when modifying the settings considered in Section 3.1 to include a parameterized unitary, the partial derivative of measurement outcomes with respect to any parameter displays similar scaling to the concentration of expectation values. Specifically, if we assume that the largest singular values of the generators {H k } K k=1 of the unitary Y (θ) scale at most polynomially in n, this implies NIBPs for linear depth circuits if n d also scales linearly in n, or for polynomial-depth circuits of degree 2 or greater for any n d > 0.

Numerical Simulations
In this section we numerically investigate the behavior of the gradient of a cost function for the Hamiltonian Variational Ansatz (HVA) using both a single-qubit local depolarizing noise model and a model based on a trapped-ion quantum computer [42,43]. We start with an introduction of the HVA in Section 4.1, and follow with a presentation of the results in Section 4.

Hamiltonian Variational Ansatz
We consider an HVA [44, 45] for a task of variational ground state search for a transverse field onedimensional quantum Ising model. The model is given by a Hamiltonian Here, g is a constant, and X, Z are Pauli matrices. We choose g = 1 and assume Periodic Boundary Conditions (PBC). The ansatz with L layers takes the form where . . , θ 2L ) are parameters. We decompose e −iθ k H XX and e −iθ l H Z to entangling Molmer-Sørensøn gates and single qubit rotations, respectively, as shown in Figure 4. We minimize the energy of the model using a cost function defined as and compute the gradient using a parameter shift rule [46]. The gradient (16) is not affected by a barren plateau in the noiseless case [20] making it a convenient setup for numerical investigation of the noise effects on the gradient scaling. For technical details on the gradient computation see Appendix B.

Numerical results
We examine the scaling of the averaged absolute value of the partial derivative of the cost function with respect to a parameter θ k |∂ θ k C| for the cost function of Eq. (15) in a system with n = 8 qubits and for an HVA ansatz with the number of layers L increasing from 1 to 600. The chosen range of L allows to showcase the asymptotic scaling of |∂ θ k C| with increasing L. The average is uniformly taken over both random sets of parameters θ and the parameters θ k . We consider n d = 0, 1, 2, . . . , n in both local depolarizing noise and a realistic trapped-ion noise model [42,47].
The depolarizing noise model applies a local depolarizing channel after each gate if the qubit is designated to be dirty. In the case of the XX gate, noise is represented by applying a single qubit depolarizing channel to each dirty qubit. The realistic model is based on real-world machines [42] and consists of single-qubit noise channels for single-qubit gates acting on the dirty qubits, and a correlated two-qubit noise channel after XX gates. The XX noise is applied only when both qubits are dirty. A detailed description of the noise models, including figures and parameters, can be found in Appendix C.
We assume that the dirty qubits form a contiguous block as in Figure 1. We gather results in Figure 5.
Here we can see that |∂ θ k C| decays exponentially with L for n d > 0 (as evidenced by a straight line in the log-linear plot). Furthermore, we can see that the averaged gradient also decreases exponentially with increasing n d . This can be seen from the fact that for a fixed L, increasing n d decreases the value of |∂ θ k C| by a constant factor in the log-linear plot. Hence, we find in both cases that for n d > 0, |∂ θ k C| decays exponentially with L and n d . Namely, we heuristically observe that |∂ θ k C| ∈ O(a Ln d ) for some a < 1.
These results reflect the exponential scaling in L and n d obtained in our theoretical results from Propositions 1 and 2, thus indicating that the scaling in our propositions might be present in realistic models.
Here we can also verify to what extent Eq. (5) holds in the considered realistic scenario. Namely, we perform numerical simulations for the same HVA ansatz as previously analyzed but now all the qubits are dirty and the local depolarizing probability is rescaled according to Eq. (5). For instance, in the case of depolarizing noise, we consider the error rates p → pf, with f = 1/n, 2/n, . . . , (n − 1)/n, 1. (17) The error rates in the case of the trapped-ion noise model are scaled analogously, see Appendix C for details. Figure 5 shows that the results for n d < n (solid lines), with the local depolarizing and the trapped-ion noise models, are similar to the case when all qubits are dirty and the noise is rescaled by f = n d /n and f = (n d − 1)/n, respectively (dashed lines). The difference in f comes from the fact that in the case of depolarizing noise, the number of noise channels per layer of the ansatz is proportional to n d , while in the case of the trapped-ion noise model we have n d − 1 noisy Molmer-Sørensøn gates per layer, as explained in Appendix C. Therefore, our numerics demonstrate that adding clean qubits is approximately equal to rescaling the error rate by a factor n d /n as in Eq. (5).
In Appendix D we additionally analyze the gradient behavior for n = 4, 6 qubits, obtaining the same conclusions as here for n = 8.
We further explore the behavior of the averaged derivative of the cost function with respect to the parameters in Figure 6 by plotting |∂ θ k C| versus the total circuit error rate, which we define as the sum of the error rates of all gates in a circuit. For both the local depolarizing and the trapped-ion noise models we obtain with good accuracy a collapse of |∂ θ k C| plotted versus the circuit error rate for all considered values of n d and f . In Appendix D we conduct a similar analysis for n = 4, 6 obtaining the same conclusion. We see small deviations in the depolarizing case which appear to increase with increasing system size. Furthermore, in the same appendix, we also explore the clean and dirty setup for low circuit depths (L = 1 − 30) up to n = 10 in the realistic noise model, obtaining similar behavior as for the higher depths.
We note that the deviations are within one standard deviation of the mean gradient which may indicate that there are finite sample artifacts. Nevertheless, to elucidate their origin a more systematic investigation would be necessary, which we leave to future work. We find that for the realistic noise model deviations are much smaller resulting in nearly perfect collapse. Finally, we note that in the case of the depolarizing We plot the mean absolute value of the partial derivative of the cost function with respect to a parameter θ k |∂ θ k C| for L = 1, . . . , 600 and n = 8. The average is taken over both 28 random sets of parameters θ and the parameters θ k . In (a) we show results for local depolarizing noise. In (b) we show results for the trapped-ion realistic noise model. The solid lines are obtained for n d = 0, 1, . . . , 8. In the plots larger n d is plotted with brighter colors. The dashed lines are obtained for the case of n dirty qubits and error rates of the noisy gates scaled by a factor f = 1/8, 2/8, . . . , 7/8 with respect to their noise rates for the aforementioned simulations. We plot the results for n d and f = n d /n with the same color.
noise, the total error is proportional to n d /n, while in the case of the realistic noise, it scales approximately linearly with n d /n as shown in an inset of Figure 6b. Therefore, the obtained collapses provide further examples of exponential suppression of the gradient by a ratio of n d /n.

Discussion
We have found that partitioning a quantum computer into noiseless "clean" qubits and noisy "dirty" qubits typically has an analogous effect to lowering the overall error rate by the ratio n d n . Consequently, our model of clean qubits cannot avoid exponential scaling effects due to noise such as barren plateaus, and even a single dirty qubit will ruin the bunch eventually. As it has previously been shown that standard error mitigation techniques also cannot avoid exponential scaling effects due to noise [23,48], our work provides further evidence that a fully noiseless system may be ultimately required to avoid such scaling. We note that even though we exclusively look at gradient scaling in our numerics, this also implies the concentration of expectation values due to the results of [16]. This means that our results have implications for the scalability of near-term algorithms in general, not just in the realm of training parameterized circuits.
On the other hand, we found numerically that each additional clean qubit exponentially suppresses the rate of gradient decay with respect to depth. Therefore, the inclusion of clean qubits helps to mitigate NIBPs and exponential concentration of expectation values due to noise. Our analytical results point to this being a more general phenomenon in partially noisy circuits. We note that here, in order to numerically investigate the scaling of trainability at large depths, we assumed that the error rates of the noisy qubits are larger than the ones expected from future machines. Their reduction should enable even more impressive increases in computational reach than the ones shown in our numerics.
We expect this work to lay the groundwork for potential explorations into the post-NISQ era, which is seemingly drawing closer with ever-larger qubit counts and decreasing error rates. Our results indicate that quantum computation with both errorcorrected and noisy qubits can mitigate the exponential concentration issues. Therefore, it motivates the search for practically viable realizations of this computational paradigm. The clean and dirty setup is an idealized model of such a quantum computer, as a clean qubit implementation requires the correction of every error occurring at a logical qubit.
A natural follow-up study would be an extension of our model that takes into account that real-world logical qubits are characterized by logical error rates. Such a model would also enable a more comprehensive study of strategies involving combining logical qubits with different logical error rates and their comparison to using the same code distance for all the device qubits. In order to better model real-world logical qubits, it is essential to consider the gate-dependent errors, as non-Clifford gates pose significantly greater implementation challenges than Clifford gates.
Another line of follow-up research would be to consider the application of the clean and dirty setup to algorithms that involve some qubits subject to many more gates than the others, such as the quantum The solid lines show the clean and dirty setup, while the dashed ones show the variable error setup, with darker lines representing lower noise levels. Both the depolarizing noise model (a) and the realistic noise (b) results show an exponential decay of the mean absolute value of the partial derivative with the total error rate up to small deviations. The inset in (b) shows the error rate for the two setups and the realistic noise model in relation to n d demonstrating that it is approximately linear. Consequently, in both cases, we obtain that the mean gradient decays approximately exponentially with n d . phase estimation algorithm [38]. For such qubits, one naturally obtains higher effective error rates, and potentially better error suppression when just a fraction of qubits are corrected. Therefore, in such cases, the clean and dirty setup may result in even better error suppression than in the cases investigated here. Another worth-exploring scenario that may result in the clean and dirty setup as an effective model is a coupling of a sensor built from noisy qubits [49] to logical qubits processing information obtained from the sensor [50]. Finally, one can also ask about using the clean and dirty computational paradigm to design custom quantum algorithms that can maximize quantum advantage in the post-NISQ era.

Acknowledgements
We thank Andrew Arrasmith, Burak Sahinoglu, Chenfeng Cao, and Hsin-Yuan Huang for their helpful discussions. We also thank Tom O'Leary for producing an open-source Qiskit version of the clean and dirty setup for the repository. This work was supported by the Quantum Science Center (QSC), a National Quantum Information Science Research Center of the U.S. Department of Energy (DOE). DB was supported by the U.S. DOE through a quan-

Code availability
The scripts and data used to create all the plots and to recreate some of the data can be found in Appendix E.

A.1 Preliminaries
Definition 1 (Depolarizing noise). The action of a single-qubit depolarizing noise channel D p with error probability p on a single-qubit quantum state ρ can be written as where 1 is the 2 × 2 identity matrix.
Definition 2 (Pauli strings). We define the set of non-identity Pauli tensor product strings P n as Action of local depolarizing noise on Pauli strings. We note that for any σ ∈ P n we have where |q σ | ⩽ 1 − p.
Proof. We have where in the first line we have added and subtracted 2 2 n ρ − 1 2 n 1 2 n , in the second line we have explicitly computed the trace for the last two terms, and in the final line we have used the definition of the Schatten 2-norm.

Lemma 2.
(2-norm of commutators.) Let X and Y be complex matrices. Let ∥ · ∥ 2 and ∥ · ∥ ∞ denote the Schatten 2-and ∞-norms respectively. We have Proof. Writing out the commutator explicitly, we have where the first inequality is due to the triangle inequality, and the second inequality is an application of the tracial matrix Hölder inequality [51].

A.3 Model 1 -CNOT ladders
In this section we consider a circuit consisting of CNOT ladders as presented in Figure 7(a). After each ladder, a fraction n d /n of the qubits experiences single-qubit depolarizing noise D p .
We first provide a brief roadmap of our proof methods. In order to analyze such circuits, we consider the Heisenberg picture and bound the output purity of the reverse circuit (Figure 7(b)). As we only consider computational basis measurements, we only need to consider Pauli tensor product strings of the form {1, Z} ⊗n . We will adopt the notation A ... B to represent a tensor product string A ⊗ ... ⊗ B. Due to Eq. (20), local depolarizing noise on n d qubits will decrease the amplitude of certain strings in {1, Z} ⊗n . Specifically, wherever a string has a "Z" on a noisy register, that string's amplitude will decay by a factor (1 − p) for each instance of local depolarizing noise D p . The goal of our analysis will be to observe patterns in Pauli strings as they are acted on by CNOT ladders and to keep track of the "best case" Pauli string that is least affected by noise. This corresponds to the Pauli string that has the fewest number of "Z"s appearing on noisy registers. This can then be used to establish upper bounds on expectation value concentration. Figure 7(b)) as CNOT n .

Definition 3 (CNOT ladder). We denote the mapping corresponding to one CNOT ladder (consisting of CNOT gates between all pairs of adjacent qubits in a 1D chain of n qubits as in
Definition 4 (CNOT ladder + noise). We denote the channel corresponding to a CNOT ladder plus depolarizing noise on the first n d qubits as W (n d ) † := CNOT n •(D ⊗n d p ⊗I n−n d ). Further, we denote the channel corresponding to L instances of this channel (i.e. the full circuit in Figure 7 Figure 7: (a) Our goal is to consider a circuit with a repeating ansatz of L layers of a CNOT ladder. After each layer, the qubits on the first n d registers go through depolarizing noise channel Dp. At the end of the circuit, we measure in the computational basis. We denote ρ as the input state. (b) In order to analyze such circuits, we consider the Heisenberg picture and bound the output purity of the reverse circuit, with computational basis input states.
As a preliminary example, we consider the mapping of the string Z ⊗ 1 ⊗ Z ⊗ Z under such CNOT ladders: We note that after 4 iterations, the Z1ZZ is mapped back to itself. We will refer to this as a string with period 4. We make two observations about the chain of mappings in (28). First, the string is mapped only to other strings in the set {1, Z} ⊗n . Second, the mapping is cyclic. Both observations can be noted to be a direct consequence of the fact that all strings from {1, Z} ⊗n are mapped bijectively to another element of the set under CNOT n , which we will now formally show.

Lemma 3. The set {1, Z} ⊗n is mapped to itself under CNOT n bijectively.
Proof. We can check the mappings for n = 2 and a single CNOT gate: As CNOT n is composed of n − 1 CNOT maps, it follows that strings in {1, Z} ⊗n are mapped to strings in {1, Z} ⊗n by CNOT n . Finally, the map is unitary and so has an inverse, and thus the map is bijective.
Returning to our former example, we observe the elements of the cyclic mapping of Z1ZZ can be written as a slanted list where we denote the row of the list with r, and the line marks the end of a cycle. We now make a key remark.
Definition 6 (Inverted Binary Pascal Triangle). We define the inverted binary Pascal triangle as the triangle of rows of binary strings, where the L elements in the kth row are uniquely generated by L + 1 elements of the (k − 1)th row, with each element constructed by summing the element directly above and to the left, with the element directly above and to the right. An inverted binary Pascal triangle with a length n string in the 0th row has in total n rows, ending in a row with a length 1 string.
Remark. The list of cyclic mappings of any element L ∈ {1, Z} ⊗n under CNOT n has a one-to-one correspondence with the inverted binary Pascal triangle whose string in row 0 is generated by taking L ⊗ 1 ⊗n−1 and applying the map Z ↔ 1, 1 ↔ 0.
As an example for the above remark, we can observe this correspondence for the string Z1ZZ (we truncate the triangle at r = 3).
It can be verified that this sequence has values a(m) = 2 m + 1, which completes the proof.
Corollary 1 (Bit-wise addition of strings). Given a string composed of adjacently joining two strings of length n ∈ 2 x ; x ∈ Z, starting in row 0, the first n entries of the string in row n is the binary bit-wise addition of both strings.
That is, the periodicity is an integer power of 2, and at most the closest power of 2 to n that is greater than n. where A ′ ... A ... B 0 ... 0 denotes the string with 2 · 2 ⌈log 2 n⌉ entries where we have appended 2 ⌈log 2 n⌉ elements in "0". Thus, A ... B reappears after 2 ⌈log 2 n⌉ steps, implying it has a period of at most 2 ⌈log 2 n⌉ . As after 2 ⌈log 2 n⌉ steps A ... B must be mapped back to itself, the period must be a factor of 2 ⌈log 2 n⌉ , which only has prime factor 2. Therefore, the period of A ... B must itself be a power of 2.
We now have the tools to understand how purity decays with circuit depth under CNOT n with local depolarizing noise. We will consider the circuit in Figure 7.
Lemma 6. For any computational basis state |z n ⟩⟨z n | with z n ∈ {0, 1} n , we have with where P (·) denotes the purity function, and ⌊·⌋ and ⌈·⌉ denote the integer floor and ceiling functions respectively.

Remark. If n is a power of 2, then the exponent in Eqs. (35) is given
Thus, we see that the exponent in Eqs. (35) is approximately equal to L · n d n .
Proof. We start with the proof for number of dirty qubits n d = 1, before showing how a looser bound can be constructed for arbitrary n d . From Lemma 5 we know that the period of a string of length n is at most 2 ⌈log 2 n⌉ . We now show that there always exists a binary string with period 2 ⌊log 2 n⌋ that corresponds to the Pauli string that is least affected by noise in the circuit in Fig. 7.
Consider a string of length n made up of all "0 " s, with "1" on the 2 ⌊log 2 n⌋ th entry. By inspection, we see that the "1" propagates out to the left and the right in a lightcone structure. In addition, due to Lemma 5, the string must repeat after 2 ⌊log 2 n⌋ steps as Due to the lightcone structure, it is clear there is only one string per cycle with "1" as its first entry. Thus, every 2 ⌊log 2 n⌋ steps there is one string with a "1" in the first entry. Now we show that no other string that produces a cycle with proportionally fewer "1" s in the first entry every L steps.
First, we remark that no nontrivial string with period smaller than 2 ⌊log 2 n⌋ can produce such a cycle, as due to the lightcone structure of "1" s propagating left, there must always be at least one string with a "1" in the first entry. Second, we note that if n is a power of 2, then 2 ⌊log 2 n⌋ = 2 ⌈log 2 n⌉ and thus with the above we have already found the cycle with the proportionally fewest "1"s in the first entry.
Thus, it only remains to consider strings with length n not a power of 2, and with period 2 ⌈log 2 n⌉ (as this is the maximum possible period, as set by Lemma 5). We remark that for such strings, there must be at least one string with "1" as the first entry within each cycle, as any "1" propagates left in a lightcone. Without loss of generality, we consider the first string of each cycle to be such a string. As we are considering non-trivial strings and cycles, the second string in the cycle must also have a "1" at some position. If the left-most "1" is the first entry, then we have trivially found a cycle with two adjacent strings with "1" in the first entry. If the left-most "1" is not the first entry, the "1" will propagate left with each step in a light cone. As 2 ⌈log 2 n⌉ > n if n is not a power of 2, it is assured that the light cone of "1" s reaches the first entry before the end of the cycle. Thus, any string with with period 2 ⌈log 2 n⌉ has at least two strings per cycle with "1" in the first entry.
We can summarize the above result as number of "1" s in first entry after L steps ⩾ L/2 ⌊log 2 n⌋ , for any string in {1, Z} ⊗n , where the bound is achievable with the string of of all "0 " s, with "1" on the 2 ⌊log 2 n⌋ th entry.
We now return to the quantum problem, recalling that Z corresponds to "1" and 1 corresponds to "0". From Eq. (38) we can conclude that through the L channel instances that make up W (1) L , any input string in {Z, 1} ⊗n will produce at least L/2 ⌊log 2 n⌋ strings with Z in the first register. Thus, for any σ where σ ′ ∈ {Z, 1} ⊗n /{1 ⊗n } and for some {q (1) Moreover, the map in Eq. 39 bijectively maps Pauli strings to Pauli strings. We note that computational basis states can be written as |z n ⟩⟨z n | = 1 2 n σ∈{1,Z} ⊗n s zn,σ σ where s zn,σ ∈ {+1, −1} and s zn,1 = 1. We can write where in the first equality we have used Lemma 1, in the second equality we have used Eq. (39), in the third equality we have used the definition of the Schatten 2-norm and the orthogonality of the Pauli matrices under the Hilbert-Schmidt inner product. The inequality comes from Eq. 40.
In order to generalize the results from W (that is, one dirty qubit to n d dirty qubits), we argue that for each extra dirty qubit, any string {Z, 1} ⊗n generates a cycle that has at least one extra Z per cycle in the first n d qubits. We now relax our previous result and consider cycles of maximal length 2 ⌊log 2 n⌋ . As argued above, any non-trivial string generates a cycle such that at least one string with "1" as the first entry.
We suppose that such a string appears in row k. From the rules of a Pascal triangle, this implies the must be a "1" in row k − 1 either as the first entry or the second, i.e. we have either 1 0 1 , or 1 We recall that this corresponds to a Pauli string with a Z in the first or second qubit. Thus, if the first and second qubits are now dirty, this guarantees for any input string an additional factor of 1 − p per cycle. In general we can consider n d ⩽ n dirty qubits and the light cone of influence up from the first entry of row k up to the first n d entries of row k − n d − 1. In this triangular light cone, there will exist at least n d "1"s. Thus, we have number of "1" s in first n d entries after L steps ⩾ Ln d /2 ⌈log 2 n⌉ .
This implies that W where σ ′ ∈ {Z, 1} ⊗n /{1 ⊗n } and for some {q (1) Proceeding with the same steps as the above, for n d dirty qubits we obtain the desired result with exponent L · n d 2 ⌈log 2 n⌉ .
We now restate and prove Proposition 1, which shows how the output distribution of Figure 7(a) exponentially concentrates. Figure 7(a) with n d dirty qubits and computational basis measurement O zn = |z n ⟩⟨z n |; z n ∈ {0, 1} n . Denote the channel that describes the action of the gates and noise in the circuit as W (n d ) L

Proposition 1. Consider the circuit in
. For any input state ρ, the expectation value concentrates as for any computational basis measurement, where P (ρ) denotes the purity of ρ and we have defined Proof. We have where in the first equality we have used the fact that Tr[W (n d ) L (ρ)] = 1, the second equality is simply a grouping of terms, the first inequality is due to the tracial matrix Hölder's inequality [51], the third equality is an application of Lemma 1, and the final inequality is an application of Lemma 6.

A.3.1 Pauli observables
We remark that Eq. 48 also implies the concentration of Pauli observables with ladder circuits. We make this precise with the following proposition. Fig. 7b ) with computational basis input state |z n ⟩⟨z n |; z n ∈ {0, 1} n and measurement observable O. Suppose further that O can be decomposed in the Pauli basis as O = σ∈W ω σ σ with W ⊂ {1, X, Y, Z} ⊗n . Then, the output of this circuit concentrates as

Supplemental Proposition 1 (Pauli observables). Consider the reverse ladder circuit
where we denote ∥ω∥ ∞ := max σ {ω σ } σ as the Pauli coefficient of largest magnitude. Thus, the output of the circuit concentrates to the maximally mixed value exponentially quickly with increasing circuit depth. Proof. Starting from Eq. 48 we can directly evaluate where in the first line we have decomposed |z n ⟩⟨z n | and O into the Pauli basis, the second line comes from the identity Tr[σ ′ σ ′′ ] = δ σ ′ σ ′′ 2 n where δ σ ′ σ ′′ is the Dirac delta function, the third line comes from taking the maximal magnitudes of q (n d ) σ,L and ω σ ′′ separately, and the final line comes by noting there can only be at most |W| non-zero terms in the sum.
We remark that this bound is particularly relevant if the measurement operator is composed of a small number of Pauli operators with bounded coefficients, compared to the circuit depth. For instance, if ∥ω∥ ∞ := max σ {ω σ } σ ∈ O(poly(n)) and N O ∈ O(poly(n)) but L = Ω(n), then Supplemental Proposition 1 implies exponential concentration with increasing system size.

A.4 Model 2 -entangled registers
In this section we consider circuits of the form in Figure 8(a). In order to analyze such circuits we bound the output purity of the reverse circuit with computational basis input states (Figure 8(b)). This will allow us to consider the circuit in Figure 8(a) using the Heisenberg picture.
We consider two distinct subsystems A and B, of size n A qubits and n B qubits respectively. The input to the circuit in Figure 8(b) is a computational basis state |z n ⟩ = |z n A ⟩ |z n B ⟩; z n ∈ {0, 1} n . We first implement m entangling operations with Hadamard gates and CNOTs on a subset of 2m registers between A and B, with an operation we denote as B m . Note that we allow for any generic m satisfying m > 0. The second step of the circuit is to act on A with unitary evolution U, and act on B with unitary V. Finally, we allow the action of a global unitary W on the joint system AB. We remark that this circuit includes as a special case the Hilbert-Schmidt Test circuit [41].
We consider three modifications of this circuit where we now consider the effects of noise. In the first modification, we consider subsystem A to be noisy, and thus we modify the channel U → U, where U is the channel corresponding to the hardware implementation of the unitary channel U with an instance of local depolarizing noise D ⊗n A p in between each layer of gates. This follows the noise model as considered in Ref.
[13], = = Figure 9: We consider the depth of U to be the number of non-parallelizable unitary layers in a given hardware implementation. When we consider a noisy modification U → U, we insert layers of local depolarizing noise with depolarizing probability p in between each unitary layer. and we present a schematic of this modification in Figure 9. In the second modification, we consider V → V, where V is the noisy version of the unitary channel V. We define the depth of the unitaries as the number of unitary layers in such an implementation and denote the depth of U( V) as L U (L V ). Finally, we also consider the scenario when both A and B are noisy, and we have U → U, V → V.
for any computational basis state input |z n ⟩⟨z n | ; z n ∈ {0, 1} n , where we have defined Thus, when only subsystem A (B) is noisy, there is still exponential scaling, however, compared to the fully noisy setting the exponent is modified by factor L U L U +L V ( L V L U +L V ). If we consider the unitaries to have linear depth in the number of qubits and scaling with the same proportionally factor (i.e. L U /n A = L V /n B ), then the modification to the exponent is n A n A +n B ( n B n A +n B ).
Proof. Consider the mapping of strings in {1, Z} ⊗2 under a Hadamard gate followed by a CNOT. For map This implies that after the CNOTs in the circuit in Figure 8(b), the strings are (up to a phase factor) exclusively of the form with A ∈ {X, Y, Z} ⊗m . In particular, we make a key observation that S A = 1 ⊗n A if and only if S B = 1 ⊗n B .
This implies that we can import over the results of Ref.
[13] and treat subsystems A and B individually in our analysis of the evolution under T . Namely, we have for any σ A ∈ P n A , with P n A the group of Pauli operators acting on subsystem A, we have where |q σ B | ⩽ 1 − p. A consequence of this is that for any unitary channel Y and for input operator where in the first equality we have used Eq. (68), in the second equality we have used the unitary invariance of Schatten norms, the third equality comes from the definition of the Schatten 2-norm, and the inequality comes from the bound |q σ A | ⩽ 1 − p. The final equality again comes from the definition of the Schatten 2-norm. We note that the output of S A ⊗ S B ∈ Bell/{1 ⊗n } under unital channels separable across the cut A|B can generically be written in the form as the argument in the left-hand side of Eq. (70). As we only consider such unitary channels in our analysis of T , we thus can iteratively apply the above to write the concentration inequality where λ(S A , S B ) ∈ C are arbitrary coefficients. By repeating analogous steps to Eqs. (70)-(77) we can also obtain We now have the tools to prove the lemma. We recall that |z n ⟩⟨z n | = 1 2 n σ∈{1,Z} ⊗n s zn,σ σ where s zn,σ ∈ {+1, −1} and s zn,1 = 1. Then, we can write where L T is defined in Eq. (62) and f (z n , S A , S B ) is a function that generates the appropriate phase. In the first equality we have used Lemma 1, in the second equality we have used Eqs. (64)-(67). The inequality comes from Eqs. (77) and (78), and the unitary invariance of the Schatten norms. In the last three equalities, we again use Eqs. (64)- (67), the unitary invariance of the Schatten norms, and Lemma 1.
Using Lemma 7 we can now prove our proposition on the concentration of output distributions in circuits with entangled measurements. Figure 8(a) with computational basis measurement O zn = |z n ⟩⟨z n |; z n ∈ {0, 1} n . For any input state ρ and choice of measurement string z n , the expectation value concentrates as

Proposition 2. Consider the circuit in
where P (ρ) is the purity of ρ and Proof. In similar steps to the proof of Proposition 1 where in the first equality we have used the fact that Tr[B † m • T • W(ρ)] = 1, the second equality is simply a grouping of terms, the first inequality is due to the tracial matrix Hölder's inequality [51], the third equality is an application of Lemma 1, and the final inequality is an application of Lemma 7.

A.5 Gradient scaling
In this section, we show how to transport our results to consider gradient scaling. We modify the settings studied in Section 3 to include a trainable unitary in the middle of the circuit. In doing so, we constrain ourselves to consider only computational basis input states. We consider a trainable unitary of the form where {W k } K k=0 are arbitrary fixed unitary operators and {H k } K k=0 are Hermitian operators. We denote the channel that corresponds to this unitary as Y(θ). We modify the circuit in Figure 2 to include this trainable unitary, which we present in Figure 10. In the following proposition, we show how partial derivatives of parameters in this unitary exponentially vanish in the circuit depth under a CNOT ladder circuit. Figure 10 which we denote as

Proposition 3. Consider a cost function corresponding to the circuit in
denotes the channel corresponding to L i instances of CNOT ladders and local noise on n d qubits as defined in Section A.3, and with computational basis measurement O zn = |z n ⟩⟨z n |; z n ∈ {0, 1} n . The partial derivative with respect to parameter θ k is bounded as where we denote for i ∈ {1, 2}.
Proof. We have where the inequality is due to Hölder's tracial matrix inequality [51], and we denote Y † k− (θ) and Y † k+ (θ) as the channels corresponding to unitary operators respectively. The first term in Eq. (95) can be bounded by considering the unitary invariance of Schatten norms and by using Lemma 6 to obtain where in the last line we have used the fact that the purity of pure states is equal to 1. The second term in Eq. (95) can be bounded as where the equality comes by directly evaluating the partial derivative, the first inequality comes from the application of Lemma 2, and the second inequality is an application of Lemma 6. Substituting Eqs. (99) and (102) into Eq. (95) we obtain the gradient scaling result Eq. (93) as desired.
Proposition 3 shows that the partial derivatives with respect to any parameter θ k exponentially vanishes in the circuit depth. This implies that so long as ∥H k ∥ ∞ grows at most polynomially in the number of qubits n, the circuit in Figure 10 has an NIBP for linear circuit depths.
We can similarly modify the circuit in Figure 3 to include the trainable unitary Y (θ) and entangling gates at the start of the circuit. We present this circuit in Figure 11. In the following proposition, we show that partial derivatives of parameters in this unitary also exponentially vanish with increasing depth. Figure 11 which we denote as

Proposition 4. Consider a cost function corresponding to the circuit in
where ρ is a computational basis input state, B mi is an entangling operation between m i pairs of qubits in subsystems A i and B i as considered in Proposition 2, T i denotes noisy unitary evolution separable across the cut A i |B i , and with computational basis measurement O zn = |z n ⟩⟨z n |; z n ∈ {0, 1} n . The partial derivative with respect to parameter θ k is bounded as where we denote for i ∈ {1, 2}, and L Ui and L Vi are defined in the same manner as L U and L V in Section A.4.
Proof. Similar to the proof of Proposition 3, we have where the inequality is due to Hölder's tracial matrix inequality [51]. The first term is bounded as where we have used Lemma 7, and the fact that the lemma gives the same result if we consider the adjoint map of T 2 . As in the proof of Proposition 3, the second term in Eq. 105 is bounded as where the equality comes by directly evaluating the partial derivative, the first inequality comes from the application of Lemma 2, and the second inequality is an application of Lemma 7. Substituting Eqs. (107) and (110)

C Noise models
In our numerical studies, we use the local depolarizing noise model and a noise model of a near-term trapped-ion quantum computer developed by Trout et al. [42]. We assume single-qubit rotations R X (θ) = e −iθ/2X , R Y (θ) = e −iθ/2Y , R Z (θ) = e −iθ/2Z , and the two-qubit Molmer-Sørensøn gate XX(θ) ≡ e −iθXi⊗Xj as native gates of the quantum computer. Here, X, Y , and Z are Pauli matrices.

C.1 Local depolarizing noise
The local depolarizing noise channel with the error rate p is This is consistent with the definition in 18 and is included here for the reader's convenience. We choose p = 2.425×10 −3 for simulations of the clean and dirty setup and p = 2.425·10 −3 f where f = 1/n, 2/n, . . . , (n−1)/n for the simulations with all dirty qubits and variable error rates. The error channels are applied to the dirty qubits after each layer of non-parallelizable gates as shown in Figure 12. Figure 12: Local depolarizing noise model. A 4-qubit HVA circuit with the local depolarizing noise and 2 dirty qubits. The depolarizing noise channel Dp is applied to all the dirty qubits after each layer of non-parallelizable gates.

PBC
Noisy Figure 13: Trapped-ion, realistic noise model. A 4-qubit HVA circuit with realistic, trapped-ion noise and 2 dirty qubits. When a two-qubit gate is applied between two dirty qubits, it is followed by a 2-qubit noise channel C. All the RZ gates acting at the dirty qubits are followed by their noise channels SZ . If a two-qubit XX gate acts in between a dirty and a clean qubit we assume that it is noiseless.
In the case of a XX gate acting at a dirty and a clean qubit, we assume that it is a noiseless gate. In practice modeling an XX gate in between an error-corrected qubit and a dirty qubit requires taking into account details of the error correction code This is outside of the scope of this work, so we assume no error for such a gate. This assumption allows us to upper bound performance of a quantum computer with quantum error-corrected qubits. Figure 13 shows how the noise channels are applied for our HVA simulations.
In the case of our HVA simulations, the choice of perfect XX gates in between dirty and clean qubits causes deviations from a scaling of the gradient magnitude with the n d /n ratio. For our choice of the dirty qubits forming a contiguous block of qubits, the number of the noisy XX gates is L(n d − 1) for 1 ≤ n d ≤ n − 1 and Ln d for n d = 0, n. Therefore, a fraction of the noisy two-qubit gates in the circuit scales as (n d − 1)/n for 1 ≤ n d ≤ n − 1 causing the gradient magnitude to scale with (n d − 1)/n rather than n d /n as can be seen in Figure 6b.

D Additional numerical results
Here we show results for the gradient behavior in the case of 4 and 6-qubit HVA circuits simulated with the realistic and depolarizing noise models. Figure 14 shows the mean absolute value of the derivative of the cost function with respect to a parameter |∂ θ k C| averaged over both 28 instances of random HVA parameters and the HVA parameters. plotted versus depth of the ansatz for the depolarizing and realistic noise models. The behavior is very similar to that of the 8-qubit HVA shown in Figure 5. Figure 15 displays |∂ θ k C| versus the total error rate for the 4 and 6-qubit HVA circuits. We find very good quality of the collapse similar to the 8-qubit case shown in Figure 6. We also observe that for the depolarizing noise small deviations from the perfect collapse are growing with increasing n. We leave further investigation of this phenomenon to future work. Figure 16 shows the behavior of |∂ θ k C| for lower layer numbers L = 1 − 30, n = 4, 6, 8, 10, and the realistic noise model. Again for each triple of n, n d , L values we use 28 random HVA instances to estimate |∂ θ k C|. Apart from the very low L ≤ n, the behavior of the derivative depends primarily on the n d /n ratio displaying a decay with an increasing n d similar to the larger L case. A different behavior observed for very small L is not surprising as for such shallow circuits one can expect dependence of the impact of the errors to be strongly localized due to the causal cone effects, unlike for the deeper circuits. Furthermore, we see in that regime strong dependence of the results on L that overshadows subtler n d effects. For n d > 0 and larger L, the decay of the gradient with L is approximately exponential, though this behavior is less clear than for larger L due to seemingly random fluctuations superimposed on the dominant trend, which might be finite sample effects.

E Repository
We have prepared a repository which can be found at https://github.com/danielbultrini/Clean_Dirty_Qubits. This contains data frames with numerical results and functions required to plot all the figures shown. In addition, to reproduce the depolarizing noise model results, an open-source Qiskit-based code is provided. This code is not optimized but does implement the depolarizing clean and dirty model. Its aim is to make it easier for interested readers to expand or alter it for their own purposes. Here we show the mean absolute value of the partial derivative of the cost function with respect to a parameter |∂ θ k C| for 4 (a,b) and 6-qubit (c,d) systems. We plot the results for both the clean and dirty and the variable noise setups in the same way as in Figure 5. As for the 8-qubit case we find that |∂ θ k C| decays exponentially with increasing L. Furthermore, we observe that |∂ θ k C| decays exponentially with increasing n d , similarly as for the 8-qubit HVA circuits. We see that the general pattern seen in the main text holds, with higher noisy qubit counts decaying faster than lower ones.