Reduction of finite sampling noise in quantum neural networks

Quantum neural networks (QNNs) use parameterized quantum circuits with data-dependent inputs and generate outputs through the evaluation of expectation values. Calculating these expectation values necessitates repeated circuit evaluations, thus introducing fundamental finite-sampling noise even on error-free quantum computers. We reduce this noise by introducing the variance regularization, a technique for reducing the variance of the expectation value during the quantum model training. This technique requires no additional circuit evaluations if the QNN is properly constructed. Our empirical findings demonstrate the reduced variance speeds up the training and lowers the output noise as well as decreases the number of necessary evaluations of gradient circuits. This regularization method is benchmarked on the regression of multiple functions and the potential energy surface of water. We show that in our examples, it lowers the variance by an order of magnitude on average and leads to a significantly reduced noise level of the QNN. We finally demonstrate QNN training on a real quantum device and evaluate the impact of error mitigation. Here, the optimization is feasible only due to the reduced number of necessary shots in the gradient evaluation resulting from the reduced variance.


Introduction
The methods of quantum machine learning are among the most promising approaches to achieve a quantum advantage in today's era of noisy intermediary quantum (NISQ) computers [1,2,3].Quantum machine learning (QML) offers significant flexibility in the design of quantum circuits, allowing for the use of short and hardware-efficient circuits.These circuits may operate more effectively on noisy hardware compared to problem-driven approaches as for example quantum optimization or quantum chemistry.
In this young field, mainly two approaches are followed to develop machine learning algorithms on NISQ hardware.The first approach utilizes the exponentially large Hilbert space that is accessible by a quantum computer for the so called kernel trick [4].Here, the input data is embedded into a high dimensional space enabling linear regression or classification within this space [5].On a quantum computer, this is achieved by using a quantum feature map [6,7], a circuit in which data is encoded into a quantum state, for example by using rotation gates [8].Optionally, the quantum feature map may contain additional trainable parameters that enable an adaption of the feature map for the given data [9,10,11].The feature map is also utilized in the second QML approach, in which the output of the QML model are either probabilities of measuring qubits in a certain computational state or expectation values of manually chosen operators [12,3].This approach follows the principles of variational quantum algorithms [2], and in this context, the feature map is more commonly referred to as a parameterized quantum circuit (PQC) [13].It is also possible to use quantum states directly as an input and manipulate the state by a trainable circuit before measuring it [14,15,16,17].
In the following we denote the second QML approach as Quantum Neural Networks (QNNs) since it is the most established name today [18].Parameters of QNNs are often optimized by minimizing a loss function similar to the training of a classical artificial neural network (ANN) by gradient descent [12,19,20].Differentiation of the expectation value is possible for example by the parametershift rule, in which the analytic derivative is obtained through evaluations of the expectation value with shifted parameter values [12,21,22].
Recent research on QNNs has placed a strong emphasis on understanding their mathematical properties and investigating their advantages and disadvantages.Like ANNs, QNNs also offer universal function approximation [23,24], which can already be achieved on the single-qubit level [25].Furthermore, QNNs can reach a high expressivity by the repeated encoding of the input data, also known as data reuploading [26,27,24].An open question is whether the high expressivity of QNNs is more of a problem than a feature, as it can also easily lead to overfitting of the data.On the other hand, research has also demonstrated that QNNs can achieve good generalization citePeters.2023 with only few data points [28].
Another important topic in the study of QNNs is the phenomenon of barren plateaus, which refers to the observation that the variance of the gradient vanishes exponentially.The issue is a well-known challenge [29,30] caused by various sources [31,32,33,34].While specially designed PQCs have shown promise in avoiding this issue [35,36], another source of barren plateaus arises from noise that can occur during evaluation on real hardware [37,38].Despite hardwarerelated noise potentially being reduced in the future, there will always be some level of noise resulting from the finite sampling of the quantum state.
In this work, we address the challenge of handling finite sampling noise in QNNs within the constraints of the NISQ era, where a massive number of repeated evaluations of a quantum circuit is currently not feasible.Today, evaluating a circuit with a larger number of shots comes with long execution times and high financial costs, and therefore, the number of shots is often fixed to a maximum number. 1 As this is a technological issue that is not provider-specific and given the fact that the optimization of a QNN typically requires the evaluation of thousands of circuits, finite sampling noise can be a significant obstacle in training QNNs on real hardware.Additionally, access to quantum computers is often granted by a pay-pershot plan [39], which can result in considerable costs for the full training of a QNN.Therefore, reducing the number of shots is a practical and crucial goal in the current NISQ era of quantum computing.
The standard deviation (STD) associated with the finite sampling noise of the expectation value E[ Ĥ] = ⟨Ψ| Ĥ|Ψ⟩ of some operator Ĥ is calculated as: where var(E[ Ĥ]) represents the variance of the expectation value.The formula shows that the finite sampling noise of the expectation value decreases with O(1/ √ N shots ) as the number of shots N shots increases.However, since the number of shots is practically limited in current NISQ devices, we propose an alternative approach that reduces std(E[ Ĥ]) by lowering the variance of the expectation value.In QNNs, the PQC that generates the wavefunction |Ψ⟩ and the operator Ĥ are free to choose, and they can be optimized to reduce the variance.We show that a significant reduction of the finite sampling noise can be achieved by adding the variance of the QNN as a regularization term in the loss function.This approach ensures that the training of the QNN not only optimizes the fitting 1 For example at IBM's currently available hardware, the maximum number of shots is limited to 100 000 shots.An execution of a single circuit on one of fastest platforms available at this provider with this number of shots takes about 2 minutes.(Experiment: The circuit as displayed in Figure 1 with 20 qubits and 3 layers is evaluated with 100 000 shots on ibm_cairo) loss but also significantly reduces the variance of the output.
The paper is structured as follows: In Section 2, we provide a technical introduction to QNNs and our concept of variance regularization.Section 3 presents an introductory example that demonstrates the impact of variance regularization on a selected problem.Next, in Section 4, we delve into the details of how variance regularization can be incorporated into the optimization procedure.We further discuss more deeply the optimization for a regression of the logarithm function and present results for two other functions.Additionally, we apply the variance regularization in the interpolation of the potential energy surface of water in Section 5. Section 6 focuses on evaluating the performance of QNNs on real quantum computing hardware, and we provide an overview of the optimization procedure conducted on the real hardware.Finally, we examine the combination of variance regularization and error mitigation techniques, specifically zero-noise extrapolation.

Variance Regularization
The QNN employed in this works utilizes a PQC, denoted as Û , to encode the classical input data x ∈ R into a quantum state |Ψ⟩ and manipulate the quantum state using additional parameters φ: ( Practically, the PQC Û (x, φ) is primarily constructed from one-and two-qubit gates, and often rotational gates are utilized for encoding data and manipulating the quantum state.Employing repeated input encoding techniques enhances the expressibility of the circuit [27].An example of a PQC used in this study can be found in Figure 1.
The output of the QNN, denoted as f (x), is obtained by evaluating the expectation value of a cost operator Ĉ(ϕ) as follows: In scenarios where multiple outputs are desired from the QNN, an individual cost operators Ĉj (ϕ j ) can be used for each output.In such cases, the QNN output f is considered as a vector containing the expectation values of each respective cost operator.When there is no need to distinguish between the parameters φ of the PQC and the parameters ϕ of the cost operators, they can be combined into a single parameter vector θ.For convenience, the explicit dependency on the parameters is often neglected in the subsequent discussions.
The parameters θ of the QNN are determined through the minimization of a loss function L fit (θ).The choice of the loss function depends on the specific task at hand and plays a crucial role in the model's performance.It is common to adapt and utilize classical machine learning loss functions within the context of QNNs.In this work, we mainly focus on regression for which a possible loss function is given by the squared error (named in the following fitting loss): (4) Here, we consider a set of input data {x i ∈ R d } and their corresponding labels {y i ∈ R} within a supervised learning scenario.In case of a classification task the cross-entropy loss function is often employed [40].Moreover, alternative loss functions empower the QNN to tackle other problems such as differential equations [41].
The optimization of the parameters θ involves minimizing the loss function which can be challenging because the loss function is typically non-convex.Therefore, gradient-based optimization methods are commonly employed to address this task [20].To obtain derivatives with respect to the parameters φ of the PQC, one approach is the parameter-shift rule [12,21,22].In this method, the parameter is shifted in each gate by a specific value and the analytic derivative is obtained by the difference of the resulting expectation values.Consequently it is necessary to evaluate two circuits for each gate containing the parameter.However, when dealing with PQCs that consist of numerous parameterized gates, the evaluation of the resulting amount of circuits can become a computational bottleneck, particularly when utilizing state-of-the-art quantum hardware.While alternative approaches, such as the linear combination of unitaries do exist [42,43], it is important to consider that these approaches may be more vulnerable to hardware noise when compared to the parameter-shift rule.
The differentiation with respect to the cost operator parameters ϕ is straightforward: The derivative ∂ ϕm Ĉ(ϕ) of the cost operator can be often evaluated from the same measurement as the expectation value of the cost operator.In our routines, we reuse the results from the circuit evaluation of the cost operators to calculate these derivatives.
Once the loss function is minimized to a satisfactory low value, the QNN is considered to be trained and the model can be used to obtain new predictions by inputting new data into Eq.(3).However, it is important to keep in mind that the output of the model is determined by the stochastic process of measuring the expectation value, and therefore, in contrast to classical ANNs, the output contains noise.The standard deviation of the expectation value, i.e. the model output, is obtained by (cf.Eq. (1)): The variance σ 2 f of the QNN can be computed by In case of multiple outputs, the standard deviation is obtained for each output by evaluating the variance for each cost operator Ĉj separately.
In the following we utilize the variance as a regularization during the optimization of the QNN in order to reduce the noise of the model output.This is achieved by adding the following term to the loss function that minimizes the variance of the function at a set of points {x k }: where ||.|| is a suitable vector norm.The total loss that is minimized during the training procedure is the sum The hyper-parameter α > 0 is used to adjust the balance between the fitting error and the variance reduction of the model.In our experience, a value between 10 −2 and 10 −4 yields the satisfying results.The choice of {x k } is in principle arbitrary and should evenly capture the domain space of the input data.
We assume that the training data is a good representation of the model domain, and therefore, we will set {x k } = {x i } in the following.This approach offers the advantage of reusing the circuits and function evaluations of f (x) from the fitting loss L fit .The computationally complexity of ⟨Ψ| Ĉ2 |Ψ⟩ strongly depends on the choice of the Ĉ.If the operator is idempotent, i.e.Ĉ = Ĉ2 , the evaluation of this term comes with no additional costs.For example, this is the case for the probability p n (0) (or p n (1)) of a single qubit n, since this is equivalent to a cost operator of |0⟩⟨0| n (or |1⟩⟨1| n ) in which n labels the qubit.When the operator is diagonal in the computational basis, it becomes feasible to evaluate the expectation value of the squared operator using the same circuit measurements as those employed for function evaluation.This scenario occurs for example when the operator solely consists of Pauli operators Î and Ẑ, as observed in Ising Hamiltonian with the following form: This is also true for the evaluation of the gradient of the variance: The second term is computed from the intermediates that are also used in the gradient computation of the loss function L fit .The derivatives of the first term are obtained using the same circuits and measurements as those used for calculating the gradient of the function.Extra work is purely introduced in the classical post-processing to evaluate the additional expectation values from the measurements.For general observables, the operators Ĉ and Ĉ2 can be simultaneously measured in the eigenbasis of Ĉ.Therefore, in principle, the quantum computation overhead is only increased by the additional unitary transformation to the eigenbasis.If this eigenbasis is unknown or impractical to compute, evaluating the squared cost operator requires a separate calculation of the expectation value ⟨Ψ| Ĉ2 |Ψ⟩.This may make the application of variance regularization impractical in variational algorithms involving many non-commuting terms in the observables, as is often encountered in the variational quantum eigensolver (VQE).To discuss this in more detail, we consider the following transverse Ising Hamiltonian: The observables involving Ẑp Ẑq and Xp can be organized into distinct groups of commuting observables.Consequently, evaluating the expectation value of Ĉ requires measurements from two separate circuits.However, in the case of the squared observable Ĉ2 , the count of distinct groups with commuting observables increases to 2N q + 2 with N q qubits.This expansion includes a group solely with pure Ẑ observables, another group with pure X observables, and 2 groups per qubit, each involving either X or Ŷ acting on the qubit while Î and Ẑ operate on all other qubits.Hence, 2N q + 2 circuits must be measured, introducing a significant overhead for calculating the variance of the given transverse Ising Hamiltonian.
In QNNs, however, the situation is considerably different since the choice of the observable is usually flexible.Commonly used observables in the literature utilize expectation operators of Pauli-Z observables or probabilities of being in a certain computational state [2,25,35,40,23,28].The remainder of the manuscript thus explores scenarios where C is diagonal in the computational basis.

Introductory Example
In this section, we present an illustrative example that demonstrates how variance regularization significantly enhances the robustness of a QNN against finite-sampling noise within a shot-based simulation.This example also highlights that a perfect fit, derived from a noise-free simulation, could inadvertently introduce a substantial variance of the expectation value.Consequently, this would necessitate a significantly large number of shots for an accurate evaluation.We want to point out that this effect should  The layer is repeated l times for a repeated input encoding.
The last layer of Ry gates serves as a change of the basis that is used for measurement.For a hardware efficient approach, the rightmost controlled gate in the blueish layer is removed to avoid swapping.
be taken into account when interpreting results from QNNs that are entirely sourced from noise-free simulators.
Figure 1 displays the PQC that is utilized throughout this work.We use the Chebyshev input encoding in which the input data is first encoded by a cos −1 (x) function [12].The use of the inverse cosine implies a rescaling of the input data to [−1, 1].Using this non-linear input encoding as angles in rotation gates around the x-axis yields the following identity for integer n [41]: (13) in which T n (x) and U n (x) are the first and second kind Chebyshev polynomials of degree n [44].X denotes the Pauli-X operator.In contrast to previous works, we introduce a parameter φ instead of a fixed value for n which enables a flexible optimization of the degree of the Chebyshev polynomials during the training.An example of curves generated using Chebyshev polynomials with a non-integer value of n is presented in the Appendix A.1.These curves demonstrate a smooth transition between the individual Chebyshev polynomials.
The initial values for the parameters in the encoding are chosen to be evenly distributed between φ l,1 = 0.01 and φ l,Nq = β following the idea of the Chebyshev tower approach [41].Here and in the subsequent sections, N q denotes the number of qubits.The initial value 0.01 is chosen to avoid a redundancy in the parameter that occurs for φ = 0.0, and β is considered as a hyper-parameter of the model.The quantum state from the input encoding is manipulated by a hardware efficient approach of Rzz (φ) = exp(−i φ 2 Ẑ ⊗ Ẑ) twoqubit interaction gates that are arranged in a nearest neighbor entangling set-up.The Ry gates at the beginning and the end of the feature map enable a basis change of the initial state and the measuring basis.Both are also optimized during the training.
In this introductory example, we choose the Ising Hamiltonian introduced in Eq. (10) as the cost operator.The parameters ϕ 1 to ϕ 3 of the Ising Hamiltonian are also optimized in the optimization.The first parameter ϕ 1 introduces a constant offset of the output of the QNN.The specific selection of this operator exemplifies a scenario where a large variance is observed after the training process.The PQC of the QNN is introduced in Figure 1, and N q = 4 qubits and l = 2 layers are chosen.The optimization of the loss function, i.e. the training, is conducted using the noise-free simulator of Qiskit [45], and it is performed with and without the variance regularization.
Figure 2 displays the inference of the trained QNN with and without finite sampling noise.The top plot illustrates the QNN obtained without variance regularization, while the plot at the bottom demonstrates the outcomes with variance regularization.Note that the noise free output of the QNN, represented by the dotted black line, accurately reproduces the logarithm function in both plots.The results of the shot-based simulator, here the QASM simulator in Qiskit [45], are computed utilizing 10 000 shots.By incorporating the variance regularization, the averaged variance of the trained QNN is reduced significantly by a factor of 85.This reduction in variance impacts the model's precision, and it leads to an increase in the squared error from 9.3 • 10 −5 to 8.3 • 10 −3 in the noise-free optimization.However, there is no substantial visual difference observed in the noise-free outcome for these low fitting losses.
This picture changes when considering the results of the shot-based simulations.Here, the difference between the evaluation of the QNN with the parameters obtained with and without variance regularization during the training is substantial.While the former results in a drastically noisy inference, the latter is difficult to visually distinguish from the noisefree result.Effectively, the 85-fold reduction in variance implies that the same level of shot noise can be achieved with 85 times fewer shots.Alternatively, if the number of shots is kept constant (as depicted in Figure 2), the standard deviation (cf.Eq. ( 6)) is reduced by a factor of 9.2.The example without variance regularization further illustrates that the QNN obtained through training with a noise-free simulator may not be practical for evaluations on a shotbased backend.Despite its high accuracy, the solution without variance regularization would require approximately 850000 shots to achieve a similar result as the QNN trained with variance regularization.
The cost operator presented in Eq. ( 10) exhibits limited flexibility due to its inclusion of only three free parameters.Additionally, the two-body interaction term is not well-suited for calculating the variance, as classical post-processing scales with O(N 4 q ).In the subsequent sections, we will utilize a different operator for our experiments, given by: Our experiments have shown this operator to be more versatile and less susceptible to shot noise in general.Therefore, all the results presented in the following sections are obtained using this cost operator.

Optimization with variance regularization
In this section, we show numerical evidence for the benefits of using the variance regularization approach introduced in Section 2. All following calculations have been executed with the QML python package sQUlearn [46], in which the variance regularization has been integrated.In the following optimization we compare two approaches for choosing the parameter α in the variance regularization (cf.Eq. ( 9)).In the first approach, the parameter remains constant throughout the optimization process.The second approach involves adjusting the parameter dynamically during optimization.The objective is to initially force the QNN to significantly reduce the variance and then gradually transition towards a regime where the squared error (cf.Eq. ( 4)) dominates the total loss.This approach results in a significantly lower final variance value.An additional benefit of reducing the variance at the beginning is the ability to use a lower number of shots in the initial iterations.The parameter α is determined using a modified sigmoid function, given by: Here, the parameter v represents the strength of the variance regularization at the end of the training, and b defines the width of the plateau in the initial phase when the optimization primarily focuses on reducing the variance of the function.The parameter a determines the slope of the decay in the regularization parameter.A plot illustrating the regularization parameter α a,b,v for various parameter values is presented in Figure 3.As detailed in Appendix A.2, the fully converged solutions primarily depend on the parameter v, which balances variance and fitting loss.Higher values of v reduce the variance more but also increase the fitting loss.In this work, we consistently used a value of v = 0.005, though different applications might require adjustments.The parameters a and b can be used to adjust the variance reduction at the beginning of training.Parameter b should ensure a long enough plateau to reduce variance when the fitting loss is initially high.Parameter a controls the rate of variance increase after reaching its minimum, however, results are not very sensitive to the precise value of a.More details on parameter selection and results for different parameters can be found in Appendix A.2.
In addition, the number of shots for evaluating the gradient is adjusted during the optimization.It is readily apparent that a higher number of shots at the beginning of the optimization is not needed, since a precise evaluation is not beneficial in case of a large fitting error.In this section, we introduce a procedure that automatically adjusts the number of shots during the evaluation of the gradient circuits.It is based on the relative standard deviation (RSD) of the fitting loss L fit (4).The number of shots are adjusted such that the RSD is lower than a predefined boundary.By using the approximation var(f (X)) ≈ (f ′ (E[X])) 2 var(X) [47] one obtains for the variance of the fitting loss L fit : Empirically, we have observed that this approach also provides a good approximation of the variance of the total loss function L, with the variance of var(L var ) being notably lower.During the initial stages of optimization, the fitting error dominates the variance of the total loss.Furthermore, as the optimization progresses and the regularization parameter α becomes smaller, the contribution of the variance loss to the total variance diminishes to a negligible extent.Using Eq. ( 16), the relative standard deviation of the fitting error can be expressed as follows: The number of shots for evaluating the circuits of the gradient computation is obtained by setting an upper limit of the RSD by a predifined hyper-parameter β.Additionally, a minimum (100) and maximum number of shots is set.In our experience, a value of β = 0.1 is enough for a gradient evaluation that does not negatively impact the optimization compared to an optimization with the given maximum number of shots.
The criterion for determining the number of shots was designed in such a way that no additional evaluations of quantities are required.The values of f θ (x i ) and σ 2 f (x i ) are already included in the loss function and are evaluated with the maximum number of shots.Subsequently, the number of shots in the evaluation of the circuits needed for the gradient computation is adjusted.The variance of the loss function, theoretically, does not provide any information about the variance of its gradient.We are not aware of any boundaries that can be derived for the gradient of the loss function without introducing additional circuit evaluations and expectation value calculations.As a result, there is no theoretical guarantee for determining the correct number of shots to ensure a predefined noise level for the gradient.However, empirically, we have found that this approach yields satisfactory results, and it achieves a similar optimization performance compared to working with the maximum number of shots.
Figure 4 illustrates the optimization process for fitting the logarithmic function.Displayed from top to It utilizes N q = 10 qubits, l = 3 layers, and the cost operator introduced in Eq. ( 14).The optimization is performed by ADAM [48] with a learning rate of 0.1 utilizing IBM's shot-based QASM simulator with a maximum of 5000 shots.The dashed lines in the plots represent statevector simulations without any noise.The shot-based optimization is repeated 10 times with the same initial parameters, and the individual results are displayed using thin lines.The solid lines represent the averaged result over the 10 runs.
We first discuss the results without any variance regularization (blue curves).Without any finite measurement errors, the fit error of the noiseless optimization is the lowest of all optimization strategies.However, switching to the shot-based simulator yields a considerably different picture.The fit error reaches a considerably higher plateau compared to the statevector simulation, primarily due to the finite sampling noise, and the final value is the highest compared to all other methods.Nonetheless, the variance slightly decreases as the optimization progresses.We conclude that the ADAM optimization indirectly addresses the variance in an attempt to further minimize the fitting error.However, this process occurs at a slow and inefficient pace.
Adding the variance regularization with a constant parameter α (green curves) strongly reduced the variance of the QNN, and the final variance is significantly lower.The fit error in the noise-free simulation is higher compared to the case without variance regularization, as the optimization is not solely focused on minimizing the fit error.However, this effect reverses when transitioning to the shot-based simulation, in which the reduced variance also improves the convergence to a lower fit error.
Moving on to the iteration-based variance parameter, we observe a similar trentow in the fit error.However, the variance is significantly decreased compared to the constant parameter α.The variance plot clearly demonstrates that this strategy primarily targets the variance reduction first, while prolonging the optimization of the fit.On the other hand, the lower variance results in fewer shots required for the timeconsuming gradient evaluation, leading to a faster overall computation time until reaching the plateau after 300 iterations.The final variance is reduced from 52.09 (without variance regularization) to 4.16.In other words, the final QNN obtained with variance regularization requires over ten times fewer shots to achieve a similar level of finite sampling noise.
Figure 5 presents the final results of regression on three different functions: the logarithm, the absolute value function, and an oscillating function.The figures also display the averaged values of the final fit error and variance over the last 10 optimization iterations.The optimization process follows the methodology described above.For the absolute value function, the fitting error of the individual training data points is weighted, with higher weights assigned to the points in the center.These weights are determined by the function w(x) = 2 exp(−x 2 ).
In all three examples, there is a noticeable reduction in both finite sampling noise and fit error.Particularly, the example of the absolute value function demonstrates a remarkable reduction in variance, exceeding a factor of 20.The reduction in fit error is evident in the plot for the oscillating function.The operator utilized in these examples (cf.Eq. ( 14)) has the capability to generate results without the same level of significant noise observed in the introductory example.However, despite this inherent capability, the application of variance regularization greatly enhances the results for all three examples, without imposing a notable computational overhead.

Application: Potential energy surface
In this section, we discuss an application addressing the interpolation of potential energy surfaces (PES) for molecules.PES are derived by solving the electronic Schrödinger equation for molecules, considering the spatial coordinates of their constituent atoms.They offer valuable insights into stable molecular configurations, reaction pathways, and reaction kinetics.Computing the energy for a given molecular geometry is computationally demanding with unfavorable scaling, making it crucial to obtain a good representation of the PES from a limited number of data points.Given that the data stems from inherent quantum simulations, this presents an interesting application for quantum machine learning [49].
In the following, we demonstrate an interpolation of a PES for the water molecule based on dataset [50] provided in Ref. [51].The coordinates of water molecules are transformed into the three degrees of freedom of the molecule: the two distinguished bond distances between the Oxygen and one of the Hydrogen atoms, and the angle formed by one hydrogen atom, oxygen, and the second hydrogen atom.Subsequently, the bond distances and angles are rescaled to the interval [−0.9, 0.9] for the cos −1 encoding (c.f. Figure 1), while the energies are scaled to the interval [0, 1] for the output of the QNN.The dataset is divided into a training set consisting of 50 samples and a test set of 47 samples.The QNN setup follows the procedure described in Section 4, employing 9 qubits and setting the ADAM learning rate to 0.01.The three input features are encoded cyclically, for instance, the first feature is encoded in the Rx gates of qubits 1, 4, and 7.  4, with the variance reduced by an order of magnitude and a substantial decrease in the fitting loss.Moreover, the reduction in variance allows for a significant decrease in the number of shots required for gradient evaluation, while still yielding a lower fitting loss L fit .The inference plots illustrate a significant reduction in the output variance through the variance regularization.Specifically, the width of the 95% confidence interval decreases notably from an average value of 0.060 to 0.012 Hartree.Additionally, the averaged values of the R 2 scores demonstrate further improvement through the variance regularization.

Results from the real hardware
In this section we investigate the impact of the variance regularization on the performance of QNNs on real quantum computing hardware.All following computations are executed within the IBM Quantum ecosystem [52].Training a QNN on real quantum computing platforms remains a challenging and timeconsuming task today.In this section, we demonstrate that the reduced variance and the optimization procedure discussed in Section 4 enable the optimization process on a real quantum computing backend, providing a notable performance boost due to the reduced number of shots.
We train the QNN using the procedure described in Section 4, wherein the last two-qubit gate Rzz of each layer is removed to achieve hardware-efficient linear entangling without introducing swapping gates.To reduce computational time, we decrease the number of training points for fitting the data compared to the previous examples.The training data used in this example is shown in Figure 8.Additionally, no error mitigation techniques are employed to expedite the process.The qubit assignment remains fixed through- out the entire optimization process, as well as during inference.However, evaluating the circuits resulting from the parameter-shift evaluation still consumes a significant amount of time.It takes approximately 27 minutes to evaluate the 2618 circuits needed in this example for the parameter-shift derivative on the real backend, even with the lowest setting of 100 shots.However, due to various reasons, such as pre-and post-processing of circuits and especially queuing, the training process for the QNN extends over a duration of several weeks.Frequent re-calibration of the quantum hardware could potentially impact the optimization process, however, in our optimization this effect seems negligible.Figure 7 showcases the optimization performed on the IBM backend ibmq_montreal for the absolute value function.Additionally, we present the optimization with the exact same settings on the shot-based simulator.Notably, both optimizations exhibit very similar behavior, indicating the potential for further reduction in the fitting loss on the real backend.We anticipate that at some point, the additional noise introduced by the hardware will limit the further optimization process, resulting in a higher final loss compared to the shot-based simulation.Due to the variance regularization, which greatly reduces the output variance of the QNN, we can perform the gradient evaluation with a relatively low number of shots.This factor enables us to carry out the optimization of this example on the real backend in the first place.The full optimization curve of the shot-based QASM simulation is displayed in the Appendix in Section A. 3.
Figure 8 displays the inference of the trained QNNs.The blue curve is evaluated with the shot-based simulator and with parameters that are obtained from the converged QASM optimization displayed in Figure 7.The orange curve is obtained from the IBM backend ibmq_montreal with parameters that result from the optimization on the same backend.At this stage of the optimziation, the output agrees well with the results from the shot-based simulator.
We would like to emphasize that the optimization and inference on the real quantum computer were conducted without any error mitigation techniques.Despite this, the QNN demonstrated a remarkable capability to adapt to hardware imperfections during real hardware training, as evidenced by its good agreement with the simulation results.However, when evaluating the QNN on real backends using parameters obtained from the QASM optimization, we observed substantial deviations from the simulated outputs.These findings strongly indicate that QNNs intended for evaluation on a quantum computer should be trained on the same machine.Such an approach takes into account the specific hardware characteristics and imperfections, resulting in better performance and improved adaptability to the target quantum system.In our view, this emphasizes the necessity for techniques that enable training directly on the hardware, instead of solely concentrating on simulations.
The final example is motivated by Ref. [38], in which it is shown that the variance is increased by error mitigation.Furthermore, it is discussed that error mitigation protocols can worsen the trainability of QNNs because of the increased variance.To investigate how the variance regularization is influence by the error mitigation, we compute results on the real backend utilizing the zero-noise extrapolation protocol [53,54] as error mitigation.Zero-noise ex- trapolation involves deliberately amplifying the noise in the output by replicating gate operators in the circuit without altering its functionality.This procedure allows the creation of a simple model based on the expectation value with controlled noise amplification, enabling extrapolation to the expectation value in the absence of any noise.The QNNs in this example are obtained by an optimization with the shot-based simulator, both with and without variance regularization.The training follows the protocol described above and in Section 4 and fits the absolute value function.The expectation values of the different QNNs are evaluated 300 times at x = −0.5 on the ibmq_montreal backend utilizing 5000 shots in each run.Figure 9 displays the histogram of the expectation values, both with and without zero-noise extrapolation.Gaussian distributions are fitted to the 300 expectation values to illustrate the width of the distribution, and the mean (µ) and the standard deviation (σ) of these Gaussians are provided.It is evident that the variance of the expectation value is significantly improved by the variance regularization, regardless of the presence of zero-noise extrapolation.Furthermore, the center of the expectation value is shifted to the noise-free reference value of the QNN by the application of zero-noise extrapolation.The increased variance through the mitigation is visible for both QNNs, although it is not particularly significant in our example.Nonetheless, the variance reduction achieved through regularization remains strong even in the presence of zero-noise extrapolation.

Conclusion
In this work, we investigated the impact of finite sampling noise, an inevitable aspect of quantum computing, on QNNs.To mitigate this noise, we introduced a technique called variance regularization, which exploits the expressivity of QNNs to reduce the variance of the output.The method additionally includes the variance in the loss function that is minimized in the training.When the cost operator of the QNN is chosen to be diagonal in the computational basis (e.g.only using Pauli I and Z operators), the variance and its derivative can be computed from the same circuit evaluations as the function values and gradients.
We presented an example illustrating that noisefree QNNs obtained from simulators can exhibit a good fit but may suffer from a high variance, requiring either a huge number of shots or introducing significant amount of finite sampling noise.We believe that this aspect is often underestimated in current research on QNNs.Our findings demonstrate that the variance regularization significantly reduces the finite sampling noise.Compared to results without regularization, the final QNN requires on average a magnitude fewer shots to achieve a similar level of noise.We introduced an optimization procedure in which the contribution of the variance loss is adjusted during the optimization, resulting in substantial variance reduction and improved regression of the QNNs.Empirically, we showed that the number of shots required during the gradient evaluation can be adjusted based on the variance of the fitting loss, leading to faster computation times.
In the final part of our study, we examined the impact of variance regularization on IBM's real quantum computing backends.We demonstrated QNN optimization on a real backend, showcasing the adaptability of the QNN to hardware-specific characteristics during training.The results show similar improvements in the variance reduction compared to the simulation examples.Additionally, we observed that zero-noise extrapolation has no strong influence on the reduced variance of the output.
We believe that variance regularization is a necessary step to make QNNs more practical for real-world applications, although the time-consuming training on real hardware remains a big challenge.Exploring the application of variance regularization in other variational quantum algorithms may also be worth investigating when reducing finite sampling noise is crucial.

A.2 Dependencies on parameters in α(i)
In this section, we discuss the dependencies on the hyper-parameters in the function α a,b,v (i) (cf.Eq. ( ) that is utilized in the variance regularization.We execute the example of fitting the logarithmic function, as discussed in Section 4, with various hyper-parameters displayed in Figure 3 on a noisefree simulator.The function α a,b,v (i), the variance loss L var and the fitting loss L fit , are displayed in Figure 11.We observe that although the progression of the variance loss and the fitting loss depends on the hyper-parameters, the final results consistently depend only on the value of the final plateau of α(i) set by parameter v.The hyper-parameter v describes the proportion between the variance loss and the fitting loss; a large value reduces the variance more significantly at the cost of increasing the fitting loss.In this work, we chose the same set of hyper-parameters for all discussed applications, but this might change for other applications.Here are some general intuitions on choosing the hyper-parameters: The plateau defined by the parameter b should be sufficiently long to ensure that the variance is reduced if the fitting loss is significantly larger than L var at the beginning.The decay of α(i), controlled by the parameter a, influences the increase of the variance after reaching its minimum.However, the precise value of a has minimal impact on the final result, as demonstrated in Figure 11.
A.3 Comparison between real quantum computing hardware and the fully converged simulated optimization Figure 12 shows the comparison of the optimization on the real quantum computing (QC) backend ibmq_montreal and the fully converged optimization carried out with the shot-based simulator.The optimization on the real backend was terminated at 95 iterations after running for several weeks.Although we were not able to run the optimization until convergence, the result shown in Figure 12 (and Figure 7, respectively) is the longest cohesive optimization instance that we were able to run on real hardware, given the current practical limitations discussed in the main text.

Figure 1 :
Figure 1: Parameterized quantum circuit of the QNN used in all examples in this work.The first layer of Ry gates manipulates the initial state.The blueish highlighted layer includes the Chebychev input encoding in the Rx gates as well as the parameterized control manipulation of the quantum state.The layer is repeated l times for a repeated input encoding.The last layer of Ry gates serves as a change of the basis that is used for measurement.For a hardware efficient approach, the rightmost controlled gate in the blueish layer is removed to avoid swapping.

b 3 Final 1 Figure 2 :
Figure 2: The output of the QNN is evaluated for two cases: trained with (a) and without (b) variance regularization.The training is conducted using a noise-free simulator, and the output is computed with and without shots.For the shotbased simulation, 10 000 shots are utilized.

1 Figure 3 :
Figure 3: Different regularization parameter functions α a,b,v (i) for various combinations of a, b and v.The blue curve with a = 0.08, b = 20, v = 0.005 is used for the experiments throughout this paper.

c 1 Figure 4 :
Figure 4: Graphs representing the ADAM optimization of a QNN using the shot-based QASM simulator.The upper panel (a) showcases the variance loss (excluding the prefactor α), while the middle panel (b) displays the fitting loss throughout the optimization process.The bottom panel (c) illustrates the number of shots utilized in the gradient evaluation.All results are averaged values from 10 runs, and the individual results are depicted as thin lines.Results from the noise-free statevector simulator (SV) are represented by dashed lines.

FinalFinal 1 Figure 5 :
Figure 5: Regression of various functions without (blue) and with (orange) variance regularization.The black lines show the reference function; the data points marked with an x are used for training the QNN.The training and the final inference are obtained from a shot-based QASM simulation using 5000 shots.The inference for the logarithm function is performed on a test set with an equidistant spacing of 0.002 while the spacing for the other functions is 0.004.

1 − 1 Figure 6 :
Figure 6: a: Results from the optimization of the water PES using a shot-based QASM simulator.b: Inference of the training and test data with the best trained model.The error bars represent the 95% confidence interval of the rescaled QNN output.These confidence intervals are obtained by performing 100 evaluations for each data point using 5000 shots.The X-marks indicate the averaged output over the 100 calculations.The R 2 scores are shown for training and test data.

Figure 6
Figure 6 depicts the loss functions L var and L fit as well as the number of shots utilized during the optimization of the QNN (a), along with the inference results for both training and test data (b).The optimization results closely resemble those shown in Figure4, with the variance reduced by an order of magnitude and a substantial decrease in the fitting loss.Moreover, the reduction in variance allows for a significant decrease in the number of shots required for gradient evaluation, while still yielding a lower fitting

in the gradient evaluation 1 Figure 7 :
Figure 7: Results from the optimization on the IBM backend ibmq_montreal and the corresponding shot-based QASM simulation, both utilizing 10 qubits.The optimization procedure is described in Section 4.

1 Figure 8 :
Figure 8: The output of the QNNs from both shot-based simulators (QASM) and real backends is presented.The blue curve represents the result obtained from the converged QASM optimization.The orange curve depicts the output after 95 training steps on the ibmq_montreal backend.It is worth noting that the results from the real quantum computing (QC) backend are obtained without the utilization of error mitigation techniques.

1 Figure 9 :
Figure 9: Histogram of the expectation value evaluated 300 times on the IBM backend ibmq_montreal.The darker colors are obtained with zero-noise extrapolation (ZNE).Gaussian distributions are fitted to the obtained expectation values, the resulting mean (µ) and the standard deviation (σ) are displayed in the same color.The dashed lines show the reference values obtained from the same QNNs evaluated on a noise-free statevector (SV) simulator.

1 Figure 11 :
Figure 11: Results for the optimization of the logarithm curve-fitting example detailed in Section 4 with various hyper-parameters.

in the gradient evaluation 1 Figure 12 :
Figure 12: Comparison between the optimization on the ibmq_montreal backend and the fully converged simulated QASM optimziation.More details and discussion are given in Section 6.