Check-Agnosia based Post-Processor for Message-Passing Decoding of Quantum LDPC Codes

The inherent degeneracy of quantum low-density parity-check codes poses a challenge to their decoding, as it significantly degrades the error-correction performance of classical message-passing decoders


Introduction
Quantum low-density parity-check (qLDPC) codes [1] have become one of the main candidates to implement the error-correction layer of a large-scale quantum computer architecture [2][3][4].Compared to other families of quantum error correction codes, qLDPC codes may reduce the physical qubit overhead, while protecting a larger number of logical qubits, so higher code rates can be obtained with similar or better error-correction performance [5][6][7][8][9].Yet, for qLDPC codes to work on a real system, a larger number of physical qubits than those available in today's noisy intermediate-scale quantum systems is required [3], [10].Nonetheless, before large-scale quantum technology becomes available, two important problems need to be addressed from the qLDPC decoding perspective: i) devising new decoding algorithms that overcome or mitigate the effect of degeneracy [11], thus providing increased error correction capabilities, and ii) developing hardware designs that meet latency and power constraints imposed by the quantum system (e.g., latency values within the decoherence time of the qubits to be protected, or power limitations for qubit technologies requiring cryogenic cooling, when the decoder is implemented within the lowtemperature layers), a topic that only got attention recently [12][13][14].
To achieve the first objective, several approaches building upon classical message-passing (MP) decoding algorithms have been recently proposed in the literature, where the degeneracy issue is dealt with by either incorporating neural network techniques in the MP decoder [15], or adding a post-processing step, taking advantage of the soft information delivered by the MP decoder [5,16].
Neural-network-based decoders are bound to the noise models used to train them and do not scale well with the number of qubits [17].Moreover, as shown in [18], there are not only different sources and noise models, but also the noise may be different depending on the area of the layout of the quantum processor, the environmental conditions, and the evolution of errors with time since the last calibration (space and time drift of the errors [19]).In that sense, more generalized solutions are required, at least at the moment of writing these lines, when there is no standardized or predominant technology or architecture for future large-scale quantum devices.Hence, post-processing techniques may become an interesting choice.A first post-processing technique based on ordered statistics decoding (OSD) was proposed in [5,20].Although the improvement in terms of coding gain is significant, the complexity is too high, and hence, it becomes unpractical for real-time hardware implementations [21].Recently, some of us proposed a new post-processing technique for Calderbank-Shor-Steane (CSS) qLDPC codes, called stabilizer inactivation [16].The postprocessing consists in inactivating a set of unreliable qubits supporting a check in the dual code (a stabilizer generator of the same type as the decoded error).Then the MP decoding is run again, while taking out of the decoding process the inactivated qubits and their neighbor check nodes.The remaining qubits and check nodes that participate in the MP decoding are called active.Several stabilizer generators may be inactivated, one at a time (which can be implemented either sequentially or in parallel), until one MP decoding meets the syndrome constraints on the active check nodes.Inactivated qubits are then determined by solving a small linear system, defined by the inactive check-nodes.Stabilizer inactivation shows a nonnegligible error correction improvement and increased flexibility (with regard to the MP decoding schedule) compared to OSD, with a considerable reduction of complexity.However, as discussed later in this paper, additional hardware-oriented analysis and optimization are required to ensure the hardware design meets the constraints required to provide real-time support to a quantum processor.
The main contributions of the paper are as follows.First, inspired by the stabilizer inactivation, we introduce a new post-processing algorithm for MP decoders.The algorithm takes into account the architectural properties of MP decoders in order to reduce the computational load and the required hardware resources.It also limits the amount of information required from the code, eliminating the need to know the stabilizer structure (dual code) and just treating both parity-check matrices as independent.Similar to the stabilizer inactivation, the algorithm identifies a small set of unreliable qubits (which are however not inactivated, in the sense described above).The information considered to identify such a set of qubits is based on the check-node reliability.When the MP decoder fails, the a priori information for the qubits connected to the least reliable check nodes is erased, and the post-processor will then try to learn again the reliability of these qubits based on the information from the rest.For this reason, we call the postprocessing technique check-agnosia.We also suggest several approaches to perform the selection of unreliable check-nodes, to reduce power consumption and latency, which are the constraints that limit the implementation of decoders in real systems [22,23].
Second, along the document, a non-agnostic hardware perspective is described to help to meet the constraints of future large-scale quantum devices.
Aligned with this, a functional description in terms of performance and hardware results of the proposed solution for the two main schedules employed for MP decoders (flooded and layered [24], [25]) is introduced.We carry out a detailed analysis of different corner cases, which is then illustrated for a specific qLPDC code, by providing latency and power consumption values of the check-agnosia solution implemented on an FPGA board.
The rest of the paper is organized as follows.Section 2 introduces the relevant notation and the algorithmic background.Section 3 introduces the checkagnosia post-processing, and discusses the check-node reliability metric along with several hardware-oriented optimizations.Section 4 analyzes the impact of the post-processing algorithm on the hardware implementation, considering architectures with different schedules and varying degrees of parallelism.Latency and power consumption results are also provided here.Section 5 evaluates the error-correction performance of the proposed check-agnosia decoder for different qLDPC codes, and compares it to other existing solutions.Finally, Section 6 provides the main conclusions of this work.

Algorithmic Background
We consider qLDPC codes of CSS type, defined by two parity check matrices H x and H z , corresponding respectively to X-type and Z-type generators.In the following, we will consider decoding of one type of error (since similar considerations apply to the other type), and will denote by H the corresponding decoding matrix (e.g., H = H z for decoding X-type errors).We also consider the Tanner graph associated with H, and denote by Q the set of qubit-nodes1 and C is the set of check-nodes (stabilizer generators).We denote by N (q) ⊂ C the set of neighboring check-nodes of a qubit-node q ∈ Q, and similarly, by N (c) ⊂ Q the set of neighboring qubit-nodes of a check-node c ∈ C.
We denote by p the probability that an error of the considered type occurs, and by e ∈ {0, 1} |Q| the error indicator vector 2 .We assume errors happen independently on the qubits, hence P (e q = 1) = p.Information about the error e is revealed through the measurement of stabilizer generators, in the form of an error syndrome s := H • e.Throughout this work, we assume ideal syndrome extraction, i.e., we only consider errors occurring on qubits, not on the extracted syndrome.The decoding problem is determining the most likely error ê, such that H • ê = s.
Although maximum likelihood decoding is optimal, it is computationally prohibitive.Instead, classical LDPC codes are efficiently decoded by MP algorithms.For a qubit q ∈ Q, we denote by γ q the a priori log-likelihood ratio (LLR) of an error happening on qubit q, which is defined3 as γ q = log (P (e q = 0)/P (e q = 1)) = log((1 − p)/p).The set of LLR values {γ q | q ∈ Q} constitutes the input of the MP decoder and is used to initialize an iterative exchange of messages between qubit and check-nodes.We denote these messages by either µ q→c or µ c→q , the arrow in the notation indicating whether the message is sent from a qubit-node q to a check-node c, or in the opposite direction.At each iteration, exchanged messages are used to compute a posteriori LLR values γq , for each qubit-node q, used to provide an estimate êq of the corresponding qubit error.The iterative message passing process stops when either the estimated error satisfies the syndrome (i.e., H • ê = s), or a maximum number of iterations is reached.In the following, we shall also refer to (a priori/a posteriori) LLR values as qubit reliabilities.
Throughout this work, we shall consider the Min-Sum (MS) decoding, or its normalized variant (NMS), which represent the option of choice for hardware implementations for several reasons: reduced computational complexity, reduced memory requirements (by adopting the first and second minimum compression method for check-node messages [26]), and its insensitivity to input LLR values up to a constant scaling factor (see Section 5 for the input LLRs of the finiteprecision MS decoder).For details regarding the MS and other MP decoding algorithms we refer to [27] (see also the discussion in [16,Section 2]).
A key attribute of MP decoding algorithms is the underlying scheduling, indicating the order in which qubit -and check-node messages are updated [27].Flooded, layered, or serial schedules4 are usually implemented through fully-parallel, partially-parallel, or serial hardware architectures, respectively, yielding designs with different performances in terms of latency, area, or power consumption.
Compared to the flooded schedule, serial and layered schedules are also known to propagate information twice faster in the Tanner graph [28] for classical (non-degenerate) LDPC codes.This directly translates into a faster convergence speed.However, the flooded schedule provides decoding performance similar to the serial and layered ones, at the cost of dou-bling the number of decoding iterations.This is no longer true for qLDPC codes, presumably due to the code degeneracy.As observed in [16], not only the flooded schedule may not be able to approach the decoding performance of the serial or layered schedules, even at the cost of an increased number of iterations, but in some cases it may also penalize the performance of the post-processing algorithm.Layered MP decoding of qLDPC codes has been recently investigated by the authors in [29], where it has also been observed that processing the layers in a random order (at each decoding iteration) may significantly improve the performance of the MP decoder.We will use these results in Section 5 of this paper.

Check-Agnosia Decoder
We introduce in this section the Check-Agnosia (CA) post-processing.We first describe the generic postprocessing technique (Algorithm 1) and then discuss possible modifications.

Generic Check-Agnosia Decoder
The error vector ê is estimated first using a softoutput MP decoding algorithm.If the error estimate ê satisfies the syndrome, i.e., H • ê = s, then no postprocessing is applied.
If the initial MP decoding fails, a metric on the exchanged soft information is used to find the λ checks {c k } k∈ [λ] whose supports are the most likely to be involved in the decoding failure (this metric will be discussed later).The post-processing will consist of rerunning the MP decoder at most λ times with new a priori qubit reliabilities5 {γ ′ q } and a modified stopping criterion.
For the k-th decoder, the input reliability will be set to γ ′ q = 0 for qubits q ∈ N (c k ), considered unreliable, and γ ′ q = γ q for the rest of the qubits.Putting the input reliability to 0 can be considered as an erasure in the MP decoder [30], ensuring that these qubits are deprived of any a priori information that may interfere in the decoding attempt of the more reliable qubits.We further define N k = ∪ q∈N (c k ) N (q), the set of checks that share a neighbor qubit-node with c k (note that c k ∈ N k ).This allows us to define s |N k the partial syndrome vector containing only the checks that have no neighbor qubit-node in N (c k ), and s |N k the residual syndrome.We then run a MP decoder with a modified stopping criterion that only tries to match the partial syndrome s |N k .In Algorithm 1, we denote this decoder by MP ⋆ (H, s, {γ ′ q }, N k ).We emphasize that MP * applies exactly the same decoding rules on the same Tanner graph as MP, except that MP * is initialized with qubit reliabilities {γ ′ q }, and it stops when the partial syndrome s |N k is satisfied (no matter whether the residual syndrome s |N k is satisfied or not).If the MP * succeeds in matching the partial syndrome, the decoder then attempts to match the residual syndrome by brute-forcing the error pattern on N (c k ).Note that sometimes the MP * can actually match the full syndrome, in which case no bruteforcing is needed (will be discussed in more detail later).In Algorithm 1, H |N k denotes the submatrix of H whose rows correspond to check-nodes c ̸ ∈ N k .
Consequently, H |N k (c, q) = 0 for any q ∈ N (c k ), and thus matches the partial syndrome, we keep its value and bruteforce ê|N (c k ) to match also the residual syndrome.
The intuition behind the post-processing is that the presence of quantum trapping sets [11] in the Tanner graph causes the a posteriori reliability values of trapped qubit-nodes to oscillate.This prevents the decoder from converging, regardless of the number of decoding iterations (for oscillating trapping sets see also [31]).Taking into account the oscillation effect, it is reasonable to think that the messages associated with the untrapped qubits will grow with each iteration while the trapped ones will keep relatively low reliability.This effect will help to identify possible trapped qubits.To this end, we define a reliability metric on checks to decide (the support of) which checks should be erased.A natural approach to define such a reliability metric is to consider the reliability (i.e., absolute value) of either incoming messages {µ q→c | q ∈ N (c)} or outgoing messages {µ c→q | q ∈ N (c)}.However, for MS-based decoders, the absolute value of outgoing messages is equal to either the first or the second minimum of the absolute values of incoming messages, denoted by min q∈N (c) |µ q→c | and min 2 q∈N (c) |µ q→c |, respectively.This motivates the reliability metric 6 δ c considered in Algorithm 1.The cost of computing this metric is nearly none, as the two minima are already computed by the MS decoder.Also, this metric is computed for all the check-nodes, based on the {µ q→c } messages at some specific (predetermined) iteration of the MS decoder (as discussed below).This allows the sorting of all the checks according to the proposed metric, after which the post-processing can be applied to the λ most unreliable checks.
To reduce the overall latency (initial MP decoding and post-processing), one may compute the check reliability values δ c at an early iteration, i.e., before 6 While different variations of this metric are possible (e.g., the sum of the absolute values of all incoming messages) we have not observed any significant difference in terms of error correction performance.Also, similar reliability metrics can be obtained for other MP decoding algorithms, e.g., sum-product.the initial MP reaches the maximum number of decoding iterations.This allows the post-processing to start running in parallel before the initial MP has ended.If the initial MP succeeds later on, the postprocessing will stop and the decoder will output the error found by the initial decoder.However, if the initial MP decoder fails, the post-processing will have already started, reducing the total latency.As it will be shown in Section 5, the error correction performance obtained by determining the list of least reliable checks using the soft information from either the last or an early iteration is almost the same, but the speedup is considerably higher in the latter case.Moreover, the reliability metric computed after a few iterations may be more accurate than the one computed at the last iteration, as the oscillation effects (also combined with saturation effects of the finite precision arithmetic) might alter quite considerably the accuracy of the reliability metric computed after a large number of iterations.
We discuss now the brute-forcing of ê|N (c k ) in Algorithm 1.To solve the system H |N k • ê = s |N k there are several possible methods, including Gaussian elimination.However, since the system to solve is small, brute-forcing, i.e., trying all the possible combinations, hopefully finding one that satisfies the system 7 , is a more efficient solution for hardware implemen-tation.Moreover, it is not too difficult to see that the brute force approach can be simplified by taking into account the local structure of the code, eliminating a lot of computation.For instance, a check-node c ∈ N k \ {c k } that has exactly one qubit-node in common with c k , uniquely determines the value of that qubit.

Check-Agnosia Decoder Without System Solver
One alternative to determine ê|N (c k ) , described in Algorithm 2, is to use a regular MP decoder that stops only if the full syndrome is matched.Precisely, the MP ⋆ (H, s, {γ ′ q }) in Algorithm 2 is a regular MP decoder, initialized with qubit reliabilities {γ ′ q }, and which stops when the full syndrome is satisfied.We keep the MP ⋆ notation in the post-processing step only to distinguish it from the initial MP decoder (will be needed later on Section 4).To justify Algorithm 2, let us consider the case when the graph induced by any subset S ⊆ N (c k ) contains at least a check-node of degree one.Then, assuming the MP decoder has converged on ê|N (c k ) , it will converge on the remaining ê|N (c k ) at the cost of a few more iterations.The above condition is the same as requiring N (c k ) contains no stopping subset 8 , and running the MP for a few more iterations amounts to running a peeling decoding [32] on the erased qubits.For instance, if the Tanner graph contains no cycles of length four, then N (c k ) satisfies the no-stopping subset condition, and one extra iteration is enough to determine ê|N (c k ) .The no-stopping subset condition may also be satisfied for graphs containing cycles of length four, but in such a case more than one extra iteration may be needed.
For a given Tanner graph the above no-stopping subset condition can easily be verified, and then we may use Algorithm 2 instead of Algorithm 1 (numerical simulations also confirmed that those two approaches give similar performance).For the simulation results shown later in this paper (Section 5), we always use Algorithm 2.
The presumably only meaningful case in which the no-stopping subset condition is not verified is when the code is auto-dual (i.e., H x = H z ), since in such a case ê|N (c k ) is the support of a codeword, hence a stopping set.It is worth noticing that for auto-dual codes, the check-agnosia (Algorithm 1) and stabilizerinactivation [16] decoders are the same, up to the reliability metric used to select the λ least reliable check-nodes.However, for codes that are not auto-8 A set of qubit-nodes is said to be a stopping set, if the induced subgraph contains no check-nodes of degree 1.If the qubit-nodes in a stopping set are erased, they can get no information during the MP decoding, that is, incoming and outgoing messages to and from these qubit-nodes remain equal to zero during the entire iterative decoding process.
dual, the check-agnosia decoder, implemented as in Algorithm 2, presents several advantages, including the use of a simpler, hardware-friendly check-node reliability metric (and not requiring the use of the dual matrix), as well as the fact that it relies solely on MP decoding, eliminating the need of brute-forcing or other system solving methods.A final remark is that all MP and MP ⋆ decoders can implement a flooded or a layered schedule, as discussed in Section 2, to cope with the hardware constraints.

Hardware Architectures
This section aims to analyze the impact of the postprocessing algorithm on the hardware implementation, considering architectures with different schedules and varying degrees of parallelism.We carry out a detailed analysis of different corner cases, providing latency and power bounds to assist future hardware decoder designers.

MP Decoder Architecture
We consider first a single MP decoder, without any post-processing.To implement the MP decoder in hardware 9 , one can use a fully parallel architecture, implementing a flooded schedule, referred to as flooded decoder, or a partly parallel architecture, implementing a layered schedule, referred to as layered decoder.
We will make standard assumptions 10 regarding the two above architectures [26].For the flooded decoder, the Tanner graph is instantiated in hardware, where messages are exchanged through wires between processing units, corresponding to qubit-and check nodes.Each decoding iteration is performed in two clock cycles, with one clock cycle for qubitnode messages and a posteriori LLRs, and a second one for check-node messages.Thus, the worst case (maximum) latency of the flooded decoder is equal to , where we count one clock-cycle for data loading, I F is the maximum number of decoding iterations of the flooded decoder, and f F is the clock frequency.
For the layered decoder, the number of processing units instantiated in hardware is given the size of the largest layer 11 , messages are exchanged through shared memory, and each processing unit is reused η L times for each decoding iteration, where η L denotes the number of layers per iteration.The worst case latency of the layered decoder is equal to (1 + η L I L )/f L (s), where we count again one clock-cycle for data loading, I L is the maximum number of decoding iterations of the layered decoder, and f L is the clock frequency.
Two observations are in place here.First, the flooded architecture may lead to a large number of connections among processing units, causing routing congestion in case of large codes.Due to the large interconnect network, the operating clock frequency of the flooded architecture (f F ) is usually smaller than twice 12 that of the layered architecture (f L ).Second, as discussed in Section 2, the layered schedule propagates information about twice faster than the flooded one, thus the maximum number of iterations of the layered architecture (I L ) is usually smaller than that of the flooded architecture (I F ). Overall, this can make the layered architecture comparably fast to the flooded one, despite the fact that it employs a reduced degree of parallelism (of course, the number of layers per iteration has to be sufficiently small).
Finally, one possible approach to further increase the clock frequency of the layered decoder is to pipeline the design (i.e., perform each layer in a number of pipelined clock cycles).However, this may lead to delayed message write-backs in memories, and thus, to pipeline related hazards [33].Solving such hazards (without relying on pipeline stalls, introducing extra latency) can be done for classical LDPC codes at the code construction stage [34].However such solutions are not generic (need a specific code construction) and may not apply to qLDPC codes.Therefore, to keep the analysis as generic as possible, we do not consider pipelined designs in this work.

Post-Processing Elements
For the check-agnosia scheme, the first step after the MP decoder is the computation of the check reliability values, as outlined in Algorithms 1 and 2. The metric used to calculate the check reliability, denoted as δ c , involves adding the two least reliable messages.These values are computed during the tree finder process employed to calculate check-node messages in the min-sum decoder.Thus, the only additional hardware required is an adder per check-node to compute δ c .These values are updated on-the-fly during each iteration, eliminating the need for extra clock cycles after the MP decoder.As described earlier in Section 3, one does not have to wait until the end of the initial MP decoder to start the post-processing (the impact of utilizing the δ c information from early iterations will be evaluated in Section 5).In the proposed architecture, the δ c values can be stored in the registers of the sorting unit (see below) before the first MP decoder completes, without any additional hardware.This allows absorbing some additional latency and initiating the post-processing MP ⋆ decoders before the initial MP decoder completes.
After the δ c values are available, a sort of the checks in order of reliability is computed.To sort the checks in order of increasing reliability |C|-1 comparators are required to implement a tree structure, which should be pipelined to avoid increasing the critical path of the decoder.The number of clock cycles needed to obtain the complete sorted list is ⌈λ/2⌉ × ⌈log 2 |C|⌉.

Overall Check-Agnosia Architecture
In this section, we detail the check-agnosia architecture corresponding to Algorithm 2 (that relies on MP decoding only, without brute-forcing).After the list of λ least reliable checks is obtained, the λ MP ⋆ decoders are performed.Depending on the time constraints and/or power budget, we may consider two different approaches, illustrated in Figure 1.
The first approach consists of performing the λ MP ⋆ decoders sequentially, reusing the same hardware as the one used for MP.Only |Q| extra multiplexors are required to choose between γ ′ q = γ q or γ ′ q = 0, and |C| extra multiplexors are required to decide which syndromes belong to s |N k , depending on the check c k .This approach of reusing hardware yields higher latency, but maybe interesting for a quantum computer with time constraints close to microseconds, e.g., based on trapped ion technology [3].
For the second approach, the λ MP ⋆ decoders are performed in parallel, by using dedicated hardware.Moreover, the λ MP ⋆ decoders may start before the initial MP completes, using check-reliability values computed at an early iteration, that we will denote in the sequel by I δc .This approach may be interesting for quantum technologies with more restrictive latency constraints, but having in mind that power can be also a limitation, as happens with superconducting qubits in which the decoder needs to reduce its power budget when it operates close to the quantum chip at cryogenic temperatures.
To illustrate the degree of complexity in hardware implementations and measure the gap between the proposed solutions to latency/power constraints, we analyze below the Pareto designs for the two approaches above, where the MP decoder uses either a flooded or a layered schedule.We provide the worstcase latency (simply referred to as latency), as well as the power consumption as a function of the nominal power consumption of the MP decoder, denoted by P F or P L , with a subscript indicating the flooded or layered architecture (we may reasonably assume that MP and MP * yield the same power consumption).For the latency value, we take into account the latency induced by sorting the check nodes according to their reliability (Section 4.2).The corresponding power consumption is not accounted for, we will assume it is negligible with respect to the power consumption of the MP decoder.• Latency:

Implementation Results
To illustrate the analysis from the previous section, we have implemented both flooded and layered NMS decoders on a Xilinx FPGA xcv095 board, for the B1 [[882, 24]] code from [5].The implemented decoders use finite precision arithmetic, with exchanged messages quantized on 6 bits, and a posteriori LLR values quantized on 8 bits.The parity-check matrix (for both X and Z errors) is of size 441 × 882 (checknodes × qubit-nodes) and has no four-cycles (thus, it satisfies the no-stopping subset condition, and we may safely apply Algorithm 2).The flooded NMS / NMS * decoders achieve a maximum operating frequency f F = 100 MHz (corresponding to a critical path of 10 ns), with 62% of the hardware resources of the device utilized, and a total power consumption P F = 5.5 W.
To implement the layered decoders, we use the 2covering approach from [29], where 7 overlapping layers are used to cover 2 iterations, yielding a fractional  , where about 25% is due to the logic depth of the operations and 75% is due to the routing limitations of the FPGA device.The decoder uses only 13% of the hardware resources of the device, and the total power consumption is around P L = 2.03 W. We consider a maximum number of decoding iterations I F = 30 for the flooded decoders, and I L = 15 for the layered decoders (due to faster convergence).For the post-processing step, we consider a list of λ = 10 least reliable checks (these parameters will be evaluated from the error correction perspective in Section 5).Latency and power consumption values are summarized in Table 1, for the Pareto designs considered in the previous section.Note that we consider two cases for the dedicated hardware scenario, in which the iteration I δc (used to compute the checknode reliability values) is chosen to be either the last or the third iteration of the NMS decoder.
It can be observed that the layered architecture achieves latency values close to the flooded one, despite the fact it employs a degree of parallelism 3.5 times lower, while considerably reducing the power consumption.It is also worth noticing that the part of the latency due to the sorting unit is 0.45 µs for the flooded architectures, and 0.56 µs for the layered ones.To reduce the latency of the sorting unit further optimizations are possible (i.e., carefully balancing the pipeline stages of the sorting unit by taking into account the maximum critical path latency of the MP decoder, splitting the sorting unit into layers in case of a layered schedule, or using a different clock domain for the sorting unit), which are however behind the scope of this work.We mention that the maximum frequency that can be reached for the sorting unit (implemented alone) is 230 MHz, which gives a lower bound on the achievable latency of 0.2 µs 13 .
Moreover, as will be shown in Section 5, because of the highly degenerate structure of the codes, the layered schedule provides better error correction performance than the flooded one, even if the number of decoding iterations of the latter exceeds significantly the number of decoding iterations of the former (in fact, to get a flooded decoder that approaches the layered decoder, albeit not closely, one would have to go for at least 60 iterations, see Section 5).One last advantage of the layered architecture, reported in [29], is that the logical error rate can be considerably improved by processing layers in random order at each iteration.Such a random layer order can be implemented at a very low cost, as it only requires modifying the ROM memory that stores the layers' control sequence and including a deeper memory with a pseudo-random sequence of layers.
From the results presented before, it can be concluded that timing constraints can be in the range of the requirements reported in [12] for transmons and ion trap technology, between microseconds and milliseconds.However, these implementations do not meet the highly restrictive conditions of superconducting qubits in both time and power which are around 400 ns and 1W, see [22].The difference compared to the fastest solution in Table 1 exceeds 3 times the time budget and it is more than one order of magnitude far in terms of power consumption.For these scenarios, it is important to remark that other approaches to implementation like ASICs or more advanced FPGA devices based on 16nm CMOS process or below (note that the xcvu095 belongs to the previous generation of 20nm) need to be explored in future work.Moreover, exploiting a ping-pong architecture that takes benefit of the pipeline registers to reduce the number of MP ⋆ decoders to half for the parallel implementation of flooded schedule can be a good proposal to reduce power consumption to almost half.
Extrapolating from state-of-the-art ASIC implementations of classical LDPC decoders, a clock frequency of 151 MHz is reported in [35] for a 65-nm CMOS ASIC implementation of a min-sum decoder using the layered architecture described in Section 4.1, for a regular LDPC code with characteristics similar to those of the B1 code investigated here 14 .The operating frequency is expected to further increase for the B1 code, given that both the parity check matrix and the layer size are smaller than that of the LDPC code in [35].For f L = 151 MHz, the latency of the layered architecture with parallel post-processing (dedicated hardware) is equal to 1 µs if I δc = I L (last iteration), and 0.73 µs if I δc = 3.Since the operating frequency increases with decreasing technology node, and assuming an inverse-linear frequency scaling [36], we may conclude that a latency constraint around 400 ns or below can be easily achieved for more advanced technology nodes, e.g., below 22 nm (today technology scaling is actually much lower).
Finally, we note that all the previous results assume that the check-agnosia decoder is implemented as in Algorithm 2. For the codes where Algorithm 2 cannot be applied, the latency will be a little bit worse than what was computed here, due to the bruteforcing step in Algorithm 1.We also provide in Appendix B arguments for why OSD post-processing (widely used today in the community for decoding of small to medium LDPC codes) is not a viable solution going forward, if trying to cope with the hardware implementation constraints.

Error Correction Performance
In this section, we evaluate the error correction performance of the proposed check-agnosia post-processing.The codes used are B1 [[882, 24]] and C2 [[1922, 50]] from [5].For both codes, the no-stopping subset condition from Section 3.2 is satisfied, hence in the following all simulations are performed using Algorithm 2 (without brute-forcing the system).
As our post-processing is targeted at decoding X and Z errors separately, we use an X noise model, and thus "physical error rate" does actually refer to the physical X error rate.
Our numerical simulations are consistent with the parameters used in Section 4.4.Precisely, we consider a finite-precision NMS decoder, using 6 bits for the exchanged messages, and 8 bits for the a posteriori LLRs.Although in floating point precision the initial (a priori) LLRs of the NMS decoder can be scaled to 1, in finite precision the initial LLR values have a non-negligible impact on convergence.In the simulations, we use the following parameters, optimized by extensive search.For the flooded decoder we set the initial LLR values to LLR init = 12 and the NMS scaling factor is set to nms = 0.875.For the layered decoder we use LLR init = 8 and NMS scaling factor s nms = 0.9375.Note that scaling factors are a sum of powers of 2 and as such the scaling operation can be implemented efficiently in hardware, using only SHIFT and ADD operations.
For the maximum number of decoding iterations, we use I F = 60 for the flooded decoder, and I L = 15 for the layered decoder.The maximum number of decoding iterations for the flooded decoder is the only deviation with respect to the parameters used in Section 4.4 (where I F = 30 was used).In fact, our goal here is to demonstrate the advantage in terms of error correction performance of the layered architecture as compared to the flooded one, even when the latter employs a significantly higher number of decoding iterations.For the post-processing part, we use λ = 10.
Whenever the layered decoding is used, we add the random ordering perturbation introduced in [29] that was shown to significantly improve the decoding convergence.
Figure 2 shows the impact of the iteration I δc used to select the checks in the post-processing, all simulations are done on the B1 code.For flooded simulations (Figure 2(a)), it is actually beneficial to use the 3rd iteration for the metric instead of the last (60th iteration).The most probable explanation is that the relatively high number of iterations combined with the finite precision algorithm makes the metric less reliable after a larger number of iterations.For comparison purposes, we also consider a random metric, corresponding to a random choice of the λ checks in the post-processing.As it can be seen, the random metric exhibits a bad error floor, validating the metric used in that case.
For layered simulation (Figure 2(b)), all three curves are close by, since the layered NMS decoder with random layer ordering performs already very well.
In fact, the three metrics yield virtually the same performance of the check-agnosia decoder, but which is better than the layered NMS with 30 iterations and without post-processing in Figure 3(a).Although the metric is less important in this case, this shows that the perturbation introduced by the post-processing step in the input reliabilities has an impact on the decoding, and that it is better to run multiple decoders in parallel with perturbed inputs and fewer iterations rather than running a single decoder for a long time.As a whole, this validates the fact that the post-processing can be done efficiently using the dedicated hardware approach, increasing the postprocessing parallelism and improving the latency.
We would also like to make a case for the choice I δc = 3.This hyperparameter can be optimized to get the best numerical results for a given code.However, the value 3 here was chosen for a different reason.Since both codes have girth 6, choosing I δc to be equal to 3 guarantees that when the aposteriories are extracted, the decoder got access to the information of the biggest neighbourhood of each variable nodes without loopy information.This ensures that although it is very local, this information is also less noisy than information coming from later rounds.
In Figure 3, the post-processing is applied to the codes B1 and C2, with both flooded and layered schedules.For a comparison with the state of the art, in both figures, we added a dashed black curve of an optimized NMS-OSD decoder using 100 iterations, floating point NMS with a scaling factor of 0.625 [5].Keep in mind that this decoder is not at all hardwarefriendly, in terms of complexity, latency and power consumption, and it only serves as a reference.As it can be seen from both simulations, our results are matching closely the performance of the OSD postprocessing, concretely showing the effectiveness of our hardware-friendly approach.In both figures, the results in red are the curves for flooded and in blue for layered.Each time, the dotted curves show the per-    formance of the decoder without post-processing.
On the B1-code for the flooded schedule, the impact of the post-processing is clear, and the checkagnosia flooded decoder exhibits good performance while keeping a latency around 1.7 µs (taking into account I F = 60).For the layered schedule, the use of the post-processing increases the steepness of the waterfall.The check-agnosia layered decoder keeps the latency at around 1.4 µs.(Latency values above correspond to our FPGA implementation from Section 4.4.) On the C2 Code, the performance gains for flooded are clear even if the post-processing suffers from a relatively high error floor.For layered scheduling, once again check-agnosia achieves better results in the error floor compared to no post-processing, closely matching the NMS-OSD curve.
Further numerical results are provided in Appendix A, where we evaluate the error correction performance of the check-agnosia decoder on the family of T-codes from [20], showing a threshold phenomenon with hyperparameter λ = 0.02 × |C|.

Conclusions
This work introduced the check-agnosia algorithm, a new post-processing method improving on the syndrome-inactivation algorithm from a hardwareoriented viewpoint.Interestingly, although in the general case brute-forcing a small linear system may still be needed, for a large class of qLDPC codes the check-agnosia post-processing relies only on MP decoding, eliminating the need for any system solver.The proposed solution is flexible and it allows devising different hardware architectures, in order to meet the latency or the power constraints of the quantum system.The analysis carried out in the document, along with the hardware implementation results for MP decoders (our own implementation on an FPGA board, or results extrapolated from stateof-the-art ASIC implementations), showed that our solution can meet latency constraints of a wide range of quantum technologies, while providing state of the art error-correction performance, with hardwareaccurate, finite-precision arithmetic.To the best of our knowledge, there is no prior work on the hardware architecture and implementation of a post-processing enhanced MP decoder for qLDPC codes.An interesting open question going forward would be to look at space-time decoding and see if the underlying graph structure lends itself well to the use of the check-Agnosia post-processing without system-solving.

A Hyperparameter λ and decoding threshold
In Figure 4, we include additional numerical results giving a threshold for a constant rate family of LDPC codes, namely the T codes family from [20].Since we are not aware of a simple way to build layers for this family of codes, we ran the simulations using a serial decoder, which is a fair approximation of the numerical results one would get with layered decoding.We use check-agnosia with hyperparameter λ = 0.02×|C| = 0.01×|Q|, since for all the codes of the T family, |C| = |Q|/2.The computational complexity of the algorithm hence is (0.01×)n 2 × log n, where n = |Q| is the number of qubits, and this complexity can be spread between time and energy consumption depending on the architecture needs (see Fig 1).Furthermore, we make a case that the average complexity of the decoder is actually much better than that.On the figure, we also included the average number of inactivations (denoted λ avg ) for physical error rates 0.6 and 0.5, where the average values get very close to one (meaning only one inactivation might usually be necessary).Since this number goes close to one for low error-rates, it means that in practice the cost of the post-processing could only add a constant mul- T1 [[126,12,d<11]] T2 [[254,14,d<17]] T3 [[510,16,d<19]] T4 [[1022,18]] T5 [[8190,24]] Figure 4: Check-agnosia threshold for the T codes family from [20].Serial scheduling with random ordering and 15 iterations.Normalized min-sum with scaling factor 0.9375 and finite precision arithmetic, with 6-bit quantization for the input LLRs and exchanged messages, and 8-bit quantization for the a posteriori LLRs.For the post-processing, checkagnosia is used with I δc = 3 and λ = 0.02 × |C|.
tiplicative overhead.This lambda average is particularly meaningful in a sequential architecture where we stop the post-processing as soon as the first postprocessing converges.In the parallel architecture, it should still be possible to optimize the actual number of parallel runs if we have access to some prior information on the noise level, e.g., by considering a pool of MP ⋆ decoders that serve for the post-processing of several logical qubits and are dynamically allocated between them.

B Latency comparison of CA and OSD
We provide below a comparison, in terms of latency, between the check agnosia proposal and the OSD post-processing solution.This comparison is similar to the method presented in [21], and is intended to clarify the differences between the two solutions with respect to hardware VLSI implementations.
We consider a layered MP decoder, with checkagnosia implemented through "Dedicated hardware" (that is, the λ MP ⋆ decoders are executed in parallel).Since the layered MP ⋆ decoders achieve a maximum operating frequency F L = 80 MHz (corresponding to a critical path of 12.5 ns), the total latency of the checkagnosia post-processing is (12.5×3.5×15)= 656.25 ns.We have omitted here the latency of the sorting unit, required to sort the check-nodes according to their reliability.
Considering now the OSD post-processing, we will omit again the latency of the sorting unit, required this time to sort the qubit-nodes according to their reliability.We will actually consider only the latency of the Gaussian elimination step required by OSD post-processing (and omit the latency of any other steps).Refs.[37,38] below provide the two main highly parallel architectures known in the literature to perform Gaussian elimination over finite fields.However, in both cases, the number of clock cycles required to perform Gaussian elimination is equal to (M 2 + M )/2, where M is the number of rows of the parity-check matrix.Thus, the latency of the Gaussian elimination implementation is determined by T OSD = (M 2 + M )/2/f OSD , where f OSD is the operating frequency.So the frequency required to achieve the same time budget as our proposal is f OSD = (441 2 + 441)/2/656.25 ns = 148.5 GHz.Such a frequency is completely unrealistic, and would certainly lead to timing violations in the design (note that it is 148.5/0.08 = 1856 times larger than the one of the layered MP decoder).Besides, it would also translate into an extremely large power consumption, which typically increases linearly with the operating frequency.

Figure 1 :
Figure 1: Comparison of different architectures for the checkagnosia decoder.The clock cycle diagram is included for the different proposals (Warning: drawing is not to scale).In case 1), MP and MP ⋆ use the same hardware.
Sort checks in increasing order of reliability, Extract {c k } k∈[λ]the least reliable checks.

Table 1 :
Latency (L) and power consumption (P ) values for the Pareto designs in Section 4.3 (Imax is the table stands for IF for flooded architectures, or for IL for layered ones).