Block-encoding structured matrices for data input in quantum computing

The cost of data input can dominate the run-time of quantum algorithms. Here, we consider data input of arithmetically structured matrices via block encoding circuits, the input model for the quantum singular value transform and related algorithms. We demonstrate how to construct block encoding circuits based on an arithmetic description of the sparsity and pattern of repeated values of a matrix. We present schemes yielding diﬀerent subnor-malisations of the block encoding; a comparison shows that the best choice depends on the speciﬁc matrix. The resulting circuits reduce ﬂag qubit number according to sparsity, and data loading cost according to repeated values, leading to an exponential improvement for certain matrices. We give examples of applying our block encoding schemes to a few families of matrices, including Toeplitz and tridiagonal matrices


Introduction
The advent and astonishing increase in computational power of classical computing has truly revolutionised the world and ushered in the age of information.Yet, there are computational problems that are and will stay out of reach of classical computation due to their exponential complexity.Quantum computing [1] offers the dazzling prospect to provide a speed up and move select problems from the intractable to the tractable.Demonstrations of first quantum computers [2][3][4] up to a few hundred qubits provide a first step towards realising such a computational advantage.Apart from vast increases in the number and quality of qubits on the hardware level, improvements and development of new quantum algorithms are also dearly called for.Currently, the number of quantum algorithms and application use cases known to provide an exponential advantage are rather limited [5,6].
The inception of the quantum singular value transform (QSVT) [7] has recently led to a new perspective on quantum algorithms.In what has been termed the "grand unification of quantum algorithms" [8], many previously known quantum algorithms have been reformulated within the framework of QSVT, including matrix inversion, phase estimation, Hamiltonian simulation, and amplitude amplification.The core of the QSVT algorithm is a polynomial transformation of an input matrix's singular values.Different choices of polynomials and input matrices give rise to these various applications.In order to run a quantum algorithm in a real-world setting, data input (into the quantum computer) and data output (readout) are important steps and can severely limit any speed up provided by the quantum algorithm itself [9].In this article, we will study how to input structured data efficiently and provide a scheme that facilitates the construction of explicit quantum circuits for input of structured data matrices, demonstrated by several examples.
In QSVT based algorithms, the input model for matrices of data is that of a block encoding [7].Generally, an input matrix of data A could be non-unitary and there is no quantum circuit that could implement the operator A directly.Instead, in a block encoding, A is embedded as a block inside a larger unitary U : Inside the larger unitary, the matrix A is scaled down by the subnormalisation α.The precise values in the junk blocks are inconsequential (but must be consistent throughout one QSVT circuit).Subnormalisation and junk serve two purposes: First of all, they can be necessary to ensure that an embedding of A into a unitary exists.However, the existence of an embedding (or even numerical knowledge of possible junk values) is not sufficient to input the data matrix A. Rather, U must be implemented as a quantum circuit and expressed in an elementary quantum gate set in order to run it on a quantum computer.Finding such a quantum circuit will typically further increase the subnormalisation α and the dimension of the junk blocks.In a quantum circuit, the block encoding can be written as follows: flag qubits: ( The top qubit register consists of the flag qubits, its dimension depends on the dimensions of the junk blocks in (1).The block of the unitary containing A/α is selected by initialising the flag qubits as |0⟩ and postselecting them as |0⟩.The probability of measuring the correct |0⟩ outcome on all flag qubits is related to the subnormalisation; a smaller subnormalisation α is better.A smaller subnormalisation typically also reduces the circuit length of a QSVT transforming the matrix, as lower resolution and lower polynomial degree are required.The bottom register has the same dimension N as the matrix A. The matrix elements A ij /α can then be recovered as the amplitudes on the bottom register, as indicated in the circuit diagram.
Since block encodings are fundamental to QSVT, they have been studied previously.Quantum circuits implementing block encodings for arbitrary dense matrices have been worked out [10,11] but are exponentially expensive: For a 2 n × 2 n matrix on n qubits, they require O(2 n ) T gates. 1 However, the scope of these works did not include optimisations for matrices that are sparse or structured (have repeated values).Other work shows quantum circuit constructions for sparse matrices based on black-box oracles [7].There, the implementation of the oracles is not discussed, and complexity of the block encoding circuits is analysed in terms of black box usages.More specialised schemes for, eg.density operators, POVM operators, Gram matrices [7], or kernel matrices [14] have also been discussed.Explicit circuits for certain sparse, structured matrices are shown in [15] for a specific value of N .
To gain a computational advantage for some problems, the exponential cost for arbitrary matrices [10,11] must be reduced.In this work, we consider matrices that are sparse and/or structured (having repeated elements); with sufficient structure, we can construct exponentially more efficient circuits.We provide several variations of a block-encoding scheme based on oracles, each with a different subnormalisation.It depends on the matrix which variation performs best.The schemes fully take into account sparsity, reducing the number of flag qubits beyond schemes in [7,10,11].Moreover, we explain how to construct circuit implementations of the required oracles given a family of matrix structures for increasing N , and provide several explicit examples.
Block encodings also appear beyond QSVT.In fact, there is a calculus allowing block-encoded matrices to be summed and multiplied [7].Many modern quantum algorithms for chemistry are based on phase estimation via qubitisation [16][17][18][19][20], an application of quantum walks [21,22].They are based on block-encoding the Hamiltonian.Successive articles go into great detail on how to construct the block encoding, and advances mainly stem from lower subnormalisation and shorter circuits of the block encoding.For chemistry applications, the block encodings are constructed as a linear combination of unitaries (LCU) with so-called PREPARE and UNPREPARE oracles.This viewpoint is quite distinct from other block-encoding schemes in the literature mentioned in the previous paragraphs; in section 2.3, we connect these two viewpoints, which can lead to an improved subnormalisation.
The quantum circuits involved are too long to directly run with noisy qubits provided by current hardware devices.Despite efforts to run QSVT primitives on noisy qubits [23], error correction and fault-tolerant quantum computation remains essential for longer circuits.In quantum error correction, an error correcting code is applied to several noisy physical qubits, yielding one or more logical qubits [24].The circuit is run at the level of logical qubits.Not all gates are equal in error correction codes, rather, in the popular surface codes [25], so-called T gates and Toffoli gates (which can be reexpressed with T gates) are much more expensive than Clifford gates [26][27][28][29][30]. Consequently, we will focus on T gate count in assessing the cost of the block encoding.
Section 2 explains and compares the proposed block encoding schemes.Their costs are summarised in Table 1, and an adaptation to yield Hermitian block encodings for symmetric matrices is considered in section 2.4.Examples of using the scheme to construct block encodings of specific matrix families are provided in section 3, for example, our Toeplitz matrix encoding provides an exponential advantage to arbitrary matrix encoding.Finally, we draw some conclusions and give an outlook in section 4. It is our hope that the schemes presented in this work will in the future enable the construction of efficient block encoding circuits for further families of matrices from a wide range of application areas.The appendices give more detail on the circuit implementations of some parts of the block encoding: Appendix A focuses on the data loading oracles.Appendix B explains singular value amplification, a technique that can be employed to improve the subnormalisation, and provides some new analysis of its efficacy.

Block-encoding schemes
Throughout, we consider real matrices, including negative values.The schemes presented are adapted to structured matrices; that is, possibly sparse matrices with repeated data values and arithmetic descriptions of the structure available.This will become clear when constructing the circuits.
We take the following product as a total figure of merit for a block encoding: A lower subnormalisation increases the probability to measure |0⟩ for the flag qubits, making it easier  to extract the matrix.Lower subnormalisation typically also leads to shorter circuits in QSVT [7,8] or qubitisation [17] algorithms that use the block encoding.Taking the product (3) as the figure of merit is motivated as it is approximately constant under singular value amplification (see appendix B)which can reduce the subnormalisation of a block encoding by inversely increasing its circuit length (up to a logarithmic factor).For simplicity, instead of considering T -gate count as a measure for circuit cost as in (3), we focus on "data loading cost", i.e. the number of data values loaded, which generically dominates the cost.As we will see, we expect the other circuit parts to be implementable with O(polylog N ) T gates.Of course, for low data loading cost, these other circuit parts could become dominant.Therefore we focus on the figure of merit The next subsections first describe our base block encoding scheme (section 2.1), followed by variants termed preamplified (section 2.2) and PREP/UNPREP (section 2.3) that for some applications will improve the figure of merit (4).

Base scheme
Let us introduce notation.Let N be the dimension of the matrix A, and D be the number of distinct data items A d (0 ≤ d < D) in the matrix (apart from zeros following the sparsity pattern).Crucially, each data value may appear multiple times in the matrix.The multiplicity M is the maximum multiplicity of any of the D data items.Finally, we have column S c and row S r sparsities, the maximal number of non-zero elements per column or row (according to the sparsity pattern, if any) and maximum sparsity S = max(S c , S r ).The smaller S, the more sparse the matrix.

Simplified introductory case
Consider first the simple case in which each of the D data items has the same multiplicity M , and each of the N rows and columns the same sparsity S c = S r = S, and all of these are powers of 2. The equality M D = N S c = N S r = #nonzero (5) then follows from calculating the number of nonzero entries in the matrix's sparsity pattern from three different perspectives.From these three perspectives, each nonzero matrix element is labelled either by • (d, m), its data index 0 ≤ d < D and multiplicity 0 ≤ m < M , or • (j, s c ), its column index 0 ≤ j < N and column sparsity index 0 ≤ s c < S c , or • (i, s r ), its row index 0 ≤ i < N and row sparsity index 0 ≤ s r < S r .
The oracles underlying our block encoding are the unitary column O c and row O r oracles relating these three equivalent descriptions.In particular, We suggest to construct these oracles by first establishing a labelling of the matrix by (d, m).We suppose that arithmetic expressions for i(d, m) and j(d, m) follow; throughout we use the term "structured matrix" to refer to this kind of arithmetic structure allowing the computation of the positions of elements.These arithmetic expressions can then be converted to quantum circuits, possibly making use of ancillas.The exact values of s r and s c are irrelevant as long as they fall within range; they arise naturally in the conversion of the arithmetic expressions, making the quantum circuits unitary.Example constructions are performed in section 3. Quantum arithmetics can be implemented efficiently [31,32], and so we expect the same of O c and O r .Provided the arithmetic expressions underlying the oracles are sufficiently short (for example, do not increase in length with N ), the T -gate counts of the oracles O c and O r are expected to be of order O(polylog N ) [31,32].
In addition to the structure of the matrix, which is encoded in the oracles O c and O r , the oracle In a quantum circuit, we write these multiplexed rotations using one flag qubit as data The notation of a slash in a control indicates that the gate is not controlled on a single value, but multiplexed for various values of the control register as in [33,10].These multiplexed rotations are the data loading step and may be implemented in multiple ways.See appendix A for details.In general, the data loading cost of (9) corresponds to a Toffoli count of O(D) (QROM [17]) or O( √ D) (select/swap network [33]) for the multiplexing, plus T gates to synthesize rotations.Note there is no dependence on M, N or S. As with the other oracles, implementation details possibly require ancilla qubits.
The oracles are supplemented by H S to construct the block encoding.The gate H S , sometimes called diffusion operator, creates an equal superposition state If S is a power of 2, H S is simply a string of Hadamard gates.In other cases, it can for example be constructed by amplitude amplification that uses an ancilla (see [17]).Putting these together, the block encoding is data s Crucially, the pink qubit registers in the middle can have dimensions D and M distinct from the black qubit registers of dimension S and N .Yet, thanks to the equality N S = M D, the oracles have same total input and output dimensions.To recover the matrix from the block encoding, the flag qubits (i.e. the data qubit and the s register) must be initialised and postselected as |0⟩.The values are then recovered when initialising the bottom register with |j⟩ and postselecting/measuring an |i⟩.This can be most easily seen by inserting a resolution of the identity I = d m |d⟩⟨d| ⊗ |m⟩⟨m| into the middle of the circuit: The factor of S comes from the H S gates.Thus, the circuit is a block encoding of A with subnormalisation S||A|| max and 1 + log 2 S flag qubits of total dimension 2S.The matrix is implemented exactly, apart from any finite accuracy in the multiplexed rotations (the data loading).
In applications of the block encoding scheme, we expect the matrix structure (the sparsity pattern and pattern of repeated values) to be given, from which O c and O r can be inferred.Different instances of the problem only require replacement of the data loading oracle, a straightforward automatable task.

Full block encoding
A general structured matrix may not fulfill the strict requirements of the simplified case in the preceding section 2.1.1.Here, we consider the general case, where each of the D data items may have a distinct multiplicity (with maximum M ), and each of the N rows and columns may have distinct sparsities (with maximum S c for columns and S r for rows, and S = max(S c , S r )).Therefore, a priori, M D = N S may not hold.We pad M and/or S, increasing their size with further dummy index range, until is fulfilled and the same construction with oracles O c and O r is possible.The action of these oracles on the padded dummy indices is insignificant (they will be flagged and deleted by an out-of-range oracle introduced in the next paragraph), as long as unitarity is ensured.Without loss of generality, we assume the resulting qubit register dimensions are powers of 2; otherwise they can trivially be embedded in a larger register made up of an integer number of qubits.
As in the simplified case, the matrix is labelled from the three perspectives (d, m), (i, s r ), and (j, s c ).Now however, because of the padding, not all labels (d, m) with 0 ≤ d < D, 0 ≤ m < M are in-range and are mapped by O r and O c to row and column indices i, j matching the matrix pattern.These are flagged by a new oracle, which we call O rg , the out-of-range oracle: The oracle is controlled on the D, M registers and flips a "delete" flag qubit whenever (d, m) is out-ofrange, which we draw as The full block encoding circuit is: It has one flag qubit (the del qubit) more than the simplified scheme, giving 2 + log 2 S flag qubits.
The subnormalisation now takes into account possibly unequal S c ̸ = S r ̸ = S and is The data loading oracle must load D data items.Because the other oracles can be implemented with arithmetics in O(polylog N ) T -count [31,32], we take the number of data items to load as a metric for the cost of the block encoding circuit.Note that if D is small and constant as the size N grows, eventually the cost of the other oracles will surpass the data loading.
The cost figures (data loading cost, subnormalisation, flag qubit dimension) are summarised in table 1.As a reference, we also include the cost figures from encoding schemes found in the literature.Gilyén et.al. present a block encoding scheme for sparse access matrices ( [7] Lemma 48).In contrast to our scheme, where the data loading oracle is multiplexed only on the D distinct indices, in their scheme the data loading oracle is multiplexed on the N row and N column indices i and j.The inner workings of the oracle are not specified.See appendix A.3 for a possible efficient implementation.The subnormalisation is the same as in our case.While they have 3 + log 2 N flag qubits, our encoding takes into account the sparsity and requires only 2 + log 2 S flag qubits.
Chakraborty et.al. present a block encoding scheme for a matrix from a quantum data structure ( [11], Lemma 6.2).A similar construction is also discussed by Clader et.al. [10].Regardless of the sparsity, the data loading cost is always N 2 +N items.The works presuppose that this data loading can be performed efficiently with a quantum data structure like qRAM [12,13].It promises a logarithmic T -gate depth.However, this is achieved by a large parallel execution of T gates on a large number of ancillary qubits.The total number of T gates is not reduced compared to other approaches (QROM [17], select-swap [33]).In the comparison table 1 we remain agnostic about the type of data loading (qRAM, QROM,. . . ) and record the number of items to be loaded.When D ≤ N , we find a lower figure of merit (4), (data loading cost) • subnormalisation, for our block encoding scheme: However, for a matrix with more distinct data entries than N , the scheme by Chakraborty may or may not be advantageous to the base scheme presented above, depending on the matrix norms.D. Camps et al. [15] show an explicit example construction for a circulant matrix.Similarly to our scheme, it requires only D data loading even though values appear multiple times, and a flag qubit number of ≈ log 2 S.However, the scheme is only suitable for matrices with a very particular structure: Each value must appear exactly once in each column.Our block-encoding encompasses this case and goes beyond, covering other types of structured matrices.

Preamplified scheme
In this section, we will show how the block encoding from the base scheme can, in some cases, be improved by a method called preamplification.The subnormalisation of a block encoding can be reduced by performing singular value amplification (see app.B) [7,11], resulting in an ε-approximate block encoding with subnormalisation reduced by an amplification factor γ.However, it requires O γ δ log γ ε applications of the original block encoding (δ is related to a bound on the matrix's singular values).Simply performing singular value amplification on the full block encoding therefore does not improve the figure of merit (4), (data loading cost)•subnormalisation; in fact, because of the factors related to the accuracy of the approximate result, it gets worse.Preamplification, presented by Gilyén et.al. [7], is based on two individual singular value amplifications of two separate circuit parts.As we will see, it is also applicable to our block encoding and can improve its figure of merit (4) in some cases, intuitively when the matrix has values of strongly varying magnitude.
The starting point for preamplification is to split the data loading oracle into two parts according to a choice of 0 ≤ p ≤ 1.The block encoding circuit (16) becomes: The two data qubits and data loading oracles combine to give the same matrix as the original block encoding.In fact, also the subnormalisation has not changed.The idea of preamplification is that, as we will see momentarily, the two circuit parts U † c and U r are block-encodings in their own right, and can be singular value amplified individually by amplification factors γ c and γ r .While the data loading cost increases by a factor of O(γ c + γ r ) (dropping logarithmic factors), the total subnormalisation of the matrix reduces multiplicatively by γ c γ r .Depending on the factors hidden in the big-O notation, preamplification can thereby reduce the figure of merit (data loading cost)•subnormalisation.
Let us discuss U † c , where U r is similar.The unitary U † c is a slightly more general block encoding, where the flag qubits on the left in circuit (19) are data0 through s, and on the right only data0 through del.Omitting data1 and del that do not participate in U † c , the encoded non-square matrix can be deduced by inserting resolutions of identity d |d⟩⟨d| , m |m⟩⟨m| , j |j⟩⟨j|: This matrix is already written in the form of a singular value decomposition because {|x j ⟩} and {|j⟩} are orthogonal systems.The singular values follow as their normalisation Following [7], we choose the amplification factor γ c as follows (see appendix B for details): Replacing U † c and U r by their amplification circuits results in a total subnormalisation divided by γ c γ r , such that the new subnormalisation after preamplification is which we record in table 1.Two flag qubits are needed in addition to (19) to perform the two amplifications.The data loading cost will be see appendix B where we have determined the prefactor 3. The largest possible choice of δ is δ = 1 − 1/ 4 √ 2 ≈ 0.16, see (98) in the appendix B. The accuracy ε determines the accuracy of the amplifications of U † c and U r .Hence, the accuracy of the full matrix is bounded by (2ε + ε 2 ) • subnormalisation.Preamplification was introduced in the scheme by Gilyén et.al. [7], where it works the same way.Like in the comparison of the base block encoding, our scheme requires a lower number of flag qubits for sparse matrices (S < N ).
While the p-norm encoding from the quantum data structure scheme ( [11]) does not employ singular value amplification, it is similar in spirit because it splits the matrix element into powers by p and 1 − p.The subnormalisation is the same as in preamplification, except for the √ 2 factor.To compare the data loading cost, bound the amplification factors by Then the preamplified data loading cost (dropping logarithms) obeys , which is to be contrasted with 2N 2 from p-norm encoding.If the matrix has D = N 2 different values, p-norm encoding is clearly favourable.Our scheme does well if the matrix is structured with a lower D.
Preamplification has a significant overhead in data loading cost, due to the prefactor, δ and ε in (27).In order to still achieve a benefit over the base block encoding scheme, the amplification factors γ c (25) and γ r must be large.Intuitively, this requires matrices with values of strongly varying magnitude.Whether the base or preamplified schemes yield a better block encoding w.r.t.(data loading cost) • subnormalisation depends on the specific matrix.The optimal choice of p in the preamplification scheme depends on the specific matrix.

PREP/UNPREP scheme
When the matrix structure has repeated data such that D ≤ S c , S r , often, the row/column oracle and the multiplexed rotations commute.Intuitively, this is the case when each data values appears in all (or most) of the rows and columns and the structure of the matrix is such that the columns and rows are (mostly) permutations of each other.Then, assuming commutativity, one can use a PREP/UNPREP scheme to reduce the subnormalisation of the base block encoding.PREP and UNPREP operators previously appeared in quantum algorithms for chemistry, when block-encoding Hamiltonians [17].We show how such operators can be used in our block encoding scheme for more general structured matrices.
The starting point is splitting the multiplexed rotations into two parts (19), just like in preamplification.By assumption, the row/column oracles commute with the rotations, hence we can use the following identity: with a prepare operator acting as and, because the left side is not a normalised quantum state, a scaling factor X. While PREP |0⟩ is of course a normalised state, the left hand side of (28) is not, due to the postselection of the flag qubit as |0⟩.The quotient factor X in ( 28) is smaller than one and results in a lower subnormalisation after application of the identity.A state preparation operator for a D-dimensional state can be implemented in various ways with no more than D data loading cost, see appendix A.2 for details.We use a similar operator on the right side of the circuit, The total block encoding circuit is with subnormalisation Contrary to preamplification, there is no singular value amplification that will necessitate repeated data loading.Instead, data will just need to be loaded twice (i.e.2D data loading cost), for PREP and for UNPREP.An application of Callebout's inequality (a refinement of the Cauchy-Schwarz inequality) [34] shows that the subnormalisation α p improves as p → 1/2: There is a single optimal choice p = 1/2 of the parameter for all matrices, contrary to preamplification.
In fact, for p = 1/2, one can choose UNPREP = PREP † up to sgn(A d ), and for certain circuit implementations of PREP one can use measurement based uncomputation to significantly reduce the number of Toffoli gates in UNPREP [18].Effectively, we take this into account by recording reduced data loading D for the optimal p = 1/2 in table 1.Compared to our base scheme, the subnormalisation is always reduced: The reduction is stronger when the matrix values have large varying magnitude than if they are all of similiar size.Data loading cost is only increased by a factor of 2 compared to the base scheme; except for the optimal p = 1/2, when it is the same.Whenever possible (i.e.D ≤ S c , S r and the row and column oracles commute with the data loading), one should therefore choose PREP/UNPREP with p = 1/2 instead of the base scheme.It depends on the specific matrix including its values whether the preamplified or PREP/UNPREP scheme have a better figure of merit (4).

Hermitian block encoding for symmetric matrices
When the matrix A is symmetric, it can be desirable to construct a Hermitian block encoding.That simplifies quantum walks [21,22] and phase estimation via qubitisation [16].A priori, the block encodings constructed above may not be Hermitian, even if the matrix is symmetric.
In this section, we show how all our block encodings become Hermitian with only slight modifications when the oracles are constructed in a particular way.The starting point is, like in all our constructions, a labelling (d, m) of the matrix elements, and a column oracle O c : (d, m) → (j, s c ).Because the matrix is symmetric, elements related by transposition (i ↔ j) have the same d.Thus, there is an oracle O t that maps (d, m) → (d, m ′ ), the element related to (d, m) by transposition.On the diagonal of the matrix, m ′ = m.When the arithmetic expression is cast into a quantum circuit, it becomes unitary and Hermitian, because transposition is an involution O 2 t = I.As usual, the action of O t on out-of-range (d, m) does not matter (apart from ensuring its unitarity and Hermiticity) because of O rg flipping the delete qubit.A row oracle can then be constructed from the column oracle as keeping in mind that S c = S r .We will show examples of constructions of Hermitian block encodings in section 3. A further Z gate must be added to the data qubit to make the base block encoding scheme Hermitian: The circuit is Hermitian, because the left and right parts are Hermitian conjugates, and the middle part consists of three commuting Hermitian operators (the O rg -controlled NOT, O t , and the X-axis rotations (7) preceded by Z).The Z does not affect the encoded matrix because it has no effect when the flag qubit is initialised as |0⟩.Yet, it is needed to make the full block encoding unitary Hermitian.This Hermitian counterpart of the base block encoding scheme has the same data loading cost, subnormalisation, and flag qubit number.
The preamplified scheme can also be made Hermitian by using the O r oracle (35) constructed above.Using p = 1/2, the circuit (19) can be adapted to: By removing the sign of A d from the first multiplexed rotations in (19), the two subcircuits to be amplified can be made Hermitian conjugates.The same will hold for their amplification circuits.A simple way to make any block encoding U Hermitian is the following circuit [16]: It increases the number of flag qubits by 1 and increases the gate complexity / data loading cost by a factor of two.Additional cost is incurred for controlling U and U † .A Hermitian block encoding scheme in terms of certain black-box oracles is suggested in [15].It cannot deal with negative matrix values and has log 2 N + 2 flag qubits.It also duplicates data loading cost, in contrast to our Hermitian counterpart of the base block encoding.

Example block encodings 3.1 Checkerboard matrix
We will first consider a matrix with a checkerboard pattern as a nice example demonstrating how our block encoding scheme provides for repeated values, and application of PREP/UNPREP.The N × N checkerboard matrix (take N a power of 2) only has D = 2 data values, the sparsities are S c = S r = S = N (it is dense), and M = N 2 /2 multiplicity of each of the two values.It clearly has an arithmetic structure, which can be translated into row and column oracles.The starting point is a labelling of the matrix elements by (d, m), 0 ≤ d < D, 0 ≤ m < M .We choose a labelling where m increases reading the matrix row by row from left to right.Specifically, for N = 4, we have the labelling: For simplicity of presentation we split m = N m hi + (N/2)m mid + m lo into its high log(N/2) bits, mid bit, and low log(N/2) bits.The row and column indices i and j can be computed from (d, m) as follows: The specific values of s c and s r are not important for the block encoding.Oracles for these expressions are very simple, O r can even be implemented as the identity.The full block encoding circuit is: The block encoding has data loading cost D = 2.Note that the oracles are very cheap, in fact, zero T gate count.It turns out that this block encoding is already Hermitian.The subnormalisation is The subnormalisation can be improved by going to PREP/UNPREP, since D ≤ S and the oracles commute with the multiplexed rotations.The corresponding block encoding circuit is: The subnormalisation follows as When the matrix elements have varying magnitude, this presents an improvement over the subnormalisation (44) of the base scheme.The checkerboard matrix serves as a simple pedagogical example.Because of its low rank and known eigenvalues, it is unlikely to be used in an actual quantum algorithm.Alternatively to our schemes, a block encoding could also be constructed by appreciating the factorisation of the matrix, leading to a circuit with the same ancilla, T count and subnormalisation as our schemes.
Our scheme can adapt to changes in the structure of the matrix.For example, to demonstrate the out-of-range O rg , consider a checkerboard matrix with top left and bottom right entries replaced by zero.The row and column oracles are as above; the additional O rg oracle deletes out-of-range labels and can be inserted in both the base circuit (43) and the PREP/UNPREP circuit (45).

Toeplitz matrix
Consider a Toeplitz matrix with D diagonals, i.e.D values, offset from the main diagonal by k: For an N × N matrix with N ≥ D it follows S c = S r = S = D, M = N .This already fulfills N S c = N S r = M D such that no padding is necessary, and we assume N and D to be powers of 2 for simplicity.Different choices can be made for the labelling in terms of (d, m) (distinct values and their repetitions).For the distinct values, we choose d as above.For m, we will simply choose the column index.Note that for example, (d = 0, m = 0) is out-of-range: In general, there is no A 0 in column 0. Arithmetically, the mapping to row and column indices is then: and out-of-range (d, m) pairs are those where an overflow or underflow in the calculation of i occurs, that is when The oracles can be constructed with in-place additions, such that the block-encoding circuit is as follows.The last qubit is an ancilla qubit serving as the overflow (as the next higher bit) for the modular additions.It is not a flag qubit because it is uncomputed back to |0⟩.(Note the additions can be performed without further ancillas if only few qubits are available [35][36][37].) If we were encoding a banded circulant matrix, the equations (51) would have be taken mod N .There would be no out-of-range pairs (d, m) and no O rg oracle, del qubit, or overflow ancilla qubit.The above circuit would then be a block encoding with log 2 D + 1 = log 2 S + 1 flag qubits, subnormalisation S||A|| max , and D data loading.That is the same as the block encoding discussed in [15].Yet for the Toeplitz matrix, the circuit can be simplified to be more compact.The overflow qubit can be used directly as the del flag qubit; then only the modular part of the addition/subtraction must be uncomputed: Then, the O rg and O r oracles can be merged, resulting in the simpler circuit data s The circuit can also be interpreted as block-encoding a circulant matrix, and the del qubit selecting the top left block: Every Toeplitz matrix can be embedded in a larger circulant matrix.Such an observation was already used for HHL [38].Further, the Fourier transformation of a circulant matrix is diagonal.Conceivably, one could construct a block encoding of the diagonal Fourier transformation.That approach is equivalent to compiling the additions in the above circuit with QFT based adders [35].
The arithmetics in the oracles only have ∝ log N Toffoli gates.The flag qubit number is 2 + log 2 D and we have D data loading.The Gilyén et.al. method would have more flag qubits, 3 + log 2 N .
Note that this matrix has D ≤ S, and the row and column oracles commute with the data loading oracle.Hence, application of PREP/UNPREP can reduce the subnormalisation, resulting in the circuit

Tridiagonal matrix
We consider a matrix that is tridiagonal and symmetric, where entries along a diagonal are all different.For an N × N matrix (we consider N a power of 2), we have Arithmetic expressions for row and column indices can be written as Translating these expressions to quantum circuits results in the block encoding circuit data del s1 s0 The equal superposition prepared by H 3 must be since those are the values of s c and s r that the oracles map valid (d, m) pairs to.Note that in this example, S > D and the PREP/UNPREP method cannot be applied.Indeed, the data loading does not commute with the oracles.Note the Tofolli count of the arithmetic oracles is logarithmic in N , and the addition and subtraction can be implemented without further ancilla qubits.The Toffoli cost is dominated by data loading of D = 2N items.There are only 4 flag qubits.In the Gilyén et.al. scheme, there would be 3 + log 2 N flag qubits.
The circuit can be simplified further by appreciating that the first CNOT in O rg will never be triggered (it commutes with the column and row oracles and is annihilated by the H 3 of eq.(62)) and can be removed.Further, instead of the second multicontrolled NOT in O rg , one could explicitly load a zero value for d = 2N − 1 with the multiplexed rotations.Then the block encoding does not need a delete flag qubit or O rg oracle: The above block encoding circuit is not Hermitian.We will demonstrate how to use our scheme to construct a Hermitian block encoding.The O t oracle has the following action: and can easily be implemented in a quantum circuit: The oracle column oracle O c is the same as above, and the row oracle is now constructed as Putting this into the Hermitian version of the base block encoding circuit (36) gives

Extended binary tree matrix
The symmetric adjacency matrix of a balanced binary tree was considered in [15].While the authors considered a non-Hermitian block encoding for N = 8, we will demonstrate how our scheme can be used to generate a Hermitian block encoding for N an arbitrary power of 2. For N = 8, the binary tree and its adjacency matrix are 0 The root and leaf nodes have weight A 0 , all other nodes weight A 1 , and the edges weight A 2 .The labelling of d is already exemplified in above matrix.One can see that Hence, M is padded to 2N , and S is padded to 6.Moreover, in this example D and the padded S are not powers of two, such that they are embedded in 2 and 3 qubits, respectively.The labelling of m can be chosen as which generalises to the following relations: with out-of-range (d, m) those with The oracle O t can be easily implemented by flipping one qubit because we are assuming that N is a power of 2.
When implementing the oracle O r , we must make sure to only get S r = 3 values of s r for valid (d, m).
The first X together with the two cnots ensure that d lo is set to zero for all entries on the diagonal.
As indicated in the circuit, that qubit is then |0⟩ for all valid inputs (d, m).It is then used as a control qubit for a rotation implementing ⌊m/2⌋ from (69).The rotation consists of swaps circularly rotating the qubits down.Finally, the cnot and ccnot ensure that 0 ≤ s r < S r = 4, i.e. s hi r = 0 for all valid inputs.Specifically, the enumeration by s r within rows implemented by the circuit is The out-of-range oracle O rg is: From these oracles, a Hermitian base and preamplified block encoding can be constructed with S r = 4.
One cannot use the PREP/UNPREP scheme with above oracles as the multiplexed rotations do not commute with O r .This has to do with the fact that in (31), H Sr/D = H 4/3 does not exist.One can artifically increase S r to 6 and then construct a PREP/UNPREP block encoding circuit.It would have subnormalisation

2-dimensional Laplacian
Here, we construct an example block-encoding of a 2-dimensional discrete Laplacian operator on a grid.Using finite differences, the one-dimensional Laplacian operator can be expressed as When the discrete values f (x i ) are understood as the components of a vector, the Laplacian corresponds to a Toeplitz matrix with −2/(∆x) 2 on the diagonal, and +1/(∆x) 2 on the first two offdiagonals.The block encoding scheme in 3.2 can be used.Now, let us consider a two-dimensional regular grid of size N x × N y , with dimensions powers of 2 for simplicity as usual.The finite difference Laplacian becomes A standard encoding of the values on the grid into an N = N x N y -dimensional vector is row by row: Then, the Laplacian matrix A is defined by We have 3 different values, which is padded to D = 4. Visually, for N x = 4, N y = 4, the matrix looks like where there are N y dashed rectangles of dimension N x ×N x each.We have maximum sparsity 5 padded to S = 8, and maximum multiplicity 2N − 2 padded to M = 2N , such that N S = DM = 8N .For the labelling by m, we separate the high bit of m hi from the low log 2 N bits m lo .We choose m hi = 0 for the lower left triangular matrix (including the diagonal), and m hi = 1 for the upper triangular matrix.We choose m lo to be the row index (in the lower triangular matrix) or the column index (in the upper triangular matrix).The Hermitian block encoding scheme for symmetric matrices (section 2. The out-of-range indices (d, m) are the following: To write this in quantum circuit form, we split the m lo register into two registers of log 2 N y and log 2 N x bits; this respects the structure in (77), When putting together the circuit, we can use the simplified form on the right because certain controls are never triggered.The full Hermitian block encoding circuit (36) for these oracles is then: The subnormalisation is Since the data loading and row/column oracles commute, we can use the PREP/UNPREP scheme to improve the subnormalisation to (in the symmetric p = 1/2 case) Block encodings of a 2-dimensional Laplacian for use in quantum algorithms were already considered in [39], using an approach that constructs an approximate block encoding.Here, in contrast, the block encoding is exact (up to finite accuracy in the data loading oracle).Moreover, the block encoding constructed with our scheme requires O(log N + log N y ) gates, which comes from the additions in (86).Whereas, the approximate block encoding in [39] appears to have an exponentially worse scaling.Our method does well, because treating it solely as a sparse [7] or dense matrix [10,11] does not harness the repeated elements, and it is not of the class considered in [15].

Conclusions and Outlook
In this work, we have presented a number of schemes to block encode structured matrices (base, preamplified, PREP/UNPREP schemes, and Hermitian extensions).Such a block encoding is necessary to use a matrix in QSVT and related quantum algorithms.All the schemes are based on a labelling of the matrix structure in terms of (d, m), where d labels distinct non-zero values and m distinguishes different elements with the same value.Arithmetic quantum circuits relating this labelling to the column and row indices constitute the core of the quantum circuits.Section 2 shows our circuit constructions based on these arithmetic oracles along with a data loading oracle.
All schemes fully incorporate the sparsity of the matrix, reflected in the flag qubit number, as well as repeated values: The data loading oracle is controlled on d, such that the data loading cost corresponds to the number of distinct values; no value is loaded twice, even if it appears in the structured matrix multiple times.The schemes differ in the subnormalisation achieved; our detailed analysis is summarised in table 1.Which scheme performs best depends on the specific matrix in question.Further, our block encodings can be adapted to be Hermitian in the case of symmetric matrices (section 2.4) without extra data loading cost or subnormalisation.
In section 3 we have provided examples showing our block-encoded schemes in action for several families of structured matrices (Toeplitz matrix, tridiagonal matrix, extended binary tree matrix, 2d Laplacian matrix).Together with the information provided in the appendix on data loading, the full circuits can be elaborated.We hope that, beyond the examples considered here, our schemes will prove useful for constructing block encodings for a large variety of matrix families from different application areas.
A theoretical bound for the subnormalisation is the spectral norm ||A|| op , due to the requirement that the block encoding be unitary.None of the schemes achieve this for an arbitrary matrix, so improvements may still be possible.While this article focuses on real matrices, an extension to complex valued matrices is straightforward: One could sum block encodings of a matrices real and imaginary parts, or adapt the multiplexed rotations to yield complex amplitudes of the |0⟩ state.We have assumed a structure in the pattern and position of matrix elements, but not in the values.If the matrix possesses further structure, we expect more efficient circuits can be found.For example, if the values depend arithmetically on any of the indices i, j, d, m, data loading may not be necessary.Further, the block encodings constructed are exact, up to finite accuracy in the data loading (see appendix A) and, in the preamplification scheme, the accuracy of singular value amplification (see appendix B).Possibly, approximate block encodings [39,14] could be implemented with more efficient circuits.
Next, step (2) rotates the data qubit according to the bitvalue saved in the ancilla register.This can be done with the phase-gradient technique [43,44,33], in which the b-bit angle register is added into a reusable phase gradient state.This addition requires b − 1 Toffolis.Alternatively, step (2) can be implemented with b rotations controlled on each of the qubits in |α d ⟩ in turn [33].
Finally, in step (3), the data lookup must be uncomputed to return the ancillas to |0⟩ ⊗b .In some cases, the Toffoli cost of uncomputing the lookup can be reduced compared to step (1) by using measurement based uncomputation [17].
The cost of the three-step procedure is asymptotically reduced from (90) because steps (1), (2), and (3) are in sequence and the term related to the accuracy of the rotations is added rather than multiplied to the term D related to the data look-up.In either case, the T cost scales with D as O(D) when using QROM; the number of data items to load is a sensible stand-in for circuit length (T count) while staying agnostic about the exact procedure.

A.2 State preparation
In the PREP/UNPREP scheme (section 2.3), the data loading oracle O data and diffusion operator H S are merged into a state preparation operator, reducing the subnormalisation.The data loading is not performed with multiplexed rotations O data on a data flag qubit like in the base and preamplified scheme.Instead, data is loaded as the amplitudes of a state that is prepared with a PREP operator.
An arbitrary quantum state |ψ⟩ = PREP |0⟩ ⊗ log 2 D can be implemented by a sequence of multiplexed rotations, whose angles α have been precomputed from ψ's amplitudes [41,33], like in this example with D = 8: These multiplexed rotations can be implemented as in appendix A.1.Alternative approaches to preparing |ψ⟩ include coherent alias sampling [17] and prerotation [10].

A.3 Data loading oracle in Gilyén et. al.'s sparse scheme
Lemma 48 in [7] shows a construction of a block encoding in terms of black-box oracles.One of the black boxes is the data loading oracle called O A .Given the row and column indices i and j, it loads the b-bit bitstring A ij of the corresponding matrix entry: The implementation of this oracle is not discussed.From a first cursory look at the oracle, one might conclude that repeated entries must be loaded separately.However, given a structured matrix as considered in this paper, it can be implemented with only D data loading and the usual O(polylog N ) arithmetic overheads: The arithmetic oracle O rc computes (d, m) into ancilla qubits from (i, j).Possibly, some of the arithmetics could be done in-place, and the i and j registers reused for d and m. rel.accuracy within =±(1-)/ target, or 0 P( ), degree 65 Figure 1: Singular value amplification.The target function for an amplification by γ is γζ (black line), and 0 outside of the validity region.For QSVT, it has to be approximated by a polynomial (blue) of sufficient accuracy.The region of accuracy (grey shaded) is determined by the parameters δ and ε.Asymptotically, the degree was shown to be O(γ/δ log(γ/ε)) [45].We truncate a Chebyshev expansion of an analytic approximation to the target function to find the polynomial's degree.

B Singular value amplification
Singular value amplification [7,45] allows to reduce the subnormalisation of a block encoding by increasing the circuit length.This is achieved by performing a QSVT that multiplies the singular values by an amplification factor γ.While the resulting block encoding (with one flag qubit more) will have a subnormalisation reduced by a factor of γ, the circuit length will have roughly increased by a factor of γ, up to logarithmic factors and constants.This reciprocal behaviour substantiates the figure of merit circuit cost • subnormalisation , (95) of which eq. ( 4) is a proxy, used to assess a block encoding.As a crucial ingredient to the preamplified scheme (section 2.2), the complexity of singular value amplification including constant factors determines which block encoding scheme performs best for specific data.While previous work has determined its big-O scaling [7,45], this appendix will shed light on the constant factors.Note that singular value amplification can be applied to general projected matrices beyond quadratic A flagged by |0⟩s.As such, it encompasses uniform amplitude amplification.Figure 1 shows an example of a target function for amplification, along with a polynomial approximation for the QSVT.The parameter ε controls the relative accuracy of the singular value amplification, while δ controls the applicable range.The requirement ζ i ≤ 1/γ is natural because the singular values ζi ≈ γζ i of the block-encoding Ũ must be bounded by one.To ensure the polynomial approximation can be bounded by one across the entire range [−1, +1] (a requirement for QSVT), the permissible range of singular values of A must be slightly lowered by δ.In principle, a trade-off between γ and δ is possible.For preamplification, Gilyén et.al. [7] choose an amplification factor of γ = 1 The starting point for the polynomial P (ζ) is a sufficiently good analytic approximation of the rectangle function on the domain [−(1 − δ)/γ, (1 − δ)/γ] based on error functions, see [7,45].Its expansion in Chebyshev polynomials is truncated such that an absolute accuracy of ε is achieved.The product with γζ then gives the desired polynomial approximation.Yet, the authors of [45] "emphasize that our proposed sequence of polynomial transformations serve primarily to prove their asymptotic scaling."They suggest to obtain the constant factors by a direct Chebyshev truncation of the entire functions.
Here, we therefore numerically perform Chebyshev truncations of the entire approximation to the rectangle function2 .The optimal degree satisfying the required accuracy is found with a binary search for various parameter combinations of the amplification factor γ, the accuracy ε, and δ.Fig. 2 shows our results along with a fit to the big-O result (96).This study confirms the scaling behaviour (96) and determines a surprisingly small scaling factor of ≈ 3. We thus find singular value amplification requires approximately repetitions of the block encoding, where log is a natural logarithm.
Note that this analysis follows the prescription from [45], where the polynomial is constructed from an approximate rectangle function multiplied by γζ.Actually, the behaviour of the polynomial outside of the accuracy region does not matter, as long as it stays bounded by ±1.In particular, in Fig. 1 it does not need to decay to zero rapidly around ζ = ±(1−δ)/γ.Perhaps a lower degree polynomial exists that does not follow the analytic construction based on the rectangle function.Improved degrees to the analytic construction have already been achieved for other target functions (like for matrix inversion) in [46].The authors used a numerical optimisation based approach based on the Remez method to directly find Chebyshev approximations.Figure 2: Degree of the polynomial for singular value amplification.Asymptotically, the degree was shown to be O(γ/δ log(γ/ε)) [45].We truncate a Chebyshev expansion of an analytic approximation to the target function to find the polynomials' exact degrees.By varying the amplification factor γ, the accuracy ε, and δ, we aim to find the constant factor in the asymptotic complexity.From the fits shown in the plots, we conclude the degree is approximately 3 γ/δ log(γ/ε).Log refers to the natural logarithm.

)
Reminiscent of the usual circular notation for controls, the circle/oval serves as the control for the O rgcontrolled NOT; the d and m registers are not modified.The out-of-range oracle can be implemented with quantum arithmetics, possibly using ancilla qubits, and have a low O(polylog N ) T -gate count.Like for the O c and O r oracles, the arithmetic expressions underlying O rg follow from the structure of the matrix and the chosen labelling in terms of (d, m).The oracles O c and O data may map invalid (d, m) to any values, they just need to be unitary.For valid (d, m), the specific values of the corresponding s r and s c are irrelevant, as long as they are within range 0 ≤ s r < S r and 0 ≤ s c < S c .Example constructions are shown in section 3.
(d, m).These are (d = 0, m = 0) and (d = 0, m = N 2 /2 − 1) for top-left and bottom-right entries.A circuit implementation of the oracle is ) Since SN = 3N ̸ = DM = 4N − 4, we must pad S and/or D. The equality can be fullfilled by padding to D = 2N and S = 4.
59) when splitting d = 2d hi + d lo into its high log 2 N bits and lowest bit.Pairs (d, m) out-of-range are those with (d lo = 0 and m = 1) or d = 2N − 1. (

4 )lo for m hi = 1 m lo − 1 for m hi = 0 and d = 1 m 1 √ 5 (
requires a transposition oracle O t (d, m) = (d, m ′ ) that gives the corresponding label of the transposed element.We haveO t (d, m hi , m lo ) = (d, m hi , m lo ) for d = 0 (d, 1 − m hi , m lo ) for d = 1 or d = 2 (80) which in quantum circuit form is d hi d lo m hi m lo N O t =(81) Next, we need a column oracle O c : (d, m) → j that gives the column index.O c (d, m hi , m lo ) = lo − N x for m hi = 0 and d = and the |s⟩ states required for H S are H S = |000⟩ + |010⟩ + |011⟩ + |100⟩ + |101⟩) (reading s hi to s lo left to right).
and makes it easy to to implement the conditions (m lo mod N x ) = 0 and m lo < N x .(Splitting j into those two registers also simplifies the −N x in O c , which is just a −1 on the log 2 N y -bit register.)The out-of-range oracle is del

Table 1 :
Comparison of block-encoding schemes for a matrix A. N dimension of the matrix.D number of distinct data values A d .M maximum multiplicity of each value.S c , S r maximum column and row sparsities (no.nonzero elements per column/row).The maximum sparsity S = max(S c , S r ) or multiplicity M are padded to ensure SN = DM , if necessary.In the comparison we've added further factors to Gilyén et.al.'s results to remove the |A ij | ≤ 1 assumption.Block encoding from a quantum data structure is also discussed in [10].Lower is better for all of data loading cost, subnormalisation, and flag qubit number.D data loading incurs O(D) T-gate count, or O( √ D) if a large number O( √ D) of possibly dirty ancillas are available (see appendix A).
2 arccos A d /||A|| max ) ⊗ |d⟩⟨d| (7) encoding the values of the D data items is needed.The factor ||A|| max = max d |A d | is needed to ensure all values are in-range of the arccosine.Provided the first qubit is initialised and postselected as |0⟩, the effect of O data is loading the correct values: The correct sign is obtained with the gate denoted by sgn A d .It can be seen as a new data loading oracle loading and applying the sign.It can be implemented with multicontrolled Z gates flipping the signs for those |d⟩ with sgn A d = −1.All circuit elements in the middle are Hermitian and commute.Compared to the preamplification scheme, this Hermitian counterpart has the same number of flag qubits, the same subnormalisation, but data loading increased by D, due to the separate sign oracle.The PREP/UNPREP scheme is already Hermitian as-is for p = 1/2, provided O r = O c O t is used for the row oracle and matching PREP and UNPREP operators are used: While equations (29) and (30) only specify the operators' actions on |0⟩, the identity PREP = O sgn UNPREP (38))must hold as an operator identity to make the full circuit Hermitian.The oracle O sgn flips the signs to match sgn(A d ).Because O sgn is Hermitian, the full PREP/UNPREP block encoding (31) is Hermitian.In practice, O sgn is integrated into PREP, and the Hermitian counterpart of the PREP/UNPREP scheme has the same costs.