Efficient algorithms for quantum information bottleneck

The ability to extract relevant information is critical to learning. An ingenious approach as such is the information bottleneck, an optimisation problem whose solution corresponds to a faithful and memory-efficient representation of relevant information from a large system. The advent of the age of quantum computing calls for efficient methods that work on information regarding quantum systems. Here we address this by proposing a new and general algorithm for the quantum generalisation of information bottleneck. Our algorithm excels in the speed and the definiteness of convergence compared with prior results. It also works for a much broader range of problems, including the quantum extension of deterministic information bottleneck, an important variant of the original information bottleneck problem. Notably, we discover that a quantum system can achieve strictly better performance than a classical system of the same size regarding quantum information bottleneck, providing new vision on justifying the advantage of quantum machine learning.


Introduction
Learning is a task of eminent importance to the contemporary world. As such, it has always been of top priority to quest powerful tools for learning information. Information bottleneck [32] stands as an excellent example, with many useful applications including deep learning [8,28,33], video processing [16], clustering [29] and polar coding [30]. Concretely, information bottleneck is a method to extract a piece of information T with respect to the system Y from the system X, and is formulated as the minimization problem of the difference I(T : X) − βI(T : Y ) with a positive parameter β, where I(T : X) is the mutual information between T and X. In particular, we are interested in the case when X is classical. By design, Figure 1: Visualization of quantum information bottleneck. In a prototypical setting of quantum information bottleneck, the task is to compress a classical system into a smaller system T , which can be either classical or quantum, by extracting its useful information about a quantum system Y and removing the useless information. It is expected that more relevant information Y about Y , instead of the entire X, can be recovered from T . information bottleneck achieves an irreversible compression, by extracting essential information about Y and simultaneously removing unessential information contained in X.
As we are stepping into the age of quantum information, the demand is growing for a method that efficiently learns information on a quantum system. For this purpose, let us consider the setup of quantum information bottleneck (QIB), demonstrated in Fig. 1. Similar as its classical counterpart, the aim of QIB is to compress X into a smaller system T while preserving the correlation with Y when some of these systems are quantum systems. Prior to this work, QIB has been discussed in several recent works [2,6,9,14,24] and has been applied to quantum information theory [6,14] and quantum machine learning [2]. On the other hand, the fundamental properties of QIB such as convergence have not been analysed, which hinders its application in more practical tasks. Quantum information bottleneck is first proposed as a quantum extension of information bottleneck method in [9]. It also derived a necessary condition for the solution of the minimization problem (see [9, Appendix A]) by using Lagrange multiplier method in the same way as [1,4]. Using the obtained condition, it also proposed an iterative algorithm to find a solution to satisfy the necessary condition [9,Appendix C]. Then, the reference [24] considered QIB in the quantum communication scenario. 1 However, no study discusses the behaviour of the iterative algorithm, i.e., it is not known whether the algorithm monotonically reduces the objective function [9,24,31,32]. It was also claimed in [9,Appendix B] that there is no advantage of using a quantum T if X, Y are both classical.
In this work, we conduct a systematic study on quantum information bottleneck, focusing on the case when the system X is classical. Compared to existing works [2,6,9,14,24], our work makes significant contributions in several directions: First, we provide throughout analyses on two critical properties -efficiency and convergence -of QIB. Motivated by a recent generalization [22] of the Arimoto-Blahut algorithm [1,4], we introduce a new quantum information bottleneck algorithm with an acceleration parameter γ that can make the value of QIB converges much faster than before when chosen properly. We prove rigorous criteria for our algorithm to converge and to achieve a minimum. In particular, we prove that the choice of β plays an important role in convergence.
Second, in contrast to the claim in Refs. [9,24], we provide concrete examples where using a quantum instead of classical T could reduce the minimal value of QIB. Notably, our result justifies a genuine quantum advantage in quantum machine learning [3,27,34], where the employment of quantum circuits has been prevalent [5,11,17,20,25,26] but the quantum advantage was rarely justified.
Last but not least, we generalise QIB by considering a general target function (1 − α)H(T ) + αI(T : X) − βI(T : Y ) with parameters α, β ≥ 0, which reduces to the standard QIB when α = 1. By doing so, the generalised QIB contains QDIB, i.e., the quantum version of deterministic information bottleneck [31], by setting α = 0. We show that our analyses and our algorithm hold for this generalised setting and, in particular, to QDIB. Then, we clarify that QDIB can be used to find a good approximate sufficient statistics T for X for Y , which requires a smaller entropy H(T ) and larger mutual information I(T : Y ). We justify our finding via a numerical example, where QDIB extracts a good approximate sufficient statistics over information about a quantum ensemble.
In summary, our work addresses several critical issues of QIB, including convergence, efficiency, choice 1 The reference [24, Appendix A] derived a necessary condition for the solution of the minimization problem by using Lagrange multiplier method in the same way as [1,4]. Using the obtained condition, it also proposed an iterative algorithm to find a solution to satisfy the necessary condition [24, The end of Appendix C]. of parameters, and the quantum advantage. We also extend QIB to a generalised setting and introduce the notion of QDIB. Our results consist of both rigorous analytical analyses and numerical experiments that justifies the importance of QIB and QDIB in fundamental tasks of learning.
The remaining part of this paper is organized as follows. Section 2 introduces our algorithm for quantum information bottleneck, and discusses its convergence and dependence of the parameter β. Section 3 discusses our algorithm when our memory system T is classical. Section 4 presents examples that realizes a smaller value of the target function by quantum memory T than by classical memory T . Section 5 discusses an application of our QIB algorithm in data classification. Section 6 proposes our algorithm for quantum deterministic information bottleneck, and studies its properties. Section 7 applies it to the extraction of approximate sufficient statistics, and numerically verifies its efficiency in an example. Section 8 makes discussion and conclusion.

Problem definition
Consider a classical-quantum joint system composed of X and Y with the joint state where X is a classical system and Y is a quantum system. Our quantum information bottleneck (QIB) problem aims at constructing an information processor, modelled by a c-q channel σ T |X from X to T (which prepares a quantum state σ T |x when the classical register is x), that extracts efficient information from X with respect to the quantum system Y . After the action of the information processor, the joint state becomes: To this aim, the QIB problem concerns constructing a classical-quantum channel σ T |X : X → T that minimizes the information bottleneck function, consisting of entropic quantities defined with respect to the joint state ρ XY T : where H(T ) denotes the entropy of T 2 , H(T |X) denotes the conditional entropy of T on X, and I(T : Y ) stands for the mutual information between T and Y . That is, our aim is the calculation of the following value: (4) In the information bottleneck (3), α and β are positive real variables modelling the objective of the task.
In the original proposal of information bottleneck [32] α = 1. Another common choice of α is α = 0, and the task is called a deterministic QIB (whose classical counterpart was discussed in Ref. [31]). The parameter β controls the tradeoff between faithfulness and compression. For instance, in a deterministic information bottleneck, a larger β would make I(T : Y ) more prominent in the objective function, forcing the information processor to preserve more information about Y , whereas a smaller β would signify the role of I(T : X), prompting the information processor to do more compression in X.
Although this section addresses the case with quantum systems Y and T , the case with a classical system Y and a quantum T can be contained as a special case by considering the diagonal densities ρ Y |x . On the other hand, the case with a classical system T is a different problem from the case with a quantum system T because we need to discuss a different minimization problem, which has a different range for the minimizing variable. Fortunately, our algorithm for a quantum system T , presented in the next subsection, can be applied to the case with a classical system T . Section 3 discusses the case of T being classical. We remark that the case where both T and Y are classical has been widely studied in classical information theory and machine learning; see, e.g., Refs. [28,[31][32][33]].

QIB algorithm for α = 1
The paper [9] discussed this problem when X, Y, T are quantum systems and α = 1, extending the classical information bottleneck [32] to the quantum regime. It derived a necessary condition for σ X|T to achieve the minimum (4). The necessary condition with quantum systems T, Y and a classical system X is written as where C x is a normalizing constant and Since this condition is self-consistent, using this condition, the paper [9] proposed the following iterative algorithm with the following update rule: 2.3 The acceleration parameter γ Next, we propose an extension of the iterative algorithm in [9]. First, we introduce a new parameter γ > 0 and rewrite the condition (5) as: where Using (10), we can derive another iterative algorithm as In this way, we can easily generalize the iterative algorithm (9) by [9]. However, it is not trivial to find the suitable value for 1 γ , which, as we show later, is critical to the efficiency of our iterative algorithm. Although many papers [9,24,31,32] discussed the iterative algorithm given by (9) including the classical case, no preceding study showed the convergence of the iterative algorithm by (9). In addition, the discussion above focuses on the case of α = 1 and does not include the case of deterministic information bottleneck (α = 0). Therefore, to make an efficient algorithm, we need to discuss the choice of the parameter γ for generic α.

QIB algorithm with general α and convergence
To analyze the convergence of the algorithm (12), we introduce a two-input variable function based on the idea in Ref. [22,, whereas the method in Ref. [22, was obtained as a generalization of the Arimoto-Blahut algorithm [1,4]. The idea is that, instead of directly solving the minimization of f α (σ T |X ), which is often too difficult, we find a continuous function J(σ T |X , σ T |X ) with two variables σ T |X , σ T |X . Then we can update these two input variables σ T |X , σ T |X alternately to decrease J(σ T |X , σ T |X ). Finally, if the function satisfies the minimum of J(σ T |X , σ T |X ) will be close to the minimum of the IB function. The above type of functions can be constructed if we find an operator F α [σ T |X ](x) to satisfy (14) In this paper, we employ the following function: Then, the condition (14) is satisfied.
Using this function, we can define which satisfies the condition (13). However, it is difficult to optimize two input variables alternately in the function J 0 (σ T |X , σ T |X ). Instead, for γ > 0, we introduce the following function : where D(σ T |X σ T |X ) := x P X (x)D(σ T |x σ T |x ) and D(σ T |x σ T |x ) denotes the relative entropy. Next, we need to specify the rules of the alternatively updating σ T |X , σ T |X . Crucially, we need to ensure that J γ,α (σ T |X , σ T |X ) is non-increasing under the updating rules. To this purpose, we first introduce the following condition: In fact, the condition (A1) is rewritten as γ ≥ γ(σ T |X , σ T |X ) by defining γ(σ T |X , σ T |X ) as . (19) This quantity is evaluated as because the relation implies the relation To state our updating rules, we definê In particular, when γ = α, the operator Theorem 1 Under the condition (A1), we have Proof of Theorem 1: The condition (A1) yields Hence, we obtain (27). Also, we have where (a), (b), and (c) follow from (17), (23), and (25), respectively. Finally, from Eq. (30) we can see that , since the first term of (30) is non-negative (with equality achieved when σ T |X = σ γ,α,T |x [σ T |X ]) and the second term is independent of σ T |X . Hence, we obtain (28).
Hence, the monotonicity of the information bottleneck under the updating rules is also guaranteed, as long as γ is sufficiently large.
Finally, we propose the following algorithm with a fixed γ and general α: Eqs. (23) and (25)]; set n as n + 1. 6: until convergence. 7: Output: A c-q channel σ (n+1) T |X As mentioned, when γ satisfies the condition (A1) in all iteration steps, i.e., when γ is sufficiently large, Theorem 1 guarantees the monotonicity of the information bottleneck function: Since f α consists of bounded entropic quantities (assuming the system to be finite), it is a bounded quantity. Therefore, the sequence {f α (σ (n) T |X )} in our Algorithm converges. In addition, we can show that the sequence of c-q channels {σ (n) T |X } converges as well: T |X } converges.
Using (30), we have Thus, we have Since due to (33) and (35), the sequence {σ T |X } is a Cauchy sequence, it converges.
We remark that it is free to choose the convergence criterion in Algorithm 1.
In Algorithm 1, γ is fixed to be a large enough value. Intuitively (see the next paragraph for more detailed discussion), γ (or, more precisely, 1/γ) is an acceleration parameter that makes the algorithm converge faster if chosen to be a smaller value.
To begin with, we show the role of γ in convergence of the algorithm. Denote by σ * T |X the convergence point of {σ T |X }. The performance of our algorithm can be characterized by the decreasing speed of the average divergence between σ * T |X and σ (n) T |X , which is evaluated as where (a), (b), and (c) follow from the combination of (23) and (25), (30), and (27), respectively. The above discussion manifests that T |X ](x) > 0, making γ smaller makes the average divergence between σ * T |X and σ (n) T |X decrease faster. On the other hand, making γ too small leads to a risk of violating the condition (18) (and, consequently, breaking the monotonicity of J γ,α ).

Remark 1
The reference [22, Section III] considered a general setting. If σ T |X is a single density matrix, our method can be considered as a special case of their setting. However, since σ T |X is classical-quantum channel in our case, our analysis is not a special case of their setting.

Remark 2
The references [9, Appendix A] [24, Appendix A] considered the case when the systems X, Y, T are quantum systems and α = 1. They derived a necessary condition for the solution of the minimization problem by using Lagrange multiplier method in the same way as [1,4]. Using the obtained condition, they [9, Appendix C] [24, Appendix C] also proposed an iterative algorithm to find a solution to satisfy the necessary condition. It seems that their necessary condition is the same as (31) with γ = α = 1. However, they did not discuss the convergence to a local minimizer in their algorithm.

Numerics on the effects of different γ
To see the effect of different γ, let us take a look at a concrete example: Consider a single-qubit quantum system Y and a classical register X with size 2 8 . Then, we assume that P X is the uniform distribution over X = {0, . . . , 2 8 − 1}, and the density ρ Y |x is given as where σ x = 0 1 1 0 is the Pauli-X matrix. The parameters θ x and λ x are randomly chosen.
Then, the ensemble we consider admits the following joint density matrix: with ρ (θ x , λ x ) given by Eq. (37). Now, we apply our QIB algorithm (i.e., Algorithm 1) to the ensemble (38). We consider a classical T whose size is the square root of |X | (i.e., |T | = 2 4 ). We set α = 1, and β = 10. Our focus will be the effects of different choices of the acceleration parameter γ. As shown in Fig. 2, the choice of γ is crucial for the performance, more specifically, the efficiency and the convergence, of the QIB algorithm.
Two interesting phenomena are manifested by our numerics: For one thing, choosing a smaller γ will accelerate the course of convergence. As shown in Fig. 2, by choosing a suitably smaller value of γ (e.g., 0.8 or 0.5), our QIB algorithm achieves convergence faster than the existing QIB algorithm [9,24], which corresponds to Algorithm 1 with γ = 1. For the other, choosing a too small γ will ruin the convergence property of the QIB algorithm. For instance, when γ is chosen to be 0.4, f α jumps up after a few iterations and ends up in a much larger value than its initial value. In conclusion, the numerics has justified our theoretical analysis (see Section 2.4) on the importance of choosing a suitable γ. We emphasize that our contribution in this direction is twofold: 1. We proposed a method of accelerating the QIB algorithm, making it converge within fewer rounds of iteration, by introducing a new parameter γ and setting it to be smaller than one.
2. We showed that the QIB algorithm cannot achieve the desired minimal value of f α if γ is too small.

Choice of β
The output of our QIB algorithms depend not only on ρ XY [cf. (1)] but also on the choice of α and β. Intuitively, a larger β improves the faithfulness (as it makes I(Y : T ) more significant in f α ), while a smaller β leads to more compression (as it makes I(X : T ) more significant in f α ). Somehow surprisingly, the choice of β is not completely free: In the following, we show that the QIB algorithm will yield a trivial σ T |X if β is too small. To consider the relation between the choice of β and the resultant information on T , we introduce the following condition for a subset S ⊂ S X→T , where S X→T is the the set of all c-q channels from X to T , i.e., the set {σ T |X = (σ T |x ) x∈X }: The condition (A2) is unitarily invariant, i.e., the pair (σ T |X , σ T |X ) satisfies the condition (A2), if and only if the pair (U σ T |X U † , U σ T |X U † ) satisfies the condition (A2) for any unitary U on T . Hence, we choose S as a unitarily invariant subset. If σ M T |x is the maximally mixed state for every x, T is uncorrelated with Y and does not contain any meaningful information. In other words, when the assumption for Theorem 4 holds, the solution of the QIB problem is not useful. Hence, we need to choose the parameters α, β such that condition (A2) does not hold. Now we discuss how to avoid the condition (A2). The LHS of (A2) is evaluated as where , the coefficient of β is a negative value. Hence, a smaller β has a possibility to satisfy the condition (A2). That is, to obtain a useful solution, we need to choose β to be a sufficiently large value.

Proof of Theorem 4: Let U be an arbitrary unitary on
is the completely mixed state on T for any x.
3 Classical system T Next, we consider the case when T is constrained to be a classical system. We stress that this is a different minimization from the previously discussed one with a quantum system T , whose minimum may not be attainable with a classical T . Instead, our objective function now is Therefore, we need to re-examine the validity of our previous analyses. Let us start with the form of QIB algorithm. Fortunately, our algorithm with a quantum system T can be applied to this case, simply with the adaptation that the states σ T |x are limited to diagonal density matrices with respect to the basis {|t } of T . Under this condition, the statesσ γ,α,T |x [σ T |X ] are also diagonal density matrices. Therefore, when we set the initial state as diagonal density matrices, Algorithm 1 works for this case.
The above discussion leads to an interesting observation as follows. The convergent σ * T |X with initial diagonal σ T |X satisfies the condition (10) and it is also diagonal. That is, if the minimum with classical T is strictly larger than the minimum with quantum T , the minimum with classical T is an example for the following statement: A solution of the condition (10) does not necessarily give the minimum of f α with quantum T . This fact shows the possible risk that a solution to (10) might be a saddle point or a local minimum rather than the global minimum for f α with quantum T .
When the states σ T |x are limited to diagonal density matrices with respect to the basis {|t } of T , The notion of unitary invariance is reduced to invariance under permutations on T , and the condition (A2) is invariant under permutations on T . Then, Theorem 4 can be rewritten as follows.
Theorem 5 Assume that a subset S satisfies (A2) and is invariant under any permutation on T . Let σ * T |X be the minimizer of min σ T |X :diagonal f α (σ T |X ). When σ * T |X belongs to S, σ * T |x is the uniform distribution over T for any x.
Theorem 5 can be shown in the same way as Theorem 4.
In this case, we can make a more precise discussion for the condition (A2). For this purpose, we consider the maximum ratio The inequality κ ≤ 1 follows from the information processing inequality for the map Q X → Then, the LHS of (A2) is simplified as When the condition α ≥ 1, α κ > β holds, the LHS of (A2) is positive for σ T |X = σ T |X . Hence, to extract useful σ T |X , we need to choose β to satisfy the condition β > α κ with α = 1. In fact, even when β > α κ , there is a possibility that a permutationinvariant subset S satisfies (A2). Due to Theorem 5, when a permutation-invariant subset S satisfies (A2), a useful solution does not belong to the subset S. Hence, to obtain a useful solution, we need to choose β sufficiently large beyond the above condition β > α κ with α = 1.

Remark 3
We consider the case with classical Y and γ = α. The operatorσ α,T [σ T |X ](x) is simplified as follows.
In this case, the reference [31, (14) Section 3] proposed the following update rule: where the operatorτ T [σ T |X ](x) is defined aŝ Since we havê That is, the update rule (47) by [31, (14) Section 3] is the same as ours of this special case. In particular, the update rule (47) with α = 1 coincides with the update rule by the reference [32].
Remark 4 When the system Y is classical and α = 1, the reference [9, Appendix B] claimed that there is no difference between the optimal value with quantum T and the optimal value with classical T . Since their algorithm works with T of a fixed size, it can be considered that they claimed the above statement when the size of T is fixed. However, their proof (see [9, Appendix B II]) contains a gap: The statement under Eq. (B23) that "the Lagrangian is invariant under a measurement of the memory M in a chosen basis |m " is not backed by a rigorous mathematical proof. It is thus unclear whether this statement and, consequently, the claim that there is no quantum advantage are correct. On the other hand, as we show next, the optimal value with quantum T can be strictly smaller than the optimal value with classical T . That is, the claim in [9, Appendix B] contradicts with our result of the next section.

Quantum advantage for T
To see the advantage of quantum system T over classical system T , we discuss several examples with the strict inequality We provide an analytical example in this section and a numerical example with application in quantum machine learning in Section 5.2 when the size of the system T is fixed. Generally, to achieve the optimal performance, we need to choose the system T as a sufficiently large dimensional system. However, in this section, to provide analytical examples, we fix the size of the system T to a certain value. Assume that Y is a classical system of size d. The size of X is k times of the size d of Y. We assume that X is given as X 1 × X 2 with X 1 = Y and |X 2 | = k. The distribution of X is assumed to be uniform. We focus on the quantum system T with the dimension n < d.
Lemma 6 When β ≥ 1 and β ≥ α, we have Proof: First, we show a bound on the QIB for generic (quantum) T . For any σ T |x , we have H(T ) ≥ I(T : Since H(T ) ≤ log n and 1 − β ≤ 0, we obtain The above bound is tight. Indeed, we choose σ T |x1,x2 as the pure state Next, we focus on the case when T is a classical system of dimension n < d.
Proof: Any channel σ T |x can be written as a probabilistic mixture of deterministic channels σ j T |x . That is, we have Since Y is independent of X 2 and the random variable J describing the choice of j, we have Also, we have Then, we have where (a) follows from (53), and (b) follows from (57) and (

Information bottleneck in supervised learning
Supervised learning is a cornerstone of machine learning. Given a dataset {(x, y)} sampled from an unknown probability distribution P XY , a general supervised learning task is to find a classifier such that, for any testing data (x , y ) sampled from the same distribution P XC , it predicts the label y with as high accuracy as possible given x .
Remarkably, recent studies [8,28,33] on the information bottleneck theory showed evidences that the training phase of deep learning can be divided into two stages. In the first stage, a representation T of X that faithfully encodes its correlation with Y is found, featured by increasing I(T : Y ). In the second stage, the size of T is compressed, featured by decreasing I(T : X). This result suggests that finding an efficient and compressed representation of X facilitates data classification.

Quantum feature maps
Following the above intuition, we propose a classicalquantum hybrid algorithm of data classification, by combining the QIB algorithm with the kernel method. The idea is illustrated in the flowchart in Fig. 3. Given a training dataset S train , the algorithm first identifies an efficient representation T of X by minimising the information bottleneck f α := H(T ) − αH(T |X) − βI(T : Y ). Then a classifier is constructed that yields a predictionŶ based on the state in T corresponding to the value of X. For simplicity, we consider for now the case when Y ∈ {1, −1} is binary. In the first step, we set the representation T to be a quantum state ρ(x) that depends on the data x, and we obtain ρ(x) via Algorithm 1. In the second step, we use a linear classifier where A is a Hermitian operator and b ∈ R. We further consider A that can be expressed as a linear combination A = x:(x,y)∈Strain a x ρ(x), and the classifier has the reduced form where K(x,x) is the kernel function, in our case given by the Hilbert-Schmidt (HS) inner product of quantum states and can be evaluated by performing the SWAP test on a quantum computer: The algorithm is summarised as follows: We remark that the quantum kernel method, where a mapping x → ρ(x) is constructed for better classification, has been a hot topic recently (see, e.g., [5,11,17,20,25,26]). The key distinction between existing works and our present method is the following: In existing works, the parameter x is passed to a parametrised (a.k.a. variational) quantum circuit that prepares the state ρ(x). One needs to train the circuit parameters on a quantum computer to obtain a good mapping x → ρ(x), which is called a feature map. In the near term, this method might be subject to the physical limitations of quantum devices. In contrast, in our present method ρ(x) is directly computed via a simple iterative algorithm. Therefore, there are two possible ways of realizing our present method, i.e., Algorithm 2. In the near term, we can regard Algorithm 2 as a "quantum-inspired" classical algorithm, and evaluate everything on a classical computer. When large-scale quantum computing becomes feasible, Algorithm 2 can be readily "quantised". Indeed, the evaluation of ρ(x) in each iteration requires subroutines that compute matrix powers and logarithm and solve linear systems, which have already been developed in Refs. [7,10,18,19].

Numerical experiments
As a proof-of-principle experiment, we tested the performance of our QIB classifier on a dataset on R 2 , generated in the following way: First, we define the discrete sets X = X 1 ×X 2 and Y, with X 1 = Y = {0, 1, 2} and X 2 = {0, 1, . . . , 9}. To apply our classification method, we arbitrarily choose permutation π, and generate n = 400 independent and identically distributed data (X 1,i ,X 2,i , Y i ) for i = 1, . . . , n as follows. We independently generate (X 1,i , X 2,i , Y i ) according to the following distribution where P Y is the uniform distribution over Y , Q X1|Y (x 1 , y) = δ(x 1 , y), Q X2|X1 (x 2 , x 1 ) = δ(x1,x2)+1 |X2|+1 , and (x 1 , x 2 ) = π(x 1 , x 2 ). Next, we generate the random variablesX j,i := X j,i + R j,i , where the ran- dom variable R j,i is subject to the uniform distribution in the interval [0, 1.2) unless i = 1, X i = 2 nor i = 2, X i = 9, it is subject to the uniform distribution in the interval [0, 1) otherwise. Then, using the obtained data ( X 1,i , X 2,i , Y i ) with i = 1, . . . , n, we define its empirical distributionP XY . We apply Algorithm 1 to the distributionP XY as Fig. 4. In the case with the distributionP XY , Algorithm 1 with quantum T can realize a smaller f α than Algorithm 1 with classical T , which shows the advantage of quantum T over classical T .
In the classification experiment, 50% of the data are used as the training set and the rest are used as the testing set. The kernel is constructed with Algorithm 2 with α = 1, β = 15, γ = 1, a single-qubit register T , and 10 iterations. We consider both when T is a generic qubit system and when T is restricted to a binary classical system, and we compare their performance. As can be seen from Fig. 4, the case of quantum T has lower IB value than the case of classical T . The final feature map σ T |X for the quantum T case suffers from certain degree of dispersion due to the random noise r 1 , r 2 , but the quantum features still form 3 clusters. In contrast, the final σ T |X in the classical T case maps different values of X into two clusters.
The effect of the above distinction is made apparent in the classification performance. In Fig. 5, the performance of the classifiers constructed from the kernels are illustrated via their decision regions. It can be seen that, since the classical-T feature map groups X into two clusters, its resultant classifier gives a binary prediction on any input data, giving up the least possible label. In stark contrast, the quantum-T feature map utilizes the full Bloch ball to generate 3 clusters, leading to a much higher accuracy of prediction. The advantage of a genuinely quantum feature map is thus manifested by this numerical example.
For reference, in Fig. 5, we also plot the performance of two standard methods of classical feature maps. The referential methods (linear kernel and polynomial kernel) achieve accuracies (defined by the ratio of correct predictions in the testing set) 0.64 and 0.62, which is slightly higher than the classical-T information bottleneck kernel (0.565) but much lower than the QIB kernel (0.92). This further justifies the superior performance of our QIB method in classification. 6 Quantum deterministic information bottleneck (QDIB) Considering the limit α → +0, the paper [31] proposed deterministic IB, which minimize f 0 . Now, we consider this minimization with quantum systems T, Y and classical system X. First, we definê where P T |x [σ T |X ] is the projection to the maximum eigenvalue of the operator Given an initial point σ (1) T |X , we propose the following update rule As shown below, each step of this algorithm improves the value of the target function f 0 .
Since Theorem 1 and (20) guarantee the limit α → 0 in (67) implies which shows that each step of this algorithm improves the value of the target function f DIB := f α→0 .

Algorithm 3 Quantum deterministic information bottleneck (QDIB) algorithm
T |X ] is the projection on the space spanned by the eigenvectors of F α=0 [σ (n) T |X ](x) [cf. (15)] corresponding to the minimum eigenvalue.

5:
Set n as n + 1. 6: until convergence. 7: Output: A c-q channel σ (n+1) T |X 7 Approximate sufficient statistics from DIB 7.1 Task formulation Next, we discuss how DIB can be used for the extraction of useful information under a classical-quantum (c-q) joint system composed of X and Y with the joint state ρ XY := x P X (x)|x x| ⊗ ρ Y |x , where X is a classical system and Y is a quantum system. For example, assume that our interest is in the quantum phenomena in the quantum system Y . This quantum system Y is correlated to the classical system X. However, there is a possibility that the classical system X contains redundant information. In this case, it is useful to extract essential information from X to describe the behavior of the quantum phenomena in the quantum system Y . To discuss the essential information, we introduce the concept of -(approximate) sufficient statistics of the classical system X with respect to the quantum system Y while the papers [12,36] discussed this concept when system Y is a classical system.
A function f from X to T is called a sufficient statistics of X for the quantum system Y when there exists a conditional distribution P X|T such that The above condition is equivalent to the condition while in general we have the inequality I(X : Y ) ≥ I(T : Y ). However, when we use sufficient statistics, we cannot remove a small correlation generated by a noise. As an example, suppose that the classical system X is composed of two classical systems X 1 and X 2 . Assume that we have a c-q state ρ X1X2Y = x1 x2 P X1,X2 (x 1 , x 2 )|x 1 , x 2 x 1 , x 2 | ⊗ ρ Y |x1 with two classical systems X 1 and X 2 .
We assume that we have already known the distribution P X1X2 but we do not know ρ Y |x . Also, we assume that we generate this state several times and apply the state estimation to the generated state. As a result, we obtain our estimatê Since our estimate always has small error,ρ Y |x1,x2 is not exactly the same as ρ Y |x1 , but it is close to ρ Y |x1 . In this case, this difference should be considered as a noise. That is, the dependence of X 2 is not essential. It is better to consider that the correlation is given asρ Y |x1 := x2 P X2|X1 (x 2 |x 1 )ρ Y |x1,x2 Figure 6: Bloch representation of the estimated ensemble {ρ(θx 1 ,x 2 , λx 1 ,x 2 )}. As can be seen in the figure, the qubit states, especially those with higher purity, form several clusters in the Bloch ball. In each cluster, the states have the same value of x1 and different values of x2. This shows that the correlation between X1 and Y is higher than the correlation between X2 and Y .
so that our estimate of ρ X1X2Y is given as For > 0, a function f : X → T is called ansufficient statistics when the inequality holds. Hence, a sufficient statistics with T of small size and an -sufficient statistics can be considered as compressed data of X with respect to Y . In the above example, X 1 X 2 is a sufficient statistics for Y . When δ is sufficiently small for , I(X 1 : Y ) is close to I(X 1 X 2 : Y ), i.e., X 1 is an -sufficient statistics. Hence, we can remove non-essential information X 2 . In fact, if X = X 1 × X 2 is disturbed by a random permutation π, it will be non-trivial to extract essential information. To cover such a non-trivial case, we need a systematic approach to find such a function with a small-size T . For this aim, we can use the information bottleneck algorithm.
To extract approximate sufficient statistics T , we focus on two requirements. The mutual information I(T : Y ) should be larger, and the entropy H(T ) should be smaller. To handle these requirements, we simply minimize H(T )−βI(T : Y ) by using deterministic information bottleneck algorithm with |T | = |X |. Since the algorithm minimizes H(T ) − βI(T : Y ), and the conditional distribution P T |X in the solution is deterministic, the support of P T in the solution is expected to be smaller than the original set T .

Numerics
To demonstrate the above idea, let us take a look at a concrete example, which is a modification of the example in Section 2.5. Consider a single-qubit quantum system Y and a classical register X that encodes information about Y . The register X is further split into two sub-registers X 1 and X 2 that take values in the sets X 1 = {0, 1, . . . , 4} and X 2 = {0, 1, . . . , 19}. Then, we assume that P X is the uniform distribution over X 1 × X 2 , and the density ρ Y |x1 is given as ρ(θ x1 , λ x1 ) with (37). The parameters θ and λ depend on x 1 as Obviously, the quantum system depends only on X 1 and X 2 contains no information about the quantum system. An experimentalist who has access to the ensemble, however, does not know this. To extract information about the quantum system, for each pair of (x 1 , x 2 ), the experimentalist estimates its density matrix by repetitively (for ν < ∞ times) making a suitable measurement on ρ (θ x1 , λ x1 ). According to quantum state estimation theory [13,15], the estimate has an inaccuracy proportional to 1/ √ ν. Taking this into account, we model the estimated density matrix as ρ (θ x1,x2 , λ x1,x2 ) when the actual density matrix is and characterise the estimation errors. The estimated ensemble then admits the density matrix given in (72) withρ Y |x1,x2 = ρ (θ x1,x2 , λ x1,x2 ), which is given by Eqs. (37), (75), and (76). Notice that now the register X 2 is correlated with Y in the estimated joint stateρ XY , even if the estimation-induced noise follows a distribution that does not depend on the value of X 2 . Now, the task is to compress the register X, by constructing a map from X to a smaller classical register T . Here we take T to be the same size as X. One intuitive approach is to discard the X 2 register because X 1 contains much more information about the qubit state than X 2 . Nevertheless, such a simple map does not exist in more general cases. For instance, if the values of (x 1 , x 2 ) in Eq. (72) are permuted, discarding X 2 will not result in faithful compression. To see this, we further apply a arbitrary chosen unknown reshuffling π : X → X to the classical register X = X 1 × X 2 in Eq. (72). The ensemble then admits the following joint density matrix: with ρ (θ x1,x2 , λ x1,x2 ) given by Eqs. (75) and (76). The goal is to extract an approximate sufficient statistics by constructing a map Q : X → T . We apply our quantum deterministic information bottleneck (QDIB) algorithm on the state (77) (see also Fig. 6). For the joint state, we choose |X1| = 5 and |X2| = 20, and PX to be the uniform distribution over X = X1 × X2. The noise rν (x1, x2) and r ν (x1, x2) are drawn randomly and uniformly from the interval (−1/ √ ν, 1/ √ ν) with ν = 20 for any x1 ∈ X1, x2 ∈ X2. In the QDIB algorithm (Algorithm 3), we choose β = 20 and |T | = |X | = 100. In the figure above, the information bottleneck f DIB := fα→0 is plotted as a function of the number of iterations. As can be seen from the plot, the QDIB value of our algorithm becomes lower than that of the fictional protocol of "discarding X2 after the inverse permutation π −1 " after only 3 iterations. In the figure below, the faithfulness I(T : Y ) is plotted as a function of the number of iterations, and I(X1 : Y ) after the inverse permutation π −1 (corresponding to the performance of the fictional protocol of "discarding X2 after the inverse permutation π −1 ") as well as I(X : Y ) (corresponding to the upper bound of I(T : Y )) are plotted for reference. Both plots justify that our QDIB algorithm performs well in the task of constructing approximate sufficient statistics.
Our QDIB algorithm works as a more systematic and more efficient method to extract essential information and discard non-essential information, even in the presence of an arbitrary permutation. In the QDIB algorithm (Algorithm 3), we choose β = 20 and |T | = |X | = |X 1 ||X 2 |. First, we consider the case when the ensemble admits the form (72), and the performance is summarised in Fig. 7. As one can see from the numerics, f DIB := f α→0 of applying our QDIB algorithm toρ XY drops lower than that of the "discarding X 2 after the inverse permutation π −1 " approach within 5 iterations, and converges to a much lower value, suggesting a better compression performance. This is further justified in the second plot, where the faithfulness I(T : Y ) and the residual information I(T : X) are plotted. We can see that since our QDIB algorithm preserves almost as much information about Y as the original variable X, it compresses a considerably larger portion of information about the original register X.

Discussion and conclusion
We have proposed a generalized algorithm for QIB with an acceleration parameter γ and an additional parameter α, and have derived a necessary condition for the monotonic decrease of the objective function f α = H(T ) − αH(T |X) − βI(T : Y ) with quantum systems Y, T and classical system X when we extract information T with respect to Y from X. We have also showed its convergence under the same condition and that a wisely-chosen parameter γ can accelerate the convergence. Our numerical calculation has further justified the above analysis as follows. In our numerical experiment, making γ smaller accelerates the convergence, but if γ is made smaller than a threshold the algorithm will fail to converge. In addition, we have provided examples that quantum system T have an advantage over classical system T even when Y and X are classical.
Next, taking the limit α → +0, we have proposed an iterative algorithm for QDIB that minimizes the objective function f DIB = H(T ) − βI(T : Y ). We have shown that this iterative algorithm always decreases the objective function monotonically. QDIB can be applied to find an approximate sufficient statistics because it realizes a smaller entropy H(T ) and a larger mutual information I(T : Y ). Then, we have numerically demonstrated that our QDIB algorithm works well as an approximate sufficient statistics.
An important application we show in this work is that our QIB algorithm yields a new approach of constructing quantum feature maps for classification. In our numerical example, quantum system T realizes a smaller value of the objective function than classical system T . This numerical analysis shows the advantage of using quantum memory T for the classification. Despite significant recent progress [3,5,11,17,20,[25][26][27]34], the advantage of quantum machine learning over its classical counterpart has not been discussed much. Our work provides a new angle of attacking this issue, shedding light on a new proposal to rigorously justify and quantify quantum supremacy in the world of learning.
An open question left for future study is how to extend our result to the case where X is also a quantum system, which covers, for instance,the scenario of compressing a quantum system while keeping its correlation with a classical label [21,23,[35][36][37][38]. Remarkably, in such a scenario, it has been shown that, if T is classical, some correlation will be lost regardless of its size [37]. Therefore, we anticipate that the advantage of a quantum T might persist or grow even stronger for QIB with a quantum X.
Finally, we remark that currently there is no efficient method to compute the restriction on γ in Theorem 3. Resolving this important issue in a future work will accelerate the convergence of our information bottleneck algorithm.