Which entropy for general physical theories?

We address the problem of quantifying the information content of a source for an arbitrary information theory, where the information content is defined in terms of the asymptotic achievable compression rate. The functions that solve this problem in classical and quantum theory are Shannon's and von Neumann's entropy, respectively. However, in a general information theory there are three different functions that extend the notion of entropy, and this opens the question as to whether any of them can universally play the role of the quantifier for the information content. Here we answer the question in the negative, by evaluating the information content as well as the various entropic functions in a toy theory called Bilocal Classical Theory.


Introduction
The experience of quantum information has suggested that viewing quantum theory as an information theory helps in making its most counterintuitive consequences comprehensible.This perspective promotes the role of information to a fundamental one in the development of a physical theory [1,2,3,4,5].For example, one might wonder to what extent the laws of physics depend on the properties of physical systems used as information carriers.Such an approach to foundations of physics already proved successful both in motivating the use of quantum theory on operational grounds, and in suggesting how viable generalisations of quantum theory for a post-quantum physics should look like [6,7,8,9,10,11].
From the above perspective, it is very natural to introduce in a general theory of physical systems some notion of entropy at a very fundamental level.Indeed, entropic functions have been identified as quantifiers of information from a source since the pioneering work by Shannon that paved the way to classical information theory [12].Along this line, von Neumann entropy [13] was studied in the realm of quantum information theory as a quantifier of quantum information [14].
The importance of Shannon's and von Neumann's tropic, the information content of its states does not coincide with its entropy.Moreover, in BCT pure states actually have non-null information content.The latter feature can be understood considering that the independent preparation of two systems in pure states does not correspond to a pure state for the composite system.Such a preparation thus introduces some ignorance about the whole system, even if the outcomes of independent experiments on its components are fully predictable.The latter result can be extended to any theory of classical systems where the rule for composing systems is such that the composition of pure states is not necessarily pure.Therefore, thinking of pure states as representing complete knowledge of the physical system is inaccurate if the composition law is not purity-preserving.

An overview of the framework
A brief review of the framework follows.A standard reference is [5], but an extended treatment can be found in [8].The framework is also carefully reviewed in the introduction of [19] 2

.1 Review of the framework
The primitive notions of an operational theory are those of test, event and system.A test {A i } i∈X is given by a collection of events, where i labels the elements of the outcome space X.The systems (or, more precisely, system types) allow for the connection between different tests, and are denoted by capital Roman letters A, B, . . . .Therefore, a test is completely determined by its input and output systems, and the events associated with the outcome space X.In order to represent a test and its events {A i } i∈X we use the usual diagrammatic notation and we will call A, B the input and the output system of the test, respectively.In this case we will say that the test {A i } i∈X , as well as each of its events A i , are of type A → B. If {A i } i∈X and {B j } j∈Y are two tests, one can define their sequential composition as the test {C i,j } (i,j)∈X×Y , with events C i,j that are diagrammaticaly represented by Notice that this definition requires the output system of the events on the left to be necessarily the input system of the events on the right.In formula, the sequential composition will be denote as C i,j := B j A i .
A singleton test is a test whose outcome space set X is a singleton, and the unique event contained in it is called deterministic.For any system A there exists a unique identity test {I A } such that, for any event of type A → B, one has A I A = I B A = A .Another operation that can be performed on tests for defining a new test is parallel composition.Given two systems A and B we call AB the composite system of A and B.Then, if {A i } i∈X and {C j } j∈Y are two tests of type A → B and C → D, respectively, their parallel composition is the test {A i ⊠ C j } (i,j)∈X×Y .Diagrammatically Both the parallel and sequential compositions must be associative, and the parallel composition operation commutes with the sequential one, namely The identity I AB of the composite system AB is the parallel composition of the two identities We will denote by A ⊠k the composition of the same system A with itself k times.
There is a special kind of system, the trivial system I, satisfying AI = IA = A for every system A. Tests with I as input system and A as the output one are called preparation tests of A, while tests with input system A and I as output are named observation tests of A. The events of a preparation test {ρ i } i∈X and of an observation test {a j } j∈Y are represented through the following diagrams A a j := A a j I .
In the following we will always use Greek letters to denote preparation tests and Latin letters for the observation ones.Preparation and observation events will also be denoted by using round brackets: |ρ) A and (a| A , respectively, and the system will be omitted whenever it is clear from the context.
A circuit is a diagram representing an arbitrary test that is obtained by sequential and parallel composition of other tests.We say that a circuit is closed when the input and output systems are both the trivial one, namely, when it starts with a preparation test and it ends with an observation test.For any pair of systems, we want their agents to be allowed to exchange their systems.This requirement is captured by the notion of braiding, that is a family of reversible transformations S AB , defined for any pair of systems of the theory, and denoted as follows A Generally, S AB and S −1 AB are different transformations.When they are equal, e.g. in the case of QT, where S AB is represented by the swap operator, the theory is called symmetric.These transformations must obey a sliding property, which asserts that two agents can equivalently perform their transformations and then exchange their output systems, or first exchange their input systems and then perform the transformations.In diagrams Then, an operational Theory is defined by a collection of systems, that is closed under composition, and a collection of tests (including all the family of tests S AB for any A and B) closed under a pair of associative operations of sequential and parallel composition.An operational probabilistic theory is an operational theory where the test corresponding to any closed circuit (equivalently, any test from the trivial system to itself) is given by a joint probability distribution for its outcomes, conditioned by the tests that make the circuit.Moreover, compound tests from the trivial system to istelf are independent, namely, the joint probability distribution is assumed to be given by the product of the probability distributions of the composing tests.A simple example is given by a preparation test {ρ i } i∈X sequentially followed by an observation test {a j } j∈Y : with i,j p(i, j) = 1.Thus, one has a joint probability distribution, which is conditioned by the chosen tests {ρ i } i∈X and {a j } j∈Y .From now on we will simply omit this dependence.The probability associated with the closed circuit where a preparation ρ i is followed by an observation a j will also be denoted by a pairing, p(i, j) = (a j |ρ i ).
Given any system A of an OPT, One can define an equivalence relation on the set of preparation events by declaring that ρ ∼ σ iff (a|ρ) = (a|σ) for any observation event a.The set of equivalence classes with respect to this relation is called the set of states of system A and it is denoted by St(A).
Similarly, one can define the set of effects as the set of equivalence classes of the observation events such that (a|ρ) = (b|ρ) for any preparation event ρ, and this is denoted by Eff(A).The sets of deterministic states and effects, obtained as the equivalence classes of deterministic preparation events and deterministic observation events respectively, will be denoted by St 1 (A) and Eff 1 (A) respectively.By definition, effects are separating for states, namely |ρ) ̸ = |σ), iff there exists an effect (a| such that (a|ρ) ̸ = (a|σ), and viceversa states are separating for effects.
Given the probabilistic structure, states can be seen as functionals on the set of effects and viceversa, and then one can consider linear combinations of them, thus defining two linear spaces, St(A) R and Eff(A) R , which are dual to each other-assuming that they are finite-dimensional.The size D A of a given system A is defined as the dimension of the linear space St R (A).Also the set of events of generic type A → B can be endowed with an equivalence relation.Indeed, given A and B of the same type A → B, we say that they are operationally equivalent (A ∼ B) if the following identity holds for any preparation event Ψ, observation event A and any ancillary system C.We then denote the set of all the equivalence classes with Tr(A → B), whose elements are simply called transformations.The operations of sequential and parallel composition are then extended to the equivalence classes in the obvious way, by means of the representatives.For instance, if A and B are two transformation events of type A → B and B → C respectively, then the sequential composition between the equivalence classes is defined as From now on we will always refer to transformations omitting the square bracket in the notation.Via the sequential composition, a transformation of type A → B is associated with a map from St(AC) to St(BC) for any ancillary system C, which can be uniquely extended to linear maps between the vector spaces St R (AC) to St R (BC) (see [5]).Therefore, a transformations is completely characterised by a family of linear maps, one for each ancillary system C.As for states and effects, the set of deterministic transformations will be denoted by Tr 1 (A → B).If U ∈ Tr(A → B) and there exists V ∈ Tr(B → A) such that V U = I A and U V = I B , we say that U is reversible.Accordingly, two systems A and B are called operationally equivalent (A ∼ = B) if there exists a reversible transformation U ∈ Tr 1 (A → B).
A notion that will be useful in the following is that of asymptotically equivalent systems.

there exists a pair of integers h
(k 1 ) be the smallest k 2 such that item 1 is satisfied for a given k 1 , and similarly for M min 1 (h 2 ) with reference to item 2. The following relations hold: ( Notice that item 1 in the above definition is equivalent to the following: for any k 1 < ∞ there exists k 2 < ∞ and a pair of maps . Indeed, the above statement implies item 1, as we now prove.
Assuming the validity of item 1 in definition 2.1, if k = k 1 , the statement is trivially true.We then first prove that for every integer k < k 1 there exists another integer k 2 and a pair of maps Ẽ and D such that ) be the maps given by item 1, such that ) and define the map E ρ as the sequential composition of the encoding map E , preceeded by the parallel composition of |ρ)-which is a map from the trivial system I to A ⊠(k1−k) 1 -with the identity I A ⊠k on the remaining k systems, i.e., ).
Similarly we do for the decoding: we take the decoding D and a deterministic effect to the trivial system) and we define the sequential composition of D with (e| ⊠ I A ⊠k 1 , to obtain .
, indeed, using the fact that parallel and sequential compositions commute where the last step follows from the fact that we have chosen a deterministic state |ρ) and a deterministic effect (e| to define the encoding E ρ and the decoding D e respectively, so that (e|ρ) = 1.Now, for k > k 1 , let m be the integer such that (m − 1)k 1 ≤ k < mk 1 and let k := mk 1 − k ≤ k 1 .We prove that there exists an integer k2 and a pair Ẽ ∈ Tr 1 (A ⊠k 1 → A ⊠k2

2
) and D that perfectly encodes A ⊠k1 Conversely, C is called the coarse-graining of the events {D i } i∈Y .We denote by R(C ) the collection of all the refinements of C .The notion of refinement and refinable events give rise to the definitions of pure and mixed states.Definition 2.4 (Pure and mixed states).ρ ∈ St(A) is called pure if it is extremal and deterministic.We will denote by PurSt(A) the set of all the pure states of system A. On the other hand, ρ ∈ St(A) is called mixed if it is neither atomic nor extremal.

A convex refinement
Given any state ρ ∈ St 1 (A), any convex decomposition made of pure states will be accordingly called pure.The set of pure convex decompositions of ρ will be denoted by Definition 2.5.Let ρ ∈ St(A) and Ψ ∈ St(AB).We say that Ψ is a dilation of ρ if there exists a deterministic effect e ∈ Eff 1 (B) such that If Ψ is also pure, then we say that it is a purification of ρ and B is called the purifying system.We denote by D ρ the set of all dilations of the state ρ, and by P ρ the set of all its purifications.
Trivially one has that The linear space St R (A) can be endowed with a metric structure by means of the following norm, which has an operational meaning related to optimal discrimination schemes [8].
Definition 2.6 (Operational norm).The norm of an element ρ ∈ St(A) R is defined as where {a 0 , a 1 } is any binary observation test.
This norm reduces to the usual trace-norm in the case of quantum theory.Moreover, it satisfies a monotonicity property, as stated in the following lemma (see [8] for the proof).

Lemma 2.1 (Monotonicity of the operational norm).
For any δ ∈ St R (A) and C ∈ Tr(A, B) the following inequality holds ∥C δ∥ op ≤ ∥δ∥ op , (6) with equality holding iff C is reversible.
An OPT can enjoy different degrees of locality according to the following definition [20].

Definition 2.7 (n-local discriminability). Let n ≤ m.
We say that that an OPT satisfies n-local discriminability if the effects obtained as a conic combination of the parallel composition of effects a 1 , . . ., a n , where a j is k j -partite with k j ≤ n for all values of j, are separating for m-partite states.
We recall that a k-partite effect is an effect on a system that is obtained as the composition of k systems.The case of n = 1, that is the case of classical and quantum theory, is simply called local discriminability.In the latter case, this property tells us that state tomography can be accomplished by means of local measurements only, and it is equivalent to state that for any pair of systems A, B with size D A and D B respectively, the composite system AB has size D AB = D A D B .Notice that a theory that is nlocally discriminable is also n+1-locally discriminable.Fermionic theory, real quantum theory and the bilocal classical theory considered here are examples of strictly bilocal theories, namely they satisfy definition 2.7 with n = 2 but not for n = 1.An OPT with local discriminability also necessarily satisfies the following feature.

Definition 2.8. An OPT satisfies atomicity of parallel composition of states if, for any pair of systems A and B and for all pair of atomic states ϕ ∈
This property is clearly satisfied by classical and quantum theory, and fermionic theory as well.The latter is a trivial example showing that the atomicity of parallel composition of states does not imply local discriminability.Moreover, examples of theories that violate atomicity of parallel composition of states can be constructed, and we will discuss one in detail in the present analysis.
We now introduce two properties that an OPT can have, and that are necessary for our discussion of the information content, whose definition will be recalled in the next subsection.These are strong causality and steering.

Definition 2.9 (Causal theories). An OPT satisfies strong causality if for every test {A i } i∈X and every collection of tests {B
is a test of the theory.
In other words, any theory that is strongly causal contains all the conceivable conditioned test, namely those in which the second test that is performed is chosen according to the outcome of the first one.This is a strong notion of causality, since one can show that the above statement implies uniqueness of the deterministic effect [5,8], that corresponds to a weaker form of causality defined as follows.
Definition 2.10 (Weakly causal theories).An OPT satisfies weak causality if for every system type A there is a unique deterministic effect e A .
In a non deterministic theory (i.e. a theory where at least one event of the trivial system is neither 0 nor 1) strong causality implies convexity of the set of events of every type-including, of course, sets of states of every system [21,5].An example of an OPT that is weakly but not strongly causal is given in [22].For every system of this theory the set of states is a simplex whose vertices, corresponding to the pure states, are perfectly discriminable, just as in standard classical theory.The fact that the set of states is a simplex for every system implies that the theory is weakly causal (by theorem 1 in [23]).However, the set of allowed tests between non-trivial systems only includes the composition of preparation and observation tests with permutations of subsystems of composite systems, and does not contain conditional tests, i.e., tests whose events are of the form (7).This implies that such a theory is not strongly causal.
We now come to the second property of interest, which is the property of steering.Steering as we define it is actually a property of both classical and quantum theory.
Property 1 (Steering).Let ρ ∈ St(A) and {σ i } i∈X ⊆ St(A) be a refinement of ρ.Then there exist a system B, a state Ψ ∈ St(AB), and an observation test {b i } i∈X such that Notice that the state Ψ in the steering assumption must be a dilation of ρ (see definition 2.5), as one can easily verify upon summing over i ∈ X, and noticing that the coarse-graining of all the effects in an observation test yields a deterministic effect.Moreover, we are not requiring that the same dilation Ψ steers all the possible refinements of ρ, differently from what happens in quantum theory.We remark indeed that in quantum theory a stronger version of steering holds.It states that any ensemble of a given state can be steered by means of a purification of that state.In more precise terms, given a state ρ ∈ St(A) and a purification Φ ∈ PurSt(AB) of ρ, for any decomposition i∈X p i σ i of ρ there exists an observation test However, we do not need such a strong property for our purposes.In the following when referring to steering we will be referring to the property 1.
Finally, with the following set of definitions we now make explicit what we mean by classical theory.First of all, we say that an OPT is convex when, for all systems of the theory, the set of states (and more generally of transformations) is convex.As we remarked above, this is always the case for non deterministic theories with strong causality.

Definition 2.11 (simplicial theory).
A simplicial theory is a finite-dimensional OPT where the extremal states of every system A are the vertices of a D A -simplex.

Definition 2.12 (Joint perfect discriminability). Let A be a system of the theory. A set of states {|ρ
is jointly perfectly discriminable if there exists an observation test {(a i | A } such that: We now define classical theories only in the restricted context of weakly causal theories.The reason is that the definition would otherwise involve complications that are unnecessary for our present purposes.Indeed, the core of our analysis will deal with strongly causal theories, which are also weakly causal.

Definition 2.13 (classical theories).
A classical theory is a theory where any set of pairwise non proportional atomic states is jointly perfectly discriminable.
As it is clear, a classical theory in the sense of the above definition is not necessarily convex, despite the pure states are the extremal points of a simplex.However, when strong causality is assumed, the scenario is restricted to convex theories, and thus classicality in that case implies that the set of states is a simplex whose extremal points are perfectly discriminable pure states.
While atomicity of parallel composition of states (definition 2.8) is always implied by local discriminability, the converse is generally not true, a trivial example being fermionic theory, that is strictly bilocal but state-purity preserving.Nevertheless, a converse holds for n-locally discriminable theories that are also simplicial, as a corollary of theorems 2 and 3 in [23]: firstly, every simplicial theory satisfying n-local discriminability does not admit the presence of entangled states iff it satisfies atomicity of parallel composition of states (theorem 3 [23]).But it is also true that a simplicial theory is locally discriminable iff there exist no entangled states (theorem 2 [23]).Whereby, a simplicial n-locally discriminable theory enjoys atomicity of parallel composition of states iff it is actually locally discriminable.
In the following we will deal with strongly causal classical theories, where the set of states of any system is a simplex whose extreme points, except for the null state (i.e. the state that yields zero on all the effects), are pure.Thus, atomicity of parallel composition for states is equivalent to purity of parallel composition of pure states.

Digitizable theories and the information content
In a general physical theory of information, information plays of course a central role.It is then reasonable to expect that one has a way to quantify it, and this in turn has to come along with a unit.In classical and quantum theories, the output of any physical source is digitized in terms of bits and qubits, respectively.In particular, focusing on the case of classical information theory, any string of length N , say i := i 1 . . .i N , where each i is extracted from an alphabet of d symbols, can be perfectly encoded on a distinct array of a suitable number of bits.The bit is indeed the reference system that we usually adopt as a unit for assessing the amount of information of a given classical source; an analogous role is played by the qubit in quantum information theory.Moreover, notice that one can actually choose any other system for the digitization of the outputs of a classical or quantum information source, with the only difference being that the entropy function, which numerically quantifies the information content, must be rescaled by a suitable multiplicative factor.
These simple observations can be thoroughly made into a requirement that a theory must abide by, thus setting the stage for a meaningful definition of compression rate in the OPT framework.This is formalized as follows.
Definition 2.14 (Digitizability).We say that an OPT is digitizable if there exists a system B (called o-bit) such that for any system X there exists k < ∞ and a pair of maps C ∈ Tr 1 (X → B ⊠k ) and F ∈ Tr 1 (B ⊠k → X) such that F • C = I X .Moreover, if B 1 and B 2 are two such systems, then they are asymptotically equivalent.
As we have already mentioned, finite-dimensional classical and quantum theory are digitizable.In particular, all the systems are asymptotically equivalent in the sense of definition 2.1, and any system can serve as an o-bit.The OPT considered in [22], on the other hand, represents an example of a theory that is not digitisable.This theory was constructed for a different purpose rather than studying digitizability, but it is not difficult to realize that it does not satisfy the latter property.As we have already mentioned, in this theory all systems are classical, in the sense that for any system the set of states is a simplex, and the vertices corresponding to non-null extremal states are perfectly jointly discriminable.Moreover, for any integer D there exists a type of system of size D. The set of allowed channels between non-trivial systems is restricted to contain compositions of deterministic states and effects with permutations of subsystems of composite systems.It is then clear that, given a system of type A, for every other system B of different type it is not possible to find an integer M and a left-reversible map of type A → B ⊠M .Indeed, such a left-reversible map should have the A system as a subsystem at the output in order to be left-reversible, and given the uniqueness of the decomposition of integers in prime factors, this is impossible.Thus, the theory cannot be digitisable.
An information source is characterized by a system A and a state ρ ∈ St(A).Repeated uses of the source then generate a message that, generalizing the i.i.d.setting, has the form of a factorized state, ρ ⊠N .Given a pair of maps E ∈ Tr 1 (A ⊠N → B ⊠M ) and D ∈ Tr 1 (B ⊠M → A ⊠N ), that we name encoding and decoding respectively, we need a figure of merit that establishes the goodness of the codec map C := DE .Our choice comes as a consequence of the two following elementary observations: (i) the output message to which we have local access may be correlated with an ancilla, and (ii) it could be prepared in different ways.Since the aim is to preserve all the possible information gathered in the message, we use the following quantity In other words, it is an optimization of the average error over all the possible decompositions of any dilation of the message.We then define an (N, ϵ)-compression scheme as an encoding-decoding pair (E , D) such that D(ρ ⊠N , DE ) < ϵ, and we denote by E N,M,ϵ (ρ) the set of all the (N, ϵ)-compression schemes.We now have all the ingredients for defining the information content of a source.
We define the smallest achievable compression ratio for length N to tolerance ε as follows The information content of the state ρ is defined as The well-posedness of this definition has been proved in [18], as a direct consequence of the digitizability assumption.As it is clear, the definition of I(ρ) strongly depends on the choice of the figure of merit D(ρ ⊠N , C ).The one that we have introduced takes into account all possible refinements of any dilation of ρ ⊠N .Alternatively, one can consider the pure convex refinements of any dilation of ρ ⊠N and define When the theory satisfies strong causality (property 2.9), and if any state is proportional to a determistic one, D and D pur can be equivalently used to define the information content via the formula (10).If a theory also satisfies steering (property 1), along with strong causality, one can use a figure of merit that is further simplified [18] 3 A case study: the information content in Bilocal Classical Theory Bilocal Classical Theory (hereafter BCT) is a fullyfledged operational probabilistic theory that has been constructed with the aim of showing the independence of two fundamental notions: entanglement and complementarity.As we have already recalled, simplicial theories, in the sense of definition 2.11, admit of the presence of entangled states iff they do not satisfy the local discriminability principle (see Theorem 2 in [23]), and this can be explicitly seen in the case of BCT.Here we outline the features of the theory that are relevant for the present work, in particular we recall the structure of state spaces, which are essentially the same of those of classical theory, and more importantly the way in which systems are composed in parallel (Postulate 1 and 2 of [23]).And the characterization of channels.For a detailed account we refer the reader to [19].
From the point of view of the state space, BCT does not differ from standard classical theory.In BCT, for every integer D > 1 it is assumed that there exists a unique system with size D. The set of states of each system is given by a simplex whose vertices are jointly perfectly discriminable, whence BCT is a convex classical theory, according to definition 2.13.In particular, for any system A, the extremal states are the vertices of the simplex St(A), and the non-null extremal states, say {|i) A } D A i=1 , are perfectly discriminated by an observation test {(j| A } D A j=1 , i.e. (j|i) A = δ ij .Moreover, given that the pure states correspond to the vertices of a simplex, any other state can be uniquely decomposed in terms of pure ones.In [23] (see in particular theorem 1 therein) it has been shown that simplicial theories are necessarily weakly causal, whereby BCT satisfies weak causality.
The salient features of BCT are a consequence of the behaviour of parallel composition of states.By theorem 4 in [23], in a simplicial theory the state space St(AB) of a composite system AB is fully characterised by: i) a choice for the size D AB of the composite system AB; ii) an unambiguous labeling for the pure states of AB as |(ij) k ) AB -for i ∈ {1, . . ., D A }, j ∈ {1, . . ., D B }, and k in finite sets I ij -and a choice of probability distributions BCT is an explixit example of how the situation described above can be consistently realised.In the first place, the following rule for the size of bipartite systems is postulated: for any two systems A, B, the size of the composite system AB is given by: Recalling that a theory satisfies local discriminability if and only if D AB = D A D B , the fact that BCT does not enjoy such a property follows from the above compositional law (12).Actually BCT is strictly bilocal [19], since it satisfies the following constraint on the dimension of tripartite systems (see theorem 2 in [19]) where ∆ AB := D AB − D A D B , as one can easily check using equation (12).The unambiguous labeling for the pure states of the composite system is made as follows: for any i ∈ {1, . . ., D A }, j ∈ {1, . . ., D B }, the set I ij is given by {+, −}, and The set of pure states of any composite system AB is then so that for all pure states |i) A ∈ PurSt(A), |j) B ∈ PurSt(B) the following parallel composition rule holds: When a third system is added the association satisfies the following law: for all local indices i, j, k and signs s 1 , s 2 .As already mentioned, as a consequence of bilocality BCT admits of entangled states.More precisely, all the pure states of any bipartite system are entangled [19].
According to equation (13), each time we compose in parallel a state with itself, there is an additional degree of freedom s associated with a sign + or − which can be either of the two values with the same probability 1  2 .This entails that a message of length N that is emitted from a source, described by where i and s collectively denote the string of N local indices and of N − 1 signs respectively.Notice that, according to the rule of Eq. ( 14), the string of signs s depends on the order in which the N systems are associated.If the order of composition changes, however, one just has a change in the string of signs s ′ = f (s), according to Eq. ( 14), which is immaterial since s is a dummy index and f is an invertible function.Anyway, for the sake of clarity, we will ubiquitously adopt the convention that the expression in Eq. (15) refers to the composite system (. . . For our purpose it is of fundamental importance to know how transformations, and in particular channels, are characterized, since this establishes how much freedom we have in devising suitable compression schemes (E , D).In [19] the authors first introduce the reversible transformations as follows (see in particular postulate 3): for any pair of non-trivial systems A and A ′ of the same size D A , R is a reversible transformation if and only if there exists a permutation of D A elements and a sign τ i for every i such that for any non trivial E and |(ij) s ) ∈ PurSt(AB) one has Then, transformations are introduced as those admitting a reversible dilation (postulate 4 in [19]).More precisely, for any pair of systems A and B, T ∈ Tr(A → B) if and only if there exists a reversible transformation R ∈ Tr(B ′ A → A ′ B), a state |Σ) B ′ and an effect (H| A ′ such that The above realisation of a BCT transformation is given in an implicit form via its reversible simulation, whose action is explicitly defined in equation ( 16).
The following result, proved in [19] (see in particular lemma 3 and 4 in appendix A), provides an explicit characterisation of transformations and channels: If the transformation is determinsitic, then {λ m,τ } (m,τ )∈I (i) is a probability distribution for every i.
We can now analyse the behaviour of the information content in BCT by applying the formalism developed in subsection 2.2.The first fact that must be checked is that BCT is a digitizable theory, in the sense of definition 2.14.This is a mandatory step that allows for a meaningful operational definition of information content, as discussed at the end of the previous section.The proof is given in appendix A (lemma A.1).In BCT, as in classical and quantum theory, any type of system can serve as o-bit.Here we choose the type of system with D = 2, that we will call bibit from now on.
The major obstacle in computing the information content of a given state is the complexity of the figure of merit.The greater is the set of states on which we must validate the compression scheme, the more difficult is to devise one that works as we wish.However, for any state ρ ∈ St 1 (A) of BCT, one can find a dilation Π ∈ St(AE) from which we can compute all the other ones by applying a suitable channel on the ancillary system E, as it is stated in the following proposition.

Proposition 3.2. Let
be the dilation of ρ with joint probability distribution defined as Then, there exists a channel C ∈ Tr 1 (E, F) such that The proof of this proposition and of all the other results of this and the subsequent section are given in appendix A. The above statement, along with the fact that BCT satisfies the strong causality principle and the steering property, straightforwardly implies the following proposition, which drastically simplifies the task of devising a compression scheme.Proposition 3.3.Given a state ρ ∈ St 1 (A) and a compression scheme (E , D) for a message of length N , the figure of merit can be computed according to the following formula In other words, defining Ĩ(ρ) analogously to I(ρ) (equation (10)) by replacing D(ρ ⊠N , C ) with D(ρ ⊠N , C ), it holds that I(ρ) = Ĩ(ρ).

Theorem 3.1. Let A be a BCT system and
A first corollary of formula (18) is that the information content is additive in BCT.Indeed, if i p i |i) A = |ρ) A ∈ St(A) and j q j |j) B = |σ) B ∈ St(B) then the factorized state is given by a straightforward application of equation (18) then gives We then deduce that atomicity of parallel composition of states is not a necessary condition for the additivity of information content when factorized states are considered.In particular, not even local discriminability is a necessary condition for additivity of I(ρ).The latter fact, however, was already known from [24], where it is proven that the information content of a fermionic source is given by the von Neumann entropy of the state representing the source, just as in the case of quantum theory, and fermionic theory is a bi-local theory (see [25,26]).
Another interesting feature of the information content in this theory is that it is strictly positive for all states of any system.In particular, the Shannon entropy of any sharp probability distribution is vanishing, whence, for any system A, it holds that This is in contrast with what we know from classical, quantum and fermionic theory, where the information content is vanishing if and only if the state is pure.In this respect, one is led to give the notion of purity an operational meaning by saying that we have maximal knowledge about a physical system whenever it is in a pure state.In [18] (proposition V. 2) It has been shown that atomicity of parallel composition of states (definition 2.8) and the uniqueness of purifications up to reversible transofrmations on the ancillary system, are sufficient conditions for this interpretation.Notably, a converse is also true [18] (proposition V.2), namely that if I(ϕ) = 0 whenever the state ϕ is pure then state purity is preserved under parallel composition.
Here we explicitly see that in a theory that violates atomicity of parallel composition of states also pure states can have non vanishing information content.
In reference [16], the authors address the problem of noiseless coding within the generalised probabilistic theories framework.Under certain assumptions, they prove (see in particular theorem VIII.I in [16]) that the measurement entropy S 1 (ρ) is a lower bound for the rate of the compression task they define in section VIII.At first glance, this result seems to be in contradiction with theorem 3.1.However, we remark that two assumptions are made in order to prove such a result, that are not satisfied by the BCT.The first one is that the composition of fine-grained measurements yields a fine-grained measurement, where the latter is defined in [16] as a measurement such that all its effects are atomic.BCT does not satisfy this property, because of the violation of the atomicity of parallel composition of states, which also entails that the composition of two fine-grained effects is no longer fine-grained.The second assumption is that, if the dimension of the state space of system A is d, then the dimension of the state space of the composite system A ⊠N is d N (here, with the word dimension, we are referring to the notion introduced in [16], that is the maximum number of elements of a fine-grained observaiton test).In the case of BCT, for a system A with size D A (recall that the size is the dimension of the linear space St R (A)), the dimension is d A = D A , so that for A ⊠N we find whence also this assumption is violated in our case.Therefore, our result is not in contradiction with theorem VIII.I presented in reference [16], since BCT is outside the domain of validity of that theorem.Nevertheless, this result has its relevance for theories in which parallel composition preserves atomicity, in particular in the case of PR-boxes, which represents a very interesting case to investigate.

Comparing the information content with entropies
Entropic-like quantities have been introduced in the context of Generalized Probabilistic Theories in terms of the Shannon entropy function [15,16,17].We report just below their definitions for the sake of completeness.H(I) (22) where in the Shannon entropies and the mutual information on the r.h.s. of the above equations refer to the random variables I, J distributed according to the joint probability distribution q(i, j) := p i (a j |ϕ i ) A .For any i = 1, 2, 3 we introduce the regularized version of S i as follows: Elementary properties such as concavity, strong concavity and subadditivity have been studied for these quantities; moreover, they have been proven to be not equivalent.As we have seen in the foregoing section, the relation between the information content of a state and the Shannon entropy of the associated probability distribution is not trivial, as a consequence of the violation of atomicity of parallel composition of states.The latter property has also a remarkable consequence on the behaviour of the regularized entropies with respect to their single-system counterparts, which is stated in the proposition below.

Proposition 4.1. In any classical theory, for any system A and
i p i |i) A = |ρ) A ∈ St(A), one has S i (ρ) = H(p).Moreover, whenever atomicity of parallel composition of states (definition 2.8) is violated, there exists a state Σ ∈ St(C) for some system C such that the strict inequality S reg i (Σ) > S i (Σ) holds for any i = 1, 2, 3.
As a corollary of the above proposition, we can also notice that, for classical theories violating property 2.8, each entropy, in addition to being superadditive, also violates additivity when factorized states are considered.Indeed, if there exist |i) A ∈ PurSt(A) and |j) B ∈ PurSt(B) such that |i) ⊠ |j) is mixed, we immediately see that where S i is given by the Shannon entropy of the respective decompositions, according to proposition 4.1.
While in classical and quantum theory all the S i 's and their regularized version collapse to the Shannon and von Neumann entropies, respectively, thus boiling down to the same operational interpretation given by the noiseless coding theorems, much less is known about their operational meaning in a general theory.In BCT the regularized entropies are related to the Shannon entropy of the state according to the following proposition.

Proposition 4.2. Let
A remarkable corollary of this proposition is that, in general, none of the S i 's nor the S reg i 's can be understood as the minimal compression rate.Notice that, while the additivity property for factorized states is violated by all the entropies, it is satisfied by the regularized versions.The result of proposition 4.2 is by far intuitive if we think of the particular compositional rule on states that BCT satisfies.Indeed, at the level of single systems, there is no difference with respect to standard classical theory, and this is true for any classical theory that does not satisfy atomicity of parallel composition of states.The effect of this violation shows up when we consider N copies of the same state, the latter operation giving an extra flat bit, one for each additional copy of the original state, and the appearence of this extra bit is captured by the regularized entropies.The factor 2 in the information content can also be intuitively expected, since also when we compose bibits we get additional space that can be used to allocate the message.Therefore, the departure of I(ρ) from S reg i (ρ) can be essentialy ascribed to the weird compositional rule for systems.
We notice that the results of theorem 3.1 and proposition 4.2 are consistent with the following general bound from [18] (in particular, see the proof of lemma V.2 and the result of lemma V.1), where D is a costant such that D B ⊠M ≤ kD M for some k.Indeed, in the case of BCT D B ⊠M = 1 2 4 M , whence D = 4. Actually, in the present case such a bound is saturated with S 2 (ρ) replaced by S reg 2 (ρ).

Conclusion
We have presented a full computation of the information content in a bilocal classical theory.The result is given in terms of the Shannon entropy of the probability distribution defining the state, namely, if The definition of I(ρ) is hardly amenable to direct calculation in general theories.However, in the special case of BCT, the calculation is simplified by the fact that, for any state, there exists a "mother" dilation from which we can obtain all the other ones by applying a suitable channel on the ancilla.With respect to standard classical theory, the information content shows two differences that can be both ascribed to the violation of atomicity of parallel composition of states in BCT: i) there is an overhead given by +1 in the numerator.This is due to the appearance of a bit each time that we compose in parallel a new copy of the same state.Since each bit is uniformly distributed we end up with the maximum overhead, that is indeed +1; ii) there is a factor 2 in the denominator.This follows from the fact that, when we compose bibits into registers, their dimension is given by the formula D B ⊠M = 2 M −1 D M B , thus the room for allocating messages per single bibit is almost "double" with respect to the size of the register.Notice that the factor 2 in the denominator is then related to meaning of information content in the specific theory at hand, where the elementary systems for physical encoding are bibits.If we had to evaluate the classical information content, i.e. the ability of the source to encode classical information, then the regularized mutual information would be the right quantifier.
We have already noticed that the information content in BCT is additive on factorized statesi.e.states of the form ρ ⊠ σ-and this means that atomicity of parallel composition is not a necessary condition for additivity.A question that remains open is under which hypotheses, given two states ρ, σ ∈ St 1 (A), one is able to prove that I(ρ⊠σ) = I(ρ)+I(σ).It might be also the case that additivity is a feature of I(ρ) that holds in full generality, as it would be desirable for a measure of the information content.The results of the present work seem to suggest that atom-icity of parallel composition plays a marginal role for this property.
Along with the information content we have also analyzed the behaviour of three different entropic functions that have been considered in the literature.At the level of single system there is no difference with respect to standard classical theory, and they all coincide with the Shannon entropy of the state ρ.As a consequence, any classical theory is monoentropic [15].The regularized entropies are clearly sensible to the extra bit that arises when systems are composed, and they all turn out to be equal to H(p) + 1.This result then establishes the existence of theories of information where none of the proposed generalizations of entropy can be interpreted as the information content of the source, and neither their regularized versions do.Remarkably, this is true in a thory that is monoentropic.The departure of the regularized version from the single-system counterpart is not peculiar of the BCT, but it actually takes place in any classical theory (in the sense of the definition 2.13) whenever atomicity of parallel composition of states is violated, as proved in proposition 4.1.
As we have already noticed at the end of the last section, in the case of BCT there is a relation between the regularized entropy S reg 2 (ρ) and the information content of the following form where D is a constant such that the relation D B ⊠M ≤ kD M holds for some k.In the case of BCT B is the bibit, and this relation is saturated with k = 1 2 and D = 4, whence the equation above trivially holds.One might be tempted to conjecture that a result of this form holds for any classical theory, but it is not difficult to realize that this is not the case.Let us consider a classical theory with only one type of system, say the bit (whose size is 2), satisfying local discriminability and, consequently, atomicity of parallel composition.Now, restrict the allowed tests of the theory to be preparation tests, observation tests, and all possible permutations of the bits when more bits are composed in parallel.It is easy to realize that there are no protocols that allow one to compress a source represented by a mixed state of a single bit, so that I(ρ) = 1.But proposition 4.1 implies that S i (ρ) = h(p) where h(p) is the binary entropy of the bit state ρ, and so is for S reg i (ρ) by local discriminability, whence the conjecture is false.
The main lesson that we learn from the results presented so far is that a treatment of the notion of information content, from a foundational point of view, cannot ignore the compositional structure of a physical theory, as we have seen that the latter heavily marks its behaviour also in a classical theory.Moreover, as we have just argued, also the allowed transformations play a significant role, as they might severely restrict the freedom of compressing.
We leave open for further studies what happens if one considers non-local classical theories with a norestriction hypothesis on the allowed transformations.It might be the case that a relation very similar in form to equation (23) holds.Aiming at doing a first step beyond the the non-local classical scenario, one could ask how the present formalism can be applied to other known non-local theories, in order to investigate possible deviations of the information content from the expected behaviour in the absence of local discriminability.Recently, the authors studied information compression in fermionic theory [24], which is a strictly bilocal theory, showing that the information content equals the von Neumann entropy of the fermionic state, and a similar analysis could be carried out for real quantum theory, which also violates the local discriminability.Another model that could highlight relevant features of the information content is that of Ref. [27], where finite-dimensional real, complex and quaternionic quantum theory have been unified within a single category, where complex quantum systems compose in a non-trivial way.Another interesting question is what happens in the case of the Popescu-Rohrlich boxes [28].It is possible that one of the entropies is equal to the information content.On the one hand, we already know that I(ρ) ≥ S 2 (ρ), therefore the missing part is achievability, namely the direct part of a noiseless coding theorem, which would establish I(ρ) = S 2 (ρ).The question in the case of PR-boxes is particularly relevant in light of the fact that the three entropies are known to be inequivalent in such a context.
Answering fundamental questions about information content and entropic functions is a first step toward the formulation of area laws and holographic principle beyond the standard quantum scenario.Both area laws and the holographic principle rely on the notion of entropy, which in turn is operationally interpreted as a quantifier of uncertainty (or information content).Understanding to what extent those laws can be generalized, independently of the nature of the systems involved, may shed new light on the relation between microscopic and large-scale physical phenomena in terms of localization and flow of information [29,30,31].

A Proofs of the main results
Lemma A.1.BCT is a digitizable theory.
Proof.We show that any system B of any size D B can serve as o-bit for BCT.Indeed, let A be any other system of the theory, denote by D A its size and set Then, let h : i,τ respectively.We then set λ (i) h(i),+ = 1 ∀i ∈ {1, • • • , D A } for the encoding.For the decoding, ∀j s such that ∃!i satisfying h(i) = j s we define µ (js) i,+ = 1, while for every other j s we can freely choose any probability distribution at our wish.It is now easy to realize that, for any i, j and E, the following holds Now consider two systems of BCT, say B 1 and B 2 .For any integer number k 1 of systems B 1 , the minimal number M min 2 (k 1 ) of B 2 that are needed to perform the encoding in a perfectly recoverable way as in equation (25) (with A = B ⊠k1 ) is given by formula ( 24) and similarly Since the set of states is a simplex, the pure ones are affinely independent, and this entails that the condition I A ⊠C |Π) AE = |Ψ AF ) is satisfied iff the following equation holds for any i, j and s For p ikτ as in the statement Since j,s q ijs = p i , for those indices i such that p i = 0 the equation trivially holds whatever the choice of λ js is, while for those i's such that p i > 0, λ is an admissible solution.
In order to relate the information content I(ρ) to the Shannon entropy H(p) of the probability distribution defining the unique pure state decomposition of ρ, we use standard techniques from classical information theory.In particolar we remind the reader of the notion of typical strings and of typical set, along with some properties of the latter.
Let p = {p i } i∈X be a probability distribution with X a finite outcome set whose label are denoted by i.For a given N , consider the string i := i 1 . . .i N associated with the factorized probability p For any δ > 0, the (N, δ)-typical set T N δ (p) is a subset of all the possible strings of lentgh N and it is defined as follows where H(p) := − i∈X p i log 2 p i denotes the Shannon entropy of the probability distribution p. Correspondingly, a string belonging to T N δ (p) is called a typical string.Notice that by reformulating the definition of typical set, a string is typical iff the associated probability is bounded as follows The typical set has the following relevant properties, that will be extensively used in the proof of theorem 3.1 Theorem A.1.Let p be a probability distribution, and for δ > 0 let T N δ (p) denote the (N, δ)-typical set.Then 1.Let η > 0. Then there exists N 0 such that for any N ≥ N 0 2. Let η > 0. Then there exists N 0 such that, for any N ≥ N 0 , the cardinality of T N δ (p) is bounded as follows This is a standard result, whose proof can be found in any of the following references [32,33].With this tool in our hand, we can prove the following Lemma A.2. Let S(N ) be any collection of strings Then, for any η > 0 there exists N 0 such that for any N ≥ N 0 one has Proof.Let S be the set of all the possible strings s, ∆ := H(p) + 1 − 2R > 0 and define where A c denotes the complementary set of A. Then consider The first term is bounded as follows thanks to the definition of typical set (see equation ( 28)) provided that we take N ≥ N 1 with N 1 sufficiently large.For the term with S 2 we use item 1 of theorem A.1, which implies that for η/2 > 0 there exists N 2 such that for any N ≥ N 2 we have i∈T Setting N 0 = max{N 1 , N 2 } we have the thesis.
Before giving the proof we provide an explicit formula to compute the figure of merit in equation (29) in terms of the probability distributions that characterise the encoding and decoding channels.Let us consider a BCT source represented by a state |ρ) A = i∈X p i |i) A on a system A, where {p i } i∈X is a probability distribution.Fix the number N of copies of A and consider a compression scheme (E , D) for M bibits, i.e.
For any i s consider the norm where we have decomposed the sum into three terms.
For any element ρ in St R (A), say ρ = i c i |i) A , we have ∥ρ∥ op = i |c i |.Thereofre, using the normalisation conditions for {λ and the figure of merit then becomes Proof of theorem 3.1.We first prove the achievability, namely that I(ρ) ≤ H({pi})+1

2
. Let δ > 0, and for any N consider the following number of bibits by item 2 of theorem A.1 and the above choice of M we have that, ∀N ∈ N This bound entails the existence of a subset P N ⊆ PurSt(B ⊠M ) with cardinality equal to |T N δ (p)|2 N −1 .Denote by S P N the set of strings associated with P N .We will use the same notation that we introduced in the proof of lemma A.1 and below the proof of lemma A.2: any element in S P N is specified by M signs t = t 1 , . . ., t M , one for each B in the composition, and M − 1 additional signs n = n 1 , . . ., n M −1 emerging from the violation of purity of parallel composition.We then denote any element of S P N with the multi-index t n .We now define the following (E , D) compression scheme for any N : 1.The encoding E is defined by a set of probability distributions λ (is) tnτ , one for each multi-index i s , where i ∈ {1, . . ., D A } N and s ∈ {+, −} N −1 (recall the notation in the proof of lemma A.1).Let h : T N δ (p) × {+, −} N −1 → S P N be any injective function that associates each typical string i s with a distinct element t n of S P N .Then, for any such i s we set λ Notice that, having defined λ (is) tnτ for any t n and τ and for any i s , we have fully specified the action of E on all the pure states of A ⊠N ⊠ A ⊠N even if not all of them directly intervene in the evaluation of the figure of merit.

Let {µ (tn)
isτ } be the probability distributions defining the decoding D. For any t n ∈ S P N we simply set µ (tn) h −1 (tn)+ = 1, namely, we invert the action of the encoding.Indeed, for any typical string i s = h −1 (t n ) we have Before giving the proof of proposition 4.1 we recall the following lemma (lemma 1 in [23]).and, while the left hand side is mixed by hypothesis, the right hand side should be pure, since the whole state is assumed to be pure (see the discussion below lemma A.3).We then conclude that |k) AB ⊠ |k ′ ) AB must be necessarily mixed.Therefore, Σ ⊠2 has the following decomposition in terms of pure states where at least one of the conditioned probability distributions {q ℓ|k,k ′ } ℓ is non-trivial.Now, notice that for any system A and any state |ρ) A = i p i |i) A one has S i (ρ) = H({p}).The argument that we give here is the same as the one proposed in [17] for classical theory (see in particular proposition 13 and theorem 3(i)), but since states are separating for effects in any OPT it also applies to the case of any classical theory defined according to definition 2.13.The case of S 3 is pretty obvious, given the uniqueness of the decomposition in terms of pure states.Now, since states are separating for effects, any other atomic effect is proportional to an effect of the perfectly discriminating test, i.e., (a| A = λ(i| A for some i. and H(X|J) = 0 for the perfectly discriminating test.Now, a trivial computation shows that Finally, notice that the subsequence S i (Σ ⊠2 k )/2 k is increasing, since and the result follows since the lim sup of the whole sequence is greater than the lim sup of any of its subequence lim sup .
Proof of proposition 4.2.BCT is a classical theory, therefore we immediately have that S i (ρ) = H(p) for any i = 1, 2, 3 (see proposition 4.1).For S reg i , we just notice that S i (ρ ⊠N ) is the Shannon of the factorized joint distribution p i,s = p i and the result follows by taking the limit.

1 onto A ⊠k2 2 , 1 . 1 .
i.e., D Ẽ = I A ⊠k 1 For k ≤ k 1 we have seen that there exists a pair ( Ẽ , D) such that D Ẽ = I A ⊠(k 1 − k) Then consider the encoding map from A ⊠k 1 to A ⊠mk2 2 defined as the composition of Ẽ with m−1 copies of E , say Ẽ ⊠ E ⊠(m−1)k1 .Similarly for the decoding, take D ⊠ D ⊠(m−1)k1
an injiective function.To denote the pair (j, s) ∈ {1, • • • , D B } k × {+, −} k−1 we use the shortened notation j s .The action of the encoding E and the decoding D are defined by two set of probability distributions, λ (i) js,τ and µ (js)

Proof of proposition 3 . 2 .
Any deterministic transformation C ∈ Tr 1 (E, F) acts on the set of pure states of AE as follows (ik) s any k, λ (k) jτ is a probability distribution.Determining C means to determine λ (k) jτ for every k.By applying the channel C to the state |Π) AE defined in the statement, expanding and re-collecting the signs suitably, one finds E ∈ Tr 1 (A ⊠N → B ⊠M ) and D ∈ Tr 1 (B ⊠M → A ⊠N ).By proposition 3.1, E and D are characterised by two probability distributions, {λ (is) tnτ } and {µ (tn) isτ } respectively, where i s , t n is a shortened notation for the pairs (i, s) ∈ {1, . . ., D A } N × {+, −} N −1 .and (t, n) ∈ {1, 2} M ×{+, −} M −1 respectively.Now, recall that the figure of merit is computed according to proposition 3.3 )+ = 1.In particular, for any i ∈ T N δ (p) and any s ∈ {+, −} N −1 we have the following diagrammatic equation(i s i s ) + A ⊠N E B ⊠M A ⊠N = (h(i s )i s ) + B ⊠M A ⊠N .If i ̸ ∈ T N δ (p),for any s set λ (is) ts+ = 1 for a fixed ts ∈ S P N .Diagrammatically the action of E is represented as follows (i s i s )

Lemma A. 3 .. 1 Proof of proposition 4 . 1 .k.
Consider a simplicial OPT satisfying n-local discriminability for some positive integer n.Then for all systems A, B and non-null extremal states |k) AB of the composite system AB, there exists a unique product of non-null extremalstates |i k ) A ⊠ |j k ) B such that |k) AB convexly refines |i k ) A ⊠ |j k ) B .Since we are focusing on strongly causal theories, pure states are all and only the non-null extremal states, as already mentioned at the end of subsection 2.1.The above lemma tells us that for any |k) AB ∈ PurSt(AB) there exist |i k ) A ∈ PurSt(A), |j k ) B ∈ PurSt(B) and a probability distribution {p ℓ } ℓ∈L such that |i k ) A ⊠ |j k ) B = ℓ∈L p ℓ |ℓ) AB(36) and |k) AB = |ℓ) AB for some ℓ ∈ L. It easy to see that for any pure state |k) AB of a bipartite system AB, both (j k | B |k) AB and (i k | A |k) AB are also pure.Indeed, if we now apply the effect (j k | B on B to the equation above, by purity of |i k ) it follows |i k ) A = (j k | B |ℓ) AB for any ℓ ∈ L, and in particular for |k), which is then pure.Similarly if one applies the effect |i k ) on A. It will be useful to keep in mind this fact in the proof of proposition 4By hypothesis, there exist systems A and B and |i) A ∈ PurSt(A), |j) B ∈ PurSt(B) } k∈Iij is a non-trivial probability distribution and {|k) AB } k∈Iij is a set of pure states of AB.Now consider |Σ ⊠2 ) AB , that can be decomposed as follows If |k) AB ⊠ |k ′ ) AB is pure, then one has a contradiction in that

j∈Y p j B j . We say that a convex refinement is trivial if B j = C for any j ∈ Y. Definition 2.3 (Atomic, refinable and extremal events). An event C is called atomic if it admits only trivial refinements, and refinable if it is not atomic.
Thus, for any state ρ of a classical theory, it follows that any other atomic observation test {a i } is such that H(a i (ρ)) ≥ H(p), by concavity of the function x log x, and the equality is achieved for the perfectly discriminating test, whence S 1 (ρ) = H(p).Finally, notice that for any state ρ of a classical theory one has S 2 (ρ) = H(p) − inf {ai}∈O at H(X|J)