Defining quantum divergences via convex optimization

We introduce a new quantum R\'enyi divergence $D^{\#}_{\alpha}$ for $\alpha \in (1,\infty)$ defined in terms of a convex optimization program. This divergence has several desirable computational and operational properties such as an efficient semidefinite programming representation for states and channels, and a chain rule property. An important property of this new divergence is that its regularization is equal to the sandwiched (also known as the minimal) quantum R\'enyi divergence. This allows us to prove several results. First, we use it to get a converging hierarchy of upper bounds on the regularized sandwiched $\alpha$-R\'enyi divergence between quantum channels for $\alpha>1$. Second it allows us to prove a chain rule property for the sandwiched $\alpha$-R\'enyi divergence for $\alpha>1$ which we use to characterize the strong converse exponent for channel discrimination. Finally it allows us to get improved bounds on quantum channel capacities.


Introduction
Given nonnegative vectors P, Q ∈ R Σ , the α-Rényi divergence is defined as for α ∈ (1, ∞). Here P Q means that for any x ∈ Σ, Q(x) = 0 implies that P (x) = 0 and the log is taken to be base 2. This definition has found many applications in information theory and beyond, we refer to the survey paper [37] for more general definitions and properties of this quantity. To generalize this notion to quantum states ρ and σ which are now positive semidefinite operators on C Σ the interpretation of the multiplication appearing in the definition matters and multiple definitions exist. Such definitions are systematically studied in [35]. We mention two important examples for our work. For positive semidefinite operators ρ and σ on C Σ , provided ρ σ (i.e., the support of ρ is contained in the support of σ), the geometric and sandwiched divergences were respectively defined by [26] and by [29,44]. The inverses here should be understood as generalized inverses i.e., the inverse on the support. When ρ σ is not satisfied, both quantities are set to ∞. Whenever ρ and σ commute, both definitions agree and as ρ and σ can be diagonalized in the same basis this also matches with the classical definition (1). For this reason, throughout the paper, if ρ and σ commute, then we simply write D α (ρ σ) for the classical α-Rényi divergence in definition (1).
Another natural definition is the measured Rényi divergence, which is obtained by performing the same measurement on ρ and σ and then considering the classical Rényi divergence after performing the measurement: where the supremum is chosen over all orthonormal bases {|v x } x of C Σ . This definition was proposed by [11,18] and we refer to [5] for the equivalence between the different variants.
Contributions In this paper, we put forward another way of defining quantum Rényi divergences through a convex optimization program. Even if such divergences may not have operational interpretations in terms of some information processing task, we demonstrate in this paper that they can nonetheless be useful tools for proofs and for computations. Given α ∈ (1, ∞), we define the # Rényi divergence of order α between two positive semidefinite operators ρ, σ as Here σ# 1 α A denotes the 1 α -geometric mean of σ and A. We recall the definitions and properties of the matrix geometric mean in Section 2 below. Using the joint concavity of the matrix geometric mean, the optimization program in (5) is convex and for rational values of α it can be expressed as a semidefinite program [14,33]. We show the following properties: • We prove in Section 3 that D # α satisfies the data processing inequality and it matches with D α for commuting operators. In addition, it is subadditive under tensor product and when regularized, it is equal to the sandwiched divergence i.e., we have (Proposition 3.4) lim n→∞ 1 n D # α (ρ ⊗n σ ⊗n ) = D α (ρ σ).
• We establish in Section 4 that the extension of D # α to channels has an expression as a convex optimization program (similar to the one for states (5)) and satisfies subadditivity under tensor product as well as a chain rule property. Furthermore, when regularized, it gives the regularized sandwiched divergence between channels.
We then give some applications of D # α in Section 5. • We show that, for α > 1, the regularized sandwiched α-Rényi divergence between quantum channels can be computed to arbitrary precision in finite time (Theorem 5.1).
• We prove a new chain rule property for D α for α > 1 (Corollary 5.2). In turn, the new chain rule property allows us to characterize the strong converse exponent for channel discrimination and show that in this regime adaptive strategies do not offer an advantage over nonadaptive strategies (Section 5.2.1).
• We give improved bounds on amortized entanglement measures, which can be used for example to bound the quantum capacity of channels with free two-way classical communication (Section 5.2.2). We restrict our focus in this paper to the resource of entanglement, but we expect the same techniques to be applicable in many other resource-theoretic frameworks [8].
In addition to these applications, we mention that a close variant of this divergence is introduced in [7] to bound conditional entropies for quantum correlations. Remark 1.1 (Convention for unnormalized ρ). We note that divergences are most interesting when the first argument is normalized, i.e., tr (ρ) = 1 but it is convenient to keep the definition general. To define it for a general positive semidefinite operator ρ, we use a convention which is not standard. We choose for this work to use the exact same expression, e.g., (1) for the classical case, (3) for the sandwiched Rényi divergence etc... even if ρ is not normalized. With this convention, D α (ρ σ) = D α ( ρ tr (ρ) σ) + α α−1 log tr (ρ), which is slightly different from the more standard choice made for example in [35] where the correction term for normalization is simply log tr (ρ). Note however that the difference between these variants only depends on tr (ρ) and α, and thus the two variants basically have the same properties even when ρ is not normalized. In particular, we will be using the property that the regularized measured divergence is equal to the sandwiched divergence, a property which clearly holds equally well for both conventions.

Notation
Let H be a finite dimensional Hilbert space and we write L (H) for the set of linear operators on H, P(H) for the set of positive semidefinite operators on H and D(H) = {ρ ∈ P(H) : tr (ρ) = 1}.
We let A ∞ = max |ψ =1 A|ψ be the operator norm of A. Also, for positive semidefinite operators A and B, we write A B when supp(A) ⊆ supp(B), where supp(A) denotes the support of A. We denote by spec(A) the spectrum of A. For ρ, σ ∈ P(H) with ρ σ, we write D max (ρ σ) = log inf{λ ∈ R : ρ ≤ λσ}. When H = X ⊗ Y for some Hilbert spaces X and Y , we often explicitly indicate the systems by We denote by CP(X, Y ) the set of completely positive maps from {|x } x labels a fixed basis of X and X and I X denotes the identity map on L (X).

Geometric means and the Kubo-Ando theory
In [24], Kubo and Ando developed a general theory of operator means from operator monotone functions. The goal of this section is to recall the properties of these means which will be useful for the rest of this paper. This paper will deal with the operator means obtained from the operator monotone functions f (x) = x β for β ∈ [0, 1] (the so-called β-matrix geometric mean), however we keep the discussion general as we believe other choices of f can be useful.
Given an operator monotone function f : [0, ∞) → [0, ∞) such that f (1) = 1, the Kubo-Ando mean # f is defined for any pair of positive semidefinite operators A, B and satisfies the following properties [24], see also [34,Theorem 37.1 and the following discussion]: (vi) For invertible A, we have Note that properties (ii) and (v) immediately imply that if N is a completely positive map, In fact it is known that the inequality above is true even if we only assume that N is a positive map (instead of completely positive), see e.g., [19,Proposition 3.30] or [26,Lemma 6.3].
The Kubo-Ando mean has the integral representation [24] for some measure µ on (0, ∞) depending on f , and where for A , B ≥ 0, A : B denotes the parallel sum of A and B which satisfies [1,Theorem 9] x, (A : Some additional properties of the Kubo-Ando mean will be needed in this paper.  (7) is still valid provided we use the generalized inverse for A.
Proof. (vii) Let P be the projector onto the orthogonal complement of supp(A). For any ε > 0, A + εP is invertible, thus we have, by (7) Letting ε → 0, and using the continuity property (iii) we get the desired equality.
(viii) We first assume that A 1 + A 2 is invertible. Using the orthogonality of supports condition, this implies that supp(A 1 ) = supp(A 1 + B 1 ) and supp(A 2 ) = supp(A 2 + B 2 ). In addition ( Thus we can use (7) together with the orthogonality of supports condition to compute the mean Using again (7) for each one of the two terms, we proved the desired statement when A 1 + A 2 is invertible. For the general case, we let P be the orthogonal projector onto supp(A 1 + B 1 ). Then we apply the previous argument to A 1 + εP and A 2 + ε(I − P ) for ε > 0, and use the continuity property (iii) to take the limit ε → 0 and conclude.
(ix) This follows from the integral representation (8) and the fact that supp((tA) : B) = supp(A) ∩ supp(B), which can be easily shown from the variational formulation (9).
(x) This follows from (vii) and the continuity of f . We will also need some specific properties that hold for f (x) = x β for β ∈ (0, 1). For such a function, we write the Kubo-Ando mean # f as # β . Proposition 2.4. For any β ∈ (0, 1), we have (xi) Tensor products: Proof. (xi) When A 1 ⊗ A 2 is invertible this follows from the formula (7). If not, note that for ε > 0 we have When ε ↓ 0, note that ( Thus by property (iii) we get the required equality by taking the limit ε ↓ 0 in (10).
We note that the matrix geometric mean is often defined for positive semidefinite operators as the limit as ε → 0 of the formula (7) applied to A + εI and B + εI and this clearly matches with the general approach of Kubo-Ando. This is the way it is presented in [6] and we refer to [22] for a systematic study of the properties of the geometric Rényi divergence with this definition.

Properties for positive semidefinite operators
In this section we state and prove basic properties for the new quantity D # α (ρ σ).
We now show that D # α satisfies the main properties of a Rényi divergence: it satisfies the dataprocessing inequality and for commuting states, it matches with the classical Rényi divergence.
is monotone under trace-preserving positive maps. More precisely, let ρ and σ be positive semidefinite operators on the Hilbert space X. Then if N is a positive and trace-preserving map from L (X) to L (Y ) then In addition, if ρ and σ commute, then D # α (ρ σ) = D α (ρ σ).
Proof. Joint convexity follows directly from the joint concavity property of the matrix geometric mean (6). In fact, for any positive semidefinite operators ρ 0 , Taking the minimum over A 0 and A 1 , we obtain the desired result. The data-processing inequality with positive trace-preserving maps follows immediately from the monotonicity property of # 1/α in Proposition 2.1. In fact, assuming that ρ σ (otherwise the statement clearly holds) let A be an optimal point for Q # α (ρ σ) so that ρ ≤ σ# 1/α A. Then we have that To analyze the commutative case, consider ρ and σ commuting operators. It suffices to assume that ρ σ in what follows. To show that D # α (ρ σ) ≤ D α (ρ σ) it suffices to take A = ρ α σ 1−α which commutes with ρ and σ. Then σ# 1/α A = σ 1−1/α (ρ α σ 1−α ) 1/α = ρ. To prove D # α (ρ σ) ≥ D α (ρ σ) consider a common eigenbasis |1 , . . . , |d for ρ and σ and consider the map M(W ) = d i=1 |i i|W |i i|. Note that M is completely positive and trace-preserving and we have M(ρ) = ρ and M(σ) = σ. Given an optimal choice of A in the program (5) for ρ and σ, we can write as before Noting that tr (M(A)) = tr (A), we have constructed another optimal solution where the matrix A commutes with ρ and σ. In this case σ# 1/α A = σ 1−1/α A 1/α . Thus, the condition ρ ≤ σ# 1/α A translates to σ 1/α−1 ρ ≤ A 1/α . As all the matrices are diagonal in the same basis, we can take both sides of this inequality to the power α and get ρ α σ 1−α ≤ A. Taking the trace, we get that

Remark 3.3 (Non-monotonicity in α).
We would like to emphasize that D # α is not monotone in α. This is illustrated in Figure 1 where Now we turn to properties of the divergence for tensor products and the relation to other quantum Rényi divergences.
In addition, for ρ, σ ∈ P(H) it can be related to the measured Rényi divergence as follows: As a consequence, and Furthermore Proof. In order to show subadditivity, we may assume ρ 1 σ 1 and ρ 2 σ 2 (the statement clearly holds otherwise). For b ∈ {1, 2}, let A b be a feasible solution for the program (5) Then, using the tensor product property in Proposition 2.2, we get In addition, tr (A 12 ) = tr (A 1 )tr (A 2 ) and we thus get Next, to relate the different divergences, note that all of these divergences are finite if and only if ρ σ. So we focus on the case ρ σ. To show (12), it follows from the data-processing inequality that for any choice of orthonormal basis In the other direction, consider the pinching map P σ defined by P σ (W ) = λ∈spec(σ) Π λ W Π λ , where spec(σ) denotes the set of eigenvalues of σ and Π λ is the projector onto the eigenspace of λ. Consider an optimal solution A of (5) for the states P σ (ρ) and σ. Then we have P σ (ρ) ≤ σ# 1/α A. Using the pinching inequality ρ ≤ |spec(σ)|P σ (ρ) (see e.g., [17] or [35, Chapter 2]), we obtain As such |spec(σ)| α A is feasible for the optimization program (5) for ρ and σ and thus . But note that P σ (σ) = σ and it commutes with P σ (ρ), so using Proposi- Putting everything together, we get As a result, we have that for any integer n ≥ 1, But it is well-known that |spec(σ ⊗n )| ≤ (n + 1) dim H−1 . This shows that the regularization of both D # α and D M α give the same value. But the regularization of the measured Rényi divergence is known to be equal to D α [27] (see also [35,Theorem 4 which means that A is feasible for (5), and so D # α (ρ σ) ≤ D α (ρ σ). We note that for α ∈ (1, 2] this inequality actually follows immediately from the fact that D α is the maximal Rényi divergence satisfying the data-processing inequality [26]. As illustrated in Figure 1, there can be a large gap between D # α and D α . On the other hand it is known (see e.g., [29,Proposition 4]) that D α (ρ σ) → D max (ρ σ) when α → ∞ thus we get the desired result.
We conclude this section by establishing additional useful properties of Q # α .
α has the direct-sum property for classical-quantum states:

Properties for positive maps
The notion of divergence between states can naturally be extended to a divergence between channels by maximizing over the input states. Here we consider the stabilized version where a reference system that is unaffected by the channels is allowed. Let X, X , Y be Hilbert spaces with dim X = dim X and N , M be completely positive maps from L (X ) to L (Y ). It will be convenient to write the definition of the channel divergence in terms of the Choi states of the channels. For this, we define Φ XX as an unnormalized maximally entangled state of the form Φ XX = x,x |x x | X ⊗ |x x | X , where {|x } x labels a fixed basis of X and X . Then we let be the Choi matrices of these channels. Here I X denotes the identity map on L (X). Observe that for any density operator ω ∈ D(X), X . For any divergence D, the corresponding channel divergence is defined as: We refer to [25] for a more detailed discussion of this definition. For D = D # α , our first result is an expression for the channel divergence in terms of a convex optimization program.
where . ∞ denotes the operator norm.
We will start by showing that, if we strengthen the condition ω X ∈ D(X) with ω X ∈ D(X), ω X > 0 in the left-hand side, equality holds. We will later show that the condition ω X > 0 leads to the same quantity. With this additional condition, the left-hand side of (18) is Now using the fact that ω X is invertible together with the transformer inequality, we get Thus, the constraint in (20) is equivalent to

Thus, by performing a change of variable
Now using Sion's minmax theorem, observe that we can exchange the minimization and the maximization. In fact, the objective function is linear in both ω X and in A XY , and the set of invertible density operators is convex. In addition, as we assumed that J N XY J M XY , using Proposition 3.1, we may restrict the set of A XY we optimize over to be convex and compact. To conclude, it suffices to observe that sup ω X ∈D(X) Since replacing the condition ω X ≥ 0 by ω X > 0 can only decrease the LHS of (18), we have shown the direction ≥ of (18). It thus remains to show the direction ≤. Take an optimal feasible solution A XY of (17) and let us write λ for its value. Now consider an ω X ∈ D(X) and define X . By construction tr (A ω XY ) ≤ λ. In addition, we have where we used the transformer inequality. As such A ω XY is feasible for the defining optimization program for Q # α (ω Taking the supremum over ω X completes the proof.
An immediate corollary is that the channel divergence is subadditive.
Proof. Let A 1 X1Y1 be a feasible solution for the program (17) for the channels N 1 and M 1 and A 2 X2Y2 for N 2 and M 2 . Then using the fact that J N1⊗N2 = J N1 ⊗ J N2 and the tensor product property of the mean (Proposition 2.2), we have that A 12 Next, we prove that as for states, the regularized channel divergence is equal to the regularized sandwiched Rényi divergence. Note that unlike for states, the sandwiched Rényi divergence of channels is not additive in general, see [13] for an example.
Proof. The second inequality follows immediately from the fact that D α ≤ D # α ; see Proposition 3.4. For the other direction, the channels N ⊗n and M ⊗n are covariant with respect to the representation of the symmetric group S n . In fact, for π ∈ S n , if we denote P X (π) the operator on the space X ⊗n that permutes the n tensor factors according to π, then N ⊗n (P X (π) . P X (π) * ) = P Y (π)N ⊗n ( . )P Y (π) * , and similarly the same relation holds for M ⊗n . Using the definition (15), we have Using [25,Proposition II.4] for the divergence D # α (which satisfies the data-processing inequality as shown in Proposition 3.2), we may restrict the optimization to permutation-invariant states and get Now consider such a permutation-invariant ω X n and we use the relation to the measured Rényi divergence in (12): Now note that if ω X n is permutation-invariant, then so is the operator ω X n on (X ⊗ Y ) ⊗n . As such, using Lemma A.1, where d := dim X dim Y . Taking the supremum over all ω X n , we get This gives the desired result.
The channel divergence satisfies a chain rule property for any α ∈ (1, ∞), as the one satisfied for the geometric divergence D α [12].
Proof. Let A XY be an optimal solution for (17) for the maps N and M andĀ RX be an optimal solution for (5) for ρ RX and σ RX . Then note that Combining the properties of A XY andĀ RX using Proposition 2.2, we get Then using the transformer inequality, we have To conclude it suffices to compute which after taking the logarithm establishes the desired inequality.

Remark 4.6.
Note that the chain rule can be seen as a generalization of the data processing inequality. In fact, we can take the R system to be trivial and if the maps are the same N = M and in addition trace-preserving, then D # α (N M) = 0.

Applications
In this section we present some example applications of the newly introduced divergences. Most of these applications are related to the regularized sandwiched divergence between channels. For α > 1, we denote We note that as the sequence 1 n D α (N ⊗n M ⊗n ) is superadditive, using Fekete's lemma, the limit exists and can be replaced by a supremum over n. Regularized entropic quantities appear extensively in quantum information theory but it is unclear how to compute them (or even whether they are computable to start with) as we do not have control on the convergence speed in the regularization. Using D # α , one can quantify the convergence speed explicitly for D reg α and thus show that this quantity is computable.

Converging hierarchy of upper bounds on the regularized divergence of channels
Proof. The lower bound follows immediately from Lemma 4.4 with m = n. For the upper bound, using the fact that D α ≤ D # α and the subadditivity property in Corollary 4.3 we have for any n, m Dividing by mn and taking the limit as n → ∞ concludes the proof of the upper bound. The inequality (22) follows from this upper bound together with Lemma 4.4 (more specifically, the lower bound there applied for m = n).
Note that for any finite m, the quantity 1 m D # α (N ⊗m M ⊗m ) can be approximated to arbitrary accuracy and this shows that D reg α can be approximated within additive ε in finite time. The precise analysis of the running time as a function of the bit size of the input is a subtle question that is outside the scope of this work. But staying at a high level, the running time of this algorithm will be exponential in the input and output dimensions of the channels. In fact, we can take m = 8αd 3 (α−1)ε where d is the dimension of the Choi state of the channels N and M, and then compute D # α (N ⊗m M ⊗m ). The channels N ⊗m and M ⊗m have a Choi state of dimension d m and thus the convex program defining D # α (N ⊗m M ⊗m ) can be approximated in time that is polynomial in d m using the ellipsoid algorithm. Overall, the running time is exponential in d.
As the regularized divergence between channels appears in the analysis of many information processing tasks, we believe this result will be useful in obtaining improved characterizations of such tasks. An example is the task of channel discrimination, for which the regularized Umegaki channel divergence governs the asymptotic error rate [13,39,43]. One could also obtain upper bounds on quantum channel capacities, such as the classical capacity, in terms of regularized divergence between channels (see e.g., [41]). In fact, closely following the approach of [12] to upper bound the classical capacity of a quantum channel and replacing D α with D # α , one does obtain improved bounds, including for the amplitude damping channel. However, the improvements obtained by such a direct application of [12] were small, typically less than 1%. To give an example, for a damping parameter γ = 0.5, we obtain (using D # α with two copies of the channel) an upper bound on the capacity of 0.7694... whereas the previous bound (using D max or D α ) was 0.7716... [40,41]. We leave the further exploration of this question for future work.

A chain rule for the sandwiched Rényi divergence
Our second application is a chain rule for the sandwiched Rényi divergence, which once again features the regularized divergence between channels. Such a chain rule was proved in [13] for the Umegaki relative entropy.
Proof. We apply the chain rule in Proposition 4.5 to the states ρ ⊗n RX and σ ⊗n RX and the channels N ⊗n and M ⊗n and get 1 n Taking the limit as n → ∞, the state divergences becomes sandwiched divergences using (13) and the channel divergence becomes the regularized channel sandwiched divergence using Lemma 4.4.

Remark 5.3.
It is unclear whether taking the limit α → 1 in this chain rule recovers the chain rule proved in [13]. The reason for this difficulty is that it remains open whether lim α↓1 D reg It is also possible to phrase the chain rule in terms of amortized divergences as introduced in [43]. For a divergence D, the amortized divergence is defined as where the supremum also runs over all finite dimensional spaces R. When D is the sandwiched Rényi divergence D α , note that for positive real numbers β and γ, we have D α (βρ γσ) = D α (ρ σ)+ α α−1 log β − log γ. As a result, for any nonzero ρ RX , σ RX ∈ P(RX), we have which means that in (24), we can also take the supremum over all nonzero positive semidefinite operators.

Theorem 5.4 (Amortization = regularization for sandwiched divergence). For any completely positive maps N , M and any α > 1, we have
Proof. The inequality ≤ follows immediately from the chain rule in (23). The inequality ≥ is actually true for any generalized divergence and was observed in previous works [39,43]. Note that we can equivalently write the channel divergence as where as usual X and X have the same dimension. Thus, denoting the n copies of X by X 1 , X 2 , . . . , X n and using the shorthand Note that in the i-th term, we subtract two expressions that differ by an application of the channels N and M on the system X i+1 and so the remaining systems can be considered as the R system in the definition (24). Using in addition the observation in (25) saying that ρ and σ need not be normalized, we get that each term is bounded by D a α (N M). Thus, D α (N ⊗n M ⊗n ) ≤ n D a α (N M) , which gives the desired result.
The concept of amortization is particularly useful when analyzing information processing tasks that have an adaptive aspect. We discuss some examples below.

Channel discrimination
We discuss the example of channel discrimination, referring to [10,43] for a more detailed and precise presentation of the problem and the relevant references on the topic. Imagine we would like to distinguish between two quantum channels N and M having black box access to n uses of one of them. The task of adaptive channel discrimination is to decide which channel we are dealing with. The word adaptive here refers to the fact that our use of one of the black boxes can depend on the outcomes of a previously used black box. By contrast, a strategy is called parallel (or nonadaptive) if the n black boxes are used in parallel on a fixed input state. As is common in hypothesis testing, we call the type I error the probability α n that the channel is actually N but our procedure says M and the type II error β n is the other kind of error and the goal is to determine the tradeoff between these two errors. Multiple regimes can be considered, the most studied is the asymmetric or Stein setting where we set α n ≤ ε for some ε ∈ (0, 1) and consider the asymptotic behavior of the optimal type II error − 1 n log β n . The works [13,39,43] establish that if we take ε → 0 this is given by the regularized Umegaki relative entropy D reg (N M) 3 . Our focus here is on the strong converse regime, i.e., we require β n ≤ 2 −rn with r > D reg (N M) and we consider the behavior of α n . As far as we are aware, it is not known whether in this case we always have α n → 1 (this would be a strong converse property). However, we can always consider the following quantity which measures how quickly α n goes to 1 when it does so: Note that if α n does not converge to 1 exponentially fast, then this quantity is 0.
A lower bound is given in [43,Proposition 20] for this quantity: We note that [43,Proposition 20] actually shows something stronger: for any n, any adaptive strategy and any α > 1, we have Thus, for any family of strategies satisfying lim sup n→∞ 1 n log β n ≤ −r, we have which establishes (27).
Using equality (26) together with the explicit convergence bounds in Theorem 5.1 as well as the strong converse exponent established for states [27], we show that this bound is in fact tight. This generalizes the result of [10] who considered the case where M is a replacer channel, i.e., M(W ) = tr (W )σ for some state σ.
In addition, the achievability uses a nonadaptive strategy and this shows that adaptive strategies do not offer an advantage in this setting.
Remark 5.6 (Continuity of D reg α when α → 1). Note that this result implies that H(r, As the behaviour of D reg α as α → 1 remains unclear, we cannot rule out that D reg (N M) < inf α>1 D reg α (N M) for some channels, and so it remains open whether a strong converse property holds in general.
Proof. As usual, we will assume J N J M , as otherwise, D reg (N M) = ∞ and the statement is void. The lower bound ≥ follows immediately from (27) and equality (26).
For the upper bound, the idea is to use the characterization of [27] for the strong converse exponent for state discrimination. They show that for any states ρ and σ and r > 0, there is a family of strategies to distinguish between ρ ⊗n and σ ⊗n with type II error probability β n (ρ, σ) ≤ 2 −rn and achieving a type I error probability α n (ρ, σ) satisfying Let ε > 0 and choose an integer m so that 1 We choose ω m to achieve up to ε the infimum over ω ∈ D(X ⊗m ) with ω > 0 of the right hand side and for this ω m , we have a strategy achieving lim sup Now we observe that for any channels A and B such that J A J B , we can perform the change of variable u = α−1 α and get inf ω∈D(X) ω>0 where we defined the function f : 2 ) < ∞ and we even have for any ω > 0, is independent of ω. As we will see shortly, for any ω > 0, the function u → f (ω, u) is thus continuous on [0, 1]. As such we have inf ω∈D(X) ω>0 We are now ready to apply Sion's minimax theorem. To do this, we check the following conditions: • For any ω ∈ D(X) with ω > 0, the function u → f (ω, u) is concave and continuous on the compact interval [0, 1]. This follows from [27,Remark IV.13 or the discussion preceding Lemma IV.9] which shows that u → u D 1 1−u (ω Applying Sion's minimax theorem, we can exchange the inf and max in (32) and get inf ω∈D(X) ω>0 Note that for the second equality we used equality (36) saying that we can drop the ω > 0 condition in the infimum. Now as D α ≤ D max , we have sup u∈(0,1) ur − u D Using the finite convergence bounds in (22) for D reg α , we get As a result, we have In other words, we have constructed a sequence of strategies for distinguishing between N ⊗mn and M ⊗mn for n ≥ 1 with a type II error β nm and a type I error α nm satisfying To conclude, we define a strategy for distinguishing between N ⊗k and M ⊗k for k that is not necessarily of the form mn for some n. For that, we write k = mq + p with 0 ≤ p < m and we only use mq copies of the channel and apply the above argument. We thus obtain exactly the same type I and type II errors as for mq copies, i.e., α k = α mq and β k = β mq . With this notation, we have As k → ∞, we have q → ∞ so lim sup In addition, using the same notation, the type I error satisfies As a result, (34) implies that As this is valid for any ε > 0, we obtain the claimed result.

Bounds on amortized entanglement measures and applications
Another task that has an adaptive nature is the task of quantum communication using free two-way classical communication. In order to analyze such tasks, one usually considers an entanglement measure and tracks its value during the rounds of the protocol. Here, we will focus on measures of the following form: for α ∈ [1, ∞] and some convex subset C(X : Y ) ⊆ P(XY ), we can define for a bipartite state ρ XY When it is clear from the context α and C will be dropped from the notation. Note that this quantity is quasiconvex in ρ XY . In fact using the joint quasiconvexity of D α , we have for λ ∈ [0, 1], ρ 0 , ρ 1 ∈ D(XY ) and σ 0 , σ 1 ∈ C(X : Y ) Taking the infimum over σ 0 and σ 1 , we get the quasiconvexity of E(X : Y ) ρ in ρ. To make this a useful correlation measure, we will assume that C(X : Y ) contains all the product states φ X ⊗ ψ Y , and as we assumed convexity of C(X : Y ), it also contains the set of all separable states. Many studied quantum correlation measures are special cases: • For the relative entropy of entanglement, C is the set of separable states and α = 1, but the full range α ∈ [1, ∞] has also been used, in particular for the study of adaptive protocols [9,30,42].
One can then naturally define the entanglement of a quantum channel N X→Y as where the supremum also runs over arbitrary finite dimensional systems X . Note that using the quasiconvexity of E in the state, we may restrict ρ X X to be pure. Thus, whenever the set C is invariant under local isometries (which will be the case here), it suffices to take X to have the same dimension as X. The amortized version is then defined as where the supremum runs over arbitrary finite dimensional systems X Y . Note that if Y is trivial, we recover E(N ) but in general it is not clear how to bound the dimensions of the systems X and Y . Amortized quantities allow one to place upper bounds on the rates of protocols allowing two-way communication, as shown for example [3] in the context of bidirectional channel capacities and in [4,23] in the context of quantum/private communication with free two-way classical communication. For completeness, we illustrate this methodology in the following simple lemma that bounds the quantum correlations that can be obtained by a process of the form given in Figure 2. For convenience of notation, we will be using the trivial 1-dimensional system Y 0 .
is generated by a sequence of quantum channels as indicated in the Figure. The channels Fi should be considered as free operations (e.g., modeling two-way classical communication) between Alice (top) and Bob (bottom) and N is a quantum channel going from Alice to Bob.
be a quantum state in C(X 0 : Y 0 ) and assume that the quantum channels generated as in Figure 2 satisfies Proof. Using the definition of E, we can write using the definition of the amortized quantity E a (N ). Repeating this argument, and using the fact that E(X 0 : Y 0 ) ρ (0) = 0, we obtain the desired result.
However, the issue with the amortized quantity E a (N ) is that it is unclear how to compute it. Using our chain rule, one can upper bound this E a (N ) in terms of a regularized divergence by finding channels M having the right properties. Then one can use Theorem 5.1 to obtain computable upper bounds on the regularized divergence.
Lemma 5.8. Let M X→Y be a completely positive map satisfying the following property. For any ρ X Y Y ∈ D(X Y Y ) and any σ X XY ∈ C(X X : Y ), we have Then Consider ρ X XY and let σ X XY ∈ C(X X : Y ). Applying the chain rule, we obtain We could then apply this methodology to a variety of tasks. Here we consider the task of quantum communication between Alice and Bob with free classical two-way communication. For that, we will fix C to be the set known as PPT [31] defined by C(X : Y ) = {σ XY ∈ P(XY ) : σ Y XY 1 ≤ 1} and α ∈ (1, ∞). We then have to find a set of channels M satisfying the condition (35). For that we use set of channel used in [12] (this choice can be traced back to [20]), where Θ Y denotes the transpose map and the diamond norm of a linear map A from L (X ) to L (Y ) is defined by A = sup{ (I X ⊗ A X →Y )(W XX ) 1 : W XX 1 ≤ 1}. Notice that any M ∈ V Θ satisfies the condition (35) as M(σ X XY ) ∈ C(X : Y Y ) for any σ X XY ∈ C(X X : Y ). In fact, we have for any σ X XY such that σ Y X XY 1 ≤ 1, we have Proposition 5.9. Let ε ∈ [0, 1], k ∈ N + and consider a state ρ generated as in Figure 2 with the quantum channels F i that preserve the property PPT (which is in particular the case for classical two-way communication and local operations). Assume that X n+1 and Y n+1 are k-qubit systems and that tr (ρ Then, for any α ∈ (1, ∞) and any M ∈ V Θ , Proof. Applying Lemma 5.7 and then Lemma 5.8 for the choice of C and M described above, we have We now want to relate the quantity E(X n+1 : Y n+1 ) ρ (n+1) to ε and k. Using the data processing inequality for D α with the completely positive and trace-preserving map A(W ) = tr (W Ψ ⊗k )|0 0|+ (1 − tr (W Ψ ⊗k ))|1 1|, we have [31]. As a result, as α > 1, we have Putting everything together, we obtain the desired bound.
Using the fact that D reg α (N M) ≤ D # α (N M) and the fact that the set of channels V Θ is representable by a semidefinite program, we obtain efficiently computable bounds min M∈V Θ D # α (N M) on the quantum capacity assisted with free PPT -preserving operations. As we also have D reg is also a valid upper bound but it is not clear how to compute it efficiently when m ≥ 2. Nonetheless, one can use the map M ∈ V Θ that minimizes min M∈V Θ D # α (N M) and evaluate 1 m D # α (N ⊗m M ⊗m ) for this map. We illustrate these bounds in Figure 3 for the amplitude damping channel, where we obtain an improved bound compared to using the geometric Rényi divergence D α in [12].

Discussion
We have presented a family of quantum α-Rényi divergences for α > 1 based on the geometric mean. The framework is in fact more general and allows us to define quantum divergences in a similar way using a Kubo-Ando mean for any operator monotone function g : [0, ∞) → [0, ∞). As we mostly used generic properties of operator means to establish properties of D # α , we expect analogous properties for more general functions g to hold. For example, for any convex function f , the f -divergence between distributions P and Q is defined as Q(x) )Q(x). When f (t) = t α with α > 1, we obtain (after applying 1 α−1 log) the α-Rényi divergence. Several quantum f -divergences have been proposed, see e.g., [19]. In the special case where f is bijective and its inverse f −1 is operator monotone, then using the Kubo-Ando mean associated with g = f −1 , we and we observe a slightly improved bound compared to the solid plot. The dotted plot shows the bound obtained using Dα from [12], which happens to match with the bound based on Dmax for the amplitude damping channel as shown in [12].
would obtain a quantum version of the f -divergence. Here, we focused on the case f (t) = t α and correspondingly g(t) = t 1/α , but it would be interesting to explore other choices of g and potential applications. In a different direction, a variant of D # α is defined in [7], using the 1 2 -geometric mean but one takes the geometric mean k times iteratively with different variables. More generally, we hope that our work encourages the study of further quantum divergences that are defined via convex optimization programs.
We leave multiple open questions. A specific question is whether lim α→1 D # α (ρ σ) is equal to the Belavkin-Staszewski divergence D(ρ σ) [2]? Numerical examples suggest that this should be the case. Another question is whether it is possible to define D # α when α < 1 with similar properties? The natural extension would be to define D # α (ρ σ) = 1 α−1 log Q # α (ρ σ) with Q # α (ρ σ) = max{tr (A) : ρ ≥ σ# 1/α A}. But with this definition, it is simple to check using the operator monotonicity of t → t α for α ∈ [0, 1] that D # α (ρ σ) = D α (ρ σ) which means that we cannot have the property (13) for example. This argument does not go through when α > 1 as t → t α is not operator monotone in this regime. It would also be interesting to generalize the divergences introduced here to infinite-dimensional spaces or even to von Neumann algebras. Another important question that is left open is whether D reg α converges to D reg when α → 1.

A Various results
The following standard lemma about permutation invariant operators was used for the proof of Lemma 4.4.
Proof. P defines a representation of the symmetric group S n on (C d ) ⊗n and its decomposition into irreducible representations is well-known, see e.g., [16,Section 5.3]. In fact, its irreducible representations are labelled by the set I n,d of Young diagrams of size n with at most d rows. For λ ∈ I n,d , we denote by p λ the corresponding irreducible representation acting on the space V λ .
Each p λ appears in general multiple times in P and this is taken into account by introducing the multiplicity space U λ (which happens to correspond to an irreducible representation of the unitary group but we will not use this here). Summarizing, the operator P (π) can in the Schur basis be written as We now express the operator X in the Schur basis where we have introduced orthonormal bases {u λ,i } i∈m(λ) of the spaces U λ (m(λ) is the dimension of U λ ) and X (λ,i),(λ ,i ) can be seen as an operator from V λ to V λ . We can now write the products P (π)X and XP (π) as P (π)X = λ,λ ∈I n,d i∈[m(λ)],i ∈[m(λ )] |λ λ | ⊗ |u λ,i u λ ,i | ⊗ p λ (π)X (λ,i),(λ ,i ) XP (π) = λ,λ ∈I n,d i∈[m(λ)],i ∈[m(λ )] |λ λ | ⊗ |u λ,i u λ ,i | ⊗ X (λ,i),(λ ,i ) p λ (π) .
We also need the following concavity and continuity statement. Taking the supremum over ω ∈ D(X) gives the desired equality.