Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states

Josep Lumbreras1, Erkka Haapasalo1, and Marco Tomamichel1,2

1Centre for Quantum Technologies, National University of Singapore, Singapore
2Department of Electrical and Computer Engineering, Faculty of Engineering, National University of Singapore, Singapore

Find this paper interesting or want to discuss? Scite or leave a comment on SciRate.


We initiate the study of tradeoffs between exploration and exploitation in online learning of properties of quantum states. Given sequential oracle access to an unknown quantum state, in each round, we are tasked to choose an observable from a set of actions aiming to maximize its expectation value on the state (the reward). Information gained about the unknown state from previous rounds can be used to gradually improve the choice of action, thus reducing the gap between the reward and the maximal reward attainable with the given action set (the regret). We provide various information-theoretic lower bounds on the cumulative regret that an optimal learner must incur, and show that it scales at least as the square root of the number of rounds played. We also investigate the dependence of the cumulative regret on the number of available actions and the dimension of the underlying space. Moreover, we exhibit strategies that are optimal for bandits with a finite number of arms and general mixed states.

► BibTeX data

► References

[1] T. Lattimore and C. Szepesvári. ``Bandit algorithms''. Cambridge University Press. (2020).

[2] A. Slivkins. ``Introduction to multi-armed bandits''. Foundations and Trends in Machine Learning 12, 1–286 (2019).

[3] S. Bubeck and N. Cesa-Bianchi. ``Regret analysis of stochastic and nonstochastic multi-armed bandit problems''. Foundations and Trends in Machine Learning 5, 1–122 (2012).

[4] D. Bouneffouf, I. Rish, and C. Aggarwal. ``Survey on applications of multi-armed and contextual bandits''. In 2020 IEEE Congress on Evolutionary Computation (CEC). Pages 1–8. (2020).

L. Tang, R. Rosales, A. Singh, and D. Agarwal. ``Automatic ad format selection via contextual bandits''. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. Page 1587–1594. Association for Computing Machinery (2013).

[6] M. Cohen, I. Lobel, and R. Paes Leme. ``Feature-based dynamic pricing''. Management Science 66, 4921–4943 (2020).

[7] W. Thompson. ``On the likelihood that one unknown probability exceeds another in view of the evidence of two samples''. Biometrika 25, 285–294 (1933).

[8] H. Robbins. ``Some Aspects of the Sequential Design of Experiments''. Bulletin of the American Mathematical Society 58, 527–535 (1952).

[9] T.L Lai and H. Robbins. ``Asymptotically efficient adaptive allocation rules''. Advances in Applied Mathematics 6, 4–22 (1985).

[10] P. Auer, N. Cesa-Bianchi, and P. Fischer. ``Finite-time analysis of the multiarmed bandit problem''. Mach. Learn. 47, 235–256 (2002).

[11] B. Casalé, G. Di Molfetta, H. Kadri, , and L. Ralaivola. ``Quantum bandits''. Quantum Mach. Intell. 2 (2020).

[12] D. Wang, X. You, T. Li, and A. Childs. ``Quantum exploration algorithms for multi-armed bandits''. In Proceedings of the AAAI Conference on Artificial Intelligence. Volume 35, pages 10102–10110. (2021).

[13] P. Rebentrost, Y. Hamoudi, M. Ray, X. Wang, S. Yang, and M. Santha. ``Quantum algorithms for hedging and the learning of ising models''. Phys. Rev. A 103, 012418 (2021).

[14] O. Shamir. ``On the complexity of bandit linear optimization''. In Proceedings of The 28th Conference on Learning Theory. Volume 40 of Proceedings of Machine Learning Research, pages 1523–1551. PMLR (2015).

[15] P. Rusmevichientong and J. Tsitsiklis. ``Linearly parameterized bandits''. Mathematics of Operations Research 35 (2008).

[16] J. Barry, D.T. Barry, and S. Aaronson. ``Quantum partially observable markov decision processes''. Phys. Rev. A 90, 032311 (2014).

[17] M. Ying, Y. Feng, and S. Ying. ``Optimal policies for quantum markov decision processes''. International Journal of Automation and Computing 18, 410–421 (2021).

[18] M. Paris and J. Rehacek. ``Quantum state estimation''. Springer Publishing Company, Incorporated. (2010). 1st edition.

[19] S. Aaronson. ``Shadow tomography of quantum states''. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. Page 325–338. STOC 2018. Association for Computing Machinery (2018).

[20] S. Aaronson, X. Chen, E. Hazan, S. Kale, and A. Nayak. ``Online learning of quantum states''. Journal of Statistical Mechanics: Theory and Experiment 2019 (2018).

[21] J. Bretagnolle and C. Huber. ``Estimation des densités: risque minimax''. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 47, 119–137 (1979).

[22] M. Müller-Lennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel. ``On Quantum Rényi Entropies: A New Generalization and Some Properties''. Journal of Mathematical Physics 54, 122203 (2013).

[23] M. Wilde, A. Winter, and D. Yang. ``Strong Converse for the Classical Capacity of Entanglement-Breaking and Hadamard Channels via a Sandwiched Rényi Relative Entropy''. Communications in Mathematical Physics 331, 593–622 (2014).

[24] W. Hoeffding. ``Probability inequalities for sums of bounded random variables''. Journal of the American Statistical Association 58, 13–30 (1963).

[25] P. Auer. ``Using confidence bounds for exploitation-exploration trade-offs''. J. Mach. Learn. Res. 3, 397–422 (2003).

[26] D. Varsha, T. Hayes, and S.Kakade. ``Stochastic linear optimization under bandit feedback.''. In Proceedings of the 21st Conference on Learning Theory. Pages 355–366. (2008).

[27] P. Rusmevichientong and J. N. Tsitsiklis. ``Linearly parameterized bandits''. Mathematics of Operations Research 35, 395–411 (2010).

[28] Y. Abbasi-Yadkori, D. Pál, and Cs. Szepesvári. ``Improved algorithms for linear stochastic bandits''. In Advances in Neural Information Processing Systems. Volume 24. Curran Associates, Inc. (2011).

[29] T. L. Lai. ``Adaptive Treatment Allocation and the Multi-Armed Bandit Problem''. The Annals of Statistics 15, 1091 – 1114 (1987).

[30] M. Guţă, J. Kahn, R. Kueng, and J. A. Tropp. ``Fast state tomography with optimal error bounds''. Journal of Physics A: Mathematical and Theoretical 53, 204001 (2020).

[31] T. Lattimore and B. Hao. ``Bandit phase retrieval''. In Advances in Neural Information Processing Systems. Volume 34, pages 18801–18811. Curran Associates, Inc. (2021).

Cited by

[1] M. Sohaib Alam, Noah F. Berthusen, and Peter P. Orth, "Quantum logic gate synthesis as a Markov decision process", npj Quantum Information 9 1, 108 (2023).

[2] Hiroshi Ohno, "Quantum greedy algorithms for multi-armed bandits", Quantum Information Processing 22 2, 101 (2023).

[3] Shantom Kumar Borah, Shivansh Anand, Rishabh Vipul Agarwal, and Sainath Bitragunta, "Analysis of Adversarial Jamming From a Quantum Game Theoretic Perspective", IEEE Systems Journal 17 1, 881 (2023).

[4] Zongqi Wan, Zhijie Zhang, Tongyang Li, Jialin Zhang, and Xiaoming Sun, "Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets", arXiv:2205.14988, (2022).

[5] Sitan Chen and Weiyuan Gong, "Efficient Pauli channel estimation with logarithmic quantum memory", arXiv:2309.14326, (2023).

[6] Xinyi Chen, Elad Hazan, Tongyang Li, Zhou Lu, Xinzhao Wang, and Rui Yang, "Adaptive Online Learning of Quantum States", arXiv:2206.00220, (2022).

[7] Han Zhong, Jiachen Hu, Yecheng Xue, Tongyang Li, and Liwei Wang, "Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret", arXiv:2302.10796, (2023).

[8] Josep Lumbreras and Marco Tomamichel, "Linear bandits with polylogarithmic minimax regret", arXiv:2402.12042, (2024).

The above citations are from Crossref's cited-by service (last updated successfully 2024-02-27 20:23:05) and SAO/NASA ADS (last updated successfully 2024-02-27 20:23:05). The list may be incomplete as not all publishers provide suitable and complete citation data.