We initiate the study of tradeoffs between exploration and exploitation in online learning of properties of quantum states. Given sequential oracle access to an unknown quantum state, in each round, we are tasked to choose an observable from a set of actions aiming to maximize its expectation value on the state (the reward). Information gained about the unknown state from previous rounds can be used to gradually improve the choice of action, thus reducing the gap between the reward and the maximal reward attainable with the given action set (the regret). We provide various information-theoretic lower bounds on the cumulative regret that an optimal learner must incur, and show that it scales at least as the square root of the number of rounds played. We also investigate the dependence of the cumulative regret on the number of available actions and the dimension of the underlying space. Moreover, we exhibit strategies that are optimal for bandits with a finite number of arms and general mixed states.
 S. Bubeck and N. Cesa-Bianchi. ``Regret analysis of stochastic and nonstochastic multi-armed bandit problems''. Foundations and Trends in Machine Learning 5, 1–122 (2012).
 D. Bouneffouf, I. Rish, and C. Aggarwal. ``Survey on applications of multi-armed and contextual bandits''. In 2020 IEEE Congress on Evolutionary Computation (CEC). Pages 1–8. (2020).
 L. Tang, R. Rosales, A. Singh, and D. Agarwal. ``Automatic ad format selection via contextual bandits''. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. Page 1587–1594. Association for Computing Machinery (2013).
 W. Thompson. ``On the likelihood that one unknown probability exceeds another in view of the evidence of two samples''. Biometrika 25, 285–294 (1933).
 H. Robbins. ``Some Aspects of the Sequential Design of Experiments''. Bulletin of the American Mathematical Society 58, 527–535 (1952).
 D. Wang, X. You, T. Li, and A. Childs. ``Quantum exploration algorithms for multi-armed bandits''. In Proceedings of the AAAI Conference on Artificial Intelligence. Volume 35, pages 10102–10110. (2021).
 P. Rebentrost, Y. Hamoudi, M. Ray, X. Wang, S. Yang, and M. Santha. ``Quantum algorithms for hedging and the learning of ising models''. Phys. Rev. A 103, 012418 (2021).
 O. Shamir. ``On the complexity of bandit linear optimization''. In Proceedings of The 28th Conference on Learning Theory. Volume 40 of Proceedings of Machine Learning Research, pages 1523–1551. PMLR (2015).
 M. Ying, Y. Feng, and S. Ying. ``Optimal policies for quantum markov decision processes''. International Journal of Automation and Computing 18, 410–421 (2021).
 S. Aaronson. ``Shadow tomography of quantum states''. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. Page 325–338. STOC 2018. Association for Computing Machinery (2018).
 S. Aaronson, X. Chen, E. Hazan, S. Kale, and A. Nayak. ``Online learning of quantum states''. Journal of Statistical Mechanics: Theory and Experiment 2019 (2018).
 J. Bretagnolle and C. Huber. ``Estimation des densités: risque minimax''. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 47, 119–137 (1979).
 M. Müller-Lennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel. ``On Quantum Rényi Entropies: A New Generalization and Some Properties''. Journal of Mathematical Physics 54, 122203 (2013).
 M. Wilde, A. Winter, and D. Yang. ``Strong Converse for the Classical Capacity of Entanglement-Breaking and Hadamard Channels via a Sandwiched Rényi Relative Entropy''. Communications in Mathematical Physics 331, 593–622 (2014).
 W. Hoeffding. ``Probability inequalities for sums of bounded random variables''. Journal of the American Statistical Association 58, 13–30 (1963).
 D. Varsha, T. Hayes, and S.Kakade. ``Stochastic linear optimization under bandit feedback.''. In Proceedings of the 21st Conference on Learning Theory. Pages 355–366. (2008).
 Y. Abbasi-Yadkori, D. Pál, and Cs. Szepesvári. ``Improved algorithms for linear stochastic bandits''. In Advances in Neural Information Processing Systems. Volume 24. Curran Associates, Inc. (2011).
 M. Guţă, J. Kahn, R. Kueng, and J. A. Tropp. ``Fast state tomography with optimal error bounds''. Journal of Physics A: Mathematical and Theoretical 53, 204001 (2020).
 T. Lattimore and B. Hao. ``Bandit phase retrieval''. In Advances in Neural Information Processing Systems. Volume 34, pages 18801–18811. Curran Associates, Inc. (2021).
 Zongqi Wan, Zhijie Zhang, Tongyang Li, Jialin Zhang, and Xiaoming Sun, "Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets", arXiv:2205.14988.
 Xinyi Chen, Elad Hazan, Tongyang Li, Zhou Lu, Xinzhao Wang, and Rui Yang, "Adaptive Online Learning of Quantum States", arXiv:2206.00220.
The above citations are from SAO/NASA ADS (last updated successfully 2022-12-08 13:11:51). The list may be incomplete as not all publishers provide suitable and complete citation data.
On Crossref's cited-by service no data on citing works was found (last attempt 2022-12-08 13:11:50).
This Paper is published in Quantum under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Copyright remains with the original copyright holders such as the authors or their institutions.