Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

Adith Swaminathan, Thorsten Joachims
Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:814-823, 2015.

Abstract

We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – called Policy Optimizer for Exponential Models (POEM) – for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.

Cite this Paper


BibTeX
@InProceedings{pmlr-v37-swaminathan15, title = {Counterfactual Risk Minimization: Learning from Logged Bandit Feedback}, author = {Swaminathan, Adith and Joachims, Thorsten}, booktitle = {Proceedings of the 32nd International Conference on Machine Learning}, pages = {814--823}, year = {2015}, editor = {Bach, Francis and Blei, David}, volume = {37}, series = {Proceedings of Machine Learning Research}, address = {Lille, France}, month = {07--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v37/swaminathan15.pdf}, url = {https://proceedings.mlr.press/v37/swaminathan15.html}, abstract = {We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – called Policy Optimizer for Exponential Models (POEM) – for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.} }
Endnote
%0 Conference Paper %T Counterfactual Risk Minimization: Learning from Logged Bandit Feedback %A Adith Swaminathan %A Thorsten Joachims %B Proceedings of the 32nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2015 %E Francis Bach %E David Blei %F pmlr-v37-swaminathan15 %I PMLR %P 814--823 %U https://proceedings.mlr.press/v37/swaminathan15.html %V 37 %X We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – called Policy Optimizer for Exponential Models (POEM) – for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.
RIS
TY - CPAPER TI - Counterfactual Risk Minimization: Learning from Logged Bandit Feedback AU - Adith Swaminathan AU - Thorsten Joachims BT - Proceedings of the 32nd International Conference on Machine Learning DA - 2015/06/01 ED - Francis Bach ED - David Blei ID - pmlr-v37-swaminathan15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 37 SP - 814 EP - 823 L1 - http://proceedings.mlr.press/v37/swaminathan15.pdf UR - https://proceedings.mlr.press/v37/swaminathan15.html AB - We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – called Policy Optimizer for Exponential Models (POEM) – for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art. ER -
APA
Swaminathan, A. & Joachims, T.. (2015). Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:814-823 Available from https://proceedings.mlr.press/v37/swaminathan15.html.

Related Material