Sample Complexity Bounds for Differentially Private Learning

Kamalika Chaudhuri, Daniel Hsu
Proceedings of the 24th Annual Conference on Learning Theory, PMLR 19:155-186, 2011.

Abstract

This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithmth at uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution $\mathcal{U}$ chosen independently of the sensitive data. Given such a reference $\mathcal{U}$, we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between $\mathcal{U}$ and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy –namely, that the only labels (and not the unlabeled parts of the examples)be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.

Cite this Paper


BibTeX
@InProceedings{pmlr-v19-chaudhuri11a, title = {Sample Complexity Bounds for Differentially Private Learning}, author = {Chaudhuri, Kamalika and Hsu, Daniel}, booktitle = {Proceedings of the 24th Annual Conference on Learning Theory}, pages = {155--186}, year = {2011}, editor = {Kakade, Sham M. and von Luxburg, Ulrike}, volume = {19}, series = {Proceedings of Machine Learning Research}, address = {Budapest, Hungary}, month = {09--11 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v19/chaudhuri11a/chaudhuri11a.pdf}, url = {https://proceedings.mlr.press/v19/chaudhuri11a.html}, abstract = {This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithmth at uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution $\mathcal{U}$ chosen independently of the sensitive data. Given such a reference $\mathcal{U}$, we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between $\mathcal{U}$ and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy –namely, that the only labels (and not the unlabeled parts of the examples)be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.} }
Endnote
%0 Conference Paper %T Sample Complexity Bounds for Differentially Private Learning %A Kamalika Chaudhuri %A Daniel Hsu %B Proceedings of the 24th Annual Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2011 %E Sham M. Kakade %E Ulrike von Luxburg %F pmlr-v19-chaudhuri11a %I PMLR %P 155--186 %U https://proceedings.mlr.press/v19/chaudhuri11a.html %V 19 %X This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithmth at uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution $\mathcal{U}$ chosen independently of the sensitive data. Given such a reference $\mathcal{U}$, we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between $\mathcal{U}$ and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy –namely, that the only labels (and not the unlabeled parts of the examples)be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound.
RIS
TY - CPAPER TI - Sample Complexity Bounds for Differentially Private Learning AU - Kamalika Chaudhuri AU - Daniel Hsu BT - Proceedings of the 24th Annual Conference on Learning Theory DA - 2011/12/21 ED - Sham M. Kakade ED - Ulrike von Luxburg ID - pmlr-v19-chaudhuri11a PB - PMLR DP - Proceedings of Machine Learning Research VL - 19 SP - 155 EP - 186 L1 - http://proceedings.mlr.press/v19/chaudhuri11a/chaudhuri11a.pdf UR - https://proceedings.mlr.press/v19/chaudhuri11a.html AB - This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithmth at uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data distribution in the form of a reference distribution $\mathcal{U}$ chosen independently of the sensitive data. Given such a reference $\mathcal{U}$, we provide an upper bound on the sample requirement that depends (among other things) on a measure of closeness between $\mathcal{U}$ and the unlabeled data distribution. Our upper bound applies to the non-realizable as well as the realizable case. The second approach is to relax the privacy requirement, by requiring only label-privacy –namely, that the only labels (and not the unlabeled parts of the examples)be considered sensitive information. An upper bound on the sample requirement of learning with label privacy was shown by Chaudhuri et al. (2006); in this work, we show a lower bound. ER -
APA
Chaudhuri, K. & Hsu, D.. (2011). Sample Complexity Bounds for Differentially Private Learning. Proceedings of the 24th Annual Conference on Learning Theory, in Proceedings of Machine Learning Research 19:155-186 Available from https://proceedings.mlr.press/v19/chaudhuri11a.html.

Related Material