Classification of Imbalanced Marketing Data with Balanced Random Sets

Vladimir Nikulin, Geoffrey J. McLachlan
Proceedings of KDD-Cup 2009 Competition, PMLR 7:89-100, 2009.

Abstract

With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented.

Cite this Paper


BibTeX
@InProceedings{pmlr-v7-nikulin09, title = {Classification of Imbalanced Marketing Data with Balanced Random Sets}, author = {Nikulin, Vladimir and McLachlan, Geoffrey J.}, booktitle = {Proceedings of KDD-Cup 2009 Competition}, pages = {89--100}, year = {2009}, editor = {Dror, Gideon and Boullé, Mar and Guyon, Isabelle and Lemaire, Vincent and Vogel, David}, volume = {7}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {28 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v7/nikulin09/nikulin09.pdf}, url = {https://proceedings.mlr.press/v7/nikulin09.html}, abstract = {With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented.} }
Endnote
%0 Conference Paper %T Classification of Imbalanced Marketing Data with Balanced Random Sets %A Vladimir Nikulin %A Geoffrey J. McLachlan %B Proceedings of KDD-Cup 2009 Competition %C Proceedings of Machine Learning Research %D 2009 %E Gideon Dror %E Mar Boullé %E Isabelle Guyon %E Vincent Lemaire %E David Vogel %F pmlr-v7-nikulin09 %I PMLR %P 89--100 %U https://proceedings.mlr.press/v7/nikulin09.html %V 7 %X With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented.
RIS
TY - CPAPER TI - Classification of Imbalanced Marketing Data with Balanced Random Sets AU - Vladimir Nikulin AU - Geoffrey J. McLachlan BT - Proceedings of KDD-Cup 2009 Competition DA - 2009/12/04 ED - Gideon Dror ED - Mar Boullé ED - Isabelle Guyon ED - Vincent Lemaire ED - David Vogel ID - pmlr-v7-nikulin09 PB - PMLR DP - Proceedings of Machine Learning Research VL - 7 SP - 89 EP - 100 L1 - http://proceedings.mlr.press/v7/nikulin09/nikulin09.pdf UR - https://proceedings.mlr.press/v7/nikulin09.html AB - With imbalanced data a classifier built using all of the data has the tendency to ignore the minority class. To overcome this problem, we propose to use an ensemble classifier constructed on the basis of a large number of relatively small and balanced subsets, where representatives from both patterns are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent the random subsets and the columns represent the features. Based on this matrix, we make an assessment of how stable the influence of a particular feature is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which is not necessarily a linear regression. Proper data pre-processing is very important for the effectiveness of the whole system, and it is proposed to reduce the original data to the most simple binary sparse format, which is particularly convenient for the construction of decision trees. As a result, any particular feature will be represented by several binary variables or bins, which are absolutely equivalent in terms of data structure. This property is very important and may be used for feature selection. The proposed method exploits not only contributions of particular variables to the base-learners, but also the diversity of such contributions. Test results against KDD-2009 competition datasets are presented. ER -
APA
Nikulin, V. & McLachlan, G.J.. (2009). Classification of Imbalanced Marketing Data with Balanced Random Sets. Proceedings of KDD-Cup 2009 Competition, in Proceedings of Machine Learning Research 7:89-100 Available from https://proceedings.mlr.press/v7/nikulin09.html.

Related Material