Learning Dissimilarities for Categorical Symbols

Jierui Xie, Boleslaw Szymanski and Mohammed Zaki; JMLR W&P 10:97-106, 2010.

Abstract

In this paper we learn a dissimilarity measure for categorical data, for effective classification of the data points. Each categorical feature (with values taken from a finite set of symbols) is mapped onto a continuous feature whose values are real numbers. Guided by the classification error based on a nearest neighbor based technique, we repeatedly update the assignment of categorical symbols to real numbers to minimize this error. Intuitively, the algorithm pushes together points with the same class label, while enlarging the distances to points labeled differently. Our experiments show that 1) the learned dissimilarities improve classification accuracy by using the affinities of categorical symbols; 2) they outperform dissimilarities produced by previous data-driven methods; 3) our enhanced nearest neighbor classifier (called LD) based on the new space is competitive compared with classifiers such as decision trees, RBF neural networks, Naive Bayes and support vector machines, on a range of categorical datasets.



Home Page

Papers

Submissions

News

Scope

Editorial Board

Announcements

Proceedings

Open Source Software

Search index

Login



RSS Feed