<?xml version="1.0"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<atom:link href="http://jmlr.csail.mit.edu/jmlr.xml" rel="self" type="application/rss+xml" />
<link>http://www.jmlr.org</link>
<title>JMLR</title>
<description></description>

<item>
<title>
A Geometric Approach to Sample Compression; Benjamin I.P. Rubinstein, J. Hyam Rubinstein; 13(Apr):1221--1261, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/rubinstein12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/rubinstein12a.html
</link>
<description>
The Sample Compression Conjecture of Littlestone &amp; Warmuth has remained unsolved for a quarter century. While maximum classes (concept classes meeting Sauer's Lemma with equality) can be compressed, the compression of general concept classes reduces to compressing maximal classes (classes that cannot be expanded without increasing VC dimension). Two promising ways forward are: embedding maximal classes into maximum classes with at most a polynomial increase to VC dimension, and compression via operating on geometric representations. This paper presents positive results on the latter approach and a first negative result on the former, through a systematic investigation of finite maximum 
</description>
</item>

<item>
<title>
A Multi-Stage Framework for Dantzig Selector and LASSO; Ji Liu, Peter Wonka, Jieping Ye; 13(Apr):1189--1219, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/liu12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/liu12a.html
</link>
<description>
We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X&#8712; &#8477;^n&#10005; m (m >> n) and a noisy observation vector y&#8712; &#8477;^n satisfying y=X&#946;^*+&#949; where &#949; is the noise vector following a Gaussian distribution N(0,&#963;^2I), how to recover the signal (or parameter vector) &#946;^* when the signal is sparse?   The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal &#946;^*. We show that if X obeys a certain condition, then with a large probability the difference 
</description>
</item>

<item>
<title>
Hope and Fear for Discriminative Training of Statistical Translation Models; David Chiang; 13(Apr):1159--1187, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/chiang12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/chiang12a.html
</link>
<description>
In machine translation, discriminative models have almost entirely supplanted the classical noisy-channel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the first uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured 
</description>
</item>

<item>
<title>
Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies; Ioannis Tsamardinos, Sofia Triantafillou, Vincenzo Lagani; 13(Apr):1097--1157, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/tsamardinos12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/tsamardinos12a.html
</link>
<description>
We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets.  This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed 
</description>
</item>

<item>
<title>
Analysis of a Random Forests Model; G&#233;rard Biau; 13(Apr):1063--1095, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/biau12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/biau12a.html
</link>
<description>
Random forests are a scheme proposed by Leo Breiman in the 2000's for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and  practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm.  In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and 
</description>
</item>

<item>
<title>
The huge Package for High-dimensional Undirected Graph Estimation in R; Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, Larry Wasserman; 13(Apr):1059--1062, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zhao12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zhao12a.html
</link>
<description>
We describe an R package named  huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data.  This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010).   Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting  Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation 
</description>
</item>

<item>
<title>
Consistent Model Selection Criteria on High Dimensions; Yongdai Kim, Sunghoon Kwon, Hosik Choi; 13(Apr):1037--1057, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kim12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kim12a.html
</link>
<description>
Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided.  Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently.  Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances 
</description>
</item>

<item>
<title>
Positive Semidefinite Metric Learning Using Boosting-like Algorithms; Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel; 13(Apr):1007--1036, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/shen12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/shen12a.html
</link>
<description>
The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data.  It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance.  In this work, we propose a boosting-based technique, termed BOOSTMETRIC, for learning a quadratic Mahalanobis distance metric.  Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite.  Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement.
</description>
</item>

<item>
<title>
Sampling Methods for the Nystr&#246;m Method; Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar; 13(Apr):981--1006, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kumar12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kumar12a.html
</link>
<description>
The Nystr&#246;m method is an efficient technique to generate low-rank matrix approximations and is used in several large-scale learning applications.  A key aspect of this method is the procedure according to which columns are sampled from the original matrix.  In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes.  We also propose a family of ensemble-based sampling algorithms for the Nystr&#246;m method. We report results of extensive experiments that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr&#246;m method when used in conjunction with 
</description>
</item>

<item>
<title>
Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features; Gil Tahan, Lior Rokach, Yuval Shahar; 13(Apr):949--979, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/tahan12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/tahan12a.html
</link>
<description>
This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the 
</description>
</item>

<item>
<title>
Stability of Density-Based Clustering; Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman; 13(Apr):905--948, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/rinaldo12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/rinaldo12a.html
</link>
<description>
High density clusters can be characterized by the connected components of a level set L(&#955;) = {x: p(x)>&#955;} of the underlying probability density function p generating the data, at some appropriate level &#955; &#8805; 0. The complete hierarchical clustering can be characterized by a cluster tree T= &#8746;_&#955;L(&#955;).  In this paper, we study the behavior of a density level set estimate  L&#770;(&#955;) and cluster tree estimate T&#770; based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L&#770;(&#955;) and T&#770; as a function of h, and investigate the theoretical properties of these instability measures.
</description>
</item>

<item>
<title>
Algebraic Geometric Comparison of Probability Distributions; Franz J. Kir&#225;ly, Paul von B&#252;nau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert M&#252;ller; 13(Mar):855--903, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kiraly12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kiraly12a.html
</link>
<description>
We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of finding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof 
</description>
</item>

<item>
<title>
NIMFA : A Python Library for Nonnegative Matrix Factorization; Marinka &#381;itnik, Bla&#382; Zupan; 13(Mar):849--853, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zitnik12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zitnik12a.html
</link>
<description>
NIMFA is an open-source Python library that provides a unified interface to nonnegative matrix factorization algorithms. It includes implementations of state-of-the-art factorization methods, initialization approaches, and quality scoring. It supports both dense and sparse matrix representation. NIMFA's component-based implementation and hierarchical design should help the users to employ already implemented techniques or design and code new strategies for matrix factorization tasks.
</description>
</item>

<item>
<title>
Causal Bounds and Observable Constraints for Non-deterministic Models; Roland R. Ramsahai; 13(Mar):829--848, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ramsahai12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ramsahai12a.html
</link>
<description>
Conditional independence relations involving latent variables do not necessarily imply observable independences. They may imply inequality constraints on observable parameters and causal bounds, which can be used for falsification and identification. The literature on computing such constraints often involve a deterministic underlying data generating process in a counterfactual framework. If an analyst is ignorant of the nature of the underlying mechanisms then they may wish to use a model which allows the underlying mechanisms to be probabilistic. A method of computation for a weaker model without any determinism is given here and demonstrated for the instrumental variable model, 
</description>
</item>

<item>
<title>
Algorithms for Learning Kernels Based on Centered Alignment; Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh; 13(Mar):795--828, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/cortes12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/cortes12a.html
</link>
<description>
This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression.  Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe 
</description>
</item>

<item>
<title>
Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso; Rahul Mazumder,  Trevor Hastie; 13(Mar):781--794, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/mazumder12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/mazumder12a.html
</link>
<description>
We consider the sparse inverse covariance regularization problem or graphical lasso with regularization  parameter &#955;.  Suppose the sample covariance graph formed by thresholding the entries of the sample covariance matrix at &#955; is decomposed into connected components.  We show that the vertex-partition induced by the connected components of the thresholded sample covariance graph (at &#955;) is exactly equal to that induced by the connected components of the estimated concentration graph, obtained by solving the graphical lasso problem for the same &#955;.  This characterizes a very interesting property of a path of graphical lasso solutions.  Furthermore, this simple rule, when used 
</description>
</item>

<item>
<title>
GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression; Chiwoo Park, Jianhua Z. Huang, Yu Ding; 13(Mar):775--779, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/park12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/park12a.html
</link>
<description>
This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper.
</description>
</item>

<item>
<title>
A Kernel Two-Sample Test; Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch&#246;lkopf, Alexander Smola; 13(Mar):723--773, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/gretton12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/gretton12a.html
</link>
<description>
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions.  Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).  We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic.  The MMD can be computed in quadratic time, although efficient linear time approximations are available.  Our statistic is an instance of an integral probability metric, and 
</description>
</item>

<item>
<title>
A Case Study on Meta-Generalising: A Gaussian Processes Approach; Grigorios Skolidis, Guido Sanguinetti; 13(Mar):691--721, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/skolidis12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/skolidis12a.html
</link>
<description>
We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it.
</description>
</item>

<item>
<title>
Structured Sparsity and Generalization; Andreas Maurer, Massimiliano Pontil; 13(Mar):671--690, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/maurer12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/maurer12a.html
</link>
<description>
We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels.
</description>
</item>

<item>
<title>
Learning Algorithms for the Classification Restricted Boltzmann Machine; Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio; 13(Mar):643--669, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/larochelle12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/larochelle12a.html
</link>
<description>
Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artificial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classifiers. We study the Classification RBM (ClassRBM), a variant on the RBM adapted to the classification setting. We study different strategies for training the ClassRBM and show that competitive classification 
</description>
</item>

<item>
<title>
Non-Sparse Multiple Kernel Fisher Discriminant Analysis; Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir; 13(Mar):607--642, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/yan12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/yan12a.html
</link>
<description>
Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general l_p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on 
</description>
</item>

<item>
<title>
A Primal-Dual Convergence Analysis of Boosting; Matus Telgarsky; 13(Mar):561--606, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/telgarsky12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/telgarsky12a.html
</link>
<description>
Boosting combines weak learners into a predictor with low empirical risk.  Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated.  This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: &#8226; Weak learnability aids the whole loss family: for any &#949; > 0, O(ln(1/&#949;)) iterations suffice to produce a predictor with empirical risk &#949;-close to the infimum; &#8226; The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of 
</description>
</item>

<item>
<title>
ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel; Stephen R. Piccolo, Lewis J. Frey; 13(Mar):555--559, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/piccolo12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/piccolo12a.html
</link>
<description>
Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. ML-Flex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple 
</description>
</item>

<item>
<title>
MULTIBOOST: A Multi-purpose Boosting Package; Djalel Benbouzid, R&#243;bert Busa-Fekete, Norman Casagrande, Fran&#231;ois-David Collin, Bal&#225;zs K&#233;gl; 13(Mar):549--553, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/benbouzid12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/benbouzid12a.html
</link>
<description>
The MULTIBOOST package provides a fast C++ implementation of multi-class/multi-label/multi-task boosting algorithms. It is based on ADABOOST.MH but it also implements popular cascade classifiers and FILTERBOOST. The package contains common multi-class base learners (stumps, trees, products, Haar filters). Further base learners and strong learners following the boosting paradigm can be easily implemented in a flexible framework.
</description>
</item>

<item>
<title>
Metric and Kernel Learning Using a Linear Transformation; Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon; 13(Mar):519--547, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/jain12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/jain12a.html
</link>
<description>
Metric and kernel learning arise in several machine learning applications.  However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points.  In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem.  In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework---that of minimizing the LogDet divergence subject 
</description>
</item>

<item>
<title>
Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks; Vikas C. Raykar, Shipeng Yu; 13(Feb):491--518, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/raykar12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/raykar12a.html
</link>
<description>
With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a data set labeled by multiple annotators in a short amount of time.  Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Since we do not have control over the quality of the annotators, very often the annotations can be dominated by spammers, defined as annotators who assign labels randomly without actually looking at the instance.  Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the final consensus labels.  In this paper we propose an empirical 
</description>
</item>

<item>
<title>
Multi-Assignment Clustering for Boolean Data; Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann; 13(Feb):459--489, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/frank12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/frank12a.html
</link>
<description>
We propose a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. By explicitly modeling the underlying generative process that combines the  individual source emissions, highly structured data are expressed with substantially fewer clusters compared to single-assignment clustering. As a consequence, such a model provides robust parameter estimators even when the number of samples is low. We extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Our model is primarily 
</description>
</item>

<item>
<title>
Online Learning in the Embedded Manifold of Low-rank Matrices; Uri Shalit, Daphna Weinshall, Gal Chechik; 13(Feb):429--458, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/shalit12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/shalit12a.html
</link>
<description>
When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction 
</description>
</item>

<item>
<title>
Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming; Garvesh Raskutti, Martin J. Wainwright, Bin Yu; 13(Feb):389--427, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/raskutti12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/raskutti12a.html
</link>
<description>
Sparse additive models are families of d-variate functions with the additive decomposition f^* = &#8721;_j &#8712; S f^*_j, where S is an unknown subset of cardinality s &lt;&lt; d. In this paper, we consider the case where each univariate component function f^*_j lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f^* based on kernels combined with l_1-type convex regularization.  Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L^2(P) and L^2(P_n) norms over the class  F_d,s,H of sparse additive models with each univariate function 
</description>
</item>

<item>
<title>
Bounding the Probability of Error for High Precision Optical Character Recognition; Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller; 13(Feb):363--387, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/huang12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/huang12a.html
</link>
<description>
We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identified with near certainty, they can be conditioned upon, allowing further inference to be done efficiently.  Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This "clean set" is subsequently used as document-specific training data.  While OCR systems produce confidence measures for the identity of each letter or word, thresholding these values still produces a significant 
</description>
</item>

<item>
<title>
Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics; Michael U. Gutmann, Aapo Hyv&#228;rinen; 13(Feb):307--361, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/gutmann12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/gutmann12a.html
</link>
<description>
We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations 
</description>
</item>

<item>
<title>
Random Search for Hyper-Parameter Optimization; James Bergstra, Yoshua Bengio; 13(Feb):281--305, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/bergstra12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/bergstra12a.html
</link>
<description>
Grid search and manual search are the most widely used strategies for hyper-parameter optimization.  This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid.  Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks.  Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time.  Granting random search the same computational budget, random search 
</description>
</item>

<item>
<title>
Active Learning via Perfect Selective Classification; Ran El-Yaniv, Yair Wiener; 13(Feb):255--279, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/el-yaniv12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/el-yaniv12a.html
</link>
<description>
We discover a strong relation between two known learning models: stream-based active learning and perfect selective classification (an extreme case of 'classification with a reject option').  For these models, restricted to the realizable case, we show a reduction of active learning to selective classification that preserves fast rates.  Applying this reduction to recent results for selective classification, we derive exponential target-independent label complexity speedup for actively learning general (non-homogeneous) linear classifiers when the data distribution is an arbitrary high dimensional mixture of Gaussians. Finally, we study the relation between the proposed technique and existing 
</description>
</item>

<item>
<title>
Multi Kernel Learning with Online-Batch Optimization; Francesco Orabona, Luo Jie, Barbara Caputo; 13(Feb):227--253, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/orabona12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/orabona12a.html
</link>
<description>
In recent years there has been a lot of interest in designing principled classification algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach.  Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem.  We prove theoretically 
</description>
</item>

<item>
<title>
Active Clustering of Biological Sequences; Konstantin Voevodski, Maria-Florina Balcan, Heiko R&#246;glin, Shang-Hua Teng, Yu Xia; 13(Jan):203--225, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/voevodski12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/voevodski12a.html
</link>
<description>
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points.  In our model we assume that we have access to one versus all queries that given a point s &#8712; S return the distances between s and all other points.  We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries.   Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering.  We use our procedure 
</description>
</item>

<item>
<title>
Optimal Distributed Online Prediction Using Mini-Batches; Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao; 13(Jan):165--202, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/dekel12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/dekel12a.html
</link>
<description>
Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms.  We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed 
</description>
</item>

<item>
<title>
An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity; Nir Ailon; 13(Jan):137--164, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ailon12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ailon12a.html
</link>
<description>
Given a set V of  n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise).  The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible.  Our performance is measured by two parameters:  The number of disagreements (loss) and the query complexity (number of pairwise preference labels).  Our algorithm adaptively queries  at most O(&#949;^-6n log^5 n) preference labels for a regret of &#949; times the optimal loss.  As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem.  
</description>
</item>

<item>
<title>
Refinement of Operator-valued Reproducing Kernels; Haizhang Zhang, Yuesheng Xu, Qinghui Zhang; 13(Jan):91--136, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zhang12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zhang12a.html
</link>
<description>
This paper studies the construction of a refinement kernel for a given operator-valued reproducing kernel such that the vector-valued reproducing kernel Hilbert space of the refinement kernel contains that of the given kernel as a subspace. The study is motivated from the need of updating the current operator-valued reproducing kernel in multi-task learning when underfitting or overfitting occurs. Numerical simulations confirm that the established refinement kernel method is able to meet this need.  Various characterizations are provided based on feature maps and vector-valued integral representations of operator-valued reproducing kernels. Concrete examples of refining translation invariant 
</description>
</item>

<item>
<title>
Plug-in Approach to Active Learning; Stanislav Minsker; 13(Jan):67--90, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/minsker12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/minsker12a.html
</link>
<description>
We present a new active learning algorithm based on nonparametric estimators of the regression function.  Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions.  We also prove minimax lower bounds which show that the obtained rates are almost tight.
</description>
</item>

<item>
<title>
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection; Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luj&#225;n; 13(Jan):27--66, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/brown12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/brown12a.html
</link>
<description>
We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation.  This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?".  To answer this, we adopt a different strategy than is usual in the feature selection literature--instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels.  While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', 
</description>
</item>

<item>
<title>
Distance Metric Learning with Eigenvalue Optimization; Yiming Ying, Peng Li; 13(Jan):1--26, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ying12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ying12a.html
</link>
<description>
The main theme of this paper is to develop a novel eigenvalue optimization framework for learning a Mahalanobis metric.  Within this context, we introduce a novel metric learning approach called DML-eig  which is shown to be equivalent to  a well-known eigenvalue optimization problem called minimizing the maximal eigenvalue of a symmetric matrix (Overton, 1988; Lewis and Overton, 1996).  Moreover, we formulate LMNN (Weinberger et al., 2005), one of the state-of-the-art metric learning methods, as a similar eigenvalue optimization problem. This novel framework not only provides new insights into metric learning but also opens new avenues  to the design of efficient metric learning algorithms.   
</description>
</item>

<item>
<title>
Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms; Beno&#238;t Patra; 12(Dec):3431--3466, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/patra11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/patra11a.html
</link>
<description>
Motivated by the problem of effectively executing clustering algorithms on very large data sets, we address a model for large scale distributed clustering methods. To this end, we briefly recall some standards on the quantization problem and some results on the almost sure convergence of the competitive learning vector quantization (CLVQ) procedure. A general model for linear distributed asynchronous algorithms well adapted to several parallel computing architectures is also discussed. Our approach brings together this scalable model and the CLVQ algorithm, and we call the resulting technique the distributed asynchronous learning vector quantization algorithm (DALVQ). An in-depth analysis of 
</description>
</item>

<item>
<title>
A Simpler Approach to Matrix Completion; Benjamin Recht; 12(Dec):3413--3430, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/recht11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/recht11a.html
</link>
<description>
This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix.  These results improve on prior work by Cand&#232;s and Recht (2009), Cand&#232;s and Tao (2009), and Keshavan et al. (2009).  The reconstruction is accomplished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisfies a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition.  The proof of this assertion is short, self contained, 
</description>
</item>

<item>
<title>
Learning with Structured Sparsity; Junzhou Huang, Tong Zhang, Dimitris Metaxas; 12(Nov):3371--3412, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/huang11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/huang11b.html
</link>
<description>
This paper investigates a learning formulation called  structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing.  By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years.  A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure.  It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization.  
</description>
</item>

<item>
<title>
Semi-Supervised Learning with Measure Propagation; Amarnag Subramanya, Jeff Bilmes; 12(Nov):3311--3370, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/subramanya11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/subramanya11a.html
</link>
<description>
We describe a new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efficiently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semi-supervised learning problem on very large data sets, for example, in one instance we use a data set with over 10^8 samples.  In this context, we propose a graph node ordering algorithm that is 
</description>
</item>

<item>
<title>
An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models; Piotr Zwiernik; 12(Nov):3283--3310, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zwiernik11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zwiernik11a.html
</link>
<description>
The standard Bayesian Information Criterion (BIC) is derived under regularity conditions which are not always satisfied in the case of graphical models with hidden variables. In this paper we derive the BIC for the binary graphical tree models where all the inner nodes of a tree represent binary hidden variables. This provides an extension of a similar formula given by Rusakov and Geiger for naive Bayes models. The main tool used in this paper is the connection between the growth behavior of marginal likelihood integrals and the real log-canonical threshold.
</description>
</item>

<item>
<title>
The Sample Complexity of Dictionary Learning; Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein; 12(Nov):3259--3281, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vainsencher11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vainsencher11a.html
</link>
<description>
A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary.  Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples?  We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective.  We develop generalization 
</description>
</item>

<item>
<title>
Robust Gaussian Process Regression with a Student-<i>t</i> Likelihood; Pasi Jyl&#228;nki, Jarno Vanhatalo, Aki Vehtari; 12(Nov):3227--3257, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/jylanki11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/jylanki11a.html
</link>
<description>
This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions.  In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We 
</description>
</item>

<item>
<title>
Group Lasso Estimation of High-dimensional Covariance Matrices; J&#233;r&#233;mie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Mu&#241;iz-Alvarez; 12(Nov):3187--3225, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/bigot11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/bigot11a.html
</link>
<description>
In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a high-dimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional 
</description>
</item>

<item>
<title>
Adaptive Exact Inference in Graphical Models; &#214;zg&#252;r S&#252;mer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu; 12(Nov):3147--3186, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/sumer11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/sumer11a.html
</link>
<description>
Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch.  In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efficiently compute marginals and update MAP configurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the 
</description>
</item>

<item>
<title>
Unsupervised Supervised Learning II: Margin-Based Classification Without Labels; Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon; 12(Nov):3119--3145, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/balasubramanian11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/balasubramanian11a.html
</link>
<description>
Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we   show how the estimate is used for evaluating classifiers in transfer   learning, and for training classifiers with no labeled data   whatsoever.
</description>
</item>

<item>
<title>
Efficient and Effective Visual Codebook Generation Using Additive Kernels; Jianxin Wu, Wei-Chian Tan, James M. Rehg; 12(Nov):3097--3118, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wu11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wu11b.html
</link>
<description>
Common visual codebook generation methods used in a bag of visual words model, for example, k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that with histogram features, the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks. In this paper, we demonstrate that HIK can be used in an unsupervised manner to significantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as the standard k-means. 
</description>
</item>

<item>
<title>
In All Likelihood, Deep Belief Is Not Enough; Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge; 12(Nov):3071--3096, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/theis11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/theis11a.html
</link>
<description>
Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the 
</description>
</item>

<item>
<title>
The Stationary Subspace Analysis Toolbox; Jan Saputra M&#252;ller, Paul von B&#252;nau, Frank C. Meinecke, Franz J. Kir&#225;ly, Klaus-Robert M&#252;ller; 12(Oct):3065--3069, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mueller11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mueller11a.html
</link>
<description>
The Stationary Subspace Analysis (SSA) algorithm linearly factorizes a high-dimensional time series into stationary and non-stationary components.  The SSA Toolbox is a platform-independent efficient stand-alone implementation of the SSA algorithm with a graphical user interface written in Java, that can also be invoked from the command line and from Matlab. The graphical interface guides the user through the whole process; data can be imported and exported from comma separated values (CSV) and Matlab's .mat files.
</description>
</item>

<item>
<title>
Robust Approximate Bilinear Programming for Value Function Approximation; Marek Petrik, Shlomo Zilberstein; 12(Oct):3027--3063, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/petrik11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/petrik11a.html
</link>
<description>
Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. 
</description>
</item>

<item>
<title>
High-dimensional Covariance Estimation Based On Gaussian Graphical Models; Shuheng Zhou, Philipp R&#252;timann, Min Xu, Peter B&#252;hlmann; 12(Oct):2975--3026, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zhou11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zhou11a.html
</link>
<description>
Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using l_1-penalization methods.  We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and refitting: first we infer a sparse undirected graphical model structure via thresholding of each among many l_1-norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator.  We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and 
</description>
</item>

<item>
<title>
Hierarchical Knowledge Gradient for Sequential Sampling; Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier; 12(Oct):2931--2974, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mes11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mes11a.html
</link>
<description>
We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multi-dimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation 
</description>
</item>

<item>
<title>
On Equivalence Relationships Between Classification and Ranking Algorithms; &#350;eyda Ertekin, Cynthia Rudin; 12(Oct):2905--2929, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ertekin11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ertekin11a.html
</link>
<description>
We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically, 
</description>
</item>

<item>
<title>
Convergence Rates of Efficient Global Optimization Algorithms; Adam D. Bull; 12(Oct):2879--2904, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/bull11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/bull11a.html
</link>
<description>
In the efficient global optimization problem, we minimize an unknown function f, using as few observations f(x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret.  Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour.  Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS).  When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS.  We provide convergence 
</description>
</item>

<item>
<title>
Efficient Learning with Partially Observed Attributes; Nicol&#242; Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir; 12(Oct):2857--2878, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/cesa-bianchi11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/cesa-bianchi11a.html
</link>
<description>
We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the "local budget" setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier "global budget" setting, where it is only the overall number of accessible 
</description>
</item>

<item>
<title>
Neyman-Pearson Classification, Convexity and Stochastic Constraints; Philippe Rigollet, Xin Tong; 12(Oct):2831--2855, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/rigollet11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/rigollet11a.html
</link>
<description>
Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss &#966;. Given a finite collection of classifiers, we combine them and obtain a new classifier  that satisfies simultaneously the two following properties with high probability: (i) its &#966;-type I error is below a pre-specified level and (ii), it has &#966;-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to 
</description>
</item>

<item>
<title>
Scikit-learn: Machine Learning in Python; Fabian Pedregosa, Ga&#235;l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, &#201;douard Duchesnay; 12(Oct):2825--2830, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
</link>
<description>
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language.  Emphasis is put on ease of use, performance, documentation, and API consistency.  It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings.  Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
</description>
</item>

<item>
<title>
Structured Variable Selection with Sparsity-Inducing Norms; Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach; 12(Oct):2777--2824, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/jenatton11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/jenatton11b.html
</link>
<description>
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual l_1-norm and the group l_1-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero 
</description>
</item>

<item>
<title>
Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes; Elias Zavitsanos, Georgios Paliouras, George A. Vouros; 12(Oct):2749--2775, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zavitsanos11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zavitsanos11a.html
</link>
<description>
This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model 
</description>
</item>

<item>
<title>
Large Margin Hierarchical Classification with Mutually Exclusive Class Membership; Huixin Wang, Xiaotong Shen, Wei Pan; 12(Sep):2721--2748, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wang11c.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wang11c.html
</link>
<description>
In hierarchical classification, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classification is specified by directed paths of the tree. In such a situation, the focus has been on how to leverage the inter-class relationship to enhance the performance of flat classification, which ignores such dependency.  This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classification, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized
</description>
</item>

<item>
<title>
Convex and Network Flow Optimization for Structured Sparsity; Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach; 12(Sep):2681--2720, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mairal11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mairal11a.html
</link>
<description>
We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of l_2- or l_&#8734;-norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups.  To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of l_&#8734;-norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods.  On the other hand, we use proximal splitting techniques,
</description>
</item>

<item>
<title>
Bayesian Co-Training; Shipeng Yu, Balaji Krishnapuram, R&#243;mer Rosales, R. Bharat Rao; 12(Sep):2649--2680, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/yu11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/yu11a.html
</link>
<description>
Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for
</description>
</item>

<item>
<title>
Theoretical Analysis of Bayesian Matrix Factorization; Shinichi Nakajima, Masashi Sugiyama; 12(Sep):2583--2648, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/nakajima11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/nakajima11a.html
</link>
<description>
Recently, variational Bayesian (VB) techniques have been applied to probabilistic matrix factorization and shown to perform very well in experiments.  In this paper, we theoretically elucidate properties of the VB matrix factorization (VBMF) method.  Through finite-sample analysis of the VBMF estimator, we show that two types of shrinkage factors exist in the VBMF estimator: the positive-part James-Stein (PJS) shrinkage and the trace-norm shrinkage, both acting on each singular component separately for producing low-rank solutions.  The trace-norm shrinkage is simply induced by non-flat prior information, similarly to the maximum a posteriori (MAP) approach.  Thus, no trace-norm shrinkage
</description>
</item>

<item>
<title>
Kernel Analysis of Deep Networks; Gr&#233;goire Montavon, Mikio L. Braun, Klaus-Robert M&#252;ller; 12(Sep):2563--2581, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/montavon11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/montavon11a.html
</link>
<description>
When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed
</description>
</item>

<item>
<title>
Weisfeiler-Lehman Graph Kernels; Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt; 12(Sep):2539--2561, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/shervashidze11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/shervashidze11a.html
</link>
<description>
In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence.  In our experimental evaluation, our kernels outperform state-of-the-art graph
</description>
</item>

<item>
<title>
Natural Language Processing (Almost) from Scratch; Ronan Collobert, Jason Weston, L&#233;on Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa; 12(Aug):2493--2537, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/collobert11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/collobert11a.html
</link>
<description>
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling.  This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge.  Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data.  This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
</description>
</item>

<item>
<title>
LPmade: Link Prediction Made Easy; Ryan N. Lichtenwalter, Nitesh V. Chawla; 12(Aug):2489--2492, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/lichtenwalter11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/lichtenwalter11a.html
</link>
<description>
LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting high-performance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction
</description>
</item>

<item>
<title>
Distance Dependent Chinese Restaurant Processes; David M. Blei, Peter I. Frazier; 12(Aug):2461--2488, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/blei11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/blei11a.html
</link>
<description>
We develop the distance dependent Chinese restaurant process, a flexible class of distributions over partitions that allows for dependencies between the elements.  This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies arising from time, space, and network connectivity.  We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings.  We study its empirical performance with three text corpora.  We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide
</description>
</item>

<item>
<title>
Parallel Algorithm for Learning Optimal Bayesian Network Structure; Yoshinori Tamada, Seiya Imoto, Satoru Miyano; 12(Jul):2437--2459, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/tamada11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/tamada11a.html
</link>
<description>
We present a parallel algorithm for the score-based optimal structure search of Bayesian networks.  This algorithm is based on a dynamic programming (DP) algorithm having O(n &#8901; 2^n) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes.  The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes.  While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(n^&#963;) time and space overhead calculations, 
</description>
</item>

<item>
<title>
Union Support Recovery in Multi-task Learning; Mladen Kolar, John Lafferty, Larry Wasserman; 12(Jul):2415--2435, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/kolar11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/kolar11a.html
</link>
<description>
We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting.  Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis.  A clearer and simpler picture emerges by studying the Normal means model.  This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures.
</description>
</item>

<item>
<title>
MULAN: A Java Library for Multi-Label Learning; Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas; 12(Jul):2411--2414, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/tsoumakas11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/tsoumakas11a.html
</link>
<description>
MULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures.
</description>
</item>

<item>
<title>
Universality, Characteristic Kernels and RKHS Embedding of Measures; Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet; 12(Jul):2389--2410, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/sriperumbudur11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/sriperumbudur11a.html
</link>
<description>
Over the last few years, two different notions of positive definite (pd) kernels---universal and characteristic---have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding 
</description>
</item>

<item>
<title>
<i>Waffles</i>: A Machine Learning Toolkit; Michael Gashler; 12(Jul):2383--2387, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/gashler11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/gashler11a.html
</link>
<description>
We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License.
</description>
</item>

<item>
<title>
Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models; Sharon Goldwater, Thomas L. Griffiths, Mark Johnson; 12(Jul):2335--2382, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/goldwater11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/goldwater11a.html
</link>
<description>
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework.  We discuss two stochastic 
</description>
</item>

<item>
<title>
Proximal Methods for Hierarchical Sparse Coding; Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach; 12(Jul):2297--2334, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/jenatton11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/jenatton11a.html
</link>
<description>
Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary.  We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them.  More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators.  Our 
</description>
</item>

<item>
<title>
MSVMpack: A Multi-Class Support Vector Machine Package; Fabien Lauer, Yann Guermeur; 12(Jul):2293--2296, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/lauer11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/lauer11a.html
</link>
<description>
This paper describes MSVMpack, an open source software package dedicated to our generic model of multi-class support vector machine.  All four multi-class support vector machines (M-SVMs) proposed so far in the literature appear as instances of this model. MSVMpack provides for them the first unified implementation and offers a convenient basis to develop other instances.  This is also the first parallel implementation for M-SVMs.  The package consists in a set of command-line tools with a callable library.  The documentation includes a tutorial, a user's guide and a developer's guide. 
</description>
</item>

<item>
<title>
Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning; Liwei Wang; 12(Jul):2269--2292, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wang11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wang11b.html
</link>
<description>
We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and 
</description>
</item>

<item>
<title>
Multiple Kernel Learning Algorithms; Mehmet G&#246;nen, Ethem Alpayd&#305;n; 12(Jul):2211--2268, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/gonen11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/gonen11a.html
</link>
<description>
In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as 
</description>
</item>

<item>
<title>
Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood; Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllym&#228;ki; 12(Jul):2181--2210, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/carvalho11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/carvalho11a.html
</link>
<description>
We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (f&#770;CLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion.  The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, 
</description>
</item>

<item>
<title>
On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem; Daniil Ryabko; 12(Jul):2161--2180, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ryabko11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ryabko11a.html
</link>
<description>
A sequence x_1,...,x_n,... of discrete-valued observations is generated according to some unknown probabilistic law (measure) &#956;.  After observing each outcome, one is required to give  conditional probabilities of the next observation.  The realizable case is when the  measure  &#956; belongs to an arbitrary but known class C  of  process measures.  The non-realizable case is when &#956; is completely arbitrary, but the prediction performance is measured with respect to a given set C of process measures.  We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and finding these solutions.  
</description>
</item>

<item>
<title>
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization; John Duchi, Elad Hazan, Yoram Singer; 12(Jul):2121--2159, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/duchi11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/duchi11a.html
</link>
<description>
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as 
</description>
</item>

<item>
<title>
Information Rates of Nonparametric Gaussian Process Methods; Aad van der Vaart, Harry van Zanten; 12(Jun):2095--2119, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vandervaart11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vandervaart11a.html
</link>
<description>
We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior.  We illustrate the computation of the upper bound for the Mat&#233;rn  and squared exponential kernels.  For these priors the risk, and hence the information criterion, tends to zero for all continuous response functions. However, the rate at which this happens 
</description>
</item>

<item>
<title>
Exploiting Best-Match Equations for Efficient Reinforcement Learning; Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering; 12(Jun):2045--2094, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vanseijen11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vanseijen11a.html
</link>
<description>
This article presents and evaluates best-match learning, a new approach to reinforcement learning that  trades off the sample efficiency of model-based methods with the space efficiency of model-free methods.  Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model.  We prove that, unlike regular sparse model-based methods, best-match learning is guaranteed to converge to the optimal Q-values in the tabular case.  Empirical results demonstrate that best-match learning can substantially outperform regular sparse model-based methods, as well as several 
</description>
</item>

<item>
<title>
A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis; Trine Julie Abrahamsen, Lars Kai Hansen; 12(Jun):2027--2044, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/abrahamsen11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/abrahamsen11a.html
</link>
<description>
Small sample high-dimensional principal component analysis (PCA) suffers from variance inflation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inflation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efficiently restore generalizability in kPCA. As for PCA our analysis also suggests a simplified approximate expression.
</description>
</item>

<item>
<title>
The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets; Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, Christian Buchta; 12(Jun):2021--2025, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/hahsler11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/hahsler11a.html
</link>
<description>
This paper describes the ecosystem of R add-on packages developed around the infrastructure provided by the package arules. The packages provide comprehensive functionality for analyzing interesting patterns including frequent itemsets, association rules, frequent sequences and for building applications like associative classification. After discussing the ecosystem's design we illustrate the ease of mining and visualizing rules with a short example.
</description>
</item>

<item>
<title>
Generalized TD Learning; Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii; 12(Jun):1977--2020, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ueno11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ueno11a.html
</link>
<description>
Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed.  Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation.  To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation.  This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms 
</description>
</item>

<item>
<title>
Kernel Regression in the Presence of Correlated Errors; Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor; 12(Jun):1955--1976, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/debrabanter11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/debrabanter11a.html
</link>
<description>
It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difficult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed 
</description>
</item>

<item>
<title>
Dirichlet Process Mixtures of Generalized Linear Models; Lauren A. Hannah, David M. Blei, Warren B. Powell; 12(Jun):1923--1953, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/hannah11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/hannah11a.html
</link>
<description>
We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression.  Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models.  DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model.  We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models.  We give conditions for weak consistency of the joint distribution and pointwise consistency of 
</description>
</item>

<item>
<title>
Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms; Vianney Perchet; 12(Jun):1893--1921, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/perchet11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/perchet11a.html
</link>
<description>
We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages  where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law.  They are based on a generalization of calibration, no longer defined in terms of  a Vorono&#239; diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework,  the expected average internal, 
</description>
</item>

<item>
<title>
Stochastic Methods for <i>l</i><sub>1</sub>-regularized Loss Minimization; Shai Shalev-Shwartz, Ambuj Tewari; 12(Jun):1865--1892, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/shalev-shwartz11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/shalev-shwartz11a.html
</link>
<description>
We describe and analyze two stochastic methods for l_1 regularized loss minimization problems, such as the Lasso.  The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration.  In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.
</description>
</item>

<item>
<title>
A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin; Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng; 12(Jun):1835--1863, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wang11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wang11a.html
</link>
<description>
Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error.  Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other 
</description>
</item>

<item>
<title>
Hyper-Sparse Optimal Aggregation; St&#233;phane Ga&#239;ffas, Guillaume Lecu&#233;; 12(Jun):1813--1833, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/gaiffas11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/gaiffas11a.html
</link>
<description>
Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in~F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. 
</description>
</item>

<item>
<title>
Learning Latent Tree Graphical Models; Myung Jin Choi, Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky; 12(May):1771--1812, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/choi11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/choi11b.html
</link>
<description>
We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all  the observed and latent variables have  the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using   so-called information 
</description>
</item>

<item>
<title>
A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes; St&#233;phane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann; 12(May):1729--1770, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ross11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ross11a.html
</link>
<description>
Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs).  The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions.
</description>
</item>

<item>
<title>
Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets; Chiwoo Park, Jianhua Z. Huang, Yu Ding; 12(May):1697--1728, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/park11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/park11a.html
</link>
<description>
Gaussian process regression is a flexible and powerful tool for machine learning, but the high computational complexity hinders its broader applications. In this paper, we propose a new approach for fast computation of Gaussian process regression with a focus on large spatial data sets. The approach decomposes the domain of a regression function into small subdomains and infers a local piece of the regression function for each subdomain. We explicitly address the mismatch problem of the local pieces on the boundaries of neighboring subdomains by imposing continuity constraints. The new approach has comparable or better computation complexity as other competing methods, but it is easier to be 
</description>
</item>

<item>
<title>
X-Armed Bandits; S&#233;bastien Bubeck, R&#233;mi Munos, Gilles Stoltz, Csaba Szepesv&#225;ri; 12(May):1655--1695, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/bubeck11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/bubeck11a.html
</link>
<description>
We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the mean-payoff function is "locally Lipschitz" with respect to a dissimilarity function that is known to the decision maker.  Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems.  In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known smoothness degree, 
</description>
</item>

<item>
<title>
Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates; Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky; 12(May):1617--1653, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/tan11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/tan11a.html
</link>
<description>
The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An  algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed.   It is shown that this algorithm is both  structurally consistent and risk consistent and  the error probability of structure learning decays faster than any polynomial in the number of samples   under fixed model size.  For the  high-dimensional scenario where the size of the  model d and the number of edges k scale with the number of samples n,  sufficient conditions on (n,d,k) are given for  the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning 
</description>
</item>

<item>
<title>
Double Updating Online Learning; Peilin Zhao, Steven C.H. Hoi, Rong Jin; 12(May):1587--1615, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zhao11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zhao11a.html
</link>
<description>
In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the 
</description>
</item>

<item>
<title>
Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation; Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama; 12(May):1537--1586, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/tomioka11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/tomioka11a.html
</link>
<description>
We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL).  Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm.  We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms.  In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems.  We experimentally 
</description>
</item>

<item>
<title>
Learning from Partial Labels; Timothee Cour, Ben Sapp, Ben Taskar; 12(May):1501--1536, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/cour11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/cour11a.html
</link>
<description>
We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct.  Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available.  The goal is to learn a classifier that can disambiguate the partially-labeled training instances, and generalize to unseen data.  We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled.  Exploiting this property 
</description>
</item>

<item>
<title>
Computationally Efficient Convolved Multiple Output Gaussian Processes; Mauricio A. &#193;lvarez, Neil D. Lawrence; 12(May):1459--1500, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/alvarez11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/alvarez11a.html
</link>
<description>
Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this 
</description>
</item>

<item>
<title>
Learning a Robust Relevance Model for Search Using Kernel Methods; Wei Wu, Jun Xu, Hang Li, Satoshi Oyama; 12(May):1429--1458, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wu11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wu11a.html
</link>
<description>
This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal 
</description>
</item>

<item>
<title>
Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning; Dorota G&#322;owacka, John Shawe-Taylor, Alex Clark, Colin de la Higuera, Mark Johnson; 12(Apr):1425--1428, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/glowacka11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/glowacka11a.html
</link>
<description>
Grammar induction refers to the process of learning grammars and languages from data; this finds a variety of applications in syntactic pattern recognition, the modeling of natural language acquisition, data mining and machine translation. This special topic contains several papers presenting some of recent developments in the area of grammar induction and language learning, as applied to various problems in Natural Language Processing, including supervised and unsupervised parsing and statistical machine translation.
</description>
</item>

<item>
<title>
Clustering Algorithms for Chains; Antti Ukkonen; 12(Apr):1389--1423, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ukkonen11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ukkonen11a.html
</link>
<description>
We consider the problem of clustering a set of chains to k clusters.  A chain is a totally ordered subset of a finite set of items.  Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items.  First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure.  We also present two different approaches for mapping chains to a vector space.  The first one is related to the planted partition model, 
</description>
</item>

<item>
<title>
Faster Algorithms for Max-Product Message-Passing; Julian J. McAuley, Tib&#233;rio S. Caetano; 12(Apr):1349--1388, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mcauley11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mcauley11a.html
</link>
<description>
Maximum A Posteriori inference in graphical models is often solved via message-passing algorithms, such as the junction-tree algorithm or loopy belief-propagation. The exact solution to this problem is well-known to be exponential in the size of the maximal cliques of the triangulated model, while approximate inference is typically exponential in the size of the model's factors. In this paper, we take advantage of the fact that many models have maximal cliques that are larger than their constituent factors, and also of the fact that many factors consist only of latent variables (i.e., they do not depend on an observation). This is a common case in a wide variety of applications that deal with 
</description>
</item>

<item>
<title>
A Family of Simple Non-Parametric Kernel Learning Algorithms; Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi; 12(Apr):1313--1347, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zhuang11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zhuang11a.html
</link>
<description>
Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interior-point SDP solver could be as high as O(N^6.5), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed "SimpleNPKL", which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with 
</description>
</item>

<item>
<title>
Better Algorithms for Benign Bandits; Elad Hazan, Satyen Kale; 12(Apr):1287--1311, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/hazan11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/hazan11a.html
</link>
<description>
The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information.  A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in 
</description>
</item>

<item>
<title>
Locally Defined Principal Curves and Surfaces; Umut Ozertem, Deniz Erdogmus; 12(Apr):1249--1286, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ozertem11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ozertem11a.html
</link>
<description>
Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with 
</description>
</item>

<item>
<title>
DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model; Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv&#228;rinen, Yoshinobu Kawahara, Takashi Washio, Patrik O. Hoyer, Kenneth Bollen; 12(Apr):1225--1248, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/shimizu11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/shimizu11a.html
</link>
<description>
Structural equation models and Bayesian networks have been widely used to analyze causal relations between continuous variables. In such frameworks, linear acyclic models are typically used to model the data-generating process of variables.  Recently, it was shown that use of non-Gaussianity identifies the full structure of a linear acyclic model, that is, a causal ordering of variables and their connection strengths, without using any prior knowledge on the network structure, which is not the case with conventional methods.  However, existing estimation methods are based on iterative search algorithms and may not converge to a correct solution in a finite number of steps.  In this paper, 
</description>
</item>

<item>
<title>
The Indian Buffet Process: An Introduction and Review; Thomas L. Griffiths, Zoubin Ghahramani; 12(Apr):1185--1224, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/griffiths11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/griffiths11a.html
</link>
<description>
The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections 
</description>
</item>

<item>
<title>
Laplacian Support Vector Machines  Trained in the Primal; Stefano Melacci, Mikhail Belkin; 12(Mar):1149--1184, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/melacci11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/melacci11a.html
</link>
<description>
In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data.  Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification.  In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation.  In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient.  We speed up training by using an early 
</description>
</item>

<item>
<title>
Anechoic Blind Source Separation Using Wigner Marginals; Lars Omlor, Martin A. Giese; 12(Mar):1111--1148, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/omlor11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/omlor11a.html
</link>
<description>
Blind source separation problems emerge in many applications, where signals can be modeled as superpositions of multiple sources. Many popular applications of blind source separation are based on linear instantaneous mixture models. If specific invariance properties are known about the sources, for example, translation or rotation invariance, the simple linear model can be extended by inclusion of the corresponding transformations.  When the sources are invariant against translations (spatial displacements or time shifts) the resulting model is called an anechoic mixing model. We present a new algorithmic framework for the solution of anechoic problems in arbitrary dimensions. This framework 
</description>
</item>

<item>
<title>
Differentially Private Empirical Risk Minimization; Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate; 12(Mar):1069--1109, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/chaudhuri11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/chaudhuri11a.html
</link>
<description>
Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed.  We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM).  These algorithms are private under the &#949;-differential privacy definition due to Dwork et al. (2006).  First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification.  Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design.  This method entails perturbing the objective function before 
</description>
</item>

<item>
<title>
Two Distributed-State Models For Generating High-Dimensional Time Series; Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis; 12(Mar):1025--1068, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/taylor11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/taylor11a.html
</link>
<description>
In this paper we develop a class of nonlinear generative models for high-dimensional time series.  We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued "visible" variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This "conditional" RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost 
</description>
</item>

<item>
<title>
Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data; Zeeshan Syed, John Guttag; 12(Mar):999--1024, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/syed11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/syed11a.html
</link>
<description>
In medicine, one often bases decisions upon a comparative analysis of patient data. In this paper, we build upon this observation and describe similarity-based algorithms to risk stratify patients for major adverse cardiac events. We evolve the traditional approach of comparing patient data in two ways. First, we propose similarity-based algorithms that compare patients in terms of their long-term physiological monitoring data. Symbolic mismatch identifies functional units in long-term signals and measures changes in the morphology and frequency of these units across patients. Second, we describe similarity-based algorithms that are unsupervised and do not require comparisons to patients 
</description>
</item>

<item>
<title>
l_p-Norm Multiple Kernel Learning; Marius Kloft, Ulf Brefeld, S&#246;ren Sonnenburg, Alexander Zien; 12(Mar):953--997, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/kloft11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/kloft11a.html
</link>
<description>
Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability.  Unfortunately, this l_1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is l_p-norms with p &#8805; 1. This interleaved optimization is much 
</description>
</item>

<item>
<title>
Forest Density Estimation; Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman; 12(Mar):907--951, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/liu11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/liu11a.html
</link>
<description>
We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models.  For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal's algorithm to estimate the optimal forest on held out data.  We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest.  For graph estimation, we consider the problem of estimating forests with restricted tree sizes.  We prove that finding a maximum weight spanning forest with restricted 
</description>
</item>

<item>
<title>
Sparse Linear Identifiable Multivariate Modeling; Ricardo Henao, Ole Winther; 12(Mar):863--905, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/henao11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/henao11a.html
</link>
<description>
In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component &#948;-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to 
</description>
</item>

<item>
<title>
Learning Transformation Models for Ranking and Survival Analysis; Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel; 12(Mar):819--862, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vanbelle11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vanbelle11a.html
</link>
<description>
This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis.  The present contribution describes a machine learning approach termed MINLIP. The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions.  Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the 'margin' is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual 
</description>
</item>

<item>
<title>
Information, Divergence and Risk for Binary Experiments; Mark D. Reid, Robert C. Williamson; 12(Mar):731--817, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/reid11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/reid11a.html
</link>
<description>
We unify f-divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information.  We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives  which all are related to cost-sensitive binary classification.  As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f-divergences to variational divergence.  The new viewpoint also illuminates existing algorithms: it provides a new derivation 
</description>
</item>

<item>
<title>
Inverse Reinforcement Learning in Partially Observable Environments; Jaedeug Choi, Kee-Eung Kim; 12(Mar):691--730, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/choi11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/choi11a.html
</link>
<description>
Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert's behavior, namely the case in which the expert's policy is explicitly given, and the case in which the expert's trajectories 
</description>
</item>

<item>
<title>
Efficient Structure Learning of Bayesian Networks using Constraints; Cassio P. de Campos, Qiang Ji; 12(Mar):663--689, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/decampos11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/decampos11a.html
</link>
<description>
This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable.  It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees.  These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion.  Then a branch-and-bound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality.  As an example, structural constraints are used to map the problem of structure learning in Dynamic 
</description>
</item>

<item>
<title>
Parameter Screening and Optimisation for ILP using Designed Experiments; Ashwin Srinivasan, Ganesh Ramakrishnan; 12(Feb):627--662, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/srinivasan11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/srinivasan11a.html
</link>
<description>
Reports of experiments conducted with an Inductive Logic Programming system rarely describe how specific values of parameters of the system are arrived at when constructing models. Usually, no attempt is made to identify sensitive parameters, and those that are used are often given "factory-supplied" default values, or values obtained from some non-systematic exploratory analysis. The immediate consequence of this is, of course, that it is not clear if better models could have been obtained if some form of parameter selection and optimisation had been performed.  Questions follow inevitably on the experiments themselves: specifically, are all algorithms being treated fairly, and is the 
</description>
</item>

<item>
<title>
Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach; Gilles Meyer, Silv&#232;re Bonnabel, Rodolphe Sepulchre; 12(Feb):593--625, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/meyer11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/meyer11a.html
</link>
<description>
The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixed-rank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the 
</description>
</item>

<item>
<title>
Variable Sparsity Kernel Learning; Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman; 12(Feb):565--592, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/aflalo11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/aflalo11a.html
</link>
<description>
This paper presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l_1 norm regularization for promoting sparsity within RKHS norms of each group and l_s, s&#8805;2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels---hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits 
</description>
</item>

<item>
<title>
Minimum Description Length Penalization for Group and Multi-Task Sparse Learning; Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar; 12(Feb):525--564, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/dhillon11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/dhillon11a.html
</link>
<description>
We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using  two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MIC-GROUP)  and multi-task feature selection (MIC-MULTI). MIC-GROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group.  MIC-MULTI applies when there are multiple related tasks that share the same set 
</description>
</item>

<item>
<title>
Learning Multi-modal Similarity; Brian McFee, Gert Lanckriet; 12(Feb):491--523, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mcfee11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mcfee11a.html
</link>
<description>
In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation.  Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video.  Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications.  We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space.  Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human 
</description>
</item>

<item>
<title>
Posterior Sparsity in Unsupervised Dependency Parsing; Jennifer Gillenwater, Kuzman Ganchev, Jo&#227;o Gra&#231;a, Fernando Pereira, Ben Taskar; 12(Feb):455--490, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/gillenwater11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/gillenwater11a.html
</link>
<description>
A strong inductive bias is essential in unsupervised grammar induction.  In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types.  We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Gra&#231;a et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by
</description>
</item>

<item>
<title>
Approximate Marginals in Latent Gaussian Models; Botond Cseke, Tom Heskes; 12(Feb):417--454, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/cseke11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/cseke11a.html
</link>
<description>
We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986).  We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both 
</description>
</item>

<item>
<title>
Operator Norm Convergence of Spectral Clustering on Level Sets; Bruno Pelletier, Pierre Pudlo; 12(Feb):385--416, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/pelletier11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/pelletier11a.html
</link>
<description>
Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t.  A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed.  Our algorithm is composed of two steps.  First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t.  Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix.  Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator 
</description>
</item>

<item>
<title>
Models of Cooperative Teaching and Learning; Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich; 12(Feb):349--384, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zilles11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zilles11a.html
</link>
<description>
While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects "helpful" training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, "helpful" examples can speed up the learning process.  The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using "coding tricks", has been widely addressed. Nevertheless, the resulting teaching and 
</description>
</item>

<item>
<title>
Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs; Jim C. Huang, Brendan J. Frey; 12(Jan):301--348, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/huang11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/huang11a.html
</link>
<description>
We present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called  cumulative distribution networks (CDNs).  Unlike graphs for probability density and mass functions, for CDFs the marginal probabilities for any subset of variables are obtained by computing limits of functions in the model, and conditional probabilities correspond to computing mixed derivatives.  We will show that the conditional independence properties in a CDN are distinct from the conditional independence properties of directed, undirected and factor graphs, but include the conditional independence properties of bi-directed graphs.  In order to perform inference in such models, we describe the `derivative-sum-product' (DSP) message-passing algorithm in which messages correspond to derivatives of the joint CDF.  We will then apply CDNs to the problem of learning to rank players in multiplayer team-based games and suggest several future directions for research.
</description>
</item>

<item>
<title>
A Bayesian Approximation Method for Online Ranking; Ruby C. Weng, Chih-Jen Lin; 12(Jan):267--300, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/weng11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/weng11a.html
</link>
<description>
This paper describes a Bayesian approximation method to obtain online ranking algorithms for games with multiple teams and multiple players.  Recently for Internet games large online ranking systems are much needed.  We consider game models in which a k-team game is treated as several two-team games.  By approximating the expectation of teams' (or players') performances, we derive simple analytic update rules.  These update rules, without numerical integrations, are very easy to interpret and implement.  Experiments on game data show that the accuracy of our approach is competitive with state of the art systems such as TrueSkill, but the running time as well as the code is much shorter.
</description>
</item>

<item>
<title>
Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm; Vladimir V. V'yugin; 12(Jan):241--266, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vyugin11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vyugin11a.html
</link>
<description>
In this paper the sequential prediction problem with expert advice is considered for the case where losses of experts suffered at each step cannot be bounded in advance. We present some modification of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled fluctuation of a game are introduced. We present a probabilistic algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled fluctuations of one-step losses of experts of the pool tend to zero.
</description>
</item>

<item>
<title>
Logistic Stick-Breaking Process; Lu Ren, Lan Du, Lawrence Carin, David Dunson; 12(Jan):203--239, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ren11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ren11a.html
</link>
<description>
A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks.  Efficient variational Bayesian inference is derived, and comparisons are made 
</description>
</item>

<item>
<title>
Training SVMs Without Offset; Ingo Steinwart, Don Hush, Clint Scovel; 12(Jan):141--202, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/steinwart11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/steinwart11a.html
</link>
<description>
We develop, analyze, and test a  training algorithm for support vector machine classifiers without offset.  Key features of this algorithm are a new, statistically motivated  stopping criterion, new warm start options, and a set of inexpensive working set selection strategies that significantly reduce the number of iterations.  For these working set strategies, we establish convergence rates that, not surprisingly,  coincide with the best known rates for SVMs with offset.  We further conduct various experiments that investigate both the run time behavior and the performed iterations of the new training algorithm. It turns out, that the new algorithm needs significantly less iterations and 
</description>
</item>

<item>
<title>
Bayesian Generalized Kernel Mixed Models; Zhihua Zhang, Guang Dai, Michael I. Jordan; 12(Jan):111--139, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zhang11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zhang11a.html
</link>
<description>
We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman's g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging 
</description>
</item>

<item>
<title>
Multitask Sparsity via Maximum Entropy Discrimination; Tony Jebara; 12(Jan):75--110, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/jebara11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/jebara11a.html
</link>
<description>
A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement.  Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces 
</description>
</item>

<item>
<title>
CARP: Software for Fishing Out Good Clustering Algorithms; Volodymyr Melnykov, Ranjan Maitra; 12(Jan):69--73, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/melnykov11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/melnykov11a.html
</link>
<description>
This paper presents the CLUSTERING ALGORITHMS' REFEREE PACKAGE or CARP,  an open source GNU GPL-licensed C package for evaluating clustering algorithms. Calibrating performance of such algorithms is important and CARP addresses this need by generating datasets of different clustering complexity and by assessing the performance of the concerned algorithm in terms of its ability to classify each dataset relative to the true grouping. This paper briefly describes the software and its capabilities. 
</description>
</item>

<item>
<title>
Improved Moves for Truncated Convex Models; M. Pawan Kumar, Olga Veksler, Philip H.S. Torr; 12(Jan):31--67, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/kumar11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/kumar11a.html
</link>
<description>
We consider the problem of obtaining an approximate maximum a posteriori estimate of a discrete random field characterized by pairwise potentials that form a truncated convex model. For this problem, we propose two st-MINCUT based move making algorithms that we call Range Swap and Range Expansion. Our algorithms can be thought of as extensions of &#945;&#946;-Swap and \alpha-Expansion respectively that fully exploit the form of the pairwise potentials. Specifically, instead of dealing with one or two labels at each iteration, our methods explore a large search space by considering a range of labels (that is, an interval of consecutive labels).  Furthermore, we show that Range Expansion provides 
</description>
</item>

<item>
<title>
Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation; Yizhao Ni, Craig Saunders, Sandor Szedmak, Mahesan Niranjan; 12(Jan):1--30, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ni11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ni11a.html
</link>
<description>
We propose a distance phrase reordering model (DPR) for statistical machine translation (SMT), where the aim is to learn the grammatical rules and context dependent changes using a phrase reordering classification framework. We consider a variety of machine learning techniques, including state-of-the-art structured prediction methods. Techniques are compared and evaluated on a Chinese-English corpus, a language pair known for the high reordering characteristics which cannot be adequately captured with current models. In the reordering classification task, the method significantly outperforms the baseline against which it was tested, and further, when integrated as a component of the state-of-the-art 
</description>
</item>

<item>
<title>
Learning Non-Stationary Dynamic Bayesian Networks; Joshua W. Robinson, Alexander J. Hartemink; 11(Dec):3647--3680, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/robinson10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/robinson10a.html
</link>
<description>
Learning dynamic Bayesian network structures provides a principled mechanism for identifying conditional dependencies in time-series data.  An important assumption of traditional DBN structure learning is that the data are generated by a stationary process, an assumption that is not true in many important settings.  In this paper, we introduce a new class of graphical model called a non-stationary dynamic Bayesian network, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time.  Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time.  
</description>
</item>

<item>
<title>
PAC-Bayesian Analysis of Co-clustering and Beyond; Yevgeny Seldin, Naftali Tishby; 11(Dec):3595--3646, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/seldin10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/seldin10a.html
</link>
<description>
We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering. We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two 
</description>
</item>

<item>
<title>
Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory; Sumio Watanabe; 11(Dec):3571--3594, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/watanabe10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/watanabe10a.html
</link>
<description>
In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown.  In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss.  In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems.  First, the Bayes cross-validation loss is asymptotically equivalent 
</description>
</item>

<item>
<title>
Incremental Sigmoid Belief Networks for Grammar Learning; James Henderson, Ivan Titov; 11(Dec):3541--3570, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/henderson10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/henderson10a.html
</link>
<description>
We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network's model structure is a function of the predicted output structure.  These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure.  ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain's complex statistical dependencies benefits from large numbers of latent variables.  While exact inference 
</description>
</item>

<item>
<title>
Rate Minimaxity of the Lasso and Dantzig Selector for the l_q Loss in l_r Balls; Fei Ye, Cun-Hui Zhang; 11(Dec):3519--3540, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ye10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ye10a.html
</link>
<description>
We consider the estimation of regression coefficients in a high-dimensional linear model. For regression coefficients in l_r balls, we provide lower bounds for the minimax l_q risk and minimax quantiles of the l_q loss for all design matrices. Under an l_0 sparsity condition on a target coefficient vector, we sharpen and unify existing oracle inequalities for the Lasso and Dantzig selector. We derive oracle inequalities for target coefficient vectors with many small elements and smaller threshold levels than the universal threshold. These oracle inequalities provide sufficient conditions on the design matrix for the rate minimaxity of the Lasso and Dantzig selector for the l_q risk and loss in 
</description>
</item>

<item>
<title>
An Exponential Model for Infinite Rankings; Marina Meil&#259;, Le Bao; 11(Dec):3481--3518, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/meila10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/meila10a.html
</link>
<description>
This paper presents a statistical model for expressing preferences through rankings, when the number of alternatives (items to rank) is large.  A human ranker will then typically rank only the most preferred items, and may not even examine the whole set of items, or know how many they are. Similarly, a user presented with the ranked output of a search engine, will only consider the highest ranked items. We model such situations by introducing a stagewise ranking model that operates with finite ordered lists called top-t orderings over an infinite space of items. We give algorithms to estimate this model from data, and demonstrate that it has sufficient statistics, being thus an exponential 
</description>
</item>

<item>
<title>
Efficient Algorithms for Conditional Independence Inference; Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studen&#253;; 11(Dec):3453--3479, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/bouckaert10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/bouckaert10b.html
</link>
<description>
The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication.  We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables.  We recall/describe the theoretical 
</description>
</item>

<item>
<title>
L_p-Nested Symmetric Distributions; Fabian Sinz, Matthias Bethge; 11(Dec):3409--3451, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/sinz10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/sinz10a.html
</link>
<description>
In this paper, we introduce a new family of probability densities called L_p-nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form &#961;(x) = ~&#961;(f(x)), where f is a nested cascade of L_p-norms ||x||_p = (&#8721; |x_i|^p)^1/p. L_p-nested symmetric distributions thereby are a special case of &#957;-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, &#957;-spherical and L_p-nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, L_p-spherically symmetric distributions, 
</description>
</item>

<item>
<title>
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion; Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol; 11(Dec):3371--3408, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/vincent10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/vincent10a.html
</link>
<description>
We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, 
</description>
</item>

<item>
<title>
Learning Instance-Specific Predictive Models; Shyam Visweswaran, Gregory F. Cooper; 11(Dec):3333--3369, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/visweswaran10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/visweswaran10a.html
</link>
<description>
This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including naive Bayes, 
</description>
</item>

<item>
<title>
Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds; Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra; 11(Dec):3313--3332, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/dmochowski10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/dmochowski10a.html
</link>
<description>
The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain.  While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear.  To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified.  In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem.  On the other hand, under 
</description>
</item>

<item>
<title>
Classification with Incomplete Data Using Dirichlet Process Priors; Chunping Wang, Xuejun Liao, Lawrence Carin, David B. Dunson; 11(Dec):3269--3311, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/wang10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/wang10a.html
</link>
<description>
A non-parametric hierarchical Bayesian framework is developed for designing a classifier, based on a mixture of simple (linear) classifiers. Each simple classifier is termed a local "expert", and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the "experts" allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classifiers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform 
</description>
</item>

<item>
<title>
Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes; Antti Honkela, Tapani Raiko, Mikael Kuusela, Matti Tornio, Juha Karhunen; 11(Nov):3235--3268, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/honkela10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/honkela10a.html
</link>
<description>
Variational Bayesian (VB) methods are typically only applied to models in the conjugate-exponential family using the variational Bayesian expectation maximisation (VB EM) algorithm or one of its variants.  In this paper we present an efficient algorithm for applying VB to more general models.  The method is based on specifying the functional form of the approximation, such as multivariate Gaussian.  The parameters of the approximation are optimised using a conjugate gradient algorithm that utilises the Riemannian geometry of the space of the approximations.  This leads to a very efficient algorithm for suitably structured approximations. It is shown empirically that the proposed method is 
</description>
</item>

<item>
<title>
A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification; Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin; 11(Nov):3183--3234, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yuan10c.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yuan10c.html
</link>
<description>
Large-scale linear classification is widely used in many areas.  The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training.  Although various optimization methods have been proposed in recent years, these have not yet been compared suitably.  In this paper, we first broadly review existing methods.  Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations.  Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.  
</description>
</item>

<item>
<title>
A Generalized Path Integral Control Approach to Reinforcement Learning; Evangelos Theodorou, Jonas Buchli, Stefan Schaal; 11(Nov):3137--3181, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/theodorou10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/theodorou10a.html
</link>
<description>
With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has 
</description>
</item>

<item>
<title>
Collective Inference for  Extraction MRFs Coupled with Symmetric Clique Potentials; Rahul Gupta, Sunita Sarawagi, Ajit A. Diwan; 11(Nov):3097--3135, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gupta10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gupta10a.html
</link>
<description>
Many structured information extraction tasks employ collective graphical models that capture inter-instance associativity by coupling them with various clique potentials.  We propose tractable families of such potentials that are invariant under permutations of their arguments, and call them symmetric clique potentials.  We present three families of symmetric potentials---MAX, SUM, and MAJORITY.  We propose cluster message passing for collective inference with symmetric clique potentials, and present message computation algorithms tailored to such potentials.  Our first message computation algorithm, called &#945;-pass, is sub-quadratic in the clique size, outputs exact messages for MAX, and 
</description>
</item>

<item>
<title>
Inducing Tree-Substitution Grammars; Trevor Cohn, Phil Blunsom, Sharon Goldwater; 11(Nov):3053--3096, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cohn10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cohn10b.html
</link>
<description>
Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research.  The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data.  The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions.  We propose a novel compromise by inferring a probabilistic tree 
</description>
</item>

<item>
<title>
Covariance in Unsupervised Learning of Probabilistic Grammars; Shay B. Cohen, Noah A. Smith; 11(Nov):3017--3051, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cohen10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cohen10a.html
</link>
<description>
Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text.  Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, general-purpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting.  To date, most of the literature has focused on using a Dirichlet prior.  The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar's parameters. Yet, various grammar parameters are expected to be correlated because 
</description>
</item>

<item>
<title>
Gaussian Processes for Machine Learning (GPML) Toolbox; Carl Edward Rasmussen, Hannes Nickisch; 11(Nov):3011--3015, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/rasmussen10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/rasmussen10a.html
</link>
<description>
The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification.  Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace's method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks.  
</description>
</item>

<item>
<title>
Semi-Supervised Novelty Detection; Gilles Blanchard, Gyemin Lee, Clayton Scott; 11(Nov):2973--3009, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/blanchard10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/blanchard10a.html
</link>
<description>
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson 
</description>
</item>

<item>
<title>
Tree Decomposition for Large-Scale SVM Problems; Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu; 11(Oct):2935--2972, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/chang10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/chang10b.html
</link>
<description>
To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size
</description>
</item>

<item>
<title>
Linear Algorithms for Online Multitask Classification; Giovanni Cavallanti, Nicol&#242; Cesa-Bianchi, Claudio Gentile; 11(Oct):2901--2934, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cavallanti10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cavallanti10a.html
</link>
<description>
We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks.  We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace.  In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our 
</description>
</item>

<item>
<title>
Expectation Truncation and the Benefits of Preselection In Training Generative Models; J&#246;rg L&#252;cke, Julian Eggert; 11(Oct):2855--2900, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/lucke10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/lucke10a.html
</link>
<description>
We show how a preselection of hidden variables can be used to efficiently train generative models with binary hidden variables.  The approach is based on Expectation Maximization (EM) and uses an efficiently computable approximation to the sufficient statistics of a given model.  The computational cost to compute the sufficient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes.  The approximation is applicable to a wide range of generative models and provides an interpretation of the benefits of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types 
</description>
</item>

<item>
<title>
Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance; Nguyen Xuan Vinh, Julien Epps, James Bailey; 11(Oct):2837--2854, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/vinh10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/vinh10a.html
</link>
<description>
Information theoretic  measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when 
</description>
</item>

<item>
<title>
Regret Bounds and Minimax Policies under Partial Monitoring; Jean-Yves Audibert, S&#233;bastien Bubeck; 11(Oct):2785--2836, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/audibert10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/audibert10a.html
</link>
<description>
This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function &#968; for which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, for &#968;(x)=exp(&#951; x) + &#947;/K, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly 
</description>
</item>

<item>
<title>
Mean Field Variational Approximation for Continuous-Time Bayesian Networks; Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman; 11(Oct):2745--2783, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cohn10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cohn10a.html
</link>
<description>
Continuous-time Bayesian networks is a natural structured representation language for multi-component stochastic processes that evolve continuously over time.  Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories.  This variational approach leads to a globally consistent distribution, which  can be efficiently queried.  Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning 
</description>
</item>

<item>
<title>
Using Contextual Representations to Efficiently Learn Context-Free Languages; Alexander Clark, R&#233;mi Eyraud, Amaury Habrard; 11(Oct):2707--2744, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/clark10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/clark10a.html
</link>
<description>
We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages  using the paradigm of positive data and a membership oracle.  We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs),  which are capable of representing richly structured context-free languages as well as some context sensitive languages.  These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning.  This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly  
</description>
</item>

<item>
<title>
Topology Selection in Graphical Models of Autoregressive Processes; Jitkomut Songsiri, Lieven Vandenberghe; 11(Oct):2671--2705, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/songsiri10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/songsiri10a.html
</link>
<description>
An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series.  The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables.  The method proposed in the paper is based on an l_1-type nonsmooth regularization of the conditional maximum likelihood estimation problem.   We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method.  Results of experiments with randomly generated and real data sets are also included.
</description>
</item>

<item>
<title>
Learnability, Stability and Uniform Convergence; Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan; 11(Oct):2635--2670, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/shalev-shwartz10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/shalev-shwartz10a.html
</link>
<description>
The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization.  In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases.  We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, 
</description>
</item>

<item>
<title>
Stochastic Composite Likelihood; Joshua V. Dillon, Guy Lebanon; 11(Oct):2597--2633, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/dillon10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/dillon10a.html
</link>
<description>
Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical robustness, and computational complexity. We discuss experimental results in the context of Boltzmann machines and conditional random fields. The theoretical and experimental studies 
</description>
</item>

<item>
<title>
Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization; Lin Xiao; 11(Oct):2543--2596, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/xiao10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/xiao10a.html
</link>
<description>
We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as l_1-norm for promoting sparsity.  We develop extensions of Nesterov's dual averaging method, that can exploit the regularization structure in an online setting.  At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient.  In the case of l_1-regularization, 
</description>
</item>

<item>
<title>
WEKA---Experiences with a Java Open-Source Project; Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten; 11(Sep):2533--2541, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/bouckaert10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/bouckaert10a.html
</link>
<description>
WEKA is a popular machine learning workbench with a development life of nearly two decades.  This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software's functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project.  
</description>
</item>

<item>
<title>
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data; Milo&#353; Radovanovi&#263;, Alexandros Nanopoulos, Mirjana Ivanovi&#263;; 11(Sep):2487--2531, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/radovanovic10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/radovanovic10a.html
</link>
<description>
Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent "popular" nearest neighbors. 
</description>
</item>

<item>
<title>
Rademacher Complexities and Bounding the Excess Risk in Active Learning; Vladimir Koltchinskii; 11(Sep):2457--2485, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/koltchinskii10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/koltchinskii10a.html
</link>
<description>
Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way.  Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered.  
</description>
</item>

<item>
<title>
Sparse Semi-supervised Learning Using Conjugate Functions; Shiliang Sun, John Shawe-Taylor; 11(Sep):2423--2455, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/sun10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/sun10a.html
</link>
<description>
In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared 
</description>
</item>

<item>
<title>
Composite Binary Losses; Mark D. Reid, Robert C. Williamson; 11(Sep):2387--2422, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/reid10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/reid10a.html
</link>
<description>
We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function.  We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and "classification calibrated" losses. We also consider the question of the "best" surrogate binary loss. We introduce a precise notion of "best" and show there exist 
</description>
</item>

<item>
<title>
High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency; Dapo Omidiran, Martin J. Wainwright; 11(Aug):2361--2386, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/omidiran10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/omidiran10a.html
</link>
<description>
We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector &#946;^* &#8712; R^p, estimate the subset of non-zero entries of &#946;^*.  A significant body of work has studied behavior of l_1-relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli).  In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction &#947; of non-zero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one.  Our main result is to
</description>
</item>

<item>
<title>
Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers; Franz Pernkopf, Jeff A. Bilmes; 11(Aug):2323--2360, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/pernkopf10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/pernkopf10a.html
</link>
<description>
We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers.  We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O(N^(k+1)) score evaluations (where constant k is the tree-width of the sub-graph over the attributes).  We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification
</description>
</item>

<item>
<title>
Spectral Regularization Algorithms for Learning Large Incomplete Matrices; Rahul Mazumder, Trevor Hastie, Robert Tibshirani; 11(Aug):2287--2322, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mazumder10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mazumder10a.html
</link>
<description>
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm SOFT-IMPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix.  Exploiting the
</description>
</item>

<item>
<title>
High Dimensional Inverse Covariance Matrix Estimation via Linear Programming; Ming Yuan; 11(Aug):2261--2286, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yuan10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yuan10b.html
</link>
<description>
This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by "sparse" matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such "sparsity".  The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem.
</description>
</item>

<item>
<title>
Restricted Eigenvalue Properties for Correlated Gaussian Designs; Garvesh Raskutti, Martin J. Wainwright, Bin Yu; 11(Aug):2241--2259, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/raskutti10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/raskutti10a.html
</link>
<description>
Methods based on l_1-relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions.  The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared l_2-error of a Lasso estimate decays at the minimax optimal rate k log p / n, where k is the sparsity of the p-dimensional regression problem with additive Gaussian noise, whenever the design satisfies a restricted eigenvalue condition.  The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far,
</description>
</item>

<item>
<title>
Erratum: SGDQN is Less Careful than Expected; Antoine Bordes, L&#233;on Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith; 11(Aug):2229--2240, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/bordes10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/bordes10a.html
</link>
<description>
The SGD-QN algorithm described in Bordes et al. (2009) contains a subtle flaw that prevents it from reaching its design goals.  Yet the flawed SGD-QN algorithm has worked well enough to be a winner of the first Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008).  This document clarifies the situation, proposes a corrected algorithm, and evaluates its performance.
</description>
</item>

<item>
<title>
Regularized Discriminant Analysis, Ridge Regression and Beyond; Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan; 11(Aug):2199--2228, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/zhang10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/zhang10b.html
</link>
<description>
Fisher linear discriminant analysis (FDA) and its kernel extension--kernel discriminant analysis (KDA)--are well known methods that consider dimensionality reduction and classification jointly.  While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures.  In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA.  We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on
</description>
</item>

<item>
<title>
Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence; Qiang Wu, Justin Guinney, Mauro Maggioni, Sayan Mukherjee; 11(Aug):2175--2198, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/wu10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/wu10a.html
</link>
<description>
The problems of dimension reduction and inference of statistical dependence are addressed by the modeling framework of learning gradients. The models we propose hold for Euclidean spaces as well as the manifold setting. The central quantity in this approach is an estimate of the gradient of the regression or classification function. Two quadratic forms are constructed from gradient estimates: the gradient outer product and gradient based diffusion maps. The first quantity can be used for supervised dimension reduction on manifolds as well as inference of a graphical model encoding dependencies that are predictive of a response variable.  The second quantity can be used for nonlinear projections
</description>
</item>

<item>
<title>
libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models; Joris M. Mooij; 11(Aug):2169--2173, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mooij10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mooij10a.html
</link>
<description>
This paper describes the software package libDAI, a free &amp; open source C++ library that provides implementations of various exact and approximate inference methods for graphical models with discrete-valued variables. libDAI supports directed graphical models (Bayesian networks) as well as undirected ones (Markov random fields and factor graphs). It offers various approximations of the partition sum, marginal probability distributions and maximum probability states. Parameter learning is also supported. A feature comparison with other open source software packages for approximate inference is given. libDAI is licensed under the GPL v2+ license and is available at http://www.libdai.org.
</description>
</item>

<item>
<title>
Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases; Guoqiang Yu, Yuanjian Feng, David J. Miller, Jianhua Xuan, Eric P. Hoffman, Robert Clarke, Ben Davidson, Ie-Ming Shih, Yue Wang; 11(Aug):2141--2167, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yu10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yu10b.html
</link>
<description>
Microarray gene expressions provide new opportunities for molecular classification of heterogeneous diseases. Although various reported classification schemes show impressive performance, most existing gene selection methods are suboptimal and are not well-matched to the unique characteristics of the multicategory classification problem. Matched design of the gene selection method and a committee classifier is needed for identifying a small set of gene markers that achieve accurate multicategory classification while being both statistically reproducible and biologically plausible. We report a simpler and yet more accurate strategy than previous works for multicategory classification of heterogeneous
</description>
</item>

<item>
<title>
Importance Sampling for Continuous Time Bayesian Networks; Yu Fan, Jing Xu, Christian R. Shelton; 11(Aug):2115--2140, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/fan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/fan10a.html
</link>
<description>
A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a finite number of states which evolves in continuous time.  Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we first present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle filtering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental
</description>
</item>

<item>
<title>
Model-based Boosting 2.0; Torsten Hothorn, Peter B&#252;hlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner; 11(Aug):2109--2113, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/hothorn10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/hothorn10a.html
</link>
<description>
We describe version 2.0 of the R add-on package mboost.  The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data.
</description>
</item>

<item>
<title>
On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation; Gavin C. Cawley, Nicola L. C. Talbot; 11(Jul):2079--2107, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cawley10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cawley10a.html
</link>
<description>
Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation.  The error of such an estimator can be broken down into bias and variance components.  While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a non-negligible variance introduces the potential for over-fitting in model selection as well as in training the model.  While this observation is in hindsight perhaps rather obvious, the degradation in performance
</description>
</item>

<item>
<title>
Matrix Completion from  Noisy Entries; Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh; 11(Jul):2057--2078, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/keshavan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/keshavan10a.html
</link>
<description>
Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the 'Netflix problem') to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here OPTSPACE. We prove performance guarantees that are order-optimal in a number of circumstances.
</description>
</item>

<item>
<title>
A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design; Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene, Karel Crombecq; 11(Jul):2051--2055, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gorissen10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gorissen10a.html
</link>
<description>
An exceedingly large number of scientific and engineering fields are confronted with the need for computer simulations to study complex, real world phenomena or solve challenging design problems. However, due to the computational cost of these high fidelity simulations, the use of neural networks, kernel methods, and other surrogate modeling techniques have become indispensable. Surrogate models are compact and cheap to evaluate, and have proven very useful for tasks such as optimization, design space exploration, prototyping, and sensitivity analysis. Consequently, in many fields there is great interest in tools and techniques that facilitate the construction of such regression models,
</description>
</item>

<item>
<title>
Posterior Regularization for Structured Latent Variable Models; Kuzman Ganchev, Jo&#227;o Gra&#231;a, Jennifer Gillenwater, Ben Taskar; 11(Jul):2001--2049, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ganchev10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ganchev10a.html
</link>
<description>
We present posterior regularization, a probabilistic framework for structured, weakly supervised learning.  Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy.  By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and
</description>
</item>

<item>
<title>
Practical Approaches to Principal Component Analysis in the Presence of Missing Values; Alexander Ilin, Tapani Raiko; 11(Jul):1957--2000, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ilin10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ilin10a.html
</link>
<description>
Principal component analysis (PCA) is a classical data analysis technique that finds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overfitting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overfitting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to
</description>
</item>

<item>
<title>
Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary &#946;-Mixing Processes; Liva Ralaivola, Marie Szafranski, Guillaume Stempfel; 11(Jul):1927--1956, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ralaivola10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ralaivola10a.html
</link>
<description>
PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints.
</description>
</item>

<item>
<title>
Fast and Scalable Local Kernel Machines; Nicola Segata, Enrico Blanzieri; 11(Jun):1883--1926, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/segata10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/segata10a.html
</link>
<description>
A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy.  Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. 
</description>
</item>

<item>
<title>
Sparse Spectrum Gaussian Process Regression; Miguel L&#225;zaro-Gredilla, Joaquin Qui&#241;onero-Candela, Carl Edward Rasmussen, An&#237;bal R. Figueiras-Vidal; 11(Jun):1865--1881, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/lazaro-gredilla10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/lazaro-gredilla10a.html
</link>
<description>
We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the spectral representation of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class.
</description>
</item>

<item>
<title>
Permutation Tests for Studying Classifier Performance; Markus Ojala, Gemma C. Garriga; 11(Jun):1833--1863, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ojala10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ojala10a.html
</link>
<description>
We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. 
</description>
</item>

<item>
<title>
How to Explain Individual Classification Decisions; David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert M&#252;ller; 11(Jun):1803--1831, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/baehrens10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/baehrens10a.html
</link>
<description>
After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point.  However, most methods will provide no answer why the model predicted a particular label for a single instance and what features were most influential for that particular instance.  The only method that is currently able to provide such explanations are decision trees.  This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method.
</description>
</item>

<item>
<title>
The SHOGUN Machine Learning Toolbox; S&#246;ren Sonnenburg, Gunnar R&#228;tsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojt&#x011B;ch Franc; 11(Jun):1799--1802, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/sonnenburg10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/sonnenburg10a.html
</link>
<description>
We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely 
</description>
</item>

<item>
<title>
Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing; Ryo Yoshida, Mike West; 11(May):1771--1798, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yoshida10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yoshida10a.html
</link>
<description>
We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices.  We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential
</description>
</item>

<item>
<title>
Evolving Static Representations for Task Transfer; Phillip Verbancsics, Kenneth O. Stanley; 11(May):1737--1769, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/verbancsics10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/verbancsics10a.html
</link>
<description>
An important goal for machine learning is to transfer knowledge between tasks. For example, learning to play RoboCup Keepaway should contribute to learning the full game of RoboCup soccer. Previous approaches to transfer in Keepaway have focused on transforming the original representation to fit the new task. In contrast, this paper explores the idea that transfer is most effective if the representation is designed to be the same even across different tasks. To demonstrate this point, a bird's eye view (BEV) representation is introduced that can represent different tasks on the same two-dimensional map.  For example, both the 3 vs. 2 and 4 vs. 3 Keepaway tasks can be represented on the same BEV.
</description>
</item>

<item>
<title>
FastInf: An Efficient Approximate Inference Library; Ariel Jaimovich, Ofer Meshi, Ian McGraw, Gal Elidan; 11(May):1733--1736, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/jaimovich10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/jaimovich10a.html
</link>
<description>
The FastInf C++ library is designed to perform memory and time efficient approximate inference in large-scale discrete undirected graphical models.  The focus of the library is propagation based approximate inference methods, ranging from the basic loopy belief propagation algorithm to propagation based on convex free energies.  Various message scheduling schemes that improve on the standard synchronous or asynchronous approaches are included. Also implemented are a clique tree based exact inference, Gibbs sampling, and the mean field algorithm.  In addition to inference, FastInf provides parameter estimation capabilities as well as representation and learning of shared parameters. It offers
</description>
</item>

<item>
<title>
Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity; Aapo Hyv&#228;rinen, Kun Zhang, Shohei Shimizu, Patrik O. Hoyer; 11(May):1709--1731, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/hyvarinen10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/hyvarinen10a.html
</link>
<description>
Analysis of causal effects between continuous-valued variables typically uses either autoregressive models or structural equation models with instantaneous effects. Estimation of Gaussian, linear structural equation models poses serious identifiability problems, which is why it was recently proposed to use non-Gaussian models. Here, we show how to combine the non-Gaussian instantaneous model with autoregressive models. This is effectively what is called a structural vector autoregression (SVAR) model, and thus our work contributes to the long-standing problem of how to estimate SVAR's. We show that such a non-Gaussian model is identifiable without prior knowledge of network structure. We propose
</description>
</item>

<item>
<title>
Consensus-Based Distributed Support Vector Machines; Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis; 11(May):1663--1707, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/forero10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/forero10a.html
</link>
<description>
This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization sub-problems (one per node) with consensus constraints on the wanted classifier parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated
</description>
</item>

<item>
<title>
Introduction to Causal Inference; Peter Spirtes; 11(May):1643--1662, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/spirtes10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/spirtes10a.html
</link>
<description>
The goal of many sciences is to understand the mechanisms by which variables came to take on the values they have (that is, to find a generative model), and to predict what the values of those variables would be if the naturally occurring mechanisms were subject to outside manipulations. The past 30 years has seen a number of conceptual developments that are partial solutions to the problem of causal inference from observational sample data or a mixture of observational sample and experimental data, particularly in the area of graphical causal modeling. However, in many domains, problems such as the large numbers of variables, small samples sizes, and possible presence of unmeasured causes,
</description>
</item>

<item>
<title>
On the Foundations of Noise-free Selective Classification; Ran El-Yaniv, Yair Wiener; 11(May):1605--1641, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/el-yaniv10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/el-yaniv10a.html
</link>
<description>
We consider selective classification, a term we adopt here to refer to 'classification with a reject option.' The essence in selective classification is to trade-off classifier coverage for higher accuracy.  We term this trade-off the risk-coverage (RC) trade-off.  Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner.  For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings.
</description>
</item>

<item>
<title>
MOA: Massive Online Analysis; Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer; 11(May):1601--1604, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/bifet10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/bifet10a.html
</link>
<description>
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams.  MOA includes a collection of offline and online methods as well as tools for evaluation.  In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Na&#239;ve Bayes classifiers at the leaves.  MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license.
</description>
</item>

<item>
<title>
Near-optimal Regret Bounds for Reinforcement Learning; Thomas Jaksch, Ronald Ortner, Peter Auer; 11(Apr):1563--1600, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/jaksch10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/jaksch10a.html
</link>
<description>
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy.  In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average).  We present a reinforcement learning algorithm with total regret &#213;(DS&#8730;AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D.  A corresponding lower bound of &#937;(&#8730;DSAT) on the total regret of any learning algorithm is given as well.  These results are complemented by
</description>
</item>

<item>
<title>
Hilbert Space Embeddings and Metrics on Probability Measures; Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch&#246;lkopf, Gert R. G. Lanckriet; 11(Apr):1517--1561, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/sriperumbudur10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/sriperumbudur10a.html
</link>
<description>
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as &#947;_k, indexed by the kernel function k that defines the inner product in the RKHS.  We present three theoretical properties of &#947;_k. First, we consider the question of determining the conditions on the kernel k for which &#947;_k is a metric: such k are denoted
</description>
</item>

<item>
<title>
Quadratic Programming Feature Selection; Irene Rodriguez-Lujan, Ramon Huerta, Charles Elkan, Carlos Santa Cruz; 11(Apr):1491--1516, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/rodriguez-lujan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/rodriguez-lujan10a.html
</link>
<description>
Identifying a subset of features that preserves classification accuracy is a problem of growing importance, because of the increasing size and dimensionality of real-world data sets.  We propose a new feature selection method, named Quadratic Programming Feature Selection (QPFS), that reduces the task to a quadratic optimization problem.  In order to limit the computational complexity of solving the optimization problem, QPFS uses the Nystr&#246;m method for approximate matrix diagonalization.  QPFS is thus capable of dealing with very large data sets, for which the use of other methods is computationally expensive.  In experiments with small and medium data sets, the QPFS method leads to
</description>
</item>

<item>
<title>
Training and Testing Low-degree Polynomial Data Mappings via Linear SVM; Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin; 11(Apr):1471--1490, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/chang10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/chang10a.html
</link>
<description>
Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues.  The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels.  Empirical experiments show that the proposed method is useful for certain large-scale data sets.  We successfully apply the proposed
</description>
</item>

<item>
<title>
Characterization, Stability and Convergence of Hierarchical Clustering Methods; Gunnar Carlsson, Facundo M&#233;moli; 11(Apr):1425--1470, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/carlsson10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/carlsson10a.html
</link>
<description>
We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods.
</description>
</item>

<item>
<title>
Consistent Nonparametric Tests of Independence; Arthur Gretton, L&#225;szl&#243; Gy&#246;rfi; 11(Apr):1391--1423, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gretton10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gretton10a.html
</link>
<description>
Three simple and explicit procedures for testing the independence of two multi-dimensional  random variables are described.  Two of the associated test statistics (L_1, log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions.  A third test statistic is defined as a kernel-based independence measure.  Two kinds of tests are provided.  Distribution-free strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size.  Asymptotically &#945;-level tests are obtained from the limiting distribution of the test statistics.
</description>
</item>

<item>
<title>
Learning Translation Invariant Kernels for Classification; Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi; 11(Apr):1353--1390, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ghiasi-shirazi10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ghiasi-shirazi10a.html
</link>
<description>
Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on
</description>
</item>

<item>
<title>
Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels; Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian; 11(Apr):1323--1351, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/donmez10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/donmez10a.html
</link>
<description>
Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel  unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data.
</description>
</item>

<item>
<title>
Learning From Crowds; Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, Linda Moy; 11(Apr):1297--1322, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/raykar10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/raykar10a.html
</link>
<description>
For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is  a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels.
</description>
</item>

<item>
<title>
Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation; Vicen&#231; G&#243;mez, Hilbert J. Kappen, Michael Chertkov; 11(Apr):1273--1296, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gomez10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gomez10a.html
</link>
<description>
We introduce novel results for approximate inference on planar graphical models using the loop calculus framework.  The loop calculus (Chertkov and Chernyak, 2006a) allows to express the exact partition function of a graphical model as a finite sum of terms that can be evaluated once the belief propagation (BP) solution is known.  In general, full summation over all correction terms is intractable.  We develop an algorithm for the approach presented in Chertkov et al. (2008) which represents an efficient truncation scheme on planar graphs and a new representation of the series in terms of Pfaffians of matrices.  We analyze the performance of the algorithm for models with binary variables
</description>
</item>

<item>
<title>
Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation; Miki Aoyagi; 11(Apr):1243--1272, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/aoyagi10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/aoyagi10a.html
</link>
<description>
In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation.  It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b).  The zeta function is defined by using a Kullback function.  We use two methods to obtain  the maximum pole: a new eigenvalue analysis method and a recursive blowing up process.  We show that these methods are effective for obtaining the asymptotic form of the generalization error of hierarchical learning models.
</description>
</item>

<item>
<title>
Graph Kernels; S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt; 11(Apr):1201--1242, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/vishwanathan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/vishwanathan10a.html
</link>
<description>
We present a unified framework to study graph kernels, special cases of which include the random walk (G&#228;rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah&#233;t al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time complexity of kernel computation between unlabeled graphs with n vertices from O(n^6) to O(n^3). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn^3) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing
</description>
</item>

<item>
<title>
A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning; Jin Yu, S.V.N. Vishwanathan, Simon G&#252;nter, Nicol N. Schraudolph; 11(Mar):1145--1200, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yu10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yu10a.html
</link>
<description>
We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value.  We apply its memory-limited variant (subLBFGS) to L_2-regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search
</description>
</item>

<item>
<title>
SFO: A Toolbox for Submodular Function Optimization; Andreas Krause; 11(Mar):1141--1144, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/krause10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/krause10a.html
</link>
<description>
In recent years, a fundamental problem structure has emerged as very useful in a variety of machine learning applications: Submodularity is an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems.  We present SFO, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions. A tutorial script illustrates the application of submodularity to machine learning and AI problems such as feature selection, clustering, inference and optimized
</description>
</item>

<item>
<title>
Continuous Time Bayesian Network Reasoning and Learning Engine; Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu; 11(Mar):1137--1140, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/shelton10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/shelton10a.html
</link>
<description>
We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE).  A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuous-time Markov process.  This software provides libraries and programs for most of the algorithms developed for CTBNs.  For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data.  For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern.  Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence.
</description>
</item>

<item>
<title>
Large Scale Online Learning of Image Similarity Through Ranking; Gal Chechik, Varun Sharma, Uri Shalit, Samy Bengio; 11(Mar):1109--1135, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/chechik10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/chechik10a.html
</link>
<description>
Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object.  Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large data sets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space
</description>
</item>

<item>
<title>
Analysis of Multi-stage Convex Relaxation for Sparse Regularization; Tong Zhang; 11(Mar):1081--1107, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/zhang10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/zhang10a.html
</link>
<description>
We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution.  Convex relaxation such as L_1-regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality.  This paper tries to remedy the above gap between theory and practice.  In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions.
</description>
</item>

<item>
<title>
Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes; Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright; 11(Mar):1043--1080, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ravikumar10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ravikumar10a.html
</link>
<description>
The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on "tree-based" linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences.  As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems.  Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman
</description>
</item>

<item>
<title>
Kronecker Graphs: An Approach to Modeling Networks; Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani; 11(Feb):985--1042, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/leskovec10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/leskovec10a.html
</link>
<description>
How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time.  Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural
</description>
</item>

<item>
<title>
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data; Gideon S. Mann, Andrew McCallum; 11(Feb):955--984, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mann10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mann10a.html
</link>
<description>
In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust,  scalable method for semi-supervised training using weakly-labeled data.  GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data.  This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields.  Experimental results demonstrate accuracy improvements over supervised training and a number of other state-of-the-art semi-supervised learning methods for these models.
</description>
</item>

<item>
<title>
On Spectral Learning; Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil; 11(Feb):935--953, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/argyriou10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/argyriou10a.html
</link>
<description>
In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W. Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate
</description>
</item>

<item>
<title>
On Learning with Integral Operators; Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito; 11(Feb):905--934, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/rosasco10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/rosasco10a.html
</link>
<description>
A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the  quality of such approximations.  The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in  spectral approximation.  2. Using these methods we provide several new results for estimating spectral properties of the
</description>
</item>

<item>
<title>
Image Denoising with Kernels Based on Natural Image Relations; Valero Laparra, Juan Guti&#233;rrez, Gustavo Camps-Valls, Jes&#250;s Malo; 11(Feb):873--903, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/laparra10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/laparra10a.html
</link>
<description>
A successful class of image denoising methods is based on Bayesian approaches working in wavelet representations. The performance of these methods improves when relations among the local frequency coefficients are explicitly included. However, in these techniques, analytical estimates can be obtained only for particular combinations of analytical models of signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources.  In this paper, we propose an alternative non-explicit way to take into account the relations among natural image wavelet coefficients for denoising: we use support vector regression (SVR) in the wavelet domain to enforce these relations
</description>
</item>

<item>
<title>
A Streaming Parallel Decision Tree Algorithm; Yael Ben-Haim, Elad Tom-Tov; 11(Feb):849--872, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html
</link>
<description>
We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm's accuracy.  The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that
</description>
</item>

<item>
<title>
Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models; Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin; 11(Feb):815--848, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/huang10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/huang10a.html
</link>
<description>
Maximum entropy (Maxent) is useful in natural language processing and many other areas.  Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent.  With many variants of IS methods, it is difficult to understand them and see the differences.  In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods.  We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods.
</description>
</item>

<item>
<title>
Stability Bounds for Stationary &#966;-mixing and &#946;-mixing Processes; Mehryar Mohri, Afshin Rostamizadeh; 11(Feb):789--814, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mohri10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mohri10a.html
</link>
<description>
Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence.  This paper
</description>
</item>

<item>
<title>
Maximum Relative Margin and Data-Dependent Regularization; Pannagadatta K. Shivaswamy, Tony Jebara; 11(Feb):747--788, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/shivaswamy10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/shivaswamy10a.html
</link>
<description>
Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions.  This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin
</description>
</item>

<item>
<title>
PyBrain; Tom Schaul, Justin Bayer, Daan Wierstra, Yi Sun, Martin Felder, Frank Sehnke, Thomas R&#252;ckstie&#223;, J&#252;rgen Schmidhuber; 11(Feb):743--746, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/schaul10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/schaul10a.html
</link>
<description>
PyBrain is a versatile machine learning library for Python. Its goal is to provide flexible, easy-to-use yet still powerful algorithms for machine learning tasks, including a variety of predefined environments and benchmarks to test and compare algorithms.  Implemented algorithms include Long Short-Term Memory (LSTM), policy gradient methods, (multidimensional) recurrent neural networks and deep belief networks.
</description>
</item>

<item>
<title>
A Fast Hybrid Algorithm for Large-Scale l_1-Regularized Logistic Regression; Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda; 11(Feb):713--741, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/shi10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/shi10a.html
</link>
<description>
l_1-regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of l_1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning.  When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable l_1-norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining
</description>
</item>

<item>
<title>
On the Rate of Convergence of the Bagged Nearest Neighbor Estimate; G&#233;rard Biau, Fr&#233;d&#233;ric C&#233;rou, Arnaud Guyader; 11(Feb):687--712, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/biau10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/biau10a.html
</link>
<description>
Bagging is a simple way to combine estimates in order to improve their performance. This method, suggested by Breiman in 1996, proceeds by resampling from the original data set, constructing a predictor from each subsample, and decide by combining. By bagging an n-sample, the crude nearest neighbor regression estimate is turned into a consistent weighted nearest neighbor regression estimate, which is amenable to statistical analysis. Letting the resampling size k_n grows appropriately with n, it is shown that this estimate may achieve optimal rate of convergence, independently from the fact that resampling is done with or without replacement. Since the estimate with the optimal rate of convergence
</description>
</item>

<item>
<title>
Second-Order Bilinear Discriminant Analysis; Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra; 11(Feb):665--685, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/christoforou10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/christoforou10a.html
</link>
<description>
Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining  first and second-order spatial and temporal features based on a bilinear model.
</description>
</item>

<item>
<title>
Error-Correcting Output Codes Library; Sergio Escalera, Oriol Pujol, Petia Radeva; 11(Feb):661--664, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/escalera10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/escalera10a.html
</link>
<description>
In this paper, we present an open source Error-Correcting Output Codes (ECOC) library. The ECOC framework is a powerful tool to deal with multi-class categorization problems. This library contains both state-of-the-art coding (one-versus-one, one-versus-all, dense random, sparse random, DECOC, forest-ECOC, and ECOC-ONE) and decoding designs (hamming, euclidean, inverse hamming, laplacian, &#946;-density, attenuated, loss-based, probabilistic kernel-based, and loss-weighted) with the parameters defined by the authors, as well as the option to include your own coding, decoding, and base classifier.
</description>
</item>

<item>
<title>
Why Does Unsupervised Pre-training Help Deep Learning?; Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio; 11(Feb):625--660, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/erhan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/erhan10a.html
</link>
<description>
Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets.  The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is
</description>
</item>

<item>
<title>
A Rotation Test to Verify Latent Structure; Patrick O. Perry, Art B. Owen; 11(Feb):603--624, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/perry10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/perry10a.html
</link>
<description>
In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors.  We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure
</description>
</item>

<item>
<title>
On Finding Predictors for Arbitrary Families of Processes; Daniil Ryabko; 11(Feb):581--602, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ryabko10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ryabko10a.html
</link>
<description>
The problem is sequence prediction in the following setting.  A sequence x_1,...,x_n,... of discrete-valued observations is generated according to some unknown probabilistic law (measure) &#956;. After observing each outcome, it is required to give the conditional probabilities of the next observation.  The measure  &#956; belongs to an arbitrary but known class C of stochastic process measures.  We are interested in predictors &#961; whose conditional probabilities converge (in some sense) to the "true" &#956;-conditional probabilities, if any &#956;&#8712;C is chosen to generate the sequence.  The contribution of this work is in characterizing the families C for which such predictors exist,
</description>
</item>

<item>
<title>
Approximate Tree Kernels; Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert M&#252;ller; 11(Feb):555--580, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/rieck10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/rieck10a.html
</link>
<description>
Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable.  In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for
</description>
</item>

<item>
<title>
Generalized Power Method for Sparse Principal Component Analysis; Michel Journ&#233;e, Yurii Nesterov, Peter Richt&#225;rik, Rodolphe Sepulchre; 11(Feb):517--553, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/journee10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/journee10a.html
</link>
<description>
In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient
</description>
</item>

<item>
<title>
Classification Using Geometric Level Sets; Kush R. Varshney, Alan S. Willsky; 11(Feb):491--516, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/varshney10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/varshney10a.html
</link>
<description>
A variational level set method is developed for the supervised classification problem.  Nonlinear classifier decision boundaries are obtained by minimizing an energy functional that is composed of an empirical risk term with a margin-based loss and a geometric regularization term new to machine learning: the surface area of the decision boundary.  This geometric level set classifier is analyzed in terms of consistency and complexity through the calculation of its &#949;-entropy.  For multicategory classification, an efficient scheme is developed using a logarithmic number of decision functions in the number of classes rather than the typical linear number of decision functions.  Geometric level
</description>
</item>

<item>
<title>
Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization; Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, Samuel Kaski; 11(Feb):451--490, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/venna10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/venna10a.html
</link>
<description>
Nonlinear dimensionality reduction methods are often used to visualize high-dimensional data, although the existing methods have been designed for other related tasks such as manifold learning. It has been difficult to assess the quality of visualizations since the task has not been well-defined. We give a rigorous definition for a specific visualization task, resulting in quantifiable goodness measures and new visualization methods. The task is information retrieval given the visualization: to find similar data based on the similarities shown on the display. The fundamental tradeoff between precision and recall of information retrieval can then be quantified in visualizations as well. The user
</description>
</item>

<item>
<title>
Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting; Philippos Mordohai, G&#233;rard Medioni; 11(Jan):411--450, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mordohai10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mordohai10a.html
</link>
<description>
We address instance-based learning from a perceptual organization standpoint and present methods for dimensionality estimation, manifold learning and function approximation. Under our approach, manifolds in high-dimensional spaces are inferred by estimating geometric relationships among the input instances. Unlike conventional manifold learning, we do not perform dimensionality reduction, but instead perform all operations in the original input space. For this purpose we employ a novel formulation of tensor voting, which allows an N-D implementation. Tensor voting is a perceptual organization framework that has mostly been applied to computer vision problems.  Analyzing the estimated local structure
</description>
</item>

<item>
<title>
A Convergent Online Single Time Scale Actor Critic Algorithm; Dotan Di Castro, Ron Meir; 11(Jan):367--410, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html
</link>
<description>
Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward.  Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic
</description>
</item>

<item>
<title>
Bundle Methods for Regularized Risk Minimization; Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le; 11(Jan):311--365, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/teo10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/teo10a.html
</link>
<description>
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L_1 and L_2 penalties. In addition to the unified framework we present tight convergence bounds, which
</description>
</item>

<item>
<title>
Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure; Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano; 11(Jan):285--310, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/kojima10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/kojima10a.html
</link>
<description>
We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure.  The previously derived constrained optimal search (COS) remains limited even for sparse super-structures.  To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them.  Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs.  Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained
</description>
</item>

<item>
<title>
Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions; Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos; 11(Jan):235--284, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/aliferis10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/aliferis10b.html
</link>
<description>
In part I of this work we introduced and evaluated the Generalized Local Learning (GLL) framework for producing local causal and Markov blanket induction algorithms. In the present second part we analyze the behavior of GLL algorithms and provide extensions to the core methods. Specifically, we investigate the empirical convergence of GLL to the true local neighborhood as a function of sample size.  Moreover, we study how predictivity improves with increasing sample size. Then we investigate how sensitive are the algorithms to multiple statistical testing, especially in the presence of many irrelevant features. Next we discuss the role of the algorithm parameters and also show that Markov blanket
</description>
</item>

<item>
<title>
Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation; Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos; 11(Jan):171--234, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/aliferis10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/aliferis10a.html
</link>
<description>
We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples.  The selected feature sets can be used for causal discovery and classification. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms.  The resulting algorithms are sound under well-defined sufficient conditions. In a first set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare
</description>
</item>

<item>
<title>
An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data; Yufeng Ding, Jeffrey S. Simonoff; 11(Jan):131--170, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ding10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ding10a.html
</link>
<description>
There are many different methods used by classification tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees applied to binary response data. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, as well as the existence or non-existence of missing values in the testing data, are the most helpful criteria to distinguish different missing data methods. In particular, separate class is clearly the
</description>
</item>

<item>
<title>
Classification Methods with Reject Option Based on Convex Risk Minimization; Ming Yuan, Marten Wegkamp; 11(Jan):111--130, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yuan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yuan10a.html
</link>
<description>
In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for  infinite sample consistency (both risks share the same minimizer)  is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition.
</description>
</item>

<item>
<title>
On-Line Sequential Bin Packing; Andr&#225;s Gy&#246;rgy, G&#225;bor Lugosi, Gy&#246;rgy Ottucs&#224;k; 11(Jan):89--109, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gyorgy10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gyorgy10a.html
</link>
<description>
We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items.  Special attention is payed to reference strategies
</description>
</item>

<item>
<title>
Model Selection: Beyond the Bayesian/Frequentist Divide; Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley; 11(Jan):61--87, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/guyon10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/guyon10a.html
</link>
<description>
The principle of parsimony also known as "Ockham's razor" has inspired many theories of  model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms.  We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches.
</description>
</item>

<item>
<title>
Online Learning for Matrix Factorization and Sparse Coding; Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro; 11(Jan):19--60, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mairal10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mairal10a.html
</link>
<description>
Sparse coding--that is, modelling data vectors as sparse linear combinations of basis elements--is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally
</description>
</item>

<item>
<title>
An Efficient Explanation of Individual Classifications using Game Theory; Erik Štrumbelj, Igor Kononenko; 11(Jan):1--18, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/strumbelj10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/strumbelj10a.html
</link>
<description>
We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method's initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful.
</description>
</item>

<item>
<title>
A Survey of Accuracy Evaluation Metrics of Recommendation Tasks; Asela Gunawardana, Guy Shani; 10(Dec):2935--2962, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/gunawardana09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/gunawardana09a.html
</link>
<description>
Recommender systems are now popular both commercially and in the research community, where many algorithms have been suggested for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms offline using some evaluation metric. Indeed, many evaluation metrics have been suggested for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric
</description>
</item>

<item>
<title>
Efficient Online and Batch Learning Using Forward Backward Splitting; John Duchi, Yoram Singer; 10(Dec):2899--2934, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/duchi09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/duchi09a.html
</link>
<description>
We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as l_1. We derive
</description>
</item>

<item>
<title>
Online Learning with Samples Drawn from Non-identical Distributions; Ting Hu, Ding-Xuan Zhou; 10(Dec):2873--2898, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/hu09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/hu09a.html
</link>
<description>
Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H&#246;lder space. For regression with least square or insensitive loss, learning rates are given
</description>
</item>

<item>
<title>
Adaptive False Discovery Rate Control under Independence and Dependence; Gilles Blanchard, &#201;tienne Roquain; 10(Dec):2837--2871, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/blanchard09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/blanchard09a.html
</link>
<description>
In the context of multiple hypothesis testing, the proportion &#928;_0 of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generally unknown a priori. A testing procedure using an implicit or explicit estimate of this quantity in order to improve its efficency is called adaptive.  In this paper, we focus on the issue of false discovery rate (FDR) control and we present new adaptive multiple testing procedures with control of the FDR.  In a first part, assuming independence of the p-values, we present two new procedures and give a unified review of other existing adaptive procedures that have provably controlled FDR. We report extensive simulation
</description>
</item>

<item>
<title>
Cautious Collective Classification; Luke K. McDowell, Kalyan Moy Gupta, David W. Aha; 10(Dec):2777--2836, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/mcdowell09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/mcdowell09a.html
</link>
<description>
Many collective classification (CC) algorithms have been shown to increase accuracy when instances are interrelated. However, CC algorithms must be carefully applied because their use of estimated labels can in some cases decrease accuracy.  In this article, we show that managing this label uncertainty through cautious algorithmic behavior is essential to achieving maximal, robust performance.  First, we describe cautious inference and explain how four well-known families of CC algorithms can be parameterized to use varying degrees of such caution.  Second, we introduce cautious learning and show how it can be used to improve the performance of almost any CC algorithm, with or without cautious
</description>
</item>

<item>
<title>
Reproducing Kernel Banach Spaces for Machine Learning; Haizhang Zhang, Yuesheng Xu, Jun Zhang; 10(Dec):2741--2775, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/zhang09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/zhang09b.html
</link>
<description>
We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semi-inner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established.
</description>
</item>

<item>
<title>
Learning Halfspaces with Malicious Noise; Adam R. Klivans, Philip M. Long, Rocco A. Servedio; 10(Dec):2715--2740, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/klivans09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/klivans09a.html
</link>
<description>
We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions.  We give poly(n, 1/&#949;)-time algorithms for solving the following problems to accuracy &#949;: Learning origin-centered halfspaces in R^n with respect to the uniform distribution on the unit ball with malicious noise rate &#951; = &#937;(&#949;^2 / log(n/&#949;)). (The best previous result
</description>
</item>

<item>
<title>
Structure Spaces; Brijnesh J. Jain, Klaus Obermayer; 10(Nov):2667--2714, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/jain09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/jain09a.html
</link>
<description>
Finite structures such as point patterns, strings, trees, and graphs occur as "natural" representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces
</description>
</item>

<item>
<title>
Bounded Kernel-Based Online Learning; Francesco Orabona, Joseph Keshet, Barbara Caputo; 10(Nov):2643--2666, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/orabona09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/orabona09a.html
</link>
<description>
A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm  Projectron. While the memory
</description>
</item>

<item>
<title>
DL-Learner: Learning Concepts in Description Logics; Jens Lehmann; 10(Nov):2639--2642, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/lehmann09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/lehmann09a.html
</link>
<description>
In this paper, we introduce DL-Learner, a framework for learning in description logics and OWL. OWL is the official W3C standard ontology language for the Semantic Web. Concepts in this language can be learned for constructing and maintaining OWL ontologies or for solving problems similar to those in Inductive Logic Programming. DL-Learner includes several learning algorithms, support for different OWL formats, reasoner interfaces, and learning problems.  It is a cross-platform framework implemented in Java. The framework allows easy programmatic access and provides a command line interface, a graphical interface as well as a WSDL-based web service.
</description>
</item>

<item>
<title>
Hash Kernels for Structured Data; Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan; 10(Nov):2615--2637, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/shi09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/shi09a.html
</link>
<description>
  We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs.
</description>
</item>

<item>
<title>
Learning When Concepts Abound; Omid Madani, Michael Connor, Wiley Greiner; 10(Nov):2571--2613, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/madani09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/madani09a.html
</link>
<description>
  Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond.  We investigate the learning of sparse class indices to address this challenge.  An index is a mapping from features to classes.  We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines.  We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach
</description>
</item>

<item>
<title>
Maximum Entropy Discrimination Markov Networks; Jun Zhu, Eric P. Xing; 10(Nov):2531--2569, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/zhu09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/zhu09a.html
</link>
<description>
The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy
</description>
</item>

<item>
<title>
When Is There a Representer Theorem?  Vector Versus Matrix Regularizers; Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil; 10(Nov):2507--2529, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/argyriou09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/argyriou09a.html
</link>
<description>
We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the L2 norm, then the learned vector is a linear combination of the input data. This result, known as the representer theorem, lies at the basis of kernel-based methods in machine learning. In this paper, we prove the necessity of the above condition, in the case of differentiable regularizers.  We further extend our analysis to regularization methods which learn a matrix, a problem which is motivated by the application to multi-task learning. In this context, we study a
</description>
</item>

<item>
<title>
Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression; Saharon Rosset; 10(Nov):2473--2505, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/rosset09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/rosset09a.html
</link>
<description>
We show how to follow the path of cross validated solutions to families of regularized optimization problems, defined by a combination of a parameterized loss function and a regularization term. A primary example is kernel quantile regression, where the parameter of the loss function is the quantile being estimated. Even though the bi-level optimization problem we encounter for every quantile is non-convex, the manner in which the optimal cross-validated solution evolves with the parameter of the loss function allows tracking of this solution. We prove this property, construct the resulting algorithm, and demonstrate it on real and artificial data. This algorithm allows us to efficiently
</description>
</item>

<item>
<title>
Prediction With Expert Advice For The Brier Game; Vladimir Vovk, Fedor Zhdanov; 10(Nov):2445--2471, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/vovk09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/vovk09a.html
</link>
<description>
We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it.  The resulting prediction algorithm is applied to predict results of football and tennis matches, with well-known bookmakers playing the role of experts.  The theoretical performance guarantee is not excessively loose on the football data set and is rather tight on the tennis data set.
</description>
</item>

<item>
<title>
Reinforcement Learning in Finite MDPs: PAC Analysis; Alexander L. Strehl, Lihong Li, Michael L. Littman; 10(Nov):2413--2444, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/strehl09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/strehl09a.html
</link>
<description>
We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples.  These "PAC-MDP" algorithms include the well-known E^3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm.  We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework.  A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.
</description>
</item>

<item>
<title>
Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions; Lisa Hellerstein, Bernard Rosell, Eric Bach, Soumya Ray, David Page; 10(Oct):2374--2411, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/hellerstein09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/hellerstein09a.html
</link>
<description>
A Boolean function f is correlation immune if each input variable is independent of the output, under the uniform distribution on inputs. For example, the parity function is correlation immune. We consider the problem of identifying relevant variables of a correlation immune function, in the presence of irrelevant variables. We address this problem in two different contexts. First, we analyze Skewing, a heuristic method that was developed to improve the ability of greedy decision tree algorithms to identify relevant variables of correlation immune Boolean functions, given examples drawn from the uniform distribution (Page and Ray, 2003). We present theoretical results revealing both
</description>
</item>

<item>
<title>
Estimating Labels from Label Proportions; Novi Quadrianto, Alex J. Smola, Tib&#x00E9;rio S. Caetano, Quoc V. Le; 10(Oct):2349--2374, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/quadrianto09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/quadrianto09a.html
</link>
<description>
Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice.
</description>
</item>

<item>
<title>
Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors; Mathias Drton, Michael Eichler, Thomas S. Richardson; 10(Oct):2329--2348, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/drton09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/drton09a.html
</link>
<description>
In recursive linear models, the multivariate normal joint distribution of all variables exhibits a dependence structure induced by a recursive (or acyclic) system of linear structural equations. These linear models have a long tradition and appear in seemingly unrelated regressions, structural equation modelling, and approaches to causal inference. They are also related to Gaussian graphical models via a classical representation known as a path diagram. Despite the models' long history, a number of problems remain open. In this paper, we address the problem of computing maximum likelihood estimates in the subclass of 'bow-free' recursive linear models. The term 'bow-free' refers to
</description>
</item>

<item>
<title>
The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs; Han Liu, John Lafferty, Larry Wasserman; 10(Oct):2295--2328, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/liu09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/liu09a.html
</link>
<description>
Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula---or "nonparanormal"---for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method's theoretical properties, and show that it works well in many examples.
</description>
</item>

<item>
<title>
Learning Nondeterministic Classifiers; Juan Jos&#x00E9; del Coz, Jorge D&#x00ED;ez, Antonio Bahamonde; 10(Oct):2273--2293, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/delcoz09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/delcoz09a.html
</link>
<description>
Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide
</description>
</item>

<item>
<title>
The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List; Cynthia Rudin; 10(Oct):2233--2271, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/rudin09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/rudin09b.html
</link>
<description>
We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This "push" near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose lp-norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around
</description>
</item>

<item>
<title>
Margin-based Ranking and an Equivalence between AdaBoost and RankBoost; Cynthia Rudin, Robert E. Schapire; 10(Oct):2193--2232, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/rudin09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/rudin09a.html
</link>
<description>
We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to "approximate coordinate ascent boosting." Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as
</description>
</item>

<item>
<title>
Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization; Vojt&#x011B;ch Franc, S&#246;ren Sonnenburg; 10(Oct):2157--2192, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/franc09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/franc09a.html
</link>
<description>
We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a &#949; precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver.  In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVM^light, SVM^perf and BMRM, achieving speedup factor more than 1,200 over SVM^light on some data sets and speedup factor of 29 over SVM^perf, while obtaining the same precise support vector solution.
</description>
</item>

<item>
<title>
Discriminative Learning Under Covariate Shift; Steffen Bickel, Michael Br&#252;ckner, Tobias Scheffer; 10(Sep):2137--2155, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/bickel09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/bickel09a.html
</link>
<description>
We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution---problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under
</description>
</item>

<item>
<title>
RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments; Brian Tanner, Adam White; 10(Sep):2133--2136, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/tanner09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/tanner09a.html
</link>
<description>
RL-Glue is a standard, language-independent software package for reinforcement-learning experiments. The standardization provided by RL-Glue facilitates code sharing and collaboration. Code sharing reduces the need to re-engineer tasks and experimental apparatus, both common barriers to comparatively evaluating new ideas in the context of the literature. Our software features a minimalist interface and works with several languages and computing platforms. RL-Glue compatibility can be extended to any programming language that supports network socket communication. RL-Glue has been used to teach classes, to run international competitions, and is currently used by several other open-source software and hardware projects.
</description>
</item>

<item>
<title>
Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods; Christian Rieger, Barbara Zwicknagl; 10(Sep):2115--2132, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/rieger09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/rieger09a.html
</link>
<description>
We introduce a new technique for the analysis of kernel-based regression problems. The basic tools are sampling inequalities which apply to all machine learning problems involving penalty terms induced by kernels related to Sobolev spaces. They lead to explicit deterministic results concerning the worst case behaviour of &#949;- and &#957;-SVRs. Using these, we show how to adjust regularization parameters to get best possible approximation orders for regression. The results are illustrated by some numerical examples.
</description>
</item>

<item>
<title>
An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems; Luciana Ferrer, Kemal S&#246;nmez, Elizabeth Shriberg; 10(Sep):2079--2114, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/ferrer09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/ferrer09a.html
</link>
<description>
We present a method for training support vector machine (SVM)-based classification systems for combination with other classification systems designed for the same task. Ideally, a new system should be designed such that, when combined with existing systems, the resulting performance is optimized. We present a simple model for this problem and use the understanding gained from this analysis to propose a method to achieve better combination performance when training SVM systems. We include a regularization term in the SVM objective function that aims to reduce the average
</description>
</item>

<item>
<title>
Evolutionary Model Type Selection for Global Surrogate Modeling; Dirk Gorissen, Tom Dhaene, Filip De Turck; 10(Sep):2039--2078, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/gorissen09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/gorissen09a.html
</link>
<description>
Due to the scale and computational complexity of currently used simulation codes, global surrogate (metamodels) models have become indispensable tools for exploring and understanding the design space. Due to their compact formulation they are cheap to evaluate and thus readily facilitate visualization, design space exploration, rapid prototyping, and sensitivity analysis. They can also be used as accurate building blocks in design packages or larger simulation environments. Consequently, there is great interest in techniques that facilitate the construction of such approximation models while minimizing the computational cost and maximizing model accuracy. Many surrogate model types exist
</description>
</item>

<item>
<title>
Ultrahigh Dimensional Feature Selection: Beyond The Linear Model; Jianqing Fan, Richard Samworth, Yichao Wu; 10(Sep):2013--2038, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/fan09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/fan09a.html
</link>
<description>
Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan &#38; Lv, 2008) or feature selection using a two-sample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan &#38; Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when
</description>
</item>

<item>
<title>
Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection; Jie Chen, Haw-ren Fang, Yousef Saad; 10(Sep):1989--2012, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/chen09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/chen09b.html
</link>
<description>
Nearest neighbor graphs are widely used in data mining and machine learning.  A brute-force method to compute the exact kNN graph takes &#920;(dn^2) time for n data points in the d dimensional Euclidean  space.  We propose two divide and conquer methods for computing an approximate kNN graph in &#920;(dn^t) time for high dimensional data (large d).  The exponent t &#8712; (1,2) is an increasing function of an internal parameter &#945; which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained
</description>
</item>

<item>
<title>
Provably Efficient Learning with Typed Parametric Models; Emma Brunskill, Bethany R. Leffler, Lihong Li, Michael L. Littman, Nicholas Roy; 10(Aug):1955--1988, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/brunskill09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/brunskill09a.html
</link>
<description>
To quickly achieve good performance, reinforcement-learning algorithms for acting in large continuous-valued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. Our algorithm balances this tradeoff by using a stochastic, switching, parametric dynamics representation. We argue that this model characterizes a number of significant, real-world domains, such as robot navigati on across varying terrain. We prove that this representational assumption allows our algorithm to be probably approximately correct with a sample complexity that scales polynomially with all problem-specific quantities including the state-space dimension. We also explicitly incorporate
</description>
</item>

<item>
<title>
Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training; Kristian Woodsend, Jacek Gondzio; 10(Aug):1937--1953, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/woodsend09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/woodsend09a.html
</link>
<description>
Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from
</description>
</item>

<item>
<title>
Learning Approximate Sequential Patterns for Classification; Zeeshan Syed, Piotr Indyk, John Guttag; 10(Aug):1913--1936, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/syed09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/syed09a.html
</link>
<description>
In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to
</description>
</item>

<item>
<title>
Learning Acyclic Probabilistic Circuits Using Test Paths; Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin; 10(Aug):1881--1911, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/angluin09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/angluin09a.html
</link>
<description>
We define a model of learning probabilistic acyclic circuits using value injection queries, in which fixed values are assigned to an arbitrary subset of the wires and the value on the single output wire is observed. We adapt the approach of using test paths from the Circuit Builder algorithm (Angluin et al., 2009) to show that there is a polynomial time algorithm that uses value injection queries to learn acyclic Boolean probabilistic circuits of constant fan-in and log depth. We establish upper and lower bounds on the attenuation factor for general and transitively reduced Boolean probabilistic circuits of test paths versus general experiments. We give computational evidence that
</description>
</item>

<item>
<title>
CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning; Roberto Esposito, Daniele P. Radicioni; 10(Aug):1851--1880, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/esposito09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/esposito09a.html
</link>
<description>
The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels.  In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity. We provide theoretical grounding together with solid empirical results supporting
</description>
</item>

<item>
<title>
Nonlinear Models Using Dirichlet Process Mixtures; Babak Shahbaba, Radford Neal; 10(Aug):1829--1850, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/shahbaba09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/shahbaba09a.html
</link>
<description>
We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on
</description>
</item>

<item>
<title>
Distributed Algorithms for Topic Models; David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling; 10(Aug):1801--1828, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/newman09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/newman09a.html
</link>
<description>
We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is
</description>
</item>

<item>
<title>
Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning; Halbert White, Karim Chalak; 10(Aug):1759--1799, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/white09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/white09a.html
</link>
<description>
Judea Pearl's Causal Model is a rich framework that provides deep insight into the nature of causal relations. As yet, however, the Pearl Causal Model (PCM) has had a lesser impact on economics or econometrics than on other disciplines. This may be due in part to the fact that the PCM is not as well suited to analyzing structures that exhibit features of central interest to economists and econometricians: optimization, equilibrium, and learning. We offer the settable systems framework as an extension of the PCM that permits causal discourse in systems embodying optimization, equilibrium, and learning. Because these are common features of physical, natural, or social systems, our framework may prove generally useful for
</description>
</item>

<item>
<title>
Dlib-ml: A Machine Learning Toolkit; Davis E. King; 10(Jul):1755--1758, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/king09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/king09a.html
</link>
<description>
There are many excellent toolkits which provide support for developing machine learning software in Python, R, Matlab, and similar environments. Dlib-ml is an open source library, targeted at both engineers and research scientists, which aims to provide a similarly rich environment for developing machine learning software in the C++ language. Towards this end, dlib-ml contains an extensible linear algebra toolkit with built in BLAS support. It also houses implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking. To enable easy
</description>
</item>

<item>
<title>
SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent; Antoine Bordes, L&#233;on Bottou, Patrick Gallinari; 10(Jul):1737--1754, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/bordes09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/bordes09a.html
</link>
<description>
The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of second-order information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the "Wild Track" of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008).
</description>
</item>

<item>
<title>
Learning Permutations with Exponential Weights; David P. Helmbold, Manfred K. Warmuth; 10(Jul):1705--1736, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/helmbold09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/helmbold09a.html
</link>
<description>
We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prov
</description>
</item>

<item>
<title>
Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification; Eitan Greenshtein, Junyong Park; 10(Jul):1687--1704, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/greenshtein09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/greenshtein09a.html
</link>
<description>
We consider the problem of classification using high dimensional features' space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent.  Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features' means. This reduction is since that only the means of a few selected variables should be estimated.  We also consider the design of naive Bayes classifiers. We show that
</description>
</item>

<item>
<title>
Transfer Learning for Reinforcement Learning Domains: A Survey; Matthew E. Taylor, Peter Stone; 10(Jul):1633--1685, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/taylor09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/taylor09a.html
</link>
<description>
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework
</description>
</item>

<item>
<title>
Marginal Likelihood Integrals for Mixtures of Independence Models; Shaowei Lin, Bernd Sturmfels, Zhiqiang Xu; 10(Jul):1611--1631, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/lin09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/lin09a.html
</link>
<description>
Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. Our methods apply to both uniform priors and Dirichlet priors. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of Segre-Veronese varieties.
</description>
</item>

<item>
<title>
Learning Linear Ranking Functions for Beam Search with Application to Planning; Yuehua Xu, Alan Fern, Sungwook Yoon; 10(Jul):1571--1610, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/xu09c.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/xu09c.html
</link>
<description>
Beam search is commonly used to help maintain tractability in large search spaces at the expense of completeness and optimality. Here we study supervised learning of linear ranking functions for controlling beam search. The goal is to learn ranking functions that allow for beam search to perform nearly as well as unconstrained search, and hence gain computational efficiency without seriously sacrificing optimality. In this paper, we develop theoretical aspects of this learning problem and investigate the application of this framework to learning in the context of automated planning. We first study the computationa
</description>
</item>

<item>
<title>
Bayesian Network Structure Learning by Recursive Autonomy Identification; Raanan Yehezkel, Boaz Lerner; 10(Jul):1527--1570, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/yehezkel09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/yehezkel09a.html
</link>
<description>
We propose the recursive autonomy identification (RAI) algorithm for constraint-based (CB) Bayesian network structure learning. The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. The sequence of operations is performed recursively for each autonomous sub-structure while simultaneously increasing the order of the CI test. While other CB algorithms d-separate structures and then direct the resulted undirected graph, the RAI algorithm combines the two processes from the outset and along the procedure. By this means and due to structure decomposition, learning a structure using RAI requires
</description>
</item>

<item>
<title>
Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks; Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras; 10(Jul):1511--1526, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/slobodianik09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/slobodianik09a.html
</link>
<description>
In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we refine this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion.
</description>
</item>

<item>
<title>
Robustness and Regularization of Support Vector Machines; Huan Xu, Constantine Caramanis, Shie Mannor; 10(Jul):1485--1510, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/xu09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/xu09b.html
</link>
<description>
We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new
</description>
</item>

<item>
<title>
Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks; Jean Hausser, Korbinian Strimmer; 10(Jul):1469--1484, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/hausser09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/hausser09a.html
</link>
<description>
We present a procedure for effective estimation of entropy and mutual information from small-sample data, and apply it to the problem of inferring high-dimensional gene association networks. Specifically, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly efficient statistically as well as computationally. Despite its simplicity, we show that it outperforms eight other entropy estimation procedures across a diverse range of sampling scenarios and data-generating models, even in cases of severe undersampling. We illustrate the approach by
</description>
</item>

<item>
<title>
Classification with Gaussians and Convex Loss; Dao-Hong Xiang, Ding-Xuan Zhou; 10(Jul):1447--1468, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/xiang09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/xiang09a.html
</link>
<description>
This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is
</description>
</item>

<item>
<title>
A Least-squares Approach to Direct Importance Estimation; Takafumi Kanamori, Shohei Hido, Masashi Sugiyama; 10(Jul):1391--1445, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/kanamori09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/kanamori09a.html
</link>
<description>
We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show
</description>
</item>

<item>
<title>
Model Monitor (M2): Evaluating, Comparing, and Monitoring Models; Troy Raeder, Nitesh V. Chawla; 10(Jul):1387--1390, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/raeder09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/raeder09a.html
</link>
<description>
This paper presents Model Monitor (M2), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired.
</description>
</item>

<item>
<title>
Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination; Eugene Tuv, Alexander Borisov, George Runger, Kari Torkkola; 10(Jul):1341--1366, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/tuv09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/tuv09a.html
</link>
<description>
Predictive models benefit from a compact, non-redundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for filters, wrappers, and embedded feature selection methods. We describe details of an algorithm using tree-based ensembles to generate a compact subset of non-redundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach.
</description>
</item>

<item>
<title>
A Parameter-Free Classification Method for Large Scale Learning; Marc Boull&#233;; 10(Jul):1367--1385, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/boulle09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/boulle09a.html
</link>
<description>
With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection.  In this paper, we present a parameter-free scalable classification method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classification enhanced with a Bayesian variable selection scheme, and averaging of models
</description>
</item>

<item>
<title>
Robust Process Discovery with Artificial Negative Events; Stijn Goedertier, David Martens, Jan Vanthienen, Bart Baesens; 10(Jun):1305--1340, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/goedertier09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/goedertier09a.html
</link>
<description>
Process discovery is the automated construction of structured process models from information system event logs. Such event logs often contain positive examples only. Without negative examples, it is a challenge to strike the right balance between recall and specificity, and to deal with problems such as expressiveness, noise, incomplete event logs, or the inclusion of prior knowledge. In this paper, we present a configurable technique that deals with these challenges by representing process discovery as a multi-relational classification problem on event logs supplemented with Artificially Generated Negative Events (AGNEs). This problem formulation allows
</description>
</item>

<item>
<title>
Perturbation Corrections in Approximate Inference: Mixture Modelling Applications; Ulrich Paquet, Ole Winther, Manfred Opper; 10(Jun):1263--1304, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/paquet09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/paquet09a.html
</link>
<description>
Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with
</description>
</item>

<item>
<title>
Incorporating Functional Knowledge in Neural Networks; Charles Dugas, Yoshua Bengio, Fran&#231;ois B&#233;lisle, Claude Nadeau, Ren&#233; Garcia; 10(Jun):1239--1262, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/dugas09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/dugas09a.html
</link>
<description>
Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on
</description>
</item>

<item>
<title>
The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models; Ricardo Silva, Zoubin Ghahramani; 10(Jun):1187--1238, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/silva09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/silva09a.html
</link>
<description>
Directed acyclic graphs (DAGs) have been widely used as a representation of conditional independence in machine learning and statistics. Moreover, hidden or latent variables are often an important component of graphical models. However, DAG models suffer from an important limitation: the family of DAGs is not closed under marginalization of hidden variables. This means that in general we cannot use a DAG to represent the independencies over a subset of variables in a larger DAG. Directed mixed graphs (DMGs) are a representation that includes DAGs as a special case, and overcomes this limitation. This paper introduces algorithms for performing Bayesian inference in Gaussian and probit DMG models. An important requirement for
</description>
</item>

<item>
<title>
Multi-task Reinforcement Learning in Partially Observable Stochastic Environments; Hui Li, Xuejun Liao, Lawrence Carin; 10(May):1131--1186, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/li09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/li09b.html
</link>
<description>
We consider the problem of multi-task reinforcement learning (MTRL) in multiple partially observable stochastic environments. We introduce the regionalized policy representation (RPR) to characterize the agent's behavior in each environment. The RPR is a parametric model of the conditional distribution over current actions given the history of past actions and observations; the agent's choice of actions is directly based on this conditional distribution, without an intervening model to characterize the environment itself. We propose off-policy batch algorithms to learn the parameters of the RPRs, using episodic data collected when following a behavior policy, and show their linkage to policy iteration. We employ the Dirichlet process as a nonparametric prior over
</description>
</item>

<item>
<title>
Universal Kernel-Based Learning with Applications to Regular Languages; Leonid (Aryeh) Kontorovich, Boaz Nadler; 10(May):1095--1129, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/kontorovich09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/kontorovich09a.html
</link>
<description>
We propose a novel framework for supervised learning of discrete concepts. Since the 1970's, the standard computational primitive has been to find the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them.  Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identifies concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and
</description>
</item>

<item>
<title>
An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity; Jose M. Pe&#241;a, Roland Nilsson, Johan Bj&#246;rkegren, Jesper Tegn&#233;r; 10(May):1071--1094, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/pena09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/pena09a.html
</link>
<description>
We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisfies weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that
</description>
</item>

<item>
<title>
Fourier Theoretic Probabilistic Inference over Permutations; Jonathan Huang, Carlos Guestrin, Leonidas Guibas; 10(May):997--1070, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/huang09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/huang09a.html
</link>
<description>
Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the "low-frequency" terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for
</description>
</item>

<item>
<title>
On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality; Wenxin Jiang; 10(Apr):977--996, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/jiang09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/jiang09a.html
</link>
<description>
The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding's inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding's inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with l1-loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors.
</description>
</item>

<item>
<title>
Nonextensive Information Theoretic Kernels on Measures; Andr&#233; F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, M&#225;rio A. T. Figueiredo; 10(Apr):935--975, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/martins09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/martins09a.html
</link>
<description>
Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon's) mutual information and the Jensen-Shannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon's information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon's entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce
</description>
</item>

<item>
<title>
Java-ML: A Machine Learning Library; Thomas Abeel, Yves Van de Peer, Yvan Saeys; 10(Apr):931--934, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/abeel09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/abeel09a.html
</link>
<description>
Java-ML is a collection of machine learning and data mining algorithms, which aims to be a readily usable and easily extensible API for both software developers and research scientists. The interfaces for each type of algorithm are kept simple and algorithms strictly follow their respective interface. Comparing different classifiers or clustering algorithms is therefore straightforward, and implementing new algorithms is also easy. The implementations of the algorithms are clearly written, properly documented and can thus be used as a reference. The library is written in Java and is available from http://java-ml.sourceforge.net/ under the GNU GPL license.
</description>
</item>

<item>
<title>
Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods; Holger H&#246;fling, Robert Tibshirani; 10(Apr):883--906, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/hoefling09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/hoefling09a.html
</link>
<description>
We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that
</description>
</item>

<item>
<title>
Stable and Efficient Gaussian Process Calculations; Leslie Foster, Alex Waagen, Nabeela Aijaz, Michael Hurley, Apolonio Luis, Joel Rinsky, Chandrika Satyavolu, Michael J. Way, Paul Gazis, Ashok Srivastava; 10(Apr):857--882, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/foster09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/foster09a.html
</link>
<description>
The use of Gaussian processes can be an effective approach to prediction in a supervised learning environment. For large data sets, the standard Gaussian process approach requires solving very large systems of linear equations and approximations are required for the calculations to be practical. We will focus on the subset of regressors approximation technique. We will demonstrate that there can be numerical instabilities in a well known implementation of the technique. We discuss alternate implementations that have better numerical stability properties and can lead to better predictions. Our results will be illustrated by looking at an application involving prediction of galaxy redshift from broadband spectrum data.
</description>
</item>

<item>
<title>
Consistency and Localizability; Alon Zakai, Ya'acov Ritov; 10(Apr):827--856, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/zakai09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/zakai09a.html
</link>
<description>
We show that all consistent learning methods---that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y)---are necessarily localizable, by which we mean that they do not significantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be defined in a non-local manner, such as support vector machines in classification and least-squares estimators in regression. Aside from showing that consistency implies a specific form of localizability, we also show that
</description>
</item>

<item>
<title>
A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization; Jacob Abernethy, Francis Bach, Theodoros Evgeniou, Jean-Philippe Vert; 10(Mar):803--826, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/abernethy09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/abernethy09a.html
</link>
<description>
We present a general approach for collaborative filtering (CF) using spectral regularization to learn linear operators mapping a set of "users" to a set of possibly desired "objects". In particular, several recent low-rank type matrix-completion methods for CF are shown to be special cases of our proposed framework. Unlike existing regularization-based CF, our approach can be used to incorporate additional information such as attributes of the users/objects---a feature currently lacking in existing regularization-based CF approaches---using popular and well-known kernel methods. We provide novel representer theorems that we use to develop new estimation methods. We then provide learning
</description>
</item>

<item>
<title>
Sparse Online Learning via Truncated Gradient; John Langford, Lihong Li, Tong Zhang; 10(Mar):777--801, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/langford09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/langford09a.html
</link>
<description>
We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss functions. This method has several essential properties: (1) The degree of sparsity is continuous---a parameter controls the rate of sparsification from no sparsification to total sparsification. (2) The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1-regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. (3) The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable.
</description>
</item>

<item>
<title>
Similarity-based Classification: Concepts and Algorithms; Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti; 10(Mar):747--776, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/chen09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/chen09a.html
</link>
<description>
This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning.
</description>
</item>

<item>
<title>
Nieme: Large-Scale Energy-Based Models; Francis Maes; 10(Mar):743--746, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/maes09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/maes09a.html
</link>
<description>
In this paper we introduce NIEME, a machine learning library for large-scale classification, regression and ranking. NIEME, relies on the framework of energy-based models (LeCun et al., 2006) which unifies several learning algorithms ranging from simple perceptrons to recent models such as the pegasos support vector machine or l1-regularized maximum entropy models. This framework also unifies batch and stochastic learning which are both seen as energy minimization problems. NIEME, can hence be used in a wide range of
</description>
</item>

</channel>
</rss>

