<?xml version="1.0"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<atom:link href="http://jmlr.csail.mit.edu/jmlr.xml" rel="self" type="application/rss+xml" />
<link>http://www.jmlr.org</link>
<title>JMLR</title>
<description></description>

<item>
<title>
Performance Bounds for &#955; Policy Iteration and Application to the Game of Tetris; Bruno Scherrer; 14(Apr):1181--1227, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/scherrer13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/scherrer13a.html
</link>
<description>
We consider the discrete-time infinite-horizon optimal control
  problem formalized by Markov decision processes
  (Puterman, 1994; Bertsekas and Tsitsiklis, 1996).  We revisit the work of Bertsekas
  and Ioffe (1996), that
  introduced &#955; policy iteration---a family of algorithms parametrized by a
  parameter &#955;---that generalizes the standard algorithms
  value and policy iteration, and has some deep connections
  with the temporal-difference algorithms described by
  Sutton and Barto (1998). We deepen the original theory developed by the
  authors by providing convergence rate bounds which generalize
  standard bounds for value iteration described for instance by
  Puterman (1994).  Then, the main contribution of this paper is to
  develop the theory of this algorithm when it is used in an
  approximate form. We extend and unify the separate analyzes
  developed by Munos for approximate value iteration (Munos, 2007)
  and approximate policy iteration (Munos, 2003), and provide
  performance bounds in the discounted and the undiscounted
  situations. Finally, we revisit the use of this algorithm in the
  training of a Tetris playing controller as originally done by
  Bertsekas and Ioffe (1996).  Our empirical results are different from those of
  Bertsekas and Ioffe (which were originally qualified as
  ''paradoxical'' and ''intriguing''). We track down the reason to be
  a minor implementation error of the algorithm, which suggests that,
  in practice, &#955; policy iteration may be more stable than previously thought.
</description>
</item>



<item>
<title>
GPstuff: Bayesian Modeling with Gaussian Processes; Jarno Vanhatalo, Jaakko Riihim&#228;ki, Jouni Hartikainen, Pasi Jyl&#228;nki, Ville Tolvanen, Aki Vehtari; 14(Apr):1175--1179, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/vanhatalo13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/vanhatalo13a.html
</link>
<description>
The GPstuff toolbox is a versatile collection of Gaussian process
  models and computational tools required for Bayesian inference. The tools
  include, among others, various inference methods, sparse
  approximations and model assessment methods.
</description>
</item>



<item>
<title>
Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing; Lisha Chen, Andreas Buja; 14(Apr):1145--1173, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/chen13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/chen13a.html
</link>
<description>
Multidimensional scaling (MDS) is the art of reconstructing
pointsets (embeddings) from pairwise distance data, and as such it
  is at the basis of several approaches to nonlinear dimension
  reduction and manifold learning.  At present, MDS lacks a unifying
  methodology as it consists of a discrete collection of proposals
  that differ in their optimization criteria, called ''stress
  functions''.  To correct this situation we propose (1) to embed many
  of the extant stress functions in a parametric family of stress
  functions, and (2) to replace the ad hoc choice among discrete
  proposals with a principled parameter selection method.  This
  methodology yields the following benefits and problem solutions:
  (a )It provides guidance in tailoring stress functions to a given
  data situation, responding to the fact that no single stress
  function dominates all others across all data situations; (b) the
  methodology enriches the supply of available stress functions;
  (c) it helps our understanding of stress functions by replacing the
  comparison of discrete proposals with a characterization of the
  effect of parameters on embeddings; (d) it builds a bridge to graph
  drawing, which is the related but not identical art of constructing
  embeddings from graphs.
</description>
</item>



<item>
<title>
Sparse Activity and Sparse Connectivity in Supervised Learning; Markus Thom, G&#252;nther Palm; 14(Apr):1091--1143, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/thom13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/thom13a.html
</link>
<description>
Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks.
This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities.
The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector.
In the theoretical part of this paper, a comprehensive theory for such a projection is developed.
In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function.
The entire model can hence be tuned end-to-end using gradient-based methods.
Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity.
With a combination of both, performance can be significantly better compared to classical non-sparse approaches.
</description>
</item>



<item>
<title>
Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications; Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown; 14(Apr):1033--1090, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/zhao13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/zhao13a.html
</link>
<description>
Fano's inequality lower bounds the probability of transmission error
through a communication channel.
   Applied to classification problems, it provides a lower bound on
the Bayes error rate and motivates the widely used Infomax principle.
   In modern machine learning, we are often interested in more than
just the error rate.
   In medical diagnosis, different errors incur different cost; hence,
the overall risk is cost-sensitive.
   Two other popular criteria are balanced error rate (BER) and
F-score.
   In this work, we focus on the two-class problem and use a general
definition of conditional entropy (including Shannon's as a special
case) to derive upper/lower bounds on the optimal F-score, BER and
cost-sensitive risk, extending Fano's result.
As a consequence, we show that <i>Infomax is not suitable for
    optimizing F-score or cost-sensitive risk</i>, in that it can potentially
lead to low F-score and high risk.
   For cost-sensitive risk, we propose a new conditional entropy
formulation which avoids this inconsistency.
   In addition, we consider the common practice of using a threshold
on the posterior probability to tune performance of a classifier.
As is widely known, a threshold of <i>0.5</i>, where the posteriors
cross, minimizes error rate---we derive similar optimal thresholds for
F-score and BER.
</description>
</item>



<item>
<title>
Variational Inference in Nonconjugate Models; Chong Wang, David M. Blei; 14(Apr):1005--1031, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/wang13b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/wang13b.html
</link>
<description>
Mean-field variational methods are widely used for approximate
  posterior inference in many probabilistic models.  In a typical
  application, mean-field methods approximately compute the posterior
  with a coordinate-ascent optimization algorithm.  When the model is
  conditionally conjugate, the coordinate updates are easily derived
  and in closed form. However, many models of interest---like the
  correlated topic model and Bayesian logistic regression---are
  nonconjugate. In these models, mean-field methods cannot be directly
  applied and practitioners have had to develop variational algorithms
  on a case-by-case basis.  In this paper, we develop two generic
  methods for nonconjugate models, Laplace variational inference and
  delta method variational inference.  Our methods have several
  advantages: they allow for easily derived variational algorithms
  with a wide class of nonconjugate models; they extend and unify some
  of the existing algorithms that have been derived for specific
  models; and they work well on real-world data sets. We studied our
  methods on the correlated topic model, Bayesian logistic regression,
  and hierarchical Bayesian logistic regression.
</description>
</item>



<item>
<title>
Bayesian Canonical Correlation Analysis; Arto Klami, Seppo Virtanen, Samuel Kaski; 14(Apr):965--1003, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/klami13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/klami13a.html
</link>
<description>
Canonical correlation analysis (CCA) is a classical method for
  seeking correlations between two multivariate data sets. During the
  last ten years, it has received more and more attention in the
  machine learning community in the form of novel computational
  formulations and a plethora of applications. We review recent
  developments in Bayesian models and inference methods for CCA which
  are attractive for their potential in hierarchical extensions and
  for coping with the combination of large dimensionalities and small
  sample sizes. The existing methods have not been particularly
  successful in fulfilling the promise yet; we introduce a novel
  efficient solution that imposes group-wise sparsity to estimate the
  posterior of an extended model which not only extracts the
  statistical dependencies (correlations) between data sets but also
  decomposes the data into shared and data set-specific components. In
  statistics literature the model is known as inter-battery factor
  analysis (IBFA), for which we now provide a Bayesian treatment.
</description>
</item>



<item>
<title>
Query Induction with Schema-Guided Pruning Strategies; Joachim Niehren, J&#233;r&#244;me Champav&#232;re, Aur&#233;lien Lemay, R&#233;mi Gilleron; 14(Apr):927--964, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/niehren13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/niehren13a.html
</link>
<description>
Inference algorithms for tree automata that define node selecting
queries in unranked trees rely on tree pruning strategies. These
impose additional assumptions on node selection that are needed to
compensate for small numbers of annotated examples. Pruning-based
heuristics in query learning algorithms for Web information extraction
often boost the learning quality and speed up the learning process.
We will distinguish the class of regular queries that
are stable under a given schema-guided pruning strategy, and show that
this class is learnable with polynomial time and data. Our learning
algorithm is obtained by adding pruning heuristics to the traditional
learning algorithm for tree automata from positive and negative
examples. While justified by a formal learning model, our learning
algorithm for stable queries also performs very well in practice of
<b>XML</b> information extraction.
</description>
</item>



<item>
<title>
Truncated Power Method for Sparse Eigenvalue Problems; Xiao-Tong Yuan, Tong Zhang; 14(Apr):899--925, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/yuan13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/yuan13a.html
</link>
<description>
This paper considers the sparse eigenvalue problem, which is to
extract dominant (largest) sparse eigenvectors with at most <i>k</i>
non-zero components. We propose a simple yet effective solution
called <i>truncated power method</i> that can approximately solve the
underlying nonconvex optimization problem. A strong sparse recovery
result is proved for the truncated power method, and this theory is
our key motivation for developing the new algorithm. The proposed
method is tested on applications such as sparse principal component
analysis and the densest <i>k</i>-subgraph problem. Extensive experiments
on several synthetic and real-world data sets demonstrate the
competitive empirical performance of our method.
</description>
</item>

<!-- 2013 Apr -->

<item>
<title>
A Widely Applicable Bayesian Information Criterion; Sumio Watanabe; 14(Mar):867--897, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/watanabe13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/watanabe13a.html
</link>
<description>
<p>A statistical model or a learning machine is called regular if the map taking a parameter 
to a probability distribution is one-to-one and if its Fisher information 
matrix is always positive definite. If otherwise, it is called singular.
In regular statistical models, 
the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, 
can be asymptotically approximated by the 
Schwarz Bayes information criterion (BIC), whereas in singular models
such approximation does not hold. 
</p>

<p>Recently, it was proved that the Bayes free energy of a singular model is
asymptotically given by a generalized formula
using a birational invariant, the real log canonical threshold (RLCT),
instead of half the number of parameters in BIC. 
Theoretical values of RLCTs in several statistical models are now being 
discovered based on algebraic geometrical methodology. 
However, it has been difficult to estimate the Bayes free energy using only training samples, 
because an RLCT depends on an unknown true distribution. </p>

<p>In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by 
the average log likelihood function 
over the posterior distribution with the inverse temperature <i>1/log n</i>,
where <i>n</i> is the number of training samples. We mathematically prove that 
WBIC has the same asymptotic expansion as the Bayes free energy, even if
a statistical model is singular for or  unrealizable by a statistical model. 
Since WBIC can be numerically calculated without any information about a true 
distribution, 
it is a generalized version of BIC onto singular statistical models.</p>
</description>
</item>



<item>
<title>
Quasi-Newton Method: A New Direction; Philipp Hennig, Martin Kiefel; 14(Mar):843--865, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/hennig13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/hennig13a.html
</link>
<description>
Four decades after their invention, quasi-Newton methods are still
  state of the art in unconstrained numerical optimization. Although
  not usually interpreted thus, these are learning algorithms that fit
  a local quadratic approximation to the objective function. We show
  that many, including the most popular, quasi-Newton methods can be
  interpreted as approximations of Bayesian linear regression under
  varying prior assumptions. This new notion elucidates some
  shortcomings of classical algorithms, and lights the way to a novel
  nonparametric quasi-Newton method, which is able to make more
  efficient use of available information at computational cost similar
  to its predecessors.
</description>
</item>



<item>
<title>
Greedy Sparsity-Constrained Optimization; Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos; 14(Mar):807--841, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/bahmani13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/bahmani13a.html
</link>
<description>
Sparsity-constrained optimization has wide applicability in machine
learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsity-constrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without <i>l<sub>2</sub></i>-regularization.
</description>
</item>



<item>
<title>
MLPACK: A Scalable C++ Machine Learning Library; Ryan R. Curtin, James R. Cline, N. P. Slagle, William B. March, Parikshit Ram, Nishant A. Mehta, Alexander G. Gray; 14(Mar):801--805, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/curtin13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/curtin13a.html
</link>
<description>
MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning
library released in late 2011 offering both a simple, consistent API accessible
to novice users and high performance and flexibility to expert users by
leveraging modern features of C++.  MLPACK provides cutting-edge algorithms
whose benchmarks exhibit far better performance than other leading machine
learning libraries.  MLPACK version 1.0.3, licensed under the LGPL, is available
at <a href="http://www.mlpack.org">www.mlpack.org</a>.
</description>
</item>



<item>
<title>
Semi-Supervised Learning Using Greedy Max-Cut; Jun Wang, Tony Jebara, Shih-Fu Chang; 14(Mar):771--800, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/wang13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/wang13a.html
</link>
<description>
Graph-based semi-supervised learning (<i>SSL</i>) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial
observed labels, this article proposes a bivariate formulation for graph-based <i>SSL</i>, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (<i>GGMC</i>) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based <i>SSL</i> is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art <i>SSL</i> methods. Moreover, <i>GGMC</i> shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes.
</description>
</item>



<item>
<title>
Sparsity Regret Bounds for Individual Sequences in Online Linear Regression; S&#233;bastien Gerchinovitz; 14(Mar):729--769, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/gerchinovitz13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/gerchinovitz13a.html
</link>
<description>
We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension <i>d</i> can be much larger than the number of time rounds <i>T</i>. We introduce the notion of <emph>sparsity regret bound</emph>, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design.
</description>
</item>



<item>
  <title>Welcome New JMLR Co-Editors-in-Chief Kevin Murphy and Bernhard Schoelkopf</title>
  <pubDate>2013-03-25</pubDate>
  <guid isPermaLink="true">http://jmlr.csail.mit.edu/editorial-board.html#2013-new-EIC</guid>
  <link>http://jmlr.csail.mit.edu/editorial-board.html#2013-new-EIC</link>
  <description>
<p>We are very pleased to announce the appointment of new JMLR Co-Editors-in-Chief, Kevin Murphy and Bernhard Schoelkopf.  We are delighted to have them take over as editors and look forward to their stewardship and innovation in running the journal.</p>

<p>We should all take this opportunity to thank outgoing EIC Lawrence Saul for his service to the journal;  he has been a wonderful Editor and we will be sorry to see him go.</p>

<p>We should also thank Aron Culotta and Rich Maclin, the Managing Editor and Production Editor, for agreeing to stay on in service of JMLR; to thank Youngmin Cho for his previous service as webmaster and to welcome Chiyuan Zhang as our new webmaster.</p>

<p>All of these people, together with the action editors, put in an enormous amount of work as volunteers and we could not succeed without them.</p>

<p>Leslie Kaelbling, on behalf of the JMLR community
<br />March 25, 2013</p>
  </description>
</item>

<item>
<title>
Differential Privacy for Functions and Functional Data; Rob Hall, Alessandro Rinaldo, Larry Wasserman; 14(Feb):703--727, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/hall13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/hall13a.html
</link>
<description>
Differential privacy is a rigorous cryptographically-motivated characterization of data privacy which may be applied when
releasing summaries of a database.
Previous work has focused mainly on methods for which
the output is a finite dimensional vector, or an element of some discrete set.
We develop methods for releasing functions
while preserving differential privacy.
Specifically, we show that adding an appropriate Gaussian process
to the function of interest yields
differential privacy.  When the functions lie in the reproducing kernel Hilbert space (RKHS) generated by the covariance kernel of the
Gaussian process, then the correct noise level is established by
measuring the "sensitivity" of the function in the RKHS norm.
As examples we consider kernel density estimation,
kernel support vector machines,
and functions in RKHSs.
</description>
</item>



<item>
<title>
Bayesian Nonparametric Hidden Semi-Markov Models; Matthew J. Johnson, Alan S. Willsky; 14(Feb):673--701, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/johnson13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/johnson13a.html
</link>
<description>
There is much interest in the Hierarchical Dirichlet Process Hidden Markov
Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous
Hidden Markov Model for learning from sequential and time-series data. However,
in many settings the HDP-HMM's strict Markovian constraints are undesirable,
particularly if we wish to learn or encode non-geometric state durations. We
can extend the HDP-HMM to capture such structure by drawing upon
explicit-duration semi-Markov modeling, which has been developed mainly in the
parametric non-Bayesian setting, to allow construction of highly interpretable
models that admit natural prior information on state durations.

In this paper we introduce the explicit-duration Hierarchical Dirichlet Process
Hidden semi-Markov Model (HDP-HSMM) and develop sampling algorithms for
efficient posterior inference. The methods we introduce also provide new
methods for sampling inference in the finite Bayesian HSMM.
Our modular Gibbs sampling methods can be embedded in samplers for larger
hierarchical Bayesian models, adding semi-Markov chain modeling as another tool
in the Bayesian inference toolbox.  We demonstrate the utility of the HDP-HSMM
and our inference methods on both synthetic and real experiments.
</description>
</item>



<item>
<title>
CODA: High Dimensional Copula Discriminant Analysis; Fang Han, Tuo Zhao, Han Liu; 14(Feb):629--671, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/han13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/han13a.html
</link>
<description>
We propose a high dimensional classification method, named the 
<i>Copula Discriminant Analysis</i> (CODA). The CODA generalizes the
normal-based linear discriminant analysis to the larger Gaussian
Copula models (or the nonparanormal) as proposed by Liu et al. (2009).
To simultaneously achieve estimation efficiency and robustness, the
nonparametric rank-based methods including the Spearman's rho and
Kendall's tau are exploited in estimating the covariance matrix. In
high dimensional settings, we prove that the sparsity pattern of the
discriminant features can be consistently recovered with the
parametric rate, and the expected misclassification error is
consistent to the Bayes risk. Our theory is backed up by careful
numerical experiments, which show that the extra flexibility gained
by the CODA method incurs little efficiency loss even when the data
are truly Gaussian. These results suggest that the CODA method can
be an alternative choice besides the normal-based high dimensional
linear discriminant analysis.
</description>
</item>



<item>
<title>
A <code>C++</code> Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics; Herv&#233; Frezza-Buet, Matthieu Geist; 14(Feb):625--628, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/frezza-buet13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/frezza-buet13a.html
</link>
<description>
This paper introduces the <code>rllib</code> as an original <code>C++</code> template-based
library oriented toward value function estimation. Generic programming is promoted here as a way of having a good
fit between the mathematics of reinforcement learning and their implementation in a
library. The main concepts of <code>rllib</code> are presented, as well as a short example.
</description>
</item>



<item>
<title>
Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality; S&#233;bastien Bubeck, Damien Ernst, Aur&#233;lien Garivier; 14(Feb):601--623, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/bubeck13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/bubeck13a.html
</link>
<description>
We consider  an original problem that arises from the issue of security analysis of a power system and that we name optimal discovery with probabilistic expert advice. We address it with an algorithm based on the optimistic paradigm and on the Good-Turing missing mass estimator. We prove two different regret bounds on the performance of this algorithm under weak assumptions on the probabilistic experts. Under more restrictive hypotheses, we also prove a macroscopic optimality result, comparing the algorithm both with  an oracle strategy and with uniform sampling.
Finally, we provide numerical experiments illustrating these theoretical findings.
</description>
</item>



<item>
<title>
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization; Shai Shalev-Shwartz, Tong Zhang; 14(Feb):567--599, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/shalev-shwartz13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/shalev-shwartz13a.html
</link>
<description>
Stochastic Gradient Descent (SGD) has become popular for solving
  large scale supervised machine learning optimization problems such
  as SVM, due to their strong theoretical guarantees.  While the
  closely related Dual Coordinate Ascent (DCA) method has been
  implemented in various software packages, it has so far lacked good
  convergence analysis.  This paper presents a new analysis of
  Stochastic Dual Coordinate Ascent (SDCA) showing that this class of
  methods enjoy strong theoretical guarantees that are comparable or
  better than SGD. This analysis justifies the effectiveness of SDCA
  for practical applications.
</description>
</item>



<item>
<title>
Algorithms for Discovery of Multiple Markov Boundaries; Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis; 14(Feb):499--566, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/statnikov13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/statnikov13a.html
</link>
<description>
Algorithms for Markov boundary discovery from data constitute an important recent development in
machine learning, primarily because they offer a principled solution to the variable/feature selection
problem and give insight on local causal structure. Over the last decade many sound algorithms have
been proposed to identify a single Markov boundary of the response variable. Even though faithful
distributions  and, more broadly, distributions  that satisfy  the  intersection property always have a
single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the
response variable.  The latter  distributions/data sets are  common in practical data-analytic
applications, and there are several reasons why it is important to induce multiple Markov boundaries
from such data. However, there are currently no sound and efficient algorithms that can accomplish
this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries
in a distribution.  The broad applicability  as well as efficiency  of the new  algorithmic family  is
demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art
algorithms/variants in 15 data sets from a diversity of application domains.
</description>
</item>



<item>
<title>
A Theory of Multiclass Boosting; Indraneel Mukherjee, Robert E. Schapire; 14(Feb):437--497, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/mukherjee13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/mukherjee13a.html
</link>
<description>
Boosting combines weak classifiers to form highly accurate
  predictors. Although the case of binary classification is well
  understood, in the multiclass setting, the "correct" requirements
  on the weak classifier, or the notion of the most efficient boosting
  algorithms are missing. In this paper, we create a broad and general
  framework, within which we make precise and identify the optimal
  requirements on the weak-classifier, as well as design the most
  effective, in a certain sense, boosting algorithms that assume such
  requirements.
</description>
</item>



<item>
<title>
Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections; Aleksandrs Slivkins, Filip Radlinski, Sreenivas Gollapudi; 14(Feb):399--436, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/slivkins13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/slivkins13a.html
</link>
<description>
Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learning-to-rank formulation that optimizes the fraction of satisfied users, with several scalable algorithms that explicitly takes document similarity and ranking context into account. Our formulation is a non-trivial common generalization of two multi-armed bandit models from the literature: <i>ranked bandits</i> (Radlinski et al., 2008) and <i>Lipschitz bandits</i> (Kleinberg et al., 2008b). We present theoretical justifications for this approach, as well as a near-optimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches.
</description>
</item>



<item>
<title>
Learning Theory Approach to Minimum Error Entropy Criterion; Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou; 14(Feb):377--397, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/hu13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/hu13a.html
</link>
<description>
We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm when an
approximation of R&#233;nyi's entropy (of order <i>2</i>) by Parzen windowing is minimized. This learning algorithm involves a
Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting
when the scaling parameter is large. Consistency and explicit convergence rates are provided in terms of the approximation
ability and capacity of the involved hypothesis space. Novel analysis is carried out for the generalization error
associated with R&#233;nyi's entropy and a Parzen windowing function, to overcome technical difficulties arising from the
essential differences between the classical least squares problems and the MEE setting. An involved symmetrized least
squares error is introduced and analyzed, which is related to some ranking algorithms.
</description>
</item>



<item>
<title>
Risk Bounds of Learning Processes for L&#233;vy Processes; Chao Zhang, Dacheng Tao; 14(Feb):351--376, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/zhang13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/zhang13a.html
</link>
<description>
L&#233;vy processes refer to a class of stochastic processes, for example, Poisson processes and Brownian motions, and play an important role in stochastic processes and machine learning. Therefore, it is essential to study risk bounds of the learning process for time-dependent samples drawn from a L&#233;vy process (or briefly called learning process for L&#233;vy process). It is noteworthy that samples in this learning process are not independently and identically distributed (i.i.d.). Therefore, results in traditional statistical learning theory are not applicable (or at least cannot be applied directly), because they are obtained under the sample-i.i.d. assumption. In this paper, we study risk bounds of the learning process for time-dependent samples drawn from a L&#233;vy process, and then analyze the asymptotical behavior of the learning process. In particular, we first develop the deviation inequalities and the symmetrization inequality for the learning process. By using the resultant inequalities, we then obtain the risk bounds based on the covering number. Finally, based on the resulting risk bounds, we study the asymptotic convergence and the rate of convergence of the learning process for L&#233;vy process.
Meanwhile, we also give a comparison to the related results under the sample-i.i.d. assumption.
</description>
</item>



<item>
<title>
A Framework for Evaluating Approximation Methods for Gaussian Process Regression; Krzysztof Chalupka, Christopher K. I. Williams, Iain Murray; 14(Feb):333--350, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/chalupka13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/chalupka13a.html
</link>
<description>
Gaussian process (GP) predictors are an important component of many
  Bayesian approaches to machine learning. However, even a
  straightforward implementation of Gaussian process regression (GPR)
  requires <i>O(n<sup>2</sup>)</i> space and <i>O(n<sup>3</sup>)</i> time for a data set of <i>n</i>
  examples. Several approximation methods have been proposed, but
  there is a lack of understanding of the relative merits of the
  different approximations, and in what situations they are most
  useful.  We recommend assessing the quality of the predictions
  obtained as a function of the compute time taken, and comparing to
  standard baselines (e.g., Subset of Data and FITC).
  We empirically investigate four different approximation algorithms on
  four different prediction problems, and make our code available to
  encourage future comparisons.
</description>
</item>



<item>
<title>
Using Symmetry and Evolutionary Search to Minimize Sorting Networks; Vinod K. Valsalam, Risto Miikkulainen; 14(Feb):303--331, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/valsalam13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/valsalam13a.html
</link>
<description>
Sorting networks are an interesting class of parallel sorting
  algorithms with applications in multiprocessor computers and
  switching networks. They are built by cascading a series of
  comparison-exchange units called comparators.  Minimizing the number
  of comparators for a given number of inputs is a challenging
  optimization problem.  This paper presents a two-pronged approach
  called Symmetry and Evolution based Network Sort Optimization
  (SENSO) that makes it possible to scale the solutions to networks
  with a larger number of inputs than previously possible.  First, it
  uses the symmetry of the problem to decompose the minimization
  goal into subgoals that are easier to solve.  Second, it minimizes
  the resulting greedy solutions further by using an evolutionary
  algorithm to learn the statistical distribution of comparators in
  minimal networks.  The final solutions improve upon half-century of
  results published in patents, books, and peer-reviewed literature,
  demonstrating the potential of the SENSO approach for solving
  difficult combinatorial problems.
</description>
</item>

<!-- above is Volume 14, Feb. -->

<item>
<title>
Derivative Estimation with Local Polynomial Fitting; Kris De Brabanter, Jos De Brabanter, Bart De Moor, Ir&#232;ne Gijbels; 14(Jan):281--301, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/debrabanter13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/debrabanter13a.html
</link>
<description>
We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L_1 and L_2 rates and establish consistency of the estimator. The new data sets created by this technique are no
</description>
</item>

<item>
<title>
Sparse Single-Index Model; Pierre Alquier, G&#233;rard Biau; 14(Jan):243--280, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/alquier13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/alquier13a.html
</link>
<description>
Let (X, Y) be a random pair taking values in &#8477;^p &#215; &#8477;. In the so-called single-index model, one has Y=f^*(&#952;^* TX)+W, where f^* is an unknown univariate measurable function, &#952;^* is an unknown vector in &#8477;^d, and W denotes a random noise satisfying  E[W|X]=0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations ("p larger than n" paradigm).  To circumvent this difficulty, we consider the single-index model estimation
</description>
</item>

<item>
<title>
MAGIC Summoning:  Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use; Daniel Kyu Hwa Kohlsdorf, Thad E. Starner; 14(Jan):209--242, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/kohlsdorf13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/kohlsdorf13a.html
</link>
<description>
Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer.  However, it is a challenge for interface designers to create gestures easily distinguishable from users' normal movements.  Our tool MAGIC Summoning addresses this problem.  Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an "Everyday Gesture Library" or EGL).  The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching.  MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering.
</description>
</item>

<item>
<title>
Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem; Antoine Salomon, Jean-Yves Audibert, Issam El Alaoui; 14(Jan):187--207, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/salomon13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/salomon13a.html
</link>
<description>
This paper is devoted to regret lower bounds in the classical model of stochastic multi-armed bandit.  A well-known result of Lai and Robbins, which has then been extended by Burnetas and Katehakis, has established the presence of a logarithmic bound for all consistent policies. We relax the notion of consistency, and exhibit a generalisation of the bound. We also study the existence of logarithmic bounds in general and in the case of Hannan consistency. Moreover, we prove that it is impossible to design an adaptive policy that would select the best of two algorithms by taking advantage of the properties of the environment. To get these results, we study variants of popular Upper Confidence
</description>
</item>

<item>
<title>
Universal Consistency of Localized Versions of Regularized Kernel Methods; Robert Hable; 14(Jan):153--186, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/hable13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/hable13a.html
</link>
<description>
In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point.  Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only
</description>
</item>

<item>
<title>
Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models; Aapo Hyv&#228;rinen, Stephen M. Smith; 14(Jan):111--152, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/hyvarinen13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/hyvarinen13a.html
</link>
<description>
We present new measures of the causal direction, or direction of effect, between two non-Gaussian random variables. They are based on the likelihood ratio under the linear non-Gaussian acyclic model (LiNGAM).  We also develop simple first-order approximations of the likelihood ratio and analyze them based on related cumulant-based measures, which can be shown to find the correct causal directions. We show how to apply these measures to estimate LiNGAM for more than two variables, and even in the case of more variables than observations. We further extend the method to cyclic and nonlinear models. The proposed framework is statistically at least as good as existing ones in the cases of few data
</description>
</item>

<item>
<title>
Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood; Jaakko Riihim&#228;ki, Pasi Jyl&#228;nki, Aki Vehtari; 14(Jan):75--109, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/riihimaki13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/riihimaki13a.html
</link>
<description>
This paper considers probabilistic multinomial probit classification using Gaussian process (GP) priors.  Challenges with multiclass GP classification are the integration over the non-Gaussian posterior distribution, and the increase of the number of unknown latent variables as the number of target classes grows.  Expectation propagation (EP) has proven to be a very accurate method for approximate inference but the existing EP approaches for the multinomial probit GP classification rely on numerical quadratures, or independence assumptions between the latent values associated with different classes, to facilitate the computations.  In this paper we propose a novel nested EP approach which does
</description>
</item>

<item>
<title>
Ranking Forests; St&#233;phan Cl&#233;men&#231;on, Marine Depecker, Nicolas Vayatis; 14(Jan):39--73, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/clemencon13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/clemencon13a.html
</link>
<description>
The present paper examines how the aggregation and feature randomization principles underlying the algorithm RANDOM FOREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl&#233;men&#231;on and Vayatis (2009c). The present work describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called RANKING FOREST is presented.
</description>
</item>

<item>
<title>
Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization; Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, Ryota Tomioka; 14(Jan):1--37, 2013.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v14/nakajima13a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v14/nakajima13a.html
</link>
<description>
The variational Bayesian (VB) approximation is known to be a promising approach to  Bayesian estimation, when the rigorous calculation of the Bayes posterior is intractable.  The VB approximation has been successfully applied to  matrix factorization (MF), offering automatic dimensionality selection for principal component analysis.  Generally, finding the VB solution is a non-convex problem, and most methods rely on  a local search algorithm derived through a standard procedure for the VB approximation.  In this paper, we show that a better option is available for fully-observed VBMF---the global solution can be analytically computed.  More specifically, the global solution is a reweighted SVD
</description>
</item>

<item>
<title>
Exploration in Relational Domains for Model-based Reinforcement Learning; Tobias Lang, Marc Toussaint, Kristian Kersting; 13(Dec):3725--3768, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/lang12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/lang12a.html
</link>
<description>
A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E^3 and R-MAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action
</description>
</item>

<item>
<title>
Security Analysis of Online Centroid Anomaly Detection; Marius Kloft, Pavel Laskov; 13(Dec):3681--3724, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kloft12b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kloft12b.html
</link>
<description>
Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution, we analyze the performance of a particular method---online centroid anomaly detection---in the presence of adversarial noise.  Our analysis
</description>
</item>

<item>
<title>
Smoothing Multivariate Performance Measures; Xinhua Zhang, Ankan Saha, S.V.N. Vishwanathan; 13(Dec):3623--3680, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zhang12d.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zhang12d.html
</link>
<description>
Optimizing multivariate performance measure is an important task in Machine Learning.  Joachims (2005) introduced a Support Vector Method whose underlying optimization problem is commonly solved by cutting plane methods (CPMs) such as SVM-Perf and BMRM.  It can be shown that CPMs converge to an &#949; accurate solution in O(1/&#955; &#949;) iterations, where &#955; is the trade-off parameter between the regularizer and the loss function.  Motivated by the impressive convergence rate of CPM on a number of practical problems, it was conjectured that these rates can be further improved.  We disprove this conjecture in this paper by constructing counter examples.  However, surprisingly, we further
</description>
</item>

<item>
<title>
SVDFeature: A Toolkit for Feature-based Collaborative Filtering; Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao Zheng, Yong Yu; 13(Dec):3619--3622, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/chen12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/chen12a.html
</link>
<description>
In this paper we introduce SVDFeature, a machine learning toolkit for feature-based collaborative filtering. SVDFeature is designed to efficiently solve the feature-based matrix factorization. The feature-based setting allows us to build factorization models incorporating side information such as temporal dynamics, neighborhood relationship, and hierarchical information.  The toolkit is capable of both rate prediction and collaborative ranking, and is carefully designed for efficient training on large-scale data set. Using this toolkit, we built solutions to win KDD Cup for two consecutive  years.
</description>
</item>

<item>
<title>
Learning Symbolic Representations of Hybrid Dynamical Systems; Daniel L. Ly, Hod Lipson; 13(Dec):3585--3618, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ly12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ly12a.html
</link>
<description>
A hybrid dynamical system is a mathematical model suitable for describing an extensive spectrum of multi-modal, time-series behaviors, ranging from bouncing balls to air traffic controllers. This paper describes multi-modal symbolic regression (MMSR): a learning algorithm to construct non-linear symbolic representations of discrete dynamical systems with continuous mappings from unlabeled, time-series data. MMSR consists of two subalgorithms---clustered symbolic regression, a method to simultaneously identify distinct behaviors while formulating their mathematical expressions, and transition modeling, an algorithm to infer symbolic inequalities that describe binary classification boundaries.
</description>
</item>

<item>
<title>
Regularized Bundle Methods for Convex and Non-Convex Risks; Trinh Minh Tri Do, Thierry Arti&#232;res; 13(Dec):3539--3583, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/do12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/do12a.html
</link>
<description>
Machine learning is most often cast as an optimization problem. Ideally, one expects a convex objective function to rely on efficient convex optimizers with nice guarantees such as no local optima. Yet, non-convexity is very frequent in practice and it may sometimes be inappropriate to look for convexity at any price. Alternatively one can decide not to limit a priori the modeling expressivity to models whose learning may be solved by convex optimization and rely on non-convex optimization algorithms. The main motivation of this work is to provide efficient and scalable algorithms for non-convex optimization. We focus on regularized unconstrained optimization problems which cover a large
</description>
</item>

<item>
<title>
DARWIN: A Framework for Machine Learning and Computer Vision Research and Development; Stephen Gould; 13(Dec):3533--3537, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/gould12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/gould12a.html
</link>
<description>
We present an open-source platform-independent C++ framework for machine learning and computer vision research. The framework includes a wide range of standard machine learning and graphical models algorithms as well as reference implementations for many machine learning and computer vision applications. The framework contains Matlab wrappers for core components of the library and an experimental graphical user interface for developing and visualizing machine learning data flows.
</description>
</item>

<item>
<title>
PAC-Bayes Bounds with Data Dependent Priors; Emilio Parrado-Hern&#225;ndez, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun; 13(Dec):3507--3531, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/parrado12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/parrado12a.html
</link>
<description>
This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs' generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and &#951;-prior SVM, whose regularization term  pushes towards the minimization of the prior PAC-Bayes bound. The experimental
</description>
</item>

<item>
<title>
Fast Approximation of Matrix Coherence and Statistical Leverage; Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff; 13(Dec):3475--3506, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/drineas12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/drineas12a.html
</link>
<description>
The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score.  These quantities are of interest in recently-popular problems such as matrix completion and Nystr&#246;m-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms.  Our main result is a randomized algorithm that takes as input an arbitrary n &#215; d matrix A, with n &#62;&#62; d, and that returns as output
</description>
</item>


<item>
<title>
Iterative Reweighted Algorithms for Matrix Rank Minimization; Karthik Mohan, Maryam Fazel; 13(Nov):3441--3473, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/mohan12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/mohan12a.html
</link>
<description>
The problem of minimizing the rank of a matrix subject to affine constraints has applications in several areas including machine learning, and is known to be NP-hard. A tractable relaxation for this problem is nuclear norm (or trace norm) minimization, which is guaranteed to find the minimum rank matrix under suitable assumptions.  In this paper, we propose a family of Iterative Reweighted Least Squares algorithms IRLS-p (with 0 &#8804; p &#8804; 1), as a computationally efficient way to improve over the performance of nuclear norm minimization. The algorithms can be viewed as (locally) minimizing certain smooth approximations to the rank function. When p=1, we give theoretical guarantees
</description>
</item>

<item>
<title>
Learning Linear Cyclic Causal Models with Latent Variables; Antti Hyttinen, Frederick Eberhardt, Patrik O. Hoyer; 13(Nov):3387--3439, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/hyttinen12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/hyttinen12a.html
</link>
<description>
Identifying cause-effect relationships between variables of interest is a central problem in science. Given a set of experiments we describe a procedure that identifies linear models that may contain cycles and latent variables. We provide a detailed description of the model family, full proofs of the necessary and sufficient conditions for identifiability, a search algorithm that is complete, and a discussion of what can be done when the identifiability conditions are not satisfied. The algorithm is comprehensively tested in simulations, comparing it to competing algorithms in the literature. Furthermore, we adapt the procedure to the problem of cellular network inference, applying it to
</description>
</item>

<item>
<title>
Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing; Nicolas Gillis; 13(Nov):3349--3386, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/gillis12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/gillis12a.html
</link>
<description>
Nonnegative matrix factorization (NMF) has become a very popular technique in machine learning because it automatically extracts meaningful features through a sparse and part-based representation. However, NMF has the drawback of being highly ill-posed, that is, there typically exist many different but equivalent factorizations.  In this paper, we introduce a completely new way to obtaining more well-posed NMF problems whose solutions are sparser. Our technique is based on the preprocessing of the nonnegative input data matrix, and relies on the theory of M-matrices and the geometric interpretation of NMF.  This approach provably leads to optimal and sparse solutions under the separability
</description>
</item>

<item>
<title>
Large-scale Linear Support Vector Regression; Chia-Hua Ho, Chih-Jen Lin; 13(Nov):3323--3348, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ho12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ho12a.html
</link>
<description>
Support vector regression (SVR) and support vector classification (SVC) are popular learning techniques, but their use with kernels is often time consuming.  Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing.  However, few studies have focused on linear SVR.  In this paper, we extend state-of-the-art training methods for linear SVC to linear SVR.  We show that the extension is straightforward for some methods, but is not trivial for some others.  Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efficiently produce models that are as good as kernel SVR.
</description>
</item>

<item>
<title>
Human Gesture Recognition on Product Manifolds; Yui Man Lui; 13(Nov):3297--3321, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/lui12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/lui12a.html
</link>
<description>
Action videos are multidimensional data and can be naturally represented as data tensors.  While tensor computing is widely used in computer vision, the geometry of tensor space is often ignored. The aim of this paper is to demonstrate the importance of the intrinsic geometry of tensor space which yields a very discriminating structure for action recognition. We characterize data tensors as points on a product manifold and model it statistically using least squares regression. To this aim, we factorize a data tensor relating to each order of the tensor using Higher Order Singular Value Decomposition (HOSVD) and then impose each factorized element on a Grassmann manifold. Furthermore, we
</description>
</item>

<item>
<title>
Linear Fitted-Q Iteration with Multiple Reward Functions; Daniel J. Lizotte, Michael Bowling, Susan A. Murphy; 13(Nov):3253--3295, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/lizotte12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/lizotte12a.html
</link>
<description>
We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions.  We also present an example of how our methods can be used to construct a real-world decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further
</description>
</item>

<item>
<title>
Sally: A Tool for Embedding Strings in Vector Spaces; Konrad Rieck, Christian Wressnegger, Alexander Bikadorov; 13(Nov):3247--3251, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/rieck12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/rieck12a.html
</link>
<description>
Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data.  We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data.  Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments,
</description>
</item>

<item>
<title>
Dynamic Policy Programming; Mohammad Gheshlaghi Azar, Vicen&#231; G&#243;mez, Hilbert J. Kappen; 13(Nov):3207--3245, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/azar12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/azar12a.html
</link>
<description>
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes.  DPP is an incremental algorithm that forces a gradual change in policy update.  This allows us to prove finite-iteration and asymptotic  l_&#8734;-norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of  the supremum of the errors.  The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be
</description>
</item>

<item>
<title>
Quantum Set Intersection and its Application to Associative Memory; Tamer Salman, Yoram Baram; 13(Nov):3177--3206, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/salman12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/salman12a.html
</link>
<description>
We describe a quantum algorithm for computing the intersection of two sets and its application to associative memory. The algorithm is based on a modification of Grover's quantum search algorithm (Grover, 1996). We present algorithms for pattern retrieval, pattern completion, and pattern correction. We show that the quantum associative memory can store an exponential number of memories and retrieve them in sub-exponential time. We prove that this model has advantages over known classical associative memories as well as previously proposed quantum models.
</description>
</item>

<item>
<title>
Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets; Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan; 13(Nov):3133--3176, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/brodersen12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/brodersen12a.html
</link>
<description>
Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects)
</description>
</item>

<item>
<title>
Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training; Zhuang Wang, Koby Crammer, Slobodan Vucetic; 13(Oct):3103--3131, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/wang12b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/wang12b.html
</link>
<description>
Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant
</description>
</item>

<item>
<title>
Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition; Yang Wang, Duan Tran, Zicheng Liao, David Forsyth; 13(Oct):3075--3102, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/wang12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/wang12a.html
</link>
<description>
We consider the problem of parsing human poses and recognizing their actions in static images with part-based models. Most previous work in part-based models only considers rigid parts (e.g., torso, head, half limbs) guided by human anatomy. We argue that this representation of parts is not necessarily appropriate. In this paper, we introduce hierarchical poselets---a new representation for modeling the pose configuration of human bodies. Hierarchical poselets can be rigid parts, but they can also be parts that cover large portions of human bodies (e.g., torso + left arm). In the extreme case, they can be the whole bodies. The hierarchical poselets are organized in a hierarchical way via a
</description>
</item>

<item>
<title>
Finite-Sample Analysis of Least-Squares Policy Iteration; Alessandro Lazaric, Mohammad Ghavamzadeh, R&#233;mi Munos; 13(Oct):3041--3074, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/lazaric12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/lazaric12a.html
</link>
<description>
In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain.
</description>
</item>

<item>
<title>
Multi-Instance Learning with Any Hypothesis Class; Sivan Sabato, Naftali Tishby; 13(Oct):2999--3039, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/sabato12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/sabato12a.html
</link>
<description>
In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. The learner is then required to emit a classification rule for bags based on the sample. MIL has numerous applications, and many heuristic algorithms have been used successfully on this problem, each adapted to specific settings or applications.  In this work we provide a unified theoretical analysis for MIL, which holds for any underlying hypothesis class,
</description>
</item>

<item>
<title>
Oger: Modular Learning Architectures For Large-Scale Sequential Processing; David Verstraeten, Benjamin Schrauwen, Sander Dieleman, Philemon Brakel, Pieter Buteneers, Dejan Pecevski; 13(Oct):2995--2998, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/verstraeten12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/verstraeten12a.html
</link>
<description>
Oger (OrGanic Environment for Reservoir computing) is a Python toolbox for building, training and evaluating modular learning architectures on large data sets. It builds on MDP for its modularity, and adds processing of sequential data sets, gradient descent training, several cross-validation schemes and parallel parameter optimization methods. Additionally, several learning algorithms are implemented, such as different reservoir implementations (both sigmoid and spiking), ridge regression, conditional restricted Boltzmann machine (CRBM) and others, including GPU accelerated versions. Oger is released under the GNU LGPL, and is available from http://organic.elis.ugent.be/oger.
</description>
</item>

<item>
<title>
Facilitating Score and Causal Inference Trees for Large Observational Studies; Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan; 13(Oct):2955--2994, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/su12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/su12a.html
</link>
<description>
Assessing treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept termed the facilitating score to account for both the confounding and interacting impacts of covariates on the treatment effect. Several approaches for estimating the facilitating score are discussed. In particular, we put forward a machine learning method, called causal inference tree (CIT), to provide a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits data
</description>
</item>

<item>
<title>
Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices; Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, Arkadi Nemirovski; 13(Oct):2923--2954, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ben-tal12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ben-tal12a.html
</link>
<description>
In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems,  IP methods become intractable and one has to resort to first-order gradient type methods. The strategy
</description>
</item>

<item>
<title>
Online Submodular Minimization; Elad Hazan, Satyen Kale; 13(Oct):2903--2922, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/hazan12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/hazan12a.html
</link>
<description>
We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and partial feedback settings.
</description>
</item>

<item>
<title>
Local and Global Scaling Reduce Hubs in Space; Dominik Schnitzer, Arthur Flexer, Markus Schedl, Gerhard Widmer; 13(Oct):2871--2902, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/schnitzer12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/schnitzer12a.html
</link>
<description>
'Hubness' has recently been identified as a general problem of high dimensional data spaces, manifesting itself in the emergence of objects, so-called hubs, which tend to be among the k nearest neighbors of a large number of data items. As a consequence many nearest neighbor relations in the distance space are asymmetric, that is, object y is amongst the nearest neighbors of x but not vice versa. The work presented here discusses two classes of methods that try to symmetrize nearest neighbor relations and investigates to what extent they can mitigate the negative effects of hubs. We evaluate local distance scaling and propose a global variant which has the advantage of
</description>
</item>

<item>
<title>
A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss; Jos&#233; Hern&#225;ndez-Orallo, Peter Flach, C&#232;sar Ferri; 13(Oct):2813--2869, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/hernandez-orallo12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/hernandez-orallo12a.html
</link>
<description>
Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (with its decomposition into refinement and calibration). One way of understanding the relations among these metrics is by means of variable operating conditions (in the form of misclassification costs and/or class distributions). Thus, a metric may correspond to some expected loss over different operating conditions. One dimension for the analysis
</description>
</item>

<item>
<title>
Multi-task Regression using Minimal Penalties; Matthieu Solnon, Sylvain Arlot, Francis Bach; 13(Sep):2773--2812, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/solnon12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/solnon12a.html
</link>
<description>
In this paper we study the kernel  multiple ridge regression framework, which we refer to as multi-task regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this
</description>
</item>

<item>
<title>
Linear Regression With Random Projections; Odalric-Ambrym Maillard, R&#233;mi Munos; 13(Sep):2735--2772, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/maillard12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/maillard12a.html
</link>
<description>
We investigate a method for regression that makes use of a randomly generated subspace G_P&#8834;F (of finite dimension P) of a given large (possibly infinite) dimensional function space F, for example, L_2([0,1]^d;&#8476;).  G_P is defined as the span of P random features  that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients.  We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in G_P rather than in F, and derive
</description>
</item>

<item>
<title>
Coherence Functions with Applications in Large-Margin Classification Methods; Zhihua Zhang, Dehua Liu, Guang Dai, Michael I. Jordan; 13(Sep):2705--2734, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zhang12c.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zhang12c.html
</link>
<description>
Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression.  The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of  the expected error
</description>
</item>

<item>
<title>
PREA: Personalized Recommendation Algorithms Toolkit; Joonseok Lee, Mingxuan Sun, Guy Lebanon; 13(Sep):2699--2703, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/lee12b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/lee12b.html
</link>
<description>
Recommendation systems are important business applications with significant economic impact. In recent years, a large number of algorithms have been proposed for recommendation systems. In this paper, we describe an open-source toolkit implementing many recommendation algorithms as well as popular evaluation metrics. In contrast to other packages, our toolkit implements recent state-of-the-art algorithms as well as most classic algorithms.
</description>
</item>

<item>
<title>
Selective Sampling and Active Learning from Single and Multiple Teachers; Ofer Dekel, Claudio Gentile, Karthik Sridharan; 13(Sep):2655--2697, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/dekel12b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/dekel12b.html
</link>
<description>
We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Additionally, our selective sampling algorithm can be converted into an efficient statistical active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. Finally, we demonstrate the
</description>
</item>

<item>
<title>
Static Prediction Games for Adversarial Learning Problems; Michael Br&#252;ckner, Christian Kanzow, Tobias Scheffer; 13(Sep):2617--2654, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/brueckner12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/brueckner12a.html
</link>
<description>
The standard assumption of identically distributed training and test data is violated when the test data are generated in response to the presence of a predictive model. This becomes apparent, for example, in the context of email spam filtering. Here, email service providers employ spam filters, and spam senders engineer campaign templates to achieve a high rate of successful deliveries despite the filters. We model the interaction between the learner and the data generator as a static game in which the cost functions of the learner and the data generator are not necessarily antagonistic. We identify conditions under which this prediction game has a unique Nash equilibrium and derive
</description>
</item>

<item>
<title>
Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs; Sunita Nayak, Kester Duncan, Sudeep Sarkar, Barbara Loeding; 13(Sep):2589--2615, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/nayak12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/nayak12a.html
</link>
<description>
We present a probabilistic framework to automatically learn models of recurring signs from multiple sign language video sequences containing the vocabulary of interest. We extract the parts of the signs that are present in most occurrences of the sign in context and are robust to the variations produced by adjacent signs. Each sentence video is first transformed into a multidimensional time series representation, capturing the motion and shape aspects of the sign. Skin color blobs are extracted from frames of color video sequences, and a probabilistic relational distribution is formed for each frame using the contour and edge pixels from the skin blobs. Each sentence is represented as a
</description>
</item>

<item>
<title>
Nonparametric Guidance of Autoencoder Representations using Label Information; Jasper Snoek, Ryan P. Adams, Hugo Larochelle; 13(Sep):2567--2588, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/snoek12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/snoek12a.html
</link>
<description>
While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks.  Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels.  One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which find latent representations that are constrained but nevertheless informative for reconstruction.  However, pure unsupervised learning with autoencoders can find representations that may or may not be
</description>
</item>

<item>
<title>
Robust Kernel Density Estimation; JooSeuk Kim, Clayton D. Scott; 13(Sep):2529--2565, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kim12b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kim12b.html
</link>
<description>
We propose a method for nonparametric density estimation that exhibits robustness to contamination of the training sample. This method achieves robustness by combining a traditional kernel density estimator (KDE) with ideas from classical M-estimation. We interpret the KDE based on a positive semi-definite kernel as a sample mean in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density estimator (RKDE).  An RKDE can be computed efficiently via a kernelized iteratively re-weighted least squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized
</description>
</item>

<item>
<title>
Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints; Mehrdad Mahdavi, Rong Jin, Tianbao Yang; 13(Sep):2503--2528, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/mahdavi12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/mahdavi12a.html
</link>
<description>
In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most  algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While  the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is  the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K  for all rounds, we only require that the constraints, which define the set K, be satisfied in the
</description>
</item>

<item>
<title>
On the Convergence Rate of l_p-Norm Multiple Kernel Learning; Marius Kloft, Gilles Blanchard; 13(Aug):2465--2502, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kloft12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kloft12a.html
</link>
<description>
We derive an upper bound on the local Rademacher complexity of l_p-norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p=1 only while our analysis covers all cases 1&#8804;p&#8804;&#8734;, assuming the different feature mappings corresponding to the different kernels to be uncorrelated.  We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely fast convergence rates of the order O(n^-&#945;/1+&#945;), where &#945; is the minimum eigenvalue decay rate of the individual kernels.
</description>
</item>

<item>
<title>
Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs; Alain Hauser, Peter B&#252;hlmann; 13(Aug):2409--2464, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/hauser12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/hauser12a.html
</link>
<description>
The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes.  In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments.  We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models.
</description>
</item>

<item>
<title>
Multi-Target Regression with Rule Ensembles; Timo Aho, Bernard &#381;enko, Sa&#353;o D&#382;eroski, Tapio Elomaa; 13(Aug):2367--2407, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/aho12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/aho12a.html
</link>
<description>
Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller)
</description>
</item>

<item>
<title>
A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally; Michael W. Mahoney, Lorenzo Orecchia, Nisheeth K. Vishnoi; 13(Aug):2339--2365, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/mahoney12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/mahoney12a.html
</link>
<description>
The second eigenvalue of the Laplacian matrix and its associated eigenvector are fundamental features of an undirected graph, and as such they have found widespread use in scientific computing, machine learning, and data analysis.  In many applications, however, graphs that arise have several local regions of interest, and the second eigenvector will typically fail to provide information fine-tuned to each local region.  In this paper, we introduce a locally-biased analogue of the second eigenvector, and we demonstrate its usefulness at highlighting local properties of data graphs in a semi-supervised manner.  To do so, we first view the second eigenvector as the solution to a constrained
</description>
</item>

<item>
<title>
High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion; Animashree Anandkumar, Vincent Y.F. Tan, Furong Huang, Alan S. Willsky; 13(Aug):2293--2337, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/anandkumar12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/anandkumar12a.html
</link>
<description>
We consider the problem of high-dimensional Gaussian graphical model selection. We  identify a set of graphs for which an efficient estimation algorithm exists, and this algorithm is  based on thresholding of  empirical conditional covariances. Under a set of transparent conditions, we establish structural consistency (or sparsistency) for the proposed algorithm, when the number of samples n=&#937;(J_min^-2 log p), where p is the number of variables and J_min is the minimum (absolute) edge potential of the graphical model. The sufficient conditions for sparsistency are based on the notion of walk-summability of the model and the presence of sparse local vertex separators in the underlying graph.
</description>
</item>

<item>
<title>
Pairwise Support Vector Machines and their Application to Large Scale Problems; Carl Brunner, Andreas Fischer, Klaus Luig, Thorsten Thies; 13(Aug):2279--2292, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/brunner12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/brunner12a.html
</link>
<description>
Pairwise classification is the task to predict whether the examples a,b of a pair (a,b) belong to the same class or to different classes. In particular, interclass generalization problems can be treated in this way.  In pairwise classification, the order of the two input examples should not affect the classification result. To achieve this, particular kernels as well as the use of symmetric training sets in the framework of support vector machines were suggested. The paper discusses both approaches in a general way and establishes a strong connection between them. In addition, an efficient implementation is discussed which allows the training of several millions of pairs. The value of these
</description>
</item>

<item>
<title>
MedLDA: Maximum Margin Supervised Topic Models; Jun Zhu, Amr Ahmed, Eric P. Xing; 13(Aug):2237--2278, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zhu12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zhu12a.html
</link>
<description>
A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism
</description>
</item>

<item>
<title>
A Topic Modeling Toolbox Using Belief Propagation; Jia Zeng; 13(Jul):2233--2236, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zeng12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zeng12a.html
</link>
<description>
Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology.  This paper introduces a topic modeling toolbox (TMBP) based on the belief propagation (BP) algorithms.  TMBP toolbox is implemented by MEX C++/Matlab/Octave for either Windows 7 or Linux.  Compared with existing topic modeling packages, the novelty of this toolbox lies in the BP algorithms for learning LDA-based topic models.  The current version includes BP algorithms for latent Dirichlet allocation (LDA), author-topic models (ATM), relational
</description>
</item>

<item>
<title>
Sign Language Recognition using Sub-Units; Helen Cooper, Eng-Jon Ong, Nicolas Pugeault, Richard Bowden; 13(Jul):2205--2231, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/cooper12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/cooper12a.html
</link>
<description>
This paper discusses sign language recognition using linguistic sub-units.  It presents three types of sub-units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data.  These sub-units are then combined using a sign level classifier; here, two options are presented.  The first uses Markov Models to encode the temporal changes between sub-units.  The second makes use of Sequential Pattern Boosting to apply discriminative feature selection at the same time as encoding temporal information.  This approach is more robust to noise and performs well in signer independent tests, improving results from the 54% achieved by the Markov Chains
</description>
</item>

<item>
<title>
An Introduction to Artificial Prediction Markets for Classification; Adrian Barbu, Nathan Lay; 13(Jul):2177--2204, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/barbu12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/barbu12a.html
</link>
<description>
Prediction markets are used in real life to predict outcomes of interest such as presidential elections. This paper presents a mathematical theory of artificial prediction markets for supervised learning of conditional probability estimators. The artificial prediction market is a novel method for fusing the prediction information of features or trained classifiers, where the fusion result is the contract price on the possible outcomes. The market can be trained online by updating the participants' budgets using training examples. Inspired by the real prediction markets, the equations that govern the market are derived from simple and reasonable assumptions. Efficient numerical algorithms are
</description>
</item>

<item>
<title>
DEAP: Evolutionary Algorithms Made Easy; F&#233;lix-Antoine Fortin, Fran&#231;ois-Michel De Rainville, Marc-Andr&#233; Gardner, Marc Parizeau, Christian Gagn&#233;; 13(Jul):2171--2175, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/fortin12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/fortin12a.html
</link>
<description>
DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. Its design departs from most other existing frameworks in that it seeks to make algorithms explicit and data structures transparent, as opposed to the more common black-box frameworks.  Freely available with extensive documentation at http://deap.gel.ulaval.ca, DEAP is an open source project under an LGPL license.
</description>
</item>

<item>
<title>
On the Necessity of Irrelevant Variables; David P. Helmbold, Philip M. Long; 13(Jul):2145--2170, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/helmbold12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/helmbold12a.html
</link>
<description>
This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers.  The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing.  The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant.  We also show that accurate learning is possible even when there are so few examples that one
</description>
</item>

<item>
<title>
A Comparison of the Lasso and  Marginal Regression; Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao; 13(Jun):2107--2143, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/genovese12b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/genovese12b.html
</link>
<description>
The lasso is an important method for sparse, high-dimensional regression problems, with efficient algorithms available, a long history of practical success, and a large body of theoretical results supporting and explaining its performance.  But even with the best available algorithms, finding the lasso solutions remains a computationally challenging task in cases where the number of covariates vastly exceeds the number of data points.   Marginal regression, where each dependent variable is regressed separately on each covariate, offers a promising alternative in this case because the estimates can be computed roughly two orders faster than the lasso solutions.  The question that remains is
</description>
</item>

<item>
<title>
Optimistic Bayesian Sampling in Contextual-Bandit Problems; Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie; 13(Jun):2069--2106, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/may12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/may12a.html
</link>
<description>
In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest.   In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing
</description>
</item>

<item>
<title>
Pattern for Python; Tom De Smedt, Walter Daelemans; 13(Jun):2063--2067, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/desmedt12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/desmedt12a.html
</link>
<description>
Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern.
</description>
</item>

<item>
<title>
EP-GIG Priors and Applications in Bayesian Sparse Learning; Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan; 13(Jun):2031--2061, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zhang12b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zhang12b.html
</link>
<description>
In this paper we propose a novel framework for the construction of  sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG).  EP-GIG is a  variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures.  Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization.  The densities of EP-GIG can be explicitly expressed.  Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop
</description>
</item>

<item>
<title>
An Improved GLMNET for L1-regularized Logistic Regression; Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin; 13(Jun):1999--2030, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/yuan12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/yuan12a.html
</link>
<description>
Recently, Yuan et al. (2010) conducted a comprehensive comparison on software for L1-regularized classification.  They concluded that a carefully designed coordinate descent implementation CDN is the fastest among state-of-the-art solvers.  In this paper, we point out that CDN is less competitive on loss functions that are expensive to compute.  In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations.   In optimization, Newton methods are known to have fewer iterations although each iteration costs more.  Because solving the Newton sub-problem is independent of the loss calculation, this type of methods may
</description>
</item>

<item>
<title>
Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality; Lan Xue, Annie Qu; 13(Jun):1973--1998, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/xue12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/xue12a.html
</link>
<description>
The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients.  It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size.  We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L_0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition
</description>
</item>

<item>
<title>
Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences; Jan Grau, Jens Keilwagen, Andr&#233; Gohr, Berit Haldemann, Stefan Posch, Ivo Grosse; 13(Jun):1967--1971, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/grau12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/grau12a.html
</link>
<description>
Jstacs is an object-oriented Java library for analysing and classifying sequence data, which emerged from the need for a standardized implementation of statistical models, learning principles, classifiers, and performance measures. In Jstacs, these components can be used, combined, and extended easily, which allows for a direct comparison of different approaches and fosters the development of new components.  Jstacs is especially tailored to biological sequence data, but is also applicable to general discrete and continuous data. Jstacs is freely available at http://www.jstacs.de under the GNU GPL license including an API documentation, a cookbook, and code examples.
</description>
</item>

<item>
<title>
Integrating a Partial Model into Model Free Reinforcement Learning; Aviv Tamar, Dotan Di Castro, Ron Meir; 13(Jun):1927--1966, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/tamar12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/tamar12a.html
</link>
<description>
In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy.  Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent's knowledge. Our method relies on a novel definition for a partially
</description>
</item>

<item>
<title>
Confidence-Weighted Linear Classification for Text Categorization; Koby Crammer, Mark Dredze, Fernando Pereira; 13(Jun):1891--1926, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/crammer12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/crammer12a.html
</link>
<description>
Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare.  We investigate several versions of confidence-weighted learning that
</description>
</item>

<item>
<title>
Regularization Techniques for Learning with Matrices; Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari; 13(Jun):1865--1890, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kakade12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kakade12a.html
</link>
<description>
There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm.  This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques.  In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate.   Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect
</description>
</item>

<item>
<title>
Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications; Jian Huang, Cun-Hui Zhang; 13(Jun):1839--1864, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/huang12b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/huang12b.html
</link>
<description>
The l_1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets.  Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted l_1-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study  the estimation, prediction, selection and sparsity  properties of the weighted l_1-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special
</description>
</item>

<item>
<title>
Entropy Search for Information-Efficient Global Optimization; Philipp Hennig, Christian J. Schuler; 13(Jun):1809--1837, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/hennig12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/hennig12a.html
</link>
<description>
Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation.
</description>
</item>

<item>
<title>
Variational Multinomial Logit Gaussian Process; Kian Ming A. Chai; 13(Jun):1745--1808, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/chai12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/chai12a.html
</link>
<description>
Gaussian process prior with an appropriate likelihood function is a flexible non-parametric model for a variety of learning tasks.  One important and standard task is multi-class classification, which is the categorization of an item into one of several fixed classes.  A usual likelihood function for this is the multinomial logistic likelihood function.  However, exact inference with this model has proved to be difficult because high-dimensional integrations are required.  In this paper, we propose a variational approximation to this model, and we describe the optimization of the variational parameters.  Experiments have shown our approximation to be tight.  In addition, we provide data-independent
</description>
</item>

<item>
<title>
Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning; Sangkyun Lee, Stephen J. Wright; 13(Jun):1705--1744, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/lee12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/lee12a.html
</link>
<description>
Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a low-dimensional manifold of parameter space along which the regularizer is smooth. (When an l_1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.)  This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation
</description>
</item>

<item>
<title>
glm-ie: Generalised Linear Models Inference &amp; Estimation Toolbox; Hannes Nickisch; 13(May):1699--1703, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/nickisch12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/nickisch12a.html
</link>
<description>
The glm-ie toolbox contains functionality for estimation and inference in generalised linear models over continuous-valued variables. Besides a variety of penalised least squares solvers for estimation, it offers inference based on (convex) variational bounds, on expectation propagation and on factorial mean field.  Scalable and efficient inference in fully-connected undirected graphical models or Markov random fields with Gaussian and non-Gaussian potentials is achieved by casting all the computations as matrix vector multiplications. We provide a wide choice of penalty functions for estimation, potential functions for inference and matrix classes with lazy evaluation for convenient modelling.
</description>
</item>

<item>
<title>
Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise; Sahand Negahban, Martin J. Wainwright; 13(May):1665--1697, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/negahban12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/negahban12a.html
</link>
<description>
We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case.  We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm.  Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices.  Our results are based on measures of the "spikiness" and "low-rankness" of matrices that are less restrictive than the incoherence conditions imposed in
</description>
</item>

<item>
<title>
Mixability is Bayes Risk Curvature Relative to Log Loss; Tim van Erven, Mark D. Reid, Robert C. Williamson; 13(May):1639--1663, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/vanerven12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/vanerven12a.html
</link>
<description>
Mixability of a loss characterizes fast rates in the online learning setting of prediction with expert advice. The determination of the mixability constant for binary losses is straightforward but opaque. In the binary case we make this transparent and simpler by characterising mixability in terms of the second derivative of the Bayes risk of proper losses.  We then extend this result to multiclass proper losses where there are few existing results.  We show that mixability is governed by the maximum eigenvalue of the Hessian of the Bayes risk, relative to the Hessian of the Bayes risk for log loss. We conclude by comparing our result to other work that bounds prediction performance in terms
</description>
</item>

<item>
<title>
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models; Neil D. Lawrence; 13(May):1609--1638, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/lawrence12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/lawrence12a.html
</link>
<description>
We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter fitting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE).  We show that MEU and ALLE are competitive with the leading
</description>
</item>

<item>
<title>
A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives; Aleix Martinez, Shichuan Du; 13(May):1589--1608, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/martinez12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/martinez12a.html
</link>
<description>
In cognitive science and neuroscience, there have been two leading models describing how humans perceive and classify facial expressions of emotion---the continuous and the categorical model. The continuous model defines each facial expression of emotion as a feature vector in a face space. This model explains, for example, how expressions of emotion can be seen at different intensities. In contrast, the categorical model consists of C classifiers, each tuned to a specific emotion category. This model explains, among other findings, why the images in a morphing sequence between a happy and a surprise face are perceived as either happy or surprise but not something in between. While the continuous
</description>
</item>

<item>
<title>
Activized Learning: Transforming Passive to Active with Improved Label Complexity; Steve Hanneke; 13(May):1469--1587, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/hanneke12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/hanneke12a.html
</link>
<description>
We study the theoretical advantages of active learning over passive learning.  Specifically, we prove that, in noise-free classifier learning for VC classes, any passive learning algorithm can be transformed into an active learning algorithm with asymptotically strictly superior label complexity for all nontrivial target functions and distributions.  We further provide a general characterization of the magnitudes of these improvements in terms of a novel generalization of the disagreement coefficient.  We also extend these results to active learning in the presence of label noise, and find that even under broad classes of noise distributions, we can typically guarantee strict improvements over
</description>
</item>

<item>
<title>
Structured Sparsity via Alternating Direction Methods; Zhiwei Qin, Donald Goldfarb; 13(May):1435--1468, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/qin12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/qin12a.html
</link>
<description>
We consider a class of sparse learning problems in high dimensional feature space regularized by a structured sparsity-inducing norm that incorporates prior knowledge of the group structure of the features.  Such problems often pose a considerable challenge to optimization algorithms due to the non-smoothness and non-separability of the regularization term.  In this paper, we focus on two commonly adopted sparsity-inducing regularization terms, the overlapping Group Lasso penalty l_1/l_2-norm and the l_1/l_&#8734;-norm.  We propose a unified framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efficiently solved.
</description>
</item>

<item>
<title>
Feature Selection via Dependence Maximization; Le Song, Alex Smola, Arthur Gretton, Justin Bedo, Karsten Borgwardt; 13(May):1393--1434, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/song12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/song12a.html
</link>
<description>
We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels.  Our approach leads to a greedy procedure for feature selection.  We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice.
</description>
</item>

<item>
<title>
On Ranking and Generalization Bounds; Wojciech Rejchel; 13(May):1373--1392, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/rejchel12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/rejchel12a.html
</link>
<description>
The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are faster than 1/&#8730;n. We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets.
</description>
</item>

<item>
<title>
Transfer in Reinforcement Learning via Shared Features; George Konidaris, Ilya Scheidwasser, Andrew Barto; 13(May):1333--1371, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/konidaris12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/konidaris12a.html
</link>
<description>
We present a framework for  transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to significantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show
</description>
</item>

<item>
<title>
Query Strategies for Evading Convex-Inducing Classifiers; Blaine Nelson, Benjamin I. P. Rubinstein, Ling Huang, Anthony D. Joseph, Steven J. Lee, Satish Rao, J. D. Tygar; 13(May):1293--1332, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/nelson12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/nelson12a.html
</link>
<description>
Classifiers are often used to detect miscreant activities. We study how an adversary can systematically query a classifier to elicit information that allows the attacker to evade detection while incurring a near-minimal cost of modifying their intended malfeasance. We generalize the theory of Lowd and Meek (2005) to the family of convex-inducing classifiers that partition their feature space into two sets, one of which is convex. We present query algorithms for this family that construct undetected instances of approximately minimal cost using only polynomially-many queries in the dimension of the space and in the level of approximation. Our results demonstrate that near-optimal evasion can
</description>
</item>

<item>
<title>
Minimax Manifold Estimation; Christopher Genovese, Marco Perone-Pacifico, Isabella Verdinelli, Larry Wasserman; 13(May):1263--1291, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/genovese12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/genovese12a.html
</link>
<description>
We find the minimax rate of convergence in Hausdorff distance for estimating a manifold M of dimension d embedded in &#8477;^D given a noisy sample from the manifold.  Under certain conditions, we show that the optimal rate of convergence is n^-2/(2+d).  Thus, the minimax rate depends only on the dimension of the manifold, not on the dimension of the space in which M is embedded.
</description>
</item>

<item>
<title>
A Geometric Approach to Sample Compression; Benjamin I.P. Rubinstein, J. Hyam Rubinstein; 13(Apr):1221--1261, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/rubinstein12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/rubinstein12a.html
</link>
<description>
The Sample Compression Conjecture of Littlestone &amp; Warmuth has remained unsolved for a quarter century. While maximum classes (concept classes meeting Sauer's Lemma with equality) can be compressed, the compression of general concept classes reduces to compressing maximal classes (classes that cannot be expanded without increasing VC dimension). Two promising ways forward are: embedding maximal classes into maximum classes with at most a polynomial increase to VC dimension, and compression via operating on geometric representations. This paper presents positive results on the latter approach and a first negative result on the former, through a systematic investigation of finite maximum
</description>
</item>

<item>
<title>
A Multi-Stage Framework for Dantzig Selector and LASSO; Ji Liu, Peter Wonka, Jieping Ye; 13(Apr):1189--1219, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/liu12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/liu12a.html
</link>
<description>
We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X&#8712; &#8477;^n&#10005; m (m >> n) and a noisy observation vector y&#8712; &#8477;^n satisfying y=X&#946;^*+&#949; where &#949; is the noise vector following a Gaussian distribution N(0,&#963;^2I), how to recover the signal (or parameter vector) &#946;^* when the signal is sparse?   The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal &#946;^*. We show that if X obeys a certain condition, then with a large probability the difference
</description>
</item>

<item>
<title>
Hope and Fear for Discriminative Training of Statistical Translation Models; David Chiang; 13(Apr):1159--1187, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/chiang12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/chiang12a.html
</link>
<description>
In machine translation, discriminative models have almost entirely supplanted the classical noisy-channel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the first uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured
</description>
</item>

<item>
<title>
Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies; Ioannis Tsamardinos, Sofia Triantafillou, Vincenzo Lagani; 13(Apr):1097--1157, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/tsamardinos12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/tsamardinos12a.html
</link>
<description>
We present methods able to predict the presence and strength of conditional and unconditional dependencies (correlations) between two variables Y and Z never jointly measured on the same samples, based on multiple data sets measuring a set of common variables. The algorithms are specializations of prior work on learning causal structures from overlapping variable sets.  This problem has also been addressed in the field of statistical matching. The proposed methods are applied to a wide range of domains and are shown to accurately predict the presence of thousands of dependencies. Compared against prototypical statistical matching algorithms and within the scope of our experiments, the proposed
</description>
</item>

<item>
<title>
Analysis of a Random Forests Model; G&#233;rard Biau; 13(Apr):1063--1095, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/biau12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/biau12a.html
</link>
<description>
Random forests are a scheme proposed by Leo Breiman in the 2000's for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and  practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm.  In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and
</description>
</item>

<item>
<title>
The huge Package for High-dimensional Undirected Graph Estimation in R; Tuo Zhao, Han Liu, Kathryn Roeder, John Lafferty, Larry Wasserman; 13(Apr):1059--1062, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zhao12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zhao12a.html
</link>
<description>
We describe an R package named  huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data.  This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010).   Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting  Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation
</description>
</item>

<item>
<title>
Consistent Model Selection Criteria on High Dimensions; Yongdai Kim, Sunghoon Kwon, Hosik Choi; 13(Apr):1037--1057, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kim12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kim12a.html
</link>
<description>
Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided.  Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently.  Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances
</description>
</item>

<item>
<title>
Positive Semidefinite Metric Learning Using Boosting-like Algorithms; Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel; 13(Apr):1007--1036, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/shen12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/shen12a.html
</link>
<description>
The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data.  It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance.  In this work, we propose a boosting-based technique, termed BOOSTMETRIC, for learning a quadratic Mahalanobis distance metric.  Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite.  Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement.
</description>
</item>

<item>
<title>
Sampling Methods for the Nystr&#246;m Method; Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar; 13(Apr):981--1006, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kumar12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kumar12a.html
</link>
<description>
The Nystr&#246;m method is an efficient technique to generate low-rank matrix approximations and is used in several large-scale learning applications.  A key aspect of this method is the procedure according to which columns are sampled from the original matrix.  In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes.  We also propose a family of ensemble-based sampling algorithms for the Nystr&#246;m method. We report results of extensive experiments that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr&#246;m method when used in conjunction with
</description>
</item>

<item>
<title>
Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features; Gil Tahan, Lior Rokach, Yuval Shahar; 13(Apr):949--979, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/tahan12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/tahan12a.html
</link>
<description>
This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the
</description>
</item>

<item>
<title>
Stability of Density-Based Clustering; Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman; 13(Apr):905--948, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/rinaldo12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/rinaldo12a.html
</link>
<description>
High density clusters can be characterized by the connected components of a level set L(&#955;) = {x: p(x)>&#955;} of the underlying probability density function p generating the data, at some appropriate level &#955; &#8805; 0. The complete hierarchical clustering can be characterized by a cluster tree T= &#8746;_&#955;L(&#955;).  In this paper, we study the behavior of a density level set estimate  L&#770;(&#955;) and cluster tree estimate T&#770; based on a kernel density estimator with kernel bandwidth h. We define two notions of instability to measure the variability of L&#770;(&#955;) and T&#770; as a function of h, and investigate the theoretical properties of these instability measures.
</description>
</item>

<item>
<title>
Algebraic Geometric Comparison of Probability Distributions; Franz J. Kir&#225;ly, Paul von B&#252;nau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert M&#252;ller; 13(Mar):855--903, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/kiraly12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/kiraly12a.html
</link>
<description>
We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of finding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof
</description>
</item>

<item>
<title>
NIMFA : A Python Library for Nonnegative Matrix Factorization; Marinka &#381;itnik, Bla&#382; Zupan; 13(Mar):849--853, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zitnik12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zitnik12a.html
</link>
<description>
NIMFA is an open-source Python library that provides a unified interface to nonnegative matrix factorization algorithms. It includes implementations of state-of-the-art factorization methods, initialization approaches, and quality scoring. It supports both dense and sparse matrix representation. NIMFA's component-based implementation and hierarchical design should help the users to employ already implemented techniques or design and code new strategies for matrix factorization tasks.
</description>
</item>

<item>
<title>
Causal Bounds and Observable Constraints for Non-deterministic Models; Roland R. Ramsahai; 13(Mar):829--848, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ramsahai12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ramsahai12a.html
</link>
<description>
Conditional independence relations involving latent variables do not necessarily imply observable independences. They may imply inequality constraints on observable parameters and causal bounds, which can be used for falsification and identification. The literature on computing such constraints often involve a deterministic underlying data generating process in a counterfactual framework. If an analyst is ignorant of the nature of the underlying mechanisms then they may wish to use a model which allows the underlying mechanisms to be probabilistic. A method of computation for a weaker model without any determinism is given here and demonstrated for the instrumental variable model,
</description>
</item>

<item>
<title>
Algorithms for Learning Kernels Based on Centered Alignment; Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh; 13(Mar):795--828, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/cortes12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/cortes12a.html
</link>
<description>
This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression.  Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe
</description>
</item>

<item>
<title>
Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso; Rahul Mazumder,  Trevor Hastie; 13(Mar):781--794, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/mazumder12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/mazumder12a.html
</link>
<description>
We consider the sparse inverse covariance regularization problem or graphical lasso with regularization  parameter &#955;.  Suppose the sample covariance graph formed by thresholding the entries of the sample covariance matrix at &#955; is decomposed into connected components.  We show that the vertex-partition induced by the connected components of the thresholded sample covariance graph (at &#955;) is exactly equal to that induced by the connected components of the estimated concentration graph, obtained by solving the graphical lasso problem for the same &#955;.  This characterizes a very interesting property of a path of graphical lasso solutions.  Furthermore, this simple rule, when used
</description>
</item>

<item>
<title>
GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression; Chiwoo Park, Jianhua Z. Huang, Yu Ding; 13(Mar):775--779, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/park12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/park12a.html
</link>
<description>
This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper.
</description>
</item>

<item>
<title>
A Kernel Two-Sample Test; Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch&#246;lkopf, Alexander Smola; 13(Mar):723--773, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/gretton12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/gretton12a.html
</link>
<description>
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions.  Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).  We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic.  The MMD can be computed in quadratic time, although efficient linear time approximations are available.  Our statistic is an instance of an integral probability metric, and
</description>
</item>

<item>
<title>
A Case Study on Meta-Generalising: A Gaussian Processes Approach; Grigorios Skolidis, Guido Sanguinetti; 13(Mar):691--721, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/skolidis12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/skolidis12a.html
</link>
<description>
We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it.
</description>
</item>

<item>
<title>
Structured Sparsity and Generalization; Andreas Maurer, Massimiliano Pontil; 13(Mar):671--690, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/maurer12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/maurer12a.html
</link>
<description>
We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels.
</description>
</item>

<item>
<title>
Learning Algorithms for the Classification Restricted Boltzmann Machine; Hugo Larochelle, Michael Mandel, Razvan Pascanu, Yoshua Bengio; 13(Mar):643--669, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/larochelle12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/larochelle12a.html
</link>
<description>
Recent developments have demonstrated the capacity of restricted Boltzmann machines (RBM) to be powerful generative models, able to extract useful features from input data or construct deep artificial neural networks. In such settings, the RBM only yields a preprocessing or an initialization for some other model, instead of acting as a complete supervised model in its own right. In this paper, we argue that RBMs can provide a self-contained framework for developing competitive classifiers. We study the Classification RBM (ClassRBM), a variant on the RBM adapted to the classification setting. We study different strategies for training the ClassRBM and show that competitive classification
</description>
</item>

<item>
<title>
Non-Sparse Multiple Kernel Fisher Discriminant Analysis; Fei Yan, Josef Kittler, Krystian Mikolajczyk, Atif Tahir; 13(Mar):607--642, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/yan12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/yan12a.html
</link>
<description>
Sparsity-inducing multiple kernel Fisher discriminant analysis (MK-FDA) has been studied in the literature. Building on recent advances in non-sparse multiple kernel learning (MKL), we propose a non-sparse version of MK-FDA, which imposes a general l_p norm regularisation on the kernel weights. We formulate the associated optimisation problem as a semi-infinite program (SIP), and adapt an iterative wrapper algorithm to solve it. We then discuss, in light of latest advances in MKL optimisation techniques, several reformulations and optimisation strategies that can potentially lead to significant improvements in the efficiency and scalability of MK-FDA. We carry out extensive experiments on
</description>
</item>

<item>
<title>
A Primal-Dual Convergence Analysis of Boosting; Matus Telgarsky; 13(Mar):561--606, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/telgarsky12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/telgarsky12a.html
</link>
<description>
Boosting combines weak learners into a predictor with low empirical risk.  Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated.  This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: &#8226; Weak learnability aids the whole loss family: for any &#949; > 0, O(ln(1/&#949;)) iterations suffice to produce a predictor with empirical risk &#949;-close to the infimum; &#8226; The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of
</description>
</item>

<item>
<title>
ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel; Stephen R. Piccolo, Lewis J. Frey; 13(Mar):555--559, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/piccolo12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/piccolo12a.html
</link>
<description>
Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. ML-Flex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple
</description>
</item>

<item>
<title>
MULTIBOOST: A Multi-purpose Boosting Package; Djalel Benbouzid, R&#243;bert Busa-Fekete, Norman Casagrande, Fran&#231;ois-David Collin, Bal&#225;zs K&#233;gl; 13(Mar):549--553, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/benbouzid12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/benbouzid12a.html
</link>
<description>
The MULTIBOOST package provides a fast C++ implementation of multi-class/multi-label/multi-task boosting algorithms. It is based on ADABOOST.MH but it also implements popular cascade classifiers and FILTERBOOST. The package contains common multi-class base learners (stumps, trees, products, Haar filters). Further base learners and strong learners following the boosting paradigm can be easily implemented in a flexible framework.
</description>
</item>

<item>
<title>
Metric and Kernel Learning Using a Linear Transformation; Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon; 13(Mar):519--547, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/jain12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/jain12a.html
</link>
<description>
Metric and kernel learning arise in several machine learning applications.  However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points.  In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem.  In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework---that of minimizing the LogDet divergence subject
</description>
</item>

<item>
<title>
Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks; Vikas C. Raykar, Shipeng Yu; 13(Feb):491--518, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/raykar12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/raykar12a.html
</link>
<description>
With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a data set labeled by multiple annotators in a short amount of time.  Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Since we do not have control over the quality of the annotators, very often the annotations can be dominated by spammers, defined as annotators who assign labels randomly without actually looking at the instance.  Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the final consensus labels.  In this paper we propose an empirical
</description>
</item>

<item>
<title>
Multi-Assignment Clustering for Boolean Data; Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann; 13(Feb):459--489, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/frank12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/frank12a.html
</link>
<description>
We propose a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. By explicitly modeling the underlying generative process that combines the  individual source emissions, highly structured data are expressed with substantially fewer clusters compared to single-assignment clustering. As a consequence, such a model provides robust parameter estimators even when the number of samples is low. We extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Our model is primarily
</description>
</item>

<item>
<title>
Online Learning in the Embedded Manifold of Low-rank Matrices; Uri Shalit, Daphna Weinshall, Gal Chechik; 13(Feb):429--458, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/shalit12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/shalit12a.html
</link>
<description>
When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction
</description>
</item>

<item>
<title>
Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming; Garvesh Raskutti, Martin J. Wainwright, Bin Yu; 13(Feb):389--427, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/raskutti12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/raskutti12a.html
</link>
<description>
Sparse additive models are families of d-variate functions with the additive decomposition f^* = &#8721;_j &#8712; S f^*_j, where S is an unknown subset of cardinality s &lt;&lt; d. In this paper, we consider the case where each univariate component function f^*_j lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f^* based on kernels combined with l_1-type convex regularization.  Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L^2(P) and L^2(P_n) norms over the class  F_d,s,H of sparse additive models with each univariate function
</description>
</item>

<item>
<title>
Bounding the Probability of Error for High Precision Optical Character Recognition; Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller; 13(Feb):363--387, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/huang12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/huang12a.html
</link>
<description>
We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identified with near certainty, they can be conditioned upon, allowing further inference to be done efficiently.  Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This "clean set" is subsequently used as document-specific training data.  While OCR systems produce confidence measures for the identity of each letter or word, thresholding these values still produces a significant
</description>
</item>

<item>
<title>
Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics; Michael U. Gutmann, Aapo Hyv&#228;rinen; 13(Feb):307--361, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/gutmann12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/gutmann12a.html
</link>
<description>
We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations
</description>
</item>

<item>
<title>
Random Search for Hyper-Parameter Optimization; James Bergstra, Yoshua Bengio; 13(Feb):281--305, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/bergstra12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/bergstra12a.html
</link>
<description>
Grid search and manual search are the most widely used strategies for hyper-parameter optimization.  This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid.  Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks.  Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time.  Granting random search the same computational budget, random search
</description>
</item>

<item>
<title>
Active Learning via Perfect Selective Classification; Ran El-Yaniv, Yair Wiener; 13(Feb):255--279, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/el-yaniv12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/el-yaniv12a.html
</link>
<description>
We discover a strong relation between two known learning models: stream-based active learning and perfect selective classification (an extreme case of 'classification with a reject option').  For these models, restricted to the realizable case, we show a reduction of active learning to selective classification that preserves fast rates.  Applying this reduction to recent results for selective classification, we derive exponential target-independent label complexity speedup for actively learning general (non-homogeneous) linear classifiers when the data distribution is an arbitrary high dimensional mixture of Gaussians. Finally, we study the relation between the proposed technique and existing
</description>
</item>

<item>
<title>
Multi Kernel Learning with Online-Batch Optimization; Francesco Orabona, Luo Jie, Barbara Caputo; 13(Feb):227--253, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/orabona12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/orabona12a.html
</link>
<description>
In recent years there has been a lot of interest in designing principled classification algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach.  Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem.  We prove theoretically
</description>
</item>

<item>
<title>
Active Clustering of Biological Sequences; Konstantin Voevodski, Maria-Florina Balcan, Heiko R&#246;glin, Shang-Hua Teng, Yu Xia; 13(Jan):203--225, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/voevodski12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/voevodski12a.html
</link>
<description>
Given a point set S and an unknown metric d on S, we study the problem of efficiently partitioning S into k clusters while querying few distances between the points.  In our model we assume that we have access to one versus all queries that given a point s &#8712; S return the distances between s and all other points.  We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only O(k) distance queries.   Our algorithm uses an active selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering.  We use our procedure
</description>
</item>

<item>
<title>
Optimal Distributed Online Prediction Using Mini-Batches; Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao; 13(Jan):165--202, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/dekel12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/dekel12a.html
</link>
<description>
Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms.  We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed
</description>
</item>

<item>
<title>
An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity; Nir Ailon; 13(Jan):137--164, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ailon12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ailon12a.html
</link>
<description>
Given a set V of  n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise).  The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible.  Our performance is measured by two parameters:  The number of disagreements (loss) and the query complexity (number of pairwise preference labels).  Our algorithm adaptively queries  at most O(&#949;^-6n log^5 n) preference labels for a regret of &#949; times the optimal loss.  As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem.
</description>
</item>

<item>
<title>
Refinement of Operator-valued Reproducing Kernels; Haizhang Zhang, Yuesheng Xu, Qinghui Zhang; 13(Jan):91--136, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/zhang12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/zhang12a.html
</link>
<description>
This paper studies the construction of a refinement kernel for a given operator-valued reproducing kernel such that the vector-valued reproducing kernel Hilbert space of the refinement kernel contains that of the given kernel as a subspace. The study is motivated from the need of updating the current operator-valued reproducing kernel in multi-task learning when underfitting or overfitting occurs. Numerical simulations confirm that the established refinement kernel method is able to meet this need.  Various characterizations are provided based on feature maps and vector-valued integral representations of operator-valued reproducing kernels. Concrete examples of refining translation invariant
</description>
</item>

<item>
<title>
Plug-in Approach to Active Learning; Stanislav Minsker; 13(Jan):67--90, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/minsker12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/minsker12a.html
</link>
<description>
We present a new active learning algorithm based on nonparametric estimators of the regression function.  Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions.  We also prove minimax lower bounds which show that the obtained rates are almost tight.
</description>
</item>

<item>
<title>
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection; Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luj&#225;n; 13(Jan):27--66, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/brown12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/brown12a.html
</link>
<description>
We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation.  This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?".  To answer this, we adopt a different strategy than is usual in the feature selection literature--instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels.  While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy',
</description>
</item>

<item>
<title>
Distance Metric Learning with Eigenvalue Optimization; Yiming Ying, Peng Li; 13(Jan):1--26, 2012.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v13/ying12a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v13/ying12a.html
</link>
<description>
The main theme of this paper is to develop a novel eigenvalue optimization framework for learning a Mahalanobis metric.  Within this context, we introduce a novel metric learning approach called DML-eig  which is shown to be equivalent to  a well-known eigenvalue optimization problem called minimizing the maximal eigenvalue of a symmetric matrix (Overton, 1988; Lewis and Overton, 1996).  Moreover, we formulate LMNN (Weinberger et al., 2005), one of the state-of-the-art metric learning methods, as a similar eigenvalue optimization problem. This novel framework not only provides new insights into metric learning but also opens new avenues  to the design of efficient metric learning algorithms.
</description>
</item>

<item>
<title>
Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms; Beno&#238;t Patra; 12(Dec):3431--3466, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/patra11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/patra11a.html
</link>
<description>
Motivated by the problem of effectively executing clustering algorithms on very large data sets, we address a model for large scale distributed clustering methods. To this end, we briefly recall some standards on the quantization problem and some results on the almost sure convergence of the competitive learning vector quantization (CLVQ) procedure. A general model for linear distributed asynchronous algorithms well adapted to several parallel computing architectures is also discussed. Our approach brings together this scalable model and the CLVQ algorithm, and we call the resulting technique the distributed asynchronous learning vector quantization algorithm (DALVQ). An in-depth analysis of
</description>
</item>

<item>
<title>
A Simpler Approach to Matrix Completion; Benjamin Recht; 12(Dec):3413--3430, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/recht11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/recht11a.html
</link>
<description>
This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix.  These results improve on prior work by Cand&#232;s and Recht (2009), Cand&#232;s and Tao (2009), and Keshavan et al. (2009).  The reconstruction is accomplished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisfies a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition.  The proof of this assertion is short, self contained,
</description>
</item>

<item>
<title>
Learning with Structured Sparsity; Junzhou Huang, Tong Zhang, Dimitris Metaxas; 12(Nov):3371--3412, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/huang11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/huang11b.html
</link>
<description>
This paper investigates a learning formulation called  structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing.  By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years.  A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure.  It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization.
</description>
</item>

<item>
<title>
Semi-Supervised Learning with Measure Propagation; Amarnag Subramanya, Jeff Bilmes; 12(Nov):3311--3370, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/subramanya11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/subramanya11a.html
</link>
<description>
We describe a new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efficiently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semi-supervised learning problem on very large data sets, for example, in one instance we use a data set with over 10^8 samples.  In this context, we propose a graph node ordering algorithm that is
</description>
</item>

<item>
<title>
An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models; Piotr Zwiernik; 12(Nov):3283--3310, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zwiernik11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zwiernik11a.html
</link>
<description>
The standard Bayesian Information Criterion (BIC) is derived under regularity conditions which are not always satisfied in the case of graphical models with hidden variables. In this paper we derive the BIC for the binary graphical tree models where all the inner nodes of a tree represent binary hidden variables. This provides an extension of a similar formula given by Rusakov and Geiger for naive Bayes models. The main tool used in this paper is the connection between the growth behavior of marginal likelihood integrals and the real log-canonical threshold.
</description>
</item>

<item>
<title>
The Sample Complexity of Dictionary Learning; Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein; 12(Nov):3259--3281, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vainsencher11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vainsencher11a.html
</link>
<description>
A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary.  Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples?  We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective.  We develop generalization
</description>
</item>

<item>
<title>
Robust Gaussian Process Regression with a Student-<i>t</i> Likelihood; Pasi Jyl&#228;nki, Jarno Vanhatalo, Aki Vehtari; 12(Nov):3227--3257, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/jylanki11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/jylanki11a.html
</link>
<description>
This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions.  In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We
</description>
</item>

<item>
<title>
Group Lasso Estimation of High-dimensional Covariance Matrices; J&#233;r&#233;mie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Mu&#241;iz-Alvarez; 12(Nov):3187--3225, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/bigot11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/bigot11a.html
</link>
<description>
In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a high-dimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional
</description>
</item>

<item>
<title>
Adaptive Exact Inference in Graphical Models; &#214;zg&#252;r S&#252;mer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu; 12(Nov):3147--3186, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/sumer11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/sumer11a.html
</link>
<description>
Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch.  In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efficiently compute marginals and update MAP configurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the
</description>
</item>

<item>
<title>
Unsupervised Supervised Learning II: Margin-Based Classification Without Labels; Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon; 12(Nov):3119--3145, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/balasubramanian11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/balasubramanian11a.html
</link>
<description>
Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we   show how the estimate is used for evaluating classifiers in transfer   learning, and for training classifiers with no labeled data   whatsoever.
</description>
</item>

<item>
<title>
Efficient and Effective Visual Codebook Generation Using Additive Kernels; Jianxin Wu, Wei-Chian Tan, James M. Rehg; 12(Nov):3097--3118, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wu11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wu11b.html
</link>
<description>
Common visual codebook generation methods used in a bag of visual words model, for example, k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that with histogram features, the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks. In this paper, we demonstrate that HIK can be used in an unsupervised manner to significantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as the standard k-means.
</description>
</item>

<item>
<title>
In All Likelihood, Deep Belief Is Not Enough; Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge; 12(Nov):3071--3096, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/theis11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/theis11a.html
</link>
<description>
Statistical models of natural images provide an important tool for researchers in the fields of machine learning and computational neuroscience. The canonical measure to quantitatively assess and compare the performance of statistical models is given by the likelihood. One class of statistical models which has recently gained increasing popularity and has been applied to a variety of complex data is formed by deep belief networks. Analyses of these models, however, have often been limited to qualitative analyses based on samples due to the computationally intractable nature of their likelihood. Motivated by these circumstances, the present article introduces a consistent estimator for the
</description>
</item>

<item>
<title>
The Stationary Subspace Analysis Toolbox; Jan Saputra M&#252;ller, Paul von B&#252;nau, Frank C. Meinecke, Franz J. Kir&#225;ly, Klaus-Robert M&#252;ller; 12(Oct):3065--3069, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mueller11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mueller11a.html
</link>
<description>
The Stationary Subspace Analysis (SSA) algorithm linearly factorizes a high-dimensional time series into stationary and non-stationary components.  The SSA Toolbox is a platform-independent efficient stand-alone implementation of the SSA algorithm with a graphical user interface written in Java, that can also be invoked from the command line and from Matlab. The graphical interface guides the user through the whole process; data can be imported and exported from comma separated values (CSV) and Matlab's .mat files.
</description>
</item>

<item>
<title>
Robust Approximate Bilinear Programming for Value Function Approximation; Marek Petrik, Shlomo Zilberstein; 12(Oct):3027--3063, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/petrik11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/petrik11a.html
</link>
<description>
Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs.
</description>
</item>

<item>
<title>
High-dimensional Covariance Estimation Based On Gaussian Graphical Models; Shuheng Zhou, Philipp R&#252;timann, Min Xu, Peter B&#252;hlmann; 12(Oct):2975--3026, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zhou11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zhou11a.html
</link>
<description>
Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using l_1-penalization methods.  We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and refitting: first we infer a sparse undirected graphical model structure via thresholding of each among many l_1-norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator.  We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and
</description>
</item>

<item>
<title>
Hierarchical Knowledge Gradient for Sequential Sampling; Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier; 12(Oct):2931--2974, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mes11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mes11a.html
</link>
<description>
We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multi-dimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation
</description>
</item>

<item>
<title>
On Equivalence Relationships Between Classification and Ranking Algorithms; &#350;eyda Ertekin, Cynthia Rudin; 12(Oct):2905--2929, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ertekin11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ertekin11a.html
</link>
<description>
We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classification and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassification error or by a statistic of the ROC curve (such as the area under the curve). Specifically,
</description>
</item>

<item>
<title>
Convergence Rates of Efficient Global Optimization Algorithms; Adam D. Bull; 12(Oct):2879--2904, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/bull11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/bull11a.html
</link>
<description>
In the efficient global optimization problem, we minimize an unknown function f, using as few observations f(x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret.  Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour.  Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS).  When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS.  We provide convergence
</description>
</item>

<item>
<title>
Efficient Learning with Partially Observed Attributes; Nicol&#242; Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir; 12(Oct):2857--2878, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/cesa-bianchi11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/cesa-bianchi11a.html
</link>
<description>
We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the "local budget" setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier "global budget" setting, where it is only the overall number of accessible
</description>
</item>

<item>
<title>
Neyman-Pearson Classification, Convexity and Stochastic Constraints; Philippe Rigollet, Xin Tong; 12(Oct):2831--2855, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/rigollet11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/rigollet11a.html
</link>
<description>
Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss &#966;. Given a finite collection of classifiers, we combine them and obtain a new classifier  that satisfies simultaneously the two following properties with high probability: (i) its &#966;-type I error is below a pre-specified level and (ii), it has &#966;-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to
</description>
</item>

<item>
<title>
Scikit-learn: Machine Learning in Python; Fabian Pedregosa, Ga&#235;l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, &#201;douard Duchesnay; 12(Oct):2825--2830, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
</link>
<description>
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language.  Emphasis is put on ease of use, performance, documentation, and API consistency.  It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings.  Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
</description>
</item>

<item>
<title>
Structured Variable Selection with Sparsity-Inducing Norms; Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach; 12(Oct):2777--2824, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/jenatton11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/jenatton11b.html
</link>
<description>
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual l_1-norm and the group l_1-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero
</description>
</item>

<item>
<title>
Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes; Elias Zavitsanos, Georgios Paliouras, George A. Vouros; 12(Oct):2749--2775, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zavitsanos11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zavitsanos11a.html
</link>
<description>
This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model
</description>
</item>

<item>
<title>
Large Margin Hierarchical Classification with Mutually Exclusive Class Membership; Huixin Wang, Xiaotong Shen, Wei Pan; 12(Sep):2721--2748, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wang11c.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wang11c.html
</link>
<description>
In hierarchical classification, class labels are structured, that is each label value corresponds to one non-root node in a tree, where the inter-class relationship for classification is specified by directed paths of the tree. In such a situation, the focus has been on how to leverage the inter-class relationship to enhance the performance of flat classification, which ignores such dependency.  This is critical when the number of classes becomes large relative to the sample size. This paper considers single-path or partial-path hierarchical classification, where only one path is permitted from the root to a leaf node. A large margin method is introduced based on a new concept of generalized
</description>
</item>

<item>
<title>
Convex and Network Flow Optimization for Structured Sparsity; Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach; 12(Sep):2681--2720, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mairal11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mairal11a.html
</link>
<description>
We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of l_2- or l_&#8734;-norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups.  To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of l_&#8734;-norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods.  On the other hand, we use proximal splitting techniques,
</description>
</item>

<item>
<title>
Bayesian Co-Training; Shipeng Yu, Balaji Krishnapuram, R&#243;mer Rosales, R. Bharat Rao; 12(Sep):2649--2680, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/yu11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/yu11a.html
</link>
<description>
Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for
</description>
</item>

<item>
<title>
Theoretical Analysis of Bayesian Matrix Factorization; Shinichi Nakajima, Masashi Sugiyama; 12(Sep):2583--2648, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/nakajima11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/nakajima11a.html
</link>
<description>
Recently, variational Bayesian (VB) techniques have been applied to probabilistic matrix factorization and shown to perform very well in experiments.  In this paper, we theoretically elucidate properties of the VB matrix factorization (VBMF) method.  Through finite-sample analysis of the VBMF estimator, we show that two types of shrinkage factors exist in the VBMF estimator: the positive-part James-Stein (PJS) shrinkage and the trace-norm shrinkage, both acting on each singular component separately for producing low-rank solutions.  The trace-norm shrinkage is simply induced by non-flat prior information, similarly to the maximum a posteriori (MAP) approach.  Thus, no trace-norm shrinkage
</description>
</item>

<item>
<title>
Kernel Analysis of Deep Networks; Gr&#233;goire Montavon, Mikio L. Braun, Klaus-Robert M&#252;ller; 12(Sep):2563--2581, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/montavon11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/montavon11a.html
</link>
<description>
When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed
</description>
</item>

<item>
<title>
Weisfeiler-Lehman Graph Kernels; Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt; 12(Sep):2539--2561, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/shervashidze11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/shervashidze11a.html
</link>
<description>
In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence.  In our experimental evaluation, our kernels outperform state-of-the-art graph
</description>
</item>

<item>
<title>
Natural Language Processing (Almost) from Scratch; Ronan Collobert, Jason Weston, L&#233;on Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa; 12(Aug):2493--2537, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/collobert11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/collobert11a.html
</link>
<description>
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling.  This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge.  Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data.  This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
</description>
</item>

<item>
<title>
LPmade: Link Prediction Made Easy; Ryan N. Lichtenwalter, Nitesh V. Chawla; 12(Aug):2489--2492, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/lichtenwalter11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/lichtenwalter11a.html
</link>
<description>
LPmade is a complete cross-platform software solution for multi-core link prediction and related tasks and analysis. Its first principal contribution is a scalable network library supporting high-performance implementations of the most commonly employed unsupervised link prediction methods. Link prediction in longitudinal data requires a sophisticated and disciplined procedure for correct results and fair evaluation, so the second principle contribution of LPmade is a sophisticated GNU make architecture that completely automates link prediction, prediction evaluation, and network analysis. Finally, LPmade streamlines and automates the procedure for creating multivariate supervised link prediction
</description>
</item>

<item>
<title>
Distance Dependent Chinese Restaurant Processes; David M. Blei, Peter I. Frazier; 12(Aug):2461--2488, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/blei11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/blei11a.html
</link>
<description>
We develop the distance dependent Chinese restaurant process, a flexible class of distributions over partitions that allows for dependencies between the elements.  This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies arising from time, space, and network connectivity.  We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings.  We study its empirical performance with three text corpora.  We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide
</description>
</item>

<item>
<title>
Parallel Algorithm for Learning Optimal Bayesian Network Structure; Yoshinori Tamada, Seiya Imoto, Satoru Miyano; 12(Jul):2437--2459, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/tamada11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/tamada11a.html
</link>
<description>
We present a parallel algorithm for the score-based optimal structure search of Bayesian networks.  This algorithm is based on a dynamic programming (DP) algorithm having O(n &#8901; 2^n) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes.  The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes.  While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(n^&#963;) time and space overhead calculations,
</description>
</item>

<item>
<title>
Union Support Recovery in Multi-task Learning; Mladen Kolar, John Lafferty, Larry Wasserman; 12(Jul):2415--2435, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/kolar11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/kolar11a.html
</link>
<description>
We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting.  Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis.  A clearer and simpler picture emerges by studying the Normal means model.  This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures.
</description>
</item>

<item>
<title>
MULAN: A Java Library for Multi-Label Learning; Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas; 12(Jul):2411--2414, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/tsoumakas11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/tsoumakas11a.html
</link>
<description>
MULAN is a Java library for learning from multi-label data. It offers a variety of classification, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures.
</description>
</item>

<item>
<title>
Universality, Characteristic Kernels and RKHS Embedding of Measures; Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet; 12(Jul):2389--2410, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/sriperumbudur11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/sriperumbudur11a.html
</link>
<description>
Over the last few years, two different notions of positive definite (pd) kernels---universal and characteristic---have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding
</description>
</item>

<item>
<title>
<i>Waffles</i>: A Machine Learning Toolkit; Michael Gashler; 12(Jul):2383--2387, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/gashler11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/gashler11a.html
</link>
<description>
We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Waffles. The Waffles tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Waffles is available under the GNU Lesser General Public License.
</description>
</item>

<item>
<title>
Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models; Sharon Goldwater, Thomas L. Griffiths, Mark Johnson; 12(Jul):2335--2382, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/goldwater11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/goldwater11a.html
</link>
<description>
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework.  We discuss two stochastic
</description>
</item>

<item>
<title>
Proximal Methods for Hierarchical Sparse Coding; Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach; 12(Jul):2297--2334, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/jenatton11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/jenatton11a.html
</link>
<description>
Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary.  We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them.  More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators.  Our
</description>
</item>

<item>
<title>
MSVMpack: A Multi-Class Support Vector Machine Package; Fabien Lauer, Yann Guermeur; 12(Jul):2293--2296, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/lauer11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/lauer11a.html
</link>
<description>
This paper describes MSVMpack, an open source software package dedicated to our generic model of multi-class support vector machine.  All four multi-class support vector machines (M-SVMs) proposed so far in the literature appear as instances of this model. MSVMpack provides for them the first unified implementation and offers a convenient basis to develop other instances.  This is also the first parallel implementation for M-SVMs.  The package consists in a set of command-line tools with a callable library.  The documentation includes a tutorial, a user's guide and a developer's guide.
</description>
</item>

<item>
<title>
Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning; Liwei Wang; 12(Jul):2269--2292, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wang11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wang11b.html
</link>
<description>
We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and
</description>
</item>

<item>
<title>
Multiple Kernel Learning Algorithms; Mehmet G&#246;nen, Ethem Alpayd&#305;n; 12(Jul):2211--2268, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/gonen11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/gonen11a.html
</link>
<description>
In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as
</description>
</item>

<item>
<title>
Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood; Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllym&#228;ki; 12(Jul):2181--2210, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/carvalho11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/carvalho11a.html
</link>
<description>
We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (f&#770;CLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion.  The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion,
</description>
</item>

<item>
<title>
On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem; Daniil Ryabko; 12(Jul):2161--2180, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ryabko11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ryabko11a.html
</link>
<description>
A sequence x_1,...,x_n,... of discrete-valued observations is generated according to some unknown probabilistic law (measure) &#956;.  After observing each outcome, one is required to give  conditional probabilities of the next observation.  The realizable case is when the  measure  &#956; belongs to an arbitrary but known class C  of  process measures.  The non-realizable case is when &#956; is completely arbitrary, but the prediction performance is measured with respect to a given set C of process measures.  We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and finding these solutions.
</description>
</item>

<item>
<title>
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization; John Duchi, Elad Hazan, Yoram Singer; 12(Jul):2121--2159, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/duchi11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/duchi11a.html
</link>
<description>
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as
</description>
</item>

<item>
<title>
Information Rates of Nonparametric Gaussian Process Methods; Aad van der Vaart, Harry van Zanten; 12(Jun):2095--2119, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vandervaart11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vandervaart11a.html
</link>
<description>
We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior.  We illustrate the computation of the upper bound for the Mat&#233;rn  and squared exponential kernels.  For these priors the risk, and hence the information criterion, tends to zero for all continuous response functions. However, the rate at which this happens
</description>
</item>

<item>
<title>
Exploiting Best-Match Equations for Efficient Reinforcement Learning; Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering; 12(Jun):2045--2094, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vanseijen11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vanseijen11a.html
</link>
<description>
This article presents and evaluates best-match learning, a new approach to reinforcement learning that  trades off the sample efficiency of model-based methods with the space efficiency of model-free methods.  Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model.  We prove that, unlike regular sparse model-based methods, best-match learning is guaranteed to converge to the optimal Q-values in the tabular case.  Empirical results demonstrate that best-match learning can substantially outperform regular sparse model-based methods, as well as several
</description>
</item>

<item>
<title>
A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis; Trine Julie Abrahamsen, Lars Kai Hansen; 12(Jun):2027--2044, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/abrahamsen11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/abrahamsen11a.html
</link>
<description>
Small sample high-dimensional principal component analysis (PCA) suffers from variance inflation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inflation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efficiently restore generalizability in kPCA. As for PCA our analysis also suggests a simplified approximate expression.
</description>
</item>

<item>
<title>
The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets; Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, Christian Buchta; 12(Jun):2021--2025, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/hahsler11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/hahsler11a.html
</link>
<description>
This paper describes the ecosystem of R add-on packages developed around the infrastructure provided by the package arules. The packages provide comprehensive functionality for analyzing interesting patterns including frequent itemsets, association rules, frequent sequences and for building applications like associative classification. After discussing the ecosystem's design we illustrate the ease of mining and visualizing rules with a short example.
</description>
</item>

<item>
<title>
Generalized TD Learning; Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii; 12(Jun):1977--2020, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ueno11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ueno11a.html
</link>
<description>
Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed.  Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation.  To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation.  This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms
</description>
</item>

<item>
<title>
Kernel Regression in the Presence of Correlated Errors; Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor; 12(Jun):1955--1976, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/debrabanter11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/debrabanter11a.html
</link>
<description>
It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difficult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed
</description>
</item>

<item>
<title>
Dirichlet Process Mixtures of Generalized Linear Models; Lauren A. Hannah, David M. Blei, Warren B. Powell; 12(Jun):1923--1953, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/hannah11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/hannah11a.html
</link>
<description>
We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression.  Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models.  DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model.  We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models.  We give conditions for weak consistency of the joint distribution and pointwise consistency of
</description>
</item>

<item>
<title>
Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms; Vianney Perchet; 12(Jun):1893--1921, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/perchet11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/perchet11a.html
</link>
<description>
We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages  where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law.  They are based on a generalization of calibration, no longer defined in terms of  a Vorono&#239; diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework,  the expected average internal,
</description>
</item>

<item>
<title>
Stochastic Methods for <i>l</i><sub>1</sub>-regularized Loss Minimization; Shai Shalev-Shwartz, Ambuj Tewari; 12(Jun):1865--1892, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/shalev-shwartz11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/shalev-shwartz11a.html
</link>
<description>
We describe and analyze two stochastic methods for l_1 regularized loss minimization problems, such as the Lasso.  The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration.  In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.
</description>
</item>

<item>
<title>
A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin; Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng; 12(Jun):1835--1863, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wang11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wang11a.html
</link>
<description>
Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error.  Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other
</description>
</item>

<item>
<title>
Hyper-Sparse Optimal Aggregation; St&#233;phane Ga&#239;ffas, Guillaume Lecu&#233;; 12(Jun):1813--1833, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/gaiffas11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/gaiffas11a.html
</link>
<description>
Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in~F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations.
</description>
</item>

<item>
<title>
Learning Latent Tree Graphical Models; Myung Jin Choi, Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky; 12(May):1771--1812, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/choi11b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/choi11b.html
</link>
<description>
We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all  the observed and latent variables have  the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using   so-called information
</description>
</item>

<item>
<title>
A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes; St&#233;phane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann; 12(May):1729--1770, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ross11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ross11a.html
</link>
<description>
Bayesian learning methods have recently been shown to provide an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs).  The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions.
</description>
</item>

<item>
<title>
Domain Decomposition Approach for Fast Gaussian Process Regression of Large Spatial Data Sets; Chiwoo Park, Jianhua Z. Huang, Yu Ding; 12(May):1697--1728, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/park11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/park11a.html
</link>
<description>
Gaussian process regression is a flexible and powerful tool for machine learning, but the high computational complexity hinders its broader applications. In this paper, we propose a new approach for fast computation of Gaussian process regression with a focus on large spatial data sets. The approach decomposes the domain of a regression function into small subdomains and infers a local piece of the regression function for each subdomain. We explicitly address the mismatch problem of the local pieces on the boundaries of neighboring subdomains by imposing continuity constraints. The new approach has comparable or better computation complexity as other competing methods, but it is easier to be
</description>
</item>

<item>
<title>
X-Armed Bandits; S&#233;bastien Bubeck, R&#233;mi Munos, Gilles Stoltz, Csaba Szepesv&#225;ri; 12(May):1655--1695, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/bubeck11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/bubeck11a.html
</link>
<description>
We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the mean-payoff function is "locally Lipschitz" with respect to a dissimilarity function that is known to the decision maker.  Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems.  In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known smoothness degree,
</description>
</item>

<item>
<title>
Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates; Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky; 12(May):1617--1653, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/tan11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/tan11a.html
</link>
<description>
The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An  algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed.   It is shown that this algorithm is both  structurally consistent and risk consistent and  the error probability of structure learning decays faster than any polynomial in the number of samples   under fixed model size.  For the  high-dimensional scenario where the size of the  model d and the number of edges k scale with the number of samples n,  sufficient conditions on (n,d,k) are given for  the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning
</description>
</item>

<item>
<title>
Double Updating Online Learning; Peilin Zhao, Steven C.H. Hoi, Rong Jin; 12(May):1587--1615, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zhao11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zhao11a.html
</link>
<description>
In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the
</description>
</item>

<item>
<title>
Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation; Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama; 12(May):1537--1586, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/tomioka11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/tomioka11a.html
</link>
<description>
We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL).  Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm.  We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms.  In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems.  We experimentally
</description>
</item>

<item>
<title>
Learning from Partial Labels; Timothee Cour, Ben Sapp, Ben Taskar; 12(May):1501--1536, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/cour11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/cour11a.html
</link>
<description>
We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct.  Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available.  The goal is to learn a classifier that can disambiguate the partially-labeled training instances, and generalize to unseen data.  We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled.  Exploiting this property
</description>
</item>

<item>
<title>
Computationally Efficient Convolved Multiple Output Gaussian Processes; Mauricio A. &#193;lvarez, Neil D. Lawrence; 12(May):1459--1500, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/alvarez11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/alvarez11a.html
</link>
<description>
Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this
</description>
</item>

<item>
<title>
Learning a Robust Relevance Model for Search Using Kernel Methods; Wei Wu, Jun Xu, Hang Li, Satoshi Oyama; 12(May):1429--1458, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/wu11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/wu11a.html
</link>
<description>
This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal
</description>
</item>

<item>
<title>
Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning; Dorota G&#322;owacka, John Shawe-Taylor, Alex Clark, Colin de la Higuera, Mark Johnson; 12(Apr):1425--1428, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/glowacka11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/glowacka11a.html
</link>
<description>
Grammar induction refers to the process of learning grammars and languages from data; this finds a variety of applications in syntactic pattern recognition, the modeling of natural language acquisition, data mining and machine translation. This special topic contains several papers presenting some of recent developments in the area of grammar induction and language learning, as applied to various problems in Natural Language Processing, including supervised and unsupervised parsing and statistical machine translation.
</description>
</item>

<item>
<title>
Clustering Algorithms for Chains; Antti Ukkonen; 12(Apr):1389--1423, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ukkonen11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ukkonen11a.html
</link>
<description>
We consider the problem of clustering a set of chains to k clusters.  A chain is a totally ordered subset of a finite set of items.  Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items.  First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure.  We also present two different approaches for mapping chains to a vector space.  The first one is related to the planted partition model,
</description>
</item>

<item>
<title>
Faster Algorithms for Max-Product Message-Passing; Julian J. McAuley, Tib&#233;rio S. Caetano; 12(Apr):1349--1388, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mcauley11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mcauley11a.html
</link>
<description>
Maximum A Posteriori inference in graphical models is often solved via message-passing algorithms, such as the junction-tree algorithm or loopy belief-propagation. The exact solution to this problem is well-known to be exponential in the size of the maximal cliques of the triangulated model, while approximate inference is typically exponential in the size of the model's factors. In this paper, we take advantage of the fact that many models have maximal cliques that are larger than their constituent factors, and also of the fact that many factors consist only of latent variables (i.e., they do not depend on an observation). This is a common case in a wide variety of applications that deal with
</description>
</item>

<item>
<title>
A Family of Simple Non-Parametric Kernel Learning Algorithms; Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi; 12(Apr):1313--1347, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zhuang11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zhuang11a.html
</link>
<description>
Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interior-point SDP solver could be as high as O(N^6.5), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed "SimpleNPKL", which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with
</description>
</item>

<item>
<title>
Better Algorithms for Benign Bandits; Elad Hazan, Satyen Kale; 12(Apr):1287--1311, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/hazan11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/hazan11a.html
</link>
<description>
The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best fixed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information.  A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in
</description>
</item>

<item>
<title>
Locally Defined Principal Curves and Surfaces; Umut Ozertem, Deniz Erdogmus; 12(Apr):1249--1286, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ozertem11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ozertem11a.html
</link>
<description>
Principal curves are defined as self-consistent smooth curves passing through the middle of the data, and they have been used in many applications of machine learning as a generalization, dimensionality reduction and a feature extraction tool. We redefine principal curves and surfaces in terms of the gradient and the Hessian of the probability density estimate. This provides a geometric understanding of the principal curves and surfaces, as well as a unifying view for clustering, principal curve fitting and manifold learning by regarding those as principal manifolds of different intrinsic dimensionalities. The theory does not impose any particular density estimation method can be used with
</description>
</item>

<item>
<title>
DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model; Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv&#228;rinen, Yoshinobu Kawahara, Takashi Washio, Patrik O. Hoyer, Kenneth Bollen; 12(Apr):1225--1248, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/shimizu11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/shimizu11a.html
</link>
<description>
Structural equation models and Bayesian networks have been widely used to analyze causal relations between continuous variables. In such frameworks, linear acyclic models are typically used to model the data-generating process of variables.  Recently, it was shown that use of non-Gaussianity identifies the full structure of a linear acyclic model, that is, a causal ordering of variables and their connection strengths, without using any prior knowledge on the network structure, which is not the case with conventional methods.  However, existing estimation methods are based on iterative search algorithms and may not converge to a correct solution in a finite number of steps.  In this paper,
</description>
</item>

<item>
<title>
The Indian Buffet Process: An Introduction and Review; Thomas L. Griffiths, Zoubin Ghahramani; 12(Apr):1185--1224, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/griffiths11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/griffiths11a.html
</link>
<description>
The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections
</description>
</item>

<item>
<title>
Laplacian Support Vector Machines  Trained in the Primal; Stefano Melacci, Mikhail Belkin; 12(Mar):1149--1184, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/melacci11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/melacci11a.html
</link>
<description>
In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data.  Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classification.  In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation.  In particular, training a LapSVM in the primal can be efficiently performed with preconditioned conjugate gradient.  We speed up training by using an early
</description>
</item>

<item>
<title>
Anechoic Blind Source Separation Using Wigner Marginals; Lars Omlor, Martin A. Giese; 12(Mar):1111--1148, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/omlor11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/omlor11a.html
</link>
<description>
Blind source separation problems emerge in many applications, where signals can be modeled as superpositions of multiple sources. Many popular applications of blind source separation are based on linear instantaneous mixture models. If specific invariance properties are known about the sources, for example, translation or rotation invariance, the simple linear model can be extended by inclusion of the corresponding transformations.  When the sources are invariant against translations (spatial displacements or time shifts) the resulting model is called an anechoic mixing model. We present a new algorithmic framework for the solution of anechoic problems in arbitrary dimensions. This framework
</description>
</item>

<item>
<title>
Differentially Private Empirical Risk Minimization; Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate; 12(Mar):1069--1109, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/chaudhuri11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/chaudhuri11a.html
</link>
<description>
Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed.  We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM).  These algorithms are private under the &#949;-differential privacy definition due to Dwork et al. (2006).  First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification.  Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design.  This method entails perturbing the objective function before
</description>
</item>

<item>
<title>
Two Distributed-State Models For Generating High-Dimensional Time Series; Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis; 12(Mar):1025--1068, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/taylor11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/taylor11a.html
</link>
<description>
In this paper we develop a class of nonlinear generative models for high-dimensional time series.  We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued "visible" variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This "conditional" RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost
</description>
</item>

<item>
<title>
Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data; Zeeshan Syed, John Guttag; 12(Mar):999--1024, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/syed11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/syed11a.html
</link>
<description>
In medicine, one often bases decisions upon a comparative analysis of patient data. In this paper, we build upon this observation and describe similarity-based algorithms to risk stratify patients for major adverse cardiac events. We evolve the traditional approach of comparing patient data in two ways. First, we propose similarity-based algorithms that compare patients in terms of their long-term physiological monitoring data. Symbolic mismatch identifies functional units in long-term signals and measures changes in the morphology and frequency of these units across patients. Second, we describe similarity-based algorithms that are unsupervised and do not require comparisons to patients
</description>
</item>

<item>
<title>
l_p-Norm Multiple Kernel Learning; Marius Kloft, Ulf Brefeld, S&#246;ren Sonnenburg, Alexander Zien; 12(Mar):953--997, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/kloft11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/kloft11a.html
</link>
<description>
Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability.  Unfortunately, this l_1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is l_p-norms with p &#8805; 1. This interleaved optimization is much
</description>
</item>

<item>
<title>
Forest Density Estimation; Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman; 12(Mar):907--951, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/liu11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/liu11a.html
</link>
<description>
We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models.  For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal's algorithm to estimate the optimal forest on held out data.  We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest.  For graph estimation, we consider the problem of estimating forests with restricted tree sizes.  We prove that finding a maximum weight spanning forest with restricted
</description>
</item>

<item>
<title>
Sparse Linear Identifiable Multivariate Modeling; Ricardo Henao, Ole Winther; 12(Mar):863--905, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/henao11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/henao11a.html
</link>
<description>
In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component &#948;-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to
</description>
</item>

<item>
<title>
Learning Transformation Models for Ranking and Survival Analysis; Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel; 12(Mar):819--862, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vanbelle11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vanbelle11a.html
</link>
<description>
This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis.  The present contribution describes a machine learning approach termed MINLIP. The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions.  Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the 'margin' is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual
</description>
</item>

<item>
<title>
Information, Divergence and Risk for Binary Experiments; Mark D. Reid, Robert C. Williamson; 12(Mar):731--817, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/reid11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/reid11a.html
</link>
<description>
We unify f-divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information.  We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives  which all are related to cost-sensitive binary classification.  As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f-divergences to variational divergence.  The new viewpoint also illuminates existing algorithms: it provides a new derivation
</description>
</item>

<item>
<title>
Inverse Reinforcement Learning in Partially Observable Environments; Jaedeug Choi, Kee-Eung Kim; 12(Mar):691--730, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/choi11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/choi11a.html
</link>
<description>
Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert's behavior, namely the case in which the expert's policy is explicitly given, and the case in which the expert's trajectories
</description>
</item>

<item>
<title>
Efficient Structure Learning of Bayesian Networks using Constraints; Cassio P. de Campos, Qiang Ji; 12(Mar):663--689, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/decampos11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/decampos11a.html
</link>
<description>
This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable.  It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees.  These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion.  Then a branch-and-bound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality.  As an example, structural constraints are used to map the problem of structure learning in Dynamic
</description>
</item>

<item>
<title>
Parameter Screening and Optimisation for ILP using Designed Experiments; Ashwin Srinivasan, Ganesh Ramakrishnan; 12(Feb):627--662, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/srinivasan11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/srinivasan11a.html
</link>
<description>
Reports of experiments conducted with an Inductive Logic Programming system rarely describe how specific values of parameters of the system are arrived at when constructing models. Usually, no attempt is made to identify sensitive parameters, and those that are used are often given "factory-supplied" default values, or values obtained from some non-systematic exploratory analysis. The immediate consequence of this is, of course, that it is not clear if better models could have been obtained if some form of parameter selection and optimisation had been performed.  Questions follow inevitably on the experiments themselves: specifically, are all algorithms being treated fairly, and is the
</description>
</item>

<item>
<title>
Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach; Gilles Meyer, Silv&#232;re Bonnabel, Rodolphe Sepulchre; 12(Feb):593--625, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/meyer11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/meyer11a.html
</link>
<description>
The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixed-rank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the
</description>
</item>

<item>
<title>
Variable Sparsity Kernel Learning; Jonathan Aflalo, Aharon Ben-Tal, Chiranjib Bhattacharyya, Jagarlapudi Saketha Nath, Sankaran Raman; 12(Feb):565--592, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/aflalo11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/aflalo11a.html
</link>
<description>
This paper presents novel algorithms and applications for a particular class of mixed-norm regularization based Multiple Kernel Learning (MKL) formulations. The formulations assume that the given kernels are grouped and employ l_1 norm regularization for promoting sparsity within RKHS norms of each group and l_s, s&#8805;2 norm regularization for promoting non-sparse combinations across groups. Various sparsity levels in combining the kernels can be achieved by varying the grouping of kernels---hence we name the formulations as Variable Sparsity Kernel Learning (VSKL) formulations. While previous attempts have a non-convex formulation, here we present a convex formulation which admits
</description>
</item>

<item>
<title>
Minimum Description Length Penalization for Group and Multi-Task Sparse Learning; Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar; 12(Feb):525--564, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/dhillon11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/dhillon11a.html
</link>
<description>
We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using  two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MIC-GROUP)  and multi-task feature selection (MIC-MULTI). MIC-GROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group.  MIC-MULTI applies when there are multiple related tasks that share the same set
</description>
</item>

<item>
<title>
Learning Multi-modal Similarity; Brian McFee, Gert Lanckriet; 12(Feb):491--523, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/mcfee11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/mcfee11a.html
</link>
<description>
In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation.  Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video.  Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications.  We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space.  Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human
</description>
</item>

<item>
<title>
Posterior Sparsity in Unsupervised Dependency Parsing; Jennifer Gillenwater, Kuzman Ganchev, Jo&#227;o Gra&#231;a, Fernando Pereira, Ben Taskar; 12(Feb):455--490, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/gillenwater11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/gillenwater11a.html
</link>
<description>
A strong inductive bias is essential in unsupervised grammar induction.  In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types.  We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Gra&#231;a et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by
</description>
</item>

<item>
<title>
Approximate Marginals in Latent Gaussian Models; Botond Cseke, Tom Heskes; 12(Feb):417--454, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/cseke11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/cseke11a.html
</link>
<description>
We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986).  We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both
</description>
</item>

<item>
<title>
Operator Norm Convergence of Spectral Clustering on Level Sets; Bruno Pelletier, Pierre Pudlo; 12(Feb):385--416, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/pelletier11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/pelletier11a.html
</link>
<description>
Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t.  A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed.  Our algorithm is composed of two steps.  First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t.  Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix.  Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator
</description>
</item>

<item>
<title>
Models of Cooperative Teaching and Learning; Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich; 12(Feb):349--384, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zilles11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zilles11a.html
</link>
<description>
While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects "helpful" training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, "helpful" examples can speed up the learning process.  The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using "coding tricks", has been widely addressed. Nevertheless, the resulting teaching and
</description>
</item>

<item>
<title>
Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs; Jim C. Huang, Brendan J. Frey; 12(Jan):301--348, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/huang11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/huang11a.html
</link>
<description>
We present a class of graphical models for directly representing the joint cumulative distribution function (CDF) of many random variables, called  cumulative distribution networks (CDNs).  Unlike graphs for probability density and mass functions, for CDFs the marginal probabilities for any subset of variables are obtained by computing limits of functions in the model, and conditional probabilities correspond to computing mixed derivatives.  We will show that the conditional independence properties in a CDN are distinct from the conditional independence properties of directed, undirected and factor graphs, but include the conditional independence properties of bi-directed graphs.  In order to perform inference in such models, we describe the `derivative-sum-product' (DSP) message-passing algorithm in which messages correspond to derivatives of the joint CDF.  We will then apply CDNs to the problem of learning to rank players in multiplayer team-based games and suggest several future directions for research.
</description>
</item>

<item>
<title>
A Bayesian Approximation Method for Online Ranking; Ruby C. Weng, Chih-Jen Lin; 12(Jan):267--300, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/weng11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/weng11a.html
</link>
<description>
This paper describes a Bayesian approximation method to obtain online ranking algorithms for games with multiple teams and multiple players.  Recently for Internet games large online ranking systems are much needed.  We consider game models in which a k-team game is treated as several two-team games.  By approximating the expectation of teams' (or players') performances, we derive simple analytic update rules.  These update rules, without numerical integrations, are very easy to interpret and implement.  Experiments on game data show that the accuracy of our approach is competitive with state of the art systems such as TrueSkill, but the running time as well as the code is much shorter.
</description>
</item>

<item>
<title>
Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm; Vladimir V. V'yugin; 12(Jan):241--266, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/vyugin11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/vyugin11a.html
</link>
<description>
In this paper the sequential prediction problem with expert advice is considered for the case where losses of experts suffered at each step cannot be bounded in advance. We present some modification of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled fluctuation of a game are introduced. We present a probabilistic algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled fluctuations of one-step losses of experts of the pool tend to zero.
</description>
</item>

<item>
<title>
Logistic Stick-Breaking Process; Lu Ren, Lan Du, Lawrence Carin, David Dunson; 12(Jan):203--239, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ren11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ren11a.html
</link>
<description>
A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks.  Efficient variational Bayesian inference is derived, and comparisons are made
</description>
</item>

<item>
<title>
Training SVMs Without Offset; Ingo Steinwart, Don Hush, Clint Scovel; 12(Jan):141--202, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/steinwart11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/steinwart11a.html
</link>
<description>
We develop, analyze, and test a  training algorithm for support vector machine classifiers without offset.  Key features of this algorithm are a new, statistically motivated  stopping criterion, new warm start options, and a set of inexpensive working set selection strategies that significantly reduce the number of iterations.  For these working set strategies, we establish convergence rates that, not surprisingly,  coincide with the best known rates for SVMs with offset.  We further conduct various experiments that investigate both the run time behavior and the performed iterations of the new training algorithm. It turns out, that the new algorithm needs significantly less iterations and
</description>
</item>

<item>
<title>
Bayesian Generalized Kernel Mixed Models; Zhihua Zhang, Guang Dai, Michael I. Jordan; 12(Jan):111--139, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/zhang11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/zhang11a.html
</link>
<description>
We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman's g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging
</description>
</item>

<item>
<title>
Multitask Sparsity via Maximum Entropy Discrimination; Tony Jebara; 12(Jan):75--110, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/jebara11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/jebara11a.html
</link>
<description>
A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement.  Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces
</description>
</item>

<item>
<title>
CARP: Software for Fishing Out Good Clustering Algorithms; Volodymyr Melnykov, Ranjan Maitra; 12(Jan):69--73, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/melnykov11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/melnykov11a.html
</link>
<description>
This paper presents the CLUSTERING ALGORITHMS' REFEREE PACKAGE or CARP,  an open source GNU GPL-licensed C package for evaluating clustering algorithms. Calibrating performance of such algorithms is important and CARP addresses this need by generating datasets of different clustering complexity and by assessing the performance of the concerned algorithm in terms of its ability to classify each dataset relative to the true grouping. This paper briefly describes the software and its capabilities.
</description>
</item>

<item>
<title>
Improved Moves for Truncated Convex Models; M. Pawan Kumar, Olga Veksler, Philip H.S. Torr; 12(Jan):31--67, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/kumar11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/kumar11a.html
</link>
<description>
We consider the problem of obtaining an approximate maximum a posteriori estimate of a discrete random field characterized by pairwise potentials that form a truncated convex model. For this problem, we propose two st-MINCUT based move making algorithms that we call Range Swap and Range Expansion. Our algorithms can be thought of as extensions of &#945;&#946;-Swap and \alpha-Expansion respectively that fully exploit the form of the pairwise potentials. Specifically, instead of dealing with one or two labels at each iteration, our methods explore a large search space by considering a range of labels (that is, an interval of consecutive labels).  Furthermore, we show that Range Expansion provides
</description>
</item>

<item>
<title>
Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation; Yizhao Ni, Craig Saunders, Sandor Szedmak, Mahesan Niranjan; 12(Jan):1--30, 2011.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v12/ni11a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v12/ni11a.html
</link>
<description>
We propose a distance phrase reordering model (DPR) for statistical machine translation (SMT), where the aim is to learn the grammatical rules and context dependent changes using a phrase reordering classification framework. We consider a variety of machine learning techniques, including state-of-the-art structured prediction methods. Techniques are compared and evaluated on a Chinese-English corpus, a language pair known for the high reordering characteristics which cannot be adequately captured with current models. In the reordering classification task, the method significantly outperforms the baseline against which it was tested, and further, when integrated as a component of the state-of-the-art
</description>
</item>

<item>
<title>
Learning Non-Stationary Dynamic Bayesian Networks; Joshua W. Robinson, Alexander J. Hartemink; 11(Dec):3647--3680, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/robinson10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/robinson10a.html
</link>
<description>
Learning dynamic Bayesian network structures provides a principled mechanism for identifying conditional dependencies in time-series data.  An important assumption of traditional DBN structure learning is that the data are generated by a stationary process, an assumption that is not true in many important settings.  In this paper, we introduce a new class of graphical model called a non-stationary dynamic Bayesian network, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time.  Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time.
</description>
</item>

<item>
<title>
PAC-Bayesian Analysis of Co-clustering and Beyond; Yevgeny Seldin, Naftali Tishby; 11(Dec):3595--3646, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/seldin10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/seldin10a.html
</link>
<description>
We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering. We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two
</description>
</item>

<item>
<title>
Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory; Sumio Watanabe; 11(Dec):3571--3594, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/watanabe10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/watanabe10a.html
</link>
<description>
In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown.  In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss.  In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems.  First, the Bayes cross-validation loss is asymptotically equivalent
</description>
</item>

<item>
<title>
Incremental Sigmoid Belief Networks for Grammar Learning; James Henderson, Ivan Titov; 11(Dec):3541--3570, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/henderson10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/henderson10a.html
</link>
<description>
We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network's model structure is a function of the predicted output structure.  These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure.  ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain's complex statistical dependencies benefits from large numbers of latent variables.  While exact inference
</description>
</item>

<item>
<title>
Rate Minimaxity of the Lasso and Dantzig Selector for the l_q Loss in l_r Balls; Fei Ye, Cun-Hui Zhang; 11(Dec):3519--3540, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ye10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ye10a.html
</link>
<description>
We consider the estimation of regression coefficients in a high-dimensional linear model. For regression coefficients in l_r balls, we provide lower bounds for the minimax l_q risk and minimax quantiles of the l_q loss for all design matrices. Under an l_0 sparsity condition on a target coefficient vector, we sharpen and unify existing oracle inequalities for the Lasso and Dantzig selector. We derive oracle inequalities for target coefficient vectors with many small elements and smaller threshold levels than the universal threshold. These oracle inequalities provide sufficient conditions on the design matrix for the rate minimaxity of the Lasso and Dantzig selector for the l_q risk and loss in
</description>
</item>

<item>
<title>
An Exponential Model for Infinite Rankings; Marina Meil&#259;, Le Bao; 11(Dec):3481--3518, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/meila10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/meila10a.html
</link>
<description>
This paper presents a statistical model for expressing preferences through rankings, when the number of alternatives (items to rank) is large.  A human ranker will then typically rank only the most preferred items, and may not even examine the whole set of items, or know how many they are. Similarly, a user presented with the ranked output of a search engine, will only consider the highest ranked items. We model such situations by introducing a stagewise ranking model that operates with finite ordered lists called top-t orderings over an infinite space of items. We give algorithms to estimate this model from data, and demonstrate that it has sufficient statistics, being thus an exponential
</description>
</item>

<item>
<title>
Efficient Algorithms for Conditional Independence Inference; Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studen&#253;; 11(Dec):3453--3479, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/bouckaert10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/bouckaert10b.html
</link>
<description>
The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication.  We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables.  We recall/describe the theoretical
</description>
</item>

<item>
<title>
L_p-Nested Symmetric Distributions; Fabian Sinz, Matthias Bethge; 11(Dec):3409--3451, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/sinz10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/sinz10a.html
</link>
<description>
In this paper, we introduce a new family of probability densities called L_p-nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form &#961;(x) = ~&#961;(f(x)), where f is a nested cascade of L_p-norms ||x||_p = (&#8721; |x_i|^p)^1/p. L_p-nested symmetric distributions thereby are a special case of &#957;-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, &#957;-spherical and L_p-nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, L_p-spherically symmetric distributions,
</description>
</item>

<item>
<title>
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion; Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol; 11(Dec):3371--3408, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/vincent10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/vincent10a.html
</link>
<description>
We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders,
</description>
</item>

<item>
<title>
Learning Instance-Specific Predictive Models; Shyam Visweswaran, Gregory F. Cooper; 11(Dec):3333--3369, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/visweswaran10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/visweswaran10a.html
</link>
<description>
This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including naive Bayes,
</description>
</item>

<item>
<title>
Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds; Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra; 11(Dec):3313--3332, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/dmochowski10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/dmochowski10a.html
</link>
<description>
The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain.  While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear.  To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified.  In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem.  On the other hand, under
</description>
</item>

<item>
<title>
Classification with Incomplete Data Using Dirichlet Process Priors; Chunping Wang, Xuejun Liao, Lawrence Carin, David B. Dunson; 11(Dec):3269--3311, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/wang10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/wang10a.html
</link>
<description>
A non-parametric hierarchical Bayesian framework is developed for designing a classifier, based on a mixture of simple (linear) classifiers. Each simple classifier is termed a local "expert", and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the "experts" allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classifiers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform
</description>
</item>

<item>
<title>
Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes; Antti Honkela, Tapani Raiko, Mikael Kuusela, Matti Tornio, Juha Karhunen; 11(Nov):3235--3268, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/honkela10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/honkela10a.html
</link>
<description>
Variational Bayesian (VB) methods are typically only applied to models in the conjugate-exponential family using the variational Bayesian expectation maximisation (VB EM) algorithm or one of its variants.  In this paper we present an efficient algorithm for applying VB to more general models.  The method is based on specifying the functional form of the approximation, such as multivariate Gaussian.  The parameters of the approximation are optimised using a conjugate gradient algorithm that utilises the Riemannian geometry of the space of the approximations.  This leads to a very efficient algorithm for suitably structured approximations. It is shown empirically that the proposed method is
</description>
</item>

<item>
<title>
A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification; Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin; 11(Nov):3183--3234, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yuan10c.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yuan10c.html
</link>
<description>
Large-scale linear classification is widely used in many areas.  The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training.  Although various optimization methods have been proposed in recent years, these have not yet been compared suitably.  In this paper, we first broadly review existing methods.  Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations.  Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.
</description>
</item>

<item>
<title>
A Generalized Path Integral Control Approach to Reinforcement Learning; Evangelos Theodorou, Jonas Buchli, Stefan Schaal; 11(Nov):3137--3181, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/theodorou10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/theodorou10a.html
</link>
<description>
With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has
</description>
</item>

<item>
<title>
Collective Inference for  Extraction MRFs Coupled with Symmetric Clique Potentials; Rahul Gupta, Sunita Sarawagi, Ajit A. Diwan; 11(Nov):3097--3135, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gupta10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gupta10a.html
</link>
<description>
Many structured information extraction tasks employ collective graphical models that capture inter-instance associativity by coupling them with various clique potentials.  We propose tractable families of such potentials that are invariant under permutations of their arguments, and call them symmetric clique potentials.  We present three families of symmetric potentials---MAX, SUM, and MAJORITY.  We propose cluster message passing for collective inference with symmetric clique potentials, and present message computation algorithms tailored to such potentials.  Our first message computation algorithm, called &#945;-pass, is sub-quadratic in the clique size, outputs exact messages for MAX, and
</description>
</item>

<item>
<title>
Inducing Tree-Substitution Grammars; Trevor Cohn, Phil Blunsom, Sharon Goldwater; 11(Nov):3053--3096, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cohn10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cohn10b.html
</link>
<description>
Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research.  The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data.  The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions.  We propose a novel compromise by inferring a probabilistic tree
</description>
</item>

<item>
<title>
Covariance in Unsupervised Learning of Probabilistic Grammars; Shay B. Cohen, Noah A. Smith; 11(Nov):3017--3051, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cohen10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cohen10a.html
</link>
<description>
Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text.  Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, general-purpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting.  To date, most of the literature has focused on using a Dirichlet prior.  The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar's parameters. Yet, various grammar parameters are expected to be correlated because
</description>
</item>

<item>
<title>
Gaussian Processes for Machine Learning (GPML) Toolbox; Carl Edward Rasmussen, Hannes Nickisch; 11(Nov):3011--3015, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/rasmussen10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/rasmussen10a.html
</link>
<description>
The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification.  Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace's method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks.
</description>
</item>

<item>
<title>
Semi-Supervised Novelty Detection; Gilles Blanchard, Gyemin Lee, Clayton Scott; 11(Nov):2973--3009, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/blanchard10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/blanchard10a.html
</link>
<description>
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson
</description>
</item>

<item>
<title>
Tree Decomposition for Large-Scale SVM Problems; Fu Chang, Chien-Yang Guo, Xiao-Rong Lin, Chi-Jen Lu; 11(Oct):2935--2972, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/chang10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/chang10b.html
</link>
<description>
To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size
</description>
</item>

<item>
<title>
Linear Algorithms for Online Multitask Classification; Giovanni Cavallanti, Nicol&#242; Cesa-Bianchi, Claudio Gentile; 11(Oct):2901--2934, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cavallanti10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cavallanti10a.html
</link>
<description>
We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks.  We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace.  In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our
</description>
</item>

<item>
<title>
Expectation Truncation and the Benefits of Preselection In Training Generative Models; J&#246;rg L&#252;cke, Julian Eggert; 11(Oct):2855--2900, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/lucke10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/lucke10a.html
</link>
<description>
We show how a preselection of hidden variables can be used to efficiently train generative models with binary hidden variables.  The approach is based on Expectation Maximization (EM) and uses an efficiently computable approximation to the sufficient statistics of a given model.  The computational cost to compute the sufficient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes.  The approximation is applicable to a wide range of generative models and provides an interpretation of the benefits of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types
</description>
</item>

<item>
<title>
Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance; Nguyen Xuan Vinh, Julien Epps, James Bailey; 11(Oct):2837--2854, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/vinh10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/vinh10a.html
</link>
<description>
Information theoretic  measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when
</description>
</item>

<item>
<title>
Regret Bounds and Minimax Policies under Partial Monitoring; Jean-Yves Audibert, S&#233;bastien Bubeck; 11(Oct):2785--2836, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/audibert10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/audibert10a.html
</link>
<description>
This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function &#968; for which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, for &#968;(x)=exp(&#951; x) + &#947;/K, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly
</description>
</item>

<item>
<title>
Mean Field Variational Approximation for Continuous-Time Bayesian Networks; Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman; 11(Oct):2745--2783, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cohn10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cohn10a.html
</link>
<description>
Continuous-time Bayesian networks is a natural structured representation language for multi-component stochastic processes that evolve continuously over time.  Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories.  This variational approach leads to a globally consistent distribution, which  can be efficiently queried.  Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning
</description>
</item>

<item>
<title>
Using Contextual Representations to Efficiently Learn Context-Free Languages; Alexander Clark, R&#233;mi Eyraud, Amaury Habrard; 11(Oct):2707--2744, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/clark10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/clark10a.html
</link>
<description>
We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages  using the paradigm of positive data and a membership oracle.  We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs),  which are capable of representing richly structured context-free languages as well as some context sensitive languages.  These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning.  This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly
</description>
</item>

<item>
<title>
Topology Selection in Graphical Models of Autoregressive Processes; Jitkomut Songsiri, Lieven Vandenberghe; 11(Oct):2671--2705, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/songsiri10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/songsiri10a.html
</link>
<description>
An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series.  The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables.  The method proposed in the paper is based on an l_1-type nonsmooth regularization of the conditional maximum likelihood estimation problem.   We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method.  Results of experiments with randomly generated and real data sets are also included.
</description>
</item>

<item>
<title>
Learnability, Stability and Uniform Convergence; Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan; 11(Oct):2635--2670, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/shalev-shwartz10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/shalev-shwartz10a.html
</link>
<description>
The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization.  In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases.  We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails,
</description>
</item>

<item>
<title>
Stochastic Composite Likelihood; Joshua V. Dillon, Guy Lebanon; 11(Oct):2597--2633, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/dillon10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/dillon10a.html
</link>
<description>
Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical robustness, and computational complexity. We discuss experimental results in the context of Boltzmann machines and conditional random fields. The theoretical and experimental studies
</description>
</item>

<item>
<title>
Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization; Lin Xiao; 11(Oct):2543--2596, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/xiao10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/xiao10a.html
</link>
<description>
We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as l_1-norm for promoting sparsity.  We develop extensions of Nesterov's dual averaging method, that can exploit the regularization structure in an online setting.  At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient.  In the case of l_1-regularization,
</description>
</item>

<item>
<title>
WEKA---Experiences with a Java Open-Source Project; Remco R. Bouckaert, Eibe Frank, Mark A. Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten; 11(Sep):2533--2541, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/bouckaert10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/bouckaert10a.html
</link>
<description>
WEKA is a popular machine learning workbench with a development life of nearly two decades.  This article provides an overview of the factors that we believe to be important to its success. Rather than focussing on the software's functionality, we review aspects of project management and historical development decisions that likely had an impact on the uptake of the project.
</description>
</item>

<item>
<title>
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data; Milo&#353; Radovanovi&#263;, Alexandros Nanopoulos, Mirjana Ivanovi&#263;; 11(Sep):2487--2531, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/radovanovic10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/radovanovic10a.html
</link>
<description>
Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent "popular" nearest neighbors.
</description>
</item>

<item>
<title>
Rademacher Complexities and Bounding the Excess Risk in Active Learning; Vladimir Koltchinskii; 11(Sep):2457--2485, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/koltchinskii10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/koltchinskii10a.html
</link>
<description>
Sequential algorithms of active learning based on the estimation of the level sets of the empirical risk are discussed in the paper. Localized Rademacher complexities are used in the algorithms to estimate the sample sizes needed to achieve the required accuracy of learning in an adaptive way.  Probabilistic bounds on the number of active examples have been proved and several applications to binary classification problems are considered.
</description>
</item>

<item>
<title>
Sparse Semi-supervised Learning Using Conjugate Functions; Shiliang Sun, John Shawe-Taylor; 11(Sep):2423--2455, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/sun10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/sun10a.html
</link>
<description>
In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared
</description>
</item>

<item>
<title>
Composite Binary Losses; Mark D. Reid, Robert C. Williamson; 11(Sep):2387--2422, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/reid10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/reid10a.html
</link>
<description>
We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function.  We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and "classification calibrated" losses. We also consider the question of the "best" surrogate binary loss. We introduce a precise notion of "best" and show there exist
</description>
</item>

<item>
<title>
High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency; Dapo Omidiran, Martin J. Wainwright; 11(Aug):2361--2386, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/omidiran10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/omidiran10a.html
</link>
<description>
We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector &#946;^* &#8712; R^p, estimate the subset of non-zero entries of &#946;^*.  A significant body of work has studied behavior of l_1-relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli).  In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction &#947; of non-zero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one.  Our main result is to
</description>
</item>

<item>
<title>
Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers; Franz Pernkopf, Jeff A. Bilmes; 11(Aug):2323--2360, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/pernkopf10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/pernkopf10a.html
</link>
<description>
We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers.  We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O(N^(k+1)) score evaluations (where constant k is the tree-width of the sub-graph over the attributes).  We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification
</description>
</item>

<item>
<title>
Spectral Regularization Algorithms for Learning Large Incomplete Matrices; Rahul Mazumder, Trevor Hastie, Robert Tibshirani; 11(Aug):2287--2322, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mazumder10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mazumder10a.html
</link>
<description>
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm SOFT-IMPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix.  Exploiting the
</description>
</item>

<item>
<title>
High Dimensional Inverse Covariance Matrix Estimation via Linear Programming; Ming Yuan; 11(Aug):2261--2286, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yuan10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yuan10b.html
</link>
<description>
This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by "sparse" matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such "sparsity".  The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem.
</description>
</item>

<item>
<title>
Restricted Eigenvalue Properties for Correlated Gaussian Designs; Garvesh Raskutti, Martin J. Wainwright, Bin Yu; 11(Aug):2241--2259, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/raskutti10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/raskutti10a.html
</link>
<description>
Methods based on l_1-relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions.  The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared l_2-error of a Lasso estimate decays at the minimax optimal rate k log p / n, where k is the sparsity of the p-dimensional regression problem with additive Gaussian noise, whenever the design satisfies a restricted eigenvalue condition.  The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far,
</description>
</item>

<item>
<title>
Erratum: SGDQN is Less Careful than Expected; Antoine Bordes, L&#233;on Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith; 11(Aug):2229--2240, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/bordes10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/bordes10a.html
</link>
<description>
The SGD-QN algorithm described in Bordes et al. (2009) contains a subtle flaw that prevents it from reaching its design goals.  Yet the flawed SGD-QN algorithm has worked well enough to be a winner of the first Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008).  This document clarifies the situation, proposes a corrected algorithm, and evaluates its performance.
</description>
</item>

<item>
<title>
Regularized Discriminant Analysis, Ridge Regression and Beyond; Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan; 11(Aug):2199--2228, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/zhang10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/zhang10b.html
</link>
<description>
Fisher linear discriminant analysis (FDA) and its kernel extension--kernel discriminant analysis (KDA)--are well known methods that consider dimensionality reduction and classification jointly.  While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures.  In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA.  We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on
</description>
</item>

<item>
<title>
Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence; Qiang Wu, Justin Guinney, Mauro Maggioni, Sayan Mukherjee; 11(Aug):2175--2198, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/wu10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/wu10a.html
</link>
<description>
The problems of dimension reduction and inference of statistical dependence are addressed by the modeling framework of learning gradients. The models we propose hold for Euclidean spaces as well as the manifold setting. The central quantity in this approach is an estimate of the gradient of the regression or classification function. Two quadratic forms are constructed from gradient estimates: the gradient outer product and gradient based diffusion maps. The first quantity can be used for supervised dimension reduction on manifolds as well as inference of a graphical model encoding dependencies that are predictive of a response variable.  The second quantity can be used for nonlinear projections
</description>
</item>

<item>
<title>
libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models; Joris M. Mooij; 11(Aug):2169--2173, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mooij10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mooij10a.html
</link>
<description>
This paper describes the software package libDAI, a free &amp; open source C++ library that provides implementations of various exact and approximate inference methods for graphical models with discrete-valued variables. libDAI supports directed graphical models (Bayesian networks) as well as undirected ones (Markov random fields and factor graphs). It offers various approximations of the partition sum, marginal probability distributions and maximum probability states. Parameter learning is also supported. A feature comparison with other open source software packages for approximate inference is given. libDAI is licensed under the GPL v2+ license and is available at http://www.libdai.org.
</description>
</item>

<item>
<title>
Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases; Guoqiang Yu, Yuanjian Feng, David J. Miller, Jianhua Xuan, Eric P. Hoffman, Robert Clarke, Ben Davidson, Ie-Ming Shih, Yue Wang; 11(Aug):2141--2167, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yu10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yu10b.html
</link>
<description>
Microarray gene expressions provide new opportunities for molecular classification of heterogeneous diseases. Although various reported classification schemes show impressive performance, most existing gene selection methods are suboptimal and are not well-matched to the unique characteristics of the multicategory classification problem. Matched design of the gene selection method and a committee classifier is needed for identifying a small set of gene markers that achieve accurate multicategory classification while being both statistically reproducible and biologically plausible. We report a simpler and yet more accurate strategy than previous works for multicategory classification of heterogeneous
</description>
</item>

<item>
<title>
Importance Sampling for Continuous Time Bayesian Networks; Yu Fan, Jing Xu, Christian R. Shelton; 11(Aug):2115--2140, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/fan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/fan10a.html
</link>
<description>
A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a finite number of states which evolves in continuous time.  Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we first present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle filtering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental
</description>
</item>

<item>
<title>
Model-based Boosting 2.0; Torsten Hothorn, Peter B&#252;hlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner; 11(Aug):2109--2113, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/hothorn10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/hothorn10a.html
</link>
<description>
We describe version 2.0 of the R add-on package mboost.  The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data.
</description>
</item>

<item>
<title>
On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation; Gavin C. Cawley, Nicola L. C. Talbot; 11(Jul):2079--2107, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/cawley10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/cawley10a.html
</link>
<description>
Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation.  The error of such an estimator can be broken down into bias and variance components.  While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a non-negligible variance introduces the potential for over-fitting in model selection as well as in training the model.  While this observation is in hindsight perhaps rather obvious, the degradation in performance
</description>
</item>

<item>
<title>
Matrix Completion from  Noisy Entries; Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh; 11(Jul):2057--2078, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/keshavan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/keshavan10a.html
</link>
<description>
Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the 'Netflix problem') to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here OPTSPACE. We prove performance guarantees that are order-optimal in a number of circumstances.
</description>
</item>

<item>
<title>
A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design; Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene, Karel Crombecq; 11(Jul):2051--2055, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gorissen10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gorissen10a.html
</link>
<description>
An exceedingly large number of scientific and engineering fields are confronted with the need for computer simulations to study complex, real world phenomena or solve challenging design problems. However, due to the computational cost of these high fidelity simulations, the use of neural networks, kernel methods, and other surrogate modeling techniques have become indispensable. Surrogate models are compact and cheap to evaluate, and have proven very useful for tasks such as optimization, design space exploration, prototyping, and sensitivity analysis. Consequently, in many fields there is great interest in tools and techniques that facilitate the construction of such regression models,
</description>
</item>

<item>
<title>
Posterior Regularization for Structured Latent Variable Models; Kuzman Ganchev, Jo&#227;o Gra&#231;a, Jennifer Gillenwater, Ben Taskar; 11(Jul):2001--2049, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ganchev10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ganchev10a.html
</link>
<description>
We present posterior regularization, a probabilistic framework for structured, weakly supervised learning.  Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy.  By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and
</description>
</item>

<item>
<title>
Practical Approaches to Principal Component Analysis in the Presence of Missing Values; Alexander Ilin, Tapani Raiko; 11(Jul):1957--2000, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ilin10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ilin10a.html
</link>
<description>
Principal component analysis (PCA) is a classical data analysis technique that finds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overfitting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overfitting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to
</description>
</item>

<item>
<title>
Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary &#946;-Mixing Processes; Liva Ralaivola, Marie Szafranski, Guillaume Stempfel; 11(Jul):1927--1956, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ralaivola10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ralaivola10a.html
</link>
<description>
PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints.
</description>
</item>

<item>
<title>
Fast and Scalable Local Kernel Machines; Nicola Segata, Enrico Blanzieri; 11(Jun):1883--1926, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/segata10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/segata10a.html
</link>
<description>
A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy.  Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process.
</description>
</item>

<item>
<title>
Sparse Spectrum Gaussian Process Regression; Miguel L&#225;zaro-Gredilla, Joaquin Qui&#241;onero-Candela, Carl Edward Rasmussen, An&#237;bal R. Figueiras-Vidal; 11(Jun):1865--1881, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/lazaro-gredilla10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/lazaro-gredilla10a.html
</link>
<description>
We present a new sparse Gaussian Process (GP) model for regression. The key novel idea is to sparsify the spectral representation of the GP. This leads to a simple, practical algorithm for regression tasks. We compare the achievable trade-offs between predictive accuracy and computational requirements, and show that these are typically superior to existing state-of-the-art sparse approximations. We discuss both the weight space and function space representations, and note that the new construction implies priors over functions which are always stationary, and can approximate any covariance function in this class.
</description>
</item>

<item>
<title>
Permutation Tests for Studying Classifier Performance; Markus Ojala, Gemma C. Garriga; 11(Jun):1833--1863, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ojala10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ojala10a.html
</link>
<description>
We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics.
</description>
</item>

<item>
<title>
How to Explain Individual Classification Decisions; David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert M&#252;ller; 11(Jun):1803--1831, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/baehrens10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/baehrens10a.html
</link>
<description>
After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point.  However, most methods will provide no answer why the model predicted a particular label for a single instance and what features were most influential for that particular instance.  The only method that is currently able to provide such explanations are decision trees.  This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method.
</description>
</item>

<item>
<title>
The SHOGUN Machine Learning Toolbox; S&#246;ren Sonnenburg, Gunnar R&#228;tsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, Vojt&#x011B;ch Franc; 11(Jun):1799--1802, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/sonnenburg10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/sonnenburg10a.html
</link>
<description>
We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines, hidden Markov models, multiple kernel learning, linear discriminant analysis, and more. Most of the specific algorithms are able to deal with several different data classes. We have used this toolbox in several applications from computational biology, some of them coming with no less than 50 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely
</description>
</item>

<item>
<title>
Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing; Ryo Yoshida, Mike West; 11(May):1771--1798, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yoshida10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yoshida10a.html
</link>
<description>
We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices.  We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential
</description>
</item>

<item>
<title>
Evolving Static Representations for Task Transfer; Phillip Verbancsics, Kenneth O. Stanley; 11(May):1737--1769, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/verbancsics10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/verbancsics10a.html
</link>
<description>
An important goal for machine learning is to transfer knowledge between tasks. For example, learning to play RoboCup Keepaway should contribute to learning the full game of RoboCup soccer. Previous approaches to transfer in Keepaway have focused on transforming the original representation to fit the new task. In contrast, this paper explores the idea that transfer is most effective if the representation is designed to be the same even across different tasks. To demonstrate this point, a bird's eye view (BEV) representation is introduced that can represent different tasks on the same two-dimensional map.  For example, both the 3 vs. 2 and 4 vs. 3 Keepaway tasks can be represented on the same BEV.
</description>
</item>

<item>
<title>
FastInf: An Efficient Approximate Inference Library; Ariel Jaimovich, Ofer Meshi, Ian McGraw, Gal Elidan; 11(May):1733--1736, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/jaimovich10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/jaimovich10a.html
</link>
<description>
The FastInf C++ library is designed to perform memory and time efficient approximate inference in large-scale discrete undirected graphical models.  The focus of the library is propagation based approximate inference methods, ranging from the basic loopy belief propagation algorithm to propagation based on convex free energies.  Various message scheduling schemes that improve on the standard synchronous or asynchronous approaches are included. Also implemented are a clique tree based exact inference, Gibbs sampling, and the mean field algorithm.  In addition to inference, FastInf provides parameter estimation capabilities as well as representation and learning of shared parameters. It offers
</description>
</item>

<item>
<title>
Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity; Aapo Hyv&#228;rinen, Kun Zhang, Shohei Shimizu, Patrik O. Hoyer; 11(May):1709--1731, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/hyvarinen10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/hyvarinen10a.html
</link>
<description>
Analysis of causal effects between continuous-valued variables typically uses either autoregressive models or structural equation models with instantaneous effects. Estimation of Gaussian, linear structural equation models poses serious identifiability problems, which is why it was recently proposed to use non-Gaussian models. Here, we show how to combine the non-Gaussian instantaneous model with autoregressive models. This is effectively what is called a structural vector autoregression (SVAR) model, and thus our work contributes to the long-standing problem of how to estimate SVAR's. We show that such a non-Gaussian model is identifiable without prior knowledge of network structure. We propose
</description>
</item>

<item>
<title>
Consensus-Based Distributed Support Vector Machines; Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis; 11(May):1663--1707, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/forero10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/forero10a.html
</link>
<description>
This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization sub-problems (one per node) with consensus constraints on the wanted classifier parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated
</description>
</item>

<item>
<title>
Introduction to Causal Inference; Peter Spirtes; 11(May):1643--1662, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/spirtes10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/spirtes10a.html
</link>
<description>
The goal of many sciences is to understand the mechanisms by which variables came to take on the values they have (that is, to find a generative model), and to predict what the values of those variables would be if the naturally occurring mechanisms were subject to outside manipulations. The past 30 years has seen a number of conceptual developments that are partial solutions to the problem of causal inference from observational sample data or a mixture of observational sample and experimental data, particularly in the area of graphical causal modeling. However, in many domains, problems such as the large numbers of variables, small samples sizes, and possible presence of unmeasured causes,
</description>
</item>

<item>
<title>
On the Foundations of Noise-free Selective Classification; Ran El-Yaniv, Yair Wiener; 11(May):1605--1641, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/el-yaniv10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/el-yaniv10a.html
</link>
<description>
We consider selective classification, a term we adopt here to refer to 'classification with a reject option.' The essence in selective classification is to trade-off classifier coverage for higher accuracy.  We term this trade-off the risk-coverage (RC) trade-off.  Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner.  For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings.
</description>
</item>

<item>
<title>
MOA: Massive Online Analysis; Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer; 11(May):1601--1604, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/bifet10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/bifet10a.html
</link>
<description>
Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams.  MOA includes a collection of offline and online methods as well as tools for evaluation.  In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Na&#239;ve Bayes classifiers at the leaves.  MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license.
</description>
</item>

<item>
<title>
Near-optimal Regret Bounds for Reinforcement Learning; Thomas Jaksch, Ronald Ortner, Peter Auer; 11(Apr):1563--1600, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/jaksch10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/jaksch10a.html
</link>
<description>
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy.  In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s' there is a policy which moves from s to s' in at most D steps (on average).  We present a reinforcement learning algorithm with total regret &#213;(DS&#8730;AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D.  A corresponding lower bound of &#937;(&#8730;DSAT) on the total regret of any learning algorithm is given as well.  These results are complemented by
</description>
</item>

<item>
<title>
Hilbert Space Embeddings and Metrics on Probability Measures; Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch&#246;lkopf, Gert R. G. Lanckriet; 11(Apr):1517--1561, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/sriperumbudur10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/sriperumbudur10a.html
</link>
<description>
A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as &#947;_k, indexed by the kernel function k that defines the inner product in the RKHS.  We present three theoretical properties of &#947;_k. First, we consider the question of determining the conditions on the kernel k for which &#947;_k is a metric: such k are denoted
</description>
</item>

<item>
<title>
Quadratic Programming Feature Selection; Irene Rodriguez-Lujan, Ramon Huerta, Charles Elkan, Carlos Santa Cruz; 11(Apr):1491--1516, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/rodriguez-lujan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/rodriguez-lujan10a.html
</link>
<description>
Identifying a subset of features that preserves classification accuracy is a problem of growing importance, because of the increasing size and dimensionality of real-world data sets.  We propose a new feature selection method, named Quadratic Programming Feature Selection (QPFS), that reduces the task to a quadratic optimization problem.  In order to limit the computational complexity of solving the optimization problem, QPFS uses the Nystr&#246;m method for approximate matrix diagonalization.  QPFS is thus capable of dealing with very large data sets, for which the use of other methods is computationally expensive.  In experiments with small and medium data sets, the QPFS method leads to
</description>
</item>

<item>
<title>
Training and Testing Low-degree Polynomial Data Mappings via Linear SVM; Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin; 11(Apr):1471--1490, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/chang10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/chang10a.html
</link>
<description>
Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues.  The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels.  Empirical experiments show that the proposed method is useful for certain large-scale data sets.  We successfully apply the proposed
</description>
</item>

<item>
<title>
Characterization, Stability and Convergence of Hierarchical Clustering Methods; Gunnar Carlsson, Facundo M&#233;moli; 11(Apr):1425--1470, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/carlsson10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/carlsson10a.html
</link>
<description>
We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods.
</description>
</item>

<item>
<title>
Consistent Nonparametric Tests of Independence; Arthur Gretton, L&#225;szl&#243; Gy&#246;rfi; 11(Apr):1391--1423, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gretton10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gretton10a.html
</link>
<description>
Three simple and explicit procedures for testing the independence of two multi-dimensional  random variables are described.  Two of the associated test statistics (L_1, log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions.  A third test statistic is defined as a kernel-based independence measure.  Two kinds of tests are provided.  Distribution-free strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size.  Asymptotically &#945;-level tests are obtained from the limiting distribution of the test statistics.
</description>
</item>

<item>
<title>
Learning Translation Invariant Kernels for Classification; Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi; 11(Apr):1353--1390, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ghiasi-shirazi10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ghiasi-shirazi10a.html
</link>
<description>
Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on
</description>
</item>

<item>
<title>
Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels; Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian; 11(Apr):1323--1351, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/donmez10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/donmez10a.html
</link>
<description>
Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel  unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data.
</description>
</item>

<item>
<title>
Learning From Crowds; Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, Linda Moy; 11(Apr):1297--1322, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/raykar10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/raykar10a.html
</link>
<description>
For many supervised learning tasks it may be infeasible (or very expensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy) labels from multiple experts or annotators. In practice, there is  a substantial amount of disagreement among the annotators, and hence it is of great practical interest to address conventional supervised learning problems in this scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels.
</description>
</item>

<item>
<title>
Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation; Vicen&#231; G&#243;mez, Hilbert J. Kappen, Michael Chertkov; 11(Apr):1273--1296, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gomez10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gomez10a.html
</link>
<description>
We introduce novel results for approximate inference on planar graphical models using the loop calculus framework.  The loop calculus (Chertkov and Chernyak, 2006a) allows to express the exact partition function of a graphical model as a finite sum of terms that can be evaluated once the belief propagation (BP) solution is known.  In general, full summation over all correction terms is intractable.  We develop an algorithm for the approach presented in Chertkov et al. (2008) which represents an efficient truncation scheme on planar graphs and a new representation of the series in terms of Pfaffians of matrices.  We analyze the performance of the algorithm for models with binary variables
</description>
</item>

<item>
<title>
Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation; Miki Aoyagi; 11(Apr):1243--1272, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/aoyagi10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/aoyagi10a.html
</link>
<description>
In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation.  It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b).  The zeta function is defined by using a Kullback function.  We use two methods to obtain  the maximum pole: a new eigenvalue analysis method and a recursive blowing up process.  We show that these methods are effective for obtaining the asymptotic form of the generalization error of hierarchical learning models.
</description>
</item>

<item>
<title>
Graph Kernels; S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt; 11(Apr):1201--1242, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/vishwanathan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/vishwanathan10a.html
</link>
<description>
We present a unified framework to study graph kernels, special cases of which include the random walk (G&#228;rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah&#233;t al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time complexity of kernel computation between unlabeled graphs with n vertices from O(n^6) to O(n^3). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn^3) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing
</description>
</item>

<item>
<title>
A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning; Jin Yu, S.V.N. Vishwanathan, Simon G&#252;nter, Nicol N. Schraudolph; 11(Mar):1145--1200, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yu10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yu10a.html
</link>
<description>
We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value.  We apply its memory-limited variant (subLBFGS) to L_2-regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search
</description>
</item>

<item>
<title>
SFO: A Toolbox for Submodular Function Optimization; Andreas Krause; 11(Mar):1141--1144, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/krause10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/krause10a.html
</link>
<description>
In recent years, a fundamental problem structure has emerged as very useful in a variety of machine learning applications: Submodularity is an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems.  We present SFO, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions. A tutorial script illustrates the application of submodularity to machine learning and AI problems such as feature selection, clustering, inference and optimized
</description>
</item>

<item>
<title>
Continuous Time Bayesian Network Reasoning and Learning Engine; Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu; 11(Mar):1137--1140, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/shelton10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/shelton10a.html
</link>
<description>
We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE).  A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuous-time Markov process.  This software provides libraries and programs for most of the algorithms developed for CTBNs.  For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data.  For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern.  Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence.
</description>
</item>

<item>
<title>
Large Scale Online Learning of Image Similarity Through Ranking; Gal Chechik, Varun Sharma, Uri Shalit, Samy Bengio; 11(Mar):1109--1135, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/chechik10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/chechik10a.html
</link>
<description>
Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object.  Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large data sets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space
</description>
</item>

<item>
<title>
Analysis of Multi-stage Convex Relaxation for Sparse Regularization; Tong Zhang; 11(Mar):1081--1107, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/zhang10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/zhang10a.html
</link>
<description>
We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution.  Convex relaxation such as L_1-regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality.  This paper tries to remedy the above gap between theory and practice.  In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions.
</description>
</item>

<item>
<title>
Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes; Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright; 11(Mar):1043--1080, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ravikumar10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ravikumar10a.html
</link>
<description>
The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on "tree-based" linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences.  As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems.  Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman
</description>
</item>

<item>
<title>
Kronecker Graphs: An Approach to Modeling Networks; Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani; 11(Feb):985--1042, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/leskovec10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/leskovec10a.html
</link>
<description>
How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time.  Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural
</description>
</item>

<item>
<title>
Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data; Gideon S. Mann, Andrew McCallum; 11(Feb):955--984, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mann10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mann10a.html
</link>
<description>
In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust,  scalable method for semi-supervised training using weakly-labeled data.  GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data.  This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields.  Experimental results demonstrate accuracy improvements over supervised training and a number of other state-of-the-art semi-supervised learning methods for these models.
</description>
</item>

<item>
<title>
On Spectral Learning; Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil; 11(Feb):935--953, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/argyriou10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/argyriou10a.html
</link>
<description>
In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W. Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate
</description>
</item>

<item>
<title>
On Learning with Integral Operators; Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito; 11(Feb):905--934, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/rosasco10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/rosasco10a.html
</link>
<description>
A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the  quality of such approximations.  The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in  spectral approximation.  2. Using these methods we provide several new results for estimating spectral properties of the
</description>
</item>

<item>
<title>
Image Denoising with Kernels Based on Natural Image Relations; Valero Laparra, Juan Guti&#233;rrez, Gustavo Camps-Valls, Jes&#250;s Malo; 11(Feb):873--903, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/laparra10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/laparra10a.html
</link>
<description>
A successful class of image denoising methods is based on Bayesian approaches working in wavelet representations. The performance of these methods improves when relations among the local frequency coefficients are explicitly included. However, in these techniques, analytical estimates can be obtained only for particular combinations of analytical models of signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources.  In this paper, we propose an alternative non-explicit way to take into account the relations among natural image wavelet coefficients for denoising: we use support vector regression (SVR) in the wavelet domain to enforce these relations
</description>
</item>

<item>
<title>
A Streaming Parallel Decision Tree Algorithm; Yael Ben-Haim, Elad Tom-Tov; 11(Feb):849--872, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ben-haim10a.html
</link>
<description>
We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm's accuracy.  The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that
</description>
</item>

<item>
<title>
Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models; Fang-Lan Huang, Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin; 11(Feb):815--848, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/huang10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/huang10a.html
</link>
<description>
Maximum entropy (Maxent) is useful in natural language processing and many other areas.  Iterative scaling (IS) methods are one of the most popular approaches to solve Maxent.  With many variants of IS methods, it is difficult to understand them and see the differences.  In this paper, we create a general and unified framework for iterative scaling methods. This framework also connects iterative scaling and coordinate descent methods.  We prove general convergence results for IS methods and analyze their computational complexity. Based on the proposed framework, we extend a coordinate descent method for linear SVM to Maxent. Results show that it is faster than existing iterative scaling methods.
</description>
</item>

<item>
<title>
Stability Bounds for Stationary &#966;-mixing and &#946;-mixing Processes; Mehryar Mohri, Afshin Rostamizadeh; 11(Feb):789--814, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mohri10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mohri10a.html
</link>
<description>
Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence.  This paper
</description>
</item>

<item>
<title>
Maximum Relative Margin and Data-Dependent Regularization; Pannagadatta K. Shivaswamy, Tony Jebara; 11(Feb):747--788, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/shivaswamy10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/shivaswamy10a.html
</link>
<description>
Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions.  This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin
</description>
</item>

<item>
<title>
PyBrain; Tom Schaul, Justin Bayer, Daan Wierstra, Yi Sun, Martin Felder, Frank Sehnke, Thomas R&#252;ckstie&#223;, J&#252;rgen Schmidhuber; 11(Feb):743--746, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/schaul10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/schaul10a.html
</link>
<description>
PyBrain is a versatile machine learning library for Python. Its goal is to provide flexible, easy-to-use yet still powerful algorithms for machine learning tasks, including a variety of predefined environments and benchmarks to test and compare algorithms.  Implemented algorithms include Long Short-Term Memory (LSTM), policy gradient methods, (multidimensional) recurrent neural networks and deep belief networks.
</description>
</item>

<item>
<title>
A Fast Hybrid Algorithm for Large-Scale l_1-Regularized Logistic Regression; Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda; 11(Feb):713--741, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/shi10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/shi10a.html
</link>
<description>
l_1-regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of l_1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning.  When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable l_1-norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining
</description>
</item>

<item>
<title>
On the Rate of Convergence of the Bagged Nearest Neighbor Estimate; G&#233;rard Biau, Fr&#233;d&#233;ric C&#233;rou, Arnaud Guyader; 11(Feb):687--712, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/biau10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/biau10a.html
</link>
<description>
Bagging is a simple way to combine estimates in order to improve their performance. This method, suggested by Breiman in 1996, proceeds by resampling from the original data set, constructing a predictor from each subsample, and decide by combining. By bagging an n-sample, the crude nearest neighbor regression estimate is turned into a consistent weighted nearest neighbor regression estimate, which is amenable to statistical analysis. Letting the resampling size k_n grows appropriately with n, it is shown that this estimate may achieve optimal rate of convergence, independently from the fact that resampling is done with or without replacement. Since the estimate with the optimal rate of convergence
</description>
</item>

<item>
<title>
Second-Order Bilinear Discriminant Analysis; Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra; 11(Feb):665--685, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/christoforou10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/christoforou10a.html
</link>
<description>
Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining  first and second-order spatial and temporal features based on a bilinear model.
</description>
</item>

<item>
<title>
Error-Correcting Output Codes Library; Sergio Escalera, Oriol Pujol, Petia Radeva; 11(Feb):661--664, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/escalera10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/escalera10a.html
</link>
<description>
In this paper, we present an open source Error-Correcting Output Codes (ECOC) library. The ECOC framework is a powerful tool to deal with multi-class categorization problems. This library contains both state-of-the-art coding (one-versus-one, one-versus-all, dense random, sparse random, DECOC, forest-ECOC, and ECOC-ONE) and decoding designs (hamming, euclidean, inverse hamming, laplacian, &#946;-density, attenuated, loss-based, probabilistic kernel-based, and loss-weighted) with the parameters defined by the authors, as well as the option to include your own coding, decoding, and base classifier.
</description>
</item>

<item>
<title>
Why Does Unsupervised Pre-training Help Deep Learning?; Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio; 11(Feb):625--660, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/erhan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/erhan10a.html
</link>
<description>
Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets.  The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is
</description>
</item>

<item>
<title>
A Rotation Test to Verify Latent Structure; Patrick O. Perry, Art B. Owen; 11(Feb):603--624, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/perry10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/perry10a.html
</link>
<description>
In multivariate regression models we have the opportunity to look for hidden structure unrelated to the observed predictors. However, when one fits a model involving such latent variables it is important to be able to tell if the structure is real, or just an artifact of correlation in the regression errors.  We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed to rotate orthogonally to the column space of the regression model. We find that only non-Gaussian latent variables are detectable, a finding that parallels a well known phenomenon in independent components analysis. We base our test on a measure
</description>
</item>

<item>
<title>
On Finding Predictors for Arbitrary Families of Processes; Daniil Ryabko; 11(Feb):581--602, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ryabko10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ryabko10a.html
</link>
<description>
The problem is sequence prediction in the following setting.  A sequence x_1,...,x_n,... of discrete-valued observations is generated according to some unknown probabilistic law (measure) &#956;. After observing each outcome, it is required to give the conditional probabilities of the next observation.  The measure  &#956; belongs to an arbitrary but known class C of stochastic process measures.  We are interested in predictors &#961; whose conditional probabilities converge (in some sense) to the "true" &#956;-conditional probabilities, if any &#956;&#8712;C is chosen to generate the sequence.  The contribution of this work is in characterizing the families C for which such predictors exist,
</description>
</item>

<item>
<title>
Approximate Tree Kernels; Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert M&#252;ller; 11(Feb):555--580, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/rieck10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/rieck10a.html
</link>
<description>
Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable.  In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for
</description>
</item>

<item>
<title>
Generalized Power Method for Sparse Principal Component Analysis; Michel Journ&#233;e, Yurii Nesterov, Peter Richt&#225;rik, Rodolphe Sepulchre; 11(Feb):517--553, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/journee10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/journee10a.html
</link>
<description>
In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient
</description>
</item>

<item>
<title>
Classification Using Geometric Level Sets; Kush R. Varshney, Alan S. Willsky; 11(Feb):491--516, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/varshney10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/varshney10a.html
</link>
<description>
A variational level set method is developed for the supervised classification problem.  Nonlinear classifier decision boundaries are obtained by minimizing an energy functional that is composed of an empirical risk term with a margin-based loss and a geometric regularization term new to machine learning: the surface area of the decision boundary.  This geometric level set classifier is analyzed in terms of consistency and complexity through the calculation of its &#949;-entropy.  For multicategory classification, an efficient scheme is developed using a logarithmic number of decision functions in the number of classes rather than the typical linear number of decision functions.  Geometric level
</description>
</item>

<item>
<title>
Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization; Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, Samuel Kaski; 11(Feb):451--490, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/venna10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/venna10a.html
</link>
<description>
Nonlinear dimensionality reduction methods are often used to visualize high-dimensional data, although the existing methods have been designed for other related tasks such as manifold learning. It has been difficult to assess the quality of visualizations since the task has not been well-defined. We give a rigorous definition for a specific visualization task, resulting in quantifiable goodness measures and new visualization methods. The task is information retrieval given the visualization: to find similar data based on the similarities shown on the display. The fundamental tradeoff between precision and recall of information retrieval can then be quantified in visualizations as well. The user
</description>
</item>

<item>
<title>
Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting; Philippos Mordohai, G&#233;rard Medioni; 11(Jan):411--450, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mordohai10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mordohai10a.html
</link>
<description>
We address instance-based learning from a perceptual organization standpoint and present methods for dimensionality estimation, manifold learning and function approximation. Under our approach, manifolds in high-dimensional spaces are inferred by estimating geometric relationships among the input instances. Unlike conventional manifold learning, we do not perform dimensionality reduction, but instead perform all operations in the original input space. For this purpose we employ a novel formulation of tensor voting, which allows an N-D implementation. Tensor voting is a perceptual organization framework that has mostly been applied to computer vision problems.  Analyzing the estimated local structure
</description>
</item>

<item>
<title>
A Convergent Online Single Time Scale Actor Critic Algorithm; Dotan Di Castro, Ron Meir; 11(Jan):367--410, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/dicastro10a.html
</link>
<description>
Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward.  Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic
</description>
</item>

<item>
<title>
Bundle Methods for Regularized Risk Minimization; Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le; 11(Jan):311--365, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/teo10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/teo10a.html
</link>
<description>
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L_1 and L_2 penalties. In addition to the unified framework we present tight convergence bounds, which
</description>
</item>

<item>
<title>
Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure; Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano; 11(Jan):285--310, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/kojima10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/kojima10a.html
</link>
<description>
We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure.  The previously derived constrained optimal search (COS) remains limited even for sparse super-structures.  To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them.  Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs.  Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained
</description>
</item>

<item>
<title>
Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions; Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos; 11(Jan):235--284, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/aliferis10b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/aliferis10b.html
</link>
<description>
In part I of this work we introduced and evaluated the Generalized Local Learning (GLL) framework for producing local causal and Markov blanket induction algorithms. In the present second part we analyze the behavior of GLL algorithms and provide extensions to the core methods. Specifically, we investigate the empirical convergence of GLL to the true local neighborhood as a function of sample size.  Moreover, we study how predictivity improves with increasing sample size. Then we investigate how sensitive are the algorithms to multiple statistical testing, especially in the presence of many irrelevant features. Next we discuss the role of the algorithm parameters and also show that Markov blanket
</description>
</item>

<item>
<title>
Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation; Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos; 11(Jan):171--234, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/aliferis10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/aliferis10a.html
</link>
<description>
We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples.  The selected feature sets can be used for causal discovery and classification. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms.  The resulting algorithms are sound under well-defined sufficient conditions. In a first set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare
</description>
</item>

<item>
<title>
An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data; Yufeng Ding, Jeffrey S. Simonoff; 11(Jan):131--170, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/ding10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/ding10a.html
</link>
<description>
There are many different methods used by classification tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees applied to binary response data. We show that in the context of classification trees, the relationship between the missingness and the dependent variable, as well as the existence or non-existence of missing values in the testing data, are the most helpful criteria to distinguish different missing data methods. In particular, separate class is clearly the
</description>
</item>

<item>
<title>
Classification Methods with Reject Option Based on Convex Risk Minimization; Ming Yuan, Marten Wegkamp; 11(Jan):111--130, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/yuan10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/yuan10a.html
</link>
<description>
In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for  infinite sample consistency (both risks share the same minimizer)  is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition.
</description>
</item>

<item>
<title>
On-Line Sequential Bin Packing; Andr&#225;s Gy&#246;rgy, G&#225;bor Lugosi, Gy&#246;rgy Ottucs&#224;k; 11(Jan):89--109, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/gyorgy10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/gyorgy10a.html
</link>
<description>
We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items.  Special attention is payed to reference strategies
</description>
</item>

<item>
<title>
Model Selection: Beyond the Bayesian/Frequentist Divide; Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley; 11(Jan):61--87, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/guyon10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/guyon10a.html
</link>
<description>
The principle of parsimony also known as "Ockham's razor" has inspired many theories of  model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms.  We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches.
</description>
</item>

<item>
<title>
Online Learning for Matrix Factorization and Sparse Coding; Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro; 11(Jan):19--60, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/mairal10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/mairal10a.html
</link>
<description>
Sparse coding--that is, modelling data vectors as sparse linear combinations of basis elements--is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally
</description>
</item>

<item>
<title>
An Efficient Explanation of Individual Classifications using Game Theory; Erik Štrumbelj, Igor Kononenko; 11(Jan):1--18, 2010.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v11/strumbelj10a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v11/strumbelj10a.html
</link>
<description>
We present a general method for explaining individual predictions of classification models. The method is based on fundamental concepts from coalitional game theory and predictions are explained with contributions of individual feature values. We overcome the method's initial exponential time complexity with a sampling-based approximation. In the experimental part of the paper we use the developed method on models generated by several well-known machine learning algorithms on both synthetic and real-world data sets. The results demonstrate that the method is efficient and that the explanations are intuitive and useful.
</description>
</item>

<item>
<title>
A Survey of Accuracy Evaluation Metrics of Recommendation Tasks; Asela Gunawardana, Guy Shani; 10(Dec):2935--2962, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/gunawardana09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/gunawardana09a.html
</link>
<description>
Recommender systems are now popular both commercially and in the research community, where many algorithms have been suggested for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms offline using some evaluation metric. Indeed, many evaluation metrics have been suggested for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric
</description>
</item>

<item>
<title>
Efficient Online and Batch Learning Using Forward Backward Splitting; John Duchi, Yoram Singer; 10(Dec):2899--2934, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/duchi09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/duchi09a.html
</link>
<description>
We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as l_1. We derive
</description>
</item>

<item>
<title>
Online Learning with Samples Drawn from Non-identical Distributions; Ting Hu, Ding-Xuan Zhou; 10(Dec):2873--2898, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/hu09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/hu09a.html
</link>
<description>
Learning algorithms are based on samples which are often drawn independently from an identical distribution (i.i.d.). In this paper we consider a different setting with samples drawn according to a non-identical sequence of probability distributions. Each time a sample is drawn from a different distribution. In this setting we investigate a fully online learning algorithm associated with a general convex loss function and a reproducing kernel Hilbert space (RKHS). Error analysis is conducted under the assumption that the sequence of marginal distributions converges polynomially in the dual of a H&#246;lder space. For regression with least square or insensitive loss, learning rates are given
</description>
</item>

<item>
<title>
Adaptive False Discovery Rate Control under Independence and Dependence; Gilles Blanchard, &#201;tienne Roquain; 10(Dec):2837--2871, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/blanchard09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/blanchard09a.html
</link>
<description>
In the context of multiple hypothesis testing, the proportion &#928;_0 of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generally unknown a priori. A testing procedure using an implicit or explicit estimate of this quantity in order to improve its efficency is called adaptive.  In this paper, we focus on the issue of false discovery rate (FDR) control and we present new adaptive multiple testing procedures with control of the FDR.  In a first part, assuming independence of the p-values, we present two new procedures and give a unified review of other existing adaptive procedures that have provably controlled FDR. We report extensive simulation
</description>
</item>

<item>
<title>
Cautious Collective Classification; Luke K. McDowell, Kalyan Moy Gupta, David W. Aha; 10(Dec):2777--2836, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/mcdowell09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/mcdowell09a.html
</link>
<description>
Many collective classification (CC) algorithms have been shown to increase accuracy when instances are interrelated. However, CC algorithms must be carefully applied because their use of estimated labels can in some cases decrease accuracy.  In this article, we show that managing this label uncertainty through cautious algorithmic behavior is essential to achieving maximal, robust performance.  First, we describe cautious inference and explain how four well-known families of CC algorithms can be parameterized to use varying degrees of such caution.  Second, we introduce cautious learning and show how it can be used to improve the performance of almost any CC algorithm, with or without cautious
</description>
</item>

<item>
<title>
Reproducing Kernel Banach Spaces for Machine Learning; Haizhang Zhang, Yuesheng Xu, Jun Zhang; 10(Dec):2741--2775, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/zhang09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/zhang09b.html
</link>
<description>
We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semi-inner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established.
</description>
</item>

<item>
<title>
Learning Halfspaces with Malicious Noise; Adam R. Klivans, Philip M. Long, Rocco A. Servedio; 10(Dec):2715--2740, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/klivans09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/klivans09a.html
</link>
<description>
We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labels and the underlying distribution of examples. Our algorithms can tolerate malicious noise rates exponentially larger than previous work in terms of the dependence on the dimension n, and succeed for the fairly broad class of all isotropic log-concave distributions.  We give poly(n, 1/&#949;)-time algorithms for solving the following problems to accuracy &#949;: Learning origin-centered halfspaces in R^n with respect to the uniform distribution on the unit ball with malicious noise rate &#951; = &#937;(&#949;^2 / log(n/&#949;)). (The best previous result
</description>
</item>

<item>
<title>
Structure Spaces; Brijnesh J. Jain, Klaus Obermayer; 10(Nov):2667--2714, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/jain09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/jain09a.html
</link>
<description>
Finite structures such as point patterns, strings, trees, and graphs occur as "natural" representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces
</description>
</item>

<item>
<title>
Bounded Kernel-Based Online Learning; Francesco Orabona, Joseph Keshet, Barbara Caputo; 10(Nov):2643--2666, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/orabona09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/orabona09a.html
</link>
<description>
A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm  Projectron. While the memory
</description>
</item>

<item>
<title>
DL-Learner: Learning Concepts in Description Logics; Jens Lehmann; 10(Nov):2639--2642, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/lehmann09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/lehmann09a.html
</link>
<description>
In this paper, we introduce DL-Learner, a framework for learning in description logics and OWL. OWL is the official W3C standard ontology language for the Semantic Web. Concepts in this language can be learned for constructing and maintaining OWL ontologies or for solving problems similar to those in Inductive Logic Programming. DL-Learner includes several learning algorithms, support for different OWL formats, reasoner interfaces, and learning problems.  It is a cross-platform framework implemented in Java. The framework allows easy programmatic access and provides a command line interface, a graphical interface as well as a WSDL-based web service.
</description>
</item>

<item>
<title>
Hash Kernels for Structured Data; Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan; 10(Nov):2615--2637, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/shi09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/shi09a.html
</link>
<description>
  We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs.
</description>
</item>

<item>
<title>
Learning When Concepts Abound; Omid Madani, Michael Connor, Wiley Greiner; 10(Nov):2571--2613, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/madani09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/madani09a.html
</link>
<description>
  Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond.  We investigate the learning of sparse class indices to address this challenge.  An index is a mapping from features to classes.  We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines.  We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach
</description>
</item>

<item>
<title>
Maximum Entropy Discrimination Markov Networks; Jun Zhu, Eric P. Xing; 10(Nov):2531--2569, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/zhu09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/zhu09a.html
</link>
<description>
The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy
</description>
</item>

<item>
<title>
When Is There a Representer Theorem?  Vector Versus Matrix Regularizers; Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil; 10(Nov):2507--2529, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/argyriou09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/argyriou09a.html
</link>
<description>
We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the L2 norm, then the learned vector is a linear combination of the input data. This result, known as the representer theorem, lies at the basis of kernel-based methods in machine learning. In this paper, we prove the necessity of the above condition, in the case of differentiable regularizers.  We further extend our analysis to regularization methods which learn a matrix, a problem which is motivated by the application to multi-task learning. In this context, we study a
</description>
</item>

<item>
<title>
Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression; Saharon Rosset; 10(Nov):2473--2505, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/rosset09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/rosset09a.html
</link>
<description>
We show how to follow the path of cross validated solutions to families of regularized optimization problems, defined by a combination of a parameterized loss function and a regularization term. A primary example is kernel quantile regression, where the parameter of the loss function is the quantile being estimated. Even though the bi-level optimization problem we encounter for every quantile is non-convex, the manner in which the optimal cross-validated solution evolves with the parameter of the loss function allows tracking of this solution. We prove this property, construct the resulting algorithm, and demonstrate it on real and artificial data. This algorithm allows us to efficiently
</description>
</item>

<item>
<title>
Prediction With Expert Advice For The Brier Game; Vladimir Vovk, Fedor Zhdanov; 10(Nov):2445--2471, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/vovk09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/vovk09a.html
</link>
<description>
We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it.  The resulting prediction algorithm is applied to predict results of football and tennis matches, with well-known bookmakers playing the role of experts.  The theoretical performance guarantee is not excessively loose on the football data set and is rather tight on the tennis data set.
</description>
</item>

<item>
<title>
Reinforcement Learning in Finite MDPs: PAC Analysis; Alexander L. Strehl, Lihong Li, Michael L. Littman; 10(Nov):2413--2444, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/strehl09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/strehl09a.html
</link>
<description>
We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples.  These "PAC-MDP" algorithms include the well-known E^3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm.  We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework.  A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.
</description>
</item>

<item>
<title>
Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions; Lisa Hellerstein, Bernard Rosell, Eric Bach, Soumya Ray, David Page; 10(Oct):2374--2411, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/hellerstein09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/hellerstein09a.html
</link>
<description>
A Boolean function f is correlation immune if each input variable is independent of the output, under the uniform distribution on inputs. For example, the parity function is correlation immune. We consider the problem of identifying relevant variables of a correlation immune function, in the presence of irrelevant variables. We address this problem in two different contexts. First, we analyze Skewing, a heuristic method that was developed to improve the ability of greedy decision tree algorithms to identify relevant variables of correlation immune Boolean functions, given examples drawn from the uniform distribution (Page and Ray, 2003). We present theoretical results revealing both
</description>
</item>

<item>
<title>
Estimating Labels from Label Proportions; Novi Quadrianto, Alex J. Smola, Tib&#x00E9;rio S. Caetano, Quoc V. Le; 10(Oct):2349--2374, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/quadrianto09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/quadrianto09a.html
</link>
<description>
Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, possibly with known label proportions. This problem occurs in areas like e-commerce, politics, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice.
</description>
</item>

<item>
<title>
Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors; Mathias Drton, Michael Eichler, Thomas S. Richardson; 10(Oct):2329--2348, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/drton09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/drton09a.html
</link>
<description>
In recursive linear models, the multivariate normal joint distribution of all variables exhibits a dependence structure induced by a recursive (or acyclic) system of linear structural equations. These linear models have a long tradition and appear in seemingly unrelated regressions, structural equation modelling, and approaches to causal inference. They are also related to Gaussian graphical models via a classical representation known as a path diagram. Despite the models' long history, a number of problems remain open. In this paper, we address the problem of computing maximum likelihood estimates in the subclass of 'bow-free' recursive linear models. The term 'bow-free' refers to
</description>
</item>

<item>
<title>
The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs; Han Liu, John Lafferty, Larry Wasserman; 10(Oct):2295--2328, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/liu09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/liu09a.html
</link>
<description>
Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula---or "nonparanormal"---for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method's theoretical properties, and show that it works well in many examples.
</description>
</item>

<item>
<title>
Learning Nondeterministic Classifiers; Juan Jos&#x00E9; del Coz, Jorge D&#x00ED;ez, Antonio Bahamonde; 10(Oct):2273--2293, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/delcoz09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/delcoz09a.html
</link>
<description>
Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide
</description>
</item>

<item>
<title>
The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List; Cynthia Rudin; 10(Oct):2233--2271, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/rudin09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/rudin09b.html
</link>
<description>
We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This "push" near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose lp-norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around
</description>
</item>

<item>
<title>
Margin-based Ranking and an Equivalence between AdaBoost and RankBoost; Cynthia Rudin, Robert E. Schapire; 10(Oct):2193--2232, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/rudin09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/rudin09a.html
</link>
<description>
We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to "approximate coordinate ascent boosting." Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as
</description>
</item>

<item>
<title>
Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization; Vojt&#x011B;ch Franc, S&#246;ren Sonnenburg; 10(Oct):2157--2192, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/franc09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/franc09a.html
</link>
<description>
We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a &#949; precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver.  In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVM^light, SVM^perf and BMRM, achieving speedup factor more than 1,200 over SVM^light on some data sets and speedup factor of 29 over SVM^perf, while obtaining the same precise support vector solution.
</description>
</item>

<item>
<title>
Discriminative Learning Under Covariate Shift; Steffen Bickel, Michael Br&#252;ckner, Tobias Scheffer; 10(Sep):2137--2155, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/bickel09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/bickel09a.html
</link>
<description>
We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution---problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under
</description>
</item>

<item>
<title>
RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments; Brian Tanner, Adam White; 10(Sep):2133--2136, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/tanner09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/tanner09a.html
</link>
<description>
RL-Glue is a standard, language-independent software package for reinforcement-learning experiments. The standardization provided by RL-Glue facilitates code sharing and collaboration. Code sharing reduces the need to re-engineer tasks and experimental apparatus, both common barriers to comparatively evaluating new ideas in the context of the literature. Our software features a minimalist interface and works with several languages and computing platforms. RL-Glue compatibility can be extended to any programming language that supports network socket communication. RL-Glue has been used to teach classes, to run international competitions, and is currently used by several other open-source software and hardware projects.
</description>
</item>

<item>
<title>
Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods; Christian Rieger, Barbara Zwicknagl; 10(Sep):2115--2132, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/rieger09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/rieger09a.html
</link>
<description>
We introduce a new technique for the analysis of kernel-based regression problems. The basic tools are sampling inequalities which apply to all machine learning problems involving penalty terms induced by kernels related to Sobolev spaces. They lead to explicit deterministic results concerning the worst case behaviour of &#949;- and &#957;-SVRs. Using these, we show how to adjust regularization parameters to get best possible approximation orders for regression. The results are illustrated by some numerical examples.
</description>
</item>

<item>
<title>
An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems; Luciana Ferrer, Kemal S&#246;nmez, Elizabeth Shriberg; 10(Sep):2079--2114, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/ferrer09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/ferrer09a.html
</link>
<description>
We present a method for training support vector machine (SVM)-based classification systems for combination with other classification systems designed for the same task. Ideally, a new system should be designed such that, when combined with existing systems, the resulting performance is optimized. We present a simple model for this problem and use the understanding gained from this analysis to propose a method to achieve better combination performance when training SVM systems. We include a regularization term in the SVM objective function that aims to reduce the average
</description>
</item>

<item>
<title>
Evolutionary Model Type Selection for Global Surrogate Modeling; Dirk Gorissen, Tom Dhaene, Filip De Turck; 10(Sep):2039--2078, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/gorissen09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/gorissen09a.html
</link>
<description>
Due to the scale and computational complexity of currently used simulation codes, global surrogate (metamodels) models have become indispensable tools for exploring and understanding the design space. Due to their compact formulation they are cheap to evaluate and thus readily facilitate visualization, design space exploration, rapid prototyping, and sensitivity analysis. They can also be used as accurate building blocks in design packages or larger simulation environments. Consequently, there is great interest in techniques that facilitate the construction of such approximation models while minimizing the computational cost and maximizing model accuracy. Many surrogate model types exist
</description>
</item>

<item>
<title>
Ultrahigh Dimensional Feature Selection: Beyond The Linear Model; Jianqing Fan, Richard Samworth, Yichao Wu; 10(Sep):2013--2038, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/fan09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/fan09a.html
</link>
<description>
Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan &#38; Lv, 2008) or feature selection using a two-sample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan &#38; Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when
</description>
</item>

<item>
<title>
Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection; Jie Chen, Haw-ren Fang, Yousef Saad; 10(Sep):1989--2012, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/chen09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/chen09b.html
</link>
<description>
Nearest neighbor graphs are widely used in data mining and machine learning.  A brute-force method to compute the exact kNN graph takes &#920;(dn^2) time for n data points in the d dimensional Euclidean  space.  We propose two divide and conquer methods for computing an approximate kNN graph in &#920;(dn^t) time for high dimensional data (large d).  The exponent t &#8712; (1,2) is an increasing function of an internal parameter &#945; which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained
</description>
</item>

<item>
<title>
Provably Efficient Learning with Typed Parametric Models; Emma Brunskill, Bethany R. Leffler, Lihong Li, Michael L. Littman, Nicholas Roy; 10(Aug):1955--1988, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/brunskill09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/brunskill09a.html
</link>
<description>
To quickly achieve good performance, reinforcement-learning algorithms for acting in large continuous-valued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. Our algorithm balances this tradeoff by using a stochastic, switching, parametric dynamics representation. We argue that this model characterizes a number of significant, real-world domains, such as robot navigati on across varying terrain. We prove that this representational assumption allows our algorithm to be probably approximately correct with a sample complexity that scales polynomially with all problem-specific quantities including the state-space dimension. We also explicitly incorporate
</description>
</item>

<item>
<title>
Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training; Kristian Woodsend, Jacek Gondzio; 10(Aug):1937--1953, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/woodsend09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/woodsend09a.html
</link>
<description>
Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from
</description>
</item>

<item>
<title>
Learning Approximate Sequential Patterns for Classification; Zeeshan Syed, Piotr Indyk, John Guttag; 10(Aug):1913--1936, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/syed09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/syed09a.html
</link>
<description>
In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to
</description>
</item>

<item>
<title>
Learning Acyclic Probabilistic Circuits Using Test Paths; Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin; 10(Aug):1881--1911, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/angluin09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/angluin09a.html
</link>
<description>
We define a model of learning probabilistic acyclic circuits using value injection queries, in which fixed values are assigned to an arbitrary subset of the wires and the value on the single output wire is observed. We adapt the approach of using test paths from the Circuit Builder algorithm (Angluin et al., 2009) to show that there is a polynomial time algorithm that uses value injection queries to learn acyclic Boolean probabilistic circuits of constant fan-in and log depth. We establish upper and lower bounds on the attenuation factor for general and transitively reduced Boolean probabilistic circuits of test paths versus general experiments. We give computational evidence that
</description>
</item>

<item>
<title>
CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning; Roberto Esposito, Daniele P. Radicioni; 10(Aug):1851--1880, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/esposito09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/esposito09a.html
</link>
<description>
The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels.  In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity. We provide theoretical grounding together with solid empirical results supporting
</description>
</item>

<item>
<title>
Nonlinear Models Using Dirichlet Process Mixtures; Babak Shahbaba, Radford Neal; 10(Aug):1829--1850, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/shahbaba09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/shahbaba09a.html
</link>
<description>
We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on
</description>
</item>

<item>
<title>
Distributed Algorithms for Topic Models; David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling; 10(Aug):1801--1828, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/newman09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/newman09a.html
</link>
<description>
We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is
</description>
</item>

<item>
<title>
Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning; Halbert White, Karim Chalak; 10(Aug):1759--1799, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/white09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/white09a.html
</link>
<description>
Judea Pearl's Causal Model is a rich framework that provides deep insight into the nature of causal relations. As yet, however, the Pearl Causal Model (PCM) has had a lesser impact on economics or econometrics than on other disciplines. This may be due in part to the fact that the PCM is not as well suited to analyzing structures that exhibit features of central interest to economists and econometricians: optimization, equilibrium, and learning. We offer the settable systems framework as an extension of the PCM that permits causal discourse in systems embodying optimization, equilibrium, and learning. Because these are common features of physical, natural, or social systems, our framework may prove generally useful for
</description>
</item>

<item>
<title>
Dlib-ml: A Machine Learning Toolkit; Davis E. King; 10(Jul):1755--1758, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/king09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/king09a.html
</link>
<description>
There are many excellent toolkits which provide support for developing machine learning software in Python, R, Matlab, and similar environments. Dlib-ml is an open source library, targeted at both engineers and research scientists, which aims to provide a similarly rich environment for developing machine learning software in the C++ language. Towards this end, dlib-ml contains an extensible linear algebra toolkit with built in BLAS support. It also houses implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking. To enable easy
</description>
</item>

<item>
<title>
SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent; Antoine Bordes, L&#233;on Bottou, Patrick Gallinari; 10(Jul):1737--1754, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/bordes09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/bordes09a.html
</link>
<description>
The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of second-order information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the "Wild Track" of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008).
</description>
</item>

<item>
<title>
Learning Permutations with Exponential Weights; David P. Helmbold, Manfred K. Warmuth; 10(Jul):1705--1736, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/helmbold09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/helmbold09a.html
</link>
<description>
We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prov
</description>
</item>

<item>
<title>
Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification; Eitan Greenshtein, Junyong Park; 10(Jul):1687--1704, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/greenshtein09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/greenshtein09a.html
</link>
<description>
We consider the problem of classification using high dimensional features' space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent.  Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and classification method, called FAIR. The FAIR method improves the design of naive-Bayes classifiers in sparse setups. The improvement is due to reducing the noise in estimating the features' means. This reduction is since that only the means of a few selected variables should be estimated.  We also consider the design of naive Bayes classifiers. We show that
</description>
</item>

<item>
<title>
Transfer Learning for Reinforcement Learning Domains: A Survey; Matthew E. Taylor, Peter Stone; 10(Jul):1633--1685, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/taylor09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/taylor09a.html
</link>
<description>
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework
</description>
</item>

<item>
<title>
Marginal Likelihood Integrals for Mixtures of Independence Models; Shaowei Lin, Bernd Sturmfels, Zhiqiang Xu; 10(Jul):1611--1631, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/lin09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/lin09a.html
</link>
<description>
Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. Our methods apply to both uniform priors and Dirichlet priors. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of Segre-Veronese varieties.
</description>
</item>

<item>
<title>
Learning Linear Ranking Functions for Beam Search with Application to Planning; Yuehua Xu, Alan Fern, Sungwook Yoon; 10(Jul):1571--1610, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/xu09c.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/xu09c.html
</link>
<description>
Beam search is commonly used to help maintain tractability in large search spaces at the expense of completeness and optimality. Here we study supervised learning of linear ranking functions for controlling beam search. The goal is to learn ranking functions that allow for beam search to perform nearly as well as unconstrained search, and hence gain computational efficiency without seriously sacrificing optimality. In this paper, we develop theoretical aspects of this learning problem and investigate the application of this framework to learning in the context of automated planning. We first study the computationa
</description>
</item>

<item>
<title>
Bayesian Network Structure Learning by Recursive Autonomy Identification; Raanan Yehezkel, Boaz Lerner; 10(Jul):1527--1570, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/yehezkel09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/yehezkel09a.html
</link>
<description>
We propose the recursive autonomy identification (RAI) algorithm for constraint-based (CB) Bayesian network structure learning. The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. The sequence of operations is performed recursively for each autonomous sub-structure while simultaneously increasing the order of the CI test. While other CB algorithms d-separate structures and then direct the resulted undirected graph, the RAI algorithm combines the two processes from the outset and along the procedure. By this means and due to structure decomposition, learning a structure using RAI requires
</description>
</item>

<item>
<title>
Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks; Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras; 10(Jul):1511--1526, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/slobodianik09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/slobodianik09a.html
</link>
<description>
In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we refine this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion.
</description>
</item>

<item>
<title>
Robustness and Regularization of Support Vector Machines; Huan Xu, Constantine Caramanis, Shie Mannor; 10(Jul):1485--1510, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/xu09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/xu09b.html
</link>
<description>
We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new
</description>
</item>

<item>
<title>
Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks; Jean Hausser, Korbinian Strimmer; 10(Jul):1469--1484, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/hausser09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/hausser09a.html
</link>
<description>
We present a procedure for effective estimation of entropy and mutual information from small-sample data, and apply it to the problem of inferring high-dimensional gene association networks. Specifically, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly efficient statistically as well as computationally. Despite its simplicity, we show that it outperforms eight other entropy estimation procedures across a diverse range of sampling scenarios and data-generating models, even in cases of severe undersampling. We illustrate the approach by
</description>
</item>

<item>
<title>
Classification with Gaussians and Convex Loss; Dao-Hong Xiang, Ding-Xuan Zhou; 10(Jul):1447--1468, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/xiang09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/xiang09a.html
</link>
<description>
This paper considers binary classification algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassification error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is
</description>
</item>

<item>
<title>
A Least-squares Approach to Direct Importance Estimation; Takafumi Kanamori, Shohei Hido, Masashi Sugiyama; 10(Jul):1391--1445, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/kanamori09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/kanamori09a.html
</link>
<description>
We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show
</description>
</item>

<item>
<title>
Model Monitor (M2): Evaluating, Comparing, and Monitoring Models; Troy Raeder, Nitesh V. Chawla; 10(Jul):1387--1390, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/raeder09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/raeder09a.html
</link>
<description>
This paper presents Model Monitor (M2), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired.
</description>
</item>

<item>
<title>
Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination; Eugene Tuv, Alexander Borisov, George Runger, Kari Torkkola; 10(Jul):1341--1366, 2009
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/tuv09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/tuv09a.html
</link>
<description>
Predictive models benefit from a compact, non-redundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for filters, wrappers, and embedded feature selection methods. We describe details of an algorithm using tree-based ensembles to generate a compact subset of non-redundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach.
</description>
</item>

<item>
<title>
A Parameter-Free Classification Method for Large Scale Learning; Marc Boull&#233;; 10(Jul):1367--1385, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/boulle09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/boulle09a.html
</link>
<description>
With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection.  In this paper, we present a parameter-free scalable classification method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classification enhanced with a Bayesian variable selection scheme, and averaging of models
</description>
</item>

<item>
<title>
Robust Process Discovery with Artificial Negative Events; Stijn Goedertier, David Martens, Jan Vanthienen, Bart Baesens; 10(Jun):1305--1340, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/goedertier09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/goedertier09a.html
</link>
<description>
Process discovery is the automated construction of structured process models from information system event logs. Such event logs often contain positive examples only. Without negative examples, it is a challenge to strike the right balance between recall and specificity, and to deal with problems such as expressiveness, noise, incomplete event logs, or the inclusion of prior knowledge. In this paper, we present a configurable technique that deals with these challenges by representing process discovery as a multi-relational classification problem on event logs supplemented with Artificially Generated Negative Events (AGNEs). This problem formulation allows
</description>
</item>

<item>
<title>
Perturbation Corrections in Approximate Inference: Mixture Modelling Applications; Ulrich Paquet, Ole Winther, Manfred Opper; 10(Jun):1263--1304, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/paquet09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/paquet09a.html
</link>
<description>
Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with
</description>
</item>

<item>
<title>
Incorporating Functional Knowledge in Neural Networks; Charles Dugas, Yoshua Bengio, Fran&#231;ois B&#233;lisle, Claude Nadeau, Ren&#233; Garcia; 10(Jun):1239--1262, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/dugas09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/dugas09a.html
</link>
<description>
Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on
</description>
</item>

<item>
<title>
The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models; Ricardo Silva, Zoubin Ghahramani; 10(Jun):1187--1238, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/silva09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/silva09a.html
</link>
<description>
Directed acyclic graphs (DAGs) have been widely used as a representation of conditional independence in machine learning and statistics. Moreover, hidden or latent variables are often an important component of graphical models. However, DAG models suffer from an important limitation: the family of DAGs is not closed under marginalization of hidden variables. This means that in general we cannot use a DAG to represent the independencies over a subset of variables in a larger DAG. Directed mixed graphs (DMGs) are a representation that includes DAGs as a special case, and overcomes this limitation. This paper introduces algorithms for performing Bayesian inference in Gaussian and probit DMG models. An important requirement for
</description>
</item>

<item>
<title>
Multi-task Reinforcement Learning in Partially Observable Stochastic Environments; Hui Li, Xuejun Liao, Lawrence Carin; 10(May):1131--1186, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/li09b.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/li09b.html
</link>
<description>
We consider the problem of multi-task reinforcement learning (MTRL) in multiple partially observable stochastic environments. We introduce the regionalized policy representation (RPR) to characterize the agent's behavior in each environment. The RPR is a parametric model of the conditional distribution over current actions given the history of past actions and observations; the agent's choice of actions is directly based on this conditional distribution, without an intervening model to characterize the environment itself. We propose off-policy batch algorithms to learn the parameters of the RPRs, using episodic data collected when following a behavior policy, and show their linkage to policy iteration. We employ the Dirichlet process as a nonparametric prior over
</description>
</item>

<item>
<title>
Universal Kernel-Based Learning with Applications to Regular Languages; Leonid (Aryeh) Kontorovich, Boaz Nadler; 10(May):1095--1129, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/kontorovich09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/kontorovich09a.html
</link>
<description>
We propose a novel framework for supervised learning of discrete concepts. Since the 1970's, the standard computational primitive has been to find the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them.  Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identifies concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and
</description>
</item>

<item>
<title>
An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity; Jose M. Pe&#241;a, Roland Nilsson, Johan Bj&#246;rkegren, Jesper Tegn&#233;r; 10(May):1071--1094, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/pena09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/pena09a.html
</link>
<description>
We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisfies weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that
</description>
</item>

<item>
<title>
Fourier Theoretic Probabilistic Inference over Permutations; Jonathan Huang, Carlos Guestrin, Leonidas Guibas; 10(May):997--1070, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/huang09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/huang09a.html
</link>
<description>
Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the "low-frequency" terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for
</description>
</item>

<item>
<title>
On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality; Wenxin Jiang; 10(Apr):977--996, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/jiang09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/jiang09a.html
</link>
<description>
The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding's inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding's inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with l1-loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors.
</description>
</item>

<item>
<title>
Nonextensive Information Theoretic Kernels on Measures; Andr&#233; F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, M&#225;rio A. T. Figueiredo; 10(Apr):935--975, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/martins09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/martins09a.html
</link>
<description>
Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon's) mutual information and the Jensen-Shannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon's information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon's entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce
</description>
</item>

<item>
<title>
Java-ML: A Machine Learning Library; Thomas Abeel, Yves Van de Peer, Yvan Saeys; 10(Apr):931--934, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/abeel09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/abeel09a.html
</link>
<description>
Java-ML is a collection of machine learning and data mining algorithms, which aims to be a readily usable and easily extensible API for both software developers and research scientists. The interfaces for each type of algorithm are kept simple and algorithms strictly follow their respective interface. Comparing different classifiers or clustering algorithms is therefore straightforward, and implementing new algorithms is also easy. The implementations of the algorithms are clearly written, properly documented and can thus be used as a reference. The library is written in Java and is available from http://java-ml.sourceforge.net/ under the GNU GPL license.
</description>
</item>

<item>
<title>
Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods; Holger H&#246;fling, Robert Tibshirani; 10(Apr):883--906, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/hoefling09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/hoefling09a.html
</link>
<description>
We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that
</description>
</item>

<item>
<title>
Stable and Efficient Gaussian Process Calculations; Leslie Foster, Alex Waagen, Nabeela Aijaz, Michael Hurley, Apolonio Luis, Joel Rinsky, Chandrika Satyavolu, Michael J. Way, Paul Gazis, Ashok Srivastava; 10(Apr):857--882, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/foster09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/foster09a.html
</link>
<description>
The use of Gaussian processes can be an effective approach to prediction in a supervised learning environment. For large data sets, the standard Gaussian process approach requires solving very large systems of linear equations and approximations are required for the calculations to be practical. We will focus on the subset of regressors approximation technique. We will demonstrate that there can be numerical instabilities in a well known implementation of the technique. We discuss alternate implementations that have better numerical stability properties and can lead to better predictions. Our results will be illustrated by looking at an application involving prediction of galaxy redshift from broadband spectrum data.
</description>
</item>

<item>
<title>
Consistency and Localizability; Alon Zakai, Ya'acov Ritov; 10(Apr):827--856, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/zakai09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/zakai09a.html
</link>
<description>
We show that all consistent learning methods---that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y)---are necessarily localizable, by which we mean that they do not significantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be defined in a non-local manner, such as support vector machines in classification and least-squares estimators in regression. Aside from showing that consistency implies a specific form of localizability, we also show that
</description>
</item>

<item>
<title>
A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization; Jacob Abernethy, Francis Bach, Theodoros Evgeniou, Jean-Philippe Vert; 10(Mar):803--826, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/abernethy09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/abernethy09a.html
</link>
<description>
We present a general approach for collaborative filtering (CF) using spectral regularization to learn linear operators mapping a set of "users" to a set of possibly desired "objects". In particular, several recent low-rank type matrix-completion methods for CF are shown to be special cases of our proposed framework. Unlike existing regularization-based CF, our approach can be used to incorporate additional information such as attributes of the users/objects---a feature currently lacking in existing regularization-based CF approaches---using popular and well-known kernel methods. We provide novel representer theorems that we use to develop new estimation methods. We then provide learning
</description>
</item>

<item>
<title>
Sparse Online Learning via Truncated Gradient; John Langford, Lihong Li, Tong Zhang; 10(Mar):777--801, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/langford09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/langford09a.html
</link>
<description>
We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss functions. This method has several essential properties: (1) The degree of sparsity is continuous---a parameter controls the rate of sparsification from no sparsification to total sparsification. (2) The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1-regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. (3) The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable.
</description>
</item>

<item>
<title>
Similarity-based Classification: Concepts and Algorithms; Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti; 10(Mar):747--776, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/chen09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/chen09a.html
</link>
<description>
This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning.
</description>
</item>

<item>
<title>
Nieme: Large-Scale Energy-Based Models; Francis Maes; 10(Mar):743--746, 2009.
</title>
<guid isPermaLink="true">
http://jmlr.csail.mit.edu/papers/v10/maes09a.html
</guid>
<link>
http://jmlr.csail.mit.edu/papers/v10/maes09a.html
</link>
<description>
In this paper we introduce NIEME, a machine learning library for large-scale classification, regression and ranking. NIEME, relies on the framework of energy-based models (LeCun et al., 2006) which unifies several learning algorithms ranging from simple perceptrons to recent models such as the pegasos support vector machine or l1-regularized maximum entropy models. This framework also unifies batch and stochastic learning which are both seen as energy minimization problems. NIEME, can hence be used in a wide range of
</description>
</item>

</channel>
</rss>
