http://www.jmlr.org
JMLRJournal of Machine Learning Research
Graph Reduction with Spectral and Cut Guarantees
http://jmlr.org/papers/v20/18-680.html
2019
Can one reduce the size of a graph without significantly altering its basic properties? The graph reduction problem is hereby approached from the perspective of restricted spectral approximation, a modification of the spectral similarity measure used for graph sparsification. This choice is motivated by the observation that restricted approximation carries strong spectral and cut guarantees, and that it implies approximation results for unsupervised learning problems relying on spectral embeddings. The article then focuses on coarsening - the most common type of graph reduction. Sufficient conditions are derived for a small graph to approximate a larger one in the sense of restricted approximation. These findings give rise to algorithms that, compared to both standard and advanced graph reduction methods, find coarse graphs of improved quality, often by a large margin, without sacrificing speed.
Learning Attribute Patterns in High-Dimensional Structured Latent Attribute Models
http://jmlr.org/papers/v20/19-197.html
2019
Structured latent attribute models (SLAMs) are a special family of discrete latent variable models widely used in social and biological sciences. This paper considers the problem of learning significant attribute patterns from a SLAM with potentially high-dimensional configurations of the latent attributes. We address the theoretical identifiability issue, propose a penalized likelihood method for the selection of the attribute patterns, and further establish the selection consistency in such an overfitted SLAM with a diverging number of latent patterns. The good performance of the proposed methodology is illustrated by simulation studies and two real datasets in educational assessments.
Sharp Restricted Isometry Bounds for the Inexistence of Spurious Local Minima in Nonconvex Matrix Recovery
http://jmlr.org/papers/v20/19-020.html
2019
Nonconvex matrix recovery is known to contain no spurious local minima under a restricted isometry property (RIP) with a sufficiently small RIP constant $\delta$. If $\delta$ is too large, however, then counterexamples containing spurious local minima are known to exist. In this paper, we introduce a proof technique that is capable of establishing sharp thresholds on $\delta$ to guarantee the inexistence of spurious local minima. Using the technique, we prove that in the case of a rank-1 ground truth, an RIP constant of $\delta<1/2$ is both necessary and sufficient for exact recovery from any arbitrary initial point (such as a random point). We also prove a local recovery result: given an initial point $x_{0}$ satisfying $f(x_{0})\le(1-\delta)^{2}f(0)$, any descent algorithm that converges to second-order optimality guarantees exact recovery.
Distributed Inference for Linear Support Vector Machine
http://jmlr.org/papers/v20/18-801.html
2019
The growing size of modern data brings many new challenges to existing statistical inference methodologies and theories, and calls for the development of distributed inferential approaches. This paper studies distributed inference for linear support vector machine (SVM) for the binary classification task. Despite a vast literature on SVM, much less is known about the inferential properties of SVM, especially in a distributed setting. In this paper, we propose a multi-round distributed linear-type (MDL) estimator for conducting inference for linear SVM. The proposed estimator is computationally efficient. In particular, it only requires an initial SVM estimator and then successively refines the estimator by solving simple weighted least squares problem. Theoretically, we establish the Bahadur representation of the estimator. Based on the representation, the asymptotic normality is further derived, which shows that the MDL estimator achieves the optimal statistical efficiency, i.e., the same efficiency as the classical linear SVM applying to the entire data set in a single machine setup. Moreover, our asymptotic result avoids the condition on the number of machines or data batches, which is commonly assumed in distributed estimation literature, and allows the case of diverging dimension. We provide simulation studies to demonstrate the performance of the proposed MDL estimator.
Measuring the Effects of Data Parallelism on Neural Network Training
http://jmlr.org/papers/v20/18-789.html
2019
Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the training algorithm, model, and data set, and find extremely large variation between workloads. Along the way, we show that disagreements in the literature on how batch size affects model quality can largely be explained by differences in metaparameter tuning and compute budgets at different batch sizes. We find no evidence that larger batch sizes degrade out-of-sample performance. Finally, we discuss the implications of our results on efforts to train neural networks much faster in the future. Our experimental data is publicly available as a database of 71,638,836 loss measurements taken over the course of training for 168,160 individual models across 35 workloads.
An Efficient Two Step Algorithm for High Dimensional Change Point Regression Models Without Grid Search
http://jmlr.org/papers/v20/18-460.html
2019
We propose a two step algorithm based on $\ell_1/\ell_0$ regularization for the detection and estimation of parameters of a high dimensional change point regression model and provide the corresponding rates of convergence for the change point as well as the regression parameter estimates. Importantly, the computational cost of our estimator is only $2\cdotp$Lasso$(n,p)$, where Lasso$(n,p)$ represents the computational burden of one Lasso optimization in a model of size $(n,p)$. In comparison, existing grid search based approaches to this problem require a computational cost of at least $n\cdot Lasso(n,p)$ optimizations. Additionally, the proposed method is shown to be able to consistently detect the case of 'no change', i.e., where no finite change point exists in the model. We allow the true change point parameter $\tau_0$ to possibly move to the boundaries of its parametric space, and the jump size $\|b_0-g_0\|_2$ to possibly diverge as $n$ increases. We then characterize the corresponding effects on the rates of convergence of the change point and regression estimates. In particular, we show that, while an increasing jump size may have a beneficial effect on the change point estimate, however the optimal rate of regression parameter estimates are preserved only upto a certain rate of the increasing jump size. This behavior in the rate of regression parameter estimates is unique to high dimensional change point regression models only. Simulations are performed to empirically evaluate performance of the proposed estimators. The methodology is applied to community level socio-economic data of the U.S., collected from the 1990 U.S. census and other sources.
A Representer Theorem for Deep Neural Networks
http://jmlr.org/papers/v20/18-418.html
2019
We propose to optimize the activation functions of a deep neural network by adding a corresponding functional regularization to the cost function. We justify the use of a second-order total-variation criterion. This allows us to derive a general representer theorem for deep neural networks that makes a direct connection with splines and sparsity. Specifically, we show that the optimal network configuration can be achieved with activation functions that are nonuniform linear splines with adaptive knots. The bottom line is that the action of each neuron is encoded by a spline whose parameters (including the number of knots) are optimized during the training procedure. The scheme results in a computational structure that is compatible with existing deep-ReLU, parametric ReLU, APL (adaptive piecewise-linear) and MaxOut architectures. It also suggests novel optimization challenges and makes an explicit link with $\ell_1$ minimization and sparsity-promoting techniques.
Learning Unfaithful $K$-separable Gaussian Graphical Models
http://jmlr.org/papers/v20/18-329.html
2019
The global Markov property for Gaussian graphical models ensures graph separation implies conditional independence. Specifically if a node set $S$ graph separates nodes $u$ and $v$ then $X_u$ is conditionally independent of $X_v$ given $X_S$. The opposite direction need not be true, that is, $X_u \perp X_v \mid X_S$ need not imply $S$ is a node separator of $u$ and $v$. When it does, the relation $X_u \perp X_v \mid X_S$ is called faithful. In this paper we provide a characterization of faithful relations and then provide an algorithm to test faithfulness based only on knowledge of other conditional relations of the form $X_i \perp X_j \mid X_S$. We study two classes of separable Gaussian graphical models, namely, weakly $K$-separable and strongly $K$-separable Gaussian graphical models. Using the above test for faithfulness, we introduce algorithms to learn the topologies of weakly $K$-separable and strongly $K$-separable Gaussian graphical models with $\Omega(K\log p)$ sample complexity. For strongly $K$-separable Gaussian graphical models, we additionally provide a method with error bounds for learning the off-diagonal precision matrix entries.
Maximum Likelihood for Gaussian Process Classification and Generalized Linear Mixed Models under Case-Control Sampling
http://jmlr.org/papers/v20/18-298.html
2019
Modern data sets in various domains often include units that were sampled non-randomly from the population and have a latent correlation structure. Here we investigate a common form of this setting, where every unit is associated with a latent variable, all latent variables are correlated, and the probability of sampling a unit depends on its response. Such settings often arise in case-control studies, where the sampled units are correlated due to spatial proximity, family relations, or other sources of relatedness. Maximum likelihood estimation in such settings is challenging from both a computational and statistical perspective, necessitating approximations that take the sampling scheme into account. We propose a family of approximate likelihood approaches which combine composite likelihood and expectation propagation. We demonstrate the efficacy of our solutions via extensive simulations. We utilize them to investigate the genetic architecture of several complex disorders collected in case-control genetic association studies, where hundreds of thousands of genetic variants are measured for every individual, and the underlying disease liabilities of individuals are correlated due to genetic similarity. Our work is the first to provide a tractable likelihood-based solution for case-control data with complex dependency structures.
Scalable Interpretable Multi-Response Regression via SEED
http://jmlr.org/papers/v20/18-200.html
2019
Sparse reduced-rank regression is an important tool for uncovering meaningful dependence structure between large numbers of predictors and responses in many big data applications such as genome-wide association studies and social media analysis. Despite the recent theoretical and algorithmic advances, scalable estimation of sparse reduced-rank regression remains largely unexplored. In this paper, we suggest a scalable procedure called sequential estimation with eigen-decomposition (SEED) which needs only a single top-$r$ sparse singular value decomposition from a generalized eigenvalue problem to find the optimal low-rank and sparse matrix estimate. Our suggested method is not only scalable but also performs simultaneous dimensionality reduction and variable selection. Under some mild regularity conditions, we show that SEED enjoys nice sampling properties including consistency in estimation, rank selection, prediction, and model selection. Moreover, SEED employs only basic matrix operations that can be efficiently parallelized in high performance computing devices. Numerical studies on synthetic and real data sets show that SEED outperforms the state-of-the-art approaches for large-scale matrix estimation problem.
Solving the OSCAR and SLOPE Models Using a Semismooth Newton-Based Augmented Lagrangian Method
http://jmlr.org/papers/v20/18-172.html
2019
The octagonal shrinkage and clustering algorithm for regression (OSCAR), equipped with the $\ell_1$-norm and a pair-wise $\ell_{\infty}$-norm regularizer, is a useful tool for feature selection and grouping in high-dimensional data analysis. The computational challenge posed by OSCAR, for high dimensional and/or large sample size data, has not yet been well resolved due to the non-smoothness and non-separability of the regularizer involved. In this paper, we successfully resolve this numerical challenge by proposing a sparse semismooth Newton-based augmented Lagrangian method to solve the more general SLOPE (the sorted L-one penalized estimation) model. By appropriately exploiting the inherent sparse and low-rank property of the generalized Jacobian of the semismooth Newton system in the augmented Lagrangian subproblem, we show how the computational complexity can be substantially reduced. Our algorithm offers a notable computational advantage in the high-dimensional statistical regression settings. Numerical experiments are conducted on real data sets, and the results demonstrate that our algorithm is far superior, in both speed and robustness, to the existing state-of-the-art algorithms based on first-order iterative schemes, including the widely used accelerated proximal gradient (APG) method and the alternating direction method of multipliers (ADMM).
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers
http://jmlr.org/papers/v20/18-079.html
2019
We propose a simple subsampling scheme for fast randomized approximate computation of optimal transport distances on finite spaces. This scheme operates on a random subset of the full data and can use any exact algorithm as a black-box back-end, including state-of-the-art solvers and entropically penalized versions. It is based on averaging the exact distances between empirical measures generated from independent samples from the original measures and can easily be tuned towards higher accuracy or shorter computation times. To this end, we give non-asymptotic deviation bounds for its accuracy in the case of discrete optimal transport problems. In particular, we show that in many important instances, including images (2D-histograms), the approximation error is independent of the size of the full problem. We present numerical experiments that demonstrate that a very good approximation in typical applications can be obtained in a computation time that is several orders of magnitude smaller than what is required for exact computation of the full problem.
Complete Search for Feature Selection in Decision Trees
http://jmlr.org/papers/v20/18-035.html
2019
The search space for the feature selection problem in decision tree learning is the lattice of subsets of the available features. We design an exact enumeration procedure of the subsets of features that lead to all and only the distinct decision trees built by a greedy top-down decision tree induction algorithm. The procedure stores, in the worst case, a number of trees linear in the number of features. By exploiting a further pruning of the search space, we design a complete procedure for finding $\delta$-acceptable feature subsets, which depart by at most $\delta$ from the best estimated error over any feature subset. Feature subsets with the best estimated error are called best feature subsets. Our results apply to any error estimator function, but experiments are mainly conducted under the wrapper model, in which the misclassification error over a search set is used as an estimator. The approach is also adapted to the design of a computational optimization of the sequential backward elimination heuristic, extending its applicability to large dimensional datasets. The procedures of this paper are implemented in a multi-core data parallel C++ system. We investigate experimentally the properties and limitations of the procedures on a collection of 20 benchmark datasets, showing that oversearching increases both overfitting and instability.
Regularization via Mass Transportation
http://jmlr.org/papers/v20/17-633.html
2019
The goal of regression and classification methods in supervised learning is to minimize the empirical risk, that is, the expectation of some loss function quantifying the prediction error under the empirical distribution. When facing scarce training data, overfitting is typically mitigated by adding regularization terms to the objective that penalize hypothesis complexity. In this paper we introduce new regularization techniques using ideas from distributionally robust optimization, and we give new probabilistic interpretations to existing techniques. Specifically, we propose to minimize the worst-case expected loss, where the worst case is taken over the ball of all (continuous or discrete) distributions that have a bounded transportation distance from the (discrete) empirical distribution. By choosing the radius of this ball judiciously, we can guarantee that the worst-case expected loss provides an upper confidence bound on the loss on test data, thus offering new generalization bounds. We prove that the resulting regularized learning problems are tractable and can be tractably kernelized for many popular loss functions. The proposed approach to regluarization is also extended to neural networks. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments.
Non-Convex Matrix Completion and Related Problems via Strong Duality
http://jmlr.org/papers/v20/17-611.html
2019
This work studies the strong duality of non-convex matrix factorization problems: we show that under certain dual conditions, these problems and the dual have the same optimum. This has been well understood for convex optimization, but little was known for non-convex problems. We propose a novel analytical framework and prove that under certain dual conditions, the optimal solution of the matrix factorization program is the same as that of its bi-dual and thus the global optimality of the non-convex program can be achieved by solving its bi-dual which is convex. These dual conditions are satisfied by a wide class of matrix factorization problems, although matrix factorization is hard to solve in full generality. This analytical framework may be of independent interest to non-convex optimization more broadly. We apply our framework to two prototypical matrix factorization problems: matrix completion and robust Principal Component Analysis. These are examples of efficiently recovering a hidden matrix given limited reliable observations. Our framework shows that exact recoverability and strong duality hold with nearly-optimal sample complexity for the two problems.
Low Permutation-rank Matrices: Structural Properties and Noisy Completion
http://jmlr.org/papers/v20/17-545.html
2019
We consider the problem of noisy matrix completion, in which the goal is to reconstruct a structured matrix whose entries are partially observed in noise. Standard approaches to this underdetermined inverse problem are based on assuming that the underlying matrix has low rank, or is well-approximated by a low rank matrix. In this paper, we propose a richer model based on what we term the permutation-rank of a matrix. We first describe how the classical non-negative rank model enforces restrictions that may be undesirable in practice, and how and these restrictions can be avoided by using the richer permutation-rank model. Second, we establish the minimax rates of estimation under the new permutation-based model, and prove that surprisingly, the minimax rates are equivalent up to logarithmic factors to those for estimation under the typical low rank model. Third, we analyze a computationally efficient singular-value-thresholding algorithm, known to be optimal for the low-rank setting, and show that it also simultaneously yields a consistent estimator for the low-permutation rank setting. Finally, we present various structural results characterizing the uniqueness of the permutation-rank decomposition, and characterizing convex approximations of the permutation-rank polytope.
Hamiltonian Monte Carlo with Energy Conserving Subsampling
http://jmlr.org/papers/v20/17-452.html
2019
Hamiltonian Monte Carlo (HMC) samples efficiently from high-dimensional posterior distributions with proposed parameter draws obtained by iterating on a discretized version of the Hamiltonian dynamics. The iterations make HMC computationally costly, especially in problems with large data sets, since it is necessary to compute posterior densities and their derivatives with respect to the parameters. Naively computing the Hamiltonian dynamics on a subset of the data causes HMC to lose its key ability to generate distant parameter proposals with high acceptance probability. The key insight in our article is that efficient subsampling HMC for the parameters is possible if both the dynamics and the acceptance probability are computed from the same data subsample in each complete HMC iteration. We show that this is possible to do in a principled way in a HMC-within-Gibbs framework where the subsample is updated using a pseudo marginal MH step and the parameters are then updated using an HMC step, based on the current subsample. We show that our subsampling methods are fast and compare favorably to two popular sampling algorithms that use gradient estimates from data subsampling. We also explore the current limitations of subsampling HMC algorithms by varying the quality of the variance reducing control variates used in the estimators of the posterior density and its gradients.
Change Surfaces for Expressive Multidimensional Changepoints and Counterfactual Prediction
http://jmlr.org/papers/v20/17-352.html
2019
Identifying changes in model parameters is fundamental in machine learning and statistics. However, standard changepoint models are limited in expressiveness, often addressing unidimensional problems and assuming instantaneous changes. We introduce change surfaces as a multidimensional and highly expressive generalization of changepoints. We provide a model-agnostic formalization of change surfaces, illustrating how they can provide variable, heterogeneous, and non-monotonic rates of change across multiple dimensions. Additionally, we show how change surfaces can be used for counterfactual prediction. As a concrete instantiation of the change surface framework, we develop Gaussian Process Change Surfaces (GPCS). We demonstrate counterfactual prediction with Bayesian posterior mean and credible sets, as well as massive scalability by introducing novel methods for additive non-separable kernels. Using two large spatio-temporal datasets we employ GPCS to discover and characterize complex changes that can provide scientific and policy relevant insights. Specifically, we analyze twentieth century measles incidence across the United States and discover previously unknown heterogeneous changes after the introduction of the measles vaccine. Additionally, we apply the model to requests for lead testing kits in New York City, discovering distinct spatial and demographic patterns.
Adaptive Geometric Multiscale Approximations for Intrinsically Low-dimensional Data
http://jmlr.org/papers/v20/17-252.html
2019
We consider the problem of efficiently approximating and encoding high-dimensional data sampled from a probability distribution $\rho$ in $\mathbb{R}^D$, that is nearly supported on a $d$-dimensional set $\mathcal{M}$ - for example supported on a $d$-dimensional manifold. Geometric Multi-Resolution Analysis (GMRA) provides a robust and computationally efficient procedure to construct low-dimensional geometric approximations of $\mathcal{M}$ at varying resolutions. We introduce GMRA approximations that adapt to the unknown regularity of $\mathcal{M}$, by introducing a thresholding algorithm on the geometric wavelet coefficients. We show that these data-driven, empirical geometric approximations perform well, when the threshold is chosen as a suitable universal function of the number of samples $n$, on a large class of measures $\rho$, that are allowed to exhibit different regularity at different scales and locations, thereby efficiently encoding data from more complex measures than those supported on manifolds. These GMRA approximations are associated to a dictionary, together with a fast transform mapping data to $d$-dimensional coefficients, and an inverse of such a map, all of which are data-driven. The algorithms for both the dictionary construction and the transforms have complexity $C D n \log n$ with the constant $C$ exponential in $d$. Our work therefore establishes Adaptive GMRA as a fast dictionary learning algorithm, with approximation guarantees, for intrinsically low-dimensional data. We include several numerical experiments on both synthetic and real data, confirming our theoretical results and demonstrating the effectiveness of Adaptive GMRA.
Relative Error Bound Analysis for Nuclear Norm Regularized Matrix Completion
http://jmlr.org/papers/v20/15-504.html
2019
In this paper, we develop a relative error bound for nuclear norm regularized matrix completion, with the focus on the completion of full-rank matrices. Under the assumption that the top eigenspaces of the target matrix are incoherent, we derive a relative upper bound for recovering the best low-rank approximation of the unknown matrix. Although multiple works have been devoted to analyzing the recovery error of full-rank matrix completion, their error bounds are usually additive, making it impossible to obtain the perfect recovery case and more generally difficult to leverage the skewed distribution of eigenvalues. Our analysis is built upon the optimality condition of the regularized formulation and existing guarantees for low-rank matrix completion. To the best of our knowledge, this is the first relative bound that has been proved for the regularized formulation of matrix completion.
PyOD: A Python Toolbox for Scalable Outlier Detection
http://jmlr.org/papers/v20/19-011.html
2019
PyOD is an open-source Python toolbox for performing scalable outlier detection on multivariate data. Uniquely, it provides access to a wide range of outlier detection algorithms, including established outlier ensembles and more recent neural network-based approaches, under a single, well-documented API designed for use by both practitioners and researchers. With robustness and scalability in mind, best practices such as unit testing, continuous integration, code coverage, maintainability checks, interactive examples and parallelization are emphasized as core components in the toolbox's development. PyOD is compatible with both Python 2 and 3 and can be installed through Python Package Index (PyPI) or https://github.com/yzhao062/pyod.
High-Dimensional Poisson Structural Equation Model Learning via $\ell_1$-Regularized Regression
http://jmlr.org/papers/v20/18-819.html
2019
In this paper, we develop a new approach to learning high-dimensional Poisson structural equation models from only observational data without strong assumptions such as faithfulness and a sparse moralized graph. A key component of our method is to decouple the ordering estimation or parent search where the problems can be efficiently addressed using $\ell_1$-regularized regression and the moments relation. We show that sample size $n = \Omega( d^{2} \log^{9} p)$ is sufficient for our polynomial time Moments Ratio Scoring (MRS) algorithm to recover the true directed graph, where $p$ is the number of nodes and $d$ is the maximum indegree. We verify through simulations that our algorithm is statistically consistent in the high-dimensional $p>n$ setting, and performs well compared to state-of-the-art ODS, GES, and MMHC algorithms. We also demonstrate through multivariate real count data that our MRS algorithm is well-suited to estimating DAG models for multivariate count data in comparison to other methods used for discrete data.
Simultaneous Private Learning of Multiple Concepts
http://jmlr.org/papers/v20/18-549.html
2019
We investigate the direct-sum problem in the context of differentially private PAC learning: What is the sample complexity of solving $k$ learning tasks simultaneously under differential privacy, and how does this cost compare to that of solving $k$ learning tasks without privacy? In our setting, an individual example consists of a domain element $x$ labeled by $k$ unknown concepts $(c_1,\ldots,c_k)$. The goal of a multi-learner is to output $k$ hypotheses $(h_1,\ldots,h_k)$ that generalize the input examples. Without concern for privacy, the sample complexity needed to simultaneously learn $k$ concepts is essentially the same as needed for learning a single concept. Under differential privacy, the basic strategy of learning each hypothesis independently yields sample complexity that grows polynomially with $k$. For some concept classes, we give multi-learners that require fewer samples than the basic strategy. Unfortunately, however, we also give lower bounds showing that even for very simple concept classes, the sample cost of private multi-learning must grow polynomially in $k$.
iNNvestigate Neural Networks!
http://jmlr.org/papers/v20/18-540.html
2019
In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and predictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this shortcoming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library innvestigate addresses this by providing a common interface and out-of-the-box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of innvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.
AffectiveTweets: a Weka Package for Analyzing Affect in Tweets
http://jmlr.org/papers/v20/18-450.html
2019
AffectiveTweets is a set of programs for analyzing emotion and sentiment of social media messages such as tweets. It is implemented as a package for the Weka machine learning workbench and provides methods for calculating state-of-the-art affect analysis features from tweets that can be fed into machine learning algorithms implemented in Weka. It also implements methods for building affective lexicons and distant supervision methods for training affective models from unlabeled tweets. The package was used by several teams in the shared tasks: EmoInt 2017 and Affect in Tweets SemEval 2018 Task 1.
Best Arm Identification for Contaminated Bandits
http://jmlr.org/papers/v20/18-395.html
2019
This paper studies active learning in the context of robust statistics. Specifically, we propose a variant of the Best Arm Identification problem for contaminated bandits, where each arm pull has probability epsilon of generating a sample from an arbitrary contamination distribution instead of the true underlying distribution. The goal is to identify the best (or approximately best) true distribution with high probability, with a secondary goal of providing guarantees on the quality of this distribution. The primary challenge of the contaminated bandit setting is that the true distributions are only partially identifiable, even with infinite samples. To address this, we develop tight, non-asymptotic sample complexity bounds for high-probability estimation of the first two robust moments (median and median absolute deviation) from contaminated samples. These concentration inequalities are the main technical contributions of the paper and may be of independent interest. Using these results, we adapt several classical Best Arm Identification algorithms to the contaminated bandit setting and derive sample complexity upper bounds for our problem. Finally, we provide matching information-theoretic lower bounds on the sample complexity (up to a small logarithmic factor).
A Particle-Based Variational Approach to Bayesian Non-negative Matrix Factorization
http://jmlr.org/papers/v20/18-153.html
2019
Bayesian Non-negative Matrix Factorization (BNMF) is a promising approach for understanding uncertainty and structure in matrix data. However, a large volume of applied work optimizes traditional non-Bayesian NMF objectives that fail to provide a principled understanding of the non-identifiability inherent in NMF---an issue ideally addressed by a Bayesian approach. Despite their suitability, current BNMF approaches have failed to gain popularity in an applied setting; they sacrifice flexibility in modeling for tractable computation, tend to get stuck in local modes, and can require many thousands of samples for meaningful uncertainty estimates. We address these issues through a particle-based variational approach to BNMF that only requires the joint likelihood to be differentiable for computational tractability, uses a novel transfer-based initialization technique to identify multiple modes in the posterior, and thus allows domain experts to inspect a small set of factorizations that faithfully represent the posterior. On several real datasets, we obtain better particle approximations to the BNMF posterior in less time than baselines and demonstrate the significant role that multimodality plays in NMF-related tasks.
Dependent relevance determination for smooth and structured sparse regression
http://jmlr.org/papers/v20/17-757.html
2019
In many problem settings, parameter vectors are not merely sparse but dependent in such a way that non-zero coefficients tend to cluster together. We refer to this form of dependency as “region sparsity.” Classical sparse regression methods, such as the lasso and automatic relevance determination (ARD), which model parameters as independent a priori, and therefore do not exploit such dependencies. Here we introduce a hierarchical model for smooth, region-sparse weight vectors and tensors in a linear regression setting. Our approach represents a hierarchical extension of the relevance determination framework, where we add a transformed Gaussian process to model the dependencies between the prior variances of regression weights. We combine this with a structured model of the prior variances of Fourier coefficients, which eliminates unnecessary high frequencies. The resulting prior encourages weights to be region-sparse in two different bases simultaneously. We develop Laplace approximation and Monte Carlo Markov Chain (MCMC) sampling to provide efficient inference for the posterior. Furthermore, a two-stage convex relaxation of the Laplace approximation approach is also provided to relax the inevitable non-convexity during the optimization. We finally show substantial improvements over comparable methods for both simulated and real datasets from brain imaging.
Model Selection via the VC Dimension
http://jmlr.org/papers/v20/17-669.html
2019
We derive an objective function that can be optimized to give an estimator for the Vapnik-Chervonenkis dimension for use in model selection in regression problems. We verify our estimator is consistent. Then, we verify it performs well compared to seven other model selection techniques. We do this for a variety of types of data sets.
An asymptotic analysis of distributed nonparametric methods
http://jmlr.org/papers/v20/17-666.html
2019
We investigate and compare the fundamental performance of several distributed learning methods that have been proposed recently. We do this in the context of a distributed version of the classical signal-in-Gaussian-white-noise model, which serves as a benchmark model for studying performance in this setting. The results show how the design and tuning of a distributed method can have great impact on convergence rates and validity of uncertainty quantification. Moreover, we highlight the difficulty of designing nonparametric distributed procedures that automatically adapt to smoothness.
Streaming Principal Component Analysis From Incomplete Data
http://jmlr.org/papers/v20/16-627.html
2019
Linear subspace models are pervasive in computational sciences and particularly used for large datasets which are often incomplete due to privacy issues or sampling constraints. Therefore, a critical problem is developing an efficient algorithm for detecting low-dimensional linear structure from incomplete data efficiently, in terms of both computational complexity and storage. In this paper we propose a streaming subspace estimation algorithm called Subspace Navigation via Interpolation from Partial Entries (SNIPE) that efficiently processes blocks of incomplete data to estimate the underlying subspace model. In every iteration, SNIPE finds the subspace that best fits the new data block but remains close to the previous estimate. We show that SNIPE is a streaming solver for the underlying nonconvex matrix completion problem, that it converges globally {to a stationary point of this program} regardless of initialization, and that the convergence is locally linear with high probability. We also find that SNIPE shows state-of-the-art performance in our numerical simulations.
Bayesian Space-Time Partitioning by Sampling and Pruning Spanning Trees
http://jmlr.org/papers/v20/16-615.html
2019
A typical problem in spatial data analysis is regionalization or spatially constrained clustering, which consists of aggregating small geographical areas into larger regions. A major challenge when partitioning a map is the huge number of possible partitions that compose the search space. This is compounded if we are partitioning spatio-temporal data rather than purely spatial data. We introduce a spatio-temporal product partition model that deals with the regionalization problem in a probabilistic way. Random spanning trees are used as a tool to tackle the problem of searching the space of possible partitions making feasible this exploration. Based on this framework, we propose an efficient Gibbs sampler algorithm to sample from the posterior distribution of the parameters, specially the random partition. The proposed Gibbs sampler scheme carries out a random walk on the space of the spanning trees and the partitions induced by deleting tree edges. In the purely spatial situation, we compare our proposed model with other state-of-art regionalization techniques to partition maps using simulated and real social and health data. To illustrate how the temporal component is handled by the algorithm and to show how the spatial clusters vary along the time we presented an application using human development index data. The analysis shows that our proposed model is better than state-of-art alternatives. Another appealing feature of the method is that the prior distribution for the partition is interpretable with a trivial coin flipping mechanism allowing its easy elicitation.
Differentiable Game Mechanics
http://jmlr.org/papers/v20/19-008.html
2019
Deep learning is built on the foundational guarantee that gradient descent on an objective function converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, that exhibit multiple interacting losses. The behavior of gradient-based methods in games is not well understood -- and is becoming increasingly important as adversarial and multi-objective architectures proliferate. In this paper, we develop new tools to understand and control the dynamics in $n$-player differentiable games. The key result is to decompose the game Jacobian into two components. The first, symmetric component, is related to potential games, which reduce to gradient descent on an implicit function. The second, antisymmetric component, relates to Hamiltonian games, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in differentiable games. Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs -- while at the same time being applicable to, and having guarantees in, much more general cases.
On the optimality of the Hedge algorithm in the stochastic regime
http://jmlr.org/papers/v20/18-869.html
2019
In this paper, we study the behavior of the Hedge algorithm in the online stochastic setting. We prove that anytime Hedge with decreasing learning rate, which is one of the simplest algorithm for the problem of prediction with expert advice, is remarkably both worst-case optimal and adaptive to the easier stochastic and adversarial with a gap problems. This shows that, in spite of its small, non-adaptive learning rate, Hedge possesses the same optimal regret guarantee in the stochastic case as recently introduced adaptive algorithms. Moreover, our analysis exhibits qualitative differences with other versions of the Hedge algorithm, such as the fixed-horizon variant (with constant learning rate) and the one based on the so-called “doubling trick”, both of which fail to adapt to the easier stochastic setting. Finally, we determine the intrinsic limitations of anytime Hedge in the stochastic case, and discuss the improvements provided by more adaptive algorithms.
SMART: An Open Source Data Labeling Platform for Supervised Learning
http://jmlr.org/papers/v20/18-859.html
2019
SMART is an open source web application designed to help data scientists and research teams efficiently build labeled training data sets for supervised machine learning tasks. SMART provides users with an intuitive interface for creating labeled data sets, supports active learning to help reduce the required amount of labeled data, and incorporates inter-rater reliability statistics to provide insight into label quality. SMART is designed to be platform agnostic and easily deployable to meet the needs of as many different research teams as possible. The project website https://rtiinternational.github.io/SMART/ contains links to the code repository and extensive user documentation.
Tight Lower Bounds on the VC-dimension of Geometric Set Systems
http://jmlr.org/papers/v20/18-719.html
2019
The VC-dimension of a set system is a way to capture its complexity and has been a key parameter studied extensively in machine learning and geometry communities. In this paper, we resolve two longstanding open problems on bounding the VC-dimension of two fundamental set systems: $k$-fold unions/intersections of half-spaces and the simplices set system. Among other implications, it settles an open question in machine learning that was first studied in the foundational paper of Blumer et al. (1989) as well as by Eisenstat and Angluin (2007) and Johnson (2008).
Learning to Match via Inverse Optimal Transport
http://jmlr.org/papers/v20/18-700.html
2019
We propose a unified data-driven framework based on inverse optimal transport that can learn adaptive, nonlinear interaction cost function from noisy and incomplete empirical matching matrix and predict new matching in various matching contexts. We emphasize that the discrete optimal transport plays the role of a variational principle which gives rise to an optimization based framework for modeling the observed empirical matching data. Our formulation leads to a non-convex optimization problem which can be solved efficiently by an alternating optimization method. A key novel aspect of our formulation is the incorporation of marginal relaxation via regularized Wasserstein distance, significantly improving the robustness of the method in the face of noisy or missing empirical matching data. Our model falls into the category of prescriptive models, which not only predict potential future matching, but is also able to explain what leads to empirical matching and quantifies the impact of changes in matching factors. The proposed approach has wide applicability including predicting matching in online dating, labor market, college application and crowdsourcing. We back up our claims with numerical experiments on both synthetic data and real world data sets.
Quantification Under Prior Probability Shift: the Ratio Estimator and its Extensions
http://jmlr.org/papers/v20/18-456.html
2019
The quantification problem consists of determining the prevalence of a given label in a target population. However, one often has access to the labels in a sample from the training population but not in the target population. A common assumption in this situation is that of prior probability shift, that is, once the labels are known, the distribution of the features is the same in the training and target populations. In this paper, we derive a new lower bound for the risk of the quantification problem under the prior shift assumption. Complementing this lower bound, we present a new approximately minimax class of estimators, ratio estimators, which generalize several previous proposals in the literature. Using a weaker version of the prior shift assumption, which can be tested, we show that ratio estimators can be used to build confidence intervals for the quantification problem. We also extend the ratio estimator so that it can: (i) incorporate labels from the target population, when they are available and (ii) estimate how the prevalence of positive labels varies according to a function of certain covariates.
Prediction Risk for the Horseshoe Regression
http://jmlr.org/papers/v20/18-321.html
2019
We show that prediction performance for global-local shrinkage regression can overcome two major difficulties of global shrinkage regression: (i) the amount of relative shrinkage is monotone in the singular values of the design matrix and (ii) the shrinkage is determined by a single tuning parameter. Specifically, we show that the horseshoe regression, with heavy-tailed component-specific local shrinkage parameters, in conjunction with a global parameter providing shrinkage towards zero, alleviates both these difficulties and consequently, results in an improved risk for prediction. Numerical demonstrations of improved prediction over competing approaches in simulations and in a pharmacogenomics data set confirm our theoretical findings.
Nonuniformity of P-values Can Occur Early in Diverging Dimensions
http://jmlr.org/papers/v20/18-314.html
2019
Evaluating the joint significance of covariates is of fundamental importance in a wide range of applications. To this end, p-values are frequently employed and produced by algorithms that are powered by classical large-sample asymptotic theory. It is well known that the conventional p-values in Gaussian linear model are valid even when the dimensionality is a non-vanishing fraction of the sample size, but can break down when the design matrix becomes singular in higher dimensions or when the error distribution deviates from Gaussianity. A natural question is when the conventional p-values in generalized linear models become invalid in diverging dimensions. We establish that such a breakdown can occur early in nonlinear models. Our theoretical characterizations are confirmed by simulation studies.
Generalized Score Matching for Non-Negative Data
http://jmlr.org/papers/v20/18-278.html
2019
A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation may be implemented using numerical integration, the approach becomes computationally intensive. The score matching method of Hyvärinen (2005) avoids direct calculation of the normalizing constant and yields closed-form estimates for exponential families of continuous distributions over $\mathbb{R}^m$. Hyvärinen (2007) extended the approach to distributions supported on the non-negative orthant, $\mathbb{R}_+^m$. In this paper, we give a generalized form of score matching for non-negative data that improves estimation efficiency. As an example, we consider a general class of pairwise interaction models. Addressing an overlooked inexistence problem, we generalize the regularized score matching method of Lin et al. (2016) and improve its theoretical guarantees for non-negative Gaussian graphical models.
Fairness Constraints: A Flexible Approach for Fair Classification
http://jmlr.org/papers/v20/18-262.html
2019
Algorithmic decision making is employed in an increasing number of real-world applicationstions to aid human decision making. While it has shown considerable promise in terms of improved decision accuracy, in some scenarios, its outcomes have been also shown to impose an unfair (dis)advantage on people from certain social groups (e.g., women, blacks). In this context, there is a need for computational techniques to limit unfairness in algorithmic decision making. In this work, we take a step forward to fulfill that need and introduce a flexible constraint-based framework to enable the design of fair margin-based classifiers. The main technical innovation of our framework is a general and intuitive measure of decision boundary unfairness, which serves as a tractable proxy to several of the most popular computational definitions of unfairness from the literature. Leveraging our measure, we can reduce the design of fair margin-based classifiers to adding tractable constraints on their decision boundaries. Experiments on multiple synthetic and real-world datasets show that our framework is able to successfully limit unfairness, often at a small cost in terms of accuracy.
Deep Optimal Stopping
http://jmlr.org/papers/v20/18-232.html
2019
In this paper we develop a deep learning method for optimal stopping problems which directly learns the optimal stopping rule from Monte Carlo samples. As such, it is broadly applicable in situations where the underlying randomness can efficiently be simulated. We test the approach on three problems: the pricing of a Bermudan max-call option, the pricing of a callable multi barrier reverse convertible and the problem of optimally stopping a fractional Brownian motion. In all three cases it produces very accurate results in high-dimensional situations with short computing times.
Analysis of Langevin Monte Carlo via Convex Optimization
http://jmlr.org/papers/v20/18-173.html
2019
In this paper, we provide new insights on the Unadjusted Langevin Algorithm. We show that this method can be formulated as the first order optimization algorithm for an objective functional defined on the Wasserstein space of order $2$. Using this interpretation and techniques borrowed from convex optimization, we give a non-asymptotic analysis of this method to sample from log-concave smooth target distribution on $\mathbb{R}^d$. Based on this interpretation, we propose two new methods for sampling from a non-smooth target distribution. These new algorithms are natural extensions of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, which is a popular extension of the Unadjusted Langevin Algorithm for largescale Bayesian inference. Using the optimization perspective, we provide non-asymptotic convergence analysis for the newly proposed methods.
Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning
http://jmlr.org/papers/v20/18-148.html
2019
Performance of distributed optimization and learning systems is bottlenecked by “straggler” nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is “encoded” to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically treated as missing, or as “erasures” at every iteration, whose loss is compensated by the embedded redundancy. For quadratic loss functions, we show that under a simple encoding scheme, many optimization algorithms (gradient descent, L-BFGS, and proximal gradient) operating under data parallelism converge to an approximate solution even when stragglers are ignored. Furthermore, we show a similar result for a wider class of convex loss functions when operating under model parallelism. The applicable classes of objectives covers several popular learning problems such as linear regression, LASSO, support vector machine, collaborative filtering, and generalized linear models including logistic regression. These convergence results are deterministic, i.e., they establish sample path convergence for arbitrary sequences of delay patterns or distributions on the nodes, and are independent of the tail behavior of the delay distribution. We demonstrate that equiangular tight frames have desirable properties as encoding matrices, and propose efficient mechanisms for encoding large-scale data. We implement the proposed technique on Amazon EC2 clusters, and demonstrate its performance over several learning problems, including matrix factorization, LASSO, ridge regression and logistic regression, and compare the proposed method with uncoded, asynchronous, and data replication strategies.
Lazifying Conditional Gradient Algorithms
http://jmlr.org/papers/v20/18-114.html
2019
Conditional gradient algorithms (also often called Frank-Wolfe algorithms) are popular due to their simplicity of only requiring a linear optimization oracle and more recently they also gained significant traction for online learning. While simple in principle, in many cases the actual implementation of the linear optimization oracle is costly. We show a general method to lazify various conditional gradient algorithms, which in actual computations leads to several orders of magnitude of speedup in wall-clock time. This is achieved by using a faster separation oracle instead of a linear optimization oracle, relying only on few linear optimization oracle calls.
Semi-Analytic Resampling in Lasso
http://jmlr.org/papers/v20/18-109.html
2019
An approximate method for conducting resampling in Lasso, the $\ell_1$ penalized linear regression, in a semi-analytic manner is developed, whereby the average over the resampled datasets is directly computed without repeated numerical sampling, thus enabling an inference free of the statistical fluctuations due to sampling finiteness, as well as a significant reduction of computational time. The proposed method is based on a message passing type algorithm, and its fast convergence is guaranteed by the state evolution analysis, when covariates are provided as zero-mean independently and identically distributed Gaussian random variables. It is employed to implement bootstrapped Lasso (Bolasso) and stability selection, both of which are variable selection methods using resampling in conjunction with Lasso, and resolves their disadvantage regarding computational cost. To examine approximation accuracy and efficiency, numerical experiments were carried out using simulated datasets. Moreover, an application to a real-world dataset, the wine quality dataset, is presented. To process such real-world datasets, an objective criterion for determining the relevance of selected variables is also introduced by the addition of noise variables and resampling. MATLAB codes implementing the proposed method are distributed in (Obuchi, 2018).
On Consistent Vertex Nomination Schemes
http://jmlr.org/papers/v20/18-048.html
2019
Given a vertex of interest in a network $G_1$, the vertex nomination problem seeks to find the corresponding vertex of interest (if it exists) in a second network $G_2$. A vertex nomination scheme produces a list of the vertices in $G_2$, ranked according to how likely they are judged to be the corresponding vertex of interest in $G_2$. The vertex nomination problem and related information retrieval tasks have attracted much attention in the machine learning literature, with numerous applications to social and biological networks. However, the current framework has often been confined to a comparatively small class of network models, and the concept of statistically consistent vertex nomination schemes has been only shallowly explored. In this paper, we extend the vertex nomination problem to a very general statistical model of graphs. Further, drawing inspiration from the long-established classification framework in the pattern recognition literature, we provide definitions for the key notions of Bayes optimality and consistency in our extended vertex nomination framework, including a derivation of the Bayes optimal vertex nomination scheme. In addition, we prove that no universally consistent vertex nomination schemes exist. Illustrative examples are provided throughout.
Variance-based Regularization with Convex Objectives
http://jmlr.org/papers/v20/17-750.html
2019
We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.
Learnability of Solutions to Conjunctive Queries
http://jmlr.org/papers/v20/17-734.html
2019
The problem of learning the solution space of an unknown formula has been studied in multiple embodiments in computational learning theory. In this article, we study a family of such learning problems; this family contains, for each relational structure, the problem of learning the solution space of an unknown conjunctive query evaluated on the structure. A progression of results aimed to classify the learnability of each of the problems in this family, and thus far a culmination thereof was a positive learnability result generalizing all previous ones. This article completes the classification program towards which this progression of results strived, by presenting a negative learnability result that complements the mentioned positive learnability result. In addition, a further negative learnability result is exhibited, which indicates a dichotomy within the problems to which the first negative result applies. In order to obtain our negative results, we make use of universal-algebraic concepts.
Proximal Distance Algorithms: Theory and Practice
http://jmlr.org/papers/v20/17-687.html
2019
Proximal distance algorithms combine the classical penalty method of constrained minimization with distance majorization. If $f(x)$ is the loss function, and $C$ is the constraint set in a constrained minimization problem, then the proximal distance principle mandates minimizing the penalized loss $f(x)+\frac{\rho}{2}dist(x,C)^2$ and following the solution $x_{\rho}$ to its limit as $\rho$ tends to $\infty$. At each iteration the squared Euclidean distance $dist(x,C)^2$ is majorized by the spherical quadratic $\|x-P_C(x_k)\|^2$, where $P_C(x_k)$ denotes the projection of the current iterate $x_k$ onto $C$. The minimum of the surrogate function $f(x)+\frac{\rho}{2}\|x-P_C(x_k)\|^2$ is given by the proximal map $prox_{\rho^{-1}f}[P_C(x_k)]$. The next iterate $x_{k+1}$ automatically decreases the original penalized loss for fixed $\rho$. Since many explicit projections and proximal maps are known, it is straightforward to derive and implement novel optimization algorithms in this setting. These algorithms can take hundreds if not thousands of iterations to converge, but the simple nature of each iteration makes proximal distance algorithms competitive with traditional algorithms. For convex problems, proximal distance algorithms reduce to proximal gradient algorithms and therefore enjoy well understood convergence properties. For nonconvex problems, one can attack convergence by invoking Zangwill's theorem. Our numerical examples demonstrate the utility of proximal distance algorithms in various high-dimensional settings, including a) linear programming, b) constrained least squares, c) projection to the closest kinship matrix, d) projection onto a second-order cone constraint, e) calculation of Horn's copositive matrix index, f) linear complementarity programming, and g) sparse principal components analysis. The proximal distance algorithm in each case is competitive or superior in speed to traditional methods such as the interior point method and the alternating direction method of multipliers (ADMM). Source code for the numerical examples can be found at https://github.com/klkeys/proxdist.
Active Learning for Cost-Sensitive Classification
http://jmlr.org/papers/v20/17-681.html
2019
We design an active learning algorithm for cost-sensitive multiclass classification: problems where different errors have different costs. Our algorithm, COAL, makes predictions by regressing to each label's cost and predicting the smallest. On a new example, it uses a set of regressors that perform well on past data to estimate possible costs for each label. It queries only the labels that could be the best, ignoring the sure losers. We prove COAL can be efficiently implemented for any regression family that admits squared loss optimization; it also enjoys strong guarantees with respect to predictive performance and labeling effort. We empirically compare COAL to passive learning and several active learning baselines, showing significant improvements in labeling effort and test cost on real-world datasets.
A Representer Theorem for Deep Kernel Learning
http://jmlr.org/papers/v20/17-621.html
2019
In this paper we provide a finite-sample and an infinite-sample representer theorem for the concatenation of (linear combinations of) kernel functions of reproducing kernel Hilbert spaces. These results serve as mathematical foundation for the analysis of machine learning algorithms based on compositions of functions. As a direct consequence in the finite-sample case, the corresponding infinite-dimensional minimization problems can be recast into (nonlinear) finite-dimensional minimization problems, which can be tackled with nonlinear optimization algorithms. Moreover, we show how concatenated machine learning problems can be reformulated as neural networks and how our representer theorem applies to a broad class of state-of-the-art deep learning methods.
Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks
http://jmlr.org/papers/v20/17-612.html
2019
We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting $W$ be the number of weights and $L$ be the number of layers, we prove that the VC-dimension is $O(W L \log(W))$, and provide examples with VC-dimension $\Omega( W L \log(W/L) )$. This improves both the previously known upper bounds and lower bounds. In terms of the number $U$ of non-linear units, we prove a tight bound $\Theta(W U)$ on the VC-dimension. All of these bounds generalize to arbitrary piecewise linear activation functions, and also hold for the pseudodimensions of these function classes. Combined with previous results, this gives an intriguing range of dependencies of the VC-dimension on depth for networks with different non-linearities: there is no dependence for piecewise-constant, linear dependence for piecewise-linear, and no more than quadratic dependence for general piecewise-polynomial.
Multi-scale Online Learning: Theory and Applications to Online Auctions and Pricing
http://jmlr.org/papers/v20/17-498.html
2019
We consider revenue maximization in online auction/pricing problems. A seller sells an identical item in each period to a new buyer, or a new set of buyers. For the online pricing problem, both when the arriving buyer bids or only responds to the posted price, we design algorithms whose regret bounds scale with the best fixed price in-hindsight, rather than the range of the values. Under the bidding model, we further show our algorithms achieve a revenue convergence rate that matches the offline sample complexity of the single-item single-buyer auction. We also show regret bounds that are scale free, and match the offline sample complexity, when comparing to a benchmark that requires a lower bound on the market share. We further expand our results beyond pricing to multi-buyer auctions, and obtain online learning algorithms for auctions, with convergence rates matching the known sample complexity upper bound of online single-item multi-buyer auctions. These results are obtained by generalizing the classical learning from experts and multi-armed bandit problems to their multi-scale versions. In this version, the reward of each action is in a different range, and the regret with respect to a given action scales with its own range, rather than the maximum range. We obtain almost optimal multi-scale regret bounds by introducing a new Online Mirror Descent (OMD) algorithm whose mirror map is the multi-scale version of the negative entropy function. We further generalize to the bandit setting by introducing the stochastic variant of this OMD algorithm.
The Sup-norm Perturbation of HOSVD and Low Rank Tensor Denoising
http://jmlr.org/papers/v20/17-397.html
2019
The higher order singular value decomposition (HOSVD) of tensors is a generalization of matrix SVD. The perturbation analysis of HOSVD under random noise is more delicate than its matrix counterpart. Recently, polynomial time algorithms have been proposed where statistically optimal estimates of the singular subspaces and the low rank tensors are attainable in the Euclidean norm. In this article, we analyze the sup-norm perturbation bounds of HOSVD and introduce estimators of the singular subspaces with sharp deviation bounds in the sup-norm. We also investigate a low rank tensor denoising estimator and demonstrate its fast convergence rate with respect to the entry-wise errors. The sup-norm perturbation bounds reveal unconventional phase transitions for statistical learning applications such as the exact clustering in high dimensional Gaussian mixture model and the exact support recovery in sub-tensor localizations. In addition, the bounds established for HOSVD also elaborate the one-sided sup-norm perturbation bounds for the singular subspaces of unbalanced (or fat) matrices.
Robust Estimation of Derivatives Using Locally Weighted Least Absolute Deviation Regression
http://jmlr.org/papers/v20/17-340.html
2019
In nonparametric regression, the derivative estimation has attracted much attention in recent years due to its wide applications. In this paper, we propose a new method for the derivative estimation using the locally weighted least absolute deviation regression. Different from the local polynomial regression, the proposed method does not require a finite variance for the error term and so is robust to the presence of heavy-tailed errors. Meanwhile, it does not require a zero median or a positive density at zero for the error term in comparison with the local median regression. We further show that the proposed estimator with random difference is asymptotically equivalent to the (infinitely) composite quantile regression estimator. In other words, running one regression is equivalent to combining infinitely many quantile regressions. In addition, the proposed method is also extended to estimate the derivatives at the boundaries and to estimate higher-order derivatives. For the equidistant design, we derive theoretical results for the proposed estimators, including the asymptotic bias and variance, consistency, and asymptotic normality. Finally, we conduct simulation studies to demonstrate that the proposed method has better performance than the existing methods in the presence of outliers and heavy-tailed errors, and analyze the Chinese house price data for the past ten years to illustrate the usefulness of the proposed method.
Kernel Approximation Methods for Speech Recognition
http://jmlr.org/papers/v20/17-026.html
2019
We study the performance of kernel methods on the acoustic modeling task for automatic speech recognition, and compare their performance to deep neural networks (DNNs). To scale the kernel methods to large data sets, we use the random Fourier feature method of Rahimi and Recht (2007). We propose two novel techniques for improving the performance of kernel acoustic models. First, we propose a simple but effective feature selection method which reduces the number of random features required to attain a fixed level of performance. Second, we present a number of metrics which correlate strongly with speech recognition performance when computed on the heldout set; we attain improved performance by using these metrics to decide when to stop training. Additionally, we show that the linear bottleneck method of Sainath et al. (2013a) improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Leveraging these three methods, the kernel methods attain token error rates between $0.5\%$ better and $0.1\%$ worse than fully-connected DNNs across four speech recognition data sets, including the TIMIT and Broadcast News benchmark tasks.
The Common-directions Method for Regularized Empirical Risk Minimization
http://jmlr.org/papers/v20/16-309.html
2019
State-of-the-art first- and second-order optimization methods are able to achieve either fast global linear convergence rates or quadratic convergence, but not both of them. In this work, we propose an interpolation between first- and second-order methods for regularized empirical risk minimization that exploits the problem structure to efficiently combine multiple update directions. Our method attains both optimal global linear convergence rate for first-order methods, and local quadratic convergence. Experimental results show that our method outperforms state-of-the-art first- and second-order optimization methods in terms of the number of data accesses, while is competitive in training time.
Multi-class Heterogeneous Domain Adaptation
http://jmlr.org/papers/v20/13-580.html
2019
A crucial issue in heterogeneous domain adaptation (HDA) is the ability to learn a feature mapping between different types of features across domains. Inspired by language translation, a word translated from one language corresponds to only a few words in another language, we present an efficient method named Sparse Heterogeneous Feature Representation (SHFR) in this paper for multi-class HDA to learn a sparse feature transformation between domains with multiple classes. Specifically, we formulate the problem of learning the feature transformation as a compressed sensing problem by building multiple binary classifiers in the target domain as various measurement sensors, which are decomposed from the target multi-class classification problem. We show that the estimation error of the learned transformation decreases with the increasing number of binary classifiers. In other words, for adaptation across heterogeneous domains to be successful, it is necessary to construct a sufficient number of incoherent binary classifiers from the original multi-class classification problem. To achieve this, we propose to apply the error correcting output correcting (ECOC) scheme to generate incoherent classifiers. To speed up the learning of the feature transformation across domains, we apply an efficient batch-mode algorithm to solve the resultant nonnegative sparse recovery problem. Theoretically, we present a generalization error bound of our proposed HDA method under a multi-class setting. Lastly, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate the superiority of our proposed method over existing state-of-the-art HDA methods in terms of prediction accuracy and training efficiency.
Near Optimal Frequent Directions for Sketching Dense and Sparse Matrices
http://jmlr.org/papers/v20/18-875.html
2019
Given a large matrix $A\in\mathbb{R}^{n\times d}$, we consider the problem of computing a sketch matrix $B\in\mathbb{R}^{\ell\times d}$ which is significantly smaller than but still well approximates $A$. We consider the problems in the streaming model, where the algorithm can only make one pass over the input with limited working space, and we are interested in minimizing the <i>covariance error</i> $\|A^TA-B^TB\|_2.$ The popular Frequent Directions algorithm of \cite{liberty2013simple} and its variants achieve optimal space-error tradeoffs. However, whether the running time can be improved remains an unanswered question. In this paper, we almost settle the question by proving that the time complexity of this problem is equivalent to that of matrix multiplication up to lower order terms. Specifically, we provide new space-optimal algorithms with faster running times and also show that the running times of our algorithms can be improved if and only if the state-of-the-art running time of matrix multiplication can be improved significantly.
Neural Architecture Search: A Survey
http://jmlr.org/papers/v20/18-598.html
2019
Deep Learning has enabled remarkable progress over the last years on a variety of tasks, such as image recognition, speech recognition, and machine translation. One crucial aspect for this progress are novel neural architectures. Currently employed architectures have mostly been developed manually by human experts, which is a time-consuming and error-prone process. Because of this, there is growing interest in automated \emph{neural architecture search} methods. We provide an overview of existing work in this field of research and categorize them according to three dimensions: search space, search strategy, and performance estimation strategy.
Deep Reinforcement Learning for Swarm Systems
http://jmlr.org/papers/v20/18-476.html
2019
Recently, deep reinforcement learning (RL) methods have been applied successfully to multi-agent scenarios. Typically, the observation vector for decentralized decision making is represented by a concatenation of the (local) information an agent gathers about other agents. However, concatenation scales poorly to swarm systems with a large number of homogeneous agents as it does not exploit the fundamental properties inherent to these systems: (i) the agents in the swarm are interchangeable and (ii) the exact number of agents in the swarm is irrelevant. Therefore, we propose a new state representation for deep multi-agent RL based on mean embeddings of distributions, where we treat the agents as samples and use the empirical mean embedding as input for a decentralized policy. We define different feature spaces of the mean embedding using histograms, radial basis functions and neural networks trained end-to-end. We evaluate the representation on two well-known problems from the swarm literature in a globally and locally observable setup. For the local setup we furthermore introduce simple communication protocols. Of all approaches, the mean embedding representation using neural network features enables the richest information exchange between neighboring agents, facilitating the development of complex collective strategies.
Tunability: Importance of Hyperparameters of Machine Learning Algorithms
http://jmlr.org/papers/v20/18-444.html
2019
Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.
Thompson Sampling Guided Stochastic Searching on the Line for Deceptive Environments with Applications to Root-Finding Problems
http://jmlr.org/papers/v20/18-263.html
2019
The multi-armed bandit problem forms the foundation for solving a wide range of online stochastic optimization problems through a simple, yet effective mechanism. One simply casts the problem as a gambler who repeatedly pulls one out of N slot machine arms, eliciting random rewards. Learning of reward probabilities is then combined with reward maximization, by carefully balancing reward exploration against reward exploitation. In this paper, we address a particularly intriguing variant of the multi-armed bandit problem, referred to as the <i>Stochastic Point Location (SPL)</i> problem. The gambler is here only told whether the optimal arm (point) lies to the “left” or to the “right” of the arm pulled, with the feedback being erroneous with probability $1-\pi$. This formulation thus targets optimization in continuous action spaces with both <i>informative</i> and <i>deceptive</i> feedback. To tackle this class of problems, we formulate a compact and scalable Bayesian representation of the solution space that simultaneously captures both the location of the optimal arm as well as the probability of receiving correct feedback. We further introduce the accompanying Thompson Sampling guided Stochastic Point Location (TS-SPL) scheme for balancing exploration against exploitation. By learning $\pi$, TS-SPL also supports <i>deceptive</i> environments that are lying about the direction of the optimal arm. This, in turn, allows us to address the fundamental Stochastic Root Finding (SRF) problem. Empirical results demonstrate that our scheme deals with both deceptive and informative environments, significantly outperforming competing algorithms both for SRF and SPL.
Bayesian Combination of Probabilistic Classifiers using Multivariate Normal Mixtures
http://jmlr.org/papers/v20/18-241.html
2019
Ensemble methods are a powerful tool, often outperforming individual prediction models. Existing Bayesian ensembles either do not model the correlations between sources, or they are only capable of combining non-probabilistic predictions. We propose a new model, which overcomes these disadvantages. Transforming the probabilistic predictions with the inverse additive logistic transformation allows us to model the correlations with multivariate normal mixtures. We derive an efficient Gibbs sampler for the proposed model and implement a regularization method to make it more robust. We compare our method to related work and the classical linear opinion pool. Empirical evaluation on several toy and real-world data sets, including a case study on air-pollution forecasting, shows that the method outperforms other methods, while being robust and easy to use.
No-Regret Bayesian Optimization with Unknown Hyperparameters
http://jmlr.org/papers/v20/18-213.html
2019
Bayesian optimization (BO) based on Gaussian process models is a powerful paradigm to optimize black-box functions that are expensive to evaluate. While several BO algorithms provably converge to the global optimum of the unknown function, they assume that the hyperparameters of the kernel are known in advance. This is not the case in practice and misspecification often causes these algorithms to converge to poor local optima. In this paper, we present the first BO algorithm that is provably no-regret and converges to the optimum without knowledge of the hyperparameters. During optimization we slowly adapt the hyperparameters of stationary kernels and thereby expand the associated function class over time, so that the BO algorithm considers more complex function candidates. Based on the theoretical insights, we propose several practical algorithms that achieve the empirical sample efficiency of BO with online hyperparameter estimation, but retain theoretical convergence guarantees. We evaluate our method on several benchmark problems.
Using Simulation to Improve Sample-Efficiency of Bayesian Optimization for Bipedal Robots
http://jmlr.org/papers/v20/18-196.html
2019
Learning for control can acquire controllers for novel robotic tasks, paving the path for autonomous agents. Such controllers can be expert-designed policies, which typically require tuning of parameters for each task scenario. In this context, Bayesian optimization (BO) has emerged as a promising approach for automatically tuning controllers. However, sample-efficiency can still be an issue for high-dimensional policies on hardware. Here, we develop an approach that utilizes simulation to learn structured feature transforms that map the original parameter space into a domain-informed space. During BO, similarity between controllers is now calculated in this transformed space. Experiments on the ATRIAS robot hardware and simulation show that our approach succeeds at sample-efficiently learning controllers for multiple robots. Another question arises: What if the simulation significantly differs from hardware? To answer this, we create increasingly approximate simulators and study the effect of increasing simulation-hardware mismatch on the performance of Bayesian optimization. We also compare our approach to other approaches from literature, and find it to be more reliable, especially in cases of high mismatch. Our experiments show that our approach succeeds across different controller types, bipedal robot models and simulator fidelity levels, making it applicable to a wide range of bipedal locomotion problems.
Efficient augmentation and relaxation learning for individualized treatment rules using observational data
http://jmlr.org/papers/v20/18-191.html
2019
Individualized treatment rules aim to identify if, when, which, and to whom treatment should be applied. A globally aging population, rising healthcare costs, and increased access to patient-level data have created an urgent need for high-quality estimators of individualized treatment rules that can be applied to observational data. A recent and promising line of research for estimating individualized treatment rules recasts the problem of estimating an optimal treatment rule as a weighted classification problem. We consider a class of estimators for optimal treatment rules that are analogous to convex large-margin classifiers. The proposed class applies to observational data and is doubly-robust in the sense that correct specification of either a propensity or outcome model leads to consistent estimation of the optimal individualized treatment rule. Using techniques from semiparametric efficiency theory, we derive rates of convergence for the proposed estimators and use these rates to characterize the bias-variance trade-off for estimating individualized treatment rules with classification-based methods. Simulation experiments informed by these results demonstrate that it is possible to construct new estimators within the proposed framework that significantly outperform existing ones. We illustrate the proposed methods using data from a labor training program and a study of inflammatory bowel syndrome.
Analysis of spectral clustering algorithms for community detection: the general bipartite setting
http://jmlr.org/papers/v20/18-170.html
2019
We consider spectral clustering algorithms for community detection under a general bipartite stochastic block model (SBM). A modern spectral clustering algorithm consists of three steps: (1) regularization of an appropriate adjacency or Laplacian matrix (2) a form of spectral truncation and (3) a kmeans type algorithm in the reduced spectral domain. We focus on the adjacency-based spectral clustering and for the first step, propose a new data-driven regularization that can restore the concentration of the adjacency matrix even for the sparse networks. This result is based on recent work on regularization of random binary matrices, but avoids using unknown population level parameters, and instead estimates the necessary quantities from the data. We also propose and study a novel variation of the spectral truncation step and show how this variation changes the nature of the misclassification rate in a general SBM. We then show how the consistency results can be extended to models beyond SBMs, such as inhomogeneous random graph models with approximate clusters, including a graphon clustering problem, as well as general sub-Gaussian biclustering. A theme of the paper is providing a better understanding of the analysis of spectral methods for community detection and establishing consistency results, under fairly general clustering models and for a wide regime of degree growths, including sparse cases where the average expected degree grows arbitrarily slowly.
Boosted Kernel Ridge Regression: Optimal Learning Rates and Early Stopping
http://jmlr.org/papers/v20/18-063.html
2019
In this paper, we introduce a learning algorithm, boosted kernel ridge regression (BKRR), that combines $L_2$-Boosting with the kernel ridge regression (KRR). We analyze the learning performance of this algorithm in the framework of learning theory. We show that BKRR provides a new bias-variance trade-off via tuning the number of boosting iterations, which is different from KRR via adjusting the regularization parameter. A (semi-)exponential bias-variance trade-off is derived for BKRR, exhibiting a stable relationship between the generalization error and the number of iterations. Furthermore, an adaptive stopping rule is proposed, with which BKRR achieves the optimal learning rate without saturation.
Robust Frequent Directions with Application in Online Learning
http://jmlr.org/papers/v20/17-773.html
2019
The frequent directions (FD) technique is a deterministic approach for online sketching that has many applications in machine learning. The conventional FD is a heuristic procedure that often outputs rank deficient matrices. To overcome the rank deficiency problem, we propose a new sketching strategy called robust frequent directions (RFD) by introducing a regularization term. RFD can be derived from an optimization problem. It updates the sketch matrix and the regularization term adaptively and jointly. RFD reduces the approximation error of FD without increasing the computational cost. We also apply RFD to online learning and propose an effective hyperparameter-free online Newton algorithm. We derive a regret bound for our online Newton algorithm based on RFD, which guarantees the robustness of the algorithm. The experimental studies demonstrate that the proposed method outperforms state-of-the-art second order online learning algorithms.
Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python
http://jmlr.org/papers/v20/17-722.html
2019
We describe a new library named picasso, which implements a unified framework of pathwise coordinate optimization for a variety of sparse learning problems (e.g., sparse linear regression, sparse logistic regression, sparse Poisson regression and scaled sparse linear regression) combined with efficient active set selection strategies. Besides, the library allows users to choose different sparsity-inducing regularizers, including the convex $\ell_1$, nonvoncex MCP and SCAD regularizers. The library is coded in \texttt{C++} and has user-friendly R and Python wrappers. Numerical experiments demonstrate that picasso can scale up to large problems efficiently.
DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization
http://jmlr.org/papers/v20/17-608.html
2019
Machine learning with big data often involves large optimization models. For distributed optimization over a cluster of machines, frequent communication and synchronization of all model parameters (optimization variables) can be very costly. A promising solution is to use parameter servers to store different subsets of the model parameters, and update them asynchronously at different machines using local datasets. In this paper, we focus on distributed optimization of large linear models with convex loss functions, and propose a family of randomized primal-dual block coordinate algorithms that are especially suitable for asynchronous distributed implementation with parameter servers. In particular, we work with the saddle-point formulation of such problems which allows simultaneous data and model partitioning, and exploit its structure by doubly stochastic coordinate optimization with variance reduction (DSCOVR). Compared with other first-order distributed algorithms, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. We discuss the implementation details of the DSCOVR algorithms, and present numerical experiments on an industrial distributed computing system.
Utilizing Second Order Information in Minibatch Stochastic Variance Reduced Proximal Iterations
http://jmlr.org/papers/v20/17-594.html
2019
We present a novel minibatch stochastic optimization method for empirical risk minimization of linear predictors. The method efficiently leverages both sub-sampled first-order and higher-order information, by incorporating variance-reduction and acceleration techniques. We prove improved iteration complexity over state-of-the-art methods under suitable conditions. In particular, the approach enjoys global fast convergence for quadratic convex objectives and local fast convergence for general convex objectives. Experiments are provided to demonstrate the empirical advantage of the proposed method over existing approaches in the literature.
Decontamination of Mutual Contamination Models
http://jmlr.org/papers/v20/17-576.html
2019
Many machine learning problems can be characterized by \emph{mutual contamination models}. In these problems, one observes several random samples from different convex combinations of a set of unknown base distributions and the goal is to infer these base distributions. This paper considers the general setting where the base distributions are defined on arbitrary probability spaces. We examine three popular machine learning problems that arise in this general setting: multiclass classification with label noise, demixing of mixed membership models, and classification with partial labels. In each case, we give sufficient conditions for identifiability and present algorithms for the infinite and finite sample settings, with associated performance guarantees.
Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations
http://jmlr.org/papers/v20/17-526.html
2019
We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We prove that this approximation can be understood mathematically as an weak approximation, which leads to a number of precise and useful results on the approximations of stochastic gradient descent (SGD), momentum SGD and stochastic Nesterov's accelerated gradient method in the general setting of stochastic objectives. We also demonstrate through explicit calculations that this continuous-time approach can uncover important analytical insights into the stochastic gradient algorithms under consideration that may not be easy to obtain in a purely discrete-time setting.
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
http://jmlr.org/papers/v20/17-451.html
2019
In recent years, randomized methods for numerical linear algebra have received growing interest as a general approach to large-scale problems. Typically, the essential ingredient of these methods is some form of randomized dimension reduction, which accelerates computations, but also creates random approximation error. In this way, the dimension reduction step encodes a tradeoff between cost and accuracy. However, the exact numerical relationship between cost and accuracy is typically unknown, and consequently, it may be difficult for the user to precisely know (1) how accurate a given solution is, or (2) how much computation is needed to achieve a given level of accuracy. In the current paper, we study randomized matrix multiplication (sketching) as a prototype setting for addressing these general problems. As a solution, we develop a bootstrap method for directly estimating the accuracy as a function of the reduced dimension (as opposed to deriving worst-case bounds on the accuracy in terms of the reduced dimension). From a computational standpoint, the proposed method does not substantially increase the cost of standard sketching methods, and this is made possible by an “extrapolation” technique. In addition, we provide both theoretical and empirical results to demonstrate the effectiveness of the proposed method.
Approximation Hardness for A Class of Sparse Optimization Problems
http://jmlr.org/papers/v20/17-373.html
2019
In this paper, we consider three typical optimization problems with a convex loss function and a nonconvex sparse penalty or constraint. For the sparse penalized problem, we prove that finding an $\mathcal{O}(n^{c_1}d^{c_2})$-optimal solution to an $n\times d$ problem is strongly NP-hard for any $c_1, c_2\in [0,1)$ such that $c_1+c_2<1$. For two constrained versions of the sparse optimization problem, we show that it is intractable to approximately compute a solution path associated with increasing values of some tuning parameter. The hardness results apply to a broad class of loss functions and sparse penalties. They suggest that one cannot even approximately solve these three problems in polynomial time, unless P $=$ NP.
A Well-Tempered Landscape for Non-convex Robust Subspace Recovery
http://jmlr.org/papers/v20/17-324.html
2019
We present a mathematical analysis of a non-convex energy landscape for robust subspace recovery. We prove that an underlying subspace is the only stationary point and local minimizer in a specified neighborhood under a deterministic condition on a dataset. If the deterministic condition is satisfied, we further show that a geodesic gradient descent method over the Grassmannian manifold can exactly recover the underlying subspace when the method is properly initialized. Proper initialization by principal component analysis is guaranteed with a simple deterministic condition. Under slightly stronger assumptions, the gradient descent method with a piecewise constant step-size scheme achieves linear convergence. The practicality of the deterministic condition is demonstrated on some statistical models of data, and the method achieves almost state-of-the-art recovery guarantees on the Haystack Model for different regimes of sample size and ambient dimension. In particular, when the ambient dimension is fixed and the sample size is large enough, we show that our gradient method can exactly recover the underlying subspace for any fixed fraction of outliers (less than 1).
A New Approach to Laplacian Solvers and Flow Problems
http://jmlr.org/papers/v20/17-286.html
2019
This paper investigates the behavior of the Min-Sum message passing scheme to solve systems of linear equations in the Laplacian matrices of graphs and to compute electric flows. Voltage and flow problems involve the minimization of quadratic functions and are fundamental primitives that arise in several domains. Algorithms that have been proposed are typically centralized and involve multiple graph-theoretic constructions or sampling mechanisms that make them difficult to implement and analyze. On the other hand, message passing routines are distributed, simple, and easy to implement. In this paper we establish a framework to analyze Min-Sum to solve voltage and flow problems. We characterize the error committed by the algorithm on general weighted graphs in terms of hitting times of random walks defined on the computation trees that support the operations of the algorithms with time. For $d$-regular graphs with equal weights, we show that the convergence of the algorithms is controlled by the total variation distance between the distributions of non-backtracking random walks defined on the original graph that start from neighboring nodes. The framework that we introduce extends the analysis of Min-Sum to settings where the contraction arguments previously considered in the literature (based on the assumption of walk summability or scaled diagonal dominance) can not be used, possibly in the presence of constraints.
Optimal Policies for Observing Time Series and Related Restless Bandit Problems
http://jmlr.org/papers/v20/17-185.html
2019
The trade-off between the cost of acquiring and processing data, and uncertainty due to a lack of data is fundamental in machine learning. A basic instance of this trade-off is the problem of deciding when to make noisy and costly observations of a discrete-time Gaussian random walk, so as to minimise the posterior variance plus observation costs. We present the first proof that a simple policy, which observes when the posterior variance exceeds a threshold, is optimal for this problem. The proof generalises to a wide range of cost functions other than the posterior variance. It is based on a new verification theorem by Nino-Mora that guarantees threshold structure for Markov decision processes, and on the relation between binary sequences known as Christoffel words and the dynamics of discontinuous nonlinear maps, which frequently arise in physics, control and biology. This result implies that optimal policies for linear-quadratic-Gaussian control with costly observations have a threshold structure. It also implies that the restless bandit problem of observing multiple such time series, has a well-defined Whittle index policy. We discuss computation of that index, give closed-form formulae for it, and compare the performance of the associated index policy with heuristic policies.
Matched Bipartite Block Model with Covariates
http://jmlr.org/papers/v20/17-153.html
2019
Community detection or clustering is a fundamental task in the analysis of network data. Many real networks have a bipartite structure which makes community detection challenging. In this paper, we consider a model which allows for matched communities in the bipartite setting, in addition to node covariates with information about the matching. We derive a simple fast algorithm for fitting the model based on variational inference ideas and show its effectiveness on both simulated and real data. A variation of the model to allow for degree-correction is also considered, in addition to a novel approach to fitting such degree-corrected models.
The Relationship Between Agnostic Selective Classification, Active Learning and the Disagreement Coefficient
http://jmlr.org/papers/v20/17-147.html
2019
A selective classifier $(f,g)$ comprises a classification function $f$ and a binary selection function $g$, which determines if the classifier abstains from prediction, or uses $f$ to predict. The classifier is called pointwise-competitive if it classifies each point identically to the best classifier in hindsight (from the same class), whenever it does not abstain. The quality of such a classifier is quantified by its rejection mass, defined to be the probability mass of the points it rejects. A “fast” rejection rate is achieved if the rejection mass is bounded from above by $\tilde{O}(1/m)$ where $m$ is the number of labeled examples used to train the classifier (and $\tilde{O}$ hides logarithmic factors). Pointwise-competitive selective (PCS) classifiers are intimately related to disagreement-based active learning and it is known that in the realizable case, a fast rejection rate of a known PCS algorithm (called Consistent Selective Strategy) is equivalent to an exponential speedup of the well-known CAL active algorithm. We focus on the agnostic setting, for which there is a known algorithm called LESS that learns a PCS classifier and achieves a fast rejection rate (depending on Hannekeâs disagreement coefficient) under strong assumptions. We present an improved PCS learning algorithm called ILESS for which we show a fast rate (depending on Hanneke's disagreement coefficient) without any assumptions. Our rejection bound smoothly interpolates the realizable and agnostic settings. The main result of this paper is an equivalence between the following three entities: (i) the existence of a fast rejection rate for any PCS learning algorithm (such as ILESS); (ii) a poly-logarithmic bound for Hanneke's disagreement coefficient; and (iii) an exponential speedup for a new disagreement-based active learner called {\ActiveiLESS}.
NetSDM: Semantic Data Mining with Network Analysis
http://jmlr.org/papers/v20/17-066.html
2019
Semantic data mining (SDM) is a form of relational data mining that uses annotated data together with complex semantic background knowledge to learn rules that can be easily interpreted. The drawback of SDM is a high computational complexity of existing SDM algorithms, resulting in long run times even when applied to relatively small data sets. This paper proposes an effective SDM approach, named NetSDM, which first transforms the available semantic background knowledge into a network format, followed by network analysis based node ranking and pruning to significantly reduce the size of the original background knowledge. The experimental evaluation of the NetSDM methodology on acute lymphoblastic leukemia and breast cancer data demonstrates that NetSDM achieves radical time efficiency improvements and that learned rules are comparable or better than the rules obtained by the original SDM algorithms.
Kernels for Sequentially Ordered Data
http://jmlr.org/papers/v20/16-314.html
2019
We present a novel framework for learning with sequential data of any kind, such as multivariate time series, strings, or sequences of graphs. The main result is a ”sequentialization” that transforms any kernel on a given domain into a kernel for sequences in that domain. This procedure preserves properties such as positive definiteness, the associated kernel feature map is an ordered variant of sample (cross-)moments, and this sequentialized kernel is consistent in the sense that it converges to a kernel for paths if sequences converge to paths (by discretization). Further, classical kernels for sequences arise as special cases of this method. We use dynamic programming and low-rank techniques for tensors to provide efficient algorithms to compute this sequentialized kernel.
Exact Clustering of Weighted Graphs via Semidefinite Programming
http://jmlr.org/papers/v20/16-128.html
2019
As a model problem for clustering, we consider the densest $k$-disjoint-clique problem of partitioning a weighted complete graph into $k$ disjoint subgraphs such that the sum of the densities of these subgraphs is maximized. We establish that such subgraphs can be recovered from the solution of a particular semidefinite relaxation with high probability if the input graph is sampled from a distribution of clusterable graphs. Specifically, the semidefinite relaxation is exact if the graph consists of \(k\) large disjoint subgraphs, corresponding to clusters, with weight concentrated within these subgraphs, plus a moderate number of nodes not belonging to any cluster. Further, we establish that if noise is weakly obscuring these clusters, i.e, the between-cluster edges are assigned very small weights, then we can recover significantly smaller clusters. For example, we show that in approximately sparse graphs, where the between-cluster weights tend to zero as the size $n$ of the graph tends to infinity, we can recover clusters of size polylogarithmic in $n$ under certain conditions on the distribution of edge weights. Empirical evidence from numerical simulations is also provided to support these theoretical phase transitions to perfect recovery of the cluster structure.
Iterated Learning in Dynamic Social Networks
http://jmlr.org/papers/v20/18-539.html
2019
A classic finding by (Kalish et al., 2007) shows that no language can be learned iteratively by rational agents in a self-sustained manner. In other words, if $A$ teaches a foreign language to $B$, who then teaches what she learned to $C$, and so on, the language will quickly get lost and agents will wind up teaching their own common native language. If so, how can linguistic novelty ever be sustained? We address this apparent paradox by considering the case of iterated learning in a social network: we show that by varying the lengths of the learning sessions over time or by keeping the networks dynamic, it is possible for iterated learning to endure forever with arbitrarily small loss.
Pyro: Deep Universal Probabilistic Programming
http://jmlr.org/papers/v20/18-403.html
2019
Pyro is a probabilistic programming language built on Python as a platform for developing advanced probabilistic models in AI research. To scale to large data sets and high-dimensional models, Pyro uses stochastic variational inference algorithms and probability distributions built on top of PyTorch, a modern GPU-accelerated deep learning framework. To accommodate complex or model-specific algorithmic behavior, Pyro leverages Poutine, a library of composable building blocks for modifying the behavior of probabilistic programs.
Monotone Learning with Rectified Wire Networks
http://jmlr.org/papers/v20/18-281.html
2019
We introduce a new neural network model, together with a tractable and monotone online learning algorithm. Our model describes feed-forward networks for classification, with one output node for each class. The only nonlinear operation is rectification using a ReLU function with a bias. However, there is a rectifier on every edge rather than at the nodes of the network. There are also weights, but these are positive, static, and associated with the nodes. Our rectified wire networks are able to represent arbitrary Boolean functions. Only the bias parameters, on the edges of the network, are learned. Another departure in our approach, from standard neural networks, is that the loss function is replaced by a constraint. This constraint is simply that the value of the output node associated with the correct class should be zero. Our model has the property that the exact norm-minimizing parameter update, required to correctly classify a training item, is the solution to a quadratic program that can be computed with a few passes through the network. We demonstrate a training algorithm using this update, called sequential deactivation (SDA), on MNIST and some synthetic datasets. Upon adopting a natural choice for the nodal weights, SDA has no hyperparameters other than those describing the network structure. Our experiments explore behavior with respect to network size and depth in a family of sparse expander networks.
TensorLy: Tensor Learning in Python
http://jmlr.org/papers/v20/18-277.html
2019
Tensors are higher-order extensions of matrices. While matrix methods form the cornerstone of traditional machine learning and data analysis, tensor methods have been gaining increasing traction. However, software support for tensor operations is not on the same footing. In order to bridge this gap, we have developed TensorLy, a Python library that provides a high-level API for tensor methods and deep tensorized neural networks. TensorLy aims to follow the same standards adopted by the main projects of the Python scientific community, and to seamlessly integrate with them. Its BSD license makes it suitable for both academic and commercial applications. TensorLy's backend system allows users to perform computations with several libraries such as NumPy or PyTorch to name but a few. They can be scaled on multiple CPU or GPU machines. In addition, using the deep-learning frameworks as backend allows to easily design and train deep tensorized neural networks. TensorLy is available at https://github.com/tensorly/tensorly
Group Invariance, Stability to Deformations, and Complexity of Deep Convolutional Representations
http://jmlr.org/papers/v20/18-190.html
2019
The success of deep convolutional architectures is often attributed in part to their ability to learn multiscale and invariant representations of natural signals. However, a precise study of these properties and how they affect learning guarantees is still missing. In this paper, we consider deep convolutional representations of signals; we study their invariance to translations and to more general groups of transformations, their stability to the action of diffeomorphisms, and their ability to preserve signal information. This analysis is carried by introducing a multilayer kernel based on convolutional kernel networks and by studying the geometry induced by the kernel mapping. We then characterize the corresponding reproducing kernel Hilbert space (RKHS), showing that it contains a large class of convolutional neural networks with homogeneous activation functions. This analysis allows us to separate data representation from learning, and to provide a canonical measure of model complexity, the RKHS norm, which controls both stability and generalization of any learned model. In addition to models in the constructed RKHS, our stability analysis also applies to convolutional networks with generic activations such as rectified linear units, and we discuss its relationship with recent generalization bounds based on spectral norms.
Joint PLDA for Simultaneous Modeling of Two Factors
http://jmlr.org/papers/v20/18-134.html
2019
Probabilistic linear discriminant analysis (PLDA) is a method used for biometric problems like speaker or face recognition that models the variability of the samples using two latent variables, one that depends on the class of the sample and another one that is assumed independent across samples and models the within-class variability. In this work, we propose a generalization of PLDA that enables joint modeling of two sample-dependent factors: the class of interest and a nuisance condition. The approach does not change the basic form of PLDA but rather modifies the training procedure to consider the dependency across samples of the latent variable that models within-class variability. While the identity of the nuisance condition is needed during training, it is not needed during testing since we propose a scoring procedure that marginalizes over the corresponding latent variable. We show results on a multilingual speaker-verification task, where the language spoken is considered a nuisance condition. The proposed joint PLDA approach leads to significant performance gains in this task for two different data sets, in particular when the training data contains mostly or only monolingual speakers.
Determining the Number of Latent Factors in Statistical Multi-Relational Learning
http://jmlr.org/papers/v20/18-037.html
2019
Statistical relational learning is primarily concerned with learning and inferring relationships between entities in large-scale knowledge graphs. Nickel et al. (2011) proposed a RESCAL tensor factorization model for statistical relational learning, which achieves better or at least comparable results on common benchmark data sets when compared to other state-of-the-art methods. Given a positive integer $s$, RESCAL computes an $s$-dimensional latent vector for each entity. The latent factors can be further used for solving relational learning tasks, such as collective classification, collective entity resolution and link-based clustering. The focus of this paper is to determine the number of latent factors in the RESCAL model. Due to the structure of the RESCAL model, its log-likelihood function is not concave. As a result, the corresponding maximum likelihood estimators (MLEs) may not be consistent. Nonetheless, we design a specific pseudometric, prove the consistency of the MLEs under this pseudometric and establish its rate of convergence. Based on these results, we propose a general class of information criteria and prove their model selection consistencies when the number of relations is either bounded or diverges at a proper rate of the number of entities. Simulations and real data examples show that our proposed information criteria have good finite sample properties.
Random Feature-based Online Multi-kernel Learning in Environments with Unknown Dynamics
http://jmlr.org/papers/v20/18-030.html
2019
Kernel-based methods exhibit well-documented performance in various nonlinear learning tasks. Most of them rely on a preselected kernel, whose prudent choice presumes task-specific prior information. Especially when the latter is not available, multi-kernel learning has gained popularity thanks to its flexibility in choosing kernels from a prescribed kernel dictionary. Leveraging the random feature approximation and its recent orthogonality-promoting variant, the present contribution develops a scalable multi-kernel learning scheme (termed Raker) to obtain the sought nonlinear learning function `on the fly,' first for static environments. To further boost performance in dynamic environments, an adaptive multi-kernel learning scheme (termed AdaRaker) is developed. AdaRaker accounts not only for data-driven learning of kernel combination, but also for the unknown dynamics. Performance is analyzed in terms of both static and dynamic regrets. AdaRaker is uniquely capable of tracking nonlinear learning functions in environments with unknown dynamics, and with with analytic performance guarantees Tests with synthetic and real datasets are carried out to showcase the effectiveness of the novel algorithms.
Spectrum Estimation from a Few Entries
http://jmlr.org/papers/v20/18-027.html
2019
Singular values of a data in a matrix form provide insights on the structure of the data, the effective dimensionality, and the choice of hyper-parameters on higher-level data analysis tools. However, in many practical applications such as collaborative filtering and network analysis, we only get a partial observation. Under such scenarios, we consider the fundamental problem of recovering spectral properties of the underlying matrix from a sampling of its entries. In this paper, we address the problem of directly recovering the spectrum, which is the set of singular values, and also in sample-efficient approaches for recovering a spectral sum function, which is an aggregate sum of a fixed function applied to each of the singular values. Our approach is to first estimate the Schatten $k$-norms of a matrix for a small set of values of $k$, and then apply Chebyshev approximation when estimating a spectral sum function or apply moment matching in Wasserstein distance when estimating the singular values directly. The main technical challenge is in accurately estimating the Schatten norms from a sampling of a matrix. We introduce a novel unbiased estimator based on counting small structures called network motifs in a graph and provide guarantees that match its empirical performance. Our theoretical analysis shows that Schatten norms can be recovered accurately from strictly smaller number of samples compared to what is needed to recover the underlying low-rank matrix. Numerical experiments suggest that we significantly improve upon a competing approach of using matrix completion methods, below the matrix completion threshold, above which matrix completion algorithms recover the underlying low-rank matrix exactly.
Accelerated Alternating Projections for Robust Principal Component Analysis
http://jmlr.org/papers/v20/18-022.html
2019
We study robust PCA for the fully observed setting, which is about separating a low rank matrix $\BL$ and a sparse matrix $\BS$ from their sum $\BD=\BL+\BS$. In this paper, a new algorithm, dubbed accelerated alternating projections, is introduced for robust PCA which significantly improves the computational efficiency of the existing alternating projections proposed in (Netrapalli et al., 2014) when updating the low rank factor. The acceleration is achieved by first projecting a matrix onto some low dimensional subspace before obtaining a new estimate of the low rank matrix via truncated SVD. Exact recovery guarantee has been established which shows linear convergence of the proposed algorithm. Empirical performance evaluations establish the advantage of our algorithm over other state-of-the-art algorithms for robust PCA.
spark-crowd: A Spark Package for Learning from Crowdsourced Big Data
http://jmlr.org/papers/v20/17-743.html
2019
As the data sets increase in size, the process of manually labeling data becomes unfeasible by small groups of experts. Thus, it is common to rely on crowdsourcing platforms which provide inexpensive, but noisy, labels. Although implementations of algorithms to tackle this problem exist, none of them focus on scalability, limiting the area of application to relatively small data sets. In this paper, we present spark-crowd, an Apache Spark package for learning from crowdsourced data with scalability in mind.
Multiplicative local linear hazard estimation and best one-sided cross-validation
http://jmlr.org/papers/v20/17-663.html
2019
This paper develops detailed mathematical statistical theory of a new class of cross-validation techniques of local linear kernel hazards and their multiplicative bias corrections. The new class of cross-validation combines principles of local information and recent advances in indirect cross-validation. A few applications of cross-validating multiplicative kernel hazard estimation do exist in the literature. However, detailed mathematical statistical theory and small sample performance are introduced via this paper and further upgraded to our new class of best one-sided cross-validation. Best one-sided cross-validation turns out to have excellent performance in its practical illustrations, in its small sample performance and in its mathematical statistical theoretical performance.
Delay and Cooperation in Nonstochastic Bandits
http://jmlr.org/papers/v20/17-631.html
2019
We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than $d$ hops to arrive, where $d$ is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with $K$ actions and $N$ agents the average per-agent regret after $T$ rounds is at most of order $\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}$, where $\alpha_{\le d}$ is the independence number of the $d$-th power of the communication graph $G$. We then show that for any connected graph, for $d=\sqrt{K}$ the regret bound is $K^{1/4}\sqrt{T}$, strictly better than the minimax regret $\sqrt{KT}$ for noncooperating agents. More informed choices of $d$ lead to bounds which are arbitrarily close to the full information minimax regret $\sqrt{T\ln K}$ when $G$ is dense. When $G$ has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in $G$, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.
Smooth neighborhood recommender systems
http://jmlr.org/papers/v20/17-629.html
2019
Recommender systems predict users' preferences over a large number of items by pooling similar information from other users and/or items in the presence of sparse observations. One major challenge is how to utilize user-item specific covariates and networks describing user-item interactions in a high-dimensional situation, for accurate personalized prediction. In this article, we propose a smooth neighborhood recommender in the framework of the latent factor models. A similarity kernel is utilized to borrow neighborhood information from continuous covariates over a user-item specific network, such as a user's social network, where the grouping information defined by discrete covariates is also integrated through the network. Consequently, user-item specific information is built into the recommender to battle the `cold-start” issue in the absence of observations in collaborative and content-based filtering. Moreover, we utilize a “divide-and-conquer” version of the alternating least squares algorithm to achieve scalable computation, and establish asymptotic results for the proposed method, demonstrating that it achieves superior prediction accuracy. Finally, we illustrate that the proposed method improves substantially over its competitors in simulated examples and real benchmark data--Last.fm music data.
Automated Scalable Bayesian Inference via Hilbert Coresets
http://jmlr.org/papers/v20/17-613.html
2019
The automation of posterior inference in Bayesian data analysis has enabled experts and nonexperts alike to use more sophisticated models, engage in faster exploratory modeling and analysis, and ensure experimental reproducibility. However, standard automated posterior inference algorithms are not tractable at the scale of massive modern data sets, and modifications to make them so are typically model-specific, require expert tuning, and can break theoretical guarantees on inferential quality. Building on the Bayesian coresets framework, this work instead takes advantage of data redundancy to shrink the data set itself as a preprocessing step, providing fully-automated, scalable Bayesian inference with theoretical guarantees. We begin with an intuitive reformulation of Bayesian coreset construction as sparse vector sum approximation, and demonstrate that its automation and performance-based shortcomings arise from the use of the supremum norm. To address these shortcomings we develop Hilbert coresets, i.e., Bayesian coresets constructed under a norm induced by an inner-product on the log-likelihood function space. We propose two Hilbert coreset construction algorithms---one based on importance sampling, and one based on the Frank-Wolfe algorithm---along with theoretical guarantees on approximation quality as a function of coreset size. Since the exact computation of the proposed inner-products is model-specific, we automate the construction with a random finite-dimensional projection of the log-likelihood functions. The resulting automated coreset construction algorithm is simple to implement, and experiments on a variety of models with real and synthetic data sets show that it provides high-quality posterior approximations and a significant reduction in the computational cost of inference.
Approximations of the Restless Bandit Problem
http://jmlr.org/papers/v20/17-547.html
2019
The multi-armed restless bandit problem is studied in the case where the pay-off distributions are stationary $\varphi$-mixing. This version of the problem provides a more realistic model for most real-world applications, but cannot be optimally solved in practice, since it is known to be PSPACE-hard. The objective of this paper is to characterize a sub-class of the problem where good approximate solutions can be found using tractable approaches. Specifically, it is shown that under some conditions on the $\varphi$-mixing coefficients, a modified version of UCB can prove effective. The main challenge is that, unlike in the i.i.d. setting, the distributions of the sampled pay-offs may not have the same characteristics as those of the original bandit arms. In particular, the $\varphi$-mixing property does not necessarily carry over. This is overcome by carefully controlling the effect of a sampling policy on the pay-off distributions. Some of the proof techniques developed in this paper can be more generally used in the context of online sampling under dependence. Proposed algorithms are accompanied with corresponding regret analysis.
Train and Test Tightness of LP Relaxations in Structured Prediction
http://jmlr.org/papers/v20/17-535.html
2019
Structured prediction is used in areas including computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation for the striking observation that approximations based on linear programming (LP) relaxations are often tight (exact) on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that this training tightness generalizes to test data.
Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds
http://jmlr.org/papers/v20/17-517.html
2019
Kernel $k$-means clustering can correctly identify and extract a far more varied collection of cluster structures than the linear $k$-means clustering algorithm. However, kernel $k$-means clustering is computationally expensive when the non-linear feature map is high-dimensional and there are many input points. Kernel approximation, e.g., the Nystrom method, has been applied in previous works to approximately solve kernel learning problems when both of the above conditions are present. This work analyzes the application of this paradigm to kernel $k$-means clustering, and shows that applying the linear $k$-means clustering algorithm to $\frac{k}{\epsilon} (1 + o(1))$ features constructed using a so-called rank-restricted Nystrom approximation results in cluster assignments that satisfy a $1 + \epsilon$ approximation ratio in terms of the kernel $k$-means cost function, relative to the guarantee provided by the same algorithm without the use of the Nystrom method. As part of the analysis, this work establishes a novel $1 + \epsilon$ relative-error trace norm guarantee for low-rank approximation using the rank-restricted Nystrom approximation. Empirical evaluations on the $8.1$ million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel $k$-means clustering with Nystrom approximation. This work argues that spectral clustering using Nystrom approximation---a popular and computationally efficient, but theoretically unsound approach to non-linear clustering---should be replaced with the efficient and theoretically sound combination of kernel $k$-means clustering with Nystrom approximation. The superior performance of the latter approach is empirically verified.
An Approach to One-Bit Compressed Sensing Based on Probably Approximately Correct Learning Theory
http://jmlr.org/papers/v20/17-504.html
2019
In this paper, the problem of one-bit compressed sensing (OBCS) is formulated as a problem in probably approximately correct (PAC) learning. It is shown that the Vapnik-Chervonenkis (VC-) dimension of the set of half-spaces in $\R^n$ generated by $k$-sparse vectors is bounded below by $k ( \lfloor\lg (n/k) \rfloor +1 )$ and above by $\lfloor 2k \lg (en) \rfloor $. By coupling this estimate with well-established results in PAC learning theory, we show that a consistent algorithm can recover a $k$-sparse vector with $O(k \lg n)$ measurements, given only the signs of the measurement vector. This result holds for \textit{all} probability measures on $\R^n$. The theory is also applicable to the case of noisy labels, where the signs of the measurements are flipped with some unknown probability.
Graphical Lasso and Thresholding: Equivalence and Closed-form Solutions
http://jmlr.org/papers/v20/17-501.html
2019
Graphical Lasso (GL) is a popular method for learning the structure of an undirected graphical model, which is based on an $l_1$ regularization technique. The objective of this paper is to compare the computationally-heavy GL technique with a numerically-cheap heuristic method that is based on simply thresholding the sample covariance matrix. To this end, two notions of sign-consistent and inverse-consistent matrices are developed, and then it is shown that the thresholding and GL methods are equivalent if: (i) the thresholded sample covariance matrix is both sign-consistent and inverse-consistent, and (ii) the gap between the largest thresholded and the smallest un-thresholded entries of the sample covariance matrix is not too small. By building upon this result, it is proved that the GL method---as a conic optimization problem---has an explicit closed-form solution if the thresholded sample covariance matrix has an acyclic structure. This result is then generalized to arbitrary sparse support graphs, where a formula is found to obtain an approximate solution of GL. Furthermore, it is shown that the approximation error of the derived explicit formula decreases exponentially fast with respect to the length of the minimum-length cycle of the sparsity graph. The developed results are demonstrated on synthetic data, functional MRI data, traffic flows for transportation networks, and massive randomly generated data sets. We show that the proposed method can obtain an accurate approximation of the GL for instances with the sizes as large as $80,000\times 80,000$ (more than 3.2 billion variables) in less than 30 minutes on a standard laptop computer running MATLAB, while other state-of-the-art methods do not converge within 4 hours
Dynamic Pricing in High-dimensions
http://jmlr.org/papers/v20/17-357.html
2019
We study the pricing problem faced by a firm that sells a large number of products, described via a wide range of features, to customers that arrive over time. Customers independently make purchasing decisions according to a general choice model that includes products features and customers' characteristics, encoded as $d$-dimensional numerical vectors, as well as the price offered. The parameters of the choice model are a priori unknown to the firm, but can be learned as the (binary-valued) sales data accrues over time. The firm's objective is to maximize its revenue. We benchmark the performance using the classic regret minimization framework where the regret is defined as the expected revenue loss against a clairvoyant policy that knows the parameters of the choice model in advance, and always offers the revenue-maximizing price. This setting is motivated in part by the prevalence of online marketplaces that allow for real-time pricing. We assume a structured choice model, parameters of which depend on $s_0$ out of the $d$ product features. Assuming that the market noise distribution is known, we propose a dynamic policy, called Regularized Maximum Likelihood Pricing (RMLP) that leverages the (sparsity) structure of the high-dimensional model and obtains a logarithmic regret in $T$. More specifically, the regret of our algorithm is of $O(s_0 \log d \cdot \log T)$. Furthermore, we show that no policy can obtain regret better than $O(s_0 (\log d + \log T))$. {In addition, we propose a generalization of our policy to a setting that the market noise distribution is unknown but belongs to a parametrized family of distributions. This policy obtains regret of $O(\sqrt{(\log d)T})$. We further show that no policy can obtain regret better than $\Omega(\sqrt{T})$ in such environments.}
Forward-Backward Selection with Early Dropping
http://jmlr.org/papers/v20/17-334.html
2019
Forward-backward selection is one of the most basic and commonly-used feature selection algorithms available. It is also general and conceptually applicable to many different types of data. In this paper, we propose a heuristic that significantly improves its running time, while preserving predictive performance. The idea is to temporarily discard the variables that are conditionally independent with the outcome given the selected variable set. Depending on how those variables are reconsidered and reintroduced, this heuristic gives rise to a family of algorithms with increasingly stronger theoretical guarantees. In distributions that can be faithfully represented by Bayesian networks or maximal ancestral graphs, members of this algorithmic family are able to correctly identify the Markov blanket in the sample limit. In experiments we show that the proposed heuristic increases computational efficiency by about 1-2 orders of magnitude, while selecting fewer or the same number of variables and retaining predictive performance. Furthermore, we show that the proposed algorithm and feature selection with LASSO perform similarly when restricted to select the same number of variables, making the proposed algorithm an attractive alternative for problems where no (efficient) algorithm for LASSO exists.
Scalable Approximations for Generalized Linear Problems
http://jmlr.org/papers/v20/17-279.html
2019
In stochastic optimization, the population risk is generally approximated by the empirical risk which is in turn minimized by an iterative algorithm. However, in the large-scale setting, empirical risk minimization may be computationally restrictive. In this paper, we design an efficient algorithm to approximate the population risk minimizer in generalized linear problems such as binary classification with surrogate losses and generalized linear regression models. We focus on large-scale problems where the iterative minimization of the empirical risk is computationally intractable, i.e., the number of observations $n$ is much larger than the dimension of the parameter $p$ ($n \gg p \gg 1$). We show that under random sub-Gaussian design, the true minimizer of the population risk is approximately proportional to the corresponding ordinary least squares (OLS) estimator. Using this relation, we design an algorithm that achieves the same accuracy as the empirical risk minimizer through iterations that attain up to a quadratic convergence rate, and that are computationally cheaper than any batch optimization algorithm by at least a factor of $\mathcal{O}(p)$. We provide theoretical guarantees for our algorithm, and analyze the convergence behavior in terms of data dimensions. Finally, we demonstrate the performance of our algorithm on well-known classification and regression problems, through extensive numerical studies on large-scale datasets, and show that it achieves the highest performance compared to several other widely used optimization algorithms.
scikit-multilearn: A Python library for Multi-Label Classification
http://jmlr.org/papers/v20/17-100.html
2019
The scikit-multilearn is a Python library for performing multi-label classification. It is compatible with the scikit-learn and scipy ecosystems and uses sparse matrices for all internal operations; provides native Python implementations of popular multi-label classification methods alongside a novel framework for label space partitioning and division and includes modern algorithm adaptation methods, network-based label space division approaches, which extracts label dependency information and multi-label embedding classifiers. The library provides Python wrapped access to the extensive multi-label method stack from Java libraries and makes it possible to extend deep learning single-label methods for multi-label tasks. The library allows multi-label stratification and data set management. The implementation is more efficient in problem transformation than other established libraries, has good test coverage and follows PEP8. Source code and documentation can be downloaded from http://scikit.ml and also via pip The project is BSD-licensed.
Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression
http://jmlr.org/papers/v20/16-607.html
2019
In this paper, we consider the problem of learning high-dimensional tensor regression problems with low-rank structure. One of the core challenges associated with learning high-dimensional models is computation since the underlying optimization problems are often non-convex. While convex relaxations could lead to polynomial-time algorithms they are often slow in practice. On the other hand, limited theoretical guarantees exist for non-convex methods. In this paper we provide a general framework that provides theoretical guarantees for learning high-dimensional tensor regression models under different low-rank structural assumptions using the projected gradient descent algorithm applied to a potentially non-convex constraint set $\Theta$ in terms of its localized Gaussian width (due to Gaussian design). We juxtapose our theoretical results for non-convex projected gradient descent algorithms with previous results on regularized convex approaches. The two main differences between the convex and non-convex approach are: (i) from a computational perspective whether the non-convex projection operator is computable and whether the projection has desirable contraction properties and (ii) from a statistical error bound perspective, the non-convex approach has a superior rate for a number of examples. We provide three concrete examples of low-dimensional structure which address these issues and explain the pros and cons for the non-convex and convex approaches. We supplement our theoretical results with simulations which show that, under several common settings of generalized low rank tensor regression, the projected gradient descent approach is superior both in terms of statistical error and run-time provided the step-sizes of the projected descent algorithm are suitably chosen.
Convergence Rate of a Simulated Annealing Algorithm with Noisy Observations
http://jmlr.org/papers/v20/16-588.html
2019
In this paper we propose a modified version of the simulated annealing algorithm for solving a stochastic global optimization problem. More precisely, we address the problem of finding a global minimizer of a function with noisy evaluations. We provide a rate of convergence and its optimized parametrization to ensure a minimal number of evaluations for a given accuracy and a confidence level close to 1. This work is completed with a set of numerical experimentations and assesses the practical performance both on benchmark test cases and on real world examples.
Parsimonious Online Learning with Kernels via Sparse Projections in Function Space
http://jmlr.org/papers/v20/16-585.html
2019
Despite their attractiveness, popular perception is that techniques for nonparametric function approximation do not scale to streaming data due to an intractable growth in the amount of storage they require. To solve this problem in a memory-affordable way, we propose an online technique based on functional stochastic gradient descent in tandem with supervised sparsification based on greedy function subspace projections. The method, called parsimonious online learning with kernels (POLK), provides a controllable tradeoff between its solution accuracy and the amount of memory it requires. We derive conditions under which the generated function sequence converges almost surely to the optimal function, and we establish that the memory requirement remains finite. We evaluate POLK for kernel multi-class logistic regression and kernel hinge-loss classification on three canonical data sets: a synthetic Gaussian mixture model, the MNIST hand-written digits, and the Brodatz texture database. On all three tasks, we observe a favorable trade-off of objective function evaluation, classification performance, and complexity of the nonparametric regressor extracted by the proposed method.
Transport Analysis of Infinitely Deep Neural Network
http://jmlr.org/papers/v20/16-243.html
2019
We investigated the feature map inside deep neural networks (DNNs) by tracking the transport map. We are interested in the role of depth---why do DNNs perform better than shallow models?---and the interpretation of DNNs---what do intermediate layers do? Despite the rapid development in their application, DNNs remain analytically unexplained because the hidden layers are nested and the parameters are not faithful. Inspired by the integral representation of shallow NNs, which is the continuum limit of the width, or the hidden unit number, we developed the flow representation and transport analysis of DNNs. The flow representation is the continuum limit of the depth, or the hidden layer number, and it is specified by an ordinary differential equation (ODE) with a vector field. We interpret an ordinary DNN as a transport map or an Euler broken line approximation of the flow. Technically speaking, a dynamical system is a natural model for the nested feature maps. In addition, it opens a new way to the coordinate-free treatment of DNNs by avoiding the redundant parametrization of DNNs. Following Wasserstein geometry, we analyze a flow in three aspects: dynamical system, continuity equation, and Wasserstein gradient flow. A key finding is that we specified a series of transport maps of the denoising autoencoder (DAE), which is a cornerstone for the development of deep learning. Starting from the shallow DAE, this paper develops three topics: the transport map of the deep DAE, the equivalence between the stacked DAE and the composition of DAEs, and the development of the double continuum limit or the integral representation of the flow representation. As partial answers to the research questions, we found that deeper DAEs converge faster and the extracted features are better; in addition, a deep Gaussian DAE transports mass to decrease the Shannon entropy of the data distribution. We expect that further investigations on these questions lead to the development of an interpretable and principled alternatives to DNNs.
Adaptation Based on Generalized Discrepancy
http://jmlr.org/papers/v20/15-192.html
2019
We present a new algorithm for domain adaptation improving upon a discrepancy minimization algorithm, (DM), previously shown to outperform a number of algorithms for this problem. Unlike many previously proposed solutions for domain adaptation, our algorithm does not consist of a fixed reweighting of the losses over the training sample. Instead, the reweighting depends on the hypothesis sought. The algorithm is derived from a less conservative notion of discrepancy than the DM algorithm called <i>generalized discrepancy</i>. We present a detailed description of our algorithm and show that it can be formulated as a convex optimization problem. We also give a detailed theoretical analysis of its learning guarantees which helps us select its parameters. Finally, we report the results of experiments demonstrating that it improves upon discrepancy minimization.