We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains.

The
approach implements this idea in the context of neural network
architectures that are trained on labeled data from the source
domain and unlabeled data from the target domain (no labeled
target-domain data is necessary). As the training progresses,
the approach promotes the emergence of features that are (i)
discriminative for the main learning task on the source domain
and (ii) indiscriminate with respect to the shift between the
domains. We show that this adaptation behaviour can be achieved
in almost any feed-forward model by augmenting it with few
standard layers and a new *gradient reversal* layer. The
resulting augmented architecture can be trained using standard
backpropagation and stochastic gradient descent, and can thus be
implemented with little effort using any of the deep learning
packages.

We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

Q-learning (QL) is a popular reinforcement learning algorithm
that is guaranteed to converge to optimal policies in Markov
decision processes. However, QL exhibits an artifact: in
expectation, the effective rate of updating the value of an
action depends on the probability of choosing that action. In
other words, there is a tight coupling between the learning
dynamics and underlying execution policy. This coupling can
cause performance degradation in noisy *non-stationary*
environments.

Here, we introduce Repeated Update Q-learning (RUQL), a learning algorithm that resolves the undesirable artifact of Q-learning while maintaining simplicity. We analyze the similarities and differences between RUQL, QL, and the closest state-of-the-art algorithms theoretically. Our analysis shows that RUQL maintains the convergence guarantee of QL in stationary environments, while relaxing the coupling between the execution policy and the learning dynamics. Experimental results confirm the theoretical insights and show how RUQL outperforms both QL and the closest state-of-the-art algorithms in noisy non-stationary environments.

A unified view on multi-class support vector machines (SVMs) is presented, covering most prominent variants including the one- vs-all approach and the algorithms proposed by Weston & Watkins, Crammer & Singer, Lee, Lin, & Wahba, and Liu & Yuan. The unification leads to a template for the quadratic training problems and new multi-class SVM formulations. Within our framework, we provide a comparative analysis of the various notions of multi-class margin and margin-based loss. In particular, we demonstrate limitations of the loss function considered, for instance, in the Crammer & Singer machine.

We analyze Fisher consistency of multi- class loss functions and universal consistency of the various machines. On the one hand, we give examples of SVMs that are, in a particular hyperparameter regime, universally consistent without being based on a Fisher consistent loss. These include the canonical extension of SVMs to multiple classes as proposed by Weston & Watkins and Vapnik as well as the one-vs-all approach. On the other hand, it is demonstrated that machines based on Fisher consistent loss functions can fail to identify proper decision boundaries in low-dimensional feature spaces.

We compared the performance of nine different multi-class SVMs in a thorough empirical study. Our results suggest to use the Weston & Watkins SVM, which can be trained comparatively fast and gives good accuracies on benchmark functions. If training time is a major concern, the one-vs-all approach is the method of choice.

`CauseEffectPairs`

that
consists of data for 100 different cause-effect pairs selected
from 37 data sets from various domains (e.g., meteorology,
biology, medicine, engineering, economy, etc.) and motivate our
decisions regarding the “ground truth” causal directions of all
pairs. We evaluate the performance of several bivariate causal
discovery methods on these real-world benchmark data and in
addition on artificially simulated data. Our empirical results
on real-world data indicate that certain methods are indeed able
to distinguish cause from effect using only purely observational
data, although more benchmark data would be needed to obtain
statistically significant conclusions. One of the best
performing methods overall is the method based on Additive Noise
Models that has originally been proposed by Hoyer et al. (2009),
which obtains an accuracy of 63 $\pm$ 10 % and an AUC of 0.74
$\pm$ 0.05 on the real-world benchmark. As the main theoretical
contribution of this work we prove the consistency of that
method.
We consider two closely related problems: planted clustering and submatrix localization. In the planted clustering problem, a random graph is generated based on an underlying cluster structure of the nodes; the task is to recover these clusters given the graph. The submatrix localization problem concerns locating hidden submatrices with elevated means inside a large real-valued random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. These formulations cover several classical models such as planted clique, planted densest subgraph, planted partition, planted coloring, and the stochastic block model, which are widely used for studying community detection, graph clustering and bi-clustering.

For both problems, we show that the space of the model
parameters (cluster/submatrix size, edge probabilities and the
mean of the submatrices) can be partitioned into four disjoint
regions corresponding to decreasing statistical and
computational complexities: (1) the *impossible* regime,
where all algorithms fail; (2) the *hard* regime, where the
computationally expensive Maximum Likelihood Estimator (MLE)
succeeds; (3) the *easy* regime, where the polynomial-time
convexified MLE succeeds; (4) the *simple* regime, where a
local counting/thresholding procedure succeeds. Moreover, we
show that each of these algorithms provably fails in the harder
regimes.

Our results establish the minimax recovery limits, which are tight up to universal constants and hold even with a growing number of clusters/submatrices, and provide order-wise stronger performance guarantees for polynomial-time algorithms than previously known. Our study demonstrates the tradeoffs between statistical and computational considerations, and suggests that the minimax limits may not be achievable by polynomial-time algorithms.

High-dimensional datasets are well-approximated by low- dimensional structures. Over the past decade, this empirical observation motivated the investigation of detection, measurement, and modeling techniques to exploit these low- dimensional intrinsic structures, yielding numerous implications for high-dimensional statistics, machine learning, and signal processing. Manifold learning (where the low-dimensional structure is a manifold) and dictionary learning (where the low- dimensional structure is the set of sparse linear combinations of vectors from a finite dictionary) are two prominent theoretical and computational frameworks in this area. Despite their ostensible distinction, the recently-introduced Geometric Multi-Resolution Analysis (GMRA) provides a robust, computationally efficient, multiscale procedure for simultaneously learning manifolds and dictionaries.

In this work, we prove non-asymptotic probabilistic bounds on the approximation error of GMRA for a rich class of data-generating statistical models that includes “noisy” manifolds, thereby establishing the theoretical robustness of the procedure and confirming empirical observations. In particular, if a dataset aggregates near a low- dimensional manifold, our results show that the approximation error of the GMRA is completely independent of the ambient dimension. Our work therefore establishes GMRA as a provably fast algorithm for dictionary learning with approximation and sparsity guarantees. We include several numerical experiments confirming these theoretical results, and our theoretical framework provides new tools for assessing the behavior of manifold learning and dictionary learning procedures on a large class of interesting models.

`print/plot/predict`

methods are available;
(b) dedicated methods for trees with We consider the problem of approximating and learning disjunctions (or equivalently, conjunctions) on symmetric distributions over $\zo^n$. Symmetric distributions are distributions whose PDF is invariant under any permutation of the variables. We prove that for every symmetric distribution $\mathcal{D}$, there exists a set of $n^{O(\log{(1/\epsilon)})}$ functions $\mathbb{S}$, such that for every disjunction $c$, there is function $p$, expressible as a linear combination of functions in $\mathbb{S$,} such that $p$ $\epsilon$-approximates $c$ in $\ell_1$ distance on $\mathcal{D}$ or $\mathbf{E}_{x \sim \mathcal{D}}[ |c(x)-p(x)|] \leq \epsilon$. This implies an agnostic learning algorithm for disjunctions on symmetric distributions that runs in time $n^{O( \log{(1/\epsilon)})}$. The best known previous bound is $n^{O(1/\epsilon^4)}$ and follows from approximation of the more general class of halfspaces (Wimmer, 2010). We also show that there exists a symmetric distribution $\mathcal{D}$, such that the minimum degree of a polynomial that $1/3$-approximates the disjunction of all $n$ variables in $\ell_1$ distance on $\mathcal{D}$ is $\Omega(\sqrt{n})$. Therefore the learning result above cannot be achieved via $\ell_1$-regression with a polynomial basis used in most other agnostic learning algorithms.

Our technique also gives a simple proof that for any product distribution $\mathcal{D}$ and every disjunction $c$, there exists a polynomial $p$ of degree $O(\log{(1/\epsilon)})$ such that $p$ $\epsilon$-approximates $c$ in $\ell_1$ distance on $\mathcal{D}$. This was first proved by Blais et al. (2008) via a more involved argument.

- when the dropout-regularized criterion has a unique minimizer,
- when the dropout- regularization penalty goes to infinity with the weights, and when it remains bounded,
- that the dropout regularization can be non- monotonic as individual weights increase from 0, and
- that the dropout regularization penalty may
*not*be convex.

`softImpute`

in R for implementing our
approaches, and a distributed version for very large matrices
using the `Spark`

cluster programming environment
When learning a directed acyclic graph (DAG) model via observational data, one generally cannot identify the underlying DAG, but can potentially obtain a Markov equivalence class. The size (the number of DAGs) of a Markov equivalence class is crucial to infer causal effects or to learn the exact causal DAG via further interventions. Given a set of Markov equivalence classes, the distribution of their sizes is a key consideration in developing learning methods. However, counting the size of an equivalence class with many vertices is usually computationally infeasible, and the existing literature reports the size distributions only for equivalence classes with ten or fewer vertices.

In this paper, we develop a method to compute the size of a Markov equivalence class. We first show that there are five types of Markov equivalence classes whose sizes can be formulated as five functions of the number of vertices respectively. Then we introduce a new concept of a rooted sub- class. The graph representations of rooted subclasses of a Markov equivalence class are used to partition this class recursively until the sizes of all rooted sub-classes can be computed via the five functions. The proposed size counting is efficient for Markov equivalence classes of sparse DAGs with hundreds of vertices. Finally, we explore the size and edge distributions of Markov equivalence classes and find experimentally that, in general, (1) most Markov equivalence classes are half completed and their average sizes are small, and (2) the sizes of sparse classes grow approximately exponentially with the numbers of vertices.

Forward stagewise regression follows a very simple strategy for constructing a sequence of sparse regression estimates: it starts with all coefficients equal to zero, and iteratively updates the coefficient (by a small amount $\epsilon$) of the variable that achieves the maximal absolute inner product with the current residual. This procedure has an interesting connection to the lasso: under some conditions, it is known that the sequence of forward stagewise estimates exactly coincides with the lasso path, as the step size $\epsilon$ goes to zero. Furthermore, essentially the same equivalence holds outside of least squares regression, with the minimization of a differentiable convex loss function subject to an $\ell_1$ norm constraint (the stagewise algorithm now updates the coefficient corresponding to the maximal absolute component of the gradient).

Even when they do not match their $\ell_1$-constrained analogues, stagewise estimates provide a useful approximation, and are computationally appealing. Their success in sparse modeling motivates the question: can a simple, effective strategy like forward stagewise be applied more broadly in other regularization settings, beyond the $\ell_1$ norm and sparsity? The current paper is an attempt to do just this. We present a general framework for stagewise estimation, which yields fast algorithms for problems such as group- structured learning, matrix completion, image denoising, and more.

We consider the query and computational complexity of learning multiplicity tree automata in Angluin's exact learning model. In this model, there is an oracle, called the Teacher, that can answer membership and equivalence queries posed by the Learner. Motivated by this feature, we first characterise the complexity of the equivalence problem for multiplicity tree automata, showing that it is logspace equivalent to polynomial identity testing.

We then move to query complexity, deriving lower bounds on the number of queries needed to learn multiplicity tree automata over both fixed and arbitrary fields. In the latter case, the bound is linear in the size of the target automaton. The best known upper bound on the query complexity over arbitrary fields derives from an algorithm of Habrard and Oncina (2006), in which the number of queries is proportional to the size of the target automaton and the size of a largest counterexample, represented as a tree, that is returned by the Teacher. However, a smallest counterexample tree may already be exponential in the size of the target automaton. Thus the above algorithm has query complexity exponentially larger than our lower bound, and does not run in time polynomial in the size of the target automaton. We give a new learning algorithm for multiplicity tree automata in which counterexamples to equivalence queries are represented as DAGs. The query complexity of this algorithm is quadratic in the target automaton size and linear in the size of a largest counterexample. In particular, if the Teacher always returns DAG counterexamples of minimal size then the query complexity is quadratic in the target automaton size---almost matching the lower bound, and improving the best previously-known algorithm by an exponential factor.

Motivated by problems in insurance, our task is to predict finite upper bounds on a future draw from an unknown distribution $p$ over natural numbers. We can only use past observations generated independently and identically distributed according to $p$. While $p$ is unknown, it is known to belong to a given collection $\mathcal{P}$ of probability distributions on the natural numbers.

The support of the distributions $p
\in \mathcal{P}$ may be unbounded, and the prediction game goes
on for *infinitely* many draws. We are allowed to make
observations without predicting upper bounds for some time. But
we must, with probability $1$, start and then continue to
predict upper bounds after a finite time irrespective of which
$p \in \mathcal{P}$ governs the data.

If it is possible,
without knowledge of $p$ and for any prescribed confidence
however close to $1$, to come up with a sequence of upper bounds
that is never violated over an infinite time window with
confidence at least as big as prescribed, we say the model class
$\mathcal{P}$ is *insurable*. We completely characterize
the insurability of any class $\mathcal{P}$ of distributions
over natural numbers by means of a condition on how the
neighborhoods of distributions in $\mathcal{P}$ should be, one
that is both necessary and sufficient.

`camel`

implementing the proposed method is
available on the Comprehensive R Archive Network cran.r-project.org/web/
packages/camel.
Stochastic multiplicity automata (SMA) are weighted finite automata that generalize probabilistic automata. They have been used in the context of probabilistic grammatical inference. Observable operator models (OOMs) are a generalization of hidden Markov models, which in turn are models for discrete-valued stochastic processes and are used ubiquitously in the context of speech recognition and bio-sequence modeling. Predictive state representations (PSRs) extend OOMs to stochastic input-output systems and are employed in the context of agent modeling and planning.

We present SMA, OOMs, and PSRs under the common framework of sequential systems, which are an algebraic characterization of multiplicity automata, and examine the precise relationships between them. Furthermore, we establish a unified approach to learning such models from data. Many of the learning algorithms that have been proposed can be understood as variations of this basic learning scheme, and several turn out to be closely related to each other, or even equivalent.

In many applications, one has side information, *e.g.*,
labels that are provided in a semi-supervised manner, about a
specific target region of a large data set, and one wants to
perform machine learning and data analysis tasks "nearby" that
prespecified target region. For example, one might be interested
in the clustering structure of a data graph near a prespecified
"seed set" of nodes, or one might be interested in finding
partitions in an image that are near a prespecified "ground
truth" set of pixels. Locally-biased problems of this sort are
particularly challenging for popular eigenvector-based machine
learning and data analysis tools. At root, the reason is that
eigenvectors are inherently global quantities, thus limiting the
applicability of eigenvector-based methods in situations where
one is interested in very local properties of the data.

In this paper, we address this issue by providing a
methodology to construct *semi-supervised eigenvectors* of
a graph Laplacian, and we illustrate how these locally-biased
eigenvectors can be used to perform *locally-biased machine
learning*. These semi-supervised eigenvectors capture
successively-orthogonalized directions of maximum variance,
conditioned on being well-correlated with an input seed set of
nodes that is assumed to be provided in a semi-supervised
manner. We show that these semi-supervised eigenvectors can be
computed quickly as the solution to a system of linear
equations; and we also describe several variants of our basic
method that have improved scaling properties. We provide several
empirical examples demonstrating how these semi-supervised
eigenvectors can be used to perform locally-biased learning; and
we discuss the relationship between our results and recent
machine learning algorithms that use global eigenvectors of the
graph Laplacian.

Fitting high-dimensional statistical models often requires
the use of non-linear parameter estimation procedures. As a
consequence, it is generally impossible to obtain an exact
characterization of the probability distribution of the
parameter estimates. This in turn implies that it is extremely
challenging to quantify the *uncertainty* associated with a
certain parameter estimate. Concretely, no commonly accepted
procedure exists for computing classical measures of uncertainty
and statistical significance as confidence intervals or
$p$-values for these models.

We consider here high- dimensional linear regression problem, and propose an efficient algorithm for constructing confidence intervals and $p$-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power.

Our approach is based on constructing a `de-biased' version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. We test our method on synthetic data and a high- throughput genomic data set about riboflavin production rate, made publicly available by Bühlmann et al. (2014).

In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time- invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now well-understood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms.

We consider similarity information in the setting
of *contextual bandits*, a natural extension of the basic
MAB problem where before each round an algorithm is given the
*context*---a hint about the payoffs in this round.
Contextual bandits are directly motivated by placing
advertisements on web pages, one of the crucial problems in
sponsored search. A particularly simple way to represent
similarity information in the contextual bandit setting is via a
*similarity distance* between the context- arm pairs which
bounds from above the difference between the respective expected
payoffs.

Prior work on contextual bandits with similarity
uses “uniform" partitions of the similarity space, so that each
context-arm pair is approximated by the closest pair in the
partition. Algorithms based on “uniform" partitions disregard
the structure of the payoffs and the context arrivals, which is
potentially wasteful. We present algorithms that are based on
*adaptive* partitions, and take advantage of "benign"
payoffs and context arrivals without sacrificing the worst-case
performance. The central idea is to maintain a finer partition
in high-payoff regions of the similarity space and in popular
regions of the context space. Our results apply to several other
settings, e.g., MAB with constrained temporal change (Slivkins
and Upfal, 2008) and sleeping bandits (Kleinberg et al.,
2008a).

We give novel algorithms for stochastic strongly-convex optimization in the gradient oracle model which return a $O(\frac{1}{T})$-approximate solution after $T$ iterations. The first algorithm is deterministic, and achieves this rate via gradient updates and historical averaging. The second algorithm is randomized, and is based on pure gradient steps with a random step size.

This rate of convergence is optimal in the gradient oracle model. This improves upon the previously known best rate of $O(\frac{\log(T)}{T})$, which was obtained by applying an online strongly-convex optimization algorithm with regret $O(\log(T))$ to the batch setting.

We complement this result by proving that any algorithm has expected regret of $\Omega(\log(T))$ in the online stochastic strongly-convex optimization setting. This shows that any online-to-batch conversion is inherently suboptimal for stochastic strongly- convex optimization. This is the first formal evidence that online convex optimization is strictly more difficult than batch stochastic convex optimization.

Optimization on manifolds is a rapidly developing branch of nonlinear optimization. Its focus is on problems where the smooth geometry of the search space can be leveraged to design efficient numerical algorithms. In particular, optimization on manifolds is well-suited to deal with rank and orthogonality constraints. Such structured constraints appear pervasively in machine learning applications, including low-rank matrix completion, sensor network localization, camera network registration, independent component analysis, metric learning, dimensionality reduction and so on.

The Manopt toolbox, available at www.manopt.org, is a user-friendly, documented piece of software dedicated to simplify experimenting with state of the art Riemannian optimization algorithms. By dealing internally with most of the differential geometry, the package aims particularly at lowering the entrance barrier.