My research aims to build reliable machine learning algorithms which can cope with unprecendented challenges in the modern data production pipeline. Here are some directions that I am particularly excited about illustrating this theme:
Inference with Outliers:What are the statistical limits of estimation with outliers in the data? ([CT+22]Optimal Mean Estimation without a Variance Yeshwanth Cherapanamjeri, Nilesh Tripuraneni, Peter L. Bartlett, Michael I. Jordan COLT 2022[CC23]Statistical Barriers to Affine-equivariant Estimation Zihao Chen, Yeshwanth Cherapanamjeri Under Submission
) and Can we realize these limits computationally? ([CF19]Fast Mean Estimation with Sub-Gaussian Rates Yeshwanth Cherapanamjeri, Nicolas Flammarion, Peter L. Bartlett COLT 2019[CH+20]Algorithms for Heavy-Tailed Statistics: Regression, Covariance Estimation, and Beyond Yeshwanth Cherapanamjeri, Samuel B. Hopkins, Tarun Kathuria, Prasad Raghavendra, Nilesh Tripuraneni STOC 2020[CM+20]List Decodable Mean Estimation in Nearly Linear Time Yeshwanth Cherapanamjeri, Sidhanth Mohanty, Morris Yau FOCS 2020[CA+20]Optimal Robust Linear Regression in Nearly Linear Time Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni, Michael I. Jordan, Nicolas Flammarion, Peter L. Bartlett Under Submission
) These ideas have implications for other questions in statistics and theoretical computer science including classical PAC learning theory ([AC+23]Optimal PAC Bounds without Uniform Convergence Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, Nikita Zhivotovskiy Invited to SICOMP Special Issue for FOCS 2023 FOCS 2023
), and streaming data structures ([CN20]On Adaptive Distance Estimation Yeshwanth Cherapanamjeri, Jelani Nelson Spotlight Presentation NeurIPS 2020[CN21]Terminal Embeddings in Sublinear Time Yeshwanth Cherapanamjeri, Jelani Nelson FOCS 2021[CN22]Uniform Approximations for Randomized Hadamard Transforms with Applications Yeshwanth Cherapanamjeri, Jelani Nelson STOC 2022
).
Data from Strategic Interactions: Sometimes, data is systematically biased due to the strategic behavior of the data-generating agents. For instance, data collected from online marketplaces is not a true reflection of the participants’ utilities. How should we account for this strategic behavior to design statistically and computationally efficient estimators? ([CD+22]Estimation of Standard Auction Models Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Andrew Ilyas, Manolis Zampetakis EC 2022[CD+23]What Makes a Good Fisherman? Linear Regression under Self-Selection Bias Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Andrew Ilyas, Manolis Zampetakis STOC 2023
)
Dataset Design: Modern datasets are collected from a range of sources of varying data quality with low-quality data being cheap while high-quality data is vastly more expensive. This raises several questions: How should a practitioner prioritize between these sources? Are there inherent limitations of the use of certain data sources? What are the statistical costs of low-quality data? ([DC+24]How Much is a Noisy Image Worth? Data Scaling Laws for Ambient Diffusion Giannis Daras*, Yeshwanth Cherapanamjeri*, Constantinos Daskalakis Under Submission [CD+25]Are Pairwise Comparisions Enough for Preference Learning? Yeshwanth Cherapanamjeri*, Constantinos Daskalakis, Gabriele Farina, Sobhan Mohammadpour* Under Preparation
)
The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than \(80\) models on data with different corruption levels across three datasets ranging from \(30,000\) to \(\approx 1.3\)M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~\(10\%\) of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.
Statistical Barriers to Affine-equivariant Estimation
Zihao Chen, Yeshwanth Cherapanamjeri
Under Submission
We investigate the quantitative performance of affine-equivariant estimators for robust mean estimation. As a natural stability requirement, the construction of such affine-equivariant estimators has been extensively studied in the statistics literature. We quantitatively evaluate these estimators under two outlier models which have been the subject of much recent work: the heavy-tailed and adversarial corruption settings. We establish lower bounds which show that affine-equivariance induces a strict degradation in recovery error with quantitative rates degrading by a factor of \(\sqrt{d}\) in both settings. We find that classical estimators such as the Tukey median (Tukey '75) and Stahel-Donoho estimator (Stahel '81 and Donoho '82) are either quantitatively sub-optimal even within the class of affine-equivariant estimators or lack any quantitative guarantees. On the other hand, recent estimators with strong quantitative guarantees are not affine-equivariant or require additional distributional assumptions to achieve it. We remedy this by constructing a new affine-equivariant estimator which nearly matches our lower bound. Our estimator is based on a novel notion of a high-dimensional median which may be of independent interest. Notably, our results are applicable more broadly to any estimator whose performance is evaluated in the Mahalanobis norm which, for affine-equivariant estimators, corresponds to an evaluation in Euclidean norm on isotropic distributions.
In statistical learning theory, determining the sample complexity of realizable binary classification for VC classes was a long-standing open problem. The results of Simon and Hanneke established sharp upper bounds in this setting. However, the reliance of their argument on the uniform convergence principle limits its applicability to more general learning settings such as multiclass classification. In this paper, we address this issue by providing optimal high probability risk bounds through a framework that surpasses the limitations of uniform convergence arguments.
Our framework converts the leave-one-out error of permutation invariant predictors into high probability risk bounds. As an application, by adapting the one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth, we propose an algorithm that achieves an optimal PAC bound for binary classification. Specifically, our result shows that certain aggregations of one-inclusion graph algorithms are optimal, addressing a variant of a classic question posed by Warmuth.
We further instantiate our framework in three settings where uniform convergence is provably suboptimal. For multiclass classification, we prove an optimal risk bound that scales with the one-inclusion hypergraph density of the class, addressing the suboptimality of the analysis of Daniely and Shalev-Shwartz. For partial hypothesis classification, we determine the optimal sample complexity bound, resolving a question posed by Alon, Hanneke, Holzman, and Moran. For realizable bounded regression with absolute loss, we derive an optimal risk bound that relies on a modified version of the scale-sensitive dimension, refining the results of Bartlett and Long. Our rates surpass standard uniform convergence-based results due to the smaller complexity measure in our risk bound.
What Makes a Good Fisherman? Linear Regression under Self-Selection Bias
Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Andrew Ilyas, Manolis Zampetakis
STOC 2023
In the classical setting of self-selection, the goal is to learn \(k\ \) models, simultaneously from observations \((x^{(i)}, y^{(i)})\ \) where \(y^{(i)}\ \) is the output of one of \(k\ \) underlying models on input \(x^{(i)}\). In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the \(k\ \) models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium.
In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly\((1/\varepsilon, k, d)\ \) sample and time complexity to estimate all model parameters to accuracy \(\varepsilon\ \) in \(d\ \) dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied \(\max\ \) self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general \(k\ \) we provide an algorithm with poly\((d) \exp(\text{poly}(k))\ \) sample and time complexity to estimate the regression parameters up to error \(1/\text{poly}(k)\), and (3) for \(k = 2\ \) we provide an algorithm for any error \(\varepsilon\ \) and poly\((d, 1/\varepsilon)\ \) sample and time complexity.
Estimation of Standard Auction Models
Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Andrew Ilyas, Manolis Zampetakis
EC 2022
We provide efficient estimation methods for first- and second-price auctions under independent (asymmetric) private values and partial observability. Given a finite set of observations, each comprising the identity of the winner and the price they paid in a sequence of identical auctions, we provide algorithms for non-parametrically estimating the bid distribution of each bidder, as well as their value distributions under equilibrium assumptions. We provide finite-sample estimation bounds which are uniform in that their error rates do not depend on the bid/value distributions being estimated. Our estimation guarantees advance a body of work in Econometrics wherein only identification results have been obtained, unless the setting is symmetric, parametric, or all bids are observable. Our guarantees also provide computationally and statistically effective alternatives to classical techniques from reliability theory. Finally, our results are immediately applicable to Dutch and English auctions.
Algorithms for Heavy-Tailed Statistics: Regression, Covariance Estimation, and Beyond
Yeshwanth Cherapanamjeri, Samuel B. Hopkins, Tarun Kathuria, Prasad Raghavendra, Nilesh Tripuraneni
STOC 2020
We study efficient algorithms for linear regression and covariance estimation in the absence of Gaussian assumptions on the underlying distributions of samples, making assumptions instead about only finitely-many moments. We focus on how many samples are needed to do estimation and regression with high accuracy and exponentially-good success probability.
For covariance estimation, linear regression, and several other problems, estimators have recently been constructed with sample complexities and rates of error matching what is possible when the underlying distribution is Gaussian, but algorithms for these estimators require exponential time. We narrow the gap between the Gaussian and heavy-tailed settings for polynomial-time estimators with:
1. A polynomial-time estimator which takes \(n\ \) samples from a random vector \(X \in R^d\ \) with covariance \(\Sigma\ \) and produces \(\hat{\Sigma}\ \) such that in spectral norm \(\|\hat{\Sigma} - \Sigma \|_2 \leq \tilde{O}(d^{3/4}/\sqrt{n})\ \) w.p. \(1-2^{-d}\ \). The information-theoretically optimal error bound is \(\tilde{O}(\sqrt{d/n})\); previous approaches to polynomial-time algorithms were stuck at \(\tilde{O}(d/\sqrt{n})\).
2. A polynomial-time algorithm which takes \(n\ \) samples \((X_i,Y_i)\ \) where \(Y_i = \langle u,X_i \rangle + \varepsilon_i\ \) and produces \(\hat{u}\ \) such that the loss \(\|u - \hat{u}\|^2 \leq O(d/n)\ \) w.p. \(1-2^{-d}\ \) for any \(n \geq d^{3/2} \log(d)^{O(1)}\ \). This (information-theoretically optimal) error is achieved by inefficient algorithms for any \(n \gg d\); previous polynomial-time algorithms suffer loss \(\Omega(d^2/n)\ \) and require \(n \gg d^2\).
Our algorithms use degree-\(8\ \) sum-of-squares semidefinite programs. We offer preliminary evidence that improving these rates of error in polynomial time is not possible in the median of means framework our algorithms employ.
Fast Mean Estimation with Sub-Gaussian Rates
Yeshwanth Cherapanamjeri, Nicolas Flammarion, Peter L. Bartlett
COLT 2019
We propose an estimator for the mean of a random vector in \(\mathbb{R}^d\ \) that can be computed in time \(O(n^4+n^2d)\ \) for \(n\ \) i.i.d.~samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins, 2018, which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.
More Publications
Heavy-tailed Contamination is Easier than Adversarial Contamination
Yeshwanth Cherapanamjeri, Daniel Lee
Under Submission
Computing Approximate Centerpoints in Polynomial Time
Yeshwanth Cherapanamjeri
FOCS 2024
The Space Complexity of Learning-Unlearning Algorithms
Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition
Aliyah R. Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri, Yaxuan Huang, Anobel Odisho, Peter Carroll, Bin Yu
Under Submission
Diagnosing Transformers: Illuminating Feature Spaces for Clinical Decision-Making
Aliyah R. Hsu, Yeshwanth Cherapanamjeri, Briton Park, Tristan Naumann, Anobel Y. Odisho, Bin Yu
ICLR 2024
Pre-trained transformers are often fine-tuned to aid clinical decision-making using limited clinical notes. Model interpretability is crucial, especially in high-stakes domains like medicine, to establish trust and ensure safety, which requires human engagement. We introduce SUFO, a systematic framework that enhances interpretability of fine-tuned transformer feature spaces. SUFO utilizes a range of analytic and visualization techniques, including Supervised probing, Unsupervised similarity analysis, Feature dynamics, and Outlier analysis to address key questions about model trust and interpretability. We conduct a case study investigating the impact of pre-training data where we focus on real-world pathology classification tasks, and validate our findings on MedNLI. We evaluate five 110M-sized pre-trained transformer models, categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal that: (1) while PubMedBERT, the domain-specific model, contains valuable information for fine-tuning, it can overfit to minority classes when class imbalances exist. In contrast, mixed-domain models exhibit greater resistance to overfitting, suggesting potential improvements in domain-specific model robustness; (2) in-domain pre-training accelerates feature disambiguation during fine-tuning; and (3) feature spaces undergo significant sparsification during this process, enabling clinicians to identify common outlier modes among fine-tuned models as demonstrated in this paper. These findings showcase the utility of SUFO in enhancing trust and safety when using transformers in medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned language models for other applications in medicine and in more critical domains.
The One-Inclusion Graph Algorithm is not Always Optimal
The one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth achieves an optimal in-expectation risk bound in the standard PAC classification setup. In one of the first COLT open problems, Warmuth conjectured that this prediction strategy always implies an optimal high probability bound on the risk, and hence is also an optimal PAC algorithm. We refute this conjecture in the strongest sense: for any practically interesting Vapnik-Chervonenkis class, we provide an in-expectation optimal one-inclusion graph algorithm whose high probability risk bound cannot go beyond that implied by Markov's inequality. Our construction of these poorly performing one-inclusion graph algorithms uses Varshamov-Tenengolts error correcting codes.
Our negative result has several implications. First, it shows that the same poor high-probability performance is inherited by several recent prediction strategies based on generalizations of the one-inclusion graph algorithm. Second, our analysis shows yet another statistical problem that enjoys an estimator that is provably optimal in expectation via a leave-one-out argument, but fails in the high-probability regime. This discrepancy occurs despite the boundedness of the binary loss for which arguments based on concentration inequalities often provide sharp high probability risk bounds.
Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time
Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, Samson Zhou
SODA 2023
We study fundamental problems in linear algebra, such as finding a maximal linearly independent subset of rows or columns (a basis), solving linear regression, or computing a subspace embedding. For these problems, we consider input matrices \(\mathbf{A}\in\mathbb{R}^{n\times d}\ \) with \(n > d\). The input can be read in \(\text{nnz}(\mathbf{A})\ \) time, which denotes the number of nonzero entries of \(\mathbf{A}\). In this paper, we show that beyond the time required to read the input matrix, these fundamental linear algebra problems can be solved in \(d^{\omega}\ \) time, i.e., where \(\omega \approx 2.37\ \) is the current matrix-multiplication exponent.
To do so, we introduce a constant-factor subspace embedding with the optimal \(m=\mathcal{O}(d)\ \) number of rows, and which can be applied in time \(\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}{\alpha}\right) + d^{2 + \alpha}\text{poly}(\log d)\ \) for any trade-off parameter \(\alpha>0\), tightening a recent result by Chepurko et. al. [SODA 2022] that achieves an \(\exp(\text{poly}(\log\log n))\ \) distortion with \(m=d\cdot\text{poly}(\log\log d)\ \) rows in \(\mathcal{O}\left(\frac{\text{nnz}(\mathbf{A})}{\alpha}+d^{2+\alpha+o(1)}\right)\ \) time. Our subspace embedding uses a recently shown property of stacked Subsampled Randomized Hadamard Transforms (SRHT), which actually increase the input dimension, to "spread" the mass of an input vector among a large number of coordinates, followed by random sampling. To control the effects of random sampling, we use fast semidefinite programming to reweight the rows. We then use our constant-factor subspace embedding to give the first optimal runtime algorithms for finding a maximal linearly independent subset of columns, regression, and leverage score sampling. To do so, we also introduce a novel subroutine that iteratively grows a set of independent rows, which may be of independent interest.
Robust Algorithms on Adaptive Inputs from Bounded Adversaries
Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, Fred Zhang, Qiuyi Zhang, Samson Zhou
ICLR 2023
We study dynamic algorithms robust to adaptive input generated from sources with bounded capabilities, such as sparsity or limited interaction. For example, we consider robust linear algebraic algorithms when the updates to the input are sparse but given by an adversary with access to a query oracle. We also study robust algorithms in the standard centralized setting, where an adversary queries an algorithm in an adaptive manner, but the number of interactions between the adversary and the algorithm is bounded. We first recall a unified framework of [HKM+20, BKM+22, ACSS23] for answering \(Q\ \) adaptive queries that incurs \(\widetilde{\mathcal{O}}(\sqrt{Q})\ \) overhead in space, which is roughly a quadratic improvement over the naïve implementation, and only incurs a logarithmic overhead in query time. Although the general framework has diverse applications in machine learning and data science, such as adaptive distance estimation, kernel density estimation, linear regression, range queries, and point queries and serves as a preliminary benchmark, we demonstrate even better algorithmic improvements for (1) reducing the pre-processing time for adaptive distance estimation and (2) permitting an unlimited number of adaptive queries for kernel density estimation. Finally, we complement our theoretical results with additional empirical evaluations.
Uniform Approximations for Randomized Hadamard Transforms with Applications
Yeshwanth Cherapanamjeri, Jelani Nelson
STOC 2022
Randomized Hadamard Transforms (RHTs) have emerged as a computationally efficient alternative to the use of dense unstructured random matrices across a range of domains in computer science and machine learning. For several applications such as dimensionality reduction and compressed sensing, the theoretical guarantees for methods based on RHTs are comparable to approaches using dense random matrices with i.i.d. entries. However, several such applications are in the low-dimensional regime where the number of rows sampled from the matrix is rather small. Prior arguments are not applicable to the high-dimensional regime often found in machine learning applications like kernel approximation. Given an ensemble of RHTs with Gaussian diagonals, \(\{M^i\}_{i = 1}^m\), and any \(1\)-Lipschitz function, \(f: \mathbb{R} \to \mathbb{R}\ \), we prove that the average of \(f\ \) over the entries of \(\{M^i v\}_{i = 1}^m\ \) converges to its expectation uniformly over \(\| v \| \leq 1\ \) at a rate comparable to that obtained from using truly Gaussian matrices. We use our inequality to then derive improved guarantees for two applications in the high-dimensional regime: 1) kernel approximation and 2) distance estimation. For kernel approximation, we prove the first mph{uniform} approximation guarantees for random features constructed through RHTs lending theoretical justification to their empirical success while for distance estimation, our convergence result implies data structures with improved runtime guarantees over previous work by the authors. We believe our general inequality is likely to find use in other applications.
Adversarial Examples in Multi-Layer Random ReLU Networks
Peter L. Bartlett, Sébastien Bubeck, Yeshwanth Cherapanamjeri
NeurIPS 2021
We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.
A single gradient step finds adversarial examples on random two-layers neural networks
Sébastien Bubeck, Yeshwanth Cherapanamjeri, Gauthier Gidel, Rémi Tachet des Combes
Spotlight Presentation
NeurIPS 2021
Daniely and Schacham recently showed that gradient descent finds adversarial examples on random undercomplete two-layers ReLU neural networks. The term "undercomplete" refers to the fact that their proof only holds when the number of neurons is a vanishing fraction of the ambient dimension. We extend their result to the overcomplete case, where the number of neurons is larger than the dimension (yet also subexponential in the dimension). In fact we prove that a single step of gradient descent suffices. We also show this result for any subexponential width random neural network with smooth activation function.
Terminal Embeddings in Sublinear Time
Yeshwanth Cherapanamjeri, Jelani Nelson
FOCS 2021
Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a terminal embedding from one metric space \((X,d_X)\ \) to another \((Y,d_Y)\ \) with a set of designated terminals \(T\subset X\). Such an embedding \(f\ \) is said to have distortion \(\rho\ge 1\ \) if \(\rho\ \) is the smallest value such that there exists a constant \(C>0\ \) satisfying $$ \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) . $$ When \(X,Y\) are both Euclidean metrics with \(Y\) being \(m\)-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion \(1+\epsilon\ \) is achievable via such a terminal embedding with \(m = O(\epsilon^{-2}\log n)\ \) for \(n := |T|\). This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within \(T\ \) and not to \(T\ \) from the rest of space. The downside of prior work is that evaluating their embedding on some \(q\in \mathbb{R}^d\ \) required solving a semidefinite program with \(\Theta(n)\ \) constraints in~\(m\ \) variables and thus required some superlinear \(\mathrm{poly}(n)\ \) runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process \(T\ \) to obtain an almost linear-space data structure that supports computing the terminal embedding image of any \(q\in\mathbb{R}^d\ \) in sublinear time \(O^* (n^{1-\Theta(\epsilon^2)} + d)\). To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.
On Adaptive Distance Estimation
Yeshwanth Cherapanamjeri, Jelani Nelson
Spotlight Presentation
NeurIPS 2020
We provide a static data structure for distance estimation which supports adaptive queries. Concretely, given a dataset \(X = \{x_i\}_{i = 1}^n\ \) of \(n\ \) points in \(\mathbb{R}^d\ \) and \(0 < p \leq 2\ \), we construct a randomized data structure with low memory consumption and query time which, when later given any query point \(q \in \mathbb{R}^d\), outputs a \((1+\epsilon)\)-approximation of \(\lVert q - x_i \rVert_p\ \) with high probability for all \(i\in[n]\). The main novelty is our data structure's correctness guarantee holds even when the sequence of queries can be chosen adaptively: an adversary is allowed to choose the \(j\)th query point \(q_j\ \) in a way that depends on the answers reported by the data structure for \(q_1,\ldots,q_{j-1}\). Previous randomized Monte Carlo methods do not provide error guarantees in the setting of adaptively chosen queries. Our memory consumption is \(\tilde O((n+d)d/\epsilon^2)\), slightly more than the \(O(nd)\ \) required to store \(X\ \) in memory explicitly, but with the benefit that our time to answer queries is only \(\tilde O(\epsilon^{-2}(n + d))\), much faster than the naive \(\Theta(nd)\ \) time obtained from a linear scan in the case of \(n\ \) and \(d\ \) very large. Here \(\tilde O\ \) hides \(\log(nd/\epsilon)\ \) factors. We discuss applications to nearest neighbor search and nonparametric estimation.
Our method is simple and likely to be applicable to other domains: we describe a generic approach for transforming randomized Monte Carlo data structures which do not support adaptive queries to ones that do, and show that for the problem at hand, it can be applied to standard nonadaptive solutions to \(\ell_p\ \) norm estimation with negligible overhead in query time and a factor \(d\ \) overhead in memory.
Optimal Robust Linear Regression in Nearly Linear Time
Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni, Michael I. Jordan, Nicolas Flammarion, Peter L. Bartlett
Under Submission
We study the problem of high-dimensional robust linear regression where a learner is given access to \(n\) samples from the generative model \(Y = \langle X,w^* \rangle + \epsilon\) (with \(X \in \mathbb{R}^d\) and \(\epsilon\) independent), in which an \(ta\) fraction of the samples have been adversarially corrupted. We propose estimators for this problem under two settings: (i) \(X\) is L4-L2 hypercontractive, \(\mathbb{E} [XX^\top]\) has bounded condition number and \(\epsilon\) has bounded variance and (ii) \(X\) is sub-Gaussian with identity second moment and \(\epsilon\) is sub-Gaussian. In both settings, our estimators: (a) Achieve optimal sample complexities and recovery guarantees up to log factors and (b) Run in near linear time (\(\tilde{O}(nd / \eta^6)\)). Prior to our work, polynomial time algorithms achieving near optimal sample complexities were only known in the setting where \(X\) is Gaussian with identity covariance and \(\epsilon\) is Gaussian, and no linear time estimators were known for robust linear regression in any setting. Our estimators and their analysis leverage recent developments in the construction of faster algorithms for robust mean estimation to improve runtimes, and refined concentration of measure arguments alongside Gaussian rounding techniques to improve statistical sample complexities.
List Decodable Mean Estimation in Nearly Linear Time
Learning from data in the presence of outliers is a fundamental problem in statistics. Until recently, no computationally efficient algorithms were known to compute the mean of a high dimensional distribution under natural assumptions in the presence of even a small fraction of outliers. In this paper, we consider robust statistics in the presence of overwhelming outliers where the majority of the dataset is introduced adversarially. With only an \(\alpha < 1/2\ \) fraction of "inliers" (clean data) the mean of a distribution is unidentifiable. However, in their influential work, [CSV17] introduces a polynomial time algorithm recovering the mean of distributions with bounded covariance by outputting a succinct list of \(O(1/\alpha)\ \) candidate solutions, one of which is guaranteed to be close to the true distributional mean; a direct analog of 'List Decoding' in the theory of error correcting codes. In this work, we develop an algorithm for list decodable mean estimation in the same setting achieving up to constants the information theoretically optimal recovery, optimal sample complexity, and in nearly linear time up to polylogarithmic factors in dimension. Our conceptual innovation is to design a descent style algorithm on a nonconvex landscape, iteratively removing minima to generate a succinct list of solutions. Our runtime bottleneck is a saddle-point optimization for which we design custom primal dual solvers for generalized packing and covering SDP's under Ky-Fan norms, which may be of independent interest.
Optimal Mean Estimation without a Variance
Yeshwanth Cherapanamjeri, Nilesh Tripuraneni, Peter L. Bartlett, Michael I. Jordan
COLT 2022
We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. Concretely, given a sample \(\mathbf{X} = \{X_i\}_{i = 1}^n\ \) from a distribution \(\mathcal{D}\ \) over \(\mathbb{R}^d\ \) with mean \(\mu\ \) which satisfies the following weak-moment assumption for some \({\alpha \in [0, 1]}\): $$ \forall \|v\| = 1: \mathbb{E}_{X \thicksim \mathcal{D}}[\lvert \langle X - \mu, v\rangle \rvert^{1 + \alpha}] \leq 1, $$ and given a target failure probability, \(\delta\ \), our goal is to design an estimator which attains the smallest possible confidence interval as a function of \(n,d,\delta\). For the specific case of \(\alpha = 1\), foundational work of Lugosi and Mendelson exhibits an estimator achieving subgaussian confidence intervals, and subsequent work has led to computationally efficient versions of this estimator. Here, we study the case of general \(\alpha\), and establish the following information-theoretic lower bound on the optimal attainable confidence interval: $$ \Omega \left(\sqrt{\frac{d}{n}} + \left(\frac{d}{n}\right)^{\frac{\alpha}{(1 + \alpha)}} + \left(\frac{\log 1 / \delta}{n}\right)^{\frac{\alpha}{(1 + \alpha)}}\right). $$
Moreover, we devise a computationally-efficient estimator which achieves this lower bound.
Testing Markov Chains Without Hitting
Yeshwanth Cherapanamjeri, Peter L. Bartlett
COLT 2019
We study the problem of identity testing of markov chains. In this setting, we are given access to a single trajectory from a markov chain with unknown transition matrix \(Q\ \) and the goal is to determine whether \(Q = P\ \) for some known matrix \(P\ \) or \(\text{Dist}(P, Q) \geq \epsilon\ \) where \(\text{Dist}\ \) is suitably defined. In recent work by Daskalakis, Dikkala and Gravin, 2018, it was shown that it is possible to distinguish between the two cases provided the length of the observed trajectory is at least super-linear in the hitting time of \(P\ \) which may be arbitrarily large.
In this paper, we propose an algorithm that avoids this dependence on hitting time thus enabling efficient testing of markov chains even in cases where it is infeasible to observe every state in the chain. Our algorithm is based on combining classical ideas from approximation algorithms with techniques for the spectral analysis of markov chains.
We consider the problem of outlier robust PCA (OR-PCA) where the goal is to recover principal directions despite the presence of outlier data points. That is, given a data matrix \(M^*\), where \((1-\alpha)\ \) fraction of the points are noisy samples from a low-dimensional subspace while \(\alpha\ \) fraction of the points can be arbitrary outliers, the goal is to recover the subspace accurately. Existing results for OR-PCA have serious drawbacks: while some results are quite weak in the presence of noise, other results have runtime quadratic in dimension, rendering them impractical for large scale applications.
In this work, we provide a novel thresholding based iterative algorithm with per-iteration complexity at most linear in the data size. Moreover, the fraction of outliers, \(\alpha\), that our method can handle is tight up to constants while providing nearly optimal computational complexity for a general noise setting. For the special case where the inliers are obtained from a low-dimensional subspace with additive Gaussian noise, we show that a modification of our thresholding based method leads to significant improvement in recovery error (of the subspace) even in the presence of a large fraction of outliers.
In this paper, we consider the problem of Robust Matrix Completion (RMC) where the goal is to recover a low-rank matrix by observing a small number of its entries out of which a few can be arbitrarily corrupted. We propose a simple projected gradient descent method to estimate the low-rank matrix that alternately performs a projected gradient descent step and cleans up a few of the corrupted entries using hard-thresholding. Our algorithm solves RMC using nearly optimal number of observations as well as nearly optimal number of corruptions. Our result also implies significant improvement over the existing time complexity bounds for the low-rank matrix completion problem. Finally, an application of our result to the robust PCA problem (low-rank+sparse matrix separation) leads to nearly linear time (in matrix dimensions) algorithm for the same; existing state-of-the-art methods require quadratic time. Our empirical results corroborate our theoretical results and show that even for moderate sized problems, our method for robust PCA is an an order of magnitude faster than the existing methods.