Research Interests
Machine learning and artificial intelligence have revolutionized computer science in the recent years from research to practice, but their lack of rigorous guarantees have raised various theoretical, practical, and ethical concerns. How can we resolve these concerns, or help alleviate them, to obtain satisfactory solutions to tackle complex real-world situations?
My research projects and interests aim to answer this question from a classical algorithmic perspective, revolving around the overarching idea of combining classical and machine learning techniques, both algorithmic and analytical, to complement machine learning methods with rigorous guarantees, improving stability and robustness in theory and practice. Specifically, my work focuses on data-dependent design and analytical methods, manifesting in two directions: (1) Online algorithms augmented with untrusted predictions, utilizing the predictive power of machine learning prudently, and (2) instance-by-instance optimality in statistical estimation, studying whether it is possible to learn some distributions more accurately than others. I plan to continue exploring these fronts with data-dependent algorithm design and analysis in the future, to further our theoretical understanding of machine learning.
Learning-Augmented Algorithms
Many problems in the classical algorithms setting exhibits impossibility results - most prominently in online settings, where algorithms have to make irrevocable decisions without knowing future inputs. On the other hand, while machine learning models excel at predicting the future based on observed, historical data and are efficient in practice, it is prone to outliers and can perform arbitrarily badly in extreme cases. These worst-case scenarios make machine learning methods impractical to deploy in certain safety-critical scenarios, such as infrastrcuture management and energy scheduling. A natural question is: How can we combine the efficiency of machine learning and the stability of classical algorithms, to obtain the best of both worlds?
The study of algorithms with predictions, or learning-augmented algorithms aims to answer by using untrusted predictions generated by machine learning methods to augment classical algorithms, to possibly improve their performance and break impossibility results if these predictions are accurate, but also maintain comparable performance to classical, non-learning-augmented state-of-the-art algorithms even if the advice is arbitrarily erroneous. Crucially, we make no assumption on these predictions - it is important to use them in our algorithms prudently, and blindly trusting these advices can lead to undesirable performances.
My previous projects in learning-augmented algorithms involve general-purpose settings, such as online covering and packing linear programs, semidefinite programs, and programs with convex or concave objectives. These settings lead to unifying frameworks that can be applied to a large variety of problems in online optimization.
Specifically in context of covering, we solve the problem of minimizing \(f(x)\), a linear or convex monotone non-decreasing function, over non-negative real variables \(x \in \mathbb{R}_{\geq 0}^n\) and linear constraints \(A \in \mathbb{R}_{\geq 0}^{m \times n}\), augmented with an external advice \(x' \in \mathbb{R}_{\geq 0}^n\) on the variables. Our frameworks are based on the online primal-dual method of Buchbinder and Naor~\cite{buchbinder2009online}, and solves the minimization problem and its corresponding dual maximization problem simultaneously, by continuously incrementing all relevant variables, generalizing prior work on learning-augmented online set cover by Bamas, Maggiori, and Svensson~\cite{bamas2020primal}. We incorporate the external advice by fine-tuning the growth rate of each variable according to whether it has reached the predicted value or not: If an variable \(x_j\) has not reached the predicted value \(x'_j\), i.e., \(x_j < x'_j\), we increment it at a faster rate, until it reaches the predicted value, after which we increment it at a slower rate.
Our frameworks are shown to achieve good performance: The online solution we obtain are bounded by a constant multiplicative factor, up to a confidence hyper-parameter \(\lambda \in [0,1]\) chosen by the user, from both the value of the advice (i.e., consistency), and the solution of the non-learning-augmented primal-dual method (i.e., robustness). In short, our frameworks can obtain constant competitiveness when the advice is accurate, and still matches the performance of state-of-the-art algorithms asymptotically even when the advice is arbitrarily inaccurate.
This line of work also led to the (re-)discovery of simple, possibly folklore, learning-augmented algorithms based on the idea of switching between the prediction and the solution of any state-of-the-art online algorithm as a black-box procedure. These algorithms are surprisingly simple compared to their sophisticated counterparts based on primal-dual methods, but can achieve near-optimal performances. Our algorithms for both covering linear programs and packing concave programs are based on the simple idea of “switching” between the advice and the online algorithm. Such ideas are classical for online optimization, but is much less prevalent in the learning-augmented field.
What structural properties of the problems we studied allows these simple, switching-based algorithms to be optimal? For which problems and settings can we obtain these simple algorithms, that uses classical online algorithms as a black-box to obtain optimal algorithms? These are meta-level conceptual questions that may be important to our understanding of the fundamental power of external predictions, and I am very interested in exploring these questions.
The study of learning-augmented algorithms in general does not make assumptions about the external prediction, and the question of how to generate accurate and trusted predictions are generally out of scope. However, there is a surge of recent interest in data-dependent algorithm design, to return to the field of machine learning and study how to obtain these predictions efficiently via online learning and other methods, which is a direction I am interested in exploring to complement my researches in learning-augmented algorithms as well.
Optimality in Statistical Estimation, Beyond the Worst-Case
Learning properties of unknown distributions from observing random samples is a fundamental primitive of statistics and machine learning: In particular estimating the mean of an unknown distribution in one-dimension. Given a set of \(n\) i.i.d. samples from an unknown distribution \(p\) with mean \(\mu_p\) and variance \(\sigma_p^2\), how do we most accurately estimate its mean with probability \(1 - \delta\), over the \(n\) samples? It has been long known that the sample mean is optimal when the unknown distribution is Gaussian, but for general distributions the sample mean is prone to outliers in the sample set, and can perform exponentially worse. A long line of research has culminated in optimal mean estimators achieving the sub-Gaussian rate, \(\sigma_p\cdot (\sqrt{2}+o(1))\sqrt{\log\frac{1}{\delta}/n}\), performing as well as the sample mean on Gaussian distributions on any distribution, even up to the constants.
These results, however, are only optimal in the worst-case: any estimator cannot perform better than sample mean on Gaussian distributions. Can we beat the sub-Gaussian rate, at least for some easier distributions? Can we develop “instance-by-instance” algorithmic and analysis techniques? Is it possible for estimators to leverage beneficial features of their input distribution to beat the sub-Gaussian rate, without explicit knowledge of these features?
My prior work resolves this question with an unexpectedly nuanced answer: Yes, in certain regimes, such as finite-sample, or symmetrical assumptions; But in general, no. In an appropriate sense of “hardness”, every distribution is at least as “hard” as Gaussian distributions. More specifically, for any distribution \(p\) with finite mean, there exists a ``neighboring” distribution \(q\), such that the means of \(p\) and \(q\) are well-separated by a ground-truth error function \(\epsilon_{n,\delta}(p)\), but indistinguishable with probability \(1 - \delta\) to any estimator. This implies via a standard testing-to-estimation reduction that any estimator cannot achieve error \(\frac{1}{2} \epsilon_{n,\delta}(p)\) over \(n\) samples with probability \(1 - \delta\) on both distributions simultaneously. The function \(\epsilon_{n,\delta}(p)\) is such that as \(\log \frac{1}{\delta} / n \to 0\), we have \(\epsilon_{n,\delta}(p) \to \Omega(\sigma_p\sqrt{\log\frac{1}{\delta}/n})\), showing that no estimator can asymptotically outperform the sub-Gaussian rate for both \(p\) and \(q\) simultaneously. Beyond the sub-Gaussian rate, our indistinguishability result holds even in the heavy-tail regime, when no \(1+\alpha\)-th moment exists for any constant \(\alpha > 0\).
We rigorously define the ground-truth error function \(\epsilon_{n,\delta}(p)\), and capture this notion of indistinguishability by defining “neighborhood optimality”, a novel definition of beyond worst-case optimality for mean estimation. In short, an estimator is neighborhood optimal with respect to a neighborhood function \(N\) if for any distribution \(p\), it is admissible, i.e., Pareto efficient on all distributions in the neighborhood \(N(p)\) of \(p\), achieving rate \(\epsilon_{n,\delta}(p)\). Neighborhood optimality smoothly interpolates between instance optimality and admissibility, both classical notions of instance-by-instance optimality that fails in context of mean estimation in \(\mathbb{R}\), and is to our knowledge the first non-trivial definition of optimality for mean estimation beyond the worst-case.
Under our new definition, we showed that the classical median-of-means estimator, as well as the state-of-the-art Lee-Valiant estimator, are both neighborhood optimal with respect to an appropriate neighborhood function, obtaining an error rate matching our ground-truth error function \(\epsilon_{n,\delta}(p)\). In particular, we establish a connection between robustness against adversarially corrupted data and instance-by-instance optimality, and show that any adversarially robust mean estimator is also neighborhood optimal.
A key message of this line of work is that if we wish to beat the worst-case sub-Gaussian rate for mean estimation, we must identify and leverage additional favorable distributional structures, beyond the existence of higher moments. Follow-up work by Gupta, Lee, and Price showed that assuming the distribution is symmetric about its mean, one can achieve instance-dependent Fisher information rates for mean estimation, which can be significantly better than the sub-Gaussian rate. I believe that the field should view these results together as a call to arms to view the mean estimation problem in new perspectives, by exploring more structural assumptions that allow us to circumvent our lower bound construction.
Beyond mean estimation, viewing our results from a higher-level perspective: While this line of work is conceptual and definitional, it is very intriguing to me that the optimality of certain classical algorithms can extend well beyond the worst-case. How can we better explore the landscape of optimality in an instance-by-instance fashion, for other statistical estimation problems, and beyond the field of statistics? I am interested in investigating and answering these questions in the future.