My research group builds AI systems that optimize real-world business objectives, with a particular emphasis on trustworthy and responsible learning methods. We take an interdisciplinary approach spanning several fields including machine learning, operations research, and statistics.
AI models pre-trained on internet data can understand text, code, audio, and video. However, as public data sources become exhausted, it is evident that enabling applications beyond consumer chatbots requires a thoughtful approach to data curation. Mistakes are costly in business applications, and intelligent agents must carefully collect and leverage proprietary data, such as customer feedback and user interactions.
My research group develops trustworthy AI-driven decision-making systems that optimize long-term outcomes. In particular, we take a holistic “process view” of AI systems.
Modern data collection systems acquire data from heterogeneous sources, and classical approaches that optimize average-case performance yield brittle AI systems. They fail to i) make good predictions on underrepresented groups, ii) generalize to new environments, even those similar to that seen during training, and iii) be robust to adversarial examples and long-tailed inputs. Yes, even the largest models trained on the entirety of the internet! Despite recent successes, lack of understanding on the failure modes of AI systems highlights the need for models that i) reliably work and ii) rigorous evaluation schemes and diagnostics that maintain their quality.
Different distribution shifts require different solutions. Understanding why model performance worsened is a fundamental step for informing subsequent methodological and operational interventions. Heterogeneity in data helps robustness, but the cost of data collection is often a binding constraint. We build a nuanced modeling language for quantifying data heterogeneity (or lack thereof), and use it to make optimally allocate limited resources in the AI production pipeline. To learn more, watch the following NeurIPS tutorial and take a look at the following two papers.
Our vision is to build robust and reliable learning procedures that make decisions with a guaranteed level of performance over its inputs. My Ph.D. thesis built the statistical
Prediction is never the final goal. To align AI-models optimized to predict short-term (e.g., clicks) with downstream long-term business outcomes (e.g., user utility), we design scalable computational frameworks for learning operational decisions. We derive algorithms from mathematical principles, but test them using rigorous empirical benchmarking practrices rather than relying on theoretical guarantees in idealized, contrived settings.
Experimentation is the foundation of scientific decision-making. Adaptive methods can significantly improve data efficiency, but standard algorithms are primarily designed to satisfy good upper bounds on their performance (regret bounds) and do not model important operational constraints and are challenging to implement due to infrastructural/organizational difficulties.
We focus on underpowered, short-horizon, and large-batch problems that typically arise in practice. Instead of the typical theory-driven paradigm, we use PyTorch and empirical benchmarking for algorithm development.
Our main observation is that normal approximations, which are universal in statistical inference, can also guide the design of adaptive algorithms. Using large batch normal approximations, we derive an MDP formulation that optimizes instance-specific constants, instead of relying on regret bounds that only hold in large horizons.
By using standard computational tools, our adaptive designs significantly outperform even Bayesian bandit algorithms (e.g., Thompson sampling) that require full distributional knowledge of individual rewards.
Real-world decision-making requires grappling with a perpetual lack of data as environments change. Intelligent agents must comprehend uncertainty and actively gather information to resolve it. Leveraging scalable computational tools in ML—neural networks, SGD based on auto-differentiation, validation losses—to balance exploration and exploitation is a longstanding challenge.
Consider cold-start problem on recommender systems, where an online platform must learn high-quality items among a large catalogue. The standard ML approach uses item features to predict engagement—the de facto industry standard. Instead, we propose a sequence modeling approach where we predict the sequence of engagements observed for the item using an autoregressive model (e.g., transformer).
Recent advances in AI present significant opportunities to rethink the design of service systems with AI at the forefront. Endogeneity presents a key intellectual challenge to managing congestion. Prediction is never the goal, but the link between predictive performance and downstream decision-making performance is not straightforward: prioritizing a job based on predictions impacts the delay of other jobs!
We crystallize how classical tools from queueing theory provide managerial insights into the design and operation of AI-based service systems: i) simple policies with heavy traffic optimality guarantees, ii) novel model selection procedure for prediction models with downstream queueing performance as a central concern, and iii) AI-based triage by trading off predictive performance, hiring costs, and congestion costs.
Off-policy methods can learn sequential decision policies using the rich reservoir of previously collected (non-experimental / observational) data. While prediction models can be easily evaluated on previously collected data, assessing decision-making performance requires counterfactual reasoning. Traditional modeling assumptions that allow adjusting prediction models to learn counterfactuals rarely hold in practice. The growth in the nominal volume of data is no panacea: observed data typically only covers a portion of the state-action space, posing challenges in counterfactual learning. Concomitant to unseen data sparsity, shifts in the data distribution are common. Observed decisions depend on unrecorded confounders, and learning good policies requires causal reasoning. Marginalized demographic groups are severely underrepresented; for example, among 10000+ cancer clinical trials the National Cancer Institute funds, fewer than 5% of participants were non-white.
Our existing statistical language falls woefully short as it relies on unverifiable (and often false) assumptions, and we lack diagnostics that can identify failure modes. We develop data analysis tools that can guarantee robust scientific findings and perhaps more importantly, fail in expected ways by highlighting the fundamental epistemic uncertainty in the data.
While large-scale randomized studies offer a “gold standard” for internal validity, their external validity can be called into question over spatiotemporal changes in the population, particularly when the treatment effect is heterogeneous across the population. To assess and improve external validity, we develop sensitivity analysis frameworks that allows researchers to assess the extent to which existing experiments inform the treatment effect in a new target site and quantify an expected range of the policy effect for each new site.
Off-policy methods can learn decision policies using the rich reservoir of previously collected (observational) data. A universal assumption that enable counterfactual reasoning requires observed decisions do not depend on any unrecorded confounders that simultaneously affect future states/rewards. This condition is frequently violated in medicine, e-commerce, and public policy, e.g., emergency department patients often do not have an existing record in the hospital’s electronic health system, leaving essential patient-specific information unobserved in subsequent counterfactual analysis.
In the presence of unobserved confounding, even with large samples, it is impossible to precisely estimate the performance of the evaluation policy. To guard against spurious counterfactual evaluations, we propose a worst-case approach where we first posit a realistic notion of bounded unobserved confounding that limits the influence of unrecorded variables on observed decisions and develop corresponding worst-case bounds on the reward.
While theoretical insights can provide invaluable principles, their successful operationalization requires recognizing and internalizing the limitations of crude approximations and unverifiable assumptions we put in place for mathematical convenience. My group’s research methodology aims to connect two disparate yet complementary worldviews:
I take inspiration from Von Neumann’s perspective on mathematical sciences, which I paraphrase below:
As a mathematical discipline travels far from its empirical source only indirectly inspired from ideas coming from 'reality', it is beset with grave dangers that it will develop along the line of least resistance and become more and more purely aestheticizing. This need not be bad if the discipline is under the influence of researchers with an exceptionally well-developed taste, but the only general remedy is the rejuvenating return to the source: the reinjection of directly empirical ideas. I am convinced that this is a necessary condition to conserve the freshness and the vitality of the subject, and that this will remain so in the future.
I am fortunate to be able to learn from the well-developed taste of my colleagues. Concurrent to this personal education, I (try to) inject empirical ideas to formulate research directions to increase the impact of my research.