Research Foundations

An overview of the research foundations behind our group's work: trustworthy AI, AI-driven decisions, robust causality, and empirical methodology.

AI models pre-trained on internet data can understand text, code, audio, and video. However, as public data sources become exhausted, it is evident that enabling applications beyond consumer chatbots requires a thoughtful approach to data curation. Mistakes are costly in decision-making problems, and intelligent agents must carefully collect and leverage proprietary data, such as customer feedback and user interactions.

The widespread adoption of AI systems across critical domains has revealed fundamental gaps between academic research and practical deployment. While large language models trained on internet data have achieved impressive capabilities in general tasks, many real-world applications require specialized domain knowledge and careful consideration of reliability, fairness, and safety. Our work addresses these challenges through three key principles:

My research group develops trustworthy AI-driven decision-making systems that optimize long-term outcomes. In particular, we take a holistic “process view” of AI systems.

Process view of AI (not just a single model) : Methodological development in ML largely focuses on model training. Taking a system-level view, we identify central bottlenecks in AI systems and resolve them by building computational and data-centric foundations.

Trustworthy AI

AI systems are omni-present. This is their primary appeal, yet also its biggest shortcoming. During operation, AI systems inevitably encounter tail-inputs and extrapolate in unexpected ways. This phenomena has been widely observed under different names: distribution shift, lack of {fairness, robustness, causality, faithfulness}, and hallucinations.

Scaling internet data is no panacea. Real-world decision-making problems require specialized in-domain datasets, e.g., healthcare, recommender systems, resource allocation. The cost of data collection poses a binding constraint in many disciplines, and we believe new ideas are needed to building intelligent agents. We aim to fill this methodological gap, and develop algorithms that can scale to frontier systems.

Comprehending Uncertainty

Intelligent agents must comprehend their own uncertainty and actively make decisions to resolve it. To advance broad AI capabilities, we must target data collection and synthetic data generation towards areas with high epistemic uncertainty. To bound tail-risk, AI systems must understand when their predictions for anomalous inputs are not trustworthy and delegate to human experts when necessary.

Systems that can reason through epistemic uncertainty based on natural language feedback has been a longstanding challenge. A traditional probabilistic model requires a prior and likelihood over latent variables. But by definition, latents are fundamentally unobservable and often ill-defined.

On the other hand, autoregressive models pre-trained on massive web data exhibit striking predictive capabilities when conditioned on even a small number of demonstrations. Since the 1920’s, De Finetti has advocated for modeling observables rather than latents. We take his predictive view of uncertainty as coming from future data that has not been observed yet. Let us illustrate with a conceptual example that crystallizes our insight (note: the example is made up and clearly devoid of clinical significance).

De Finetti predictive view uncertainty as coming from future data: We show the sequence prediction loss (perplexity) over exchangeable documents (questions and answers) measures the quality of uncertainty quantification over latent environments (mental state of the patient). This allows us to bring to bear frontier autoregressive models to quantify uncertainty at scale!

Crucially, the AI agent can updatie beliefs as data is gathered. Unlike cumbersome posterior inference routines for probabilistic models, you can now simply append prior observations to the context of the sequence model. To learn more, watch this recent talk at Simons Institute, and the following recent paper from the group. We are actively working in this space so ping us if you’d like to chat more.

Continual Learning Adaptive Experimentation

Exchangeable Sequence Models Quantify Uncertainty Over Latent Concepts

Naimeng Ye, and Hongseok Namkoong

Abs Bib

AI models are omni-present yet extrapolate in unexpected ways, posing a significant barrier to robust and fair systems. Building AI systems that can articulate their own uncertainty has been a longstanding challenge in ML, such probabilistic reasoning capability is key to bounding downside risk (e.g., delegating to human experts) and continually improving system performance by gathering data to resolve uncertainty. Despite recent advances in large language models, uncertainty quantification remains a challenge, with methods attempting to leverage these deep neural networks—such as Bayesian neural networks—frequently facing scalability limitations.
This work takes an important conceptual step towards building large-scale AI systems that can reason about uncertainty through natural language. We revisit De Finetti’s view of uncertainty coming from missing observations rather than latent parameters, which allows us to pose learning to do statistical inference as a prediction problem involving masked inputs. This formal connection between autoregressive generation with probabilistic reasoning allows pre-trained sequence models to express their epistemic uncertainty on underlying concepts, and refine their beliefs as they gather more information.
Our findings open a promising avenue for addressing uncertainty in complex, data-rich settings in a scalable way. We are excited by how this work leverages a timeless insight to inform a timely topic: guiding the next generation of AI systems.
1. As internet data depletes, the pace of progress in LLM capabilities has been widely observed to slow down (even in public media). This suggests that the limited paradigm of pre-training on passively scraped web data has reached its full potential. To move forward, the authors believe that the next generation of AI systems must be able to understand tasks on which they suffer high uncertainty, and actively gather data in order to continually improve their performance.
2. Since scalable uncertainty quantification poses a key intellectual bottleneck, we resolve this by going back to De Finetti’s insight developed in the 1920s. We believe the connection between Bayesian inference and autoregressive generation provides the groundwork for building LLMs with probabilistic reasoning capabilities.

Taken together, our work showcases how principled scientific insights have the potential to shape the design of even the largest scale AI systems.
@article{YeNa24, title = {Exchangeable Sequence Models Quantify Uncertainty Over Latent Concepts author = {Ye, Naimeng and Namkoong, Hongseok}, journal = {arXiv:2408.03307 [stat.ML]}, year = {2024}, selected = false }

Language for distribution shifts

Different distribution shifts require different solutions. Understanding why model performance worsened is a fundamental step for informing subsequent methodological and operational interventions. Heterogeneity in data helps robustness, but the cost of data collection is often a binding constraint. We build a nuanced modeling language for quantifying data heterogeneity (or lack thereof), and use it to make optimally allocate limited resources in the AI production pipeline. To learn more, watch the following NeurIPS tutorial and take a look at the following two papers.

Foundations of distributional robustness

Classical approaches that optimize average-case performance yield brittle AI systems. They fail to i) make good predictions on underrepresented groups, ii) generalize to new environments, even those similar to that seen during training, and iii) be robust to adversarial examples and long-tailed inputs. Yes, even the largest models trained on the entirety of the internet! Despite recent successes, lack of understanding on the failure modes of AI systems highlights the need for models that i) reliably work and ii) rigorous evaluation schemes and diagnostics that maintain their quality.

Our vision is to build robust and reliable learning procedures that make decisions with a guaranteed level of performance over its inputs. My Ph.D. thesis built the statistical, and computational foundations of robust machine learning. As robustness is a central topic spanning across multiple fields, my subsequent works have developed robust algorithms for deep learning , causal inference, reinforcement learning, and safety evaluation of autonomous vehicles. These works have led to new approaches toward fairness by characterizing fundamental connections between robustness and fairness. Watch my talk at Google Brain to learn more.

Trustworthy AI Optimization

Distributionally Robust Losses Against Mixture Covariate Shifts

John C. Duchi, Tatsunori Hashimoto, and Hongseok Namkoong

Operations Research, 2022

Bib Video Code

@article{DuchiHaNa22,
  author = {Duchi, John C. and Hashimoto, Tatsunori and Namkoong, Hongseok},
  title = {Distributionally Robust Losses Against Mixture Covariate Shifts
  year = {2022},
  journal = {Operations Research},
  url = {https://pubsonline.informs.org/doi/10.1287/opre.2022.2363},
}

AI-driven decisions

Prediction is never the final goal. To align AI-models optimized to predict short-term outcomes with downstream long-term goals, we design scalable computational frameworks for learning operational decisions. We derive algorithms from mathematical principles, but test them using rigorous empirical benchmarking practrices rather than relying on theoretical guarantees in idealized, contrived settings.

Optimization-driven adaptive experimentation

Experimentation is the foundation of scientific decision-making. Adaptive experimentation can significantly improve efficiency by focusing resources on promising treatments, and expand the set of testable scientific hypotheses. However, significant practical challenges remain in applying standard adaptive algorithms:

Batching, delayed feedback, and short horizons: Bandit algorithms are designed to be updated after every observation, but real-world experiments are conducted with a few large batches and limited adaptivity due to infrastructure constraints and delayed feedback.
Nonstationarity: Customers at Costco look different on Sunday vs. Monday.
General objectives and metric constraints: Practitioners care about a wide range of objectives and constraints (within- vs. post-experiment) across multiple outcomes/metrics/rewards.
Cost/budget constraints: We often want the best treatment while satisfying budget constraints.

The existing algorithm design paradigm requires you to consider a very specific combination of these features and develop a new bandit algorithm tailored to this setting. This is akin to akin to an optimization solver developed for a particular linear program! Naturally, existing algorithms are extremely brittle and often underperform even a non-adaptive A/B test.

Instead, we develop a mathematical programming framework for developing adaptive experimentation algorithms. We ask the modeler to write down a flexible optimization formulation and use modern machine learning systems to (heuristically) solve for adaptive designs.

Unlike bespoke methods tailored to a particular problem, our mathematical programming framework (RHO) provides consistent and robust efficiency gains across instances.

How do we do this? A naive formulation of the adaptive experimentation problem as a dynamic program is intractable: individual outcome distributions are unknown and leads to combinatorial actions spaces. Using a batched view, we model the uncertainty around batch-level sufficient statistics necessary to enable the use of modern computational tools (auto-differentiation and SGD, as opposed to human intuition) in designing adaptive algorithms.

Our main observation is that normal approximations, which are universal in statistical inference, can also guide the design of adaptive algorithms. Using large batch normal approximations, we derive an MDP formulation that optimizes instance-specific constants, instead of relying on regret bounds that only hold in large horizons. Instead of the typical theory-driven paradigm, we use PyTorch and empirical benchmarking for algorithm development.

Adaptive Experimentation Optimization

Optimization-Driven Adaptive Experimentation

Ethan Che, Daniel Jiang, Hongseok Namkoong , and Jimmy Wang

Selected for oral presentations at the Econometric Society Interdisciplinary Frontiers: Economics and AI+ML conference and Conference on Digital Experimentation

Abs Bib Code Slides

Adaptivity can significantly improve efficiency of experimentation, but it is challenging to implement even at large online platforms with mature experimentation systems. As a result, many real-world experiments are deliberately implemented with large batches and a handful of opportunities to update the sampling allocation as a way to reduce operational costs of experimentation.
In this work, we focus on adaptive experiments with limited adaptivity (short horizons T < 10). Bandit algorithms focusing on long-horizon settings are tailored to provide regret guarantees for each specific case, and we find they often underperform static A/B tests on practical problem instances with batched feedback, non-stationarity, multiple objectives and constraints, and personalization.
In response, we develop a mathematical programming framework for developing adaptive experimentation algorithms. Instead of the problem-specific research paradigm (akin to an optimization solver developed for a particular linear program), we ask the modeler to write down a flexible optimization formulation and use modern machine learning systems to (heuristically) solve for adaptive designs. Since a naive formulation of the adaptive experimentation problem as a dynamic program is intractable, we propose a batched view of the experimentation process. We model the uncertainty around batch-level sufficient statistics necessary to make allocation decisions, instead of attempting to model unit-level outcomes whose distributions are commonly unknown and leads to intractable dynamic programs with combinatorial action spaces.
Sequential Gaussian approximations is the main intellectual vehicle powering our mathematical programming framework. CLT-based normal approximations are universal in statistical inference, and a sequential variant we prove provides a simple optimization formulation that lends itself to modern computational tools. Through extensive empirical evaluation, we observe that even a preliminary and heuristic solution approach can provide major robustness benefits. Unlike bespoke methods (e.g., Thompson sampling variants), our mathematical programming framework provides consistent gains over static randomized control trials and exhibits robust performance across problem instances.
@article{CheJiNaWa24, title = {Optimization-Driven Adaptive Experimentation author = {Che, Ethan and Jiang, Daniel and Namkoong, Hongseok and Wang, Jimmy}, journal = {arXiv:2408.04570 [cs.LG]}, year = {2024}, note = {Selected for oral presentations at the Econometric Society Interdisciplinary Frontiers: Economics and AI+ML conference and Conference on Digital Experimentation}, }

Adaptive Experimentation Benchmarks

AExGym: Benchmarks and Environments for Adaptive Experimentation

Jimmy Wang, Ethan Che, Daniel Jiang, and Hongseok Namkoong

Bib Code

@article{WangChJiNa24,
  title = {AExGym: Benchmarks and Environments for Adaptive Experimentation
  author = {Wang, Jimmy and Che, Ethan and Jiang, Daniel and Namkoong, Hongseok},
  journal = {arXiv:2408.04531 [cs.LG]},
  year = {2024},
}

Informed Exploration Using Foundation Models

Real-world decision-making requires the AI agent to continually interact with the environment. This requires combining two different modes of learning: static and interactive. We propose a paradigm of learning where the agent initially relies on the rich world prior available in frontier AI models to balance exploration and exploitation. As data gathered online accrues, the agent must increasingly rely more heavily on them by updating its beliefs.

Uncertainty from missing data: In a cold-start problem in recommender systems, autoregressively imputing user outcomes implements informed Thompson sampling that leverages world prior based on foundation models. Training good autoregressive sequence models thus allows you to solve complex online decision-making problems!

Continual Learning Adaptive Experimentation

Active Exploration via Autoregressive Generation of Missing Data

Tiffany Cai, Hongseok Namkoong, Daniel Russo, and Kelly Zhang

Selected for presentation at the Econometric Society Interdisciplinary Frontiers: Economics and AI+ML conference

Bib Slides

@article{CaiNaRuZh25,
  title = {Active Exploration via Autoregressive Generation of Missing Data
  author = {Cai, Tiffany and Namkoong, Hongseok and Russo, Daniel and Zhang, Kelly},
  journal = {arXiv:2405.19466 [cs.LG]},
  year = {2025},
  note = {Selected for presentation at the Econometric Society
                    Interdisciplinary Frontiers: Economics and AI+ML
                    conference},
}

AI-based service systems

Recent advances in AI present significant opportunities to rethink the design of service systems with AI at the forefront. Endogeneity presents a key intellectual challenge to managing congestion. Prediction is never the goal, but the link between predictive performance and downstream decision-making performance is not straightforward: prioritizing a job based on predictions impacts the delay of other jobs!

Example of a service system based on state-of-the-art AI models: large-scale content moderation systems in online platforms. AI models help human reviewers prioritize toward violating contents most likely to go viral.

We crystallize how classical tools from queueing theory provide managerial insights into the design and operation of AI-based service systems: i) simple policies with heavy traffic optimality guarantees, ii) novel model selection procedure for prediction models with downstream queueing performance as a central concern, and iii) AI-based triage by trading off predictive performance, hiring costs, and congestion costs.

Queueing

Design and Scheduling of an AI-based Queueing System

Jiung Lee, Hongseok Namkoong, and Yibo Zeng

Major revision in Management Science; Selected for presentation at SIG day 2025

Abs Bib

Recent advances in AI present significant opportunities to rethink the design of service systems with AI at the forefront. Even in the era of LLMs, managing a workforce of human agents (“servers”) is a crit- ical problem. Crowdsourcing workers are vital for aligning LLMs with human values (e.g., ChatGPT) and in many domains, the cost of human annotation is a binding constraint (e.g., medical diagnosis from radiologists). This work models and analyzes modern service systems involving human reviewers and state-of-the-art AI models. A key intellectual challenge in managing con- gestion within such service systems is endogeneity. Prediction is never the goal, and the link between predictive performance and downstream decision-making performance is not straightforward due to endogeneity. Our work crystallizes how classical tools from queueing theory provide managerial insights into the design of AI-based service systems.
@article{LeeNaZe24, title = {Design and Scheduling of an AI-based Queueing System author = {Lee, Jiung and Namkoong, Hongseok and Zeng, Yibo}, year = {2024}, journal = {arXiv:2406.06855 [math.OC]}, note = {Major revision in Management Science; Selected for presentation at SIG day 2025} }

Robust causality

Off-policy methods can learn sequential decision policies using the rich reservoir of previously collected (non-experimental / observational) data. While prediction models can be easily evaluated on previously collected data, assessing decision-making performance requires counterfactual reasoning. Traditional modeling assumptions that allow adjusting prediction models to learn counterfactuals rarely hold in practice. The growth in the nominal volume of data is no panacea: observed data typically only covers a portion of the state-action space, posing challenges in counterfactual learning. Concomitant to unseen data sparsity, shifts in the data distribution are common. Observed decisions depend on unrecorded confounders, and learning good policies requires causal reasoning. Marginalized demographic groups are severely underrepresented; for example, among 10000+ cancer clinical trials the National Cancer Institute funds, fewer than 5% of participants were non-white.

Our existing statistical language falls woefully short as it relies on unverifiable (and often false) assumptions, and we lack diagnostics that can identify failure modes. We develop data analysis tools that can guarantee robust scientific findings and perhaps more importantly, fail in expected ways by highlighting the fundamental epistemic uncertainty in the data.

External validity

While large-scale randomized studies offer a “gold standard” for internal validity, their external validity can be called into question over spatiotemporal changes in the population, particularly when the treatment effect is heterogeneous across the population. To assess and improve external validity, we develop sensitivity analysis frameworks that allows researchers to assess the extent to which existing experiments inform the treatment effect in a new target site and quantify an expected range of the policy effect for each new site.

Causal Inference Trustworthy AI Deployed in Practice

Assessing External Validity via Worst-case Subpopulation Treatment Effects

Sookyo Jeong, and Hongseok Namkoong

Short version appeared in Conference on Learning Theory 2020; Major revision in Management Science

Bib Video Code Slides

@article{JeongNa22,
  title = {Assessing External Validity via Worst-case Subpopulation Treatment Effects
  author = {Jeong, Sookyo and Namkoong, Hongseok},
  journal = {arXiv:2007.02411 [stat.ML]},
  year = {2022},
  note = {Short version appeared in Conference on Learning Theory 2020; Major revision in Management Science},
}

Unobserved confounding

Off-policy methods can learn decision policies using the rich reservoir of previously collected (observational) data. A universal assumption that enable counterfactual reasoning requires observed decisions do not depend on any unrecorded confounders that simultaneously affect future states/rewards. This condition is frequently violated in medicine, e-commerce, and public policy, e.g., emergency department patients often do not have an existing record in the hospital’s electronic health system, leaving essential patient-specific information unobserved in subsequent counterfactual analysis.

In the presence of unobserved confounding, even with large samples, it is impossible to precisely estimate the performance of the evaluation policy. To guard against spurious counterfactual evaluations, we propose a worst-case approach where we first posit a realistic notion of bounded unobserved confounding that limits the influence of unrecorded variables on observed decisions and develop corresponding worst-case bounds on the reward.

Causal Inference Trustworthy AI

Bounds on the Conditional and Average Treatment Effect with Unobserved Confounding Factors

Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, and 1 more author

Annals of Statistics, 2022

Bib Video

@article{YadlowskyNaBaDuTi22,
  title = {Bounds on the Conditional and Average Treatment Effect 
                    with Unobserved Confounding Factors},
  author = {Yadlowsky, Steve and Namkoong, Hongseok and Basu, Sanjay and Duchi, John and Tian, Lu},
  journal = {Annals of Statistics},
  volume = {50},
  number = {5},
  pages = {2587--2615},
  year = {2022},
  url = {https://projecteuclid.org/journals/annals-of-statistics/volume-50/issue-5/Bounds-on-the-conditional-and-average-treatment-effect-with-unobserved/10.1214/22-AOS2195.full},
  slide = {YadlowskyNaBaDuTi22-slides.pdf}
}

Research philosophy

While theoretical insights can provide invaluable principles, their successful operationalization requires recognizing and internalizing the limitations of crude approximations and unverifiable assumptions we put in place for mathematical convenience. My group’s research methodology aims to connect two disparate yet complementary worldviews:

computational tools and mathematical insights from statistical learning, optimization, applied probability, and casual inference
rigorous empirical benchmarking practices arising from the AI research community’s data-centric approach.

We take inspiration from Von Neumann’s perspective on mathematical sciences as paraphrased below:

As a mathematical discipline travels far from its empirical source only indirectly inspired from ideas coming from 'reality', it is beset with grave dangers that it will develop along the line of least resistance and become more and more purely aestheticizing. This need not be bad if the discipline is under the influence of researchers with an exceptionally well-developed taste, but the only general remedy is the rejuvenating return to the source: the reinjection of directly empirical ideas. I am convinced that this is a necessary condition to conserve the freshness and the vitality of the subject, and that this will remain so in the future.

Our methodological research is grounded in theoretical principles, but we do not view aesthetic mathematical results as the goal of our impact-driven agenda. We interweave empirical ideas in our algorithmic research, and recognize empirical rigour as a core part of the scientific method (induction). Correspondingly, we are passionate to build empirical foundations for the research community, a perspective we develop further in our recent work on benchmarking in operations research .

Queueing Benchmarks

QGym: Scalable Simulation and Benchmarking of Queuing Network Controllers

Haozhe Chen, Ang Li, Ethan Che, Tianyi Peng, and 2 more authors

NeurIPS 2024

Bib Code

@inproceedings{ChenLiChPeDoNa24,
  title = {QGym: Scalable Simulation and Benchmarking of Queuing Network Controllers
  author = {Chen, Haozhe and Li, Ang and Che, Ethan and Peng, Tianyi and Dong, Jing and Namkoong, Hongseok},
  booktitle = {Advances in Neural Information Processing Systems 37, Datasets and Benchmark Track},
  year = {2024},
}

Adaptive Experimentation Benchmarks

AExGym: Benchmarks and Environments for Adaptive Experimentation

Jimmy Wang, Ethan Che, Daniel Jiang, and Hongseok Namkoong

Bib Code

@article{WangChJiNa24,
  title = {AExGym: Benchmarks and Environments for Adaptive Experimentation
  author = {Wang, Jimmy and Che, Ethan and Jiang, Daniel and Namkoong, Hongseok},
  journal = {arXiv:2408.04531 [cs.LG]},
  year = {2024},
}

Continual Learning Benchmarks

PersonalLLM: Tailoring LLMs to Individual Preferences

Thomas Zollo*, Andrew Siah*, Naimeng Ye, Ang Li, and 1 more author

ICLR 2025

Bib Code

@inproceedings{ZolloSiYeLiNa24,
  title = {PersonalLLM: Tailoring LLMs to Individual Preferences
  author = {Zollo$*$, Thomas and Siah$*$, Andrew and Ye, Naimeng and Li, Ang and Namkoong, Hongseok},
  booktitle = {Proceedings of the Thirteenth International Conference on 
  Learning Representations},
  year = {2025},
}

Benchmarks

Empirical Rigor Through Benchmarking in Operations Research

Jing Dong, Daksh Mittal, and Hongseok Namkoong

SSRN preprint, 2026

Bib Code Website

@article{DongMiNa26,
  title = {Empirical Rigor Through Benchmarking in Operations Research
  author = {Dong, Jing and Mittal, Daksh and Namkoong, Hongseok},
  journal = {SSRN preprint},
  year = {2026},
  url = {https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6974119},
}