Aaron J. Li

Cambridge, MA, 02138

I am an incoming CS PhD student at UC Berkeley affiliated with BAIR, and I will be coadvised by Prof. Bin Yu and Prof. Ion Stoica. I recently completed my Master’s degree in Computational Science and Engineering at Harvard University, where I was fortunate to be advised by Prof. Hima Lakkaraju. Prior to that, I earned my Bachelor’s degree from UC Berkeley, double majoring in Computer Science and Psychology through the EECS Honors Program.

I’ve been working on the intersections of Trustworthy Machine Learning, LLMs, and Mechanistic Interpretability. Moving forward, I’m also broadly interested in LLM evaluation and alignment. The two overarching research questions I aim to address are:

(1) How could we obtain reliable interpretations of learning dynamics and explanations of observed model behaviors?

(2) How could we leverage such actionable insights to improve next-generation foundation models?

selected publications

ICLR
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

Aaron J. Li, Satyapriya Krishna, and Himabindu Lakkaraju

In Proceedings of the International Conference on Learning Representations, 2025

Oral Presentation, Top 1.8%

Abs arXiv Bib

The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice, Reinforcement Learning From Human Feedback (RLHF) has been widely used to align LLMs with labeled human preferences, but its assumed effect on model trustworthiness hasn’t been rigorously evaluated. To bridge this knowledge gap, this study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. Our results demonstrate that RLHF on human preferences doesn’t automatically guarantee trustworthiness, and reverse effects are often observed. Furthermore, we propose to adapt efficient influence function based data attribution methods to the RLHF setting to better understand the influence of fine-tuning data on individual trustworthiness benchmarks, and show its feasibility by providing our estimated attribution scores. Together, our results underscore the need for more nuanced approaches for model alignment from both the data and framework perspectives, and we hope this research will guide the community towards developing language models that are increasingly capable without sacrificing trustworthiness.
@inproceedings{li2024more, title = {More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness}, author = {Li, Aaron J. and Krishna, Satyapriya and Lakkaraju, Himabindu}, booktitle = {Proceedings of the International Conference on Learning Representations}, year = {2025}, publisher = {ICLR}, venue = {ICLR 2025}, note = {<span style="color:red;"> Oral Presentation, Top 1.8% </span>}, }
ICML
Improving Prototypical Visual Explanations with Reward Reweighing, Reselection, and Retraining

Aaron J. Li, Robin Netzorg, Zhihan Cheng, Zhuoqin Zhang, and Bin Yu

In Proceedings of the International Conference on Machine Learning, 2024

Abs arXiv Bib

In recent years, work has gone into developing deep interpretable methods for image classification that clearly attributes a model’s output to specific features of the data. One such of these methods is the Prototypical Part Network (ProtoPNet), which attempts to classify images based on meaningful parts of the input. While this architecture is able to produce visually interpretable classifications, it often learns to classify based on parts of the image that are not semantically meaningful. To address this problem, we propose the Reward Reweighing, Reselecting, and Retraining (R3) post-processing framework, which performs three additional corrective updates to a pretrained ProtoPNet in an offline and efficient manner. The first two steps involve learning a reward model based on collected human feedback and then aligning the prototypes with human preferences. The final step is retraining, which realigns the base features and the classifier layer of the original model with the updated prototypes. We find that our R3 framework consistently improves both the interpretability and the predictive accuracy of ProtoPNet and its variants.
@inproceedings{li2023improving, title = {Improving Prototypical Visual Explanations with Reward Reweighing, Reselection, and Retraining}, author = {Li, Aaron J. and Netzorg, Robin and Cheng, Zhihan and Zhang, Zhuoqin and Yu, Bin}, booktitle = {Proceedings of the International Conference on Machine Learning}, year = {2024}, publisher = {PMLR}, venue = {ICML 2024}, }