Aaron J. Li

prof_pic.jpg

Cambridge, MA, 02138

I am an incoming CS PhD student at UC Berkeley affiliated with BAIR, and I will be coadvised by Prof. Bin Yu and Prof. Ion Stoica. I recently completed my Master’s degree in Computational Science and Engineering at Harvard University, where I was fortunate to be advised by Prof. Hima Lakkaraju. Prior to that, I earned my Bachelor’s degree from UC Berkeley, double majoring in Computer Science and Psychology through the EECS Honors Program.

I’ve been working on the intersections of Trustworthy Machine Learning, LLMs, and Mechanistic Interpretability. Moving forward, I’m also broadly interested in LLM evaluation and alignment. The two overarching research questions I aim to address are:

(1) How could we obtain reliable interpretations of learning dynamics and explanations of observed model behaviors?

(2) How could we leverage such actionable insights to improve next-generation foundation models?

selected publications

  1. ICLR
    rlhf_trust.png
    More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
    Aaron J. Li, Satyapriya Krishna, and Himabindu Lakkaraju
    In Proceedings of the International Conference on Learning Representations, 2025
    Oral Presentation, Top 1.8%
  2. ICML
    r3_ppnet.png
    Improving Prototypical Visual Explanations with Reward Reweighing, Reselection, and Retraining
    Aaron J. Li, Robin Netzorg, Zhihan Cheng, Zhuoqin Zhang, and Bin Yu
    In Proceedings of the International Conference on Machine Learning, 2024