publications

publications by categories in reversed chronological order. generated by jekyll-scholar.

2026

  1. EACL
    sae_robustness.png
    Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
    Aaron J. Li, Suraj Srinivas, Usha Bhalla, and Himabindu Lakkaraju
    In Proceedings of the European Chapter of the Association for Computational Linguistics, 2026

2025

  1. ICLR
    rlhf_trust.png
    More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
    Aaron J. Li, Satyapriya Krishna, and Himabindu Lakkaraju
    In Proceedings of the International Conference on Learning Representations, 2025
    Oral Presentation, Top 1.8%

2024

  1. ICML
    r3_ppnet.png
    Improving Prototypical Visual Explanations with Reward Reweighing, Reselection, and Retraining
    Aaron J. Li, Robin Netzorg, Zhihan Cheng, Zhuoqin Zhang, and Bin Yu
    In Proceedings of the International Conference on Machine Learning, 2024
  2. COLM
    certify.png
    Certifying LLM Safety against Adversarial Prompting
    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron J. Li, Soheil Feizi, and Himabindu Lakkaraju
    In Conference on Language Modeling, 2024