Papers

Filter:

✨ Preprints

Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao

Computer-use agent safety usually screens prompts. OS-BLIND shows even benign instructions can produce harm via task context or execution; safety alignment only activates in the first few steps, so even Claude 4.5 Sonnet hits 92.7% attack success.

Paperagentstrustworthiness

Yongkang Du, Jieyu Zhao, Yijun Yang, Tianyi Zhou

Fairness-accuracy trade-offs usually hand you one operating point. CPT lets users specify a preferred balance via reference vectors, navigating the Pareto front with stabilized fairness updates and gradient pruning.

Papertrustworthiness

Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu, Ranjay Krishna, Kang Jian, Jieyu Zhao

Computer-use agents leave a video trail of their work. We turn that trail into reward signal — rating trajectories at the pixel level instead of grading only the final outcome.

Lincan Li, Jiaqi Li, Catherine Chen, Fred Gui, Hongjia Yang, Chenxiao Yu, Zhengguang Wang, Jianing Cai, Junlong Aaron Zhou, Bolin Shen, et al.

Survey of how LLMs are being applied in political science — what they can credibly contribute, and the methodological pitfalls researchers keep stepping into.

Paperevaluationtrustworthiness

2026

Taiwei Shi, Zhuoer Wang, Longqi Yang, Ying-Chun Lin, Zexue He, Mengting Wan, Pei Zhou, Sujay Jauhar, Sihao Chen, Shan Xia, Hongfei Zhang, Jieyu Zhao, Xiaofeng Xu, Xia Song, Jennifer Neville

Align LLMs with the messy, in-the-moment feedback users actually leave during real interactions — not the curated preference pairs that look clean on paper.

Paperalignment

Experiential Reinforcement Learning

Lifelong Agent Workshop @ ICLR 2026

Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, Jieyu Zhao

Treat past episodes as a retrievable experience bank rather than gradient updates only — letting the policy reason about what worked last time before deciding what to try next.

Jiazhi Li, Mi Zhou, Mahyar Khayatkhoei, Jingyu Shi, Xiang Gao, Jiageng Zhu, Hanchen Xie, Xiyun Song, Zongfang Lin, Heather Yu, Liang Peng, Jieyu Zhao

Text-to-image models trade diversity for fidelity by default. We show you can recover demographic and conceptual diversity without sacrificing how good the images look.

trustworthiness

2025

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah Smith, Chiyuan Zhang

Six-axis evaluation for machine unlearning in language models — testing whether claimed unlearning actually erases the knowledge or just hides it from a few probes.

Paperalignmentevaluation

2024

Yubo Zhang, Shudi Hou, Mingyu Derek Ma, Wei Wang, Muhao Chen, Jieyu Zhao

Benchmark probing whether LLM medical advice changes with patient demographics — the bias most likely to cause real-world harm if deployed in clinical workflows.

Papertrustworthinessevaluation

Md Khairul Islam, Andrew Wang, Tianhao Wang, Yangfeng Ji, Judy Fox, Jieyu Zhao

Differential privacy and bias mitigation sometimes conflict. We measure exactly when DP training makes downstream bias worse — and which protected groups bear the cost.

trustworthiness

Yusen Zhang, Nan Zhang, Yixin Liu, Alexander Fabbri, Junru Liu, Ryo Kamoi, Xiaoxin Lu, Caiming Xiong, Jieyu Zhao, Dragomir Radev, Kathleen McKeown, Rui Zhang

Summarizers compress text — and often compress out minority viewpoints. We measure that loss and propose summaries that preserve the spread of perspectives in the source.

Papertrustworthiness

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, ..., Jieyu Zhao, ..., Yan Liu, Yanfang Ye, Yinzhi Cao, Yong Chen, Yue Zhao

A consortium-scale benchmark for trustworthiness in LLMs — truthfulness, safety, fairness, robustness, privacy, and ethics — with consistent protocols across 16 mainstream models.

Papertrustworthinessevaluation

2023

2022

Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, Kai-Wei Chang

An audit of how the NLP community measures bias and harm — what each metric actually captures, what it misses, and where the field has been talking past itself.

Papertrustworthinessevaluation

2021

2020

Zuohui Fu*, Yikun Xian*, Ruoyuan Gao, Jieyu Zhao, Qiaoying Huang, Yingqiang Ge, Shuyuan Xu, Shijie Geng, Chirag Shah, Yongfeng Zhang, Gerard de Melo

Explainable recommendation can quietly bake in unfair preferences. We modify the KG-walking objective so the explanation is fair, not just plausible.

Papertrustworthiness

Andrew Gaut, Tony Sun, Shirlyn Tang, Yuxin Huang, Jing Qian, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, William Yang Wang

Relation extractors learn that some relations belong to some genders — a problem when the same model is then used to build knowledge graphs.

trustworthiness

2019

Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang, Muhao Chen, Ryan Cotterell, Kai-Wei Chang

English-trained bias methods don't transfer cleanly to languages with grammatical gender. We adapt the measurement and find the bias is still there — just shaped differently.

Papertrustworthiness

2018

Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, Kai-Wei Chang

GN-GloVe: train word embeddings whose gender dimension is isolated by construction, so downstream models can use the semantics without inheriting the gender association.

PaperCodetrustworthiness

2017