BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang • @arXiv • 10 July 2023

TLDR: The BeaverTails dataset is introduced, aimed at fostering research on safety alignment in large language models (LLMs) and providing vital resources for the community, contributing towards the safe development and deployment of LLMs.

Citations: 128

Abstract: In this paper, we introduce the \textsc{BeaverTails} dataset, aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, we have gathered safety meta-labels for 30,207 question-answer (QA) pairs and 30,144 pairs of expert comparison data for both the helpfulness and harmlessness metrics. In total, we have gathered safety meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert comparison data for both the helpfulness and harmlessness metrics. We further showcase applications of BeaverTails in content moderation and reinforcement learning with human feedback (RLHF), emphasizing its potential for practical safety measures in LLMs. We believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of LLMs. Our project page is available at the following URL: https://sites.google.com/view/pku-beavertails. Warning: this paper contains example data that may be offensive or harmful.

Related Fields of Study

128 Citations No References

Citations

Sort by

Showing results 1 to 0 of 0

References

Sort by

Showing results 1 to 0 of 0