Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations

Qingyu Chen, Jingcheng Du, Yan Hu, V. Keloth, Xueqing Peng, Kalpana Raja, Rui Zhang, Zhiyong Lu, Huan Xu • @arXiv • 10 May 2023

TLDR: This pilot study establishes the baseline performance of GPT-3 and G PT-4 at both zero-shot and one-shot settings in eight BioNLP datasets across four applications, examines the errors produced by the LLMs and categorized the errors into three types: missingness, inconsistencies, and unwanted artificial content, and provides suggestions for using LLMs inBioNLP applications.

Citations: 31

Abstract: Biomedical literature is growing rapidly, making it challenging to curate and extract knowledge manually. Biomedical natural language processing (BioNLP) techniques that can automatically extract information from biomedical literature help alleviate this burden. Recently, large Language Models (LLMs), such as GPT-3 and GPT-4, have gained significant attention for their impressive performance. However, their effectiveness in BioNLP tasks and impact on method development and downstream users remain understudied. This pilot study (1) establishes the baseline performance of GPT-3 and GPT-4 at both zero-shot and one-shot settings in eight BioNLP datasets across four applications: named entity recognition, relation extraction, multi-label document classification, and semantic similarity and reasoning, (2) examines the errors produced by the LLMs and categorized the errors into three types: missingness, inconsistencies, and unwanted artificial content, and (3) provides suggestions for using LLMs in BioNLP applications. We make the datasets, baselines, and results publicly available to the community via https://github.com/qingyu-qc/gpt_bionlp_benchmark.

Related Fields of Study

31 Citations No References

Citations

Sort by

Showing results 1 to 0 of 0

References

Sort by

Showing results 1 to 0 of 0