RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports

Sarvesh Soni, Meghana Gudala, A. Pajouhi, Kirk Roberts • @International Conference on Language Resources and Evaluation • 01 January 2022

TLDR: A thorough analysis of the proposed RadQA dataset is conducted, examining the broad categories of disagreement in annotation and the reasoning requirements to answer a question (uncovering the huge dependence on medical knowledge for answering the questions).

Citations: 9

Abstract: We present a radiology question answering dataset, RadQA, with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines. We conduct a thorough analysis of the proposed dataset by examining the broad categories of disagreement in annotation (providing insights on the errors made by humans) and the reasoning requirements to answer a question (uncovering the huge dependence on medical knowledge for answering the questions). The advanced transformer language models achieve the best F1 score of 63.55 on the test set, however, the best human performance is 90.31 (with an average of 84.52). This demonstrates the challenging nature of RadQA that leaves ample scope for future method research.

Related Fields of Study

9 Citations No References

Citations

Sort by

Showing results 1 to 0 of 0

References

Sort by

Showing results 1 to 0 of 0