MultiFin: A Dataset for Multilingual Financial NLP

R. Jørgensen, Oliver Brandt, Mareike Hartmann, Xiang Dai, C. Igel, Desmond Elliott • @Findings of the Association for Computational Linguistics • 01 January 2023

TLDR: This work describes MultiFin – a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families, and develops and annotates the dataset using both ‘label by native-speaker’ and ‘translate-then-label’ approaches.

Citations: 9

Abstract: Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MultiFin – a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multi-class. We develop our annotation schema based on a real-world application and annotate our dataset using both ‘label by native-speaker’ and ‘translate-then-label’ approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.

Related Fields of Study

9 Citations No References

Citations

Sort by

Showing results 1 to 0 of 0

References

Sort by

Showing results 1 to 0 of 0