Publication:
Creating an Extended Named Entity Dictionary from Wikipedia
Ryuichiro Higashinaka, Kugatsu Sadamitsu, Kuniko Saito, Toshiro Makino, Y. Matsuo • @International Conference on Computational Linguistics • 01 December 2012
TLDR: An extensive list of features for the accurate classification into the ENE types, such as those related to the surface string of a title, the content of the article, and the meta data provided with Wikipedia are devised.
Citations: 19
Abstract: Automatic methods to create entity dictionaries or gazetteers have used only a small number of entity types (18 at maximum), which could pose a limitation for fine-grained information extraction. This paper aims to create a dictionary of 200 extended named entity (ENE) types. Using Wikipedia as a basic resource, we classify Wikipedia titles into ENE types to create an ENE dictionary. In our method, we derive a large number of features for Wikipedia titles and train a multiclass classifier by supervised learning. We devise an extensive list of features for the accurate classification into the ENE types, such as those related to the surface string of a title, the content of the article, and the meta data provided with Wikipedia. By experiments, we successfully show that it is possible to classify Wikipedia titles into ENE types with 79.63% accuracy. We applied our classifier to all Wikipedia titles and, by discarding low-confidence classification results, created an ENE dictionary of over one million entities covering 182 ENE types with an estimated accuracy of 89.48%. This is the first large scale ENE dictionary. TITLE AND ABSTRACT IN ANOTHER LANGUAGE (JAPANESE) Wikipediaを用いた拡張固有表現辞書の構築 従来の固有表現辞書では,少ない数(最大で 18)の固有表現タイプが用いられてきたため, ピンポイントな情報抽出に適用することが難しいという問題があった.そこで,本稿では, 200の拡張固有表現タイプを用いた固有表現辞書の構築を目指す.具体的には,教師あり学 習による多クラス分類器を用い,Wikipediaの見出し語を拡張固有表現タイプに分類するこ とで辞書を構築する.特徴量として,見出し語そのもの,本文,そして,カテゴリ等のメタ データに関するものを数多く列挙し用いた.結果として,見出し語を,79.63%の精度で,拡 張固有表現タイプに分類できることが分かった.学習された多クラス分類器を,Wikipediaの すべての見出し語に適用し,また,信頼度の低い分類結果については除外するようにしたと ころ,推定分類精度が 89.48%で,また,182の拡張固有表現タイプをカバーする,百万以 上のエントリを持つ拡張固有表現辞書を構築することができた.この辞書は,初の大規模な 拡張固有表現辞書である.
Related Fields of Study
loading
Citations
Sort by
Previous
Next
Showing results 1 to 0 of 0
Previous
Next
References
Sort by
Previous
Next
Showing results 1 to 0 of 0
Previous
Next