0
喜欢
0
书签
声明论文
Disambiguating authors in academic publications using random forests   
摘  要:   Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pair-wise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle name's similarity alone achieves an accuracy of almost 90%.
发  表:   ACM/IEEE Joint Conference on Digital Libraries  2009

论文统计图
共享有5个版本
 [展开全部版本] 

Bibtex
创新指数 
阅读指数 
重现指数 
论文点评
还没有人点评哦

Feedback
Feedback
Feedback
我想反馈:
排行榜