Publications

Wednesday 9, 2016

Tarek Kanan and Edward A. Fox. ”Automated Arabic Text Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy”. Journal of the Association for Information Science and Technology, Impact Factor 2.28. Willy Publisher, USA. DOI: 10.1002/asi.23609

Abstract:

Arabic news articles in electronic collections are difficult to study. Browsing by category is rarely supported. Although helpful machine-learning methods have been applied successfully to similar situations for English news articles, limited research has been completed to yield suitable solutions for Arabic news. In connection with a Qatar National Research Fund (QNRF)-funded project to build digital library community and infrastructure in Qatar, we developed software for browsing a collection of about 237,000 Arabic news articles, which should be applicable to other Arabic news collections. We designed a simple taxonomy for Arabic news stories that is suitable for the needs of Qatar and other nations, is compatible with the subject codes of the International Press Telecommunications Council, and was enhanced with the aid of a librarian expert as well as five Arabic-speaking volunteers. We developed tailored stemming (i.e., a new Arabic light stemmer called P-Stemmer) and automatic classification methods (the best being binary Support Vector Machines classifiers) to work with the taxonomy. Using evaluation techniques commonly used in the information retrieval community, including 10-fold cross-validation and the Wilcoxon signed-rank test, we showed that our approach to stemming and classification is superior to state-of-the-art techniques.

Keywords: Natural language processing; Information retrieval; Digital libraries

Kanan, Tarek; Ayoub, Souleiman; Al-Dahoud, Ali; Kanaan, Ghassan; Fox, Edward. “Extracting Named Entities Using Named Entity Recognizer for Arabic News Articles”. Proceeding of the International Computer Science and Informatics Conference (ICSIC 2016). Amman Arab University, Amman – Jordan. January 2016.

Abstract:

This paper describes how to extract, for the Arabic language, named entities and topics from news articles. Indeed, there is a lack of high quality tools for Named Entity Recognition (NER) for Arabic; therefore the authors have built an Arabic NER (RenA). NER involves extracting information and identifying types, such as name, organization, and location. For English language there are effective tools for NER, however these are not directly applicable to Arabic language. As a result, a new method and tool (i.e., RenA) have been developed. For NER evaluation purposes a baseline corpus was built for assessment and comparison with other methods and tools, with help from volunteer graduate students who understand Arabic. RenA produces good results, with accurate Name, Organization, and Location extraction from news articles collected from online resources. A comparison between the RenA results with a popular Arabic NER resulted in a noticeable enhancement.

Keywords: Arabic Language; Named Entity Recognizer; Natural Language Processing.

Comments are closed.

Tarek Ghazi Qwaider Kanan

Welcome to my Portal

Publications

Wednesday 9, 2016

Thanks for downloading!