跳到主要导航 跳到搜索 跳到主要内容

Classifying XML documents by using genre features

  • Malcolm Clark
  • , Stuart Watt

科研成果: Chapter

4 引用 (Scopus)

摘要

The categorization of documents is traditionally topic-based. This paper presents a complementary analysis of research and experiments on genre to show that encouraging results can be obtained by using genre structure (form) features. We conducted an experiment to assess the effectiveness of using extensible mark-up language (XML) tag information, and part-of-speech (P-O-S) features, for the classification of genres, testing the hypothesis that if a focus on genre can lead to high precision on normal textual documents, then good results can be achieved using XML tag information in addition to P-O-S information. An experiment was carried out on a subsection of the initiative for the evaluation of XML (INEX) 1.4 collection. The features were extracted and documents were classified using machine learning algorithms, which yielded encouraging results for logistic regression and neural networks. We propose that utilizing these features and training a classifier may benefit retrieval for most world wide web (WWW) technologies such as XML and extensible hypertext markup language) XHTML.
源语言English
主期刊名Proceedings - International Workshop on Database and Expert Systems Applications, DEXA
242-248
页数7
DOI
出版状态Published - 2007

出版系列

姓名Proceedings - International Workshop on Database and Expert Systems Applications, DEXA

指纹

探究 'Classifying XML documents by using genre features' 的科研主题。它们共同构成独一无二的指纹。

引用此