Asian Research Thesis Index

Abstract

The exponential growth of electronic documents, in both proprietary and public information systems, pose new challenges in finding relevant information from these large repositories. Document clustering is a specialized technique that has found its niche in effectively browsing, filtering, managing and summarizing these collections. Document clustering process has three distinct steps: (i) document representation, (ii) computation of pair-wise document similarity, and (iii) application of clustering algorithm. Document clustering methods are very sensitive to document representation schemes. Conventionally, document representations are based on extracting simple features such as terms/n-grams/frequent words/sequences from the documents that can be used as meta-descriptors for documents. These features reduce the dimensionality of the problem but simply fail to capture the semantics of the text in a transformed compact representation. These representations completely ignore the order and relationships among words/features. Documents written in human languages generally contain a context and use of words are mainly dependent on the same context. Motivated by this a novel document representation scheme that first extracts lexical chains from the documents and exploits topic maps structure for the lexical chains is proposed. The scheme takes advantage of lexical cohesion structure along with topic map relationships to get a semantic based representation of document. Topic Maps (TM) is an international standard for codification of knowledge. Moreover, a good similarity measure is essential for the clustering task. The similarity function should make use of semantic relationship among features (lexical topics) to provide a viable clue for relatedness between any pair of documents. A similarity function based on lexical chain similarity and frequent common tree patterns extracted from the topic maps of documents is defined. Hence these patterns (hierarchical lexical topics with different granularity) also inherently capture semantics in similarity calculation. An extensive set of experiments on four publicly available document datasets is performed. The evaluation measures like F-score, purity and entropy clearly established that the proposed approach is better than traditional document clustering approaches.

Add/Update Thesis