Home
Add
Get on Google Play
Home
> Edit
Add/Update Thesis
Title*
Author's Name*
Supervisor's Name
Abstract
The exponential growth of electronic documents, in both proprietary and public information systems, pose new challenges in finding relevant information from these large repositories. Document clustering is a specialized technique that has found its niche in effectively browsing, filtering, managing and summarizing these collections. Document clustering process has three distinct steps: (i) document representation, (ii) computation of pair-wise document similarity, and (iii) application of clustering algorithm. Document clustering methods are very sensitive to document representation schemes. Conventionally, document representations are based on extracting simple features such as terms/n-grams/frequent words/sequences from the documents that can be used as meta-descriptors for documents. These features reduce the dimensionality of the problem but simply fail to capture the semantics of the text in a transformed compact representation. These representations completely ignore the order and relationships among words/features. Documents written in human languages generally contain a context and use of words are mainly dependent on the same context. Motivated by this a novel document representation scheme that first extracts lexical chains from the documents and exploits topic maps structure for the lexical chains is proposed. The scheme takes advantage of lexical cohesion structure along with topic map relationships to get a semantic based representation of document. Topic Maps (TM) is an international standard for codification of knowledge. Moreover, a good similarity measure is essential for the clustering task. The similarity function should make use of semantic relationship among features (lexical topics) to provide a viable clue for relatedness between any pair of documents. A similarity function based on lexical chain similarity and frequent common tree patterns extracted from the topic maps of documents is defined. Hence these patterns (hierarchical lexical topics with different granularity) also inherently capture semantics in similarity calculation. An extensive set of experiments on four publicly available document datasets is performed. The evaluation measures like F-score, purity and entropy clearly established that the proposed approach is better than traditional document clustering approaches.
Subject/Specialization
Language
Program
Faculty/Department's Name
Institute Name
Univeristy Type
Public
Private
Campus (if any)
Institute Affiliation Inforamtion (if any)
City where institute is located
Province
Country
Degree Starting Year
Degree Completion Year
Year of Viva Voce Exam
Thesis Completion Year
Thesis Status
Completed
Incomplete
Number of Pages
Urdu Keywords
English Keywords
Link
Select Category
Religious Studies
Social Sciences & Humanities
Science
Technology
Any other inforamtion you want to share such as Table of Contents, Conclusion.
Your email address*