Asian Research Thesis Index

Abstract

With emergence of big data analytics in last decade, the importance of analyzing semistructured and unstructured data (such as text) is also highlighted. Since, the text (such as customer reviews, newspaper articles, etc.) contain significant business information, the text analytics becomes more significant to predict, infer or analyse information to add value to the business. In this research, we present a unified approach for intelligent association analysis of text that how much a piece of text is related to a customer or a person In this dissertation, an approach is presented for Authorship attribution in Urdu text using LDA model with n-grams texts of authors and improved sqrt-cosine similarity for the sake of forensic analysis. The proposed approach uses n-grams words to identify various learned representations of stylometric features and use them to identify the writing style of a particular author. The LDA based approach emphasizes instance-based and profile-based classification of an author’s text. Here, LDA suitably handles high dimensional and sparse data by allowing more expressive representation of text. The presented approach is an unsupervised computational methodology that can handle the heterogeneity of the dataset, diversity in writing styles of authors, and the inherent ambiguity of the Urdu language. A large corpus has been collected for performance testing of the presented approach. The results of experiments show superiority of the proposed approach over the state-of-the-art representations and other algorithms used for Authorship attribution. Manifold contributions of the presented work are use of improved sqrt-cosine similarity with LDA topics to measure similarity in vectors of text documents for the forensic analysis purpose, construction of a large data set of 6000 documents of columns, and achievement of 92% results on Urdu columns with fifteen authors and 78.57% results on PAN12 English dataset with fourteen authors without using any labels for authorship attribution task.

Add/Update Thesis