Search or add a thesis

Advanced Search (Beta)
Home > Term Discrimination Based Robust Text Classification With Application to E-Mail Spam Filtering

Term Discrimination Based Robust Text Classification With Application to E-Mail Spam Filtering

Thesis Info

Access Option

External Link

Author

Junejo, Khurram Nazir

Program

PhD

Institute

Lahore University of Management Sciences

City

Lahore

Province

Punjab

Country

Pakistan

Thesis Completing Year

2008

Thesis Completion Status

Completed

Subject

Computer Science

Language

English

Link

http://prr.hec.gov.pk/jspui/bitstream/123456789/2242/1/2490S.pdf

Added

2021-02-17 19:49:13

Modified

2024-03-24 20:25:49

ARI ID

1676727846154

Similar


The Internet has touched every part of our lives, including our interactions and communications. Printed books are being replaced by electronic books (e-books), personal and official correspon- dences have shifted to electronic mail (e-mail), and news is now being read online. This is gener- ating huge volumes of unstructured textual data that needs to be analyzed, filtered, and organized automatically in order to harness its wealth of information for profitable gains. By 2013, it is projected that the worldwide volume of e-mails will reach 507 billion e-mails per day out of which 89% will be spam e-mails [Radicati (2009)]. In 2008, the cost of spam to businesses in terms of hardware, software, and human resource cost was around $140 billion [Research (2008)]. Content-based text classification can automatically organize text documents into predefined thematic categories. However, text classification is challenging in the modern Internet environment. Firstly, text documents are sparsely represented in a very high dimensional feature space (easily in hundred thousands), making learning and generalization difficult. Secondly, due to the high cost of labeling documents researchers are forced to collect training data from sources different from the target domain, which results in a distribution shift between training and test data. Thirdly, although unlabeled data is easily available its utilization in practical text classification for improved performance remains a challenge. One important domain for text classification, which embodies these challenges, is that of e-mail spam filtering. A typical e-mail service provider (ESP) caters to thousands to millions of users where each user can have his own interests of topics and preferences for spam and non-spam e-mails. Personalized service-side spam filtering provides a solution to this problem; however, for such solutions to be practically usable they must be efficient, scalable, and robust to distribution shifts. In this thesis, we propose a robust text classification technique that combines local generative models and global discriminative classifiers through the use of discriminative term weighting and linear opinion pooling. Terms in the documents are assigned weights that quantify the discrimina- tion information they provide for one category over the others. These weights, called discriminative term weights (DTW), also serve to partition the terms into two sets. An opinion pooling strategy consolidates the discrimination information of terms in the sets to yield a two dimensional feature space, in which a discriminant function is learned to categorize the documents. In addition to a supervised technique, we also develop two semi-supervised variants for personalizing the local and global models using unlabeled data. We then generalize our technique into a classifier framework that integrates different feature selection criteria, discriminative term weighting schemes, infor- mation pooling strategies, and discriminative classifiers. We provide a theoretical comparison of our proposed framework with existing generative, discriminative, and hybrid classifiers. Our text classification framework is evaluated with five discriminative term weighting strategies, six opinion consolidation techniques, and four discriminative classifiers. We employ nine real-world datasets from different domains in our experimental evaluation, and the results are compared with four benchmark text classification algorithms via accuracy and AUC values. Our framework is also evaluated under varying distribution shift, on gray e-mails, on unseen e-mails, and under varying classifier size. Scalability of our spam filter is also demonstrated for personalized service-side spam filtering. Statistical significance tests confirm that our technique performs significantly better than the compared techniques in both supervised and semi-supervised settings, and in global and person- alized spam filtering. In particular, it performs remarkably well when distribution shift is high between training and test data, a phenomenon common in e-mail systems. Additional contributions of this thesis include a systematic analysis of the spam filtering problem and the challenges to effective global and personalized spam filtering at the service side. We formally define key characteristics of e-mail classification such as distribution shift and gray e-mails, and relate them to machine learning problem settings. The concept of term discrimination introduced in this work has also found applications in text clustering, visualization, and feature extraction, and it can be extended for keyword extraction and topic identification from textual documents.
Loading...
Loading...

Similar Books

Loading...

Similar Chapters

Loading...

Similar News

Loading...

Similar Articles

Loading...

Similar Article Headings

Loading...

لسانیاتی طریقہ مطالعہ

موضوع 9:لسانیاتی طریقہ مطالعہ
لسانیات کا تعلق زبان سے ہے اور زبان معاشرے کی تشکیل میں اہم کردار ادا کرتی ہے۔ زبان معاشرے پر انمٹ نقوش چھوڑتی ہے اورمعاشرہ زبان پر اثر انداز ہوتا ہے۔جب ہم زبان کا تقابل کرتے ہیں اور جائزہ لیتے ہیں تو مزید صورتحال واضح ہو جاتی ہے۔ زبان دنیا کے ہر حطے ہر کونے میں پائی جاتی ہے اور تغیر سے گزر رہی ہوتی ہے۔زبان کے کام اور ضرورت کو پیش نظر رکھتے ہوئے بہت سے ماہرین نے کانٹ چھانٹ کر کے زبان کے اصول و ضوابط کے مطابق رکھا اور تبدیلیاں منظر عام پر لائے۔
زبان کی تاریخ اتنی ہی پرانی ہے جتنی کے انسانی تاریخ۔ اردو میں ابتدائی لسانیات کے حوالے سے جنہوں نے کام کیا ان میں کچھ فرانسیسی ،اطالوی اور کچھ فرنگی تھے جوکہ زبان میں تبدیلیاں لے کر آئے۔ان میں زبان کی محبت، جستجو اور جذبہ تھا۔کسی بھی زبان کو سمجھنے کے لئے مقامی لوگوں سے رابطہ کرنا پڑتا ہے۔ جبکہ مقامی لوگ ان لوگوں کے کام کو وقت کا زیاں سمجھتے ہیں۔
لسانیات پر مشتمل کتابیں منظر عام پر آئیں اوران میں ادب کے حوالے سے اپنے خیالات نظریات کا اظہار کیا۔ اس میں سرسیداحمدخان کی قواعداردو شامل تھی۔ اس رسالے میں گرائمر کے مختلف قاعدے درج ہیں اس کا سن اشاعت 1840 ہے۔ اس رسالے کے متعلق مولوی عبدالحق لکھتے ہیں:
"اگرچہ یہ کتاب کچھ ایسی قابل لحاظ نہیں لیکن اس سے اس بات کا پتہ چلتا ہے کہ مرحوم کو اردو زبان سے کس قدر دلچسپی تھی۔"
سر سید احمد خان نے زبان کے لیے جامع قسم کی لغت کی ضرورت کو محسوس کیاانہوں نیاپنے عہدمیں اردو لغت مرتب کرنے کا بیڑا اٹھایا۔ انہوں نے اس لغت میں لفظ کی تعریف اور تشکیل کو بھی شامل کیا تھا۔ انہوں نے لفظ کی تعریف کی...

Doctrine of Peace: An Islamic Perspective, an Analysis of the Hindrances in Peace Making Process

The present day global community is confronted with several challenges, absence of peace being the biggest one. The UNO and other regional organizations strive hard to achieve this goal. Unfortunately the world is far from achieving the goal of ensuring peace. It is also a fact that in most cases the Muslim world is in turmoil and it is involved or has been involved in one way or another. In order to find out the real causes of unrest, it is imperative to analyze the situation from academic, political, economic and social aspects.  This article deals with first aspect i.e. The academic. It has three parts, in the first part importance of peace has been elaborated in the light of the Quran and the Sunnah. In the second part, importance of peace treaties of the Prophet (PBUH), and one of His successors Umar, the second Caliph, has been brought in to focus. The issue that the foreign policy of Islamic State is based on perpetual peace has been discussed in this part. While in the third part those obstacles have been discussed which have caused to damage the  peace process in the world.

Vulnerability to Emotional Disturbances in Father of Neurotic and Psychotic Children

In the present research the difference of anxiety and depression level of the fathers of neurotic and psychotic children was studied. In order to find out the level of anxiety and depression among fathers of neurotic and psychotic children, IPAT anxiety Scale and IPAT Depression Scale were used followed by an interview to obtain information and history from them. Three hundred fathers participated in the study. Hundred fathers have neurotic children , hundred had psychotic children and hundred had normal children. A chi-square test was computed for the statistical analysis of data. Hypothesis No 5 when calculated for Statistical significance on the sample of 100 subjects in each group i.e. fathers of neurotic and psychotic children rendereda X^2 of 3.26 which made it insignificant by the fraction of difference as the X^2 of 3.84 is required to make it significant at the level of P<.05. It was then decided to increase the data by 20 subjects in each group to find out the Statistical significance as expected the X^2 with the increased data was significant at .05 level. It was concluded that the fathers of neurotic children have high sten scores on anxiety and depression than the fathers of psychotic children. Moreover anxiety sten scores were higher than the depression sten scores comparatively.