Asian Research Thesis Index

Abstract

The news generation in the digital environment is no longer a periodic process with a fixed single output like printed newspaper. The news are instantly generated and updated online in a continuous fashion. However, because of different reasons like the short lifespan of digital information and speed of generation of information, it has become vital to preserve digital news for the long-term. Digital preservation includes various actions to ensure that digital information remains accessible and usable, as long as they are considered important. Libraries and archives preserve newspapers by carefully digitizing collections as newspapers are a good source of knowing history. The lifespan of news stories published online vary from one newspaper to another, that is, from one day to a month or even more. Though a newspaper may be backed up and archived by the news publisher or national archives, in the future it will be difficult to access particular information published in various newspapers about the same news. The issues become more complicated if a story is to be tracked through a multi-lingual archive of many online newspapers, which require different access technologies. Based on prior studies, a ten step systematic approach is introduced for web preservation, which lead to create an effective web archive and followed to create the intended digital news stories archive using digital news stories extractor. Initially, the archive is enriched with three English newspapers, enhanced to ten online newspaii pers and then upgraded to dual language (Urdu and English) news articles archive, extracted from fifteen online news sources. The news stories archive preserve about 360 Urdu and 850 English news articles, periodically crawled every second day using digital news stories extractor tool. The main goal of the dissertation was to link digital news stories duration preservation using text processing techniques. To achieve the goal, the formulated text processing similarity measures are applied for linking two types of news articles, that is, to link English news (English-to-English) and dual languages (Urdu-to-English) news articles in the archive. To linking English news articles in the archive, the study proposed five contentbased similarity measures that find similarity based upon news content features and link news articles during preservation. The measures compute similarity value among news articles based on features like number of terms, named entities, named entities position, title terms, and position of terms in the titles, etc. The measures are evaluated on to same news articles sets, of different size and compared with human based judgment in order to evaluate the accuracy and assess the effectiveness, worth and significance of designed similarity measures for linking English news articles. The results showed that the proposed measures presented are feasible for linking English news articles in the news stories archive. The selection of measure depends upon the performance of that measure in a specific category, for example, a measure can perform better on a category “Opinion”. All the proposed measures are evaluated for six categories of news articles and the results are mutually compared with two known text based similarity measures to assess the effectiveness and appropriateness of proposed measure in the best fitted scenario. The pre-processing step in any web preservation project is of utmost importance because the intensions are to archive the targeted contents, especially, in a language which doesn’t have any sophisticated tools and techniques. To link dual languages news articles in the archive, Urdu news articles needs extensive pre-processing, which leads to create an Urdu bag of words and dictionary containing 50502 words and 78739 pairs of Urdu words with English meanings respectively. The study proposed five content-based similarity measures that find similarity based upon news content features and link Urdu-to-English news articles during preservation. The measures are applied to the same news articles sets, of different sizes and mutually compared. The results showed that three of the proposed measures presented appreciable results for linking dual-lingual news articles in the archive, which can be improved by improving the structure and contents of the Urdu dictionary. In summary, the performance of different measures has been evaluated individually for linking digital news articles in the digital news story archive to make sure the accessibility of these news articles in the future from this enormous collection.

Add/Update Thesis