+ Reply to Thread
Results 1 to 2 of 2

Thread: automated details

  1. #1
    Member
    Join Date
    Jun 2008
    Posts
    66

    Default automated details

    Automated Online News Classification with

    Personalization

    Chee-Hong Chan Aixin Sun Ee-Peng Lim

    Center for Advanced Information Systems, Nanyang Technological University

    Nanyang Avenue, Singapore, 639798

    Abstract

    Classification of online news, in the past, has been very much done in the manual
    way. In the Categorizor system, we have experimented an automated approach to

    classify online news using the SVM (Support Vector Machine) classification method.
    SVM has been shown to give good classification results when ample training doc-
    uments are given. In our research, we have applied the SVM classification method

    to personalized classification. In personalized classification, users can define their
    personalized categories using a few keywords. By constructing search queries using
    these keywords, the Categorizor obtains both positive and negative training doc-

    uments required for the construction of classifiers. In this paper, we describe the
    preliminary version of the Categorizor and present its system architecture.

    1 Introduction

    1.1 Motivation

    Text classification is the process of assigning text documents to one or more predefined

    categories based on their content. This allows users to find desired information faster

    by searching only the relevant categories and not the whole information space. The

    importance of text classification is even more apparent when the information space is

    huge such as the World Wide Web. Web classification services provided by web portals

    such as Yahoo![15] and Google[7] represent an approach to classify web sites and pages.

    As such classification services are being carried out by human experts, they do not scale

    up well with the growth rate of web pages on the Internet. To automate the classification

    process, machine learning methods have been introduced. In a text classification method

    based on machine learning, one or more classifiers is built (trained) with a given set of

    training documents. The purpose of a trained classifier is therefore to assign documents

    to the suitable categories with minimal human intervention.

    1

    Page 2

    Online news articles represent a type of web information that are frequently referenced

    and are mostly textual. Currently, online news are provided by many dedicated newswires

    such as Reuters[11] and PR Newswires[10]. These newswires may specialize in reporting

    news in different areas (e.g. financial, sports). It would be useful to gather news from

    these sources and classify them accordingly for ease reference.

    In this paper, we describe a working news classification system, named Categorizor[1],

    that performs automated online news classification. The Categorizor attempts to adopt

    the Support Vector Machine(SVM) to classify news articles into several general categories

    and special categories defined by the users. The latter are also known as the personalized

    categories. With personalized categories, the Categorizor allows users to quickly locate

    the desired news articles with minimum effort.

    1.2 Related Work

    Text classification has been a well studied problem. Several methods have been proposed

    previously and many of them can be directly applied to news classification as long as

    the categories are predefined and there exists a good set of training documents for each

    category[17, 6, 14]. Nevertheless, when the categories (i.e., personalized categories) are

    defined on the fly and training documents are not readily available, the classification

    problem will become much more complex. Text classification with user-defined or per-

    sonalized categories is a form of personalization and there are several existing ways to

    support personalization.

    In the collaborative filtering approach, each user is associated with a user profile.

    When the user profiles of two users are similar, news articles read by one of them will be

    automatically recommended to the other[2].

    In another personalization approach known as content filtering, one or more set of

    features each representing a different interest domain (personalized category) of the user is

    derived. News articles are then recommended based on the semantic similarity with each

    set of features. In this approach, the interest domain of a user is very much independent

    of that of another user.

    In the subscription-based personalization approach, a user can manually subscribe to a

    subset of a large number of pre-defined news categories. The set of pre-defined categories

    is usually static and it corresponds to the categories assigned to the news article when

    they are first created. In other words, the subscription-based personalization approach is

    rather straightforward and does not require much classification efforts. Most of the web

    sites achieve news personalization by adopting the subscription approach, e.g. Newscan-

    online [9, 4].

    2

    Page 3

    Among the above three approaches, we have chosen to use content filtering approach

    to support personalized categories in the Categorizor system. However, there two main

    difficulties in using this approach for personalized news classification. Firstly, it is not

    easy to obtain the training documents required for the generation of classifiers. As news

    articles are generally short, the selection of features for classification will become more

    important than that in normal text classification problem.

    2 The Categorizor

    2.1 General Features

    At present, the Categorizor is developed to classify mainly financial news. It offers

    two kinds of classification, namely, the general classification and personalized classifica-

    tion. In general classification, we have adopted a fix set of categories from the Reuters

    collection[12]. The Reuters collection was chosen because its categories are closely re-

    lated to financial services and economics. For a start, we have designed the Categorizor

    to classify news articles from the Channel News Asia [3]. A classifier is developed for

    each category using the corresponding Reuters training documents.

    The unique feature of the Categorizor is that it allows users to create and maintain

    their own personalized categories. Users can register with the Categorizor and subse-

    quently create a personalized news categories by specifying a few keywords associated

    with the category. There is no restriction on the number of personalized categories for

    each user. To build the classifier for a personalized news category, a number of training

    documents (news articles) have to be obtained. Instead of getting the user to perform

    the time-consuming task selecting and uploading the training documents, we construct a

    query to the Yahoo News Search Engine[16] using the user supplied keywords for the cat-

    egory. The training documents are then selected from the most highly ranked resultant

    news articles from Yahoo news.

    To cope with evolving user interests and to further improve the effectiveness of clas-

    sifiers for personalized categories, our personalized classifiers are defined such that they

    can be retrained upon user request. The retraining of the personalized classifier can be

    performed in two ways. The first involves redefining the keywords of the personalized

    category. The other is to have the user providing feedback as he/she read the categorized

    news. These new articles carrying feedbacks will be later used as training documents for

    retraining the personalized classifiers.

    3

    Page 4

    2.2 Architecture

    User details Search Engine

    Webpage Crawler

    Text Extractor SVM Learn Module

    Document Pre-processor

    Document Vector Generator SVM Classify
    Module

    Result Interpreter

    Display Formatter Database News Reuters Test
    Collection Database System User Registration
    Module HTTP
    requests

    News text (for
    classification) (for training or
    for classification) News webpages

    News text
    (for training) News text (for
    classification)

    News text document
    vector (for classi-
    fication) News text document
    vector (for training) Prediction
    file

    Sorted classi-
    fication results Training news articles
    for personalized
    categories WWW Webpage retrieval
    Module

    Presentation
    Module SVM
    Module Storage
    Module

    Preprocessing
    Module query Search
    results
    Webpages

    Pre-processed news text Model file Model file User details Yahoo! News

    Figure 1: Architecture overview of the Categorizor

    The architecture of the Categorizor is shown in Figure 1. The main architecture

    consists of six modules, i.e. the Pre-processing, Presentation, Storage, SVM Classifier,

    User Registration and Webpage Retrieval modules.

    The Webpage Retrieval module employs the web page crawler to download the on-

    line news articles from the Channel News Asia web site. The Pre-processing module

    consists of the Text Extractor, Document Pre-Processor and the Document Vector Gen-

    erator. The Text Extractor extracts the news text from the downloaded news pages. The

    extracted news text for classification is stored in the News Database. The Document Pre-

    4

    Page 5

    processor performs stop-word removal and word stemming on the extracted text. After

    pre-processing, document vectors are generated by the Document Vector Generator using

    the well known tf × idf scheme [13]. To cater for documents with varying length, the

    document vectors are normalized to unit length.

    There are three information repositories in the system. The News Database stores

    the attributes of the news articles downloaded from the online news web sites for both

    training (in the case of personalized classification) and classification. The attributes to be

    stored include the downloading date, the URL and the news text. The System Database

    holds information about users and their personalized categories.

    The SVM Classifier is a binary classifier which consists of the SVM Learn Module and

    the SVM Classify Module. The SVM Learn Module trains the classifier of a category

    (general or personalized) and produces a model file. Given the model file, the SVM

    Classify Module performs classification on a given set of documents (represented by their

    document vectors). In our prototype system, the SV M light package developed by Joachim

    is used [8].

    The Presentation module sorts the classification results from SVM classifier according

    to the score values returned by the SVM Classify Module. The User Registration module

    is responsible for the management of user information and their personalized categories.

    3 Classification Process

    The Categorizor performs two kinds of classification as mentioned in Section 2.1. The

    two kinds of classification are performed in different ways. The detailed classification

    process are described in this section.

    3.1 General Classification

    In general classification, all the categories are taken from the Reuters-21578 text collec-

    tion. Only 10 general categories are currently supported by the Categorizor and users

    can select any general categories for viewing as shown in Figure 2. We build a SVM

    classifier for each of the 10 categories as each SVM classifier is capable of giving a binary

    decision given an input document. The steps of training and using a SVM classifier are

    as follows:

    1. The SVM classifier is trained with the training documents from the Reuters-21578

    text collection. The positive documents are the ones that belong to the category

    and equal number of negative training documents are randomly selected from the

    5

    Page 6

    Figure 2: Selection of general categories

    rest of categories. After training, the output of the SVM classifier (i.e. the model

    files) are stored in the System Database.

    2. The news articles are downloaded daily from the source website, (i.e. the Channel

    News Asia news) and their text are extracted from the news bodies by the Text Ex-

    tractor and then stored in the News Database. The text are referred as documents

    in the later process.

    3. When the user requests for the news from category C
    i , the most recently downloaded

    documents are retrieved from the News Database. Their document vectors are

    generated by the Document Pre-processor and Document Vector Generator.

    4. The model file for category C
    i is retrieved from the System Database and the

    corresponding SVM classifier will start classifying the document vectors.

    5. The classification results are sorted according to the score values assigned by the

    SVM classifier and displayed in the resultant web page as shown in Figure 3. In

    the resultant web page, we use 5-point ranking to identify the relevance of the the

    news articles to the category.

    3.2 Personalized Classification

    In personalized classification, the personalized categories are defined by users and each

    category is described by a few keywords. The classification steps are as follows:

    1. The user first registers his/her user name and password with the Categorizor.

    6

    Page 7

    Figure 3: Results returned for general categories

    2. The user defines his/her personalized categories by providing category names and

    a set keywords for each personalized category that describe the content of the

    category.

    3. To obtain the training news articles (documents) for each personalized category, the

    keywords are submitted to the Yahoo! news search engine and the news articles

    originally from Reuters returned by Yahoo! are used as the positive training doc-

    uments. The negative training documents are obtained by conducting an inverse

    keyword search on the Yahoo news search engine. The inverse keyword search can

    be easily achieved by adding a "-" operator before the keyword.

    Figure 4: Entering keywords to define a new personalized category

    4. All the positive and negative training news articles are submitted to the Text Ex-

    tractor and the document vectors are generated with the Document Vector Gener-

    ator.

    7

    Page 8

    5. A SVM classifier is constructed by the SVM learn module for the each newly con-

    structed personalized category. The learning process utilizes both the positive and

    negative training document vectors. The generated classifier is stored as a model

    file within the System Database.

    6. When a user requests for news under his/her personalized category C
    j , the recently

    downloaded news from News Database are retrieved and their document vectors

    are generated by the Document Vector Generator.

    Figure 5: Classifier re-training

    7. Both the document vectors and the model file for C
    j are passed to the SVM Classify

    module and the classification results sorted by score values are displayed in HTML

    format. In the resultant web page, a "Relevant?" checkbox is associated with each

    news entry, as shown in Figure 5, to allow feedback from the user.

    3.3 Re-training of the Classifier

    In order to strengthen the personalization aspect of the Categorizor, it is designed to

    accept feedback from the user. The user, while reading news from a category can indicate

    8

    Page 9

    if the content of the news article is relevant for the particular category by checking the

    "Relevant?" box. When the "Update classifier with selected document(s)" button is

    clicked, the corresponding classifier will be re-trained with a new training set that includes

    the feedbacked documents. In this way, users can constantly refine the training sets for

    their personalized categories with better accuracy. At present, we have not evaluated the

    effect of re-training in personalized classification. Experiments to evaluate the different

    ways of training will be covered in the future research.

    4 Conclusion

    We have designed and implemented a preliminary version of news classification system

    based on the SVM classification technique. The system is capable of both general clas-

    sification and personalized classification. Our preliminary experiments, not reported in

    this paper, have shown that our system works well for the general classification while the

    there are rooms for improvement for the personalized classification.

    As the Categorizor is still in its development and enhancement stage, much work need

    to be done to make it a full-fledge news classification system, particularly the personal-

    ized classification feature. Firstly, we need to enhance the Categorizor with a complete

    set of general categories. Due to the unavailability of a generic extraction software for ex-

    tracting the desired news text from HTML web pages, Categorizor is currently restricted

    to classifying news articles from Channel News Asia only. A complete version of the

    Categorizor will have to incorporate an extraction facility that allows users to specify the

    sources of news articles. We are currently conducting experiments to improve the perfor-

    mance of personalized classification. For example, we are exploring the use of hierarchy

    in personalized classification as it has been reported that hierarchical classification gains

    better performance than the flat classification[5].

    References

    [1] Categorizor, CAIS | Loading cheehong/servlet/categorizor.

    [2] R. J. Chen, M. Nathalie, and W. Shawn. Collaborative information agents on the

    world wide web. In Proceedings of the third ACM Conference on Digital libraries,

    pages 279*280, 1998.

    [3] Channel News Asia, Channelnewsasia.com.

    9

    Page 10

    [4] R. D¨aßler, K. Schirmer, and G. Neher. Business news in 3 dimension, 1998.

    http://fabdp.fh-potsdam.de/infoviz/paper/ieee98.pdf.

    [5] S. Dumais and H. Chen. Hierarchical classification of Web content. In Proceed-

    ings of the 23rd ACM International Conference on Research and Development in

    Information Retrieval, pages 256*263, Athens, GR, 2000. ACM Press, New York,

    US.

    [6] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms

    and representations for text categorization. In Proceedings of the 7th International

    Conference on Information and Knowledge Management, pages 148*155, 1998.

    [7] Google, Google.

    [8] T. Joachims. SV M light , an implementation of Support Vector Machines (SVMs) in

    C. http://ais.gmd.de/ thorsten/svm light/.

    [9] Newscan-Online, http://www.newscan-online.de/newscan/index.html.

    [10] Pr newswires, PR Newswires.

    [11] Reuters, World News, Business News, Breaking US & International News | Reuters.com.

    [12] Reuters-21578 text categorization test collection, AT&T Labs Research

    lewis/reuters21578.html.

    [13] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval.

    Information Processing and Management, 24(5):513*523, 1988.

    [14] F. Sebastiani. Machine learning in automated text categorisation: a survey. Tech-

    nical Report IEI-B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio

    Nazionale delle Ricerche, Pisa, IT, 1999. Revised version, 2001.

    [15] Yahoo!, Yahoo!.

    [16] Yahoo! News, The top news headlines on current events from Yahoo! News.

    [17] Y. Yang and X. Liu. A re-examination of text categorization methods. In 22nd

    Annual International SIGIR, pages 42*49, Berkley, August 1999.

    10

  2. #2
    Member
    Join Date
    Sep 2010
    Posts
    42

    Default

    Yes! Nice information that you have shared with us.
    Thank you very much!!
    ______________________
    canada debt consolidation

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
www.vbulletin.com