I always thought tagging was a silly name for ‘monkey based text classification’, because most of the tagging systems I worked on or used relied on the good will of their users to do attribute the right tags.

Tags are a very useful indexing and classification tool, most of the time helping to close the gap between natural language and what is stored on our computers. From basic filtering to grouping to more complex processing approaches, this technique may save us a lot of time.

Automatic tagging is a problem which is attractive to me. From the computing science standpoint, there are plenty of field for applying and learning about text classification algorithms. One may stumble on new ideas or even new algorithms, but just using the well known cornerstones of statistical and semantic group of algorithms is a good practical exercise.

The basic idea is to classify a block of text (post, article, message, item) automatically to one or more tags, using an automated process which can be enhanced as the time goes and as it classify more and more elements. This way, one could design an application which would save the user the hard work of figuring out imported data or changing tag sets each time a migration happens. Better yet, it would help the user to organize itself, as the case of calendars.

The main ‘automatic’ tagging example one may find, are automated antispam filters. Paul Graham’s ‘A Plan for Spam‘ is a very interesting article about spam filtering using statistical approaches in text classification.

Well, before going all guerilla about implementing your own bayesian or LSI classifier, take a look in python and ruby libraries to learn a little more about how they do their work.

Most of these algorithms have training and classification phases, and one may untrain and feedback new training items as the application is being used. It’s more or less like the ‘SPAM’ and ‘Not SPAM’ buttons on gmail and other webmails which may help their internal classification program to mark or unmark a certain message as unwanted mail.

Ruby has some main options for text filtering. We will use classifier for naive bayesian and LSI, Bishop for another approach ate bayesian filtering. There is a binding for the nice CRM114 Discriminator, but it seems broken at the moment (most of these modules seems unmaintened. I had to patch classifier manually after some hard debugging and googling work to detect a nasty error. CRM114 haven’t worked for me yet, but I kept the code in the examples below).

First of all, I made a script to build a training database. Intentionally, I kept my database small. For a good result, the bigger the training database, the better. Also, I made no interface to feed results back to database and untrain some wrong entry. The database generation script scraps information from Google news, just the headlines, and stores it all in YAML files.

-FIle: dbgen.rg


#!/usr/bin/env ruby

require 'rubygems'
require 'mechanize'
require 'logger'
require 'open-uri'
require 'yaml'

# dbgen for google news
# most used:
# sports: http://news.google.com/?ned=us&topic=s
# world: http://news.google.com/?ned=us&topic=w
# entertainment: http://news.google.com/?ned=us&topic=e
# sci/tech: http://news.google.com/?ned=us&topic=t
# health: http://news.google.com/?ned=us&topic=m
# business: http://news.google.com/?ned=us&topic=b

tagSrc = {
    'sports' => 'http://news.google.com/?ned=us&topic=s',
    'world' => 'http://news.google.com/?ned=us&topic=w',
    'entertainment' => 'http://news.google.com/?ned=us&topic=e',
    'scitech' => 'http://news.google.com/?ned=us&topic=t',
    'health' => 'http://news.google.com/?ned=us&topic=m',
    'business' => 'http://news.google.com/?ned=us&topic=b'
}

limit=10

tagSrc.each do |tag,url|

    agent = WWW::Mechanize.new{ |obj| obj.log = Logger.new('dbgen.log') }
    puts("Start processing: #{url} for tag: #{tag}")

    news = Array.new
    page = agent.get(url)
    ni = page.search("div[@class='lh']/font[2]")

    ni.each do |n|
        news << n.to_s.gsub(/(<\/?&#91;^>]*>)|(&(.+)\;)/,'')
        #break unless (news.size < limit)
    end

    #puts news.to_yaml
    File.open(tag+'.yaml', 'a+') { |f| f.puts news.to_yaml }
    puts("Done processing: #{url}")
end&#91;/sourcecode&#93;
Then the classifier script itself, which uses a RSS feed form Yahoo News to get the top stories and try to classify them as one of the stored tags. I choose RSS because it's easier to get to the point and I would have code ready to use in my projects too. It could be easily changed to use mechanize parsing the main page instead.-File: rssclass.rb<u><tt><span style="font-style:italic;"><span style="color:#9a1900;"></span></span></tt></u>


#!/usr/bin/env ruby

require 'rss/1.0'
require 'rss/2.0'
require 'rubygems'
require 'mechanize'
require 'logger'
require 'open-uri'
require 'stemmer'
require 'classifier'
require 'yaml'
require 'crm114'
require 'bishop'

class NewsClassifier
    @bayesianClassifier=nil
    @lsiClassifier=nil
    @crm114Classifier=nil
    @bishopClassifier=nil

    @sports=@world=@entertainment=@scitech=@health=@business=nil

    def loadTrainingFiles
        @sports=YAML::load_file('sports.yaml')
        @world=YAML::load_file('world.yaml')
        @entertainment=YAML::load_file('entertainment.yaml')
        @scitech=YAML::load_file('scitech.yaml')
        @health=YAML::load_file('health.yaml')
        @business=YAML::load_file('business.yaml')
    end

    def trainBayesClassifier
        @bayesianClassifier = Classifier::Bayes.new('sports', 'world','entertainment','scitech','health','business')

        @sports.each { |s| @bayesianClassifier.train_sports s }
        @world.each { |w| @bayesianClassifier.train_world w }
        @entertainment.each { |e| @bayesianClassifier.train_entertainment e }
        @scitech.each { |s| @bayesianClassifier.train_scitech s }
        @health.each { |h| @bayesianClassifier.train_health h }
        @business.each { |b| @bayesianClassifier.train_business b }
    end

    def trainLSIClassifier
        # lsi needs a fix for the newest GSL. see http://rubyforge.org/forum/forum.php?thread_id=10069&forum_id=2816
        @lsiClassifier = Classifier::LSI.new :auto_rebuild => false

        @sports.each { |s| @lsiClassifier.add_item(s, "Sports") }
        @world.each { |w| @lsiClassifier.add_item(w,"World") }
        @entertainment.each { |e| @lsiClassifier.add_item(e,"Entertainment") }
        @scitech.each { |s| @lsiClassifier.add_item(s,"Scitech") }
        @health.each { |h| @lsiClassifier.add_item(h,"Health") }
        @business.each { |b| @lsiClassifier.add_item(b,"Business") }
        # no auto index building
        @lsiClassifier.build_index
    end

    def trainCRM114Classifier
        @crm114Classifier=Classifier::CRM114.new(["Sports", "World","Entertainment","Scitech","Health","Business"])

        @sports.each { |s| @crm114Classifier.learn!("Sports",s) }
        @world.each { |w| @crm114Classifier.learn!("World",w) }
        @entertainment.each { |e| @crm114Classifier.learn!("Entertainment",e) }
        @scitech.each { |s| @crm114Classifier.learn!("Scitech",s) }
        @health.each { |h| @crm114Classifier.learn!("Health",h) }
        @business.each { |b| @crm114Classifier.learn!("Business",b) }

    end

    def trainBishopBayesClassifier
        @bishopClassifier = Bishop::Bayes.new #('sports', 'world','entertainment','scitech','health','business')

        @sports.each { |s| @bishopClassifier.train("sports",s) }
        @world.each { |w| @bishopClassifier.train("world",w) }
        @entertainment.each { |e| @bishopClassifier.train("entertainment",e) }
        @scitech.each { |s| @bishopClassifier.train("scitech",s) }
        @health.each { |h| @bishopClassifier.train("health",h) }
        @business.each { |b| @bishopClassifier.train("business",b) }
    end

    def getBishopResult(t)
        res=@bishopClassifier.guess(t)
        results = res.sort_by{|score| -score.last }
        return results.first[0].capitalize
    end
    def initialize
        loadTrainingFiles

        trainBayesClassifier
        trainLSIClassifier
        #trainCRM114Classifier
        trainBishopBayesClassifier
    end

    def runTest
        agent = WWW::Mechanize.new { |obj| obj.log = Logger.new('news_classifier.log') }
        rssURL = "http://rss.news.yahoo.com/rss/topstories"
        rssBody = agent.get_file(rssURL)

        rss = RSS::Parser.parse(rssBody.to_s, false)
        print "link,bayesian,LSI,bishop,CRM114\n"
        rss.channel.items.each do  |r|
            # download / classify
            page = agent.get(r.link)
            text = page.search("//div[@id='storybody']/p")
            text.search("div").remove
            text.search("a").remove
            text.search("img").remove
            t=text.to_s.gsub(/(<\/?&#91;^>]*>)|(&(.+)\;)/,'')
            print "#{r.link},#{@bayesianClassifier.classify(t)},"
            print "#{@lsiClassifier.classify(t)},"
            print "#{getBishopResult(t)},\n"

            #,#{@crm114Classifier.classify.first}\n"
        end
    end
end

# main
nc = NewsClassifier.new
nc.runTest

The classifier script uses the following algorithms: naive bayesian, lsi and bayes with robinson algorithm for processing the news text.Tags were chosen as Google News channels (‘sports’, ‘world’,’entertainment’,’scitech’,’health’,’business’). Later this approach proved very limited. As there is no result feedback and untraining, and the training database is very small, there were a certaing tendency to generalization around a single tag (world).To illustrate that, I created a table for a given run of rssclass.rb. The output is a csv style table, so I opened it up in gnumeric, opened link by link and wrote what tag from the group I would choose (monkey based tagging at best). Also, I included a ‘could be’ column with which would be my tag of choice. I’ve edited the links so it would fit my screen:

link bayesian LSI bishop CRM114 human could be
http://news.yahoo.com/s/ap/20071229/ap_on_re_as/pakistan World World World World
http://news.yahoo.com/s/ap/20071229/ap_on_re_us/carnation_killings Entertainment Entertainment World World
http://news.yahoo.com/s/ap/20071229/ap_on_re_us/missing_woman Health Entertainment Sports World(?) Local, Violence
http://news.yahoo.com/s/ap/20071229/ap_on_re_us/tiger_escapes Health Scitech Health World(?) Local, Tabloid
http://news.yahoo.com/s/ap/20071229/ap_on_re_af/kenya_elections World World Business World(?) Politics, Foreign
http://news.yahoo.com/s/ap/20071229/ap_on_re_la_am_ca/venezuela_colombia_hostages World World World World
http://news.yahoo.com/s/ap/20071229/ap_on_el_ge/youth_vote_line_legacy Health Scitech World World(?) Politics, Foreign
http://news.yahoo.com/s/ap/20071229/ap_on_en_mo/people_sean_penn Entertainment Business Entertainment Entertainment
http://news.yahoo.com/s/ap/20071229/ap_on_fe_st/odd_phony_address Health Entertainment Health Entertainment Tabloid, Violence, Fun
http://news.yahoo.com/s/ap/20071229/ap_on_sp_bk_ga_su/bkn_magic_heat Sports Sports Sports Sports
http://news.yahoo.com/s/nm/20071229/ts_nm/pakistan_crisis_dc World World World World Politics, Foreign
http://news.yahoo.com/s/nm/20071229/pl_nm/usa_politics_dc World World World World Politics, Foreign
http://news.yahoo.com/s/nm/20071229/ts_nm/kenya_election_dc Sports Business Business World(?) Politics, Foreign
http://news.yahoo.com/s/nm/20071228/pl_nm/bush_defense_dc World Scitech Business Business Law
http://news.yahoo.com/s/nm/20071228/ts_nm/chad_france_dc World World World World
http://news.yahoo.com/s/nm/20071229/ts_nm/australia_detainee_dc World Scitech World World Violence, Terrorism
http://news.yahoo.com/s/nm/20071228/ts_nm/cuba_castro_dc World World Business World(?) Politics,Foreign
http://news.yahoo.com/s/nm/20071229/ts_nm/colombia_hostages_dc World World World World
http://news.yahoo.com/s/afp/20071229/ts_afp/pakistanattacksbhutto World World World World(?) Terrorism, Violence, Politics
http://news.yahoo.com/s/afp/20071229/ts_afp/kenyavoteodinga World World World World(?) Politics,Foreign

In other runs, the output had the same tendency to group results around ‘World’. Its a result of poor tag choosing ( in cases when there must had a Politics, Law, Foreign tag) and poor database (in cases where world was choosen agains other ‘known’ tags and health and entertainment mixed).

The conclusions I draw after all testings are:

1) I am happy that I could jump a lot of trouble by scripting tests to know better how each algorithm works. It would be a PITA to do that in ANSI C just to check which one would be better for what I want to do.

2) Alto there are some buggy libraries, things seems to flow smoothly after picking up the pace. I managed to patch classifier myself and took the oportunity to look at the gem internals. Check http://rubyforge.org/forum/forum.php?thread_id=10069&forum_id=2816 for more info, but is basically a change in Ruby’s GSL binding which demanded that alloc was called instead of new. Take some time to browse thru the source code.

3) Automatic tag classification is possible and works great. I managed to run tests with bigger databases and my results were better and better. It must be allied with a feedback routine tho, so monkey classification must go on !

4) pick up LSI if possible. I still want to test CRM114 in Ruby. I will update this post as soon as I manage to discover what went wrong.

5) choose your main tags wisely, if you dont want the user to creat their tags. See, you may use tagging for internal purposes only (as I do) and there is no meaning in freezing a group of tags just for the sake of it. Do some experimentation.

There is another very useful algorithm you may check, which is “Porter Stemmer”, which is used to remove inflexions and common variations from english language words. It’s an importante step for classification and indexing, because it provides a normalisation of terms. It may work in other languages. In portuguese, we would look for something like the word base or the ‘radical’ part of the word.

Normalisation is the whole process of working the text block to an intermediary representation, to highlight common elements and optimize the classification and indexing. In some cases, part of this process would be running a spell checker before any other processing.

To know more about Porter Stemmer, check the author’s page, http://tartarus.org/martin/PorterStemmer/index.html.

-File: stemtest.rb


#!/usr/bin/env ruby

require 'rubygems'
require 'stemmer'

puts "category".stem
puts "categorization".stem

puts "programming".stem
puts "programmer".stem

puts "tagging".stem
puts "tags".stem

The results:


$ ruby stemtest.rb
categori
categor
program
programm
tag
tag

Beyond spell checking and stemming, normalisation would include operations such as lowercasing, punctuation removal, word relevance check (i.e. ‘the’ and ‘to’ may have a lowest relevance for indexing and classification than ‘hamburger’ and ‘baconcheeseburger’).

Updates:

– 03/12/2008: Fixed some typos, and used wordpress sourcecode tag to help code formatting. Added a comment about classifier patching, which already was into the source code for rssclass.rb. A little bit more into why using RSS to get the news feed instead of parsing the main page. It’s easy: I already have code looking to rss feeds to aggregate news.


– 03/12/2008: added back the stemming and normalisation part.

Advertisements

2 Responses to “Practical text classification with Ruby”

  1. vivekified Says:

    Hi. The post is good and useful. I am working on sentence classification, like based on the sentence, I need to identify if it falls under news/entertainment etc., Which algorithm do u think works best for this?

  2. gm Says:

    You could use the same approach I`ve used to the whole news classification. Just make sure you train it with a base which reflects what is news or entertainment for a given phrase. It will be broken in tokens anyway.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: