The whole twitter’s user base provide us with a snapshot of what is hip and what is not right now. Using its trends and searching, it’s possible to harvest a lot of information. The heavy load of crawling and storing such huge volume is naturally handled by them.

Read the rest of this entry »

I’ve been trying to finish this post since middle of December, but after about 4 almost complete rewrites I’ve decided to put it online. I still mean to make it better, because I didn’t wanted to sound cocky or give the wrong impression that it is about evaluating the best text classification algorithm out there.

Here it goes: Practical text classification with Ruby

Thanks to Renato for reviewing it beforehand.

When scraping for info on any website, the most time consuming part is locating where is what you need, and how it’s enclosed. Most of time, automatically generated HTML can be pretty convoluted due to templating systems. Hand made HTML tends to be more cleaner but it’s not so common these days.

Firebug is an extension for Firefox which among other things, can help you find URL, XPath for certain elements, discover action names, find out how does the forms are handled and so on.

Having a full XPath or the right URL for a form in a few clicks is a great productivity improvement. To show how to do it, I will download my contacts stored in a GMail account.

First and foremost, we need to know how to export contacs manually. It’s a matter of logging in, clicking in the Contacts link below your folders, clicking again the export button, selecting the proper options (All contacts, Outlook CSV format) and clicking another export button.

What we may need more than XPath harvesting is an automation tool, so it can navigate to the right URL. Better yet, we need the export action URL so we may not need to simulate ‘clicking’ as most automation libraries do.

Apart from Firefox loaded with Firebug, we will use Ruby and WWW::Mechanize. WWW::Mechanize uses Hpricot to handle XPath and has nice features like a cookiejar to handle all cookies, redirection following and form handling.

The first step is login using gmail’s form. It’s a simple html form, the first one of the page. Let’s find out the names of the input fields. Start Firefox, points to https://www.gmail.com and activate Firebug by clicking the icon in the low right corner.

login page

Use the inspect feature to see the HTML code for a given element. inspect may return its full XPath or DOM name. Take some time to explore the login screen and note that the field’s name are Email and Passwd, and they are case-sensitive. To login in using www::mechanize the code would be like:

agent = WWW::Mechanize.new { |obj| obj.log = Logger.new(‘gmail.log’) }
page = agent.get(‘https://www.gmail.com’)

form = page.forms.first
form.Email = ‘username’
form.Passwd = ‘passwd’

page = agent.submit(form)

After logging in, mechanize will take care of any redirection and cookies. We may proceed requesting for any other element.

Our goal is exporting a contact list and clicking the way to it is not the smartest idea. We need the exact URL to get it. Let’s find it:

contact management

export contacts

Enable firebug, select the ‘Net’ tab and click into export.

Contact export screen

Check Firebug’s console for the list of net requests. There we will find the exact URL we need:

network requests

Mouse over the itens to see the URL value. In gmail it will be the one labelled export, but go on to see the other backgrounds request it does.
contact list download

The contact list export URL is http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV.

After logging in, it’s a matter of just requesting this URL and saving the file:

page = agent.get(‘http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV’)

page.save_as(‘gmail_contacts.csv’)

And that’s it.

Check Firebug documentation and scripts to learn other ways to avoid heavy work by perusing it. See the full script below.

——- gmail-scrap.rb ————

#!/usr/bin/env ruby

require 'rubygems'
require 'mechanize'
require 'logger'

agent = WWW::Mechanize.new { |obj| obj.log = Logger.new('gmail.log') }

page = agent.get('https://www.gmail.com')

form = page.forms.first
form.Email = 'username'
form.Passwd = 'passwd'

page = agent.submit(form)

page = agent.get('http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV')

page.save_as('gmail_contacts.csv')

——- gmail-scrap.rb ————