scraping with firebug and www::mechanize

November 11, 2007

When scraping for info on any website, the most time consuming part is locating where is what you need, and how it’s enclosed. Most of time, automatically generated HTML can be pretty convoluted due to templating systems. Hand made HTML tends to be more cleaner but it’s not so common these days.

Firebug is an extension for Firefox which among other things, can help you find URL, XPath for certain elements, discover action names, find out how does the forms are handled and so on.

Having a full XPath or the right URL for a form in a few clicks is a great productivity improvement. To show how to do it, I will download my contacts stored in a GMail account.

First and foremost, we need to know how to export contacs manually. It’s a matter of logging in, clicking in the Contacts link below your folders, clicking again the export button, selecting the proper options (All contacts, Outlook CSV format) and clicking another export button.

What we may need more than XPath harvesting is an automation tool, so it can navigate to the right URL. Better yet, we need the export action URL so we may not need to simulate ‘clicking’ as most automation libraries do.

Apart from Firefox loaded with Firebug, we will use Ruby and WWW::Mechanize. WWW::Mechanize uses Hpricot to handle XPath and has nice features like a cookiejar to handle all cookies, redirection following and form handling.

The first step is login using gmail’s form. It’s a simple html form, the first one of the page. Let’s find out the names of the input fields. Start Firefox, points to https://www.gmail.com and activate Firebug by clicking the icon in the low right corner.

login page

Use the inspect feature to see the HTML code for a given element. inspect may return its full XPath or DOM name. Take some time to explore the login screen and note that the field’s name are Email and Passwd, and they are case-sensitive. To login in using www::mechanize the code would be like:

agent = WWW::Mechanize.new { |obj| obj.log = Logger.new(‘gmail.log’) }
page = agent.get(‘https://www.gmail.com’)

form = page.forms.first
form.Email = ‘username’
form.Passwd = ‘passwd’

page = agent.submit(form)

After logging in, mechanize will take care of any redirection and cookies. We may proceed requesting for any other element.

Our goal is exporting a contact list and clicking the way to it is not the smartest idea. We need the exact URL to get it. Let’s find it:

contact management

export contacts

Enable firebug, select the ‘Net’ tab and click into export.

Contact export screen

Check Firebug’s console for the list of net requests. There we will find the exact URL we need:

network requests

Mouse over the itens to see the URL value. In gmail it will be the one labelled export, but go on to see the other backgrounds request it does.
contact list download

The contact list export URL is http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV.

After logging in, it’s a matter of just requesting this URL and saving the file:

page = agent.get(‘http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV’)

page.save_as(‘gmail_contacts.csv’)

And that’s it.

Check Firebug documentation and scripts to learn other ways to avoid heavy work by perusing it. See the full script below.

——- gmail-scrap.rb ————

#!/usr/bin/env ruby

require 'rubygems'
require 'mechanize'
require 'logger'

agent = WWW::Mechanize.new { |obj| obj.log = Logger.new('gmail.log') }

page = agent.get('https://www.gmail.com')

form = page.forms.first
form.Email = 'username'
form.Passwd = 'passwd'

page = agent.submit(form)

page = agent.get('http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV')

page.save_as('gmail_contacts.csv')

——- gmail-scrap.rb ————

Advertisements
%d bloggers like this: