icalendar gem
November 19, 2007
ICalendar (iCal) is a standard for calendar data interchange. There’s a gem called icalendar, which helps to parse and generate such file, so you may use data from your google or exchange calendar to feed your app (or make it generate data to feed your calendar, e.g., a link to Digg or Facebook in each post of your blog to setup a TODO item).
To parse a .ics file (iCal invite or TODO item) it’s just a matter of looping thru the elements in a given calendar. A ics file may hold more than one calendar, end each calendar may contain events and TODO itens.
#!/usr/bin/env ruby require 'rubygems' require 'icalendar' if (ARGV.size < 1) then puts "Usage: ical_parse.rb <calendar.ics>" exit end cal_file = File.open(ARGV[0]) cals = Icalendar.parse(cal_file) if (cals.size==0) then puts "Empty calendar" exit end cals.each {|c| puts "\nEvents\n\n" if (c.events.size == 0) then puts "Empty event list" else c.events.each { |e| puts "---------------------------------------" puts "Seq:"+e.sequence.to_s puts "UID:"+e.uid.to_s puts "DTSTART: "+e.dtstart.to_s puts "summary: " + e.summary puts "location: " + e.location puts "description: "+e.description if (not e.attendees.nil?) then puts "attendee: " e.attendees.each{|a| puts "\t"+a.to } end puts "---------------------------------------" } end puts "\nTODO\n\n" t=c.todos if (t.size == 0) then puts "Empty TODO list" else puts "---------------------------------------" t.each {|oi| puts "Seq:"+oi.sequence.to_s puts "UID:"+oi.uid.to_s puts oi.dtstart puts "summary "+oi.summary } puts "---------------------------------------" end }
scraping with firebug and www::mechanize
November 11, 2007
When scraping for info on any website, the most time consuming part is locating where is what you need, and how it’s enclosed. Most of time, automatically generated HTML can be pretty convoluted due to templating systems. Hand made HTML tends to be more cleaner but it’s not so common these days.
Firebug is an extension for Firefox which among other things, can help you find URL, XPath for certain elements, discover action names, find out how does the forms are handled and so on.
Having a full XPath or the right URL for a form in a few clicks is a great productivity improvement. To show how to do it, I will download my contacts stored in a GMail account.
First and foremost, we need to know how to export contacs manually. It’s a matter of logging in, clicking in the Contacts link below your folders, clicking again the export button, selecting the proper options (All contacts, Outlook CSV format) and clicking another export button.
What we may need more than XPath harvesting is an automation tool, so it can navigate to the right URL. Better yet, we need the export action URL so we may not need to simulate ‘clicking’ as most automation libraries do.
Apart from Firefox loaded with Firebug, we will use Ruby and WWW::Mechanize. WWW::Mechanize uses Hpricot to handle XPath and has nice features like a cookiejar to handle all cookies, redirection following and form handling.
The first step is login using gmail’s form. It’s a simple html form, the first one of the page. Let’s find out the names of the input fields. Start Firefox, points to https://www.gmail.com and activate Firebug by clicking the icon in the low right corner.
Use the inspect feature to see the HTML code for a given element. inspect may return its full XPath or DOM name. Take some time to explore the login screen and note that the field’s name are Email and Passwd, and they are case-sensitive. To login in using www::mechanize the code would be like:
agent = WWW::Mechanize.new { |obj| obj.log = Logger.new(‘gmail.log’) }
page = agent.get(‘https://www.gmail.com’)
form = page.forms.first
form.Email = ‘username’
form.Passwd = ‘passwd’
page = agent.submit(form)
After logging in, mechanize will take care of any redirection and cookies. We may proceed requesting for any other element.
Our goal is exporting a contact list and clicking the way to it is not the smartest idea. We need the exact URL to get it. Let’s find it:
Enable firebug, select the ‘Net’ tab and click into export.
Check Firebug’s console for the list of net requests. There we will find the exact URL we need:
Mouse over the itens to see the URL value. In gmail it will be the one labelled export, but go on to see the other backgrounds request it does.
The contact list export URL is http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV.
After logging in, it’s a matter of just requesting this URL and saving the file:
page = agent.get(‘http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV’)
page.save_as(‘gmail_contacts.csv’)
And that’s it.
Check Firebug documentation and scripts to learn other ways to avoid heavy work by perusing it. See the full script below.
——- gmail-scrap.rb ————
#!/usr/bin/env ruby require 'rubygems' require 'mechanize' require 'logger' agent = WWW::Mechanize.new { |obj| obj.log = Logger.new('gmail.log') } page = agent.get('https://www.gmail.com') form = page.forms.first form.Email = 'username' form.Passwd = 'passwd' page = agent.submit(form) page = agent.get('http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV') page.save_as('gmail_contacts.csv')
——- gmail-scrap.rb ————