Hpricot is a (x)HTML parser for Ruby, which is specially useful when one needs to do some quick
scraping. Regular expressions are a great tool to do quick and dirt hacks, but in some cases,
it can get pretty fragile. A good example is when the site is driven by a CMS, but the owner
(or editor/writer)has freedom to format his text with many \n(newlines) and tags he wishes.

Enter Hpricot and XPath syntax to help minimize this issue. You may end up having to do some regexp work,
but it will get considerably less complex and much more specific.

Also Hpricot is fast both to learn and use, and easy to understand. Once your document is parsed, you can
remove or add tags, search, filter and save it again.

In one of my side projects, I scrap various sites to gather information for books and video and downloads.

One of these scrappers looks in Amazon.com for a given ISBN code and get back as much info as possible.
The first version of this script was in PERL, which works pretty well, but as I'm commited to use Ruby,
a port was on the way. Hpricot really provided an easy way to do that.

The code:

#------------------ isbnscrap.rb ------------------------
#!/usr/bin/env ruby

#
# ISBN screen scrapping
# shows how to use Hpricot to
# scrap book info from two sellers.
# Hpricot uses XPath notation to get to
# (x)HTML elements, using regexps only when necessary
# we're after book title, author name and price if avaliable

require 'rubygems'
require 'open-uri'
require 'hpricot'

class ISBNScrap
	def initialize (isbn)
		@urlAmazon="http://www.amazon.com/s/&url=search-alias=stripbooks?field-keywords=#{isbn}"
	end

	def searchAmazon
		puts "Scrapped URL: "+@urlAmazon
		begin
			doc = Hpricot(open(@urlAmazon))
			keywords = doc.at("meta[@name='keywords']")['content'].split(',')
			puts "Title: "+keywords[0]
			puts "Author: "+keywords[1]

			price=(doc/"table.product//tr//td/b.price").inner_html
			puts "Price: "+price if (not price.nil?)

			buckets=(doc/"table//td.bucket//li").each { |bucket|
				info=bucket.at("b").inner_html
				info= info.gsub("\n", "")
				print info
				# some cleaning up: removes <b>, <a> tags,  , \n
				# we just need the text
				(bucket/"b").remove
				(bucket/"a").remove
				valueTxt=bucket.inner_html.gsub("\n", "").gsub("\ ", "")
				# regexp/splitting for really messy html
				if (info =~/Average/)
					rate = valueTxt.scan(/"(.+)stars\-(\d\-\d)\.(.+)"/)
					puts $2
				elsif (info =~ /Sales/)
					rank=valueTxt.split
					puts rank[0]
				else
					puts valueTxt
				end
			}
		rescue
			puts "Err: "+$!
			puts "Trace:"
			$@.each {|tl| 	puts tl }
		end
	end
end

isbn = (ARGV[0].nil?)?'0451167716':ARGV[0] # if argv[0] is null, defaults to the bible of correct social behavior

isbnScrap = ISBNScrap.new(isbn)

isbnScrap.searchAmazon
#------------------ isbnscrap.rb ------------------------

 

After extracting title and author name from a meta tag, price from a special "b" tag (table.product//tr//td/b.price) and the rest
comes from the contents of a list (table//td.bucket//li) inside a table. Amazon's code is pretty messy, and there is no reason to complain
about it because we are using it as something other than the original intent. Most of html code over the web is like that - or worse - and
it makes for a interesting challenge.
While looking into <li> tags, I ended up using regexp matching and scanning to filter through Average Review and Sales Rank. I pretty much had
to figure out the rank system (which is an image stars-M-N, where M, N are numbers to consider in the M.N format),
and  to clean up empty lines and other undesirable tags.

In the case of The Godfather, the rating was 4.5, so the image was called stars-4-5. Debug the script by printing the loop element (bucket) and exploring its methods.
Note that this scraper is more robust than a purely regexp based one, but can get broken too. It's important to provide a validation mechanism
for the data you gather this way. Checking for nulls is not enough usually, so I recommend keeping track of an unique identifier inside the HTML.
It's easy to do in sites as blogspot and other blogs, as they keep a UID or post id in each post.
Edit: While XPath may be easy to use at first glance, it can get very complicated while looking in a deep obfuscated HTML tree.
So to lend a helping hand. check this tutorial about Firebug, a nice Firefox extension which among other fine features has a 'Copy XPath' option.




Advertisements
%d bloggers like this: