RubyConfBR recap

November 1, 2010

Last week I presented at Rubyconf here in Brazil. It was the first edition named Rubyconf, but the folks behind it already had a tradition with Rails Summit. It was a huge event, specially considering that the main focus was a programming language. I got no exact numbers, but people where talking about more than 600 attendants. The site still up and you can see for yourself that there was a lot of interesting people talking.

I was invited by Fabio Akita to speak about non-blocking I/O. It was the first time here in Brazil that I got the opportunity to talk of such topics, and I’m glad that all conferences I attended or presented this year had more technical topics than vendor related and superficial concepts related to methodologies.

I had the opportunity to talk with Jim Weirich and Blaine Cook about what they are doing, and other folks that I happen to see once in a while, even living in Brazil.

Nando Vieira‘s lightning talk was one of the best I saw, along with Blaine Cook talking about webfinger and Emerson Macedo about Node.JS

There goes my slide deck

 

Advertisements

More on memcached

February 17, 2009

Sharing memcache data between different applications is useful and easy, be it as a glorified IPC, a robust distributed cache, rate limit control or any other suggested architecture approach.

There are some caveats tho:

  1. The captain obvious one: if its the case, make sure the way you store your data is readable between different languages. For example, storing in python and reading in java or ruby a pickled object is trivial, but persisting some specific objects, like rails is prone to do, may render the data almost unreadable. Try to use simple serialization formats if possible (like yaml, json, xml).
  2. The other captain obvious one: saving and invalidating data must be done by the application responsible for its integrity, for simplicity and safety sake. Remember cache 101: a cache is not a database. It’s not searchable, and its data must reflect a coherent source of data.
  3. The not so obvious one: if you use more than a memcached server, make sure both clients understand the hashing algorithm which is used to select the right server for the key you are asking. When using the same language and client this is transparent, but there different known ways to select the right server as:
  • md5 hash of the key
  • crc32 based hash
  • native hash (as String.hashCode() in java)
  • pure magic hash (some clients implement non-standard memcache

The case in point is a ruby application using the memcache-client gem and a java application using whalin’s client. If you use more than one server, the ruby client uses it’s unique algorithm, which is CRC32 based. The java client defaults to a NATIVE based algorithm, but contains 3 more algorithms. Keys would never get correct hits this way.

Let’s see how it works :

Code for hashing in Ruby (straight from memcache_client gem)


# Note that the method crc32_ITU_T is a patch for the String class from memcache_client

def hash_for(key)
 (key.crc32_ITU_T >> 16) & 0x7fff
 end

Code for the right hashing algorithm in JAVA:


private static long newCompatHashingAlg( String key ) {
                CRC32 checksum = new CRC32();
                checksum.update( key.getBytes() );
                long crc = checksum.getValue();
                return (crc >> 16) & 0x7fff;
        }

The algorithm is selected by this piece of code, whalin’s memcache client library:

    case NATIVE_HASH:
        return (long)key.hashCode();
        case OLD_COMPAT_HASH:
        return origCompatHashingAlg( key );
    case NEW_COMPAT_HASH:
        return newCompatHashingAlg( key );
    case CONSISTENT_HASH:
        return md5HashingAlg( key );
    default:
        // use the native hash as a default
        hashingAlg = NATIVE_HASH;
        return (long)key.hashCode();

So, before using the client in java, we need to issue setHashingAlg( SockIOPool.NEW_COMPAT_HASH ); on the right SockIOPool object.

That’s it.

Now, for a change …

Really unneccessary section !

We can test the CRC32 based algorithm like this:

Start irb and type:

irb --> require "rubygems"
    ==> true
irb --> require "memcache"
    ==> true
irb --> a = "mykey"
    ==> "mykey"
irb --> (a.crc32_ITU_T() >> 16) & 0x7fff
    ==> 17510

From this, we see that 17510 is the resulting hash for “mykey” key.

The memcache client was required just to attach the crc32_ITU_T() method to the String class, but if you dont want to install it, just paste the following code (which is part of memcache_client) instead:


class String

  ##
  # Uses the ITU-T polynomial in the CRC32 algorithm.

  def crc32_ITU_T
    n = length
    r = 0xFFFFFFFF

    n.times do |i|
      r ^= self[i]
      8.times do
        if (r & 1) != 0 then
          r = (r>>1) ^ 0xEDB88320
        else
          r >>= 1
        end
      end
    end

    r ^ 0xFFFFFFFF
  end

end

Let’s test it in JAVA’s end:

TestCRC.java


import java.util.zip.CRC32;

public class TestCRC {

        public static void main(String[] args) {

                CRC32 checksum = new CRC32();
                checksum.update("mykey".getBytes());
                long crc = checksum.getValue();
                System.out.println(((crc >> 16) & 0x7fff));
        }
}

Compile and run as:

$ javac TestCRC.java
$ java -cp . TestCRC
17510

Again, 17510, as in Ruby. That’s the right value for “mykey”.

Both cases lent 17510 as result, which would then be divided by the number of machines in the pool (e.g. 2) and the mod of this operation is the index of the right server, both in JAVA and Ruby. Weee.

I’ve been trying to finish this post since middle of December, but after about 4 almost complete rewrites I’ve decided to put it online. I still mean to make it better, because I didn’t wanted to sound cocky or give the wrong impression that it is about evaluating the best text classification algorithm out there.

Here it goes: Practical text classification with Ruby

Thanks to Renato for reviewing it beforehand.

I wrote this guide as a result of one of the cross-compiling oddities I’ve been thru the last month. Let me know if you have any suggestions about the build process I’ve been using.

icalendar gem

November 19, 2007

ICalendar (iCal) is a standard for calendar data interchange. There’s a gem called icalendar, which helps to parse and generate such file, so you may use data from your google or exchange calendar to feed your app (or make it generate data to feed your calendar, e.g., a link to Digg or Facebook in each post of your blog to setup a TODO item).

To parse a .ics file (iCal invite or TODO item) it’s just a matter of looping thru the elements in a given calendar. A ics file may hold more than one calendar, end each calendar may contain events and TODO itens.

#!/usr/bin/env ruby
require 'rubygems'

require 'icalendar'


if (ARGV.size < 1) then
 puts "Usage: ical_parse.rb <calendar.ics>"
 exit
end


cal_file = File.open(ARGV[0])

cals = Icalendar.parse(cal_file)
if (cals.size==0) then
 puts "Empty calendar"
 exit
end


cals.each {|c|

 puts "\nEvents\n\n"


	if (c.events.size == 0) then

 	puts "Empty event list"

 else

 	c.events.each { |e|

 		puts "---------------------------------------"

 		puts "Seq:"+e.sequence.to_s
 		puts "UID:"+e.uid.to_s
 		puts "DTSTART: "+e.dtstart.to_s
 		puts "summary: " + e.summary
 		puts "location: " + e.location
 		puts "description: "+e.description

 		if (not e.attendees.nil?) then

 			puts "attendee: "
 			e.attendees.each{|a|
 				puts "\t"+a.to
 			}

 		end

 		puts "---------------------------------------"

 	}

 end


	puts "\nTODO\n\n"

	t=c.todos
 if (t.size == 0) then

 	puts "Empty TODO list"

 else

 	puts "---------------------------------------"

 	t.each {|oi|

 		puts "Seq:"+oi.sequence.to_s
 		puts "UID:"+oi.uid.to_s
 		puts oi.dtstart
 		puts "summary "+oi.summary

 	}

 	puts "---------------------------------------"

 end

}

 	

When scraping for info on any website, the most time consuming part is locating where is what you need, and how it’s enclosed. Most of time, automatically generated HTML can be pretty convoluted due to templating systems. Hand made HTML tends to be more cleaner but it’s not so common these days.

Firebug is an extension for Firefox which among other things, can help you find URL, XPath for certain elements, discover action names, find out how does the forms are handled and so on.

Having a full XPath or the right URL for a form in a few clicks is a great productivity improvement. To show how to do it, I will download my contacts stored in a GMail account.

First and foremost, we need to know how to export contacs manually. It’s a matter of logging in, clicking in the Contacts link below your folders, clicking again the export button, selecting the proper options (All contacts, Outlook CSV format) and clicking another export button.

What we may need more than XPath harvesting is an automation tool, so it can navigate to the right URL. Better yet, we need the export action URL so we may not need to simulate ‘clicking’ as most automation libraries do.

Apart from Firefox loaded with Firebug, we will use Ruby and WWW::Mechanize. WWW::Mechanize uses Hpricot to handle XPath and has nice features like a cookiejar to handle all cookies, redirection following and form handling.

The first step is login using gmail’s form. It’s a simple html form, the first one of the page. Let’s find out the names of the input fields. Start Firefox, points to https://www.gmail.com and activate Firebug by clicking the icon in the low right corner.

login page

Use the inspect feature to see the HTML code for a given element. inspect may return its full XPath or DOM name. Take some time to explore the login screen and note that the field’s name are Email and Passwd, and they are case-sensitive. To login in using www::mechanize the code would be like:

agent = WWW::Mechanize.new { |obj| obj.log = Logger.new(‘gmail.log’) }
page = agent.get(‘https://www.gmail.com&#8217;)

form = page.forms.first
form.Email = ‘username’
form.Passwd = ‘passwd’

page = agent.submit(form)

After logging in, mechanize will take care of any redirection and cookies. We may proceed requesting for any other element.

Our goal is exporting a contact list and clicking the way to it is not the smartest idea. We need the exact URL to get it. Let’s find it:

contact management

export contacts

Enable firebug, select the ‘Net’ tab and click into export.

Contact export screen

Check Firebug’s console for the list of net requests. There we will find the exact URL we need:

network requests

Mouse over the itens to see the URL value. In gmail it will be the one labelled export, but go on to see the other backgrounds request it does.
contact list download

The contact list export URL is http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV.

After logging in, it’s a matter of just requesting this URL and saving the file:

page = agent.get(‘http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV&#8217;)

page.save_as(‘gmail_contacts.csv’)

And that’s it.

Check Firebug documentation and scripts to learn other ways to avoid heavy work by perusing it. See the full script below.

——- gmail-scrap.rb ————

#!/usr/bin/env ruby

require 'rubygems'
require 'mechanize'
require 'logger'

agent = WWW::Mechanize.new { |obj| obj.log = Logger.new('gmail.log') }

page = agent.get('https://www.gmail.com')

form = page.forms.first
form.Email = 'username'
form.Passwd = 'passwd'

page = agent.submit(form)

page = agent.get('http://mail.google.com/mail/contacts/data/export?exportType=ALL&groupToExport=&out=OUTLOOK_CSV')

page.save_as('gmail_contacts.csv')

——- gmail-scrap.rb ————

Ruby and document indexing

October 30, 2007

I did some Ferret testing and the results were pretty fine. Check it out.