Github repo

October 29, 2009

I’m uploading almost all code from this blog to my repo at github: http://github.com/gleicon/zenmachine .

There’s some really ugly stuff, old ruby constructs, but all in all it’s been fun to review and update them. One of my top posts of all time, about text classification and ruby is getting a review and will be uploaded too.

Any code you want ? Questions ? Feel free to post them in comments.

 

Twisted COMET

October 28, 2009

I was about to write a post about NGINX, Python, Twisted and COMET, but it got so long that I decided to break it in 2 or 3 parts.

The first one is a kind of follow-up to the last post . This time the subject is a script that will search for a given word in twitter, and update the results in a continuous fashion. Once you point your browser to http://localhost:8000/ it will start to receive the results from time to time, as a big download (which is what COMET is about).

 

Read the rest of this entry »

I’ve been looking into twisted to build a comet based app. It’s not a hard task, given that you can tune a lot of parameters (including which kind of reactor), but the basics are very interesting. Combining this approach along with a nginx based architecture has given me excellent results.

Read the rest of this entry »

To finish the holy trilogy of twitter madness, there goes a naive friend recommender script for twitter. It basically applies a simple idea over the social graph: from all my friends, select all their friends that I don’t follow, rank them by ocurrence, filter against who is already in my list and recommend them to me. Read the rest of this entry »

A twitter bot

August 21, 2009

One of the first experiments that I did with python and twitter was a BOT. I was interested in testing how would an interactive application work using twitter, and a reasonable model for me is that it would answer back when a user sent a message, after performing a given action. Read the rest of this entry »

The whole twitter’s user base provide us with a snapshot of what is hip and what is not right now. Using its trends and searching, it’s possible to harvest a lot of information. The heavy load of crawling and storing such huge volume is naturally handled by them.

Read the rest of this entry »

Since a friend of mine told me of using OpenOffice as a daemon to run tasks automatically, I thought that would be nice to try it as a part of a proof of concept to a slideshare mini clone. It would be a matter of uploading the original file, convert it using OO and displaying a page along with it. There`s an API and many clients. I choose not to develop a new client and used JodConverter. Of course I would have to develop a new converter if I wanted to inject or run customized procedures over a document.

Read the rest of this entry »

My book at google books

June 11, 2009

Long ago I wrote a book about Linux Programming (portuguese only). I just saw that it is added to google books at http://books.google.com/books?id=nMoQKiQyHwwC . There are some cute previews, including a serial port circuit picture :D

link recap

May 7, 2009

Following up this old post, there is an presentation from MySQL Conf & Expo 2009 about using Memcached showing the basic usage pattern, which by the way is not using memcached as a database, asking if you can search inside it’s keys and so on.

Another presentation from Craiglist’s Jeremy Zawodny about using sphinx to index mysql data is there too.

Ezra Zygmuntowicz’ Nanite was presented at erlang factory (which I may or may not write about later) and seems like a perfect case of AMPQ and queue use (I got a pretty solid understanding of routing keys from it).

More on memcached

February 17, 2009

Sharing memcache data between different applications is useful and easy, be it as a glorified IPC, a robust distributed cache, rate limit control or any other suggested architecture approach.

There are some caveats tho:

  1. The captain obvious one: if its the case, make sure the way you store your data is readable between different languages. For example, storing in python and reading in java or ruby a pickled object is trivial, but persisting some specific objects, like rails is prone to do, may render the data almost unreadable. Try to use simple serialization formats if possible (like yaml, json, xml).
  2. The other captain obvious one: saving and invalidating data must be done by the application responsible for its integrity, for simplicity and safety sake. Remember cache 101: a cache is not a database. It’s not searchable, and its data must reflect a coherent source of data.
  3. The not so obvious one: if you use more than a memcached server, make sure both clients understand the hashing algorithm which is used to select the right server for the key you are asking. When using the same language and client this is transparent, but there different known ways to select the right server as:
  • md5 hash of the key
  • crc32 based hash
  • native hash (as String.hashCode() in java)
  • pure magic hash (some clients implement non-standard memcache

The case in point is a ruby application using the memcache-client gem and a java application using whalin’s client. If you use more than one server, the ruby client uses it’s unique algorithm, which is CRC32 based. The java client defaults to a NATIVE based algorithm, but contains 3 more algorithms. Keys would never get correct hits this way.

Let’s see how it works :

Code for hashing in Ruby (straight from memcache_client gem)


# Note that the method crc32_ITU_T is a patch for the String class from memcache_client

def hash_for(key)
 (key.crc32_ITU_T >> 16) & 0x7fff
 end

Code for the right hashing algorithm in JAVA:


private static long newCompatHashingAlg( String key ) {
                CRC32 checksum = new CRC32();
                checksum.update( key.getBytes() );
                long crc = checksum.getValue();
                return (crc >> 16) & 0x7fff;
        }

The algorithm is selected by this piece of code, whalin’s memcache client library:

    case NATIVE_HASH:
        return (long)key.hashCode();
        case OLD_COMPAT_HASH:
        return origCompatHashingAlg( key );
    case NEW_COMPAT_HASH:
        return newCompatHashingAlg( key );
    case CONSISTENT_HASH:
        return md5HashingAlg( key );
    default:
        // use the native hash as a default
        hashingAlg = NATIVE_HASH;
        return (long)key.hashCode();

So, before using the client in java, we need to issue setHashingAlg( SockIOPool.NEW_COMPAT_HASH ); on the right SockIOPool object.

That’s it.

Now, for a change …

Really unneccessary section !

We can test the CRC32 based algorithm like this:

Start irb and type:

irb --> require "rubygems"
    ==> true
irb --> require "memcache"
    ==> true
irb --> a = "mykey"
    ==> "mykey"
irb --> (a.crc32_ITU_T() >> 16) & 0x7fff
    ==> 17510

From this, we see that 17510 is the resulting hash for “mykey” key.

The memcache client was required just to attach the crc32_ITU_T() method to the String class, but if you dont want to install it, just paste the following code (which is part of memcache_client) instead:


class String

  ##
  # Uses the ITU-T polynomial in the CRC32 algorithm.

  def crc32_ITU_T
    n = length
    r = 0xFFFFFFFF

    n.times do |i|
      r ^= self[i]
      8.times do
        if (r & 1) != 0 then
          r = (r>>1) ^ 0xEDB88320
        else
          r >>= 1
        end
      end
    end

    r ^ 0xFFFFFFFF
  end

end

Let’s test it in JAVA’s end:

TestCRC.java


import java.util.zip.CRC32;

public class TestCRC {

        public static void main(String[] args) {

                CRC32 checksum = new CRC32();
                checksum.update("mykey".getBytes());
                long crc = checksum.getValue();
                System.out.println(((crc >> 16) & 0x7fff));
        }
}

Compile and run as:

$ javac TestCRC.java
$ java -cp . TestCRC
17510

Again, 17510, as in Ruby. That’s the right value for “mykey”.

Both cases lent 17510 as result, which would then be divided by the number of machines in the pool (e.g. 2) and the mod of this operation is the index of the right server, both in JAVA and Ruby. Weee.