Following my scrap and crawling experiences, I was looking for a good indexer. Initially I was setup to use Lucene, as I got pretty good recomendations about. Lucene really shines, but I was decided about using Ruby or any other scripting language to avoid bloated code.
Browsing around I found about Ferret, which is a text indexing library for Ruby. The benchmarks and references were good, and so I setup to work on some testing to get used to it. Fortunately, the results were good, and the API is a breeze. Also pagination is built-in. How cool is that ?
For an initial test, I setup to index the Linux Kernel source code. By looking at Brian McCallister example, I wrote two small scripts: indexer.rb and search.rb. I ran indexer over the source tree, and came up with some very interesting results. The words I searched for were ‘net’, ‘skb’, ‘x86′ and finally ‘linux’.
Source : http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.23.1.tar.bz2
System path: /usr/src/linux-2.6.23.1
$ du -hs linux-2.6.23.1/
296M linux-2.6.23.1/
filetypes: mostly text, .c, .h
$ ruby indexer.rb
Files: 23867
Elapsed time: 307.072062 secs
Index size on disk:
$ du -hs /tmp/ferret-test/
469M /tmp/ferret-test/
Three time runs, limited on 10 first relevant results (:offset=0)
$ ruby search.rb net
Searching
/usr/src/linux-2.6.23.1/net/ieee80211/softmac/ieee80211softmac_auth.c score: 1.0
/usr/src/linux-2.6.23.1/drivers/scsi/qla4xxx/Kconfig score: 0.837733394053738
/usr/src/linux-2.6.23.1/drivers/s390/Makefile score: 0.837733394053738
/usr/src/linux-2.6.23.1/drivers/net/usb/pegasus.c score: 0.728849451255887
/usr/src/linux-2.6.23.1/net/core/net-sysfs.c score: 0.722764197161946
/usr/src/linux-2.6.23.1/net/llc/Kconfig score: 0.698111124955755
/usr/src/linux-2.6.23.1/drivers/net/usb/usbnet.c score: 0.676843858509696
/usr/src/linux-2.6.23.1/net/rxrpc/ar-error.c score: 0.653023192474492
/usr/src/linux-2.6.23.1/net/ieee80211/softmac/ieee80211softmac_assoc.c score: 0.634811095955514
/usr/src/linux-2.6.23.1/drivers/net/usb/mcs7830.c score: 0.634811095955514
(1)Elapsed time: 0.010868 secs
(2)Elapsed time: 0.010751 secs
(3)Elapsed time: 0.011172 secs
$ ruby search.rb skb
Searching
/usr/src/linux-2.6.23.1/net/bridge/br_forward.c score: 1.0
/usr/src/linux-2.6.23.1/net/ipv6/xfrm6_mode_transport.c score: 0.859210759788879
/usr/src/linux-2.6.23.1/include/net/x25device.h score: 0.842525497336466
/usr/src/linux-2.6.23.1/net/ipv4/xfrm4_output.c score: 0.828721377551808
/usr/src/linux-2.6.23.1/net/ipv4/xfrm4_mode_transport.c score: 0.825503025430993
/usr/src/linux-2.6.23.1/net/ipv6/ip6_input.c score: 0.822541791389715
/usr/src/linux-2.6.23.1/net/x25/x25_dev.c score: 0.809219293235739
/usr/src/linux-2.6.23.1/net/ipv6/xfrm6_output.c score: 0.799289861992655
/usr/src/linux-2.6.23.1/drivers/isdn/pcbit/capi.c score: 0.79427853649446
/usr/src/linux-2.6.23.1/net/ipv6/xfrm6_mode_beet.c score: 0.793999290076671
(1)Elapsed time: 0.008609 secs
(2)Elapsed time: 0.008855 secs
(3)Elapsed time: 0.010647 secs
$ ruby search.rb x86
Searching
/usr/src/linux-2.6.23.1/include/asm-x86_64/rtc.h score: 1.0
/usr/src/linux-2.6.23.1/include/asm-i386/rtc.h score: 1.0
/usr/src/linux-2.6.23.1/include/asm-um/cache.h score: 0.625
/usr/src/linux-2.6.23.1/arch/i386/kernel/cpu/mcheck/mce.c score: 0.559017013920514
/usr/src/linux-2.6.23.1/include/xen/interface/features.h score: 0.530330073130523
/usr/src/linux-2.6.23.1/kernel/uid16.c score: 0.5182226426261
/usr/src/linux-2.6.23.1/include/asm-x86_64/cpufeature.h score: 0.5
/usr/src/linux-2.6.23.1/include/asm-powerpc/auxvec.h score: 0.5
/usr/src/linux-2.6.23.1/include/asm-m32r/tlb.h score: 0.5
/usr/src/linux-2.6.23.1/include/asm-i386/tlb.h score: 0.5
(1)Elapsed time: 0.011245 secs
(2)Elapsed time: 0.011024 secs
(3)Elapsed time: 0.011324 secs
Full run with all results for:
net:
Elapsed time: 0.094946 secs
Documents found: 488
skb:
Elapsed time: 0.254102 secs
Documents found: 1248
x86:
Elapsed time: 0.061239 secs
Documents found: 332
linux:
Elapsed time: 0.986499 secs
Documents found: 4030
I made 3 runs of each 10-result search. So far the times were consistent. My machine was under some load, as I use it as desktop and didn’t closed any application prior to testing:
$ free
total used free shared buffers cached
Mem: 2074644 1532372 542272 0 25572 1182912
-/+ buffers/cache: 323888 1750756
Swap: 4096552 665880 3430672
Processor is a Intel(R) Pentium(R) D CPU 2.80GHz (according to /proc/cpuinfo).
Ferret is pretty tight and seems that the author is commited to keeping good performance through smart code optimization (from the 0.9 version the core of the library is written in C). Setup is easy as gem install ferret could be and the learning curve is smooth. The hardest part until now is the FQL, Ferret Query Language, which may be used to fine tune queries and results.
———- Code for indexer.rb ———–
#!/usr/bin/env ruby require 'rubygems' require 'ferret' require 'find' include Ferret index = Index::Index.new(:default_field => 'content', :path => '/tmp/ferret-test') ini = Time.now numFiles=0 Find.find("/usr/src/linux-2.6.23.1/") do |path| puts "Indexing: #{path}" numFiles=numFiles+1 if FileTest.file? path File.open(path) do |file| index.add_document(:file => path, :content => file.readlines) end end end elapsed = Time.now - ini puts "Files: #{numFiles}" puts "Elapsed time: #{elapsed} secs\n"
———- Code for indexer.rb ———–
———- Code for search.rb ———–
#!/usr/bin/env ruby require 'rubygems' require 'ferret' require 'find' wot = ARGV[0] if wot.nil? puts "use: search.rb <query>" exit end index = Ferret::Index::Index.new(:default_field => 'content', :path => '/tmp/ferret-test') ini = Time.now puts "Searching" docs=0 # uncomment line below for 10 first results, and comment the subsequent line. # index.search_each(wot) do |doc, score| index.search_each(wot, options={:limit=>:all}) do |doc, score| puts index[doc]['file'] + " score: "+score.to_s docs+=1 end elapsed = Time.now - ini puts "Elapsed time: #{elapsed} secs\n" puts "Documents found: #{docs}"
———- Code for search.rb ———–