Following my scrap and crawling experiences, I was looking for a good indexer. Initially I was setup to use Lucene, as I got pretty good recomendations about. Lucene really shines, but I was decided about using Ruby or any other scripting language to avoid bloated code.

Browsing around I found about Ferret, which is a text indexing library for Ruby. The benchmarks and references were good, and so I setup to work on some testing to get used to it. Fortunately, the results were good, and the API is a breeze. Also pagination is built-in. How cool is that ?

For an initial test, I setup to index the Linux Kernel source code. By looking at Brian McCallister example, I wrote two small scripts: indexer.rb and search.rb. I ran indexer over the source tree, and came up with some very interesting results. The words I searched for were ‘net’, ‘skb’, ‘x86’ and finally ‘linux’.

Source : http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.23.1.tar.bz2
System path: /usr/src/linux-2.6.23.1

$ du -hs linux-2.6.23.1/
296M linux-2.6.23.1/

filetypes: mostly text, .c, .h

$ ruby indexer.rb
Files: 23867
Elapsed time: 307.072062 secs

Index size on disk:
$ du -hs /tmp/ferret-test/
469M /tmp/ferret-test/

Three time runs, limited on 10 first relevant results (:offset=0)

$ ruby search.rb net
Searching
/usr/src/linux-2.6.23.1/net/ieee80211/softmac/ieee80211softmac_auth.c score: 1.0
/usr/src/linux-2.6.23.1/drivers/scsi/qla4xxx/Kconfig score: 0.837733394053738
/usr/src/linux-2.6.23.1/drivers/s390/Makefile score: 0.837733394053738
/usr/src/linux-2.6.23.1/drivers/net/usb/pegasus.c score: 0.728849451255887
/usr/src/linux-2.6.23.1/net/core/net-sysfs.c score: 0.722764197161946
/usr/src/linux-2.6.23.1/net/llc/Kconfig score: 0.698111124955755
/usr/src/linux-2.6.23.1/drivers/net/usb/usbnet.c score: 0.676843858509696
/usr/src/linux-2.6.23.1/net/rxrpc/ar-error.c score: 0.653023192474492
/usr/src/linux-2.6.23.1/net/ieee80211/softmac/ieee80211softmac_assoc.c score: 0.634811095955514
/usr/src/linux-2.6.23.1/drivers/net/usb/mcs7830.c score: 0.634811095955514

(1)Elapsed time: 0.010868 secs
(2)Elapsed time: 0.010751 secs
(3)Elapsed time: 0.011172 secs

$ ruby search.rb skb
Searching
/usr/src/linux-2.6.23.1/net/bridge/br_forward.c score: 1.0
/usr/src/linux-2.6.23.1/net/ipv6/xfrm6_mode_transport.c score: 0.859210759788879
/usr/src/linux-2.6.23.1/include/net/x25device.h score: 0.842525497336466
/usr/src/linux-2.6.23.1/net/ipv4/xfrm4_output.c score: 0.828721377551808
/usr/src/linux-2.6.23.1/net/ipv4/xfrm4_mode_transport.c score: 0.825503025430993
/usr/src/linux-2.6.23.1/net/ipv6/ip6_input.c score: 0.822541791389715
/usr/src/linux-2.6.23.1/net/x25/x25_dev.c score: 0.809219293235739
/usr/src/linux-2.6.23.1/net/ipv6/xfrm6_output.c score: 0.799289861992655
/usr/src/linux-2.6.23.1/drivers/isdn/pcbit/capi.c score: 0.79427853649446
/usr/src/linux-2.6.23.1/net/ipv6/xfrm6_mode_beet.c score: 0.793999290076671

(1)Elapsed time: 0.008609 secs
(2)Elapsed time: 0.008855 secs
(3)Elapsed time: 0.010647 secs

$ ruby search.rb x86
Searching
/usr/src/linux-2.6.23.1/include/asm-x86_64/rtc.h score: 1.0
/usr/src/linux-2.6.23.1/include/asm-i386/rtc.h score: 1.0
/usr/src/linux-2.6.23.1/include/asm-um/cache.h score: 0.625
/usr/src/linux-2.6.23.1/arch/i386/kernel/cpu/mcheck/mce.c score: 0.559017013920514
/usr/src/linux-2.6.23.1/include/xen/interface/features.h score: 0.530330073130523
/usr/src/linux-2.6.23.1/kernel/uid16.c score: 0.5182226426261
/usr/src/linux-2.6.23.1/include/asm-x86_64/cpufeature.h score: 0.5
/usr/src/linux-2.6.23.1/include/asm-powerpc/auxvec.h score: 0.5
/usr/src/linux-2.6.23.1/include/asm-m32r/tlb.h score: 0.5
/usr/src/linux-2.6.23.1/include/asm-i386/tlb.h score: 0.5

(1)Elapsed time: 0.011245 secs
(2)Elapsed time: 0.011024 secs
(3)Elapsed time: 0.011324 secs

Full run with all results for:

net:
Elapsed time: 0.094946 secs
Documents found: 488

skb:
Elapsed time: 0.254102 secs
Documents found: 1248

x86:
Elapsed time: 0.061239 secs
Documents found: 332

linux:
Elapsed time: 0.986499 secs
Documents found: 4030

I made 3 runs of each 10-result search. So far the times were consistent. My machine was under some load, as I use it as desktop and didn’t closed any application prior to testing:

$ free
total used free shared buffers cached
Mem: 2074644 1532372 542272 0 25572 1182912
-/+ buffers/cache: 323888 1750756
Swap: 4096552 665880 3430672

Processor is a Intel(R) Pentium(R) D CPU 2.80GHz (according to /proc/cpuinfo).

Ferret is pretty tight and seems that the author is commited to keeping good performance through smart code optimization (from the 0.9 version the core of the library is written in C). Setup is easy as gem install ferret could be and the learning curve is smooth. The hardest part until now is the FQL, Ferret Query Language, which may be used to fine tune queries and results.

———- Code for indexer.rb ———–

#!/usr/bin/env ruby
require 'rubygems'

require 'ferret'

require 'find'

include Ferret


index = Index::Index.new(:default_field => 'content', :path => '/tmp/ferret-test')


ini = Time.now

numFiles=0

Find.find("/usr/src/linux-2.6.23.1/") do |path|

 puts "Indexing: #{path}"

 numFiles=numFiles+1

 if FileTest.file? path

 	File.open(path) do |file|

 		index.add_document(:file => path, :content => file.readlines)

 	end

 end

end


elapsed = Time.now - ini

puts "Files: #{numFiles}"

puts "Elapsed time: #{elapsed} secs\n"


———- Code for indexer.rb ———–

———- Code for search.rb ———–

#!/usr/bin/env ruby
require 'rubygems'

require 'ferret'

require 'find'


wot = ARGV[0]

if wot.nil?

 puts "use: search.rb <query>"

 exit

end


index = Ferret::Index::Index.new(:default_field => 'content', :path => '/tmp/ferret-test')


ini = Time.now

puts "Searching"


docs=0

# uncomment line below for 10 first results, and comment the subsequent line.

# index.search_each(wot) do |doc, score| 


index.search_each(wot, options={:limit=>:all}) do |doc, score|

 puts index[doc]['file'] + " score: "+score.to_s

 docs+=1

end


elapsed = Time.now - ini

puts "Elapsed time: #{elapsed} secs\n"

puts "Documents found: #{docs}"

———- Code for search.rb ———–

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: