Data mining – Harvesting information from twitter

August 19, 2009

The whole twitter’s user base provide us with a snapshot of what is hip and what is not right now. Using its trends and searching, it’s possible to harvest a lot of information. The heavy load of crawling and storing such huge volume is naturally handled by them.

One could argue that the results would be biased by the user opinions, but it’s one of the interesting features of such dataset. The results of different users and topic mash-ups is more than a day newer than social-driven news websites, as digg (check it out).

To get a little taste of it, we can start with a little script, which will extract links from the top current trends ( the ones you see in Twitter’s main login page ).

 import simplejson
 import twitter
 import pickle, re, os, urllib

detect_url_pattern = re.compile('(http://.+?/.+?(\s|$))+', re.I)

filename = "last_topic_ids.db"

if os.path.exists(filename):
 last_topic_ids = pickle.load(file(filename, 'r+b'))
 last_topic_ids = {}

api = twitter.Api()
 trends_current = simplejson.loads(api._FetchUrl(""))
 c = trends_current["trends"]
 for a in c[c.keys()[0]]:
 if a['query'] not in last_topic_ids.keys():
 url = "" % (urllib.quote_plus(a['query']))
 url = "" % (urllib.quote_plus(a['query']), last_topic_ids[a['query']])
 print "--------------------------------------"
 print "%s: %s" % (a['name'], url)
 statuses = simplejson.loads(api._FetchUrl(url))
 for s in statuses['results']:
 urls = detect_url_pattern.findall(s['text'])
 if len(urls) > 0:
 print urls[0]

last_topic_ids[a['query']] = statuses['max_id']
 print "--------------------------------------"

print last_topic_ids
 pickle.dump(last_topic_ids, file(filename, 'w+b'))

We start by getting the main current trends, and for each of them, retrieving the latest search results. For each result we apply a regular expression to extract links and that’s it. From there we could have the urls in a dict, to count how many time they appear ( to make a tag cloud for example), decode the url shorteners to see if it’s a image, video or other content to assemble other kinds of ranking and clouds, analyse it’s content and so on.

This program uses the fine python-twitter library, but can be easily adapted to use urllib2. The general rule of thumb for url shorteners is to check the location header in a HTTP HEAD request, like this:

def get_location(url, uri):
   conn = httplib.HTTPConnection(url)
   conn.request("HEAD", "/"+uri)
   res = conn.getresponse()

   if res.status != 301:
      return url+uri
   return res.getheader("location")

To specific shorteners (image and video usually), you may provide other ways to check the final url, or the embedding url (thumbnail for images).

Have fun !


One Response to “Data mining – Harvesting information from twitter”

  1. alef Says:

    you rock!

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: