Anatomy of loklak: Part I: Twitter Crawler

Loklak is a open-source twitter crawler. It is distributed, anonymous and it has APIs in several popular programming language, including our favorite python. Today I will start the anatomy of this well-constructed crawler. Let’s start with several basic package.

  • package org.loklak.http:

    • ClientConnection: This is a package for making http request exquisitely to twitter using https
      (proof: HttpHost twitter = new HttpHost("twitter.com", 443);) It uses a package from Apache common to manage a pool for doing request, and by using such library, the code is thread safe and can be use everywhere.
      It also includes a file downloader to download file, usually pictures, from the Internet. The core code is here:

OutputStream os = new BufferedOutputStream(new FileOutputStream(target_file));
int count;
byte[] buffer = new byte[2048];
try {
while ((count = connection.inputStream.read(buffer)) > 0)
os.write(buffer, 0, count);
}

There is a overload function that returns a ByteArray of the file instead write directly to the file. Another helper method is to solve redirect on the link.

  • package org.loklak.harvester:

    • org.loklak.data.Timeline:
      Although it is not in the package, the Timeline appears numerous time in the code. Timeline is a structure that is easy to organize tweets. It has Iterator to go through the whole structure. It uses two maps, one for tweets and one for users.

    • org.loklak.harvester.twitterAPI:
      This is a helper class for parsing the twitter data structure

    • RedirectUnshortener:
      This is a helper class to solve redirective URL(short URL)

    • TwitterScraper:
      https://twitter.com/search?f=tweets&vertical=default&q=kaffee&src=typd.
      The link above is a typical twitter searching URL. The scraper uses this URL to fetch tweets.
      It contructs a URL and sends it to ClientConnection. Once get the data, it start parsing it to a Timeline.