A few days ago, in my intern group, there is a need to write a crawler to scrape open souce software code in order to compare the similarities, and I was assigned to write the main crawler. It seems pretty easy to write one, but it turns out that it is diffiult than the normal sites.
My idea is to utilize the search portal and scrape different repositories each time. However, the scraper stopped after finishing downloading around 100 repos.
My old solution is a small program written in Scala in a non-blocking way because I want it to be quick. However, I find that github actually limit the rate of crawling repositories, limiting the efficiency of crawler. Non-blocking design isn’t necessary in this case.
I tried to bypass it by logging into github, that didn’t work. Plus I don’t have many proxy server, so I decided to limit the rate and let it run slowly for month and check out the result.
To be continue…