Crawling Chinese Social Network -- Shuoshuo (or QQ Kongjian 说说爬虫)

Shuoshuo is a popular social network created by Tencent. Due to QQ’s large user, billions of post is created on this social network. However, its ancestor, taotao, provides us a great api to crawling the social network. Using MongoDB and C++, providing the skey(which could be find in chrome console) and qq number, one are able to fetch million of posts everyday. The following program is just for experiment, and I do not responsible for any consequence of using it inappropriately.

Once you got the skey, calculating gtk is pivotal for crawling. GTK is introduced by Tencent to make further encryption. However, the encryption code can be find in the javascript. We can calculate manually in C++.

void QQ_info::calculate_g_tk(){  
 long long g_tk;
 long long hash = 5381;
 auto len = skey.length();
 for(int i = 0; i < len; ++i) {
  hash += (hash<<5) + (int)skey[i];
 }
 g_tk = hash & 0x7fffffff;
 g_tk_str = std::to_string(g_tk);
}

The code above is for GTK calculation. This method is in the class QQinfo.

After it, constructing cookies and URL is essential.

std::string QQ_info::get_cookie() const{  
  std::string cookie;
  std::string uin = "uin=o0" + qq_number + "; ";
  std::string cookie_skey = "skey=" + skey + "; ";
  cookie = uin + cookie_skey;
  return cookie;
}

std::string QQ_info::get_url(std::string qq_num) const{  
  const std::string first("taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6?uin=");
  const std::string middle("&num=2&replynum=100&format=jsonp&g_tk=");
  std::string final = first + qq_num + middle + g_tk_str;
  return final;
}

The cookies have been reduced as minimum for crawling. I deleted cookie one by one on Chrome to test the dependency of certain cookies, and the result showed that only two of them is needed. So did the URL, the flags have been reduced to its minimum.

Then we can use libcurl to GET the data from the servers. How can we do that? After set up the cookies permanently, only URL is changing. We can write down the following code to fetch:

std::string& Fetcher::get(std::string& qq_num){  
  data_buffer = "";
  std::string url = logined_qq.get_url(qq_num);
  //std::cout<<url<<std::endl;
  auto code = curl_easy_setopt(easyhandle, CURLOPT_URL, url.c_str());
  if(code != CURLE_OK) fprintf(stderr, "error in changing url");
  curl_easy_perform(easyhandle);
  return this->data_buffer;
}

However, the data we get from curl is raw string without any structure. In order to store the data in the database, we need to parse them into JSON. Such job can be done by library Jsonopp, we simply write down:

Json::Value Fetcher::parsed_json(std::string& qq_num){  
  Json::Value root;
  Json::Reader reader;

  get(qq_num);
  std::smatch substrings;
  std::string json_string;
  std::regex_match(data_buffer,substrings,match_json);
  if (substrings[1] != ""){
    json_string = substrings[1];
    bool parsedSuccess = reader.parse(json_string, root);
    if(! parsedSuccess){;
      fprintf(stderr, "error in parsing the json document");
    } //the document is parsable, then iterate through the whole structure to find "uin"

    else{
      std::sregex_token_iterator iter(json_string.cbegin(), json_string.cend(), match_qq, 1);
      std::sregex_token_iterator end;
      for( ; iter != end; ++iter ){
        std::string qq_nume(*iter);

        if(!qq_filter.check_add(qq_nume)) {
          qq_queue->push(qq_nume);
          std::cout<<"new qq added:"<<qq_nume<<std::endl;
        }

      }

    }//if the string is parsable
  }//if the string is not null
  else reader.parse(std::string(""), root); //still parse it.
  return root;
}

Then we can organize the data and store in the database. Together, let’s assemble the code. The following link is a link for the complete product using MongoDB.

View the Code on Github


When you compile your own code, you can use other database. The main reason I choose MongoDB is that its schema-less features allows us to store and modify the data rapidly. It successfully compiled on Mac ( I tried to compile on other platform like Ubuntu, seems like Mongo-cxx-client is too sophisticated and its ABI is not stable, it throws a segmentation fault).  Having fun with this code!