Stack, Queue and their usage in crawler

As a experienced crawler designer, stack and queue are the most common tool for me to store the URL, or the id that need to be crawled. However, newbies for this field can hardly tell the difference between these two. Indeed, those two data structure are similar and has the same final result. There is no difference between the two in speed. However, those two makes a difference in the way of crawling a interconnected structure.

Stack, as you can imagine, follows FILO. FILO means first in last out. The first data that enters into the structure will comes out last. Such feature enable those crawler to perform depth-first crawling algorithms. Let’s take a look at the difference when crawling the tree structure. Depth-first means that the crawler will crawling deep into the tree until hit the bottom; after that, the crawler starts in other root and crawls to its bottom. Let’s take a look at the picture of such behavior:

stack

In the contrary, queue has a totally different crawling pattern. Queue is FIFO, which means first in first out. Such feature enables queue to perform width-first crawling. Crawler will iterate through all the node in one level until it goes to another level. It extents its width first. Let’s visualize such feature:queue

Besides the data structure, one of the main topic of crawling is how to crawl fast. Although it is not recommended to crawl at a fast pace since it will give too much pressure, when you crawl too fast, you cannot tell the difference between crawling and DDOS. However, it is essential to know how to crawl fast. Crawling is a I/O intensive application, more specifically, network intensive. Most of the time the crawler is waiting for the server to respond or block by the network socket. In order to satisfy the network and crawl faster, multi-threading is necessary. However, multi-threading requires one to modify the data structure and prevent race-condition. The simple way to perform this is to use a lock when perform any changes in a data structure like queue and stack.

I will give you some useful code to perform a thread-safe queue and stack in C++:

template <typename T>  
  class Threadsafe_queue{
  private:
    std::queue<T> raw_queue;
    mutable std::mutex mtx; //all mutex should be mutable
    std::condition_variable queue_cond;
  public:

    Threadsafe_queue(){}
    explicit Threadsafe_queue(const Threadsafe_queue& other){
      std::lock_guard<std::mutex> guardian1(other.mtx);
      raw_queue = other.raw_queue;
    }
    Threadsafe_queue(const std::queue<T>& other){
      raw_queue = other;
    }
    Threadsafe_queue operator=(const Threadsafe_queue&) = delete;

    void push(T data){
      std::lock_guard<std::mutex> locker(mtx);
      raw_queue.push(data);
      queue_cond.notify_one();
      return;
    }

    bool try_pop(T& data){
      std::lock_guard<std::mutex> locker(mtx);
      if(raw_queue.empty()) return false;

      data = raw_queue.front();
      return true;
    }
    std::shared_ptr<T> try_pop(){
      std::lock_guard<std::mutex> locker(mtx);
      if(raw_queue.empty()) return std::shared_ptr<T>();

      std::shared_ptr<T> res = std::make_shared<T>(raw_queue.front());
      raw_queue.pop();
      return res;
    }

    void wait_pop(T& data){
      std::unique_lock<std::mutex> locker(mtx);
      queue_cond.wait(locker,[this]{return !raw_queue.empty();});

      data = raw_queue.front();
      raw_queue.pop();
      return;
    }
    std::shared_ptr<T> wait_pop(){
      std::unique_lock<std::mutex> locker(mtx);
      queue_cond.wait(locker,[this]{return !raw_queue.empty();});

      std::shared_ptr<T> res = std::make_shared<T>(raw_queue.front());
      raw_queue.pop();
      return res;
    }

    bool empty() const{
      std::lock_guard<std::mutex> locker(mtx);
      return raw_queue.empty();
    }

  };
template<typename T>  
  class Threadsafe_stack{
  private:
    std::stack<T> raw_stack;
    mutable std::mutex mtx;
    std::condition_variable stack_cond;
  public:
    Threadsafe_stack(){}
    explicit Threadsafe_stack(const Threadsafe_stack& other){
      std::lock_guard<std::mutex> guardian1(other.mtx);
      raw_stack = other.raw_queue;
    }
    explicit Threadsafe_stack(const std::stack<T>& other){
      raw_stack = other;
    }
    Threadsafe_stack operator=(const Threadsafe_stack& other) = delete;

    void push(T data){
      std::lock_guard<std::mutex> guardian(mtx);
      raw_stack.push(data);
      stack_cond.notify_one();
    }

    bool empty() const{
      std::lock_guard<std::mutex> guardian(mtx);
      return raw_stack.empty;
    }

    void wait_pop(T& data){
      std::unique_lock<std::mutex> locker(mtx);
      stack_cond.wait(locker,[this]{return !raw_stack.empty();});

      data = raw_stack.end();
      raw_stack.pop();
      return;
    }
    std::shared_ptr<T> wait_pop(){
      std::unique_lock<std::mutex> locker(mtx);
      stack_cond.wait(locker,[this]{return !raw_stack.empty();});

      std::shared_ptr<T> res = std::make_shared<T>(raw_stack.end());
      raw_stack.pop();
      return res;
    }

    bool try_pop(T& data){
      std::lock_guard<std::mutex> locker(mtx);
      if(raw_stack.empty()) return false;

      data = raw_stack.end();
      return true;
    }
    std::shared_ptr<T> try_pop(){
      std::lock_guard<std::mutex> locker(mtx);
      if(raw_stack.empty()) return std::shared_ptr<T>();

      std::shared_ptr<T> res = std::make_shared<T>(raw_stack.end());
      raw_stack.pop();
      return res;
    }
  };