Posted on

Create a Quick and Dirty Web Crawler With Ruby

Anemone An easy-to-use Ruby web spider framework

What is it?

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.

Parsing HTML with Nokogiri

Quickly extract data from raw HTML with the Nokogiri gem


Scraping a blog with Anemone (Ruby web crawler) and MongoDB


Saving hours with Anemone and Nokogiri

// November 8, 2010

Today, i’ve been working on a WordPress site that needed to be moved to another server and also another domain. This site had quite a few (absolute) internal links in its content — that had URLs like — and i needed to replace these URLs with instead. Exact same url structure. Just the domain name to change. It could have been pretty tedious and boring to go over all these pages and check their source to make sure it did not contain any reference to But in my extreme proactive lazyness, i found a solution to make this task a lot more fun and reusable. I used Anemone and Nokogiri Ruby gems. Here is the code:

require 'rubygems'
require 'anemone'
require 'nokogiri'

Anemone.crawl("") do |anemone|
  anemone.on_every_page do |page|
    if page.doc
      page.doc.css('a').each do |link|
        if link.attributes['href'].to_s =~ /oldsite\.com/
          puts "#{link.attributes['href']} in #{page.url}"

What this script does is this: Anemone is crawling all the pages in For each page, i use Nokogiri to look for its links. And i loop into each of these links to get the ones that contains occurrences of “” in them.

By running this script for a couple of minutes, i got this output: in in

Then, all i had to do is go over the WordPress admin, and replace all the old domain in the URLs.






I have a starting page of which has 1220 listings broken up by pagination in the standard way eg 20 results per page.

I have code working that parses the first page of results and follows links that contain “example_guide/paris_shops” in their url. I then use Nokogiri to pull specific data of that final page. All works well and the 20 results are written to a file.

However I can’t seem to figure out how to also get Anemone to crawl to the next page of results ( and then continue to parse that page and then the 3rd page ( and so on.

So I’d like to ask if anyone knows how I can get anemone to start on a page, parse all the links on that page (and the next level of data for specific data) but then follow the pagination to the next page of results so anemone can start parsing again and so on and on. Given that the pagination links are different from the links in the results Anemone doesn’t of course follow them.

At the moment I am loading the url for the first page of results, letting that finish and then pasting in the next url for the 2nd page of results etc etc. Very manual and inefficient especially for getting hundreds of pages.

Any help would be much appreciated.

require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'

Anemone.crawl("", :delay => 3) do |anemone|
  anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |

doc = Nokogiri::HTML(open(page.url))

name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?

open('savedwebdata.txt', 'a') { |f|
  f.puts "#{name}\t#{address}\t#{website}\t#{}"