Posted on

Create a Quick and Dirty Web Crawler With Ruby

Anemone An easy-to-use Ruby web spider framework

What is it?

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.

http://anemone.rubyforge.org/information-and-examples.html#pages

Parsing HTML with Nokogiri

Quickly extract data from raw HTML with the Nokogiri gem

 

http://ruby.bastardsbook.com/chapters/html-parsing/

 

Scraping a blog with Anemone (Ruby web crawler) and MongoDB

 

http://www.danneu.com/posts/8-scraping-a-blog-with-anemone-ruby-web-crawler-and-mongodb/

 

Saving hours with Anemone and Nokogiri

// November 8, 2010

Today, i’ve been working on a WordPress site that needed to be moved to another server and also another domain. This site had quite a few (absolute) internal links in its content — that had URLs like http://oldsite.com — and i needed to replace these URLs with http://newsite.com instead. Exact same url structure. Just the domain name to change. It could have been pretty tedious and boring to go over all these pages and check their source to make sure it did not contain any reference to oldsite.com. But in my extreme proactive lazyness, i found a solution to make this task a lot more fun and reusable. I used Anemone and Nokogiri Ruby gems. Here is the code:

require 'rubygems'
require 'anemone'
require 'nokogiri'

Anemone.crawl("http://www.newsite.com/") do |anemone|
  anemone.on_every_page do |page|
    if page.doc
      page.doc.css('a').each do |link|
        if link.attributes['href'].to_s =~ /oldsite\.com/
          puts "#{link.attributes['href']} in #{page.url}"
        end
      end
    end
  end
end

What this script does is this: Anemone is crawling all the pages in newsite.com. For each page, i use Nokogiri to look for its links. And i loop into each of these links to get the ones that contains occurrences of “oldsite.com” in them.

By running this script for a couple of minutes, i got this output:

http://www.oldsite.com/the-link in http://www.newsite.com
http://www.oldsite.com/wp-content/uploads/2010/09/dude1.jpg in http://www.newsite.com/the-guy

Then, all i had to do is go over the WordPress admin, www.newsite.com/wp-admin and replace all the old domain in the URLs.

Done!

********************************************************************************************************************************************************

BEST SAMPLE   BEST SAMPLE   BEST SAMPLE  

***************************************************************************************************************************************************************************

 

I have a starting page of http://www.example.com/startpage which has 1220 listings broken up by pagination in the standard way eg 20 results per page.

I have code working that parses the first page of results and follows links that contain “example_guide/paris_shops” in their url. I then use Nokogiri to pull specific data of that final page. All works well and the 20 results are written to a file.

However I can’t seem to figure out how to also get Anemone to crawl to the next page of results (http://www.example.com/startpage?page=2) and then continue to parse that page and then the 3rd page (http://www.example.com/startpage?page=3) and so on.

So I’d like to ask if anyone knows how I can get anemone to start on a page, parse all the links on that page (and the next level of data for specific data) but then follow the pagination to the next page of results so anemone can start parsing again and so on and on. Given that the pagination links are different from the links in the results Anemone doesn’t of course follow them.

At the moment I am loading the url for the first page of results, letting that finish and then pasting in the next url for the 2nd page of results etc etc. Very manual and inefficient especially for getting hundreds of pages.

Any help would be much appreciated.

require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'

Anemone.crawl("http://www.example.com/startpage", :delay => 3) do |anemone|
  anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |

doc = Nokogiri::HTML(open(page.url))

name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?

open('savedwebdata.txt', 'a') { |f|
  f.puts "#{name}\t#{address}\t#{website}\t#{Time.now}"
}
  end
end


http://stackoverflow.com/questions/3836597/help-needed-with-screen-scraping-using-anemone-and-nokogiri