Anemone An easy-to-use Ruby web spider framework

What is it?

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site.

The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.

http://anemone.rubyforge.org/information-and-examples.html#pages

Parsing HTML with Nokogiri

Quickly extract data from raw HTML with the Nokogiri gem

 

http://ruby.bastardsbook.com/chapters/html-parsing/

 

Scraping a blog with Anemone (Ruby web crawler) and MongoDB

 

http://www.danneu.com/posts/8-scraping-a-blog-with-anemone-ruby-web-crawler-and-mongodb/

 

Saving hours with Anemone and Nokogiri

// November 8, 2010

Today, i’ve been working on a WordPress site that needed to be moved to another server and also another domain. This site had quite a few (absolute) internal links in its content — that had URLs like http://oldsite.com — and i needed to replace these URLs with http://newsite.com instead. Exact same url structure. Just the domain name to change. It could have been pretty tedious and boring to go over all these pages and check their source to make sure it did not contain any reference to oldsite.com. But in my extreme proactive lazyness, i found a solution to make this task a lot more fun and reusable. I used Anemone and Nokogiri Ruby gems. Here is the code:

require 'rubygems'
require 'anemone'
require 'nokogiri'

Anemone.crawl("http://www.newsite.com/") do |anemone|
  anemone.on_every_page do |page|
    if page.doc
      page.doc.css('a').each do |link|
        if link.attributes['href'].to_s =~ /oldsite\.com/
          puts "#{link.attributes['href']} in #{page.url}"
        end
      end
    end
  end
end

What this script does is this: Anemone is crawling all the pages in newsite.com. For each page, i use Nokogiri to look for its links. And i loop into each of these links to get the ones that contains occurrences of “oldsite.com” in them.

By running this script for a couple of minutes, i got this output:

http://www.oldsite.com/the-link in http://www.newsite.com
http://www.oldsite.com/wp-content/uploads/2010/09/dude1.jpg in http://www.newsite.com/the-guy

Then, all i had to do is go over the WordPress admin, www.newsite.com/wp-admin and replace all the old domain in the URLs.

Done!

********************************************************************************************************************************************************

BEST SAMPLE   BEST SAMPLE   BEST SAMPLE  

***************************************************************************************************************************************************************************

 

I have a starting page of http://www.example.com/startpage which has 1220 listings broken up by pagination in the standard way eg 20 results per page.

I have code working that parses the first page of results and follows links that contain “example_guide/paris_shops” in their url. I then use Nokogiri to pull specific data of that final page. All works well and the 20 results are written to a file.

However I can’t seem to figure out how to also get Anemone to crawl to the next page of results (http://www.example.com/startpage?page=2) and then continue to parse that page and then the 3rd page (http://www.example.com/startpage?page=3) and so on.

So I’d like to ask if anyone knows how I can get anemone to start on a page, parse all the links on that page (and the next level of data for specific data) but then follow the pagination to the next page of results so anemone can start parsing again and so on and on. Given that the pagination links are different from the links in the results Anemone doesn’t of course follow them.

At the moment I am loading the url for the first page of results, letting that finish and then pasting in the next url for the 2nd page of results etc etc. Very manual and inefficient especially for getting hundreds of pages.

Any help would be much appreciated.

require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'

Anemone.crawl("http://www.example.com/startpage", :delay => 3) do |anemone|
  anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |

doc = Nokogiri::HTML(open(page.url))

name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?

open('savedwebdata.txt', 'a') { |f|
  f.puts "#{name}\t#{address}\t#{website}\t#{Time.now}"
}
  end
end


http://stackoverflow.com/questions/3836597/help-needed-with-screen-scraping-using-anemone-and-nokogiri
Print Friendly, PDF & Email
Categories: temps

wumingland.com

Focus on Internet Marketing

Related Posts

temps

CHINESE Museum of finance

中国金融博物馆是一座位于中国天津市和平区解放北路29号的博物馆,该博物馆于2010年6月9日晚八点正式开馆,第一期展馆展示面积为2200平方米,是中华人民共和国第一家兼具展示、教育、金融文化消费和研究功能的专业金融博物馆。目前,中国金融博物馆的累计参观人数为39893人。[1] 馆藏及专区 在中国金融博物馆2200平方米的展馆内收藏了200余件中国及世界各个时期的货币、金融票据和大量金融实物。中国金融博物馆的展览陈列内容分为五个部分八个专题[3]:第一部分为“历史与现状”,主要介绍了货币的起源与发展、金融机构与金融工具、金融市场、企业重组与并购以及国际金融机构等一些内容;第二部分为“金融与我们”,主要介绍了金融与创业家、金融与产业、金融与战争、金融与政治、金融与科学和金融与艺术等一些内容;第三部分为“中国金融史”,主要介绍了中国货币的民间演义、天津金融开埠史、金融名人堂和金融里程碑等一些内容;第四部分为“次贷与金融危机”,主要介绍了次贷危机、消费信贷、大萧条、1987年股灾、经济周期、亚洲金融危机、恶性通货膨胀以及金融诈骗与丑闻等一些内容;第五部分为“专题展览”,主要介绍了黄金专题展、货币专题展、解放北路金融 史专题展以及滨海新区金融核心区规划展等一些专题内容。[4] [编辑]宗旨 中国金融博物馆建馆宗旨为:“以全球化视野关注中国金融市场发展,以民间资本为主体打造国家级非营利性金融博物馆”。[5] 建筑 天津市人民政府 天津市历史风貌建筑 法国俱乐部 所在 天津市和平区 时代 1931年 中国金融博物馆原址为天津法国俱乐部也称法国总会,坐落于天津法租界的主要街道大法国路(今解放北路29号),为天津历史风貌建筑。[8] 该建筑初建于十九世纪九十年代,原为法国击剑俱乐部,后由法国公议局出资于宣统三年(1911年)及1931年两次重建,成为当时法国侨民娱乐场所,内设酒吧、剧场、舞厅、台球厅、地球厅等,建筑后院原有小花园广场,露天舞台等。当时法国商会也在此驻扎,因此这里也是法国商人聚会的场所。该建筑面积2941平方米,为半地下室一层砖混结构,具有装饰艺术风格的镂空花饰金属门建于临街转角处,两侧设附壁灯柱。外檐简明,局部装饰。一楼内部为八角形大厅,屋顶中央设有彩色玻璃窗,是一座具有现代主义风格的法式建筑。[9][10] [编辑]地理位置 中国金融博物馆设在中华人民共和国天津市,位于被称为东方华尔街的和平区解放北路29号原法国俱乐部。具体位置请见谷歌地图[11]。 The first phase pavilion of China’s Financial Museum located on the TJ financial street, with a century year Read more...

temps

DreamHost中shell使用指南

1. Basic Instructions基本操作命令 通常来说,使用”$[Instructions] –help”可以获得以下各个命令[instructions]的帮助,包含其参数列表的定义。 -ls 列出当前文件夹下所有内容 $ls -o 列出当前文件夹中所有内容,含详细信息,但不列出group $ls -l 同上,含group信息 $ls -a 列出当前文件夹中所有内容,包含以”.”开头的文件 $ls -t 按更改时间排序 $ls -v 按版本先后排序 -cd [dir] 进入文件夹 -pwd 显示当前路径 -mkdir [dir] 新建文件夹 -chmod 更改文件/文件夹权限 $chmod [Mode] [dir],其中Mode形如”755″或”777″等。 Read more...

temps

delphi bpl 共享包冲突

2009-10-08 9:55 delphi中执行loadPackage,出现Cannot load package ‘vcl100.’ It contains unit ‘Forms,’which is alsocontained in package ‘Package2’ 的解决办法 这两天做了一个小程序,实现了动态加载包的功能。 Project MyApp.exe raised exception class EPackageError with message ‘Cannot load package ‘vcl100.’ It contains unit ‘Forms,’which is alsocontainedin package ‘Package2”. 功能代码如下 var temp :TComponent; clazz :TPersistentClass; a :TInterfacedObject; dlg :IDialog; s :String; begin package1ModelHandle :=LoadPackage(‘lib\package1.bpl’); clazz :=GetClass(‘TMyFactory1’); temp :=TComponentClass(clazz).Create(nil); FMyFactory1:=IMyFactory(temp asIInterface); dlg :=FMyFactory1.getDialog; try s := dlg.getDialogCaption; ShowMessage(s); finally dlg :=nil; end;//finally package2ModelHandle :=LoadPackage(‘lib\package2.bpl’); clazz :=GetClass(‘TMyFactory2’); temp :=TComponentClass(clazz).Create(nil); FMyFactory2:=IMyFactory(temp asIInterface); dlg :=FMyFactory2.getDialog; try s := dlg.getDialogCaption; ShowMessage(s); finally dlg :=nil; end;//finally end; Read more...