hpricot vs nokogiry
本帖最后由 yakczh 于 2012-02-12 13:12 编辑同一个url ,同一个xpath
url='http://zu.cq.soufun.com/house/c21000-d22000-g22-s31-kw%bd%f0%c9%bd%c3%fb%b6%bc/'
xpath="//p[@class='housetitle']/a"
用hpricot 抓取require 'hpricot'
doc = Hpricot(open(url))
doc.search(xpath).each do |item|
puts item['href']
end有结果
用nokogiri 抓取require 'nokogiri'
doc = Nokogiri::HTML(open(url))
# puts doc
doc.xpath(xpath).each do |link|
puts link.content
puts link['href']
end 无结果 模块的名字很奇葩:em17: 回复 2# Sevk
require 'pp'
require 'open-uri'
require 'nokogiri'
url='http://zu.cq.soufun.com/house/c21000-d22000-g22-s31-kw%bd%f0%c9%bd%c3%fb%b6%bc/'
xpath="//p[@class='housetitle']/a"
doc = Nokogiri::HTML(open(url))
# puts doc
doc.xpath(xpath).each do |link|
puts link.content
puts link['href']
pp link
end 还是没有 doc.css("p.housetitle").each do |link|
puts link
end 用css选择器也取不出来 用hpricot抓取 如果要输出 链接的内部文本 比如
text= link.inner_text
puts text
报错 Ruby192/lib/ruby/gems/1.9.1/gems/hpricot-0.8.6-x86-mswin32/lib/hpricot/builder.rb:9:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
要强制转码才行
nokogiri可以直接在构造函数中传编码参数 doc = Nokogiri::HTML(open(url),'gbk')
hpricot好象没用,只认utf8
页:
[1]