论坛徽章:: 2

电梯直达

1楼 [收藏(0)] [报告]

发表于 2016-08-22 22:58 |只看该作者 |倒序浏览

准备通过scrapy自动下载网站的小说链接：代码如下：

from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from teizi.items import TeiziItem
from scrapy import log

class XunduSpider(CrawlSpider):
name="teizi"
download_delay=1
allowed_domains=['http://www.xunread.com/']
start_urls=["http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml"]
rules=[Rule(LinkExtractor(allow=('\d\.shtml')),callback='parse_item',follow=True)]

def parse_item(self,response):
      log.msg("parse_item",level='INFO')
      item=TeiziItem
      sel=Selector(response)
      script_content = sel.xpath('//div[@id="content"]/script/div/text()').extract()
      script_title= sel.xpath('//div[@id="title"]/script/div/text()').extract()
      item['content']=[n.encode('utf-8') for n in script_content]
      item['title']=[n.encode('utf-8') for n in script_title]
      yield item

执行结果如下：

C:\Users\Administrator\teizi>scrapy crawl teizi
C:\Users\Administrator\teizi\teizi\spiders\tiezi_spider.py:5: ScrapyDeprecationW
arning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builti
n Python library for logging. Read the updated logging entry in the documentatio
n to learn more.
  from scrapy import log
2016-08-22 22:43:09 [scrapy] INFO: Scrapy 1.1.0 started (bot: teizi)
2016-08-22 22:43:09 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'te
izi.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['teizi.spiders'], 'BOT_
NAME': 'teizi', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/5.0 (Windows NT
6.1; rv:38.0) Gecko/20100101 Firefox/38.0'}
2016-08-22 22:43:10 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-08-22 22:43:10 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-22 22:43:11 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-22 22:43:11 [scrapy] INFO: Enabled item pipelines:
['teizi.pipelines.TeiziPipeline']
2016-08-22 22:43:11 [scrapy] INFO: Spider opened
2016-08-22 22:43:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-08-22 22:43:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-08-22 22:43:12 [scrapy] DEBUG: Crawled (200) <GET http://www.xunread.com/ro
bots.txt> (referer: None)
2016-08-22 22:43:14 [scrapy] DEBUG: Crawled (200) <GET http://www.xunread.com/ar
ticle/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml> (referer: None)
2016-08-22 22:43:14 [scrapy] DEBUG: Filtered offsite request to 'www.xunread.com
': <GET http://www.xunread.com/article/8 ... c-148eee4d6615/1.sh
tml>
2016-08-22 22:43:14 [scrapy] INFO: Closing spider (finished)
2016-08-22 22:43:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 556,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 44647,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 22, 14, 43, 14, 353000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 657,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 8, 22, 14, 43, 11, 360000)}
2016-08-22 22:43:14 [scrapy] INFO: Spider closed (finished)

从打印来看，开始第一个页面下载的时候，就马上停止了，似乎并没有进行parse_item这个函数，请各位看下原因是什么。谢谢

文库|博客

返回列表

Chinaunix › 论坛 › 程序设计 › Python › python中scrapy的使用问题

python中scrapy的使用问题 [复制链接]

浏览过的版块