免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1503 | 回复: 0
打印 上一主题 下一主题

python中scrapy的使用问题 [复制链接]

论坛徽章:
2
2015年迎新春徽章
日期:2015-03-04 10:01:44程序设计版块每日发帖之星
日期:2015-06-28 22:20:00
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2016-08-22 22:58 |只看该作者 |倒序浏览
准备通过scrapy自动下载网站的小说链接:代码如下:

from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from teizi.items import TeiziItem
from scrapy import log

class XunduSpider(CrawlSpider):
    name="teizi"
    download_delay=1
    allowed_domains=['http://www.xunread.com/']
    start_urls=["http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml"]
    rules=[Rule(LinkExtractor(allow=('\d\.shtml')),callback='parse_item',follow=True)]

    def parse_item(self,response):
        log.msg("parse_item",level='INFO')
        item=TeiziItem
        sel=Selector(response)
        script_content = sel.xpath('//div[@id="content"]/script/div/text()').extract()
        script_title= sel.xpath('//div[@id="title"]/script/div/text()').extract()
        item['content']=[n.encode('utf-8') for n in script_content]
        item['title']=[n.encode('utf-8') for n in script_title]
        yield item

执行结果如下:

C:\Users\Administrator\teizi>scrapy crawl teizi
C:\Users\Administrator\teizi\teizi\spiders\tiezi_spider.py:5: ScrapyDeprecationW
arning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builti
n Python library for logging. Read the updated logging entry in the documentatio
n to learn more.
  from scrapy import log
2016-08-22 22:43:09 [scrapy] INFO: Scrapy 1.1.0 started (bot: teizi)
2016-08-22 22:43:09 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'te
izi.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['teizi.spiders'], 'BOT_
NAME': 'teizi', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/5.0 (Windows NT
6.1; rv:38.0) Gecko/20100101 Firefox/38.0'}
2016-08-22 22:43:10 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-08-22 22:43:10 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-22 22:43:11 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-22 22:43:11 [scrapy] INFO: Enabled item pipelines:
['teizi.pipelines.TeiziPipeline']
2016-08-22 22:43:11 [scrapy] INFO: Spider opened
2016-08-22 22:43:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-08-22 22:43:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-08-22 22:43:12 [scrapy] DEBUG: Crawled (200) <GET http://www.xunread.com/ro
bots.txt> (referer: None)
2016-08-22 22:43:14 [scrapy] DEBUG: Crawled (200) <GET http://www.xunread.com/ar
ticle/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml> (referer: None)
2016-08-22 22:43:14 [scrapy] DEBUG: Filtered offsite request to 'www.xunread.com
': <GET http://www.xunread.com/article/8 ... c-148eee4d6615/1.sh
tml>
2016-08-22 22:43:14 [scrapy] INFO: Closing spider (finished)
2016-08-22 22:43:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 556,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 44647,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 22, 14, 43, 14, 353000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 657,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 8, 22, 14, 43, 11, 360000)}
2016-08-22 22:43:14 [scrapy] INFO: Spider closed (finished)

从打印来看,开始第一个页面下载的时候,就马上停止了,似乎并没有进行parse_item这个函数,请各位看下原因是什么。谢谢
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP