- 论坛徽章:
- 0
|
我在 middlewares.py中添加- class WebkitDownloader( object ):
- 52 def process_request( self, request, spider ):
- 53 if spider.name in settings.WEBKIT_DOWNLOADER:
- 54 if(type(request) is not FormRequest ):
- 55 webview = webkit.WebView()
- 56 webview.connect( 'load-finished', lambda v,f: gtk.main_quit() )
- 57 webview.load_uri( request.url )
- 58 gtk.main()
- 59 js = jswebkit.JSContext( webview.get_main_frame().get_global_context() )
- 60 renderedBody = str( js.EvaluateScript( 'document.body.innerHTML' ) )
- 61 return HtmlResponse( request.url, body=renderedBody )
复制代码 在settings.py中添加- DOWNLOADER_MIDDLEWARES = {
- 31 # 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None,
- 32 # 'search_spider.middlewares.ProxyMiddleware': 100,
- 33 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
- 34 'taobao.middlewares.WebkitDownloader': 543, //按网上说的,新添的
- 35 'taobao.middlewares.UserAgentMiddleware': 400,
- 36 }
- WEBKIT_DOWNLOADLER=['taobao_spider']
- 52 import os
- 53 os.environ["DISPLAY"] = ":0"
复制代码 然后用 scrapy crawl taobao_spider
直接退出了 , 什么输出都没,不知道怎么搞得?
有谁用过 scrapy + webkit处理过 js动态生成的DOM结点, 然后抓取数据经验的,或者其他有效方法,欢迎提出,感激不尽!!! |
|