- 论坛徽章:
- 0
|
之前发过一个帖子,关于开心网的,不知道是不是被当作广告了,帖子被删除了,我重新发一个吧,请各位帮忙看看下一步怎么进行。
先说说目前实现到什么了:目前可以登录wap开心网的首页,并且解析html文件后,可以把所有的超链接都提取出来。
需要解决的问题,我想访问里面特定的某个超链接,比如“组件”这个超链接,我该怎么提取这个超链接呢?
urllister.py- from sgmllib import SGMLParser
- class URLLister(SGMLParser):
- def reset(self):
- SGMLParser.reset(self)
- self.urls = []
- def start_a(self, attrs):
- href = [v for k, v in attrs if k == 'href']
- if href:
- self.urls.extend(href)
复制代码 kaixin001.py- # -*- coding: utf-8 -*-
- #File name kaixin001.py
- import urllib
- import urllib2
- import httplib
- import cookielib
- import urllister
- def login(url, params):
- try:
- params = urllib.urlencode(params)
- request = urllib2.Request(url, params)
- response = urllib2.urlopen(request)
- return response.read()
- except:
- return False
- def getpage(url):
- try:
- request = urllib2.Request(url)
- response = urllib2.urlopen(request)
- return response.read()
- except:
- return False
- if __name__ == '__main__':
-
- print '正在登录...'.decode('utf-8')
- url = 'http://wap.kaixin001.com/home/'
- params = {"email":"aaa@bbb.cn","password":"123456","from":"","refuid":"0","refcode":"","bind":"","gotourl":"","login":"登录"}
- html = login(url, params)
- parser = urllister.URLLister()
- parser.feed(html)
- for url in parser.urls: print url
复制代码 |
|