- 论坛徽章:
- 0
|
本帖最后由 misslushui 于 2013-11-08 11:07 编辑
- #!/usr/bin/python
- #coding=utf8
- import urllib2
- import re
- import threading
- import time
- '''
- http://www.cnproxy.com/proxy1.html
- '''
- proxylist1 =[]
- portdicts ={'v':"3",'m':"4",'a':"2",'l':"9",'q':"0",'b':"5",'i':"7",'w':"6",'r':"8",'c':"1"}
- def get_proxy_from_cnproxy():
- global proxylist1
- p=re.compile(r'''<tr><td>(.+?)<SCRIPT type=text/javascript>document.write\(":"\+(.+?)\)</SCRIPT></td><td>(.+?)</td><td>.+?</td><td>(.+?)</td></tr>''')
- for i in range(1,2):
- target = r"http://www.cnproxy.com/proxy%d.html" % i
- print target
- req = urllib2.urlopen(target)
- result = req.read()
- matchs = p.findall(result)
- # print matchs
- for row in matchs:
- ip=row[0]
- port =row[1]
- port = map(lambda x:portdicts[x],port.split('+'))
- port = ''.join(port)
- agent = row[2]
- #addr = row[3].decode("GBK").encode("utf8")
- #addr = row[3].decode("cp936").encode("utf8")
- addr = row[3].decode("utf8").encode("utf8")
- l=[ip,port,agent,addr]
- print l
- proxylist1.append(l)
- if __name__ =="__main__":
- get_proxy_from_cnproxy()
复制代码 调整了一下,不在用baidu检查代理的正确性,解析的中文还是有问题,如下:- http://www.cnproxy.com/proxy1.html
- ['217.169.209.2', '6666', 'HTTP', '\xe5\xa1\x9e\xe5\xb0\x94\xe7\xbb\xb4\xe4\xba\x9a']
- ['192.227.139.106', '7808', 'HTTP', 'United States']
- ['110.4.12.170', '83', 'HTTP', '\xe9\xa6\x99\xe6\xb8\xaf \xe9\xa6\x99\xe6\xb8\xaf\xe7\xa7\xbb\xe5\x8a\xa8\xe9\x80\x9a\xe8\xae\xaf\xe6\x9c\x89\xe9\x99\x90\xe5\x85\xac\xe5\x8f\xb8']
- ['69.197.132.80', '7808', 'HTTP', '\xe7\xbe\x8e\xe5\x9b\xbd WholeSale Internet\xe5\x85\xac\xe5\x8f\xb8']
- ['205.164.41.101', '3128', 'HTTP', '\xe7\xbe\x8e\xe5\x9b\xbd \xe5\x8a\xa0\xe5\x88\xa9\xe7\xa6\x8f\xe5\xb0\xbc\xe4\xba\x9a\xe5\xb7\x9e\xe5\x9c\xa3\xe4\xbd\x95\xe5\xa1\x9eEgihosting\xe5\x85\xac\xe5\x8f\xb8']
- ['63.141.249.37', '8089', 'HTTP', '\xe7\xbe\x8e\xe5\x9b\xbd \xe5\xaf\x86\xe8\x8b\x8f\xe9\x87\x8c\xe5\xb7\x9e\xe5\xa0\xaa\xe8\x90\xa8\xe6\x96\xaf\xe5\x9f\x8eDataShack\xe5\x85\xac\xe5\x8f\xb8']
- ['27.34.142.47', '9090', 'HTTP', '\xe6\x97\xa5\xe6\x9c\xac']
- ['211.115.113.35', '8088', 'HTTP', '\xe9\x9f\xa9\xe5\x9b\xbd \xe9\xa6\x96\xe5\xb0\x94']
- ['110.232.72.174', '8080', 'HTTP', '\xe5\x8d\xb0\xe5\xba\xa6\xe5\xb0\xbc\xe8\xa5\xbf\xe4\xba\x9a']
- ['177.69.195.4', '3128', 'HTTP', '\xe5\xb7\xb4\xe8\xa5\xbf']
- ['112.218.71.120', '80', 'HTTP', '\xe9\x9f\xa9\xe5\x9b\xbd']
- ['139.0.15.186', '8080', 'HTTP', '\xe5\x8d\xb0\xe5\xba\xa6\xe5\xb0\xbc\xe8\xa5\xbf\xe4\xba\x9a']
- ['54.247.119.128', '3128', 'HTTP', '\xe7\xbe\x8e\xe5\x9b\xbd \xe5\x8d\x8e\xe7\x9b\x9b\xe9\xa1\xbf\xe5\xb7\x9e\xe8\xa5\xbf\xe9\x9b\x85\xe5\x9b\xbe\xe5\xb8\x82\xe4\xba\x9a\xe9\xa9\xac\xe9\x80\x8a\xe5\x85\xac\xe5\x8f\xb8\xe6\x95\xb0\xe6\x8d\xae\xe4\xb8\xad\xe5\xbf\x83']
- ['84.22.41.1', '3128', 'HTTP', '\xe5\xa1\x9e\xe5\xb0\x94\xe7\xbb\xb4\xe4\xba\x9a']
- ['200.54.92.187', '80', 'HTTP', '\xe6\x99\xba\xe5\x88\xa9']
- ['108.61.89.152', '7808', 'HTTP', 'United States']
- ['190.228.33.114', '8080', 'HTTP', '\xe9\x98\xbf\xe6\xa0\xb9\xe5\xbb\xb7']
- ['200.42.56.146', '8080', 'HTTP', '\xe9\x98\xbf\xe6\xa0\xb9\xe5\xbb\xb7']
- ['110.50.80.30', '8888', 'HTTP', '\xe5\x8d\xb0\xe5\xba\xa6\xe5\xb0\xbc\xe8\xa5\xbf\xe4\xba\x9a']
- ['189.112.117.5', '8080', 'HTTP', '\xe5\xb7\xb4\xe8\xa5\xbf']
- ['110.4.12.170', '80', 'HTTP', '\xe9\xa6\x99\xe6\xb8\xaf \xe9\xa6\x99\xe6\xb8\xaf\xe7\xa7\xbb\xe5\x8a\xa8\xe9\x80\x9a\xe8\xae\xaf\xe6\x9c\x89\xe9\x99\x90\xe5\x85\xac\xe5\x8f\xb8']
复制代码 |
|