免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 3822 | 回复: 5
打印 上一主题 下一主题

Python 抓取网页信息(正则表达式问题) 急求好心人帮忙 在线等 万分感谢! [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2014-08-31 08:36 |只看该作者 |倒序浏览
急需从网页抓点信息分析,初次接触Python和正则表达式,自己捯饬了几天,但有些地方还是晕晕的,请大神指教。
网页源码如下,想抓取
<a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
中的 ur2509775

<div>
<small>1801 out of 2099 people found the following review useful:</small><br>
<a href="/user/ur2509775/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMTk0MzY5MjExM15BMl5BanBnXkFtZTcwOTQyNjE3OQ@@._V1._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
<h2>Tied for the best movie I have ever seen</h2>
<img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
<b>Author:</b>
<a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
<small>26 November 2003</small><br>

</div>

已尝试使用如下代码抓取,但是抓到的都是空值,没有输出显示,就是啥都没抓到...
j=0
for i in range(0, 20, 10):  
    url = 'xxx'  
    hash = 'start=%d' % i  
    url = url + hash  
    content = urllib2.urlopen(url).read()
    name = re.findall(r' <a href="/user/(.*?)/">.*?</a> <small>.*?</small><br>', content)
    for i in range(0,len(name)):  
        j = j+1
        print mid
print j
print 'done'

输出:
>>>
0
done
>>>


怀疑是正则表达式的问题,但不确定,试了几次也木有成功,急求大神帮助!!万分感谢!!

论坛徽章:
0
2 [报告]
发表于 2014-08-31 09:36 |只看该作者
看看这样可以吗
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-

  3. import re

  4. def fa():
  5.     text='''
  6. <div>
  7. <small>1801 out of 2099 people found the following review useful:</small><br>
  8. <a href="/user/ur2509775/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMTk0MzY5MjExM15BMl5BanBnXkFtZTcwOTQyNjE3OQ@@._V1._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
  9. <h2>Tied for the best movie I have ever seen</h2>
  10. <img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
  11. <b>Author:</b>
  12. <a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
  13. <small>26 November 2003</small><br>

  14. </div>

  15.         '''
  16.     print '\n'.join(re.findall(r'<a href="/user/([\d\w]+)/">\w+</a>',text))

  17. fa()
复制代码

论坛徽章:
0
3 [报告]
发表于 2014-08-31 09:59 |只看该作者
回复 2# whitelotus19


    谢谢!我试了这段是可以的,可是我是想从这个页面(http://www.imdb.com/title/tt0111161/reviews?start=0)直接抓出评价人的id和评价分数(评价分数不为空时),评价分数是可以抓出来的,但是用户id一直抓不到。我刚刚把你写的正则表达式贴进我的代码里试了一下,但抓出的还是空。

我的全部代码是:
  1. # coding=utf-8 ##  
  2. import urllib2  
  3. import re  
  4. import sys  
  5.   
  6. type = sys.getfilesystemencoding()  
  7. j = 0  
  8. for i in range(0, 20, 10):  
  9.     url = 'http://www.imdb.com/title/tt0111161/reviews?'  
  10.     hash = 'start=%d' % i  
  11.     url = url + hash  
  12.     content = urllib2.urlopen(url).read()
  13.     content = content.decode('UTF-8','ignore')
  14.     # 读取用户编号  
  15.     name = re.findall(r' <a href="/user/(.*?)/">.*?</a> <small>.*?</small><br>', content)
  16.     # 读取分数  
  17.     score = re.findall(r'<img width="102" height="12" alt="(.*?)/10" .*?>', content)
  18.     for i in range(0,len(name)):  
  19.         if score[i]!="":   
  20.             j = j + 1                 
  21.             print mid[i]+"+"+score[i]   
  22. print ('共有'+ str(j) +'条').decode('UTF-8','ignore')  
  23. print 'done'
复制代码
看样子真的是正则表达式的问题,求大神看看有没有救,谢谢!

论坛徽章:
0
4 [报告]
发表于 2014-08-31 12:55 |只看该作者
不知道是不是这样子?
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-

  3. import re

  4. def fa():
  5.     text='''
  6.         <div>
  7.         <small>865 out of 1160 people found the following review useful:</small><br>
  8.         <a href="/user/ur1285640/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
  9.         <h2>The best story ever told on film</h2>
  10.         <img width="102" height="12" alt="8/10" src="http://i.media-imdb.com/images/showtimes/80.gif"><br>
  11.         <b>Author:</b>
  12.         <a href="/user/ur1285640/">Si Cole</a><br>
  13.         <small>3 August 2001</small><br>
  14.         <p><b>*** This review may contain spoilers ***</b></p>

  15.         </div>
  16.         <p>

  17.         I believe that this film is the best story ever told on film, and I'm about
  18.         to tell you why.<br><br>Tim Robbins plays Andy Dufresne, a city banker, wrongfully convicted of
  19.         murdering his wife and her lover. He is sent to Shawshank Prison in 1947
  20.         and
  21.         receives a double life sentence for the crime. Andy forms an unlikely
  22.         friendship with &quot;Red&quot; (Morgan Freeman), the man who knows how to get
  23.         things.
  24.         Andy faces many trials in prison, but forms an alliance with the wardens
  25.         because he is able to use his banking experience to help the corrupt
  26.         officials amass personal fortunes. The story unfolds....<br><br>I was so impressed with how every single subplot was given a great deal of
  27.         respect and attention from the director. The acting was world-class. I have
  28.         never seen Tim Robbins act as well since, Morgan Freeman maybe (e.g.
  29.         Seven).
  30.         The twists were unexpected, an although this film had a familiar feel, it
  31.         wasn't even slightly pretentious or cliched, it was original. The
  32.         cinematography was grand and expressive. It gave a real impression of the
  33.         sheer magnitude of this daunting prison.<br><br>But the one thing which makes THE SHAWSHANK REDEMPTION stand above all
  34.         other
  35.         films, is the attention given to the story. The film depends on the story
  36.         and the way in which it unravels. It's a powerful, poignant,
  37.         thought-provoking, challenging film like no other. If Andy were to comment
  38.         on this film, I think he might say: &quot;Get busy watching, or get busy dying.&quot;
  39.         Take his advice.<br><br>Thoroughly recommended.
  40.         </p>

  41.         <div class="yn" id="ynd_348829">

  42.         <form method="get"

  43.          action="/register/login"

  44.         >
  45.         Was the above review useful to you?

  46.         <input class="click linkasbutton-secondary" type="submit"
  47.          name="ynb_348829_yes" value="Yes"

  48.          rel="login"

  49.         >
  50.         <input class="click linkasbutton-secondary" type="submit"
  51.          name="ynb_348829_no" value="No"

  52.          rel="login"

  53.         >

  54.         </form>

  55.         </div>

  56.         <hr noshade="1" size="1" width="50%" align="center">

  57.         <div>
  58.         <small>583 out of 721 people found the following review useful:</small><br>
  59.         <a href="/user/ur0257957/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
  60.         <h2>The Shawshank Redemption</h2>
  61.         <img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
  62.         <b>Author:</b>
  63.         <a href="/user/ur0257957/">Tim Cox</a> <small>from Marietta, OH</small><br>
  64.         <small>25 March 1999</small><br>

  65.         </div>
  66.         <p>

  67.         One of the finest films made in recent years. It's a poignant story
  68.         about hope. Hope gets me. That's what makes a film like this more than a
  69.         movie. It tells a lesson about life.
  70.         Those are the films people talk about 50 or even 100 years from you. It's
  71.         also a story for freedom. Freedom from isolation,
  72.         from rule, from bigotry and hate. Freeman and Robbins are
  73.         majestic in their performances. Each learns from the other.
  74.         Their relationship is strong and you feel that from the first
  75.         moment they make contact with one another. There is also a
  76.         wonderful performance from legend James Whitmore as Brooks.<br><br>He shines when it is his time to go back into the world,
  77.         only
  78.         to find that the world grew up so fast he never even got
  79.         a chance to blink. Stephen King's story is brought to the
  80.         screen with great elegance and excitement. It is an extraordinary motion
  81.         that people &quot;will&quot; be talking about in
  82.         50 or 100 years.  
  83.         </p>

  84.         <div class="yn" id="ynd_348222">

  85.         <form method="get"

  86.          action="/register/login"

  87.         >
  88.         Was the above review useful to you?

  89.         <input class="click linkasbutton-secondary" type="submit"
  90.          name="ynb_348222_yes" value="Yes"

  91.          rel="login"

  92.         >
  93.         <input class="click linkasbutton-secondary" type="submit"
  94.          name="ynb_348222_no" value="No"

  95.          rel="login"

  96.         >

  97.         </form>

  98.         </div>

  99.         <hr noshade="1" size="1" width="50%" align="center">

  100.         <div>
  101.         <small>574 out of 706 people found the following review useful:</small><br>
  102.         <a href="/user/ur0611718/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
  103.         <h2>Powerful</h2>
  104.         <img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
  105.         <b>Author:</b>
  106.         <a href="/user/ur0611718/">Thomas McFadden (tmac4)</a> <small>from Houston, Texas</small><br>
  107.         <small>25 July 2001</small><br>

  108.         </div>
  109.         '''
  110.     for x in re.findall(r'<img width="\d+" height="\d+" alt="(\d+)/\d+" src=.*?<a href="/user/([\d\w]+)/">[^<]+</a>',text,re.DOTALL):
  111.         print x

  112. fa()
复制代码

论坛徽章:
0
5 [报告]
发表于 2014-08-31 17:44 |只看该作者
回复 4# whitelotus19


    可以了!太感谢了!!!

论坛徽章:
0
6 [报告]
发表于 2014-09-15 16:26 |只看该作者
pattern_match = re.compile(r'<a href="/user/(.*)/">(.*)</a> <small>(.*)</small><br>')
results = re.findall(pattern_match, content)
if results:
    print results
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP