Cam_Un 发表于 2014-08-31 08:36

Python 抓取网页信息(正则表达式问题) 急求好心人帮忙 在线等 万分感谢!

急需从网页抓点信息分析,初次接触Python和正则表达式,自己捯饬了几天,但有些地方还是晕晕的,请大神指教。
网页源码如下,想抓取
<a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
中的 ur2509775

<div>
<small>1801 out of 2099 people found the following review useful:</small><br>
<a href="/user/ur2509775/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMTk0MzY5MjExM15BMl5BanBnXkFtZTcwOTQyNjE3OQ@@._V1._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
<h2>Tied for the best movie I have ever seen</h2>
<img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
<b>Author:</b>
<a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
<small>26 November 2003</small><br>

</div>

已尝试使用如下代码抓取,但是抓到的都是空值,没有输出显示,就是啥都没抓到...
j=0
for i in range(0, 20, 10):
    url = 'xxx'
    hash = 'start=%d' % i
    url = url + hash
    content = urllib2.urlopen(url).read()
    name = re.findall(r' <a href="/user/(.*?)/">.*?</a> <small>.*?</small><br>', content)
    for i in range(0,len(name)):
      j = j+1
      print mid
print j
print 'done'

输出:
>>>
0
done
>>>


怀疑是正则表达式的问题,但不确定,试了几次也木有成功,急求大神帮助!!万分感谢!!

whitelotus19 发表于 2014-08-31 09:36

看看这样可以吗#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def fa():
    text='''
<div>
<small>1801 out of 2099 people found the following review useful:</small><br>
<a href="/user/ur2509775/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMTk0MzY5MjExM15BMl5BanBnXkFtZTcwOTQyNjE3OQ@@._V1._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
<h2>Tied for the best movie I have ever seen</h2>
<img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
<b>Author:</b>
<a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
<small>26 November 2003</small><br>

</div>

      '''
    print '\n'.join(re.findall(r'<a href="/user/([\d\w]+)/">\w+</a>',text))

fa()

Cam_Un 发表于 2014-08-31 09:59

回复 2# whitelotus19


    谢谢!我试了这段是可以的,可是我是想从这个页面(http://www.imdb.com/title/tt0111161/reviews?start=0)直接抓出评价人的id和评价分数(评价分数不为空时),评价分数是可以抓出来的,但是用户id一直抓不到。我刚刚把你写的正则表达式贴进我的代码里试了一下,但抓出的还是空。

我的全部代码是:# coding=utf-8 ##
import urllib2
import re
import sys

type = sys.getfilesystemencoding()
j = 0
for i in range(0, 20, 10):
    url = 'http://www.imdb.com/title/tt0111161/reviews?'
    hash = 'start=%d' % i
    url = url + hash
    content = urllib2.urlopen(url).read()
    content = content.decode('UTF-8','ignore')
    # 读取用户编号
    name = re.findall(r' <a href="/user/(.*?)/">.*?</a> <small>.*?</small><br>', content)
    # 读取分数
    score = re.findall(r'<img width="102" height="12" alt="(.*?)/10" .*?>', content)
    for i in range(0,len(name)):
      if score!="":   
            j = j + 1               
            print mid+"+"+score   
print ('共有'+ str(j) +'条').decode('UTF-8','ignore')
print 'done'
看样子真的是正则表达式的问题,求大神看看有没有救,谢谢!

whitelotus19 发表于 2014-08-31 12:55

不知道是不是这样子?#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def fa():
    text='''
      <div>
      <small>865 out of 1160 people found the following review useful:</small><br>
      <a href="/user/ur1285640/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
      <h2>The best story ever told on film</h2>
      <img width="102" height="12" alt="8/10" src="http://i.media-imdb.com/images/showtimes/80.gif"><br>
      <b>Author:</b>
      <a href="/user/ur1285640/">Si Cole</a><br>
      <small>3 August 2001</small><br>
      <p><b>*** This review may contain spoilers ***</b></p>

      </div>
      <p>

      I believe that this film is the best story ever told on film, and I'm about
      to tell you why.<br><br>Tim Robbins plays Andy Dufresne, a city banker, wrongfully convicted of
      murdering his wife and her lover. He is sent to Shawshank Prison in 1947
      and
      receives a double life sentence for the crime. Andy forms an unlikely
      friendship with &quot;Red&quot; (Morgan Freeman), the man who knows how to get
      things.
      Andy faces many trials in prison, but forms an alliance with the wardens
      because he is able to use his banking experience to help the corrupt
      officials amass personal fortunes. The story unfolds....<br><br>I was so impressed with how every single subplot was given a great deal of
      respect and attention from the director. The acting was world-class. I have
      never seen Tim Robbins act as well since, Morgan Freeman maybe (e.g.
      Seven).
      The twists were unexpected, an although this film had a familiar feel, it
      wasn't even slightly pretentious or cliched, it was original. The
      cinematography was grand and expressive. It gave a real impression of the
      sheer magnitude of this daunting prison.<br><br>But the one thing which makes THE SHAWSHANK REDEMPTION stand above all
      other
      films, is the attention given to the story. The film depends on the story
      and the way in which it unravels. It's a powerful, poignant,
      thought-provoking, challenging film like no other. If Andy were to comment
      on this film, I think he might say: &quot;Get busy watching, or get busy dying.&quot;
      Take his advice.<br><br>Thoroughly recommended.
      </p>

      <div class="yn" id="ynd_348829">

      <form method="get"

         action="/register/login"

      >
      Was the above review useful to you?

      <input class="click linkasbutton-secondary" type="submit"
         name="ynb_348829_yes" value="Yes"

         rel="login"

      >
      <input class="click linkasbutton-secondary" type="submit"
         name="ynb_348829_no" value="No"

         rel="login"

      >

      </form>

      </div>

      <hr noshade="1" size="1" width="50%" align="center">

      <div>
      <small>583 out of 721 people found the following review useful:</small><br>
      <a href="/user/ur0257957/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
      <h2>The Shawshank Redemption</h2>
      <img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
      <b>Author:</b>
      <a href="/user/ur0257957/">Tim Cox</a> <small>from Marietta, OH</small><br>
      <small>25 March 1999</small><br>

      </div>
      <p>

      One of the finest films made in recent years. It's a poignant story
      about hope. Hope gets me. That's what makes a film like this more than a
      movie. It tells a lesson about life.
      Those are the films people talk about 50 or even 100 years from you. It's
      also a story for freedom. Freedom from isolation,
      from rule, from bigotry and hate. Freeman and Robbins are
      majestic in their performances. Each learns from the other.
      Their relationship is strong and you feel that from the first
      moment they make contact with one another. There is also a
      wonderful performance from legend James Whitmore as Brooks.<br><br>He shines when it is his time to go back into the world,
      only
      to find that the world grew up so fast he never even got
      a chance to blink. Stephen King's story is brought to the
      screen with great elegance and excitement. It is an extraordinary motion
      that people &quot;will&quot; be talking about in
      50 or 100 years.
      </p>

      <div class="yn" id="ynd_348222">

      <form method="get"

         action="/register/login"

      >
      Was the above review useful to you?

      <input class="click linkasbutton-secondary" type="submit"
         name="ynb_348222_yes" value="Yes"

         rel="login"

      >
      <input class="click linkasbutton-secondary" type="submit"
         name="ynb_348222_no" value="No"

         rel="login"

      >

      </form>

      </div>

      <hr noshade="1" size="1" width="50%" align="center">

      <div>
      <small>574 out of 706 people found the following review useful:</small><br>
      <a href="/user/ur0611718/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
      <h2>Powerful</h2>
      <img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
      <b>Author:</b>
      <a href="/user/ur0611718/">Thomas McFadden (tmac4)</a> <small>from Houston, Texas</small><br>
      <small>25 July 2001</small><br>

      </div>
      '''
    for x in re.findall(r'<img width="\d+" height="\d+" alt="(\d+)/\d+" src=.*?<a href="/user/([\d\w]+)/">[^<]+</a>',text,re.DOTALL):
      print x

fa()

Cam_Un 发表于 2014-08-31 17:44

回复 4# whitelotus19


    可以了!太感谢了!!!

lizhihui_kevin 发表于 2014-09-15 16:26

pattern_match = re.compile(r'<a href="/user/(.*)/">(.*)</a> <small>(.*)</small><br>')
results = re.findall(pattern_match, content)
if results:
    print results
页: [1]
查看完整版本: Python 抓取网页信息(正则表达式问题) 急求好心人帮忙 在线等 万分感谢!