Python 抓取网页信息（正则表达式问题）急求好心人帮忙在线等万分感谢！

Cam_Un 发表于 2014-08-31 08:36

急需从网页抓点信息分析，初次接触Python和正则表达式，自己捯饬了几天，但有些地方还是晕晕的，请大神指教。
网页源码如下，想抓取
<a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
中的 ur2509775

<div>
<small>1801 out of 2099 people found the following review useful:</small><br>
<a href="/user/ur2509775/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMTk0MzY5MjExM15BMl5BanBnXkFtZTcwOTQyNjE3OQ@@._V1._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
<h2>Tied for the best movie I have ever seen</h2>
<img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
<b>Author:</b>
<a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
<small>26 November 2003</small><br>

</div>

已尝试使用如下代码抓取，但是抓到的都是空值，没有输出显示，就是啥都没抓到...
j=0
for i in range(0, 20, 10):
url = 'xxx'
hash = 'start=%d' % i
url = url + hash
content = urllib2.urlopen(url).read()
name = re.findall(r' <a href="/user/（.*?）/">.*?</a> <small>.*?</small><br>', content)
for i in range(0,len(name)):
j = j+1
print mid
print j
print 'done'

输出：
>>>
0
done
>>>

怀疑是正则表达式的问题，但不确定，试了几次也木有成功，急求大神帮助！！万分感谢！！

whitelotus19 发表于 2014-08-31 09:36

看看这样可以吗#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def fa():
text='''
<div>
<small>1801 out of 2099 people found the following review useful:</small><br>
<a href="/user/ur2509775/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMTk0MzY5MjExM15BMl5BanBnXkFtZTcwOTQyNjE3OQ@@._V1._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
<h2>Tied for the best movie I have ever seen</h2>
<img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
<b>Author:</b>
<a href="/user/ur2509775/">carflo</a> <small>from Texas</small><br>
<small>26 November 2003</small><br>

</div>

'''
print '\n'.join(re.findall(r'<a href="/user/([\d\w]+)/">\w+</a>',text))

fa()

Cam_Un 发表于 2014-08-31 09:59

回复 2# whitelotus19

谢谢！我试了这段是可以的，可是我是想从这个页面（http://www.imdb.com/title/tt0111161/reviews?start=0）直接抓出评价人的id和评价分数（评价分数不为空时），评价分数是可以抓出来的，但是用户id一直抓不到。我刚刚把你写的正则表达式贴进我的代码里试了一下，但抓出的还是空。

我的全部代码是：# coding=utf-8 ##
import urllib2
import re
import sys

type = sys.getfilesystemencoding()
j = 0
for i in range(0, 20, 10):
url = 'http://www.imdb.com/title/tt0111161/reviews?'
hash = 'start=%d' % i
url = url + hash
content = urllib2.urlopen(url).read()
content = content.decode('UTF-8','ignore')
# 读取用户编号
name = re.findall(r' <a href="/user/（.*?）/">.*?</a> <small>.*?</small><br>', content)
# 读取分数
score = re.findall(r'<img width="102" height="12" alt="(.*?)/10" .*?>', content)
for i in range(0,len(name)):
   if score!="":
         j = j + 1
         print mid+"+"+score
print ('共有'+ str(j) +'条').decode('UTF-8','ignore')
print 'done'
看样子真的是正则表达式的问题，求大神看看有没有救，谢谢！

whitelotus19 发表于 2014-08-31 12:55

不知道是不是这样子？#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def fa():
text='''
   <div>
   <small>865 out of 1160 people found the following review useful:</small><br>
   <a href="/user/ur1285640/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
   <h2>The best story ever told on film</h2>
   <img width="102" height="12" alt="8/10" src="http://i.media-imdb.com/images/showtimes/80.gif"><br>
   <b>Author:</b>
   <a href="/user/ur1285640/">Si Cole</a><br>
   <small>3 August 2001</small><br>
   <p><b>*** This review may contain spoilers ***</b></p>

   </div>
   <p>

   I believe that this film is the best story ever told on film, and I'm about
   to tell you why.<br><br>Tim Robbins plays Andy Dufresne, a city banker, wrongfully convicted of
   murdering his wife and her lover. He is sent to Shawshank Prison in 1947
   and
   receives a double life sentence for the crime. Andy forms an unlikely
   friendship with "Red" (Morgan Freeman), the man who knows how to get
   things.
   Andy faces many trials in prison, but forms an alliance with the wardens
   because he is able to use his banking experience to help the corrupt
   officials amass personal fortunes. The story unfolds....<br><br>I was so impressed with how every single subplot was given a great deal of
   respect and attention from the director. The acting was world-class. I have
   never seen Tim Robbins act as well since, Morgan Freeman maybe (e.g.
   Seven).
   The twists were unexpected, an although this film had a familiar feel, it
   wasn't even slightly pretentious or cliched, it was original. The
   cinematography was grand and expressive. It gave a real impression of the
   sheer magnitude of this daunting prison.<br><br>But the one thing which makes THE SHAWSHANK REDEMPTION stand above all
   other
   films, is the attention given to the story. The film depends on the story
   and the way in which it unravels. It's a powerful, poignant,
   thought-provoking, challenging film like no other. If Andy were to comment
   on this film, I think he might say: "Get busy watching, or get busy dying."
   Take his advice.<br><br>Thoroughly recommended.
   </p>

   <div class="yn" id="ynd_348829">

   <form method="get"

      action="/register/login"

   >
   Was the above review useful to you?

   <input class="click linkasbutton-secondary" type="submit"
      name="ynb_348829_yes" value="Yes"

      rel="login"

   >
   <input class="click linkasbutton-secondary" type="submit"
      name="ynb_348829_no" value="No"

      rel="login"

   >

   </form>

   </div>

   <hr noshade="1" size="1" width="50%" align="center">

   <div>
   <small>583 out of 721 people found the following review useful:</small><br>
   <a href="/user/ur0257957/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
   <h2>The Shawshank Redemption</h2>
   <img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
   <b>Author:</b>
   <a href="/user/ur0257957/">Tim Cox</a> <small>from Marietta, OH</small><br>
   <small>25 March 1999</small><br>

   </div>
   <p>

   One of the finest films made in recent years. It's a poignant story
   about hope. Hope gets me. That's what makes a film like this more than a
   movie. It tells a lesson about life.
   Those are the films people talk about 50 or even 100 years from you. It's
   also a story for freedom. Freedom from isolation,
   from rule, from bigotry and hate. Freeman and Robbins are
   majestic in their performances. Each learns from the other.
   Their relationship is strong and you feel that from the first
   moment they make contact with one another. There is also a
   wonderful performance from legend James Whitmore as Brooks.<br><br>He shines when it is his time to go back into the world,
   only
   to find that the world grew up so fast he never even got
   a chance to blink. Stephen King's story is brought to the
   screen with great elegance and excitement. It is an extraordinary motion
   that people "will" be talking about in
   50 or 100 years.
   </p>

   <div class="yn" id="ynd_348222">

   <form method="get"

      action="/register/login"

   >
   Was the above review useful to you?

   <input class="click linkasbutton-secondary" type="submit"
      name="ynb_348222_yes" value="Yes"

      rel="login"

   >
   <input class="click linkasbutton-secondary" type="submit"
      name="ynb_348222_no" value="No"

      rel="login"

   >

   </form>

   </div>

   <hr noshade="1" size="1" width="50%" align="center">

   <div>
   <small>574 out of 706 people found the following review useful:</small><br>
   <a href="/user/ur0611718/"><img class="avatar" src="http://ia.media-imdb.com/images/M/MV5BMjI2NDEyMjYyMF5BMl5BanBnXkFtZTcwMzM3MDk0OQ@@._SX40_SY40_SS40_.jpg" height=${avatar.image.size} width=${avatar.image.size}></a>
   <h2>Powerful</h2>
   <img width="102" height="12" alt="10/10" src="http://i.media-imdb.com/images/showtimes/100.gif"><br>
   <b>Author:</b>
   <a href="/user/ur0611718/">Thomas McFadden (tmac4)</a> <small>from Houston, Texas</small><br>
   <small>25 July 2001</small><br>

   </div>
   '''
for x in re.findall(r'<img width="\d+" height="\d+" alt="(\d+)/\d+" src=.*?<a href="/user/([\d\w]+)/">[^<]+</a>',text,re.DOTALL):
   print x

fa()

Cam_Un 发表于 2014-08-31 17:44

回复 4# whitelotus19

可以了！太感谢了！！！

lizhihui_kevin 发表于 2014-09-15 16:26

pattern_match = re.compile(r'<a href="/user/(.*)/">(.*)</a> <small>(.*)</small><br>')
results = re.findall(pattern_match, content)
if results:
print results

页: [1]

Chinaunix's Archiver

Python 抓取网页信息（正则表达式问题） 急求好心人帮忙 在线等 万分感谢！

Python 抓取网页信息（正则表达式问题）急求好心人帮忙在线等万分感谢！