- 论坛徽章:
- 0
|
向各位朋友请教,
我想抓取法制网上的http://www.legaldaily.com.cn/locality/node_32245.htm 上的新闻标题,并保存入csv文件. 由于刚刚接触Python, 知识不够用. 想向大家请教.
由于问题太多, 所以分开叙述.
难点: 无法正确抓取所需文字:
网页部分源代码为:
</HR><A class="f14 blue001" href="content/2013-11/01/content_4983464.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>玉门工商清理整顿一批无照经营户 <SPAN class="f12 black">2013-11-01</SPAN></A> <BR><A class="f14 blue001" href="content/2013-11/01/content_4983441.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>临夏州举办涉法涉诉信访工作改革培训班 <SPAN class="f12 black">2013-11-01</SPAN></A> <BR><A class="f14 blue001" href="content/2013-11/01/content_4983439.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>酒泉市肃州区马营河水闸道路工程顺利通车 <SPAN class="f12 black">2013-11-01</SPAN></A> <BR><A class="f14 blue001" href="content/2013-11/01/content_4983401.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>酒泉狠抓四环节推进涉法涉诉信访工作改革 <SPAN class="f12 black">2013-11-01</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/30/content_4974324.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>酒泉瓜州工商局开展群众路线教育实践活动 <SPAN class="f12 black">2013-10-30</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/29/content_4971723.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>酒泉市瓜州县工商局开展酒类市场集中整治 <SPAN class="f12 black">2013-10-29</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/21/content_4948889.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>酒泉市信访局开设“道德讲堂” <SPAN class="f12 black">2013-10-21</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/21/content_4948876.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>打造制度建设新亮点 推动酒泉经济发展 <SPAN class="f12 black">2013-10-21</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/18/content_4944212.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>酒泉加强行政程序建设提高依法行政水平 <SPAN class="f12 black">2013-10-18</SPAN></A> <BR><A class="f14 blue001" href="content/2013-10/16/content_4940043.htm?node=32245" target=_blank><SPAN class="f14 blue001">·</SPAN>酒泉肃州西峰乡进一步落实矛盾排查制度 <SPAN class="f12 black">2013-10-16</SPAN></A> <BR>
<HR SIZE=1>
目的是抓取新闻标题.
目前的代码如下:
from bs4 import BeautifulSoup
import re
import urllib2
url = "http://www.legaldaily.com.cn/locality/node_32245.htm"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
xinwen = soup.find_all('span')
for xw in xinwen:
print xw
但是运行效果出来, 所取标题文字残缺, 而且有很多<span class="f14 blue001">路</span>的字样.
请问怎样才能正确取出呢? 谢谢. |
|