- 论坛徽章:
- 0
|
使用scrapy编写的一个爬虫
需要爬取相关网页源码如下:
<tr class="even" data-id=salinas leteliernicolás fabrizziom18-24> <td>54</td> <td>1</td> <td class="name"><span class="icon-silver"></span> Salinas Letelier, Nicolás Fabrizzio</td> <td>CHL</td> <td>4</td> <td>18:45:59</td> <td>9,848</td> </tr>
<tr class="odd" data-id=borges carneirolucasm18-24> <td>55</td> <td>2</td> <td class="name"><span class="icon-silver"></span> Borges Carneiro, Lucas</td> <td>BRA</td> <td>3</td> <td>28:13:30</td> <td>9,825</td> </tr>
编写代码如下:
def parse_RR(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="mainContentCol4"]/div/div/div/section/div[2]/table/tbody/tr[@class="odd"]|//*[@id="mainContentCol4"]/div/div/div/section/div[2]/table/tbody/tr[@class="even"]')
items = []
for site in sites:
item = RRItem()
item['DataID'] = site.xpath('@data-id').extract()
item['OverallPosition'] = site.select('td[1]/text()').extract()
item['CountryPosition'] = site.select('td[2]/text()').extract()
item['Icon'] = site.select('td[3]/span/@class').extract()
item['Name'] = site.select('td[3]/text()').extract()
item['Country'] = site.select('td[4]/text()').extract()
item['Races'] = site.select('td[5]/text()').extract()
item['OverallTime'] = site.select('td[6]/text()').extract()
item['Points'] = site.select('td[7]/text()').extract()
items.append(item)
return items
现在碰到的问题是,在取data-id的时候,函数xpath('@data-id')会自动将空格后面的内容忽略,导致取不全
如:取出后的DataID为salinas,而非salinas leteliernicolás fabrizziom18-24
请教该如何编写,谢谢
|
|