blackantt 发表于 2021-04-20 19:14

为啥re.findall的结果出现 多余的, " 等?

import requests
import re
url = 'http://www.shubang.net/book/66_2151.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'}
web_data = requests.get(url, headers=headers)
web_data.encoding = 'utf-8'
txt = web_data.text
items = re.findall(r'line_en\" \>(.*)<|line_cn\" title=\"(.*)\"', txt)   
for item in items:
    print(item)

结果如下所示
。。。。。
('&#34;It doesn&#39;t look new. It looks old,&#34; one of the boys said.', '')('', '“房子一点也不新,旧死了,”其中一个男孩说。')('It just couldn&#39;t be.', '')('', '绝对不可能。')('The other members of his family turned to stare at me.', '')('', '其他人都把目光转向了我。')
............


请问:
1.上面的 ') , ( 是哪来的?
2.couldn't 变成了 couldn&#39;是咋回事?


blackantt 发表于 2021-04-21 11:24

知道了, 要用 replace 函数 做替换
页: [1]
查看完整版本: 为啥re.findall的结果出现 多余的, " 等?