- 论坛徽章:
- 0
|
因为s1, 和 dd 都是解析出来后的结果,那么它们应该不用再解析了。可以简单的使用replace
- for i in range(len(s1)):
- a = a.replace(s1[i], dd[i])
- print a
复制代码
或使用re.sub也可以,但要使用re.escape将特殊字符作一个处理,这样不会认为是正则表达式规则了,如:
- for i in range(len(s1)):
- a = re.sub(re.escape(s1[i]), dd[i], a)
- print a
复制代码
还可以在更简化,一次就做完:
- #coding=cp936
- import re
- a ='<p>吖<span lang=EN-US> ā [</span>吖嗪<span lang=EN-US>] (ā</span>q<span lang=EN-US>í</span>n<span lang=EN-US>) '
- def do_sub(m):
- return re.sub(r'<.*?>', '', m.group())
- a = re.sub(r'\[.*?\]', do_sub, a)
- print a
复制代码
只不过你的要求好象有问题。象[</span>这个</span>是与前一个<span>对应的。你把它去掉的话,就对应不上了。后面那个也是。
另外可以通过 .*? 这种方式来处理,使用r可以简化字符串的书写。下面是在python文档中的一段话,关于?的,可以参考一下:
*?, +?, ??
The "*", "+", and "?" qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn't desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'. |
|