- 论坛徽章:
- 1
|
本帖最后由 yakczh_cu 于 2016-12-21 20:31 编辑
- from bs4 import BeautifulSoup
- html_doc='''
- <!DOCTYPE HTML>
- <html lang="ru-RU">
- <head>
- <title></title>
- <meta charset="UTF-8">
- </head>
- <body>
- <img src="/upload/some.jpg" />
- <script src="/js/jquery" > </script>
- <script >
-
- document.write(navigator.userAgent);
- </script>
- </body>
- </html>
- '''
- soup = BeautifulSoup(html_doc, 'lxml',from_encoding='utf8')
- def has_class_but_no_id(tag):
- return tag.has_attr('src')
- scripts=soup.find_all('script',has_class_but_no_id)
- print scripts
复制代码
scripts=soup.find_all(has_class_but_no_id)
会把 img src=xxx这样的节点也取出来
soup.find_all("script")
会把不带src的s代码也取出来
怎么样才能把同时满足 标签是script 同时有'src'属性 的节点取出来?
|
|