免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2249 | 回复: 0
打印 上一主题 下一主题

关于HTMLParser中的一些概念 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2009-11-20 19:33 |只看该作者 |倒序浏览
htmllib模块中的flowing_data,paragraph,line_break, label_data是什么范围? 没看懂
比如直接运行htmllib模块(其中Formatter用AbstractWriter作参数),输入的html的源码见附件:
局部html代码:

  1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
  2. <html>

  3. <head>
  4.   <title>A big win for Creamer; a big move for Karlsson - Golf - Yahoo! Sports</title>
  5.   <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
  6.   <meta
  7. name="description" content="Paula Creamer held off Lorena Ochoa, among
  8. others, to win the Samsung World Championship on Sunday.  - Golf
  9. news"/>
  10.   <link rel="stylesheet" type="text/css"
  11. href="http://l.yimg.com/img.sports.yahoo.com/static/versioned_asset/v3/css/editorial/css/yui/reset-fonts-grids_2.1.0.r1.3.css;editorial/css/adops/uhbt1_v27_1.8.r1.5.css;editorial/css/sports.r1.65.css;editorial/css/experts.r1.7.css;editorial/css/player_search.r1.5.css;editorial/css/scorethin.r1.26.css;editorial/css/article.r1.69.css;editorial/css/video.r1.23.css;editorial/css/sitewide_nav_header_footer_test.r1.14.2.29.css"
  12. />
  13. <link rel="stylesheet" type="text/css" media="print"
  14. href="http://l.yimg.com/img.sports.yahoo.com/static/versioned_asset/v3/print_css/editorial/print_css/article.r1.6.css"
  15. />

  16.   <script type="text/javascript"
  17. src="http://l.yimg.com/img.sports.yahoo.com/static/versioned_asset/v3/minify/js/editorial/js/yui/yuiloader-beta-min_2.5.1.r1.4.js;editorial/js/yui/dom-min_2.5.1.r1.4.js;editorial/js/yui/event-min_2.5.1.r1.4.js;editorial/js/yui/connection-min_2.5.1.r1.4.js;editorial/js/yui/animation-min_2.5.1.r1.4.js;editorial/js/yui/json-min.r1.3.js;editorial/js/constants.r1.15.js;editorial/js/globalsearch.r1.3.js;editorial/js/sports.r1.23.js;editorial/js/tabs.r1.20.js;editorial/js/cookie.r1.3.js;editorial/js/article.r1.15.js;editorial/js/window.r1.16.js;editorial/js/scorethin.r1.16.js;editorial/js/manager.r1.3.js;editorial/js/carousel.r1.12.js;editorial/js/nav_test.r1.6.js;editorial/js/ult.r1.3.js;editorial/js/mlbtv.r1.5.js;editorial/js/player_search.r1.3.js;editorial/js/flyout_test.r1.22.js;editorial/js/ult.r1.3.js"></script>


  18. <script type="text/javascript">
  19.   YAHOO.Sports.oAC
  20. = { uri :
  21. 'http://l.yimg.com/img.sports.yahoo.com/static/versioned_asset/v3/minify/js/editorial/js/yui/autocomplete-min_2.5.1.r1.4.js;editorial/js/yui/element-beta-min_2.5.1.r1.4.js;editorial/js/yui/datasource-beta-min_2.5.1.r1.4.js;editorial/js/yui/datatable-beta-min_2.5.1.r1.4.js;editorial/js/nav_ac.r1.4.js'
  22. };
  23. </script>


  24.   <!-- SpaceID=96837903 loc=RICH noad -->
  25. <script language=javascript>
  26. if(window.yzq_d==null)window.yzq_d=new Object();
  27. window.yzq_d['qNPGCNj8fbo-']='&U=12cg7hqgf%2fN%3dqNPGCNj8fbo-%2fC%3d-1%2fD%3dRICH%2fB%3d-1%2fV%3d0';
  28. </script><noscript><img
  29. width=1 height=1 alt=""
  30. src="http://us.bc.yahoo.com/b?P=UBl_6GKIPE7mNOczSKKh1gKFdOLXUEjrFAoACr3e&T=14vkhi4mo%2fX%3d1223365642%2fE%3d96837903%2fR%3dsports%2fK%3d5%2fV%3d2.1%2fW%3dH%2fY%3dYAHOO%2fF%3d4292398135%2fH%3dY29udGVudD0ibGVhZ3VlPWdvbGY-%2fQ%3d-1%2fS%3d1%2fJ%3d723C8862&U=12cg7hqgf%2fN%3dqNPGCNj8fbo-%2fC%3d-1%2fD%3dRICH%2fB%3d-1%2fV%3d0"></noscript>
  31.   
  32.       <STYLE>
  33.       #ysports .iysmcm { padding:5px; border:1px solid #999; width:auto;}
  34.       #ysports .iysmcm h3 { margin-bottom:0; font-size:92%; color:#000; font-weight:bold; }
  35.       #ysports .iysmcm h4 { font-size:100%; }
  36.       #ysports .iysmcm-col { margin:0px 20px 0 0; padding:0px; list-style-type:none; }
  37.       #ysports .iysmcm-col li { margin-top:10px; }
  38.       #ysports .iysmcm-desc { margin:0px; }
  39.       #ysports #bd .iysmcm-desc a { text-decoration:none; color:#000; }
  40.       #ysports #bd .iysmcm-url { color: #008000; }
  41.       #ysports .iysmcm div { clear:both; }
  42.       #ysports #oly-sport_nav { display: none; }
  43.       </STYLE>
  44. </head>

  45. <body id="ysports" class="article memo ">
  46. <div id="doc" class="yui-t4">

  47.   
  48.   <div id="hd">
  49.    
  50.     <script>YAHOO.Sports.ultPageInfo = {"ult" : true, "spaceid" : 96837903};</script>




  51. <div class="mast">
  52.   <div id="ysp-hd">
  53. <div id="ysp-network-nav">
  54.         <form action="http://srd.yahoo.com/loc=head&st=yahoo/*[url]http://search.yahoo.com/search[/url]" class="yahoo-functions" id="ysp-yahoo-search-form" method="get">
  55.           <fieldset>
  56.             <legend>Web Search</legend>

  57.             <ul>
  58.               <li class="greeting">New User?</li>

  59.             <li class="signup"><a
  60. href="http://us.ard.yahoo.com/SIG=1508k7nqf/M=289534.10117939.10793562.1918016/D=sports/S=96837903:HEAD/_ylt=AqsFtID6bJGhg9cGKl.q7HwPocUF/Y=YAHOO/EXP=1223372842/L=UBl_6GKIPE7mNOczSKKh1gKFdOLXUEjrFAoACr3e/B=n9PGCNj8fbo-/J=1223365642787583/A=4352158/R=0/SIG=150mqajph/*[url]https://edit.yahoo.com/config/eval_register?.done=http://sports.yahoo.com%2fgolf%2fpga%2fnews%3fslug%3dtxgolfnotes%26prov%3dst%26type%3dlgns&.src=spt&.intl=us[/url]">Sign Up</a></li>

  61.             <li class="signin"><a
  62. href="http://us.ard.yahoo.com/SIG=1508k7nqf/M=289534.10117939.10793562.1918016/D=sports/S=96837903:HEAD/_ylt=AqsFtID6bJGhg9cGKl.q7HwPocUF/Y=YAHOO/EXP=1223372842/L=UBl_6GKIPE7mNOczSKKh1gKFdOLXUEjrFAoACr3e/B=n9PGCNj8fbo-/J=1223365642787583/A=4352158/R=1/SIG=14pi7g4c9/*[url]https://login.yahoo.com/config/login?.done=http://sports.yahoo.com%2fgolf%2fpga%2fnews%3fslug%3dtxgolfnotes%26prov%3dst%26type%3dlgns&.src=spt&.intl=us[/url]">Sign In</a></li>

  63.             <li class="help"><a
  64. href="http://us.ard.yahoo.com/SIG=1508k7nqf/M=289534.10117939.10793562.1918016/D=sports/S=96837903:HEAD/_ylt=AqsFtID6bJGhg9cGKl.q7HwPocUF/Y=YAHOO/EXP=1223372842/L=UBl_6GKIPE7mNOczSKKh1gKFdOLXUEjrFAoACr3e/B=n9PGCNj8fbo-/J=1223365642787583/A=4352158/R=2/SIG=11abslvoe/*[url]http://help.yahoo.com/l/us/yahoo/sports/[/url]">Help</a></li>
  65.               <li class="searchbox">
  66.                 <label for="web-search">Web Search</label>

  67.                 <input id="web-search" name="p" type="text">
  68.                 <input name="fr" type="hidden" value="ush-sports">
  69.               </li>

  70.             <li class="search-submit"><input alt="Web Search"
  71. id="ysp-web-search-submit"
  72. src="http://us.i1.yimg.com/us.yimg.com/i/us/sp/ed/nav08/web-search-01.png"
  73. type="image"></li>
复制代码


输出的结果见下:
send_flowing_data("YAHOO.Sports.oAC = { uri : 'http://l.yimg.com/img.sports.yahoo.com/static/versioned_asset/v3/minify/js/editorial/js/yui/autocomplete-min_2.5.1.r1.4.js;editorial/js/yui/element-beta-min_2.5.1.r1.4.js;editorial/js/yui/datasource-beta-min_2.5.1.r1.4.js;editorial/js/yui/datatable-beta-min_2.5.1.r1.4.js;editorial/js/nav_ac.r1.4.js' };")
send_flowing_data(" if(window.yzq_d==null)window.yzq_d=new Object(); window.yzq_d['qNPGCNj8fbo-']='")
send_flowing_data("=12cg7hqgf%2fN%3dqNPGCNj8fbo-%2fC%3d-1%2fD%3dRICH%2fB%3d-1%2fV%3d0';")
send_flowing_data(' #ysports .iysmcm { padding:5px; border:1px solid #999; width:auto;} #ysports .iysmcm h3 { margin-bottom:0; font-size:92%; color:#000; font-weight:bold; } #ysports .iysmcm h4 { font-size:100%; } #ysports .iysmcm-col { margin:0px 20px 0 0; padding:0px; list-style-type:none; } #ysports .iysmcm-col li { margin-top:10px; } #ysports .iysmcm-desc { margin:0px; } #ysports #bd .iysmcm-desc a { text-decoration:none; color:#000; } #ysports #bd .iysmcm-url { color: #008000; } #ysports .iysmcm div { clear:both; } #ysports #oly-sport_nav { display: none; }')
send_flowing_data(' YAHOO.Sports.ultPageInfo = {"ult" : true, "spaceid" : 96837903};')
send_flowing_data(' Web Search')
send_line_break()
send_paragraph(1)
new_margin('ul', 1)
send_label_data('*')
send_flowing_data('New User?')
send_line_break()
send_label_data('*')
send_flowing_data('Sign Up')
send_flowing_data('[1]')
send_line_break()
send_label_data('*')
send_flowing_data('Sign In')
send_flowing_data('[2]')
send_line_break()
send_label_data('*')
send_flowing_data('Help')
send_flowing_data('[3]')
send_line_break()
send_label_data('*')
send_flowing_data('Web Search')
send_line_break()
send_label_data('*')
new_margin(None, 0)
new_margin('ul', 1)
send_line_break()
……

比较搞不懂的是
本以为flowing_data就是html标签内的内容,结果发现同一JS标签内的代码居然被分割成两段Flowing_data了,于是被颠覆了 望指教

test.tar.gz

201 Bytes, 下载次数: 16

您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP