免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1284 | 回复: 2
打印 上一主题 下一主题

[文本处理] 文本处理求助-从HTML文本中查找数据并生成CSV文件 [复制链接]

论坛徽章:
18
辰龙
日期:2014-05-21 21:01:4115-16赛季CBA联赛之深圳
日期:2016-12-23 13:51:3815-16赛季CBA联赛之北控
日期:2016-11-28 18:26:3815-16赛季CBA联赛之佛山
日期:2016-11-03 11:18:5815-16赛季CBA联赛之辽宁
日期:2016-07-10 16:09:4115-16赛季CBA联赛之江苏
日期:2016-02-20 23:09:202015亚冠之塔什干棉农
日期:2015-08-17 19:49:492015年亚洲杯之日本
日期:2015-04-30 01:24:342015年亚洲杯之约旦
日期:2015-04-01 00:37:182015年亚洲杯之沙特阿拉伯
日期:2015-03-02 15:55:40处女座
日期:2014-05-25 10:34:0020周年集字徽章-年
日期:2023-04-23 11:17:52
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2017-05-18 00:20 |只看该作者 |倒序浏览
本帖最后由 bikkuri 于 2017-05-18 00:38 编辑

大家好!我有一个问题向大家求助.
我有一些HTML文本,需要从中查找一些数据并生成CSV文件.
例如以下是一个样本.

<HTML style="background-color: #CCCCCC"><HEAD><TITLE>7x50 HCT</TITLE><link rel="stylesheet" type="text/css" href="css/myStyle.css?v=1.2.3"/><link rel="shortcut icon" href="favicon.ico"/><script type="text/javascript" src="js/hct.js?v=1.0"></script><script type="text/javascript" src="js/overlib.js"></script></HEAD><BODY ><div id="overDiv" style="position:absolute; visibility:hidden; z-index:1000;"></div><center><center><table class="intro" width="99%" height="98%" border=0 cellpadding=0 cellspacing=0><tr><td align=center><table border=0; cellpadding="0" cellspacing="0" id="top1" width="98%" style="border-bottom:1px solid black; margin-top: 10px;"><tr><th nowrap width="22%" style="padding-left: 10px; padding-right: 10px;" class="newproperty"><style="margin-top: 1px; margin-left: 5px"><b style="font-size: 15pt;"> Health Check Tool v5 </b></th><th id="welcome" valign="middle" align="center" style="color:#333333; font-weight: bold"><form action="logon.php" method="POST">Welcome ZHU John!   |   <a href="index.php" style="color: #000000">UPLOAD</a>   |   <input type="hidden" name="logout" value="ZHU John,,0"><input  class="btn" type="submit" value="Log out" style="font-size:8pt" /></form></th></tr></table><TABLE id="content" WIDTH="98%" BORDER="0" valign ="top" style="border-left:1px solid #606060; border-right: 1px solid #606060; border-bottom: 2px solid #606060;"><TR><TD WIDTH="20%" ALIGN="LEFT" VALIGN="TOP"><table class="imagetable" border="3" align="center" cellpadding="0"><tr><th class="newproperty" colspan="3"><i> REPORT </i></th></tr><tr><td class="level_1" colspan="2">LEVEL 1</td></tr><tr><td class="darknav" >Error</td><td class="diff">Information requires further analysis by TEC</td></tr><tr><td class="darknav" >Component information</td><td class="value">IMM 7</td></tr><tr><td class="darknav" >Information</td><td class="value">TEC needs to review the TS files to determine the root cause.</td></tr><tr><td class="darknav" >Action Plan</td><td class="value">Escalate according to the severity.</td></tr><tr><td class="level_2" colspan="2">LEVEL 2</td></tr><tr><td class="darknav" >Error</td><td class="diff">Enable vprn-network-exceptions</td></tr><tr><td class="darknav" >Component information</td><td class="value">SF/CPM A[integrated]</td></tr><tr><td class="darknav" >Information</td><td class="value">General Recommendation: Implement TA 12-1435. <br>Enable vprn-network-exceptions under "config>system>security#" context.</td></tr><tr><td class="darknav" >Action Plan</td><td class="value"><a href="DOCS/7x50/TSN/TA12-1435.pdf" target="myspot">TA12-1435.pdf</a></td></tr></table></td><td width="70%" height="100%" align="center" VALIGN="TOP" rowspan="6"><iframe src="../HCT/upload/jzhu039/errorInfo.html" style="border:1px solid #CCCCCC" name="myspot" width="98%" height="100%"></iframe></td></tr><TR><TD WIDTH="20%" ALIGN="LEFT" VALIGN="TOP"><table class="imagetable" border="3" VALIGN="TOP" align="left" cellpadding="0"><tr><th class="newproperty" colspan="3"><i>ERROR DETAILS</i></th></tr><tr><td class="value" colspan="3"><a href="../HCT/upload/jzhu039/errorInfo.html" target="myspot"><img src="images/view.png"></a></td></tr></table></TD></TR><TR><TD WIDTH="20%" ALIGN="LEFT" VALIGN="TOP"><table class="imagetable" border="3" VALIGN="TOP" align="left" cellpadding="0"><tr><th class="newproperty" colspan=3><i>ANALYSIS / COMPARING TOOLS</i></th></tr><tr><td class="darknav"> </td><td class="darknav">DOWNLOAD</td><td class="darknav">DIFF</td></TR><tr><td class="value">CLI Show Output</td><td class="value"><a href="../HCT/upload/jzhu039/cliShow.zip"><img src="images/download.gif"></a></td><td class="value"><a href="diff.php?f1=../HCT/upload/jzhu039/ar1cta1gru.ts1.txt.show&f2=../HCT/upload/jzhu039/ar1cta1gru.ts2.txt.show" target="myspot"><img src="images/view.png"></a></td></tr><tr><td class="value">LOG 99 & 100</td><td class="value"><a href="../HCT/upload/jzhu039/logs.zip"><img src="images/download.gif"></a></td><td class="value"><a href="diff.php?f1=../HCT/upload/jzhu039/ar1cta1gru.ts1.txt.log&f2=../HCT/upload/jzhu039/ar1cta1gru.ts2.txt.log" target="myspot"><img src="images/view.png"></a></td></tr><tr><td class="value">CONFIG</td><td class="value"><a href="../HCT/upload/jzhu039/configs.zip"><img src="images/download.gif"></a></td><td class="value"><a href="diff.php?f1=../HCT/upload/jzhu039/ar1cta1gru.ts1.txt.cfg&f2=../HCT/upload/jzhu039/ar1cta1gru.ts2.txt.cfg" target="myspot"><img src="images/view.png"></a></td></tr><tr><td class="value">PORT STATS</td><td class="value"><a href="../HCT/upload/jzhu039/portStats.zip"><img src="images/download.gif"></a></td><td class="value"><a href="diff.php?f1=../HCT/upload/jzhu039/ar1cta1gru.ts1.txt.portstats&f2=../HCT/upload/jzhu039/ar1cta1gru.ts2.txt.portstats" target="myspot"><img src="images/view.png"></a></td></tr><tr><td class="value">CLI History</td><td class="value"><a href="../HCT/upload/jzhu039/cliHistory.zip"><img src="images/download.gif"></a></td><td class="value"><a href="diff.php?f1=../HCT/upload/jzhu039/ar1cta1gru.ts1.txt.hist&f2=../HCT/upload/jzhu039/ar1cta1gru.ts2.txt.hist" target="myspot"><img src="images/view.png"></a></td></tr><tr><td class="value">All</td><td class="value"><a href="../HCT/upload/jzhu039/All.zip"><img src="images/download.gif"></a></td></table></TD></TR><TR><TD width="20%" VALIGN="TOP"><table class="imagetable" border="3" align="center" cellpadding="0"><tr><th class="newproperty" colspan="3"><i>TIME INFO</i></th><tr><td class="darknav"></td><td class="darknav">CAPTURED</td><td class="darknav">DIFF</td></tr><tr><td class="value"> ar1cta1gru.ts1.txt </td><td class="value"> FRI MAR 24 12:10:42 2017 UTC
</td><td class="value" rowspan="2"><font color="black"> 168:00:2</font></td></tr><tr><td class="value"> ar1cta1gru.ts2.txt </td><td class="value"> FRI MAR 31 12:10:44 2017 UTC
</td></tr></table></td></tr><TR><TD width="20%" height="40%">  <BR><BR><BR><BR><BR><BR><BR><BR><BR><BR></TD></TR></table></td></FORM></tr></table></TD></TR></TABLE></BODY></HTML><HTML style="background-color: #F6EEEE"><HEAD><link rel="stylesheet" type="text/css" href="../../css/myStyle.css?v=1.2.3"/><script type="text/javascript" src="../../js/hct.js?v=1.0"></script><script type="text/javascript" src="../../js/overlib.js"></script></HEAD><BODY style="background-color: #F6EEEE"><table class="system" width="100%" border="1" align="center" valign="top" cellpadding="1" cellspacing="1" bgcolor="#CCCCCC"><tr><td class="type" width="40%"> System Name </td><td class="data"><b>ar1.cta1.gru
</b></td></tr><tr><td class="type" width="40%"> System Type </td><td class="data"><b>7750 SR-12
</b></td></tr><tr><td class="type" width="40%"> System Version </td><td class="data"><b>C-12.0.R6
</b></td></tr><tr><td class="type" width="40%"> Chassis MAC </td><td class="data"><b>e4:81:84:2d:9c:0f
</b></td></tr><tr><td class="infodata" width="40%"> TS file 1 : ar1cta1gru.ts1 </td><td class="data"><b>Information as of FRI MAR 24 12:10:42 2017 UTC
</b></td></tr><tr><td class="infodata" width="40%"> TS file 2 : ar1cta1gru.ts2 </td><td class="data"><b>Information as of FRI MAR 31 12:10:44 2017 UTC
</b></td></tr></table><table class="system" style="background-color: #E4B77B" width="100%" border="0" align="center" cellpadding="0" cellspacing="1"><tr><td style="color:#333333; font-weight: bold;" align="center" colspan="8"><i>ERRORs</i></td></tr><tr><th class="categories">Component Information</th><th class="categories">Error</th><th class="categories">Level</th><th class="categories">Description</th><th class="categories">Action Plan</th></tr><tr><td colspan="5" align="left" style="padding-left: 320px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> CHASSIS 1 </font> - [<i>Standalone</i>] - [<i>200G per slot capable</i>] - [<i>NS142861360
</i>] - [<i>2016/12/23 11:54:50
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> CPM A </font> - [<i>sfm4-12</i>] - [<i>sfm4-12</i>] - [<i>up</i>] - [<i>up_active</i>] - [<i>NS1424F0495
</i>] - [<i>2016/12/23 11:54:50
</i>]</td></tr><tr><td class="diff" >SF/CPM A[integrated]</td><td class="bookedl2" >Enable vprn-network-exceptions</td><td class="diff" >2</td><td class="diff" colspan=2>General Recommendation: Implement TA 12-1435. <br>Enable vprn-network-exceptions under "config>system>security#" context.</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> CPM B </font> - [<i>sfm4-12</i>] - [<i>sfm4-12</i>] - [<i>up</i>] - [<i>up_standby</i>] - [<i>NS1425F0644
</i>] - [<i>2016/12/29 01:23:16
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IMM 1 </font> - [<i>imm-2pac-fp3</i>] - [<i>imm-2pac-fp3</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS141263567
</i>] - [<i>2016/12/23 11:55:43
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IMM 2 </font> - [<i>imm-2pac-fp3</i>] - [<i>imm-2pac-fp3</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS1424F1146
</i>] - [<i>2016/12/23 11:55:41
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IMM 3 </font> - [<i>imm-2pac-fp3</i>] - [<i>imm-2pac-fp3</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS1424F1035
</i>] - [<i>2016/12/23 11:55:41
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IOM 4 </font> - [<i>iom3-xp</i>] - [<i>iom3-xp</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS142663050
</i>] - [<i>2016/12/23 11:55:42
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 380px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> MDA 4/1 </font> - [<i>m20-1gb-xp-sfp</i>] - [<i>m20-1gb-xp-sfp</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS1415F0994
</i>] - [<i>2016/12/23 11:55:58
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 380px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> MDA 4/2 </font> - [<i>m20-1gb-xp-sfp</i>] - [<i>m20-1gb-xp-sfp</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS1415F0669
</i>] - [<i>2016/12/23 11:55:58
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IMM 5 </font> - [<i>imm-2pac-fp3</i>] - [<i>imm-2pac-fp3</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS152168464
</i>] - [<i>2016/12/23 11:55:41
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IMM 6 </font> - [<i>imm-2pac-fp3</i>] - [<i>imm-2pac-fp3</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS152168452
</i>] - [<i>2016/12/23 11:55:42
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IMM 7 </font> - [<i>imm-2pac-fp3</i>] - [<i>imm-2pac-fp3</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS152168467
</i>] - [<i>2016/12/23 11:55:46
</i>]</td></tr><tr><td class="diff" >IMM 7</td><td class="booked" >Information requires further analysis by TEC</td><td class="diff" >1</td><td class="diff" colspan=2>TEC needs to review the TS files to determine the root cause.</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IMM 8 </font> - [<i>imm-2pac-fp3</i>] - [<i>imm-2pac-fp3</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS152168463
</i>] - [<i>2017/02/23 13:14:09
</i>]</td></tr><tr><td colspan="5" align="left" style="padding-left: 350px; background: #FFFFFF; color: #432F21; font-size: 8pt; font-family: arial, sans-serif;"><font style="font-weight: bold; font-size: 12pt;"> IMM 9 </font> - [<i>imm-2pac-fp3</i>] - [<i>imm-2pac-fp3</i>] - [<i>up</i>] - [<i>up</i>] - [<i>NS152168460
</i>] - [<i>2016/12/23 11:55:41
</i>]</td></tr></table>

HTML源文件中可能有很多条以LEVEL 1或者LEVEL 2或者LEVEL 3开头的错误.
希望对HTML源文件中的这种错误进行处理并输出如下的CSV文件:
"NODE","LEVEL","ERROR","COMPONENT INFORMATION","INFORMATION","ACTION PLAN"
"ar1.cta1.gru","LEVEL 1","Information requires further analysis by TEC","IMM 7","TEC needs to review the TS files to determine the root cause.","Escalate according to the severity."
"ar1.cta1.gru","LEVEL 2","Enable vprn-network-exceptions","SF/CPM A[integrated]","General Recommendation: Implement TA 12-1435. <br>Enable vprn-network-exceptions under "config>system>security#" context.","TA12-1435.pdf"

假如没有找到任何错误,则仅输出系统名称:
"NODE","LEVEL","ERROR","COMPONENT INFORMATION","INFORMATION","ACTION PLAN"
"ar1.cta1.gru"

以下是两个HTML源文件的样本:
output-1.jzhu039.txt (12.79 KB, 下载次数: 0)
output-2.jzhu039.txt (9.68 KB, 下载次数: 0)

以下是ar1.cta1.gru的输出内容:
ar1.cta1.gru_HCT_output.pdf (328.66 KB, 下载次数: 5)

请问用awk命令应该如何处理这种HTML文件抓取所需的信息并输出所需的格式?
谢谢大家.


论坛徽章:
0
2 [报告]
发表于 2017-05-18 07:07 |只看该作者
建议用Python,没试过shell输出到cvs

论坛徽章:
145
技术图书徽章
日期:2013-10-01 15:32:13戌狗
日期:2013-10-25 13:31:35金牛座
日期:2013-11-04 16:22:07子鼠
日期:2013-11-18 18:48:57白羊座
日期:2013-11-29 10:09:11狮子座
日期:2013-12-12 09:57:42白羊座
日期:2013-12-24 16:24:46辰龙
日期:2014-01-08 15:26:12技术图书徽章
日期:2014-01-17 13:24:40巳蛇
日期:2014-02-18 14:32:59未羊
日期:2014-02-20 14:12:13白羊座
日期:2014-02-26 12:06:59
3 [报告]
发表于 2017-05-20 07:09 |只看该作者
回复 1# bikkuri

for your reference ...

$ awk -F'[<>]' 'function tlc(t,s){s=toupper(t);gsub("<"s,"<"t)}function mch(s,k){if(match(s,"^t. class=\"([^\"]+)\"",m)){if(k=="")return m[1];return(m[1]==k?1:0)}return""}function q(s){return("\""s"\"")}{tlc("tr");tlc("td");gsub("</[^>]+>","");gsub("<[bi]>","");gsub("<a href=[^>]+>","");for(n=1;n<=NF;++n){if($n~"^table class=")g=0;if(mch($n,"newproperty")&&$(n+1)==" REPORT ")g=1;if(mch($n)){k=m[1];if(k~/^level_/){a[++N]=q($(n+1));if(N==1)h=q("LEVEL");continue}if(g&&N==1&&k=="darknav")h=h","q($(n+1));if(g&&(k=="value"||k=="diff"))a[N]=a[N]","q($(n+1));if(k=="type"&&$(n+1)==" System Name "&&mch($(n+2),data)){Node=$(n+3)}}}}END{print q("Node")","h;for(n=1;n<=N;++n)print q(Node)","a[n]}' a.html
"Node","LEVEL","Error","Component information","Information","Action Plan"
"ar1.cta1.gru","LEVEL 1","Information requires further analysis by TEC","IMM 7","TEC needs to review the TS files to determine the root cause.","Escalate according to the severity."
"ar1.cta1.gru","LEVEL 2","Enable vprn-network-exceptions","SF/CPM A[integrated]","General Recommendation: Implement TA 12-1435. ","TA12-1435.pdf"


您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP