免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 5484 | 回复: 2

[文本处理] [正则表达] sed提取内容问题? [复制链接]

论坛徽章:
0
发表于 2013-08-23 14:58 |显示全部楼层
内容格式:
1080*#<h2>1080</h2>*#<a href="sound://10800001.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >noun</em></span>*#<span  class="pr">\(<span class="unicode">ˌ</span>)ten-<span class="unicode">ˈ</span>ā-tē\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>a poisonous preparation of sodium fluoroacetate used as a rodenticide and pesticide </span>*#<h2>*#<span>Variants:*#</span>*#</h2>*#<strong>1080</strong>*#<a href="sound://10800001.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a> also  <strong>ten–eighty</strong><span  class="pr">\(<span class="unicode">ˌ</span>)ten-<span class="unicode">ˈ</span>ā-tē\</span>*#<h2>*#<span>Origin:*#</span>*#</h2>*#from its laboratory serial numberFirst Known Use: 1945*#</>

1080s*#<h2>1080</h2>*#<a href="sound://10800001.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >noun</em></span>*#<span  class="pr">\(<span class="unicode">ˌ</span>)ten-<span class="unicode">ˈ</span>ā-tē\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>a poisonous preparation of sodium fluoroacetate used as a rodenticide and pesticide </span>*#<h2>*#<span>Variants:*#</span>*#</h2>*#<strong>1080</strong>*#<a href="sound://10800001.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a> also  <strong>ten–eighty</strong><span  class="pr">\(<span class="unicode">ˌ</span>)ten-<span class="unicode">ˈ</span>ā-tē\</span>*#<h2>*#<span>Origin:*#</span>*#</h2>*#from its laboratory serial numberFirst Known Use: 1945*#</>

12-step*#<h2>12–step</h2>*#<a href="sound://12ste01v.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >adj</em></span>*#<span  class="pr">\<span class="unicode">ˈ</span>twelv-<span class="unicode">ˌ</span>step\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>of, relating to, characteristic of, or being a program that is designed especially to help an individual overcome an addiction, compulsion, serious shortcoming, or traumatic experience by adherence to 12 tenets emphasizing personal growth and dependence on a higher spiritual being </span>*#<h2>*#<span>First Known Use:*#</span>*#</h2>*#1983*#</>

18-wheeler*#<h2><a href="entry://18–wheel·er">18–wheel·er</a></h2>*#<a href="sound://18_whe01.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >noun</em></span>*#<span  class="pr">\<span class="unicode">ˌ</span>ā(t)-(<span class="unicode">ˌ</span>)tēn-<span class="unicode">ˈ</span>wē-lər\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>a trucking rig consisting of a tractor and a trailer and typically having eighteen wheels </span>*#<h2>*#<span>Variants:*#</span>*#</h2>*#<strong><a href="entry://18–wheel·er">18–wheel·er</a></strong>*#<a href="sound://18_whe01.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a> or  <strong><a href="entry://eighteen–wheeler">eigh·teen–wheel·er</a></strong><span  class="pr">\<span class="unicode">ˌ</span>ā(t)-(<span class="unicode">ˌ</span>)tēn-<span class="unicode">ˈ</span>wē-lər\</span>*#<h2>*#<span>First Known Use:*#</span>*#</h2>*#1976*#</>


注意:“以上内容都是一条条记录,就是开头1080*#到*#</>结束都是一条记录中间没有回车!”

要取以*#分割两个内容?如下:

1080 <h2>1080</h2>
就写成:
sed -n 's/\([^\*#]*\)\*#\([^\*#]*\)\*#.*/\1:\2/p'

到了第四条记录内容就变成这样:
1080:<h2>1080</h2>
1080s:<h2>1080</h2>
12-step:<h2>12–step</h2>
18-wheeler*#<h2><a href="entry://18–wheel·er">18–wheel·er</a></h2>:<a href="sound://18_whe01.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >noun</em></span>

论坛徽章:
1
天蝎座
日期:2013-08-22 15:14:44
发表于 2013-08-23 15:18 |显示全部楼层
回复 1# peterdocter

   输出结果与楼主不一样。另:第四行<h2>与</h2>之间的a标签属性是期望的吗?
  1. sed -n 's/\([^\*#]*\)\*#\([^\*#]*\)\*#.*/\1:\2/p' urfile
  2. 1080:<h2>1080</h2>
  3. 1080s:<h2>1080</h2>
  4. 12-step:<h2>12–step</h2>
  5. 18-wheeler:<h2><a href="entry://18–wheel·er">18–wheel·er</a></h2>
复制代码

论坛徽章:
0
发表于 2013-08-23 15:45 |显示全部楼层
回复 2# guogang225
多谢了!中要实现样的效果。这个正则也测试,还有问题?

可能是我源格式中是非点问题吧?
其实是& #183 ;
中间没有空格,显示出来是一点。
   
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP