Chinaunix

标题: [正则表达]如何实现字典方法过滤?[已解答] [打印本页]

作者: peterdocter    时间: 2013-09-09 18:15
标题: [正则表达]如何实现字典方法过滤?[已解答]
本帖最后由 peterdocter 于 2013-09-17 12:40 编辑

如何实现字典方法过滤,应该也叫循环方式过滤吧?
有两个文件,test1.txt是原内容记录,注意都一条过过。test2.txt就是要过滤条件字符。
只要test2.txt存在字符,就把些字符所在整条记录过滤掉或删除了。
本人尝试用shell+sed,1W要过滤字符。3天都没有处理完,kao。看来可能用awk比较快,但是又不懂awk如何编写?
  1. test1.txt
  2. 1080*#<h2>1080</h2>*#<a href="sound://10800001.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >noun</em></span>*#<span  class="pr">\(<span class="unicode">ˌ</span>)ten-<span class="unicode">ˈ</span>ā-tē\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>a poisonous preparation of sodium fluoroacetate used as a rodenticide and pesticide </span>*#<h2>*#<span>Variants:*#</span>*#</h2>*#<strong>1080</strong>*#<a href="sound://10800001.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a> also  <strong>ten–eighty</strong><span  class="pr">\(<span class="unicode">ˌ</span>)ten-<span class="unicode">ˈ</span>ā-tē\</span>*#<h2>*#<span>Origin:*#</span>*#</h2>*#from its laboratory serial numberFirst Known Use: 1945*#</>
  3. 1080s*#<h2>1080</h2>*#<a href="sound://10800001.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >noun</em></span>*#<span  class="pr">\(<span class="unicode">ˌ</span>)ten-<span class="unicode">ˈ</span>ā-tē\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>a poisonous preparation of sodium fluoroacetate used as a rodenticide and pesticide </span>*#<h2>*#<span>Variants:*#</span>*#</h2>*#<strong>1080</strong>*#<a href="sound://10800001.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a> also  <strong>ten–eighty</strong><span  class="pr">\(<span class="unicode">ˌ</span>)ten-<span class="unicode">ˈ</span>ā-tē\</span>*#<h2>*#<span>Origin:*#</span>*#</h2>*#from its laboratory serial numberFirst Known Use: 1945*#</>
  4. 12-step*#<h2>12–step</h2>*#<a href="sound://12ste01v.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >adj</em></span>*#<span  class="pr">\<span class="unicode">ˈ</span>twelv-<span class="unicode">ˌ</span>step\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>of, relating to, characteristic of, or being a program that is designed especially to help an individual overcome an addiction, compulsion, serious shortcoming, or traumatic experience by adherence to 12 tenets emphasizing personal growth and dependence on a higher spiritual being </span>*#<h2>*#<span>First Known Use:*#</span>*#</h2>*#1983*#</>
  5. 18-wheeler*#<h2><a href="entry://18–wheel.er">18–wheel.er</a></h2>*#<a href="sound://18_whe01.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >noun</em></span>*#<span  class="pr">\<span class="unicode">ˌ</span>ā(t)-(<span class="unicode">ˌ</span>)tēn-<span class="unicode">ˈ</span>wē-lər\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>a trucking rig consisting of a tractor and a trailer and typically having eighteen wheels </span>*#<h2>*#<span>Variants:*#</span>*#</h2>*#<strong><a href="entry://18–wheel.er">18–wheel.er</a></strong>*#<a href="sound://18_whe01.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a> or  <strong><a href="entry://eighteen–wheeler">eigh.teen–wheel.er</a></strong><span  class="pr">\<span class="unicode">ˌ</span>ā(t)-(<span class="unicode">ˌ</span>)tēn-<span class="unicode">ˈ</span>wē-lər\</span>*#<h2>*#<span>First Known Use:*#</span>*#</h2>*#1976*#</>
  6. 18-wheelers*#<h2><a href="entry://18–wheel.er">18–wheel.er</a></h2>*#<a href="sound://18_whe01.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a><span class="main-fl"><em >noun</em></span>*#<span  class="pr">\<span class="unicode">ˌ</span>ā(t)-(<span class="unicode">ˌ</span>)tēn-<span class="unicode">ˈ</span>wē-lər\</span>*#<h2 class="def-header">*#<span>Definition:*#</span>*#</h2>*#<span class="ssens"><strong>:</strong>a trucking rig consisting of a tractor and a trailer and typically having eighteen wheels </span>*#<h2>*#<span>Variants:*#</span>*#</h2>*#<strong><a href="entry://18–wheel.er">18–wheel.er</a></strong>*#<a href="sound://18_whe01.spx"><img align="absmiddle" border="0" src="/webster2012_audio.gif"></a> or  <strong><a href="entry://eighteen–wheeler">eigh.teen–wheel.er</a></strong><span  class="pr">\<span class="unicode">ˌ</span>ā(t)-(<span class="unicode">ˌ</span>)tēn-<span class="unicode">ˈ</span>wē-lər\</span>*#<h2>*#<span>First Known Use:*#</span>*#</h2>*#1976*#</>
复制代码
  1. test2.txt
  2. 12-step
  3. 18-wheelers
复制代码
如:
test2.txt
12-step 就把
test1.txt
12-step*#<h2>12–step</h2>*#...整条记录都过滤掉或删除。
注意:“配置条件是^12-step*#才进行过滤。”

发布内容格式都变了,请用附件中的demo吧。多谢了!
test.rar (910 Bytes, 下载次数: 8)

作者: rdcwayx    时间: 2013-09-09 19:14
  1. awk -F "*#" 'NR==FNR{a[$1];next} {if ($1 in a) next}1' file2.txt file1.txt
复制代码

作者: seesea2517    时间: 2013-09-10 09:27

不知道 grep -vf test2 test1 与 awk 的效率相比如何,楼主有实验结果望分享。
作者: peterdocter    时间: 2013-09-10 10:25
回复 2# rdcwayx
多谢了!几分钟就可以处理10W多条记录,果然用awk效率比较好!

   
作者: Shell_HAT    时间: 2013-09-10 10:29
回复 4# peterdocter
  1. grep -vFf test2 test1
复制代码
这个效率如何?
作者: LikeLx    时间: 2013-09-10 10:29
seesea2517 发表于 2013-09-10 09:27
不知道 grep -vf test2 test1 与 awk 的效率相比如何,楼主有实验结果望分享。



我在我的机器上试了下,awk快!
作者: peterdocter    时间: 2013-09-10 10:36
回复 3# seesea2517
这个有问题,如果有一条
18-wheelers....
也会过滤掉,我要的是18-wheelers*#才进行过滤。


   
作者: peterdocter    时间: 2013-09-10 10:39
本帖最后由 peterdocter 于 2013-09-10 10:39 编辑

回复 5# Shell_HAT
同上面一样,把条件不匹配也过滤了


   
作者: Shell_HAT    时间: 2013-09-10 10:47
回复 8# peterdocter
  1. grep -vxFf test2 test1
复制代码
  1. grep -vwFf test2 test1
复制代码

作者: peterdocter    时间: 2013-09-10 10:58
回复 9# Shell_HAT
一样没有效果?

   
作者: 惟吾无为    时间: 2013-09-10 19:52
本帖最后由 惟吾无为 于 2013-09-10 19:55 编辑

又在做什么词典啊?

作者: 惟吾无为    时间: 2013-09-10 20:04
回复 9# Shell_HAT


    这是个词典。所以正文可能出现过滤的关键字,但标题无辜。这样容易误伤。还是用awk吧。
否则只能先处理过滤列表,把正则的元字符转移,然后使用正则匹配
  1. ^转义过的词条名\*#
复制代码





欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2