免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
123下一页
最近访问板块 发新帖
查看: 5853 | 回复: 20
打印 上一主题 下一主题

请教抽取数据的问题 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-11-07 15:03 |只看该作者 |倒序浏览
/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Edadisz%2Ecn%3Fadtype%3D2%26pdid%3D102%26cid%3D9090%26grid%
3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Ezhicai%2Ecom%2Ecn
%2Findex%2Easpx%3Fzpdid%3D70%26zcid%3D5001%26zgrid%3D2859%26zkwid%3D4916%3Fadtype%3D2%26pdid%3D102%26cid%3D250544%26grid%3D2859%26kw
id%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Emyzhuce%2Ecom%2Ecn%3Fadtype%
3D2%26pdid%3D102%26cid%3D7978%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url
=http%3A%2F%2Fwww%2Egzjundao168%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D9902%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.act
ion?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2E91zhuce%2Ecom%2Ecn%2Farticle%2FZhuCeGongSi%2F%3Fadtype%3D2%26pdid
%3D102%26cid%3D250878%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A
%2F%2Fwww%2E13723464777%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D9774%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||   注册公司

像上面这条记录,取出来应该是这样的,要把最后一个词也带上。
9029,5001,250544,7978,9902,250878,9774,注册公司

每行里有N个cid%3D   %,这个N在每行是不一定的。

我想把cid%3D  %中间的数据抽取出来,用逗号或空格分割,请问有什么好方法?

[ 本帖最后由 于仁洁 于 2008-11-7 15:32 编辑 ]

论坛徽章:
0
2 [报告]
发表于 2008-11-07 15:13 |只看该作者
你的数据太火星了
一下读取个search action以||分割然后
echo "123cid%456%123" | sed 's/.*cid%\(.*\)%.*/\1/'
456

[ 本帖最后由 ubuntuer 于 2008-11-7 15:19 编辑 ]

论坛徽章:
0
3 [报告]
发表于 2008-11-07 15:15 |只看该作者
原帖由 ubuntuer 于 2008-11-7 15:13 发表
你的数据太火星了
echo "123cid%456%123" | sed 's/.*cid%\(.*\)%.*/\1/'
456


不对哦

出来成这样了:
3D10103%26grid%3D38098%26kwid%3D104065%26prid%3D%26ct
        1
        3
3D250681%26grid%3D254333%26kwid%3D176917%26prid%3D%26ct
3D8671%26grid%3D1513%26kwid%3D36261%26prid%3D%26ct

论坛徽章:
23
15-16赛季CBA联赛之吉林
日期:2017-12-21 16:39:27白羊座
日期:2014-10-27 11:14:37申猴
日期:2014-10-23 08:36:23金牛座
日期:2014-09-30 08:26:49午马
日期:2014-09-29 09:40:16射手座
日期:2014-11-25 08:56:112015年辞旧岁徽章
日期:2015-03-03 16:54:152015年迎新春徽章
日期:2015-03-04 09:49:0315-16赛季CBA联赛之山东
日期:2017-12-21 16:39:1915-16赛季CBA联赛之广东
日期:2016-01-19 13:33:372015亚冠之山东鲁能
日期:2015-10-13 09:39:062015亚冠之西悉尼流浪者
日期:2015-09-21 08:27:57
4 [报告]
发表于 2008-11-07 15:16 |只看该作者
awk -F% '{for(i=1;i<NF;i++) if($i~/cid$/) print $(i+1)}' urfile

论坛徽章:
0
5 [报告]
发表于 2008-11-07 15:20 |只看该作者
原帖由 ly5066113 于 2008-11-7 15:16 发表
awk -F% '{for(i=1;i


谢谢,比较接近了。

应该是这样的问题

我想把cid%3D  %中间的数据抽取出来,用逗号或空格分割,请问有什么好方法?

论坛徽章:
0
6 [报告]
发表于 2008-11-07 15:30 |只看该作者
/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Edadisz%2Ecn%3Fadtype%3D2%26pdid%3D102%26cid%3D9090%26grid%
3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Ezhicai%2Ecom%2Ecn
%2Findex%2Easpx%3Fzpdid%3D70%26zcid%3D5001%26zgrid%3D2859%26zkwid%3D4916%3Fadtype%3D2%26pdid%3D102%26cid%3D250544%26grid%3D2859%26kw
id%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Emyzhuce%2Ecom%2Ecn%3Fadtype%
3D2%26pdid%3D102%26cid%3D7978%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url
=http%3A%2F%2Fwww%2Egzjundao168%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D9902%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.act
ion?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2E91zhuce%2Ecom%2Ecn%2Farticle%2FZhuCeGongSi%2F%3Fadtype%3D2%26pdid
%3D102%26cid%3D250878%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A
%2F%2Fwww%2E13723464777%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D9774%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||   注册公司

像上面这条记录,取出来应该是这样的,要把最后一个词也带上。
9029,5001,250544,7978,9902,250878,9774,注册公司

[ 本帖最后由 于仁洁 于 2008-11-7 15:31 编辑 ]

论坛徽章:
23
15-16赛季CBA联赛之吉林
日期:2017-12-21 16:39:27白羊座
日期:2014-10-27 11:14:37申猴
日期:2014-10-23 08:36:23金牛座
日期:2014-09-30 08:26:49午马
日期:2014-09-29 09:40:16射手座
日期:2014-11-25 08:56:112015年辞旧岁徽章
日期:2015-03-03 16:54:152015年迎新春徽章
日期:2015-03-04 09:49:0315-16赛季CBA联赛之山东
日期:2017-12-21 16:39:1915-16赛季CBA联赛之广东
日期:2016-01-19 13:33:372015亚冠之山东鲁能
日期:2015-10-13 09:39:062015亚冠之西悉尼流浪者
日期:2015-09-21 08:27:57
7 [报告]
发表于 2008-11-07 15:34 |只看该作者

回复 #6 于仁洁 的帖子

awk -F'[% ]+' '{for(i=1;i<NF;i++) if($i~/cid$/) printf substr($(i+1),3)","}END{print $NF}' urfile

论坛徽章:
0
8 [报告]
发表于 2008-11-07 15:36 |只看该作者
原帖由 ly5066113 于 2008-11-7 15:34 发表
awk -F'[% ]+' '{for(i=1;i


厉害。

不过出来的结果是这种:
0418,250843,250605,250674,250837,250704,250378,250905没有换行。

N个ID对应最后面的一个词。

论坛徽章:
23
15-16赛季CBA联赛之吉林
日期:2017-12-21 16:39:27白羊座
日期:2014-10-27 11:14:37申猴
日期:2014-10-23 08:36:23金牛座
日期:2014-09-30 08:26:49午马
日期:2014-09-29 09:40:16射手座
日期:2014-11-25 08:56:112015年辞旧岁徽章
日期:2015-03-03 16:54:152015年迎新春徽章
日期:2015-03-04 09:49:0315-16赛季CBA联赛之山东
日期:2017-12-21 16:39:1915-16赛季CBA联赛之广东
日期:2016-01-19 13:33:372015亚冠之山东鲁能
日期:2015-10-13 09:39:062015亚冠之西悉尼流浪者
日期:2015-09-21 08:27:57
9 [报告]
发表于 2008-11-07 15:39 |只看该作者
原帖由 于仁洁 于 2008-11-7 15:36 发表


厉害。

不过出来的结果是这种:
0418,250843,250605,250674,250837,250704,250378,250905没有换行。

N个ID对应最后面的一个词。

  1. $ cat urfile
  2. /search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Edadisz%2Ecn%3Fadtype%3D2%26pdid%3D102%26cid%3D9090%26grid%
  3. 3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Ezhicai%2Ecom%2Ecn
  4. %2Findex%2Easpx%3Fzpdid%3D70%26zcid%3D5001%26zgrid%3D2859%26zkwid%3D4916%3Fadtype%3D2%26pdid%3D102%26cid%3D250544%26grid%3D2859%26kw
  5. id%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2Emyzhuce%2Ecom%2Ecn%3Fadtype%
  6. 3D2%26pdid%3D102%26cid%3D7978%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url
  7. =http%3A%2F%2Fwww%2Egzjundao168%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D9902%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.act
  8. ion?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A%2F%2Fwww%2E91zhuce%2Ecom%2Ecn%2Farticle%2FZhuCeGongSi%2F%3Fadtype%3D2%26pdid
  9. %3D102%26cid%3D250878%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||/search.action?kw=%E6%B3%A8%E5%86%8C%E5%85%AC%E5%8F%B8&url=http%3A
  10. %2F%2Fwww%2E13723464777%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D9774%26grid%3D2742%26kwid%3D4704%26prid%3D%26ct%3D0||   注册公司
  11. $ awk -F'[% ]+' '{for(i=1;i<NF;i++) if($i~/cid$/) printf substr($(i+1),3)","}END{print $NF}' urfile
  12. 9090,5001,250544,7978,9902,250878,9774,注册公司
复制代码

论坛徽章:
0
10 [报告]
发表于 2008-11-07 15:47 |只看该作者
奇怪。。单这行可以。。整个文件进行过滤就成了

250440,10304,10749,10749,250357,250357,10101,10101,7823,9610,250214,9600,9295,7823,9610,250214,9600,9295,7823,9610,250214,9600,9295,7823,9610,250214,9600,9295,250325,250849 ......




原始内容:至中文词里为每行结尾。
/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Eszlawyerhuzi%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D7985%26grid%3D4922%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Ehlaw88%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D7881%26grid%3D4922%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Egzlawyer168%2Ecn%3Fadtype%3D2%26pdid%3D102%26cid%3D7667%26grid%3D1568%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fczplaw%2Efl168%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D10351%26grid%3D4922%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Etonylawyer%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D9738%26grid%3D1568%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Egz124%2Ecom%2Fhunyin%2Easp%3Fadtype%3D2%26pdid%3D102%26cid%3D8836%26grid%3D4922%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Esz164%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D250777%26grid%3D4922%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Egzlawyer88%2Ecom%3Fadtype%3D2%26pdid%3D102%26cid%3D10110%26grid%3D4922%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Etjdlawyer%2Ecom%2Fintro%2Ehtm%3FActiveNode%3D2%3Fadtype%3D2%26pdid%3D102%26cid%3D250759%26grid%3D1367%26kwid%3D11491%26prid%3D%26ct%3D0||/q1.action?kw=%E7%A6%BB%E5%A9%9A%E5%BE%8B%E5%B8%88&url=http%3A%2F%2Fwww%2Egzlawyer007%2Ecom%2Ffw%2Ehtm%3FActiveNode%3D3%3Fadtype%3D2%26pdid%3D102%26cid%3D250253%26grid%3D4922%26kwid%3D11491%26prid%3D%26ct%3D0||      离婚律师
/q1.action?kw=%E5%A9%9A%E5%A4%96%E6%83%85%E8%B0%83%E6%9F%A5&url=http%3A%2F%2Fwww%2Eliehu007%2Ecom%2Ffwxm%2Ehtm%3FActiveNode%3D8%3Fadtype%3D2%26pdid%3D102%26cid%3D250898%26grid%3D3480%26kwid%3D9872%26prid%3D%26ct%3D0||/q1.action?kw=%E5%A9%9A%E5%A4%96%E6%83%85%E8%B0%83%E6%9F%A5&url=http%3A%2F%2Fwww%2Ehonghai007%2Ecom%2Fhh%2Fnew%2Flove%2Easp%3Fadtype%3D2%26pdid%3D102%26cid%3D8594%26grid%3D2012%26kwid%3D9872%26prid%3D%26ct%3D0||/q1.action?kw=%E5%A9%9A%E5%A4%96%E6%83%85%E8%B0%83%E6%9F%A5&url=http%3A%2F%2Fwww%2Egzsjk%2Ecom%2Fg1%2Ehtml%3Fadtype%3D2%26pdid%3D102%26cid%3D8861%26grid%3D2012%26kwid%3D9872%26prid%3D%26ct%3D0||/q1.action?kw=%E5%A9%9A%E5%A4%96%E6%83%85%E8%B0%83%E6%9F%A5&url=http%3A%2F%2Fwww%2Ewing007%2Ecom%2Fwedlock%2Easp%3Fadtype%3D2%26pdid%3D102%26cid%3D250850%26grid%3D3480%26kwid%3D9872%26prid%3D%26ct%3D0||        婚外情调查

[ 本帖最后由 于仁洁 于 2008-11-7 15:48 编辑 ]
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP