免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
12下一页
最近访问板块 发新帖
查看: 4368 | 回复: 12
打印 上一主题 下一主题

[文本处理] 统计单词频率及行字符长度计算 [复制链接]

论坛徽章:
1
白羊座
日期:2014-11-13 10:19:16
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2014-05-25 02:06 |只看该作者 |倒序浏览
本帖最后由 iocg 于 2014-05-25 06:35 编辑

我有一个文本需要进行统计。
1、将文本的所有单词出现的频率进行排序,输出格式
  1. AAA   23%(230/1000)
  2. BBB   21%(210/1000)
  3. CCC  12%(120/1000)
  4. DDD  6%(60/1000)
  5. ...
复制代码
(注:AAA   23%(230/1000)   230是单词AAA出现次数,1000为所有单词总数)
我只需要前10个数据就好,或者输出到10%以上

2、统计每行的字符数,输出格式

  1. 57字符      20%(20/100)
  2. 56字符      13%(13/100)
  3. 55字符      14%(14/100)
  4. ......
复制代码
(注:57字符      20%(20/100)  行字符长度为57的共出现20次,总行数为100行)
我只需要重复次数从高到低前10个数据就好,或者输出到10%以上

将统计数据均输出到一个文本中
有办法实现吗???

论坛徽章:
39
辰龙
日期:2013-08-21 15:45:192015亚冠之广州富力
日期:2015-05-12 16:34:52亥猪
日期:2015-03-03 17:22:00申猴
日期:2015-03-03 17:21:37未羊
日期:2014-10-10 13:45:41戌狗
日期:2014-06-17 09:53:29巨蟹座
日期:2014-06-12 23:17:17双鱼座
日期:2014-06-10 12:42:44寅虎
日期:2014-06-09 12:52:172015亚冠之卡尔希纳萨夫
日期:2015-05-24 15:24:35黄金圣斗士
日期:2015-12-02 17:25:0815-16赛季CBA联赛之吉林
日期:2017-06-24 16:43:52
2 [报告]
发表于 2014-05-25 12:54 |只看该作者
回复 1# iocg


    目测一下:
   
1.假设文本中每一行一个单词:
  1. awk '{a[$1]++}END{for(i in a)print i,int(a[i]/NR*100)"%("a[i]"/"NR")"|"sort -k2nr|head"}' urfile >out.txt
复制代码
2.
  1. awk '{a[length()]++}END{for(i in a)print i,int(a[i]/NR*100)"%("a[i]"/"NR")"|"sort -k2nr|head"}' urfile >>out.txt
复制代码

论坛徽章:
32
处女座
日期:2013-11-20 23:41:20双子座
日期:2014-06-11 17:20:43戌狗
日期:2014-06-16 11:05:00处女座
日期:2014-07-22 17:30:47狮子座
日期:2014-07-28 15:38:17金牛座
日期:2014-08-05 16:34:01亥猪
日期:2014-08-18 13:34:25白羊座
日期:2014-09-02 15:03:55金牛座
日期:2014-11-10 10:23:58处女座
日期:2014-12-02 09:17:52程序设计版块每日发帖之星
日期:2015-06-16 22:20:002015亚冠之塔什干火车头
日期:2015-06-20 23:28:22
3 [报告]
发表于 2014-05-26 00:04 |只看该作者
请楼主给出数据模型~

论坛徽章:
1
白羊座
日期:2014-11-13 10:19:16
4 [报告]
发表于 2014-05-26 08:12 |只看该作者
回复 3# yestreenstars


    就是普通的文本(英文和数字),以空格隔开,每行大约40-80字符,一行字符一般不重复
  1. aa  cc  bdf  ew wet reg  54
  2. qwf  wef htr qw  we  reg qwd
  3. qw fgweg wqef qw efw q
  4. as  aa  afd   dw   dw   fe
  5. ......
复制代码
有没有高手解决下呢??

论坛徽章:
2
白羊座
日期:2013-11-18 19:52:42辰龙
日期:2014-09-07 07:46:06
5 [报告]
发表于 2014-05-26 09:50 |只看该作者
  1. awk '{c+=NF;for(i=1;i<=NF;i++){a[$i]++;b[length($i)]++}}END{for(i in a) printf "%s\t%3d%%(%d/%d)\n",i,int(a[i]/c*100),a[i],c;printf"\n";for(i in b) printf "%s char(s)\t%3d%%(%d/%d)\n",i,int(b[i]/c*100),b[i],c}' test
复制代码

论坛徽章:
769
金牛座
日期:2014-02-26 17:49:58水瓶座
日期:2014-02-26 18:10:15白羊座
日期:2014-04-15 19:29:52寅虎
日期:2014-04-17 19:43:21酉鸡
日期:2014-04-19 21:24:10子鼠
日期:2014-04-22 13:55:24卯兔
日期:2014-04-22 14:20:58亥猪
日期:2014-04-22 16:13:09狮子座
日期:2014-05-05 22:31:17摩羯座
日期:2014-05-06 10:32:53处女座
日期:2014-05-12 09:23:11子鼠
日期:2014-05-21 18:21:27
6 [报告]
发表于 2014-05-26 10:25 |只看该作者
本帖最后由 Herowinter 于 2014-05-26 10:26 编辑

回复 1# iocg

文本中的标点符号 缩写怎么办?比如说这段文本,awk如果默认NF的话,
取出来的单词会有time,  found. I've等等
  1. As a student of adversity, I've been struck over the years by how some people with major challenges seem to draw strength from them, and I've heard the popular wisdom that that has to do with finding meaning. And for a long time, I thought the meaning was out there, some great truth waiting to be found.
复制代码

论坛徽章:
769
金牛座
日期:2014-02-26 17:49:58水瓶座
日期:2014-02-26 18:10:15白羊座
日期:2014-04-15 19:29:52寅虎
日期:2014-04-17 19:43:21酉鸡
日期:2014-04-19 21:24:10子鼠
日期:2014-04-22 13:55:24卯兔
日期:2014-04-22 14:20:58亥猪
日期:2014-04-22 16:13:09狮子座
日期:2014-05-05 22:31:17摩羯座
日期:2014-05-06 10:32:53处女座
日期:2014-05-12 09:23:11子鼠
日期:2014-05-21 18:21:27
7 [报告]
发表于 2014-05-26 13:20 |只看该作者
本帖最后由 Herowinter 于 2014-05-26 13:38 编辑

回复 1# iocg
这是我写的代码,两个功能分开做是没问题的。
  1. awk '{b[length($0)]++;total_lines++;for(i=1;i<=NF;i++){sub(/^\W+|\W+$/,"",$i);a[$i]++;total_words++}} END{print "Top 10 words frequency:";for(i in a)printf "%s %.2f%%(%d/%d)\n",i,100*a[i]/total_words,a[i],total_words| "sort -nr -k2 | head -10"}' i
  2. Top 10 words frequency:
  3. and 3.55%(23/647)
  4. to 3.40%(22/647)
  5. I 3.25%(21/647)
  6. the 2.94%(19/647)
  7. a 2.78%(18/647)
  8. that 2.32%(15/647)
  9. of 2.01%(13/647)
  10. with 1.39%(9/647)
  11. was 1.24%(8/647)
  12. our 1.24%(8/647)
复制代码
  1. awk '{b[length($0)]++;total_lines++;for(i=1;i<=NF;i++){sub(/^\W+|\W+$/,"",$i);a[$i]++;total_words++}} END{print "Top 10 line characters:";for(i in b)printf "%d字符 %.2f%%(%d/%d)\n",i,100*b[i]/total_lines,b[i],total_lines| "sort -nr -k1 | head -10"}' i
  2. Top 10 line characters:
  3. 1016字符 7.69%(1/13)
  4. 682字符 7.69%(1/13)
  5. 478字符 7.69%(1/13)
  6. 445字符 7.69%(1/13)
  7. 421字符 7.69%(1/13)
  8. 304字符 7.69%(1/13)
  9. 135字符 7.69%(1/13)
  10. 0字符 46.15%(6/13)
复制代码
测试文件i
  1. As a student of adversity, I've been struck over the years by how some people with major challenges seem to draw strength from them, and I've heard the popular wisdom that that has to do with finding meaning. And for a long time, I thought the meaning was out there, some great truth waiting to be found.

  2. But over time, I've come to feel that the truth is irrelevant. We call it finding meaning, but we might better call it forging meaning.

  3. My last book was about how families manage to deal with various kinds of challenging or unusual offspring, and one of the mothers I interviewed, who had two children with multiple severe disabilities, said to me, "People always give us these little sayings like, 'God doesn't give you any more than you can handle,' but children like ours are not preordained as a gift. They're a gift because that's what we have chosen."

  4. We make those choices all our lives. When I was in second grade, Bobby Finkel had a birthday party and invited everyone in our class but me. My mother assumed there had been some sort of error, and she called Mrs. Finkel, who said that Bobby didn't like me and didn't want me at his party. And that day, my mom took me to the zoo and out for a hot fudge sundae. When I was in seventh grade, one of the kids on my school bus nicknamed me "Percy" as a shorthand for my demeanor, and sometimes, he and his cohort would chant that provocation the entire school bus ride, 45 minutes up, 45 minutes back, "Percy! Percy! Percy! Percy!" When I was in eighth grade, our science teacher told us that all male homosexuals develop fecal incontinence because of the trauma to their anal sphincter. And I graduated high school without ever going to the cafeteria, where I would have sat with the girls and been laughed at for doing so, or sat with the boys and been laughed at for being a boy who should be sitting with the girls.

  5. I survived that childhood through a mix of avoidance and endurance. What I didn't know then, and do know now, is that avoidance and endurance can be the entryway to forging meaning. After you've forged meaning, you need to incorporate that meaning into a new identity. You need to take the traumas and make them part of who you've come to be, and you need to fold the worst events of your life into a narrative of triumph, evincing a better self in response to things that hurt.

  6. One of the other mothers I interviewed when I was working on my book had been raped as an adolescent, and had a child following that rape, which had thrown away her career plans and damaged all of her emotional relationships. But when I met her, she was 50, and I said to her, "Do you often think about the man who raped you?" And she said, "I used to think about him with anger, but now only with pity." And I thought she meant pity because he was so unevolved as to have done this terrible thing. And I said, "Pity?" And she said, "Yes, because he has a beautiful daughter and two beautiful grandchildren and he doesn't know that, and I do. So as it turns out, I'm the lucky one."

  7. Some of our struggles are things we're born to: our gender, our sexuality, our race, our disability. And some are things that happen to us: being a political prisoner, being a rape victim, being a Katrina survivor. Identity involves entering a community to draw strength from that community, and to give strength there too. It involves substituting "and" for "but" -- not "I am here but I have cancer," but rather, "I have cancer and I am here."
复制代码
但为什么合在一起输出顺序有点问题,管道的问题吗?有大神能答疑解惑下吗?
  1. awk '{b[length($0)]++;total_lines++;for(i=1;i<=NF;i++){sub(/^\W+|\W+$/,"",$i);a[$i]++;total_words++}} END{print "Top 10 words frequency:";for(i in a)printf "%s %.2f%%(%d/%d)\n",i,100*a[i]/total_words,a[i],total_words| "sort -nr -k2 | head -10";print "Top 10 line characters:";for(i in b)printf "%d字符 %.2f%%(%d/%d)\n",i,100*b[i]/total_lines,b[i],total_lines| "sort -nr -k1 | head -10"}' i
  2. Top 10 words frequency:
  3. Top 10 line characters:
  4. 1016字符 7.69%(1/13)
  5. 682字符 7.69%(1/13)
  6. 478字符 7.69%(1/13)
  7. 445字符 7.69%(1/13)
  8. 421字符 7.69%(1/13)
  9. 304字符 7.69%(1/13)
  10. 135字符 7.69%(1/13)
  11. 0字符 46.15%(6/13)
  12. and 3.55%(23/647)
  13. to 3.40%(22/647)
  14. I 3.25%(21/647)
  15. the 2.94%(19/647)
  16. a 2.78%(18/647)
  17. that 2.32%(15/647)
  18. of 2.01%(13/647)
  19. with 1.39%(9/647)
  20. was 1.24%(8/647)
  21. our 1.24%(8/647)
复制代码

论坛徽章:
2
白羊座
日期:2013-11-18 19:52:42辰龙
日期:2014-09-07 07:46:06
8 [报告]
发表于 2014-05-26 15:46 |只看该作者
小改进
  1. awk '{gsub(/\W+/," ");c+=NF;for(i=1;i<=NF;i++){a[$i]++;b[length($i)]++}}END{for(i in a) printf "%s\t%3d%%(%d/%d)\n",i,int(a[i]/c*100),a[i],c;printf"\n";for(i in b) printf "%s char(s)\t%3d%%(%d/%d)\n",i,int(b[i]/c*100),b[i],c}' test
复制代码

论坛徽章:
6
摩羯座
日期:2013-08-24 10:43:10狮子座
日期:2013-08-25 10:27:06天秤座
日期:2013-09-11 20:28:44午马
日期:2014-09-28 16:06:0015-16赛季CBA联赛之八一
日期:2016-12-19 13:55:0515-16赛季CBA联赛之天津
日期:2016-12-20 14:01:23
9 [报告]
发表于 2014-05-26 17:03 |只看该作者
但为什么合在一起输出顺序有点问题,管道的问题吗?有大神能答疑解惑下吗?


| "sort -nr -k2 | head -10" 开启子进程的缘故吧
即awk 把for(i in a)的结果丢给 sort 和head,自己去做后面的事,不知道什么时候"sort -nr -k2 | head -10" 就返回了。

论坛徽章:
769
金牛座
日期:2014-02-26 17:49:58水瓶座
日期:2014-02-26 18:10:15白羊座
日期:2014-04-15 19:29:52寅虎
日期:2014-04-17 19:43:21酉鸡
日期:2014-04-19 21:24:10子鼠
日期:2014-04-22 13:55:24卯兔
日期:2014-04-22 14:20:58亥猪
日期:2014-04-22 16:13:09狮子座
日期:2014-05-05 22:31:17摩羯座
日期:2014-05-06 10:32:53处女座
日期:2014-05-12 09:23:11子鼠
日期:2014-05-21 18:21:27
10 [报告]
发表于 2014-05-26 17:08 |只看该作者
回复 9# cao627
可能是这样,多谢解答,但不用管道的话,
awk从一个数组里取最大的10个数不是很好写,
这是一道经典的堆排序面试题。
   
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP