- 论坛徽章:
- 0
|
东风何处是人间(ruby版)
近期这个帖子大火啊,也终于抽出时间写ruby版的了。
个人水平有限,程序写的很糟糕,至少比原文的看着复杂多了,不知道是否能有ruby高手给大家写个示例。
数据:《全宋词》文本
Ruby代码- #coding: utf-8
- require "iconv"
-
- s1 = Iconv.conv 'gbk','utf-8',","
- s2 = Iconv.conv 'gbk','utf-8',"。"
- s3 = Iconv.conv 'gbk','utf-8',"!"
- s4 = Iconv.conv 'gbk','utf-8',"?"
- s5 = Iconv.conv 'gbk','utf-8',"、"
-
- NUM1 = 2 #分词长度
- NUM2 =500 #显示大于多少的记录
-
- def splitword(s,l) #分词,x是字符串,l是字符分词长度
- lt = s.length
- k = Array.new
- 0.upto(lt-l) do |i|
- k<<s[i..i+l-1]
- end
- return k
- end
-
- x = Array.new #记录分词结果的数组
-
- File.open("ci.txt","r") do |file|
- file.each do |line|
- if line.length<500 and line.length>10
- line.gsub!(s2,s1) #把标点都替换为",",再统一进行分割
- line.gsub!(s3,s1)
- line.gsub!(s4,s1)
- line.gsub!(s5,s1)
- line.chomp!
- column = line.split(s1) #用逗号分割
- column.delete_if {|i| i.length >10 } #去除大于10个字的语句
- column.each do |col|
- splitword(col,NUM1).each{|i| x<<i} if col.length>=NUM1 # 分词
- end
- end
- end
- end
-
- h = Hash.new
- h = x.inject(Hash.new(0)){|hash,x| hash[x] += 1; hash} #把数组内容进行计数为hash
- h.delete_if {|key, value| value <NUM2} #去除hash中小于指定数值的部分
-
- y = Array.new
- y = h.sort {|a,b| b[1]<=>a[1]} # 从大到小排序
- y.each_index {|i| puts "#{i+1} #{y[i][0]} = #{y[i][1]}" }
复制代码 |
|