免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 7588 | 回复: 2
打印 上一主题 下一主题

[Hadoop&HBase] 请教hadoop加lzo压缩后 运算不正确的问题[已解决] [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2012-02-23 10:02 |只看该作者 |倒序浏览
本帖最后由 懒烊烊 于 2012-02-24 12:48 编辑

大家好
我 建立一个hadoop 集群 并安装好了lzo压缩   然后通过计算 却发现 和不用lzo计算的结果不一样

原始文件
  1. cat a.log
  2. 192.168.0.211 - - [26/Dec/2011:15:10:01 +0800] GET /js/272/272893.js HTTP/1.1 "304" 0 "http://www.86zw.com/Html/Book/33/33137/2794580.shtml" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) QQBrowser/6.9.11153.201" "-"
  3. 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /okno.php?user=troryzh HTTP/1.1 "200" 5591 "http://www.renao001.com/detail22_7555.shtml" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)" "2.52"
  4. 192.168.0.211 - - [26/Dec/2011:15:10:01 +0800] GET /js/282/282002.js HTTP/1.1 "200" 220 "http://gg.ux120.com/zc/0005/00016.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Sicent; .NET CLR 2.0.50727)" "-"
  5. 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /js/282/282016.js HTTP/1.1 "304" 0 "http://www.bookbao.com/Search/q_%25u5341%25u5E74%25u4E00%25u54C1%25u6E29%25u5982%25u8A00" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; @ZOhdam%{qEY?-9:*EF6cSUp=G{gxfX:v4Us,G; SV1; QQDownload 691; 360SE)" "-"
  6. 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /js/270/270653.js HTTP/1.1 "304" 0 "http://www.kyks8.com/zuixin520/3/3889/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; Microsoft Windows Media Center PC 6.0)" "-"
  7. 192.168.0.211 - - [26/Dec/2011:15:10:01 +0800] GET /ok.php?user=lmxh521 HTTP/1.1 "200" 5809 "http://www.wwe7.cn/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; 360SE)" "2.48"
  8. 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /xvi.php HTTP/1.1 "200" 4559 "http://www.shushuw.cn/search/%E8%8B%8D%E7%A9%B9/0.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" "2.61"
  9. 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /js/281/281779.js HTTP/1.1 "200" 356 "http://www.pp456.com/guochanju/17305/play.html?17305-0-13" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" "-"
  10. 192.168.0.211 - - [26/Dec/2011:15:10:01 +0800] GET /js/281/281640.js HTTP/1.1 "304" 0 "http://www.lenovo2008.com/files/article/html/0/30/6912.html" "-" "-"
  11. 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /vi.php HTTP/1.1 "200" 4547 "http://www.morui.com/book/5/5578/1126529.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.2)" "2.62"
复制代码
压缩并传上hadoop 并建立好index 如下;
  1. lzop a.log

  2. bin/hadoop fs -put a.log.lzo lzoinputlzo

  3. bin/hadoop jar /home/hadoop/hadoop/lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer lzoinputlzo
  4. 12/02/23 09:44:44 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
  5. 12/02/23 09:44:44 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev ]
  6. 12/02/23 09:44:44 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs://zsqy13:9000/user/hadoop/lzoinputlzo/a.log.lzo to indexing list (no index currently exists)
  7. 12/02/23 09:44:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  8. 12/02/23 09:44:45 INFO input.FileInputFormat: Total input paths to process : 1
  9. 12/02/23 09:44:45 INFO mapred.JobClient: Running job: job_201202221049_0013
  10. 12/02/23 09:44:46 INFO mapred.JobClient:  map 0% reduce 0%
  11. 12/02/23 09:44:59 INFO mapred.JobClient:  map 100% reduce 0%
  12. 12/02/23 09:45:04 INFO mapred.JobClient: Job complete: job_201202221049_0013
  13. 12/02/23 09:45:04 INFO mapred.JobClient: Counters: 15
  14. 12/02/23 09:45:04 INFO mapred.JobClient:   Job Counters
  15. 12/02/23 09:45:04 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12736
  16. 12/02/23 09:45:04 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
  17. 12/02/23 09:45:04 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
  18. 12/02/23 09:45:04 INFO mapred.JobClient:     Rack-local map tasks=1
  19. 12/02/23 09:45:04 INFO mapred.JobClient:     Launched map tasks=1
  20. 12/02/23 09:45:04 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
  21. 12/02/23 09:45:04 INFO mapred.JobClient:   File Output Format Counters
  22. 12/02/23 09:45:04 INFO mapred.JobClient:     Bytes Written=0
  23. 12/02/23 09:45:04 INFO mapred.JobClient:   FileSystemCounters
  24. 12/02/23 09:45:04 INFO mapred.JobClient:     HDFS_BYTES_READ=172
  25. 12/02/23 09:45:04 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=21162
  26. 12/02/23 09:45:04 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=8
  27. 12/02/23 09:45:04 INFO mapred.JobClient:   File Input Format Counters
  28. 12/02/23 09:45:04 INFO mapred.JobClient:     Bytes Read=51
  29. 12/02/23 09:45:04 INFO mapred.JobClient:   Map-Reduce Framework
  30. 12/02/23 09:45:04 INFO mapred.JobClient:     Map input records=1
  31. 12/02/23 09:45:04 INFO mapred.JobClient:     Spilled Records=0
  32. 12/02/23 09:45:04 INFO mapred.JobClient:     Map output records=1
  33. 12/02/23 09:45:04 INFO mapred.JobClient:     SPLIT_RAW_BYTES=117
复制代码
测试原始数据统计ip出现的次数如下(结果正确)
  1. cat a.log | ./mapper.py | sort | ./reducer.py
  2. 192.168.0.211        4
  3. 192.168.0.212        6
复制代码
使用lzo 计算如下
  1. bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat -input lzoinputlzo -output outputlzo -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py
  2. packageJobJar: [mapper.py, reducer.py, /home/hadoop/double/hadoop-hadoop/hadoop-unjar7460131778399100974/] [] /tmp/streamjob8185284251347424727.jar tmpDir=null
  3. 12/02/23 09:46:31 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
  4. 12/02/23 09:46:31 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev ]
  5. 12/02/23 09:46:31 INFO mapred.FileInputFormat: Total input paths to process : 2
  6. 12/02/23 09:46:31 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/double/hadoop-hadoop/mapred/local]
  7. 12/02/23 09:46:31 INFO streaming.StreamJob: Running job: job_201202221049_0014
  8. 12/02/23 09:46:31 INFO streaming.StreamJob: To kill this job, run:
  9. 12/02/23 09:46:31 INFO streaming.StreamJob: /home/hadoop/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=zsqy13:9001 -kill job_201202221049_0014
  10. 12/02/23 09:46:31 INFO streaming.StreamJob: Tracking URL: http://zsqy13:50030/jobdetails.jsp?jobid=job_201202221049_0014
  11. 12/02/23 09:46:32 INFO streaming.StreamJob:  map 0%  reduce 0%
  12. 12/02/23 09:46:45 INFO streaming.StreamJob:  map 100%  reduce 0%
  13. 12/02/23 09:46:56 INFO streaming.StreamJob:  map 100%  reduce 100%
  14. 12/02/23 09:47:02 INFO streaming.StreamJob: Job complete: job_201202221049_0014
  15. 12/02/23 09:47:02 INFO streaming.StreamJob: Output: outputlzo
复制代码
打开计算结果如下 完全和测试的数据不同(结果看不懂)
File: /user/hadoop/outputlzo/part-00000

  1. 0        1
  2. 1074        9
复制代码
请教大家是这个问题 是如何解决的?

论坛徽章:
0
2 [报告]
发表于 2012-02-24 10:58 |只看该作者
自我解答下
经过lzo 后 读取文件里每行头一个字段 是乱码的 和我们计算没关系  所以计算从第二个字段开始去处理 就好了(测试了好久 我才发现)

论坛徽章:
0
3 [报告]
发表于 2012-02-24 11:06 |只看该作者
自我解答 好啊~
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP