- 论坛徽章:
- 0
|
本帖最后由 懒烊烊 于 2012-02-24 12:48 编辑
大家好
我 建立一个hadoop 集群 并安装好了lzo压缩 然后通过计算 却发现 和不用lzo计算的结果不一样
原始文件- cat a.log
- 192.168.0.211 - - [26/Dec/2011:15:10:01 +0800] GET /js/272/272893.js HTTP/1.1 "304" 0 "http://www.86zw.com/Html/Book/33/33137/2794580.shtml" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) QQBrowser/6.9.11153.201" "-"
- 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /okno.php?user=troryzh HTTP/1.1 "200" 5591 "http://www.renao001.com/detail22_7555.shtml" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)" "2.52"
- 192.168.0.211 - - [26/Dec/2011:15:10:01 +0800] GET /js/282/282002.js HTTP/1.1 "200" 220 "http://gg.ux120.com/zc/0005/00016.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Sicent; .NET CLR 2.0.50727)" "-"
- 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /js/282/282016.js HTTP/1.1 "304" 0 "http://www.bookbao.com/Search/q_%25u5341%25u5E74%25u4E00%25u54C1%25u6E29%25u5982%25u8A00" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; @ZOhdam%{qEY?-9:*EF6cSUp=G{gxfX:v4Us,G; SV1; QQDownload 691; 360SE)" "-"
- 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /js/270/270653.js HTTP/1.1 "304" 0 "http://www.kyks8.com/zuixin520/3/3889/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; Microsoft Windows Media Center PC 6.0)" "-"
- 192.168.0.211 - - [26/Dec/2011:15:10:01 +0800] GET /ok.php?user=lmxh521 HTTP/1.1 "200" 5809 "http://www.wwe7.cn/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; 360SE)" "2.48"
- 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /xvi.php HTTP/1.1 "200" 4559 "http://www.shushuw.cn/search/%E8%8B%8D%E7%A9%B9/0.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" "2.61"
- 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /js/281/281779.js HTTP/1.1 "200" 356 "http://www.pp456.com/guochanju/17305/play.html?17305-0-13" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" "-"
- 192.168.0.211 - - [26/Dec/2011:15:10:01 +0800] GET /js/281/281640.js HTTP/1.1 "304" 0 "http://www.lenovo2008.com/files/article/html/0/30/6912.html" "-" "-"
- 192.168.0.212 - - [26/Dec/2011:15:10:01 +0800] GET /vi.php HTTP/1.1 "200" 4547 "http://www.morui.com/book/5/5578/1126529.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.2)" "2.62"
复制代码 压缩并传上hadoop 并建立好index 如下;- lzop a.log
- bin/hadoop fs -put a.log.lzo lzoinputlzo
- bin/hadoop jar /home/hadoop/hadoop/lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer lzoinputlzo
- 12/02/23 09:44:44 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
- 12/02/23 09:44:44 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev ]
- 12/02/23 09:44:44 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs://zsqy13:9000/user/hadoop/lzoinputlzo/a.log.lzo to indexing list (no index currently exists)
- 12/02/23 09:44:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
- 12/02/23 09:44:45 INFO input.FileInputFormat: Total input paths to process : 1
- 12/02/23 09:44:45 INFO mapred.JobClient: Running job: job_201202221049_0013
- 12/02/23 09:44:46 INFO mapred.JobClient: map 0% reduce 0%
- 12/02/23 09:44:59 INFO mapred.JobClient: map 100% reduce 0%
- 12/02/23 09:45:04 INFO mapred.JobClient: Job complete: job_201202221049_0013
- 12/02/23 09:45:04 INFO mapred.JobClient: Counters: 15
- 12/02/23 09:45:04 INFO mapred.JobClient: Job Counters
- 12/02/23 09:45:04 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12736
- 12/02/23 09:45:04 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
- 12/02/23 09:45:04 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
- 12/02/23 09:45:04 INFO mapred.JobClient: Rack-local map tasks=1
- 12/02/23 09:45:04 INFO mapred.JobClient: Launched map tasks=1
- 12/02/23 09:45:04 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
- 12/02/23 09:45:04 INFO mapred.JobClient: File Output Format Counters
- 12/02/23 09:45:04 INFO mapred.JobClient: Bytes Written=0
- 12/02/23 09:45:04 INFO mapred.JobClient: FileSystemCounters
- 12/02/23 09:45:04 INFO mapred.JobClient: HDFS_BYTES_READ=172
- 12/02/23 09:45:04 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21162
- 12/02/23 09:45:04 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=8
- 12/02/23 09:45:04 INFO mapred.JobClient: File Input Format Counters
- 12/02/23 09:45:04 INFO mapred.JobClient: Bytes Read=51
- 12/02/23 09:45:04 INFO mapred.JobClient: Map-Reduce Framework
- 12/02/23 09:45:04 INFO mapred.JobClient: Map input records=1
- 12/02/23 09:45:04 INFO mapred.JobClient: Spilled Records=0
- 12/02/23 09:45:04 INFO mapred.JobClient: Map output records=1
- 12/02/23 09:45:04 INFO mapred.JobClient: SPLIT_RAW_BYTES=117
复制代码 测试原始数据统计ip出现的次数如下(结果正确)- cat a.log | ./mapper.py | sort | ./reducer.py
- 192.168.0.211 4
- 192.168.0.212 6
复制代码 使用lzo 计算如下- bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat -input lzoinputlzo -output outputlzo -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py
- packageJobJar: [mapper.py, reducer.py, /home/hadoop/double/hadoop-hadoop/hadoop-unjar7460131778399100974/] [] /tmp/streamjob8185284251347424727.jar tmpDir=null
- 12/02/23 09:46:31 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
- 12/02/23 09:46:31 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev ]
- 12/02/23 09:46:31 INFO mapred.FileInputFormat: Total input paths to process : 2
- 12/02/23 09:46:31 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/double/hadoop-hadoop/mapred/local]
- 12/02/23 09:46:31 INFO streaming.StreamJob: Running job: job_201202221049_0014
- 12/02/23 09:46:31 INFO streaming.StreamJob: To kill this job, run:
- 12/02/23 09:46:31 INFO streaming.StreamJob: /home/hadoop/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=zsqy13:9001 -kill job_201202221049_0014
- 12/02/23 09:46:31 INFO streaming.StreamJob: Tracking URL: http://zsqy13:50030/jobdetails.jsp?jobid=job_201202221049_0014
- 12/02/23 09:46:32 INFO streaming.StreamJob: map 0% reduce 0%
- 12/02/23 09:46:45 INFO streaming.StreamJob: map 100% reduce 0%
- 12/02/23 09:46:56 INFO streaming.StreamJob: map 100% reduce 100%
- 12/02/23 09:47:02 INFO streaming.StreamJob: Job complete: job_201202221049_0014
- 12/02/23 09:47:02 INFO streaming.StreamJob: Output: outputlzo
复制代码 打开计算结果如下 完全和测试的数据不同(结果看不懂)
File: /user/hadoop/outputlzo/part-00000请教大家是这个问题 是如何解决的? |
|