论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2011-12-23 02:32 |只看该作者 |倒序浏览

<div>参考</div><div>http://hadoop.apache.org/common/docs/current/streaming.html</div><div>http://dongxicheng.org/mapreduce/hadoop-streaming-programming/</div><div>1、</div><div>Hadoop Streamimg是随Hadoop发布的一个编程工具，允许使用任何可执行文件或脚本创建和运行map/reduce job。</div><div>例如：最简单的</div><div>$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \</div><div>    -input myInputDirs \</div><div>    -output myOutputDir \</div><div>    -mapper /bin/cat \</div><div>    -reducer /bin/wc</div><div>2、Streaming如何工作</div><div>在上面的例子中，mapper和reducer从标准输入逐行的读入输入。处理后发送到标准输出。Streaming将创建map/reducejob，提交job到cluster，并监视job的执行过程。</div><div>当一个可执行文件或脚本作为mappers，当mapper初始化时，每个mapper task将该可执行文件或脚本作为一个独立的进程运行。当mapper task运行时，转换输入为行，并将该行提供给进程作为标准输入。在此期间，mapper收集可执行文件或脚本的标准输出，并把每一行内容转换为key/value对，作为mapper的输出。默认情况下，一行中第一个tab之前的部分作为key，之后的（不包含tab）作为value。如果没有tab，正行作为key值，value值为null。key值也可以通过自定义，将在稍后提到。</div><div>当一个可执行文件或脚本作为reducers，当reducer初始化时，每个reducer task将该可执行文件或脚本作为一个独立的进程运行。当reducer task运行时，转换输入的key/values对为行并提供给进程作为标准输入。在此期间，reducer收集可执行文件或脚本的标准输出，并将每一行内容转换为key/values对，作为reducer的输出。默认情况下，一行中第一个tab之前的部分作为key，之后的（不包含tab）作为value。如果没有tab，正行作为key值，value值为null。key值也可以通过自定义，将在稍后提到。</div><div>以上是Map/Reduce框架</div><div><br></div><div>可以只map而不reduce</div><div>Specifying Map-Only Jobs</div><div>Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.</div><div><br></div><div>    -D mapred.reduce.tasks=0</div><div> </div><div>To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-D mapred.reduce.tasks=0".</div><div><br></div><div>options：</div><div>（1）-input：输入文件路径</div><div>（2）-output：输出文件路径</div><div>（3）-mapper：用户自己写的mapper程序，可以是可执行文件或者脚本</div><div>（4）-reducer：用户自己写的reducer程序，可以是可执行文件或者脚本</div><div>（5）-file：打包文件到提交的作业中，可以是mapper或者reducer要用的输入文件，如配置文件，字典等。</div><div>（6）-partitioner：用户自定义的partitioner程序</div><div>（7）-combiner：用户自定义的combiner程序（必须用java实现）</div><div>（8）-D：作业的一些属性（以前用的是-jonconf），具体有：</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>注意：-D参数必须作为第一个参数，有多个参数需要调整时，写多个-D <property=value></div><div>             1）mapred.map.tasks：map task数目</div><div>             2）mapred.reduce.tasks：reduce task数目</div><div>             3）stream.map.input.field.separator/stream.map.output.field.separator： map task输入/输出数</div><div>据的分隔符,默认均为\t。</div><div>             4）stream.num.map.output.key.fields：指定map task输出记录中key所占的域数目(即，使用几个域用于排序。如果为2，则第一和第二个域整体作为可以，参与排序。似乎不能单独指定第二个域作为key)</div><div>             5）stream.reduce.input.field.separator/stream.reduce.output.field.separator：reduce task输入/输出数据的分隔符，默认均为\t。</div><div>             6）stream.num.reduce.output.key.fields：指定reduce task输出记录中key所占的域数目</div><div>-D <property=value><span class="Apple-tab-span" style="white-space:pre"> </span>Use value for given property.</div><div>其中的property是任何可以在core，hdfs，maprd配置文件里写的属性，都可以在这里传递。 </div><div>-D参数可以广泛用于hadoop的命令中，例如dfs，详细查看：http://hadoop.apache.org/common/docs/current/commands_manual.html</div><div>例如：在上传文件时，指定拷贝的副本</div><div>hadoop dfs -D dfs.replication=10 -put 70M  logs/2</div>

文库|博客

返回列表

Chinaunix › 论坛 › 数据库技术 › NoSQL技术 › Hadoop Streaming

[Hadoop&HBase] Hadoop Streaming [复制链接]

浏览过的版块