1、MapReduce的主要应用领域在哪里?在哪些场合被取代可能性不高?
a. offline computing. batch computing
b. query data with SQL?
c. it's very hard to be replace under batch processing.
2、对比YARN和Mesos的优势和劣势,以及YARN框架未来的发展方向?
a. Yarn support capacity/fair scheduler on memory/cpu which has fine-grained scheduler.
b. Mesos supprot coarse-grained scheduler which support yarn job also with non-yarn job.
3、HDFS缺少哪些你需要的特性,或者你比较喜欢其哪一个特性,也可以谈谈您比较看好哪个存储系统,为什么?
a.I like hdfs easily scaling. has default 3 replication with high availability. also it's take server down as common problems ,also build on commodity server.reduce server-farm cost.
b. compare to Glusterfs, Hdfs balance doesn't have high impact than GlusterFS.
c. compare to Fastdfs, I thought it's can commit data replication more accurate than Fastdfs, which it is very hard under high volume write situation.
d. but, hdfs sync between cluster/DC. we have to use distcp tools to make it, doesn't like NFS which need sync data easily.
e. hdfs doesn't like new tech ignite/tachyon which support memory-based storage will provide more faster access data, as it's data store on disk. you know, Disk I/O is always bottleneck of performance.
4、Hadoop从业者应该如何进行职业规划?
Hadoop is a big ecosystem include storage/database/processing/security. I thought it's better do some project/experience under some mentor if possible. also you have to strong java coding skill, as it's based on java. after you did some projects, then try to understand the principle of Hadoop.
try to fix some bugs under github/googlegroup. the most important part, you have to keep hungry till to understand the truth of Hadoop.
Just part of my opinion. 作者: heguangwu 时间: 2016-01-25 14:02
现在大的公司都逐步采用Spark了,当然并不是说不用MapReduce,当前情况是并存 回复 10# Steddywr
I have some concern about yarn , on scheduler even u using fair scheduler/ capacity scheduler ,there still have some situation can't be meet.
eg: if I have A,B,C,D four Queue by using capacity scheduler,
and the assginment is A 10% B 20% C 40% D 40% witch each setting maximum capacity is 90%, If queue C used absolute capacity 80%, and now another job coming to Queue D, looks like it won't running till there have enough resource available. 作者: yehuafeilang 时间: 2016-01-26 14:24
1、MapReduce的主要应用领域在哪里?在哪些场合被取代可能性不高?
MapReduce的主要应用领域,目前日志分析用的比较多,还有做搜素的索引,机器学习算法包mahout也是之一,当然它能做的东西还有很多,比如分布grep,分布排序,web访问日志分析,反向索引构建,文档聚类,机器学习,基于统计的机器翻译,数据挖掘、信息提取等等。
大规模数据处理的特点决定了大量的数据记录难以全部存放在内存,而通常只能放在外存中进行处理。由于磁盘的顺序访问要远比随机访问快得多,因此 MapReduce主要设计为面向顺序式大规模数据的磁盘访问处理,因此在这种场合被取代的可能性不高。