spark 新手,请教spark 如何写二进制文件到hdfs

tianhailong 发表于 2018-03-14 20:58

本帖最后由 tianhailong 于 2018-03-14 21:01 编辑

我目前做一个spark 应用开发，读取一些点云数据(二进制格式)，随后做一系列处理后保存为普通的二进制文件，保存在 hdfs上边，
我使用python 开发，通过hadoop 文件api 写文件到hdfs Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration

def save_file_content(fileContent, fileName, savePath, Path, FileSystem, Configuration):
   fs = FileSystem.get(Configuration())
   output = fs.create(Path(savePath + "/" + fileName))

   output.write(bytearray(fileContent))
   output.close()
但是这个只能在driver 机器执行，我把rdd 放在 driver 机器遍历写入，content = rdd.collect()for i in content: save(i, savePath, Path, FileSystem, Configuration)
但是这种方式数据量大时会内存溢出，请问有什么更好的方向写入hdfs 吗？

页: [1]

Chinaunix's Archiver

spark 新手,请教spark 如何写二进制文件到hdfs