python 大文件去重问题

yhizyh 发表于 2015-11-20 23:45

本帖最后由 yhizyh 于 2015-11-20 23:46 编辑

今天遇到一个问题，处理几个大文件，1个文件为2G ，1个文件大小为4G，我看了一下里面数据条数是129115369 条，现在这些数据有很多都是重复的，不是逐行重复，不一定那条和那条重复，我要去掉重复数据，可是用了set 、sort都不能正常处理这个文件，肯定不是我程序的问题，同样内容的文件我处理800-900M都可以，就是2个特大的处理不了。所以请教一下对于这种特大文件有什么办法没有。谢谢了

-rw-rw-r-- 1 root root 2.7G 11月 20 21:24 f10.txt
-rw-rw-r-- 1 root root 4.6G 11月 20 22:40 f11.txt
-rw-rw-r-- 1 root root65M 11月 20 20:33 f6.txt
-rw-rw-r-- 1 root root 218M 11月 20 20:34 f7.txt
-rw-rw-r-- 1 root root 604M 11月 20 20:38 f8.txt
-rw-rw-r-- 1 root root 1.4G 11月 20 20:51 f9.txt
-rw-r----- 1 root root838 11月 20 22:50 RemoveSimilar.py

目前就是两个最大的文件处理不了。
通过list(set())方法处理的部分代码def modi_File(filename):
sFile="out/"+filename
oFile="out1/"+filename
fp = file(sFile,"r")
lines = fp.readlines()
fp.close()
index =0
count =len(lines)
while index<count:
   lines=lines.strip("\n")
   index +=1
flines=list(set(lines))
fp_w=file(oFile,"w")
count=0
for line in flines:
   fp_w.write(str(line)+"\n")
   count +=1
fp_w.write("数据总量：%s"%count)
fp_w.close()

substr函数 发表于 2015-11-21 10:13

我是小白
还请前辈多多指导。#!/usr/bin/python2
# coding: utf-8

def modi (filename):
IN = '/tmp/' + filename    # "out/" + filename
OUT = '/tmp/' + '_' + filename# "out1/" + filename
fhi = open (IN)
fho = open (OUT, 'w')
uniq= set ()
count = 0

for line in fhi:
   if line in uniq: continue
   uniq.add (line)
   count += 1
   fho.write (line)

fho.write ("数据总量：%s" % count)
fhi.close ()
fho.close ()

modi ('xyz')

yhizyh 发表于 2015-11-21 15:05

回复 2# substr函数

太棒了，上次也是麻烦您了。非常感谢。

Hadron74 发表于 2016-04-05 16:36

回复 1# yhizyh
这样的问题，不用编程，用UNIX命令cat filename | sort | uniq

mswsg 发表于 2016-04-19 14:15

问题出在这一行代码上lines = fp.readlines()

页: [1]

Chinaunix's Archiver

python 大文件去重问题