免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
12下一页
最近访问板块 发新帖
查看: 2971 | 回复: 12
打印 上一主题 下一主题

文件读取的问题 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2010-01-27 00:28 |只看该作者 |倒序浏览
读取文件内容时能不能够每一行读进一个字符?而不是经常使用的一行一行的读进来。
因为文件的数据量好大,而且文件的个数也好多。所以突然间想到这个方法,如果可行的话应该可以提高效率吧。

论坛徽章:
1
程序设计版块每日发帖之星
日期:2015-10-07 06:20:00
2 [报告]
发表于 2010-01-27 00:54 |只看该作者
There may cases where you need to read a file only a few characters at a time instead of line-by-line. This may be the case for binary data. To do just that you can use the read command.
open FILE, "picture.jpg" or die $!;
binmode FILE;
my ($buf, $data, $n);
while (($n = read FILE, $data, 4) != 0) {
print "$n bytes read\n";
$buf .= $data;
}
close(FILE);
  There is a lot going on here so let's take it step by step. In the first line of the above code fragment a file is opened. As you can guess from the filename it is a binary file. Binary files need to treated differently than text files on some operating systems (eg, Windows). The reason is that on these platforms a newline "character" is actually represented within text files by the two character sequence \cM\cJ (that's control-M, control-J). When reading the text file Perl will convert the \cM\cJ sequence into a single \n newline characted. The converse also holds when writing files. Clearly, when reading binary data this behavior is undesired and calling binmode on the filehandle will make sure that this conversion is avoided.

The read command takes either 3 or 4 arguments. The 3-argument form is:
read FILEHANDLE, SCALAR, LENGTH
while the 4-argument form is:
read FILEHANDLE, SCALAR, LENGTH, OFFSET
In the first case LENGTH characters of data are read in the variable specified by SCALAR from FILEHANDLE. The return value of read is the number of characters actually read, 0 at the end of the file or undef in the case of an error. Returning to our example above the third line of code will read at most 4 characters of data into the $data variable. The number of characters read will be stored in $n. Successive read operations on the same filehandle will set the current file position to be just before the first unread character. Thus the code above will read the contents of the file picture.jpg and store them in $buf, printing the number of characters read at every iteration.

If OFFSET is specified then the characters read will be placed at that position within the SCALAR. Taking advantage of this we could rewrite the loop above as such:
my ($data, $n, $offset);
while (($n = read FILE, $data, 4, $offset) != 0) {
print "$n bytes read\n";
$offset += $n;
}

Even though the example above demonstrates binary reading the read command works just as well on text files - just make sure to use (for binary) or not use (for text) binmode accordingly.

论坛徽章:
0
3 [报告]
发表于 2010-01-27 00:58 |只看该作者

回复 #2 chenhao392 的帖子

看得我头晕啊,能不能用中文注释一下呢??

论坛徽章:
1
程序设计版块每日发帖之星
日期:2015-10-07 06:20:00
4 [报告]
发表于 2010-01-27 01:22 |只看该作者
哥们,我是觉得在哪见过,就给你找了一下,本人远非高手,你可以试试。。
open FILE, "picture.jpg" or die $!;   #读入一个文件
binmode FILE; #进入 binmode
my ($buf, $data, $n);#定义变量
while (($n = read FILE, $data, 4) != 0) { #这里就是 那个read FILEHANDLE, SCALAR, LENGTH,  我猜测的意思是将某一行的字符4个一组读入到$data中.
print "$n bytes read\n";  #打印读入了几个字符
$buf .= $data; #存储$data到$buf中
}
close(FILE);

[ 本帖最后由 chenhao392 于 2010-1-27 01:27 编辑 ]

论坛徽章:
0
5 [报告]
发表于 2010-01-27 02:45 |只看该作者

回复 #4 chenhao392 的帖子

我意思是说:我就读取每一行的第一个字符。其他的都不读取

论坛徽章:
1
程序设计版块每日发帖之星
日期:2015-10-07 06:20:00
6 [报告]
发表于 2010-01-27 08:04 |只看该作者
酱紫...抱歉理解错误..
楼主用linux么?
用cut处理一下就好,例如:
cut -c 1 file_name >new_file
就可以只保存每一行的第一个字符到新文件..

或者楼主可以继续求助.
应该awk或者sed之类的都可以做类似的事情.

论坛徽章:
0
7 [报告]
发表于 2010-01-27 17:09 |只看该作者
原帖由 panwenbo363 于 2010-1-27 02:45 发表
我意思是说:我就读取每一行的第一个字符。其他的都不读取

while(<$fd>) {
    print $1,"\n" if /^(.{1})/;
}

用一个regex可以吧~

论坛徽章:
0
8 [报告]
发表于 2010-01-27 19:46 |只看该作者
读一个字符有啥用? 楼主能举个有用的例子么?

论坛徽章:
1
程序设计版块每日发帖之星
日期:2015-10-07 06:20:00
9 [报告]
发表于 2010-01-27 22:51 |只看该作者

回复 #7 兰花仙子 的帖子

仙子好
我想过,但是怀疑这个和读取整行效率上的区别..
不太了解这段代码的实际运作,它是把整行先读读到什么地方,再抓出第一个字符么?

论坛徽章:
0
10 [报告]
发表于 2010-01-27 22:56 |只看该作者

回复 #8 Ray001 的帖子

我写的是一个日志的错误分析脚本,只要第一行开头的第一个字符为“E”就表示错误的日志记录,这些记录是要统计的。
因为业务量很大,一个日志文件上几百万行的,而且日志文件也比较的多。所以我就想利用这个方法提高一下扫描日志的效率。
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP