- 论坛徽章:
- 0
|
I tested ARCWriter & ARCReader of Heritrix, and I got a big problem when reading chinese content from ARC file.
I defined page and http-header :
final String PAGE = " TEST test 测试中文 ";
final String CONTENT = "HTTP/1.1 200 OK\r\n"
+ "Content-Type: text/html\r\n\r\n" + PAGE;
and then write it to ARC in looping.
But there're problems When reading
I used ARCRecord.dump to dump content to console, and got this:
HTTP/1.1 200 OK
Content-Type: text/html
TEST test 测试中文 TEST test 测试中文 ";
final String CONTENT = "HTTP/1.1 200 OK\r\n"
+ "Content-Type: text/html\r\n\r\n" + PAGE;
ARCWriter aw = new ARCWriter(SERIAL_NO, Arrays.asList(ARC_DIRs),
PREFIX, COMPRESS, DEFAULT_MAX_ARC_FILE_SIZE);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
baos.write(CONTENT.getBytes());
// write first record
aw.write(URL, TYPE, HOST, DATE, CONTENT.length(), baos);
for (int i = 0; i file:"
+ aw.getFile().getAbsolutePath() + "\t offset: " + start
+ "\t size:" + (end - start));
}
aw.close();
}
//NOTE:
// change file and offset when use :
public static void testARCReader() throws IOException {
final String arcFile = "d:\\tmp\\arc1\\TMP-20070912062413-00000.arc.gz";
ARCReader reader = ARCReaderFactory.get(new URL("file:////" + arcFile));
ARCRecord r = (ARCRecord) reader.get(309);
System.out.println(r.getBodyOffset());
System.out.println(r.getHeader().getDate());
System.out.println(r.getHeader().getLength());
System.out.println(r.getHeader().getOffset());
System.out.println(r.getHeader().getMimetype());
System.out.println(r.getHeader().getUrl());
// r.dumpHttpHeader();
//r.skipHttpHeader();
r.dump();
r.close();
/*
or dump to file, got the same problem
But when I write several r.dump() after the first r.dump(), I got one char each time
and the whole content can be dumped till a exception happens
*/
}
本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u1/37897/showart_380298.html |
|