免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 7324 | 回复: 0
打印 上一主题 下一主题

Apache POI 解析 microsoft word 图片文字都不放过 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-12-19 19:14 |只看该作者 |倒序浏览

Apache POI 解析 microsoft word 图片文字都不放过
项目需要 ,写了个 ms word 解析器,贴出来分享!
Apache POI 组件主要用来  解析 microsoft word,ppt,excel,Visio 文档 ,具体介绍看下面吧!
Overview
The following are components of the entire POI project and a brief summary of their purpose.
POIFS for OLE 2 Documents
POIFS is the oldest and most stable part of the project. It is our port of the OLE 2 Compound Document Format to pure Java. It supports both read and write functionality. All of our components ultimately rely on it by definition. Please see the POIFS project page for more information.
HSSF for Excel Documents
HSSF is our port of the Microsoft Excel 97(-2003) file format (BIFF8) to pure Java. It supports read and write capability. (Support for Excel 2007 .xlsx files is in progress). Please see the HSSF project page for more information.
HWPF for Word Documents
HWPF is our port of the Microsoft Word 97 file format to pure Java. It supports read, and limited write capabilities. Please see the HWPF project page for more information. This component is in the early stages of development. It can already read and write simple files.
Presently we are looking for a contributor to foster the HWPF development. Jump in!
HSLF for PowerPoint Documents
HSLF is our port of the Microsoft PowerPoint 97(-2003) file format to pure Java. It supports read and write capabilities. Please see the HSLF project page for more information.
HDGF for Visio Documents
HDGF is our port of the Microsoft Viso 97(-2003) file format to pure Java. It currently only supports reading at a very low level, and simple text extraction. Please see the HDGF project page for more information.
HPSF for Document Properties
HPSF is our port of the OLE 2 property set format to pure Java. Property sets are mostly use to store a document's properties (title, author, date of last modification etc.), but they can be used for application-specific purposes as well.
HPSF supports reading and writing of properties. However, you will need to be using version 3.0 of POI to utilise the write support.
Please see the HPSF project page for more information.
package org.osforce.document.extractor;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.model.PicturesTable;
import org.apache.poi.hwpf.usermodel.CharacterRun;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Picture;
import org.apache.poi.hwpf.usermodel.Range;
/**
*
* @author huhaozhong
* @version 1.0 date 2008.7.27
* microsoft word document extractor extract text and picture
*/

public class MSWordExtractor {
    private HWPFDocument msWord;
    /**
     *
     * @param input
     *            InputStream from file system which has word document stream
     * @throws IOException
     */
    public MSWordExtractor(InputStream input) throws IOException {
        msWord = new HWPFDocument(input);
    }
    /**
     *
     * @return all paragraphs of text
     */

    public String[] extractParagraphTexts() {
        Range range = msWord.getRange();
        int numParagraph = range.numParagraphs();
        String[] paragraphs = new String[numParagraph];
        for (int i = 0; i
            Paragraph p = range.getParagraph(i);
            paragraphs = new String(p.text());
        }
        return paragraphs;
    }
    /**
     *
     * @return all text of a word
     */

    public String extractMSWordText() {
        Range range = msWord.getRange();
        String msWordText = range.text();
        return msWordText;
    }
    /**
     *
     * @param directory
     *            local file directory that store the images
     * @throws IOException
     */

    public void extractImagesIntoDirectory(String directory) throws IOException {
        PicturesTable pTable = msWord.getPicturesTable();
        int numCharacterRuns = msWord.getRange().numCharacterRuns();
        for (int i = 0; i


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u1/41814/showart_1730158.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP