论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2005-11-11 17:26 |只看该作者 |倒序浏览

使用nekohtml进行转化
nekohtml下载地址：
http://people.apache.org/~andyc/neko/doc/html/
源程序：
html2xml.java
import org.w3c.dom.Node;
import org.w3c.dom.DocumentFragment;
import org.w3c.dom.html.HTMLDocument;
import org.xml.sax.InputSource;
import org.apache.html.dom.HTMLDocumentImpl;
import org.cyberneko.html.parsers.DOMFragmentParser;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import java.util.Properties;
import java.util.Calendar;
import java.io.File;
import java.io.InputStreamReader;
import java.io.InputStream;
import java.io.FileReader;
import java.net.HttpURLConnection;
import java.net.URL;
public class html2xml {
public static void main(String args[]){
      if(args!=null&&args.length>=2){
         try {
            String path=args[0];
            String fromfile=args[1];
            String outputfile=getFileName();
            if(args.length>2){
                  outputfile=args[2];
            }
            boolean b=Boolean.valueOf(fromfile).booleanValue();
            html2xml h2x=new html2xml();
            DocumentFragment df=h2x.getSourceNode(path,b);
            File file=new File(outputfile);
            if(file.exists())
                  file.delete();
            h2x.genXmlFile(df,file);
            System.out.println("generate "+file.getCanonicalPath()+" successfully!");
         } catch (Exception e) {
            e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
         }
      }else{
         System.out.println("usage:html2xml path fromfile [outputfile]");
         System.out.println("html2xml http://www.sina.com.cn false D:/tempfile.xml");
         System.out.println("html2xml D:/htmlfile.htm true D:/tempfile.xml");
         System.out.println("--");
      }
}
public void genXmlFile(Node output,File file) throws Exception,Error{
         TransformerFactory tf=TransformerFactory.newInstance();
         Transformer transformer=tf.newTransformer();
         DOMSource source=new DOMSource(output);
         java.io.FileOutputStream fos=new java.io.FileOutputStream(file);
         StreamResult result=new StreamResult(fos);
         Properties props = new Properties();
         props.setProperty("encoding", "GB2312");
         props.setProperty("method", "xml");
         props.setProperty("omit-xml-declaration", "yes");
         transformer.setOutputProperties(props);
         transformer.transform(source,result);
         fos.close();
}
public DocumentFragment getSourceNode(String path,boolean fromfile) throws Exception,Error{
      DOMFragmentParser parser = new DOMFragmentParser();
      HTMLDocument document = new HTMLDocumentImpl();
      DocumentFragment fragment = document.createDocumentFragment();
         if(path!=null&&!path.trim().equals(""))
         {
            String tmp=path;
            if(fromfile){
                  File input = new File(path);
                  FileReader fr=new FileReader(input);
                  InputSource is=new InputSource(fr);
                  parser.parse(is,fragment);
                  fr.close();
            }else{
                  URL url = new URL(tmp);
                  HttpURLConnection con = (HttpURLConnection) url.openConnection();
                  InputStream inputs = con.getInputStream();
                  InputStreamReader isr=new InputStreamReader(inputs,"GBK");
                  InputSource source=new InputSource(isr);
                  parser.parse(source,fragment);
            }
            return fragment;
         }else{
            return null;
         }
}
public static String getFileName() throws Exception{
      Calendar c=Calendar.getInstance();
      String name="tmp"+c.get(Calendar.YEAR)+(c.get(Calendar.MONTH)<9?"0":"")+
            (c.get(Calendar.MONTH)+1)+(c.get(Calendar.DAY_OF_MONTH)<10?"0":"")+
            c.get(Calendar.DAY_OF_MONTH)+(c.get(Calendar.HOUR_OF_DAY)<10?"0":"")+
            c.get(Calendar.HOUR_OF_DAY)+(c.get(Calendar.MINUTE)<10?"0":"")+
            c.get(Calendar.MINUTE)+(c.get(Calendar.SECOND)<10?"0":"")+
            c.get(Calendar.SECOND)+(c.get(Calendar.MILLISECOND)<10?"0":"")+
            (c.get(Calendar.MILLISECOND)<100?"0":"")+c.get(Calendar.MILLISECOND);
      return name;
}
}
目录结构：
html2xml
├─classes
├─lib
└─src
在目录html2xml下建立一个批处理文件run.bat，内容为：java -cp "./lib/nekohtml.jar;./lib/xercesImpl.jar;./lib/xml-apis.jar;./lib/commons-logging.jar;./lib/commons-discovery.jar;./lib/saaj.jar;./classes" html2xml %1 %2 %3
使用方法：
1.文件转化，从命令行输入：run D: est.html true D: est.xml，第一个参数是被转化的目标html文件，第二个参数是标志位，文件转化时使用“true”，第三个参数是输出的xml文件，第三个参数若省缺将会产生一个临时文件用于存储输出xml。
2.通过网页地址转换，从命令行输入：run http://www.sina.com.cn false D:sina.xml，第一个参数是网页地址，第二个应该设置为false，第三个参数同上。
由于nekohtml具有较好的容错性，对于大多数情况都能够成功转化，需要注意的是目标网页的html标签存在多个时会发生错误，还有一个也是在使用中发现的，attribute='a'b'，对于这种属性无法解析，而一般是不会出现这种极烂的书写方式。
nekohtml可以从
http://people.apache.org/~andyc/neko/doc/html/
下载获得。

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u/9085/showart_56542.html

文库|博客

返回列表

Chinaunix › 论坛 › 程序设计 › Java › Java文档中心 › Html转XML

Html转XML [复制链接]