免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1963 | 回复: 0
打印 上一主题 下一主题

[Web] Wrapper Definition [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-05-23 09:38 |只看该作者 |倒序浏览
Wrappers are specialised program routines thatautomatically extract data from Internet websites and convert the informationinto a structured format. More specifically, wrappers have three mainfunctions. Firstly, they must be able todownload HTML pages from a website. Secondly, search for, recognise and extractspecified data. Thirdly, save this data in a suitably structured format toenable further manipulation [6]. The data can then be imported into otherapplications for additional processing. According to [20], over 80% of thepublished information on the WWW is based on databases running in thebackground. When compiling this data into HTML documents the structure of theunderlying databases is completely lost. Wrappers try to reverse this processby restoring the information to a structured format [21]. With the rightprograms, it is even possible to use the WWW as a large database. By usingseveral wrappers to extract data from the various information sources of theWWW, the retrieved data can be made available in an appropriately structured format[4].
As a rule, a specially developed wrapper is required foreach individual data source, because of the different and unique structures ofwebsites. The WWW is also extremely dynamic and continually evolving, whichresults in frequent changes in the structures of websites. Consequently, it isoften necessary to constantly update or even completely rewrite existingwrappers, in order to maintain the desired data extraction capabilities [1].The Extensible Markup Language (XML) has the potential to alleviate suchproblems. Whereas HTML is presentation oriented, XML keeps the data structureseparate from the presentation. However, it may take some time before all datais provided in the XML format, and it remains to be seen whether XML canestablish itself in all areas of electronic information processing [11]. Takinginto consideration that XML documents are based on varying Document TypeDefinitions (DTD) or XML-Schemas, the current problems regarding dataextraction from HTML documents can be reduced, but not completely resolved.Wrappers will, therefore, retain an important role in the integration of datafrom WWW sources for some time to come.
Wrapper-Generating Toolkits
Every wrapper can be manually developed from scratch, for example, in anestablished programming language using regular expressions. For smallerapplications, this can prove to be a sensible approach. However, if the use ofa larger number of wrappers is required, this inevitably leads to the use ofso-called toolkits, which can generate a complete wrapper based on user definedparameters for a given data source. One of the most important features ofgenerated wrappers is the format in which the extracted data can be exported.If, for example, the extracted data is converted into an XML format, then itcan be imported and processed by a large number

of software applications. Toolkits for generating wrappers can bedifferentiated in a number of ways. They can be categorised by their outputmethods, interface type, Web crawling capability, use of a graphical userinterface (GUI) and several other characteristics. Laender et al.
[12] categorise a number of toolkits based on the methods used for generatingwrappers. These methods include specially designed wrapper developmentlanguages and algorithms based on HTML-awareness, induction, modelling,ontology and natural language processing. However, a detailed presentation ofsuch technical details is beyond the scope of this survey paper. Therefore, thetoolkits are simply divided into two basic categories based on commercial andnon-commercial availability.

The wrapper generating programs within both of these categories offer severaldifferent means of user interaction. Some toolkits are solely based on commandlines and require routines developed in a pre-determined unique scriptinglanguage, in order to generate an appropriate wrapper for a specified datasource. These wrapper development scripting languages are used in standard texteditors and can be seen as application specific alternatives to general-purposelanguages such as Perl and Java. A large number of toolkits offer a GUI,whereby the relevant data within an HTML document is highlighted with a mouse,and the program then generates a wrapper based on the specified information.Several toolkits combine both of the features described above. Initially, therelevant data is highlighted with a mouse and the program generates a wrapperfrom this input. If the automatically generated result does not meet thespecified requirements, the user has the additional possibility of implementingchanges via an editor integrated within the toolkit. Whether frequentcorrections are necessary or not depends, largely, on the underlying algorithmsand the functional maturity of the toolkit.
For more information, please visit ourwebsite: http://www.knowlesys.com

img_web2db.gif (13.1 KB, 下载次数: 29)

web data extraction

web data extraction
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP