免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1633 | 回复: 0
打印 上一主题 下一主题

[Web] Deep Web Data Extraction [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-05-16 11:21 |只看该作者 |倒序浏览
Problem
The unabated growth of the Web has resulted in a situation in which moreinformation is available to more people than ever in human history. Along withthis unprecedented growth has come the inevitable problem of informationoverload. To counteract this information overload, users typically rely onsearch engines (like Google and AllTheWeb) or on manually-createdcategorization hierarchies (like Yahoo! and the Open Directory Project). Thoughexcellent for accessing Web pages on the so-called "crawlable" web,these approaches overlook a much more massive and high-quality resource: theDeep Web.

The Deep Web(or Hidden Web) comprises all information that resides in autonomous databasesbehind portals and information providers' web front-ends. Web pages in the DeepWeb are dynamically-generated in response to a query through a web site'ssearch form and often contain rich content. A recent study has estimated thesize of the Deep Web to be more than 500 billion pages, whereas the size of the"crawlable" web is only 1% of the Deep Web (i.e., less than 5 billionpages). Even those web sites with some static links that are "crawlable"by a search engine often have much more information available only through aquery interface. Unlocking this vast deep web content presents a major researchchallenge.
In analogy tosearch engines over the "crawlable" web, we argue that one way tounlock the Deep Web is to employ a fully automated approach to extracting,indexing, and searching the query-related information-rich regions from dynamicweb pages. For this miniproject, we focus on the first of these: extractingdata from the Deep Web.
Extracting theinteresting information from a Deep Web site requires many things: includingscalable and robust methods for analyzing dynamic web pages of a given website, discovering and locating the query-related information-rich contentregions, and extracting itemized objects within each region. By fullautomation, we mean that the extraction algorithms should be designedindependently of the presentation features or specific content of the webpages, such as the specific ways in which the query-related information is laidout or the specific locations where the navigational links and advertisementinformation are placed in the web pages.
There are manypossible 7001-miniprojects. Feel free to talk to either of us for more details.Here are a few possibilities to consider:
1. Develop aWeb-based demo for clustering pages of a similar type from a single Deep Websource. For example, AllMusic produces three types of pages in response to auser query: a direct match page (e.g. for Elvis Presley), a list of links tomatch pages (e.g. a list of all artists named Jackson), and a page with nomatches. As a first-step to extracting the relevant data from each page, youmay develop techniques to separate out the pages that contain query matchesfrom pages that contain no matches, and perhaps, rank each group based on somemetric of quality.
2. Design asystem for extracting interesting data from a collection of pages from a DeepWeb source. You might define a set of regular expression that can identifydates, prices, or names. Develop a small program that converts a page into atype structure. For example, given a DOM model of a web page, identify all ofthe types that you have defined, and replace the string tokens with XML tagsidentifying the types. Replace all non-type tokens with a generic type, andreturn the tree as a full type structure). Alternatively, you may suggest yourown approach for extracting data.
3. Develop asystem to recognize names in page. Given a list of names and a web page,identify possible matches in the page. Based on the structure of the page andthe distribution of recognized names, identify strings that may also be namesbased on their location in the DOM tree heirarchy representing the page.
4. Write asurvey paper about current approaches for understanding and analyzing the DeepWeb. Be sure to include many of your own comments on the viability of theapproaches you review.
5. Or, feelfree to suggest a miniproject of your own.
Background: Knowledge of Java or Python wouldbe helpful. Some knowledge of information retrieval and machine learning may beuseful but is not required.
Deliverables: You should submit a report thatclearly describes what you have learned and what you have accomplished. Thereport should include useful references. You should also provide any sourcecode you may have written to validate your ideas.
Evaluation: You will be graded on the noveltyand quality of your report and implementation.
......

For more information,please visit our website: http://www.knowlesys.com
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP