Chinaunix

标题: PHP Cookbook读书笔记 – 第13章Web自动化 [打印本页]

作者: cu_Cbear 时间: 2011-11-25 14:40
标题: PHP Cookbook读书笔记 – 第13章Web自动化

PHP Cookbook读书笔记 – 第13章Web自动化

通过GET获得一个指定url的页面内容
有3种方式来获取一个URL的内容：

1.PHP提供的文件函数file_get_contents()
2.cURL扩展
3.PEAR中的HTTP_Request类
view sourceprint?
01 //方式1
02 $page = file_get_contents('http://www.example.com/robots.txt');
03
04 //方式2
05 $c = curl_init('http://www.example.com/robots.txt');
06 curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
07 $page = curl_exec($c);
08 curl_close($c);
09
10 //方式3
11 require_once 'HTTP/Request.php';
12 $r = new HTTP_Request('http://www.example.com/robots.txt');
13 $r->sendRequest();
14 $page = $r->getResponseBody();

复制代码

可以通过这些方式来获取XML文档，通过结合http_build_query()来建立一个查询字符串，可以通过url中加入username@password的形式来访问受保护的页面，通过cURL和PEAR的HTTP_Client类来跟踪重定向。

通过POST获得一个URL
让PHP模拟发送一个POST请求并获得服务器的反馈内容

view sourceprint?
01 //1
02 $url = 'http://www.example.com/submit.php';
03 $body = 'monkey=uncle&rhino=aunt';
04 $options = array('method' => 'POST', 'content' => $body);
05 $context = stream_context_create(array('http' => $options));
06 print file_get_contents($url, false, $context);
07
08 //2
09 $url = 'http://www.example.com/submit.php';
10 $body = 'monkey=uncle&rhino=aunt';
11 $c = curl_init($url);
12 curl_setopt($c, CURLOPT_POST, true);
13 curl_setopt($c, CURLOPT_POSTFIELDS, $body);
14 curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
15 $page = curl_exec($c);
16 curl_close($c);
17
18 //3
19 require 'HTTP/Request.php';
20 $url = 'http://www.example.com/submit.php';
21 $r = new HTTP_Request($url);
22 $r->setMethod(HTTP_REQUEST_METHOD_POST);
23 $r->addPostData('monkey','uncle');
24 $r->addPostData('rhino','aunt');
25 $r->sendRequest();
26 $page = $r->getResponseBody();

复制代码

通过Cookie获得一个URL

view sourceprint?
01 //2
02 $c = curl_init('http://www.example.com/needs-cookies.php');
03 curl_setopt($c, CURLOPT_COOKIE, 'user=ellen; activity=swimming');
04 curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
05 $page = curl_exec($c);
06 curl_close($c);
07
08 //3
09 require 'HTTP/Request.php';
10 $r = new HTTP_Request('http://www.example.com/needs-cookies.php');
11 $r->addHeader('Cookie','user=ellen; activity=swimming');
12 $r->sendRequest();
13 $page = $r->getResponseBody();

复制代码

通过Header获得一个URL
通过修改header中的信息可以来伪造 Referer 或 User-Agent 后请求目标URL，不少防盗链网站经常会采用判断Referer中的信息来源决定是否允许下载或访问资源。需要具备一些HTTP的HEADER背景知识。

标记网页
其实这个代码经过简单修改还可以应用到替换网页中的敏感关键字，这在天朝是很有用的一个功能

view sourceprint?
1 $body = '
I like pickles and herring.
view sourceprint?
01 <A href="http://www.cnblogs.com/Excellent/admin/pickle.php"><IMG alt="" src="http://www.cnblogs.com/Excellent/admin/pickle.jpg">A pickle picture</A>
02
03 I have a herringbone-patterned toaster cozy.
04
05 Herring is not a real HTML element!
06 ';
07
08 $words = array('pickle','herring');
09 $patterns = array();
10 $replacements = array();
11 foreach ($words as $i => $word) {
12 $patterns[] = '/' . preg_quote($word) .'/i';
13 $replacements[] = "<SPAN class=word-$i>\\0</SPAN>";
14 }
15
16 // Split up the page into chunks delimited by a
17 // reasonable approximation of what an HTML element
18 // looks like.
19 $parts = preg_split("{(<(?:\"[^\"]*\"|'[^']*'|[^'\">])*>)}",
20 $body,
21 -1, // Unlimited number of chunks
22 PREG_SPLIT_DELIM_CAPTURE);
23 foreach ($parts as $i => $part) {
24 // Skip if this part is an HTML element
25 if (isset($part[0]) && ($part[0] == '<')) { continue; }
26 // Wrap the words with <SPAN>s
27 $parts[$i] = preg_replace($patterns, $replacements, $part);
28 }
29
30 // Reconstruct the body
31 $body = implode('',$parts);
32
33 print $body;</SPAN>

复制代码

提取页面所有链接
也是一个很不错的功能，在做采集之类的程序时可以用的上

采用了tidy扩展的实现方式：

view sourceprint?
01 $doc = new DOMDocument();
02 $opts = array('output-xml' => true,
03 // Prevent DOMDocument from being confused about entities
04 'numeric-entities' => true);
05 $doc->loadXML(tidy_repair_file('linklist.html',$opts));
06 $xpath = new DOMXPath($doc);
07 // Tell $xpath about the XHTML namespace
08 $xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
09 foreach ($xpath->query('//xhtml:a/@href') as $node) {
10 $link = $node->nodeValue;
11 print $link . "\n";

复制代码

通过正则提取链接：

view sourceprint?
1 $html = file_get_contents('linklist.html');
2 $links = pc_link_extractor($html);
3 foreach ($links as $link) {
4 print $link[0] . "\n";
5 }
6
7 function pc_link_extractor($html) {
8 $links = array();
9 preg_match_all('/<A href="http://www.cnblogs.com/Excellent/admin/[\"\']?([^\"\'">]*)[\"\']?[^>]*>(.*?)<\/a>/i', $html,$matches,PREG_SET_ORDER); foreach($matches as $match) { $links[] = array($match[1],$match[2]); } return $links;</A>

复制代码

将文本转换为HTML
bbcode的概念和这个很像，所以将这个贴出来

view sourceprint?
1 function pc_text2html($s) {
2 $s = htmlentities($s);
3 $grafs = split("\n\n",$s);
4 for ($i = 0, $j = count($grafs); $i < $j; $i++) {
5 // 转换html超链接
6 $grafs[$i] = preg_replace('/((ht|f)tp:\/\/[^\s&]+)/',
7 '<A href="http://www.cnblogs.com/Excellent/admin/$1">$1</A>',$grafs[$i]); // 转换email链接
8 $grafs[$i] = preg_replace('/[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}/i', '<A href="mailto:$1">$1</A>',$grafs[$i]); // 开始一个新段落
9 $grafs[$i] = '
'.$grafs[$i].'
view sourceprint?
1 '; } return implode("\n\n",$grafs);}

复制代码

将HTML转换为文本
已经有现成的代码来实现这个功能http://www.chuggnutt.com/html2text.php

删除HTML和PHP标签
用这个函数strip_tags( ) 可以删除HTML和PHP标签

欢迎光临 Chinaunix (http://bbs.chinaunix.net/)