免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1415 | 回复: 0
打印 上一主题 下一主题

Using the CURL library in PHP [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-04-22 14:40 |只看该作者 |倒序浏览
Abstract
In this article you will learn what the CURL library is, how to use it, and some of its (advanced) options.
Introduction
Sooner or later you're bound to run across a certain problem in your script: how to retrieve content from other websites. There are several methods for this, and the simplest one is probably to use the fopen() function (if it's enabled), but there aren't really a lot of options you can set when using the fopen function. What if you're building a
web spider
, and want to have a custom user agent? That isn't really possible with fopen, nor is it possible to define the request method (GET or POST).
That's where the CURL library comes in. This library, usually included with
PHP
, allows you to retrieve other pages, and also makes it possible to define dozens of different options.
In this article we'll have a look at how to use the CURL library, what it can do, and explore some of its options. But first, let's get started with the basics of CURL.
The Basics
The first step in using CURL is to create a new CURL resource, by calling the
curl_init()
function, like so:

// create a new curl resource
$ch = curl_init();
?>
Now that you've got a curl resource, it's possible to retrieve a URL, by first setting the URL you want to retrieve using the
curl_setopt()
function:

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.
google
.com/");
?>
After that, to get the page, call the
curl_exec()
which will execute the retrieval, and automatically print the page:

// grab URL and pass it to the browser
curl_exec($ch);
?>
Finally, it's probably wise to close the curl resource to free up system resources. This can be done with the
curl_close()
function, as follows:

// close curl resource, and free up system resources
curl_close($ch);
?>
That's all there is to it, and the above
code snippets
together form the following working demo:

// create a new curl resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.google.nl/");
// grab URL and pass it to the browser
curl_exec($ch);
// close curl resource, and free up system resources
curl_close($ch);
?>
(
View Live Demo
)
The only problem we have now is that the output of the page is immediately printed, but what if we want to use the output in some other way? That's no problem, as there's an option called CURLOPT_RETURNTRANSFER which, when set to TRUE, will make sure the output of the page is returned instead of printed. See the example below:

// create a new curl resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.google.nl/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// grab URL, and return output
$output = curl_exec($ch);
// close curl resource, and free up system resources
curl_close($ch);
// Replace 'Google' with 'PHPit'
$output = str_replace('Google', 'PHPit', $output);
// Print output
echo $output;
?>
(
View Live Demo
)
In the previous two examples you might've noticed we used the curl_setopt() function to define how the page should be retrieved, and that's where the real power of curl lies. By setting all kinds of different options, pretty much anything is possible, so let's have a look at that a bit more.
What's possible with the curl options
If you have a look at the manual for the
curl_setopt()
function you'll notice there's a huge list of different options. Let's go through the most interesting.
The first interesting option is CURLOPT_FOLLOWLOCATION. When this is set to true, curl will automatically follow any redirect it gets sent. For example, when you try to retrieve a
PHP
page, and the PHP page uses header("Location: http://new_url"), curl will automatically follow it. The example below demonstrates this:

// create a new curl resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.
google
.com/");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// grab URL, and print
curl_exec($ch);
?>
(
View Live Demo
)
If Google decides to send a redirect, the example above will now follow to the new location. Two options that are related to this are the CURLOPT_MAXREDIRS and CURLOPT_AUTOREFERER options.
The CURLOPT_MAXREDIRS option allows you to define how many redirects should be followed, and any more after that won't be followed. If the CURLOPT_AUTOREFERER option is set to TRUE, curl will automatically include the Referer header in each redirect. Not that important really, but could be useful in certain cases.
Next up is the CURLOPT_POST option. This is a very useful function, as it allows you to do POST requests, instead of GET requests, which actually means you can submit forms to other pages without having to actually fill in the form. The below example demonstrates what I mean:

// create a new curl resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://projects/phpit/content/using%20curl%20php/demos/handle_form.php");
// Do a POST
$data = array('name' => 'Dennis', 'surname' => 'Pallett');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
// grab URL, and print
curl_exec($ch);
?>
(
View Live Demo
)
And the handle_form.php file:

echo 'Form variables I received: ';
echo '';
print_r ($_POST);
echo '';
?>
As you can see this makes it really easy to submit forms, and it's a great way to test all your forms, without having to fill them in all the time.
The CURLOPT_CONNECTTIMEOUT is used to set how long curl should wait whilst trying to connect. This is a very important option, since it could cause requests to fail if you set it too low, but if you set it too high (e.g. 1000 or 0 for unlimited) it could cause your
PHP scripts
to crash. A related option to this is the CURLOPT_TIMEOUT option, which is used to set how long curl requests are allowed to execute. If you set this to a low value, it might cause slow pages to be incomplete, since they take a while to download.
The final interesting option is the CURLOPT_USERAGENT option, which allows you to set the user agent of the request. This makes it possible to create your own
web spiders
, with their own user agent, like so:

// create a new curl resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.useragent.org/");
curl_setopt($ch, CURLOPT_USERAGENT, 'My custom web spider/0.1');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// grab URL, and print
curl_exec($ch);
?>
(
View Live Demo
)
Now that we've had most of the interesting options, let's have a look at the curl_getinfo() function and what it can do for us.
Getting info about the page
The
curl_getinfo()
is used to get all kinds of different information about the page that was retrieved and the request itself. You can either specify what information you want by setting the second argument or you can simple leave the second argument out and get an associative array with every detail. The below example demonstrates this:

// create a new curl resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.google.com");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FILETIME, true);
// grab URL
$output = curl_exec($ch);
// Print info
echo '';
print_r (curl_getinfo($ch));
echo '';
?>
(
View Live Demo
)
Most of the information returned is about the request itself, like the amount of time it took and the response header that was returned, but there's also some information on the page, like the content-type and last modified time (only if you explicitly state you want to get the last modified time, like I did in the example).
That's all about curl_getinfo(), so let's have a look at some practical uses now.
Practical uses
The first useful thing the curl library could be used for is checking whether a page really exists. To do this, we first have to retrieve the page, and then check the response code (404=not found, and thus it doesn't exist). See the example below:
php[/url]

// create a new curl resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.
google
.com/does/not/exist");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// grab URL
$output = curl_exec($ch);
// Get response code
$response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// Not found?
if ($response_code == '404') {
        echo 'Page doesn\'t exist';
} else {
        echo $output;
}
?>
(
View Live Demo
)
Another possibility is to create an automatic link checker, which will get a page, and check if all the links work (by using the above code), and then retrieving each link, and doing the same.
Curl also makes it possible to write your own
web spider
, similar to Google's web spider, or any other web spider. This article isn't about writing a web spider, so I won't talk about it any further, but a future article on PHPit will show you exactly how to create your own web spider.
Conclusion
In this article I've shown how to use the CURL library, and taken you through most of its options.
For most basic tasks, like simply getting a page, you probably won't need the curl library, since PHP comes with inbuilt support for remote pages. But as soon as you want to do anything slightly advanced, you're probably going to want to use the curl library.
In the near-future I will show you exactly how to build your own web spider, similar to
Google's web
spider, so stay tuned to PHPit.
If you have any questions or comments on this article, feel free to leave them below, or join us at
PHPit Forums
.


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/10599/showart_572710.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP