免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1968 | 回复: 0
打印 上一主题 下一主题

Using cURL to Grab Data [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2009-04-25 07:21 |只看该作者 |倒序浏览

Using cURL is a simple and effective way to gather data from another website, run it through a script, parse the data and transform it into something useful that you can use on your website. Whether you are “scraping” data to build a summary of a link, pulling an XML file to parse into a database, or just simply wanting to get the contents of the file, cURL is a simple and effective way to pull the data from an outside source into your page.
Making sure cURL is enabled and setup
First things first, you need to make sure cURL is enabled on your web host. The easiest way to accomplish this is to check your phpinfo on your server. Simply deploy a PHP file with the following information onto your server, and name it whatever you want.
phpinfo();   
?>  
After the file is uploaded/saved onto your web server, look through the file to ensure that there is a section that looks as follows.

If your PHP file doesn’t have this section of code, or nothing similar to it, then your hosting service may not support cURL, or it may not be enabled. If you are on a hosting service, you can ask your host to enable it for you, or if you are on your own server, you can modify your php.ini file to enable the extension.
You can modify your php.ini file as follows:
(if you can’t find it, look at the top of the script we wrote above, it will give you the ini path)
view plain
copy to clipboard
print
?
// Find this line in your php.ini   
;extension=php_curl.dll   
  
// Remove the semi-colon in front, to make the line look like this:   
extension=php_curl.dll  // Find this line in your php.ini
;extension=php_curl.dll
// Remove the semi-colon in front, to make the line look like this:
extension=php_curl.dll
After modifying and saving your php.ini file, you are going to have to restart your web service.
- If you are running on Apache, you should be able to enable it with a simple “apachectl restart” command.
- If you are running an IIS web server, you are going to have to restart IIS or just restart the Worker Pool that is running your PHP. This can be done through the MMC IIS Snap-In.
- If you are running WAMP on your local machine, simply right-click on the WAMP icon in your system tray, find the Apache menu, and click “Restart”.
Just make sure you go back into your file running phpinfo() to ensure that cURL is showing up in the file now. If not, you may want to seek addition support from your IT, Co-workers or Web hosting provider for more information as to why cURL will not function on your server.
Assuming everything is running now, and cURL is enabled, we will continue onwards.
A simple cURL Request
cURL isn’t incredibly hard to use to pull the data in, as illustrated below.
view plain
copy to clipboard
print
?
// Init $curl as a cURL object   
$curl = curl_init();   
  
// Tell cURL what URL we are going after   
curl_setopt($curl, CURLOPT_URL, 'http://www.google.com');   
  
// Tell cURL we would like headers as well   
curl_setopt($curl, CURLOPT_HEADER, 1);   
  
// Tell cURL we would like the results as a string instead of just dumping it on the screen   
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);   
  
// Execute the cURL request   
$data = curl_exec($curl);   
  
// Close the cURL request   
curl_close($curl);   
  
// Display the data from the variable to ensure its there.   
var_dump($data);  
The above set of code will go out to http://www.google.com and will set the variable $data to contain the HTML contents of the website. The var_dump($data) at the end of the file merely spits it back out onto your screen so you can see the data you have to work with.
Now, what you end up doing with this data is up to you! You could run it through some regex statements to pull relevant information, you could parse it line by line and store certain portions of code somewhere, or if you are pulling an XML file, you could begin to parse the XML. Since this article is just about cURL, we won’t get into that.
Using a cURL Request Object
A bit more on the advanced side, but if you want to create an object to handle all your requests for you, I’ve pulled one out of my code library that you may find useful.
view plain
copy to clipboard
print
?
class curlHandler {   
    public $url = '';   
    public $output = '';   
    public $curl = '';   
  
    function __construct($url) {   
        $this->curl = curl_init();   
        $this->url($url);   
        curl_setopt($this->curl, CURLOPT_URL, $this->url);   
        curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, true);   
        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');   
        $this->output = curl_exec($this->curl);   
        return $this->output;   
  }   
  
    function __destruct() {   
        curl_close($this->curl);   
    }   
  
    function url($url) {   
        $this->url = $url;   
        curl_setopt($this->curl, CURLOPT_URL, $url);   
    }   
}   
  
// Init the Object and do the Request, as well as close down the handler afterwards   
$curlHandler = new curlHandler("http://www.google.com");   
  
// Display what we've found   
var_dump($curlHandler);  curl = curl_init();
        $this->url($url);
        curl_setopt($this->curl, CURLOPT_URL, $this->url);
        curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2');
        $this->output = curl_exec($this->curl);
  return $this->output;
  }
function __destruct() {
  curl_close($this->curl);
}
function url($url) {
  $this->url = $url;
  curl_setopt($this->curl, CURLOPT_URL, $url);
}
}
// Init the Object and do the Request, as well as close down the handler afterwards
$curlHandler = new curlHandler("http://www.google.com");
// Display what we've found
var_dump($curlHandler);
Well, gathering data this way is pretty simple when you know what you are passing in. Notice above in my class, that I am passing a Firefox browser string into the cURL request. Why is this? Well some websites try to block cURL or automated requests (such as the World of Warcraft Armory, which is what I was scraping), so by mimicking a browser, we can get past these obstacles.
Now what you do with all of this new found data, well that is up to you. Eventually I will create a post more about parsing this data you find, but that is for another day.


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/78/showart_1910281.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP