Chinaunix

标题: 如何使用 wget 命令。 [打印本页]

作者: wazhl    时间: 2010-07-02 13:26
标题: 如何使用 wget 命令。
RT。

比如我想下载 网上的

http://mirrors.kernel.org/opensuse/distribution/11.2/iso/

下的文件。。。

怎么得到呢?
作者: dexter_yccs    时间: 2010-07-02 13:53
wget http://mirrors.kernel.org/opensu ... Addon-Lang-i586.iso
好像不能批次
作者: gamester88    时间: 2010-07-02 14:07
回复 1# wazhl


   
wget 使用技巧

2007-10-14 Toy Posted in TipsRSSTrackback

wget 是一个命令行的下载工具。对于我们这些 Linux 用户来说,几乎每天都在使用它。下面为大家介绍几个有用的 wget 小技巧,可以让你更加高效而灵活的使用 wget。

    * $ wget -r -np -nd http://example.com/packages/

这条命令可以下载 http://example.com 网站上 packages 目录中的所有文件。其中,-np 的作用是不遍历父目录,-nd 表示不在本机重新创建目录结构。

    * $ wget -r -np -nd --accept=iso http://example.com/centos-5/i386/

与上一条命令相似,但多加了一个 --accept=iso 选项,这指示 wget 仅下载 i386 目录中所有扩展名为 iso 的文件。你也可以指定多个扩展名,只需用逗号分隔即可。

    * $ wget -i filename.txt

此命令常用于批量下载的情形,把所有需要下载文件的地址放到 filename.txt 中,然后 wget 就会自动为你下载所有文件了。

    * $ wget -c http://example.com/really-big-file.iso

这里所指定的 -c 选项的作用为断点续传。

    * $ wget -m -k (-H) http://www.example.com/

该命令可用来镜像一个网站,wget 将对链接进行转换。如果网站中的图像是放在另外的站点,那么可以使用 -H 选项。

作者: 一路征程一路笑    时间: 2010-07-02 14:35
提示: 作者被禁止或删除 内容自动屏蔽
作者: 俺小时候可帅了    时间: 2010-07-02 15:31
大哥不会用man吗?!
作者: klanet    时间: 2010-07-02 21:58
3l的牛...我只会wget url
作者: expresss    时间: 2010-07-03 17:54
本帖最后由 expresss 于 2010-07-05 08:09 编辑

为什么我下载楼主给的那些文件,没办法下载成功呢?
wget -r -np -nd --accept=md5 http://mirrors.kernel.org/opensuse/distribution/11.2/iso/
并不能成功的把md5文件下下来,只有一个robots.txt文件。改成其它后缀也一样。
这个参数有问题吗?试过好几次都错了。并不能下载指定类型的文件。
作者: gamester88    时间: 2010-07-05 11:58
回复 7# expresss
  1. 9.1 Robot Exclusion

  2. It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. ‘wget -r site’, and you're set. Great? Not for the server admin.

  3. As long as Wget is only retrieving static pages, and doing it at a reasonable rate (see the ‘--wait’ option), there's not much of a problem. The trouble is that Wget can't tell the difference between the smallest static page and the most demanding CGI. A site I know has a section handled by a CGI Perl script that converts Info files to html on the fly. The script is slow, but works well enough for human users viewing an occasional Info file. However, when someone's recursive Wget download stumbles upon the index page that links to all the Info files through the script, the system is brought to its knees without providing anything useful to the user (This task of converting Info files could be done locally and access to Info documentation for all installed GNU software on a system is available from the info command).

  4. To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion was invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from robots and those they will permit access.

  5. The most popular mechanism, and the de facto standard supported by all the major robots, is the “Robots Exclusion Standard” (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in /robots.txt in the server root, which the robots are expected to download and parse.

  6. Although Wget is not a web robot in the strictest sense of the word, it can download large parts of the site without the user's intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue:

  7.      wget -r http://www.server.com/
  8. First the index of ‘www.server.com’ will be downloaded. If Wget finds that it wants to download more documents from that server, it will request ‘http://www.server.com/robots.txt’ and, if found, use it for further downloads. robots.txt is loaded only once per each server.

  9. Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft ‘<draft-koster-robots-00.txt>’ titled “A Method for Web Robots Control”. The draft, which has as far as I know never made to an rfc, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.

  10. This manual no longer includes the text of the Robot Exclusion Standard.

  11. The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:

  12.      <meta name="robots" content="nofollow">
  13. This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion.

  14. If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to ‘off’ in your .wgetrc. You can achieve the same effect from the command line using the -e switch, e.g. ‘wget -e robots=off url...’.
复制代码

作者: gamester88    时间: 2010-07-05 12:01
本帖最后由 gamester88 于 2010-07-05 12:02 编辑

回复 7# expresss


    因为robots.txt文件的缘故,所以上面的参数都会失效,所以
  1. [gamester88@gamester88 iso]$ mkdir iso
  2. [gamester88@gamester88 iso]$ cd iso
  3. [gamester88@gamester88 iso]$ ls
  4. [gamester88@gamester88 iso]$ wget -e robots=off -r -np -nd --accept=md5 http://mirrors.kernel.org/opensuse/distribution/11.2/iso/
  5. [gamester88@gamester88 iso]$ls
  6. openSUSE-11.2-Addon-Lang-i586.iso.md5                 
  7. openSUSE-11.2-DVD-x86_64.iso.md5           
  8. openSUSE-11.2-KDE4-LiveCD-x86_64.iso.md5
  9. openSUSE-11.2-Addon-Lang-x86_64.iso.md5               
  10. openSUSE-11.2-GNOME-LiveCD-i686.iso.md5   
  11. openSUSE-11.2-NET-i586.iso.md5
  12. openSUSE-11.2-Addon-NonOss-BiArch-i586-x86_64.iso.md5  
  13. openSUSE-11.2-GNOME-LiveCD-x86_64.iso.md5  
  14. openSUSE-11.2-NET-x86_64.iso.md5
  15. openSUSE-11.2-DVD-i586.iso.md5                        
  16. openSUSE-11.2-KDE4-LiveCD-i686.iso.md5
复制代码

作者: expresss    时间: 2010-07-06 21:22
本帖最后由 expresss 于 2010-07-07 09:16 编辑

回复 9# gamester88


    谢谢,非常热心的回答,真的非常感谢。看样子Linux要学好了,还真要把英文搞好不可,呵呵。
非常感谢您热心的解答。大概明白了,因为robots.txt里面的disallow:/,所以不允许搜索整个目录,用-e robots=off可以不按robots的内容来,也就是可以绕过robots.txt里的限制。




欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2