免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 4143 | 回复: 9
打印 上一主题 下一主题

如何使用 wget 命令。 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2010-07-02 13:26 |只看该作者 |倒序浏览
RT。

比如我想下载 网上的

http://mirrors.kernel.org/opensuse/distribution/11.2/iso/

下的文件。。。

怎么得到呢?

论坛徽章:
0
2 [报告]
发表于 2010-07-02 13:53 |只看该作者

论坛徽章:
5
寅虎
日期:2015-01-20 09:16:52亥猪
日期:2015-01-21 14:43:44IT运维版块每日发帖之星
日期:2015-12-17 06:20:00每日论坛发贴之星
日期:2015-12-17 06:20:00每周论坛发贴之星
日期:2015-12-20 22:22:00
3 [报告]
发表于 2010-07-02 14:07 |只看该作者
回复 1# wazhl


   
wget 使用技巧

2007-10-14 Toy Posted in TipsRSSTrackback

wget 是一个命令行的下载工具。对于我们这些 Linux 用户来说,几乎每天都在使用它。下面为大家介绍几个有用的 wget 小技巧,可以让你更加高效而灵活的使用 wget。

    * $ wget -r -np -nd http://example.com/packages/

这条命令可以下载 http://example.com 网站上 packages 目录中的所有文件。其中,-np 的作用是不遍历父目录,-nd 表示不在本机重新创建目录结构。

    * $ wget -r -np -nd --accept=iso http://example.com/centos-5/i386/

与上一条命令相似,但多加了一个 --accept=iso 选项,这指示 wget 仅下载 i386 目录中所有扩展名为 iso 的文件。你也可以指定多个扩展名,只需用逗号分隔即可。

    * $ wget -i filename.txt

此命令常用于批量下载的情形,把所有需要下载文件的地址放到 filename.txt 中,然后 wget 就会自动为你下载所有文件了。

    * $ wget -c http://example.com/really-big-file.iso

这里所指定的 -c 选项的作用为断点续传。

    * $ wget -m -k (-H) http://www.example.com/

该命令可用来镜像一个网站,wget 将对链接进行转换。如果网站中的图像是放在另外的站点,那么可以使用 -H 选项。
一路征程一路笑 该用户已被删除
4 [报告]
发表于 2010-07-02 14:35 |只看该作者
提示: 作者被禁止或删除 内容自动屏蔽

论坛徽章:
0
5 [报告]
发表于 2010-07-02 15:31 |只看该作者
大哥不会用man吗?!

论坛徽章:
0
6 [报告]
发表于 2010-07-02 21:58 |只看该作者
3l的牛...我只会wget url

论坛徽章:
0
7 [报告]
发表于 2010-07-03 17:54 |只看该作者
本帖最后由 expresss 于 2010-07-05 08:09 编辑

为什么我下载楼主给的那些文件,没办法下载成功呢?
wget -r -np -nd --accept=md5 http://mirrors.kernel.org/opensuse/distribution/11.2/iso/
并不能成功的把md5文件下下来,只有一个robots.txt文件。改成其它后缀也一样。
这个参数有问题吗?试过好几次都错了。并不能下载指定类型的文件。

论坛徽章:
5
寅虎
日期:2015-01-20 09:16:52亥猪
日期:2015-01-21 14:43:44IT运维版块每日发帖之星
日期:2015-12-17 06:20:00每日论坛发贴之星
日期:2015-12-17 06:20:00每周论坛发贴之星
日期:2015-12-20 22:22:00
8 [报告]
发表于 2010-07-05 11:58 |只看该作者
回复 7# expresss
  1. 9.1 Robot Exclusion

  2. It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. ‘wget -r site’, and you're set. Great? Not for the server admin.

  3. As long as Wget is only retrieving static pages, and doing it at a reasonable rate (see the ‘--wait’ option), there's not much of a problem. The trouble is that Wget can't tell the difference between the smallest static page and the most demanding CGI. A site I know has a section handled by a CGI Perl script that converts Info files to html on the fly. The script is slow, but works well enough for human users viewing an occasional Info file. However, when someone's recursive Wget download stumbles upon the index page that links to all the Info files through the script, the system is brought to its knees without providing anything useful to the user (This task of converting Info files could be done locally and access to Info documentation for all installed GNU software on a system is available from the info command).

  4. To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion was invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from robots and those they will permit access.

  5. The most popular mechanism, and the de facto standard supported by all the major robots, is the “Robots Exclusion Standard” (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in /robots.txt in the server root, which the robots are expected to download and parse.

  6. Although Wget is not a web robot in the strictest sense of the word, it can download large parts of the site without the user's intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue:

  7.      wget -r http://www.server.com/
  8. First the index of ‘www.server.com’ will be downloaded. If Wget finds that it wants to download more documents from that server, it will request ‘http://www.server.com/robots.txt’ and, if found, use it for further downloads. robots.txt is loaded only once per each server.

  9. Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft ‘<draft-koster-robots-00.txt>’ titled “A Method for Web Robots Control”. The draft, which has as far as I know never made to an rfc, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.

  10. This manual no longer includes the text of the Robot Exclusion Standard.

  11. The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:

  12.      <meta name="robots" content="nofollow">
  13. This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion.

  14. If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to ‘off’ in your .wgetrc. You can achieve the same effect from the command line using the -e switch, e.g. ‘wget -e robots=off url...’.
复制代码

论坛徽章:
5
寅虎
日期:2015-01-20 09:16:52亥猪
日期:2015-01-21 14:43:44IT运维版块每日发帖之星
日期:2015-12-17 06:20:00每日论坛发贴之星
日期:2015-12-17 06:20:00每周论坛发贴之星
日期:2015-12-20 22:22:00
9 [报告]
发表于 2010-07-05 12:01 |只看该作者
本帖最后由 gamester88 于 2010-07-05 12:02 编辑

回复 7# expresss


    因为robots.txt文件的缘故,所以上面的参数都会失效,所以
  1. [gamester88@gamester88 iso]$ mkdir iso
  2. [gamester88@gamester88 iso]$ cd iso
  3. [gamester88@gamester88 iso]$ ls
  4. [gamester88@gamester88 iso]$ wget -e robots=off -r -np -nd --accept=md5 http://mirrors.kernel.org/opensuse/distribution/11.2/iso/
  5. [gamester88@gamester88 iso]$ls
  6. openSUSE-11.2-Addon-Lang-i586.iso.md5                 
  7. openSUSE-11.2-DVD-x86_64.iso.md5           
  8. openSUSE-11.2-KDE4-LiveCD-x86_64.iso.md5
  9. openSUSE-11.2-Addon-Lang-x86_64.iso.md5               
  10. openSUSE-11.2-GNOME-LiveCD-i686.iso.md5   
  11. openSUSE-11.2-NET-i586.iso.md5
  12. openSUSE-11.2-Addon-NonOss-BiArch-i586-x86_64.iso.md5  
  13. openSUSE-11.2-GNOME-LiveCD-x86_64.iso.md5  
  14. openSUSE-11.2-NET-x86_64.iso.md5
  15. openSUSE-11.2-DVD-i586.iso.md5                        
  16. openSUSE-11.2-KDE4-LiveCD-i686.iso.md5
复制代码

论坛徽章:
0
10 [报告]
发表于 2010-07-06 21:22 |只看该作者
本帖最后由 expresss 于 2010-07-07 09:16 编辑

回复 9# gamester88


    谢谢,非常热心的回答,真的非常感谢。看样子Linux要学好了,还真要把英文搞好不可,呵呵。
非常感谢您热心的解答。大概明白了,因为robots.txt里面的disallow:/,所以不允许搜索整个目录,用-e robots=off可以不按robots的内容来,也就是可以绕过robots.txt里的限制。
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP