论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2010-07-02 13:26 |只看该作者 |倒序浏览

RT。

比如我想下载网上的

http://mirrors.kernel.org/opensuse/distribution/11.2/iso/

下的文件。。。

怎么得到呢？

文库|博客

dexter_yccs

小富即安

论坛徽章:: 0

2楼 [报告]

发表于 2010-07-02 13:53 |只看该作者

wget http://mirrors.kernel.org/opensu ... Addon-Lang-i586.iso
好像不能批次

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

gamester88

丰衣足食

论坛徽章:: 5

3楼 [报告]

发表于 2010-07-02 14:07 |只看该作者

回复 1# wazhl

wget 使用技巧

2007-10-14 Toy Posted in TipsRSSTrackback

wget 是一个命令行的下载工具。对于我们这些 Linux 用户来说，几乎每天都在使用它。下面为大家介绍几个有用的 wget 小技巧，可以让你更加高效而灵活的使用 wget。

* $ wget -r -np -nd http://example.com/packages/

这条命令可以下载 http://example.com 网站上 packages 目录中的所有文件。其中，-np 的作用是不遍历父目录，-nd 表示不在本机重新创建目录结构。

* $ wget -r -np -nd --accept=iso http://example.com/centos-5/i386/

与上一条命令相似，但多加了一个 --accept=iso 选项，这指示 wget 仅下载 i386 目录中所有扩展名为 iso 的文件。你也可以指定多个扩展名，只需用逗号分隔即可。

* $ wget -i filename.txt

此命令常用于批量下载的情形，把所有需要下载文件的地址放到 filename.txt 中，然后 wget 就会自动为你下载所有文件了。

* $ wget -c http://example.com/really-big-file.iso

这里所指定的 -c 选项的作用为断点续传。

* $ wget -m -k (-H) http://www.example.com/

该命令可用来镜像一个网站，wget 将对链接进行转换。如果网站中的图像是放在另外的站点，那么可以使用 -H 选项。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

一路征程一路笑该用户已被删除	4楼 [报告] 发表于 2010-07-02 14:35 \|只看该作者提示: 作者被禁止或删除内容自动屏蔽
一路征程一路笑该用户已被删除	实战分享：从技术角度谈机器学习入门\| 【大话IT】RadonDB低门槛向MySQL集群下战书 \| ChinaUnix打赏功能已上线！ \| 新一代分布式关系型数据库RadonDB知多少？

俺小时候可帅了

白手起家

论坛徽章:: 0

5楼 [报告]

发表于 2010-07-02 15:31 |只看该作者

大哥不会用man吗？！

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

klanet

家境小康

论坛徽章:: 0

6楼 [报告]

发表于 2010-07-02 21:58 |只看该作者

3l的牛...我只会wget url

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

expresss

稍有积蓄

论坛徽章:: 0

7楼 [报告]

发表于 2010-07-03 17:54 |只看该作者

本帖最后由 expresss 于 2010-07-05 08:09 编辑

为什么我下载楼主给的那些文件，没办法下载成功呢？
wget -r -np -nd --accept=md5 http://mirrors.kernel.org/opensuse/distribution/11.2/iso/
并不能成功的把md5文件下下来，只有一个robots.txt文件。改成其它后缀也一样。
这个参数有问题吗？试过好几次都错了。并不能下载指定类型的文件。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

gamester88

丰衣足食

论坛徽章:: 5

8楼 [报告]

发表于 2010-07-05 11:58 |只看该作者

回复 7# expresss

9.1 Robot Exclusion
It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. ‘wget -r site’, and you're set. Great? Not for the server admin.
As long as Wget is only retrieving static pages, and doing it at a reasonable rate (see the ‘--wait’ option), there's not much of a problem. The trouble is that Wget can't tell the difference between the smallest static page and the most demanding CGI. A site I know has a section handled by a CGI Perl script that converts Info files to html on the fly. The script is slow, but works well enough for human users viewing an occasional Info file. However, when someone's recursive Wget download stumbles upon the index page that links to all the Info files through the script, the system is brought to its knees without providing anything useful to the user (This task of converting Info files could be done locally and access to Info documentation for all installed GNU software on a system is available from the info command).
To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion was invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from robots and those they will permit access.
The most popular mechanism, and the de facto standard supported by all the major robots, is the “Robots Exclusion Standard” (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in /robots.txt in the server root, which the robots are expected to download and parse.
Although Wget is not a web robot in the strictest sense of the word, it can download large parts of the site without the user's intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue:
wget -r http://www.server.com/
First the index of ‘www.server.com’ will be downloaded. If Wget finds that it wants to download more documents from that server, it will request ‘http://www.server.com/robots.txt’ and, if found, use it for further downloads. robots.txt is loaded only once per each server.
Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/wc/norobots.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft ‘<draft-koster-robots-00.txt>’ titled “A Method for Web Robots Control”. The draft, which has as far as I know never made to an rfc, is available at http://www.robotstxt.org/wc/norobots-rfc.txt.
This manual no longer includes the text of the Robot Exclusion Standard.
The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:
<meta name="robots" content="nofollow">
This is explained in some detail at http://www.robotstxt.org/wc/meta-user.html. Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion.
If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to ‘off’ in your .wgetrc. You can achieve the same effect from the command line using the -e switch, e.g. ‘wget -e robots=off url...’.

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

gamester88

丰衣足食

论坛徽章:: 5

9楼 [报告]

发表于 2010-07-05 12:01 |只看该作者

本帖最后由 gamester88 于 2010-07-05 12:02 编辑

回复 7# expresss

因为robots.txt文件的缘故，所以上面的参数都会失效，所以

[gamester88@gamester88 iso]$ mkdir iso
[gamester88@gamester88 iso]$ cd iso
[gamester88@gamester88 iso]$ ls
[gamester88@gamester88 iso]$ wget -e robots=off -r -np -nd --accept=md5 http://mirrors.kernel.org/opensuse/distribution/11.2/iso/
[gamester88@gamester88 iso]$ls
openSUSE-11.2-Addon-Lang-i586.iso.md5
openSUSE-11.2-DVD-x86_64.iso.md5
openSUSE-11.2-KDE4-LiveCD-x86_64.iso.md5
openSUSE-11.2-Addon-Lang-x86_64.iso.md5
openSUSE-11.2-GNOME-LiveCD-i686.iso.md5
openSUSE-11.2-NET-i586.iso.md5
openSUSE-11.2-Addon-NonOss-BiArch-i586-x86_64.iso.md5
openSUSE-11.2-GNOME-LiveCD-x86_64.iso.md5
openSUSE-11.2-NET-x86_64.iso.md5
openSUSE-11.2-DVD-i586.iso.md5
openSUSE-11.2-KDE4-LiveCD-i686.iso.md5

复制代码

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

expresss

稍有积蓄

论坛徽章:: 0

10楼 [报告]

发表于 2010-07-06 21:22 |只看该作者

本帖最后由 expresss 于 2010-07-07 09:16 编辑

回复 9# gamester88

谢谢，非常热心的回答，真的非常感谢。看样子Linux要学好了，还真要把英文搞好不可，呵呵。
非常感谢您热心的解答。大概明白了，因为robots.txt里面的disallow:/，所以不允许搜索整个目录，用-e robots=off可以不按robots的内容来，也就是可以绕过robots.txt里的限制。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

返回列表

Chinaunix › 论坛 › 操作系统 › Linux新手园地 › 如何使用 wget 命令。

如何使用 wget 命令。 [复制链接]

浏览过的版块