论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2010-03-17 20:09 |只看该作者 |倒序浏览

本帖最后由 shinelian 于 2010-03-18 17:26 编辑

引用： http://everythinglinux.org/rsync/
rsync
Diffs - Only actual changed pieces of files are transferred, rather than the whole file. This makes updates faster, especially over slower links like modems. FTP would transfer the entire file, even if only one byte changed.

引用：http://zsync.moria.org.uk/index
zsync is a file transfer program. It allows you to download a file from a remote server, where you have a copy of an older version of the file on your computer already. zsync downloads only the new parts of the file. It uses the same algorithm as rsync. However, where rsync is designed for synchronising data from one computer to another within an organisation, zsync is designed for file distribution, with one file on a server to be distributed to thousands of downloaders. zsync requires no special server software zsync requires no special server software  just a web server to host the files  and imposes no extra load on the server, making it ideal for large scale file distribution.

zsync还有其他的特性如 Rsync over HTTP，Handling for compressed files，大家详见文档。

问1： rsync和zsync的适用场景是什么，为什么会这样，如果用了其中一种，如何调优。

rsync有几个选项：
-W, --whole-file          copy whole files, no incremental checks 可能有助于理解本case
         --no-whole-file       turn off --whole-file

-S, --sparse             handle sparse files efficiently 这个选项大家如何理解，在效率上如何体现的。

   -c, --checksum             always checksum 这个对行为模式和效率是如何影响的。

   --partial             keep partially transferred files 这也是个很有意思的选项。

我看了一些官方文档，有些结论，不知道对不对。

举例，如果5000个rsync client同时去rsync一个只变化了50byte的1G的文件，会导致rsync server 计算5000次checksum，会导致rsyncd server过载,每个client端的同步数据为50byte。
同样的，如果5000个zsync client同时去zsync一个只变化了50byte的1G的文件，如果使用zsync，差异部分预选算好，只传差异部分，不会导致zsyncd  http server过载，因为zsync还多个metadata，metadata已经包含了处理差异部分的信息，每个zsync client端的同步数据为50byte。

希望有经验的哥们指点一二。

附：zsync的一些简单用法：

  zsyncmake -C -u http://ftp.uk.debian.org/debian/dists/sarge/main/binary-i386/Packages.gz Packages.gz
   Note  use  of  -C  to  save  the client compressing the file on receipt; the Debian package system uses the file uncom-
   pressed.
   zsyncmake -z my-subversion-dump
   In this case there is a large, compressible file to transfer. This creates a gzipped version of the file (optimised for
   zsync), and a .zsync file. A URL is automatically added assuming that the two files will be served from the same direc-
   tory on the web server.
   zsyncmake -e -u http://www.mirrorservice.org/sites/ftp.freebsd.org/pub/FreeBSD/ports/distfiles/zsync-0.2.2.tar.gz zsync-0.2.2.tar.gz

   This  creates  a zsync referring to the named source tarball, which the client should download from the given URL. This
   example is for downloading a source tarball for a FreeBSD port, hence -e is specified so the client  will  be  able  to
   match its md5sum.

zsyncmake  man
http://www.helplinux.cn/man/1/zsyncmake.html
zsync  man
http://www.helplinux.cn/man/1/zsync.html

参考文献：翻译好几段，保存时失败，杯具。
看了这几段overview，应该可以证明之前的结论正确。
   HTTP already provides the Range header for transferring partial content of files. This is useful only if you are able to determine from some other source of information which are the changed sections. If you know that a file is a log and will only ever grow — existing content will not change — then Range is an effective tool. But it does not solve the problem by itself.
   There are alternative download technologies like BitTorrent, which break up the desired file into blocks, and retrieve these blocks from a range of sources [BitT2003]. As BitTorrent provides checksums on fragments of file content, these could be used to identify content that is already known to the client (and it is used for this, to resume partial downloads, I believe). But reusing data(复用数据) from older files is not a purpose of this data in BitTorrent — only if exactly matching blocks could be identified would the data be any use.

   The best existing solution from the point of view of minimising data transfer is rsync. rsync uses a rolling checksum algorithm that allows the checksum over a given block length at all points in a file to be calculated efficiently. Generally speaking, a checksum would have to be run at every possible start point to achieve this — the algorithm used in rsync (see [Rsync1998]) allows the checksum window to be rolled forward over the file and the checksum for each new location to be trivially derived from the previous checksum and the values at the window edges. So rsync can calculate the checksum at all points in the input file by streaming through the file data just once. While doing so it compares each calculated checksum against the list of checksums for the existing data file, and spots any chunks from the old data file which can be reused.

   So rsync achieves a high level of data reuse. It comes at a high computational cost, however. The current rsync implementation calculates the checksums for a set of blocks on the client, then uploads these to the server; the server them uses the rsync algorithm to work out which blocks the client has and which it needs, and pushes back the blocks it needs. But this approach suffers many drawbacks:

CPU requirements on the

server are high.

2. Memory requirements for the server are high - it must store a hash table or equivalent structure of all the checksums received from the client while parsing its own data。注：内存敏感型的应用。

3 The server must receive and act on a large volume of data from the client, storing it in memory, parsing data, etc — so there is the opportunity for denial of service attacks and security holes. In practice rsync has had a remarkably good security record: there have been a few vulnerabilities in the past few years (although at least one of these was actually a zlib bug, if I remember rightly)。注：有dos和安全风险，

The drawbacks with rsync have prevented it being deployed widely to distribute files to the general public. Instead, it has been used in areas closer to the existing use of cvs and sup, where a limited community of users use an rsync server to pull daily software snapshots. rsync is also very widely used inside organisations for efficient transfer of files between private systems, using rcp or scp as a tunnel. rsync also has very powerful functionality parallelling cp -a and tar's abilities, with transfers of file permissions, directory trees, special files, etc. But public releases are rarely made with rsync, as far as I can tell.

   I should also mention rproxy. While I have not used it myself, it is an attempt to integrate the rsync algorithm into the HTTP protocol [RProxy]. An rproxy-enabled client transmits the rsync checksums of blocks of data it already has to the server as part of the HTTP request; the server calculates the rolling checksum over the page it would have transmitted, and transmits only the blocks and the meta-information needed for the client to construct the full page. It has the advantage of integrating with the existing protocol and working even for dynamic pages. But it will, I suppose, suffer the same disk and CPU load problems as rsync on large files, and is an unwelcome overhead on the server even for small files. Since server administrators are rarely as concerned about bandwidth and download time as the client, it is hard to see them wanting to put extra work on their servers by offering either rsync or rproxy generally.

   CVS and subversion provide a specialised server programs and protocols for calculating diffs on a per-client basis. They have the advantage of efficiency once again, by constructing exactly the diff the client needs — but lose on complexity, because the server must calculate on a per-client basis, and the relatively complicated server processing client requests increases the risk of security vulnerabilities. CVS is also poor at handling binary data, although subversion does do better in this area. But one would hardly distribute ISO images over either of these systems.
Hybrid protocols have been designed, which incorporate ideas from several of the systems above. For instance, CVSup [CVSup1999] uses CVS and deltas for version-controlled files, and the rsync algorithm for files outside of version control. While it offers significantly better performance than either rsync or CVS, due to efficient pipelining of requests for multiple files, it does not fundamentally improve on either, so the discussion above — in particular the specialised server and high server processing cost per client — apply.

暂时的结论：
1. 使用rsync+inotify，client多了，cpu，memory撑不住。

第3方文献：
http://www.gaojinbo.com/rsync%E7%9A%84%E5%87%A0%E7%A7%8D%E4%BC%98%E5%8C%96%E5%BA%94%E7%94%A8%E6%96%B9%E6%A1%88.html  高兄这篇文章真是经验之谈。

rsync, zsync, 差异备份, 算法, rsync, zsync, 差异备份, 算法