免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 7860 | 回复: 9
打印 上一主题 下一主题

[Web] 关于zsync和rsync+inotify的同步算法与适用场景(高级讨论,慎入) [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2010-03-17 20:09 |只看该作者 |倒序浏览
本帖最后由 shinelian 于 2010-03-18 17:26 编辑

引用: http://everythinglinux.org/rsync/  
rsync
Diffs - Only actual changed pieces of files are transferred, rather than the whole file. This makes updates faster, especially over slower links like modems. FTP would transfer the entire file, even if only one byte changed.


引用:http://zsync.moria.org.uk/index
zsync is a file transfer program. It allows you to download a file from a remote server, where you have a copy of an older version of the file on your computer already. zsync downloads only the new parts of the file. It uses the same algorithm as rsync. However, where rsync is designed for synchronising data from one computer to another within an organisation, zsync is designed for file distribution, with one file on a server to be distributed to thousands of downloaders. zsync requires no special server software zsync requires no special server software  just a web server to host the files  and imposes no extra load on the server, making it ideal for large scale file distribution.

zsync还有其他的特性如 Rsync over HTTPHandling for compressed files,大家详见文档。


问1: rsync和zsync的适用场景是什么,为什么会这样,如果用了其中一种,如何调优。


rsync有几个选项:
    -W, --whole-file            copy whole files, no incremental checks 可能有助于理解本case
            --no-whole-file         turn off --whole-file   


    -S, --sparse                handle sparse files efficiently 这个选项大家如何理解,在效率上如何体现的。

      -c, --checksum              always checksum   这个对行为模式和效率是如何影响的。

      --partial               keep partially transferred files   这也是个很有意思的选项。



我看了一些官方文档,有些结论,不知道对不对。

举例, 如果5000个rsync client同时去rsync一个只变化了50byte的1G的文件,会导致rsync server 计算5000次checksum,会导致rsyncd server过载,每个client端的同步数据为50byte。
同样的,如果5000个zsync client同时去zsync一个只变化了50byte的1G的文件,如果使用zsync,差异部分预选算好,只传差异部分,不会导致zsyncd  http server过载,因为zsync还多个metadata,metadata已经包含了处理差异部分的信息,每个zsync client端的同步数据为50byte。


希望有经验的哥们指点一二。



附:zsync的一些简单用法:

  zsyncmake -C -u http://ftp.uk.debian.org/debian/dists/sarge/main/binary-i386/Packages.gz Packages.gz
       Note  use  of  -C  to  save  the client compressing the file on receipt; the Debian package system uses the file uncom-
       pressed.
       zsyncmake -z my-subversion-dump
       In this case there is a large, compressible file to transfer. This creates a gzipped version of the file (optimised for
       zsync), and a .zsync file. A URL is automatically added assuming that the two files will be served from the same direc-
       tory on the web server.
       zsyncmake   -e   -u   http://www.mirrorservice.org/sites/ftp.freebsd.org/pub/FreeBSD/ports/distfiles/zsync-0.2.2.tar.gz    zsync-0.2.2.tar.gz
      
       This  creates  a zsync referring to the named source tarball, which the client should download from the given URL. This
       example is for downloading a source tarball for a FreeBSD port, hence -e is specified so the client  will  be  able  to
       match its md5sum.


zsyncmake  man     
http://www.helplinux.cn/man/1/zsyncmake.html
zsync  man
http://www.helplinux.cn/man/1/zsync.html


参考文献:翻译好几段,保存时失败,杯具。
看了这几段overview,应该可以证明之前的结论正确。
      HTTP already provides the Range header for transferring partial content of files. This is useful only if you are able to determine from some other source of information which are the changed sections. If you know that a file is a log and will only ever grow — existing content will not change — then Range is an effective tool. But it does not solve the problem by itself.
      There are alternative download technologies like BitTorrent, which break up the desired file into blocks, and retrieve these blocks from a range of sources [BitT2003]. As BitTorrent provides checksums on fragments of file content, these could be used to identify content that is already known to the client (and it is used for this, to resume partial downloads, I believe). But reusing data(复用数据) from older files is not a purpose of this data in BitTorrent — only if exactly matching blocks could be identified would the data be any use.   

     The best existing solution from the point of view of minimising data transfer is rsync. rsync uses a rolling checksum algorithm that allows the checksum over a given block length at all points in a file to be calculated efficiently. Generally speaking, a checksum would have to be run at every possible start point to achieve this — the algorithm used in rsync (see [Rsync1998]) allows the checksum window to be rolled forward over the file and the checksum for each new location to be trivially derived from the previous checksum and the values at the window edges. So rsync can calculate the checksum at all points in the input file by streaming through the file data just once. While doing so it compares each calculated checksum against the list of checksums for the existing data file, and spots any chunks from the old data file which can be reused.

      So rsync achieves a high level of data reuse. It comes at a high computational cost, however. The current rsync implementation calculates the checksums for a set of blocks on the client, then uploads these to the server; the server them uses the rsync algorithm to work out which blocks the client has and which it needs, and pushes back the blocks it needs. But this approach suffers many drawbacks:


         1. The server must reparse the data each time. It cannot save the computed checksums. This is because the client sends just the checksums for disjoint blocks of data from its pool of known data. The server must calculate the checksum at all offsets, not just at the block boundaries. The client cannot send the checksum at all points, because this would be four times larger than the data file itself — and the server does not want to pre-compute the checksums at all points, because again it would be four times larger, and require four times as much disk activity, as reading the original data file. So CPU requirements on the server are high. Also the server must read the entire file, even if the final answer is that the client requires only a small fragment updated. 注:cpu敏感型的应用。


    2. Memory requirements for the server are high - it must store a hash table or equivalent structure of all the checksums received from the client while parsing its own data。注:内存敏感型的应用。

   3 The server must receive and act on a large volume of data from the client, storing it in memory, parsing data, etc — so there is the opportunity for denial of service attacks and security holes. In practice rsync has had a remarkably good security record: there have been a few vulnerabilities in the past few years (although at least one of these was actually a zlib bug, if I remember rightly)。 注:有dos和安全风险,



    The drawbacks with rsync have prevented it being deployed widely to distribute files to the general public. Instead, it has been used in areas closer to the existing use of cvs and sup, where a limited community of users use an rsync server to pull daily software snapshots. rsync is also very widely used inside organisations for efficient transfer of files between private systems, using rcp or scp as a tunnel. rsync also has very powerful functionality parallelling cp -a and tar's abilities, with transfers of file permissions, directory trees, special files, etc. But public releases are rarely made with rsync, as far as I can tell.

     I should also mention rproxy. While I have not used it myself, it is an attempt to integrate the rsync algorithm into the HTTP protocol [RProxy]. An rproxy-enabled client transmits the rsync checksums of blocks of data it already has to the server as part of the HTTP request; the server calculates the rolling checksum over the page it would have transmitted, and transmits only the blocks and the meta-information needed for the client to construct the full page. It has the advantage of integrating with the existing protocol and working even for dynamic pages. But it will, I suppose, suffer the same disk and CPU load problems as rsync on large files, and is an unwelcome overhead on the server even for small files. Since server administrators are rarely as concerned about bandwidth and download time as the client, it is hard to see them wanting to put extra work on their servers by offering either rsync or rproxy generally.

     CVS and subversion provide a specialised server programs and protocols for calculating diffs on a per-client basis. They have the advantage of efficiency once again, by constructing exactly the diff the client needs — but lose on complexity, because the server must calculate on a per-client basis, and the relatively complicated server processing client requests increases the risk of security vulnerabilities. CVS is also poor at handling binary data, although subversion does do better in this area. But one would hardly distribute ISO images over either of these systems.
Hybrid protocols have been designed, which incorporate ideas from several of the systems above. For instance, CVSup [CVSup1999] uses CVS and deltas for version-controlled files, and the rsync algorithm for files outside of version control. While it offers significantly better performance than either rsync or CVS, due to efficient pipelining of requests for multiple files, it does not fundamentally improve on either, so the discussion above — in particular the specialised server and high server processing cost per client — apply.


暂时的结论:
1. 使用rsync+inotify,client多了,cpu,memory撑不住。


第3方文献:
http://www.gaojinbo.com/rsync%E7%9A%84%E5%87%A0%E7%A7%8D%E4%BC%98%E5%8C%96%E5%BA%94%E7%94%A8%E6%96%B9%E6%A1%88.html  高兄这篇文章真是经验之谈。

论坛徽章:
2
2015年辞旧岁徽章
日期:2015-03-03 16:54:1515-16赛季CBA联赛之上海
日期:2016-05-05 09:45:14
2 [报告]
发表于 2010-03-17 20:42 |只看该作者
没用过zsync,和和,

论坛徽章:
0
3 [报告]
发表于 2010-03-18 00:04 |只看该作者
留名,回头再详细看

论坛徽章:
0
4 [报告]
发表于 2010-03-18 17:03 |只看该作者
暂时的结论:
1. 使用rsync+inotify,client多了,cpu,memory撑不住。

多是指多少呢,多了总归撑不住的,

论坛徽章:
0
5 [报告]
发表于 2010-03-18 17:12 |只看该作者
本帖最后由 shinelian 于 2010-03-18 17:14 编辑

暂时结论:
如果把它俩都用在同一个场景的话,rsync是cpu和内存敏感型的算法,zsync则不是,应该来说是rsync首先到cpu和内存的瓶颈。
具体值我觉得和很多因素有关,需要视具体场景加上测试。

论坛徽章:
0
6 [报告]
发表于 2010-04-26 11:44 |只看该作者
rsync 3.0.7已经很不错,继续关注中

论坛徽章:
0
7 [报告]
发表于 2010-05-29 12:33 |只看该作者
我昨天也刚了解了ls说的rsync3.0.7,有些吸引人的新特性。

论坛徽章:
0
8 [报告]
发表于 2010-05-29 17:37 |只看该作者
rsync 3.0.7还没用到过,期望更多资料

论坛徽章:
0
9 [报告]
发表于 2010-06-08 11:09 |只看该作者
zsync 是否支持https ?

论坛徽章:
8
亥猪
日期:2014-02-09 10:55:252015小元宵徽章
日期:2015-03-06 15:57:20数据库技术版块每日发帖之星
日期:2015-06-08 22:20:00综合交流区版块每日发帖之星
日期:2015-06-14 22:20:002015亚冠之阿尔沙巴布
日期:2015-09-01 20:23:45IT运维版块每日发帖之星
日期:2015-09-04 06:20:00IT运维版块每日发帖之星
日期:2015-11-04 06:20:00IT运维版块每日发帖之星
日期:2015-12-04 06:20:00
10 [报告]
发表于 2012-04-28 20:13 |只看该作者
shinelian 发表于 2010-03-17 20:09
引用: http://everythinglinux.org/rsync/  
rsync
Diffs - Only actual changed pieces of files are t ...


学习一下,最近用这个下载了ubuntu的镜像,不错,可以节约带宽
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP