免费注册	查看新帖 \|


平台论坛博客文库

› 论坛 › IT运维 › 虚拟化与云服务 › 浩存 - 面向数据库，虚拟机等海量数据可同时提供NFS/iSC ...

1 ... 7 8 9 101112 13 14 15 ... 25 / 25 页下一页

最近访问板块

发新帖

楼主: yftty

上一主题

下一主题

浩存 - 面向数据库，虚拟机等海量数据可同时提供NFS/iSCSI访问的集群存储系统 [复制链接]

论坛徽章:: 0

101楼 [报告]

发表于 2005-06-22 10:50 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

1) http://zgp.org/linux-elitists/20040101205016.E5998@shaitan.lightconsulting.com.html

2) http://zgp.org/linux-elitists/20040101205016.E5998@shaitan.lightconsulting.com.html
3. Elastic Quota File System (EQFS) Proposal
23 Jun 2004 - 30 Jun 2004 (46 posts) Archive Link: "Elastic Quota File System
(EQFS)"
People: Amit Gud, Olaf Dabrunz, Mark Cooke

Amit Gud said:

Recently I'm into developing an Elastic Quota File System (EQFS). This file
system works on a simple concept ... give it to others if you're not using
it, let others use it, but on the guarantee that you get it back when you
need it!!

Here I'm talking about disk quotas. In any typical network, e.g.
sourceforge, each user is given a fixed amount of quota. 100 Mb in case of
sourceforge. 100 Mb is way over some project requirements and too small for
some projects. EQFS tries to solve this problem by exploiting the users'
usage behavior at runtime. That is the user's quota which he doesn't need
is given to the users who need it, but on 100% assurance that the originl
user can any time reclaim his/her quota.

Before getting into implementation details I want to have public opinion
about this system. All EQFS tries to do is it maximizes the disk space
usage, which otherwise is wasted if the user doesn't really need the
allocated user..on the other hand it helps avoid the starvation of the user
who needs more space. It also helps administrator to get away with the
problem of variable quota needs..as EQFS itself adjusts according to the
user needs.

Mark Watts asked how it would be possible to "guarantee" that the user would
get the space back when they wanted it. Amit expanded:

Ok, this is what I propose:

Lets say there are just 2 users with 100 megs of individual quota, user A
is using 20 megs and user B is running out of his quota. Now what B could
do is delete some files himself and make some free space for storing other
files. Now what I say is instead of deleting the files, he declares those
files as elastic.

Now, moment he makes that files elastic, that much amount of space is added
to his quota. Here Mark Cooke's equation applies with some modifications: N
no. of users, Qi allocated quota of ith user Ui individual disk usage of
ith user ( should be <= allocated quota of ith user ), D disk threshold;
thats the amount of disk space admin wants to allow the users to use
(should be >;= sum of all users' allocated quota, i.e. summation Qi ; for i
= 0 to N - 1).

Total usage of all the users (here A & B) should be at _anytime_ less than
D. i.e. summation Ui <= D; for i = 0 to N - 1.

The point to note here is that we are not bothering how much quota has been
allocated to an individual user by the admin, but we are more interested in
the usage pattern followed by the users. E.g. if user B wants additional
space of say 25 megs, he picks up 25 megs of his files and 'marks' them
elastic. Now his quota is increased to 125 megs and he can now add more 25
megs of files; at the same time allocated quota for user A is left
unaffected. Applying the above equation total usage now is A: 20 megs, B:
125 megs, now total 145 <= D, say 200 megs. Thus this should be ok for the
system, since the usage is within bounds.

Now what happens if Ui >; D? This can happen when user A tries to recliam
his space. i.e. if user A adds say more 70 megs of files, so the total
usage is now - A: 90 megs, B: 125 megs; 215 ! <= D. The moment the total
usage crosses the value, 'action' will be taken on the elastic files. Here
elastic files are of user B so only those will be affected and users A's
data will be untouched, so in a way this will be completely transparent to
user A. What action should be taken can be specified by the user while
making the files elastic. He can either opt to delete the file, compress it
or move it to some place (backup) where he know he has write access. The
corresponding action will be taken until the threshold is met.

Will this work?? We are relying on the 'free' space ( i.e. D - Ui ) for the
users to benefit. The chances of having a greater value for D - Ui
increases with the increase in the number of users, i.e. N. Here we are
talking about 2 users but think of 10000+ users where all the users will
probably never use up _all_ the allocated disk space. This user behavior
can be well exploited.

EQFS can be best fitted in the mail servers. Here e.g. I make whole
linux-kernel mailing list elastic. As long as Ui <= D I get to keep all the
messages, whenever Ui >; D, messages with latest dates will be 'acted' upon.

For variable quota needs, admin can allocate different quotas for different
users, but this can get tiresome when N is large. With EQFS, he can
allocate fixed quota for each user ( old and new ) , set up a value for D
and relax. The users will automatically get the quota they need. One may
ask that this can be done by just setting up value of D, checking it
against summation Ui and not allocating individual quotas at all. But when
summation Ui crosses D value, whose file to act on? Moreover with both
individual quotas and D, we give users 'controlled' flexibility just like
elastic - it can be stretched but not beyond a certain range.

What happens when an user tries to eat up all the free ( D - Ui ) space?
This answer is implementation dependent because you need to make a
decision: should an user be allowed to make a file elastic when Ui == D . I
think by saying 'yes' we eliminate some users' mischief of eating up all
free space.

Olaf Dabrunz replied:

   + having files disappear at the discretion of the filesystem seems to be
      bad behaviour: either I need this file, then I do not want it to just
      disappear, or I do not need it, and then I can delete it myself.

      Since my idea of which files I need and which I do not need changes
      over time, I believe it is far better that I can control which files I
      need and which I do not need whenever other constraints (e.g. quota
      filled up) make this decision necessary. Also, then I can opt to try to
      convince someone to increase my quota.

   + moving the file to some other place (backup) does not seem to be a
      viable option:

      o If the backup media is always accessible, then why can't the user
         store the "elastic" files there immediately?

         ->; advantages:

            # the user knows where his file is
            # applications that remember the path to a file will be able to
            access it

      o If the backup media will only be accessible after manually
         inserting it into some drive, this amounts to sending an E-Mail to
         the backup admin and then pass a list of backup files to the backup
         software.

         But now getting the file back involves a considerable amount of
         manual and administrative work. And it involves bugging the backup
         admin, who now becomes the bottleneck of your EQFS.

So this narrows down to the effective handling of backup procedures and the
effective administration of fixed quotas and centralization of data.

If you have many users it is also likely that there are more people
interested in big data-files. So you need to help these people organize
themselves e.g. by helping them to create mailing-list, web-pages or
letting them install servers that makes the data centrally available with
some interface that they can use to select parts of the data.

I would rather suggest that if the file does not fit within a given quota,
the user should apply for more quota and give reasons for that.

I believe that flexible or "elastic" allocation of ressources is a good
idea in general, but it only works if you have cheap and easy ways to
control both allocation and deallocation. So in the case of CBQ in networks
this works, since bandwidth can easily and quickly be allocated and
deallocated.

But for filesystem space this requires something like a "slower (= less
expensive), bigger, always accessible" third level of storage in the "RAM,
disk, ..." hierarchy. And then you would need an easy or even transparent
way to access files on this third level storage. And you need to make sure
that, although you obviously *need* the data for something, you still can
afford to increase retrieval times by several orders of magnitude at the
discretion of the filesystem.

But usually all this can be done by scripts as well.

Still, there is a scenario and a combination of features for such a
filesystem that IMHO would make it useful:

   + Provide allocation of overquota as you described it.
   + Let the filesystem move (parts of) the "elastic" files to some
      third-level backing-store on an as-needed basis. This provides you with
      a not-so-cheap (but cheaper than manual handling) resource management
      facility.

Now you can use the third-level storage as a backing store for hard-drive
space, analoguous to what swap-space provides for RAM. And you can "swap
in" parts of files from there and cache them on the hard drive. So
"elastic" files are actually files that are "swappable" to backing store.

This assumes that the "elastic" files meet the requirements for a "working
set" in a similar fashion as for RAM-based data. I.e. the swap operations
need only be invoked relatively seldom.

If this is not the case, your site/customer needs to consider buying more
hard drive space (and maybe also RAM).

The tradeoff for the user now is:

   + do not have the big file(s) OR
   + have them and be able to use them in a random-access fashion from any
      application, but maybe only with a (quite) slow access time, but
      without additional administrative/manual hassle

Maybe this is a good tradeoff for a significant amount of users. Maybe
there are sites/customers that have the required backing store (or would
consider buying into this). I do not know. Find a sponsor, do some field
research and give it a try.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

102楼 [报告]

发表于 2005-06-22 11:11 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

我在很多電腦書上看到scalability這個詞
cluster computing也有這一項特性
但不是很了解他其中的意義
知道的人可不可以請說明一下謝謝

If your computer or cluster has a bottleneck,
you want to solve the bottleneck.
Scalability is the ability to do that.

Usually there are three types of bottlenecks:
1. CPU
You add more CPU to your computer. add more
compute nodes (horizontal scalability) or
buy a bigger computer (vertical scalability).
2. Network - add more switches, network cards or buy
Myrinet, etc.
3. I/O - disk I/O bottlenecks can be solved by
better I/O (IDE>;SCSI, for example) or I/O clustering
(http://www.erexi.com.tw/solutions/NFS_fileserve_solution.pdf)
and similar.

Some applications are scalable (Oracle DB: you can
scale horizontally by moving from 9i to 9i RAC), some
are not (Linux VI Editor).

Then you have scalability limits (e.g. "x scales to up to
16 nodes"

, etc.

Sean

The official definition according to webopedia (www.webopedia.com)
is:

(1) A popular buzzword that refers to how well a hardware or software system can adapt to increased demands. For example,
a scalable network system would be one that can start with just a few nodes but can easily expand to thousands of nodes.
Scalability can be a very important feature because it means that you can invest in a system with confidence you won't outgrow it.

(2) Refers to anything whose size can be changed.
For example, a font is said to be scalable if it can be represented in different sizes.

(3) When used to describe a computer system, the ability to run more than one processor.

Normally scalability means that it can adapted to changes you've made to your system.
For example, supports more CPU etc...
Hope I make myself clear.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

103楼 [报告]

发表于 2005-06-23 09:32 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

Allan Fields wrote:
>; On Tue, Jun 21, 2005 at 09:35:56AM -0500, Eric Anderson wrote:
>;
>;>;This is something I've brought up before on other lists, but I'm curious
>;>;if anyone is interested in developing a BSD licensed clustered
>;>;filesystem for FreeBSD (and anyone else)?
>;
>;
>; A few questions:
>;
>; Could this be done as a stackable file system (vnode layer distributed
>; file system) or did you have something else in mind (i.e. specifically
>; a full implementation of a network filesystem including storage
>; layer)?

Hmm.  I'm not sure if it can or not.  I'll try to explain what I'm
dreaming of.  I currently have about 1000 clients needing access to the
same pools of data (read/write) all the time.  The data changes
constantly.  There is a lot of this data.  We use NFS currently.
FreeBSD is *very* fast and stable at serving NFS data.  The problem is,
that even though it is very fast and stable, I still cannot pump out
enough bits fast enough with one machine, and if that one machine fails
(hardware problems, etc), then all my machines are hung waiting for me
to bring it back online.

So, what I would love to have, is this kind of setup: shared media
storage (fibre channel SAN, iscsi, or something like ggated possibly),
connected up to a cluster of hosts running FreeBSD.  Each FreeBSD server
has access to the logical disks, same partitions, and can mount them all
r/w.  Now, I can kind of do this now, however there are obviously some
issues with this currently.  I want all machines in this cluster to be
able to serve the data via NFS (or http, or anything else for that
matter really - if you can make NFS work, anything will pretty much
work) simultaneously from the same partitions, and see writes
immediately as the other hosts in the cluster commit them.

I currently have a solution just like this for Linux - Polyserve
(http://www.polyserve.com) has a clustered filesystem for linux, that
works very well.  I've even tried to convince them to port it to
FreeBSD, but it falls on deaf ears, so it's time to make our own.

>; Why not a port of an existing network filesystem say from Linux?
>; (A BSD rewrite could be done, if the code was GPLed.)  Would
>; cross-platform capabilities make sense?

That would work fine I'm sure - but I have found some similar threads in
the past that claim it would be just as hard and time consuming to port
one as it would be to create one from scratch. Cross platform
capabilities would be great, but I'm mostly interested in getting
FreeBSD into this arena (as it will soon be an extremely important one
to be in).

>; How do you see this comparing to device-level solutions?  I know
>; the argument can be made to implement file systems/storage
>; abstractions at multiple layers, but I thought I might ask.

I'm not sure of a device level solution that does this.  I think the OS
has to know to commit the meta-data to a journal, or otherwise let the
other machines know about locking, etc, in order for this to work.

>; The other thing is there a wealth of filesystem papers out there,
>; any in specific caught your eye?

No - can you point me to some?

I'll be honest here - I'm not a code developer.  I would love to learn
some C here, and 'just do it', but filesystems aren't exactly simple, so
I'm looking for a group of people that would love to code up something
amazing like this - I'll support the developers and hopefully learn
something in the process.  My goal personally would be to do anything I
could to make the developers work most productively, and do testing.  I
can probably provide equipment, and a good testbed for it.

Eric

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

104楼 [报告]

发表于 2005-06-23 09:33 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

>; Hmm.  I'm not sure if it can or not.  I'll try to explain what I'm
>; dreaming of.  I currently have about 1000 clients needing access to the
>; same pools of data (read/write) all the time.  The data changes
>; constantly.  There is a lot of this data.  We use NFS currently.

Sounds like you want SGI's clustered xfs....

>; I'll be honest here - I'm not a code developer.  I would love to learn
>; some C here, and 'just do it', but filesystems aren't exactly simple, so
>; I'm looking for a group of people that would love to code up something
>; amazing like this - I'll support the developers and hopefully learn
>; something in the process.  My goal personally would be to do anything I
>; could to make the developers work most productively, and do testing.  I
>; can probably provide equipment, and a good testbed for it.

If you are not a seasoned programmer in _some_ language, this
will not be easy at all.

One suggestion is to develop an abstract model of what a CFS
is.  Coming up with a clear detailed precise specification is
not an easy task either but it has to be done and if you can
do it, it will be immensely helpful all around.  You will
truly understand what you are doing, you have a basis for
evaluating design choices, you will have made choices before
writing any code, you can write test cases, writing code is
far easier etc.  etc.  Google for clustered filesystems.
The citeseer site has some papers as well.

A couple FS specific suggestions:
- perhaps clustering can be built on top of existing
  filesystems.  Each machine's local filesystem is considered
  a cache and you use some sort of cache coherency protocol.
  That way you don't have to deal with filesystem allocation
  and layout issues.

- a network wide stable storage `disk' may be easier to do
  given GEOM.  There are atleast N copies of each data block.
  Data may be cached locally at any site but writing data is
  done as a distributed transaction.  So again cache
  coherency is needed.  A network RAID if you will!

But again, let me stress that one must have a clear *model*
of the problem being solved.  Getting distributed programs
right is very hard even at an abstract model level.
Debugging a distributed program that doesn't have a clear
model is, well, for masochists (nothing against them -- I
bet even they'd rather get their pain some other way

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

105楼 [报告]

发表于 2005-06-23 09:42 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

原帖由 "yftty" 发表：
我在很多電腦書上看到scalability這個詞
cluster computing也有這一項特性
但不是很了解他其中的意義
知道的人可不可以請說明一下謝謝
.

Scalability
the ease with which a system or component can be modified to fit the problem area.
SEI的定义
http://www.sei.cmu.edu/str/indexes/glossary/scalability.html

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

106楼 [报告]

发表于 2005-06-26 11:41 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://www.onlamp.com/pub/a/onlamp/2005/06/23/whatdevswant.html

"Irrespective of the language programmers choose for expressing solutions, their wants and needs are similar. They need to be productive and efficient, with technologies that do not get in the way but rather help them produce high-quality software. In this article, we share our top ten list of programmers' common wants and needs."

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

107楼 [报告]

发表于 2005-06-27 18:52 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://bbs.chinaunix.net/forum/viewtopic.php?t=568208&show_type=

我們在工作領域上，即使薪水、股票拿的再多，那是挑水；
而卻忘記把握下班後的時間，挖一口屬於自己的井
培養自己另一方面的實力；
未來當您年紀大了，體力拼不過年輕人了，
您還是有水喝，
而且還要喝得很悠閒喔

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

108楼 [报告]

发表于 2005-06-28 12:04 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

AMENABLE TO EXTENSIVE PARALLELIZATION, GOOGLE’S WEB SEARCH APPLICATION LETS DIFFERENT QUERIES RUN ON DIFFERENT PROCESSORS AND, BY PARTITIONING THE OVERALL INDEX, ALSO LETS A SINGLE QUERY USE MULTIPLE PROCESSORS. TO HANDLE THIS WORKLOAD, GOOGLE’S ARCHITECTURE FEATURES CLUSTERS OF MORE THAN 15,000 COMMODITYCLASS PCS WITH FAULT-TOLERANT SOFTWARE. THIS ARCHITECTURE ACHIEVES SUPERIOR PERFORMANCE AT A FRACTION OF THE COST OF A SYSTEM BUILT FROM FEWER, BUT MORE EXPENSIVE, HIGH-END SERVERS.

Luiz André Barroso
Jeffrey Dean
Urs Hölzle

Few Web services require as much computation per request as search engines. On average, a single query on Google reads hundreds of megabytes of data and consumes tens of billions of CPU cycles. Supporting a peak request stream of thousands of queries per second requires an infrastructure comparable in size to that of the largest supercomputer installations. Combining more than 15,000 commodity-class PCs with fault-tolerant software creates a solution that is more cost-effective than a comparable system built out of a smaller number of high-end servers.

Here we present the architecture of the Google cluster, and discuss the most important factors that influence its design: energy efficiency and price-performance ratio. Energy efficiency is key at our scale of operation, as power consumption and cooling issues become significant operational factors, taxing the limits of available data center power densities.

Our application affords easy parallelization:
Different queries can run on different processors, and the overall index is partitioned so that a single query can use multiple processors. Consequently, peak processor performance is less important than its price/performance. As such, Google is an example of a throughput-oriented workload, and should benefit from processor architectures that offer more on-chip parallelism, such as simultaneous multithreading or on-chip multiprocessors.

Google architecture overview

Google’s software architecture arises from two basic insights. First, we provide reliability in software rather than in server-class hardware, so we can use commodity PCs to build a high-end computing cluster at a low-end price. Second, we tailor the design for best aggregate request throughput, not peak server response time, since we can manage response times by parallelizing individual requests.

We believe that the best price/performance tradeoff for our applications comes from fashioning a reliable computing infrastructure from clusters of unreliable commodity PCs. We provide reliability in our environment at the software level, by replicating services across many different machines and automatically detecting and handling failures. This software-based reliability encompasses many different areas and involves all parts of our system design. Examining the control flow in handling a query provides insight into the highlevel structure of the query-serving system, as
well as insight into reliability considerations.

Serving a Google query

When a user enters a query to Google (such as www.google.com/search?q=ieee+society), the user’s browser first performs a domain name system (DNS) lookup to map www.google.com to a particular IP address. To provide sufficient capacity to handle query traffic, our service consists of multiple clusters distributed worldwide. Each cluster has around a few thousand
machines, and the geographically distributed setup protects us against catastrophic data center failures (like those arising from earthquakes and large-scale power failures). A DNS-based load-balancing system selects a cluster by accounting for the user’s geographic proximity to each physical cluster. The load-balancing system minimizes round-trip time for the user’s request, while also considering the available
capacity at the various clusters.

The user’s browser then sends a hypertext transport protocol (HTTP) request to one of these clusters, and thereafter, the processing of that query is entirely local to that cluster. A hardware-based load balancer in each cluster monitors the available set of Google Web servers (GWSs) and performs local load balancing of requests across a set of them. After receiving a query, a GWS machine coordinates the query execution and formats the results into a Hypertext Markup Language
(HTML) response to the user’s browser. Figure 1 illustrates these steps.

Query execution consists of two major phases.1 In the first phase, the index servers consult an inverted index that maps each query word to a matching list of documents (the hit list). The index servers then determine a set of relevant documents by intersecting the hit lists of the individual query words, and they compute a relevance score for each document. This relevance score determines the order of results on the output page.

The search process is challenging because of the large amount of data: The raw documents comprise several tens of terabytes of uncompressed data, and the inverted index resulting from this raw data is itself many terabytes of data. Fortunately, the search is highly parallelizable by dividing the index into pieces
(index shards), each having a randomly chosen subset of documents from the full index. A pool of machines serves requests for each shard, and the overall index cluster contains one pool for each shard. Each request chooses a machine within a pool using an intermediate load balancer—in other words, each query goes to one
machine (or a subset of machines) assigned to each shard. If a shard’s replica goes down, the load balancer will avoid using it for queries, and other components of our cluster-management system will try to revive it or eventually replace it with another machine. During the downtime, the system capacity is reduced in proportion to the total fraction of capacity that this machine represented. However, service remains uninterrupted, and all parts of the index remain available.

The final result of this first phase of query execution is an ordered list of document identifiers (docids). As Figure 1 shows, the second phase involves taking this list of docids and computing the actual title and uniform resource locator of these documents, along with a query-specific document summary. Document servers (docservers) handle this job, fetching each document from disk to extract the title and the keyword-in-context snippet. As with the index lookup phase, the strategy is to partition the processing of all documents byrandomly distributing documents into
smaller shards having multiple server replicas responsible for handling each shard, and routing requests through a load balancer. The docserver cluster must have access to an online, low-latency copy of the entire Web. In fact, because of the replication required for performance and availability, Google stores dozens of copies of the Web across its clusters.

In addition to the indexing and document-serving phases, a GWS also initiates several other ancillary tasks upon receiving a query, such as sending the query to a spell-checking system and to an ad-serving system to generate relevant dvertisements (if any). When all phases are complete, a GWS generates the appropriate HTML for the output page and returns it to the user’s browser.

Using replication for capacity and fault-tolerance

We have structured our system so that most accesses to the index and other data structures involved in answering a query are read-only: Updates are relatively infrequent, and we can often perform them safely by diverting queries away from a service replica during an update. This principle sidesteps many of the consistency issues that typically arise in using a general-purpose database.

We also aggressively exploit the very large amounts of inherent parallelism in the application: For example, we transform the lookup of matching documents in a large index into many lookups for matching documents in a set of smaller indices, followed by a relatively inexpensive merging step. Similarly, we divide the query stream into multiple streams, each handled by a cluster. Adding machines to each
pool increases serving capacity, and adding shards accommodates index growth. By parallelizing the search over many machines, we reduce the average latency necessary to answer a query, dividing the total computation across more CPUs and disks. Because individual shards don’t need to communicate with each
other, the resulting speedup is nearly linear. In other words, the CPU speed of the individual index servers does not directly influence the search’s overall performance, because we can increase the number of shards to accommodate slower CPUs, and vice versa. Consequently, our hardware selection process focuses on machines that offer an excellent request throughput for our application, rather than machines that offer the highest single-thread performance.

In summary, Google clusters follow three key design principles:
.
Software reliability. We eschew fault-tolerant hardware features such as redundant power supplies, a redundant array of inexpensive disks (RAID), and high-quality components, instead focusing on tolerating failures in software.
.
Use replication for better request throughput and availability. Because machines are
inherently unreliable, we replicate each of our internal services across many achines. Because we already replicate services across multiple machines to obtain sufficient capacity, this type of fault tolerance almost comes for free.
•
Price/performance beats peak performance. We purchase the CPU generation that
currently gives the best performance per unit price, not the CPUs that give the
best absolute performance.
•
Using commodity PCs reduces the cost of computation. As a result, we can afford to use more computational resources per query, employ more expensive techniques
in our ranking algorithm, or search a larger index of documents. Leveraging commodity parts

Google’s racks consist of 40 to 80 x86-based servers mounted on either side of a custom made rack (each side of the rack contains twenty 20u or forty 1u servers). Our focus on price/performance favors servers that resemble mid-range desktop PCs in terms of their components, except for the choice of large disk drives. Several CPU generations are in active service, ranging from single-processor 533MHz Intel-Celeron-based servers to dual 1.4GHz Intel Pentium III servers. Each server
contains one or more integrated drive electronics (IDE) drives, each holding 80 Gbytes. Index servers typically have less disk space than document servers because the former have a more CPU-intensive workload. The servers on each side of a rack interconnect via a 100-Mbps Ethernet switch that has one or two gigabit uplinks to a core gigabit switch that connects all racks together.

Our ultimate selection criterion is cost per query, expressed as the sum of capital expense (with depreciation) and operating costs (hosting, system administration, and repairs) divided by performance. Realistically, a server will not last beyond two or three years, because of its disparity in performance when compared to newer machines. Machines older than three years are so much slower than current-generation machines that it is difficult to achieve proper load distribution and configuration in clusters containing both types. Given the relatively short mortization period, the equipment cost figures prominently in the overall cost equation.

Because Google servers are custom made, we’ll use pricing information for comparable PC-based server racks for illustration. For example, in late 2002 a rack of 88 dual-CPU 2-GHz Intel Xeon servers with 2 Gbytes of RAM and an 80-Gbyte hard disk was offered on RackSaver.com for around $278,000. This
figure translates into a monthly capital cost of $7,700 per rack over three years. Personnel and hosting costs are the remaining major contributors to overall cost.

The relative importance of equipment cost makes traditional server solutions less appealing for our problem because they increase performance but decrease the price/performance. For example, four-processor motherboards are expensive, and because our application parallelizes very well, such a motherboard doesn’t recoup its additional cost with better performance. Similarly, although SCSI disks are faster and more reliable, they typically cost two or three times as much as an equal-capacity IDE drive.

The cost advantages of using inexpensive, PC-based clusters over high-end multiprocessor servers can be quite substantial, at least for a highly parallelizable application like ours. The example $278,000 rack contains 176 2-GHz Xeon CPUs, 176 Gbytes of RAM, and 7 Tbytes of disk space. In comparison, a typical x86-based server contains eight 2-GHz Xeon CPUs, 64 Gbytes of RAM, and 8 Tbytes of disk space; it costs about $758,000.2 In other words, the multiprocessor server is about three times more expensive but has 22 times fewer CPUs, three times less RAM, and slightly more disk space. Much of the cost difference derives from the much higher interconnect bandwidth and reliability of a high-end server, but again, Google’s highly redundant architecture does not rely on either of these attributes.

Operating thousands of mid-range PCs instead of a few high-end multiprocessor
servers incurs significant system administration and repair costs. However, for a relatively homogenous application like Google, where most servers run one of very few applications, these costs are manageable. Assuming tools to install and upgrade software on groups of machines are available, the time and cost to maintain 1,000 servers isn’t much more than the cost of maintaining 100 servers because all machines have identical configurations. Similarly, the cost of monitoring a
cluster using a scalable application-monitoring system does not increase greatly with cluster size. Furthermore, we can keep repair costs reasonably low by batching repairs and ensuring that we can easily swap out components with the highest failure rates, such as disks and power supplies.

The power problem

Even without special, high-density packaging, power consumption and cooling issues can become challenging. A mid-range server with dual 1.4-GHz Pentium III processors draws about 90 W of DC power under load: roughly 55 W for the two CPUs, 10 W for a disk drive, and 25 W to power DRAM and the motherboard. With a typical efficiency of about 75 percent for an ATX power supply, this translates into 120 W of AC power per server, or roughly 10 kW per rack. A rack comfortably fits in 25 ft2 of space, resulting in a power density of
400 W/ft2. With higher-end processors, the power density of a rack can exceed 700 W/ft2.

Unfortunately, the typical power density for commercial data centers lies between 70 and 150 W/ft2, much lower than that required for PC clusters. As a result, even low-tech PC clusters using relatively straightforward packaging need special cooling or additional space to bring down power density to that which is tolerable in typical data centers. Thus, packing even more servers into a rack could be of
limited practical use for large-scale deployment as long as such racks reside in standard data centers. This situation leads to the question of whether it is possible to reduce the power usage per server.

Reduced-power servers are attractive for large-scale clusters, but you must keep some caveats in mind. First, reduced power is desirable, but, for our application, it must come without a corresponding performance penalty: What counts is watts per unit of performance, not watts alone. Second, the lower-power server must not be considerably more expensive, because the cost of depreciation typically outweighs the cost of power. The earlier-mentioned 10 kW rack consumes about 10 MW-h of power per month (including cooling overhead). Even at a generous 15 cents per kilowatt-hour (half for the actual power, half to amortize uninterruptible power
supply [UPS] and power distribution equipment), power and cooling cost only $1,500 per month. Such a cost is small in comparison to the depreciation cost of $7,700 per month. Thus, low-power servers must not be more expensive than regular servers to have an overall cost advantage in our setup.

Hardware-level application characteristics

Examining various architectural characteristics of our application helps illustrate which hardware platforms will provide the best price/performance for our query-serving system. We’ll concentrate on the characteristics of the index server, the component of our infrastructure whose price/performance most heavily impacts overall price/performance. The main activity in the index server consists of decoding
compressed information in the inverted index and finding matches against a set of documents that could satisfy a query. Table 1 shows some basic instruction-level measurements of the index server program running on a 1-GHz dual-processor Pentium III system.

The application has a moderately high CPI, considering that the Pentium III is capable of issuing three instructions per cycle. We expect such behavior, considering that the application traverses dynamic data structures and that control flow is data dependent, creating a significant number of difficult-to-predict branches. In fact, the same workload running on the newer Pentium 4 processor exhibits nearly twice the CPI and approximately the same branch prediction performance, even though the Pentium 4 can issue more instructions concurrently and has superior branch prediction logic. In essence, there isn’t that much exploitable instruction-level parallelism (ILP) in the workload. Our measurements suggest that the level of aggressive out-oforder, speculative execution present in modern processors is already beyond the point of diminishing performance returns for such programs.

A more profitable way to exploit parallelism for applications such as the index server is to leverage the trivially parallelizable computation. Processing each query shares mostly read-only data with the rest of the system, and constitutes a work unit that requires little communication. We already take advantage of that at the cluster level by deploying large numbers of inexpensive nodes, rather than fewer high-end ones. Exploiting such abundant thread-level parallelism at the microarchitecture level appears equally promising. Both simultaneous multithreading (SMT) and chip multiprocessor (CMP) architectures target thread-level parallelism and should improve the performance of many of our servers. Some early experiments with a dual-context (SMT) Intel Xeon processor show more than a 30 percent performance improvement over a single-context setup. This speedup is at the upper bound of improvements reported by Intel for their SMT implementation.

We believe that the potential for CMP systems is even greater. CMP designs, such as Hydra4 and Piranha,5 seem especially promising. In these designs, multiple (four to eight) simpler, in-order, short-pipeline cores replace a complex high-performance core. The penalties of in-order execution should be minor given how little ILP our application yields, and shorter pipelines would reduce or eliminate branch mispredict penalties. The available thread-level parallelism should allow near-linear speedup with the number of cores, and a shared L2 cache of reasonable size would speed up interprocessor communication.

Memory system

Table 1 also outlines the main memory system performance parameters. We observe
good performance for the instruction cache and instruction translation look-aside buffer, a result of the relatively small inner-loop code size. Index data blocks have no temporal locality, due to the sheer size of the index data and the unpredictability in access patterns for the index’s data block. However, accesses within an index data block do benefit from spatial locality, which hardware prefetching (or possibly larger cache lines) can exploit. The net effect is good overall cache hit ratios, even for relatively modest cache sizes.

Memory bandwidth does not appear to be a bottleneck. We estimate the memory bus utilization of a Pentium-class processor system to be well under 20 percent. This is mainly due to the amount of computation required (on average) for every cache line of index data brought into the processor caches, and to the data-dependent nature of the data fetch stream. In many ways, the index server’s memory system behavior resembles the behavior reported for the Transaction Processing Performance Council’s benchmark D (TPC-D).6 For such workloads, a memory system with a relatively modest sized L2 cache, short L2 cache and memory latencies, and longer (perhaps 128 byte) cache lines is likely to be the
most effective.

Large-scale multiprocessing

As mentioned earlier, our infrastructure consists of a massively large cluster of inexpensive desktop-class machines, as opposed to a smaller number of large-scale shared-memory machines. Large shared-memory machines are most useful when the computation-to-communication ratio is low; communication patterns or data partitioning are dynamic or hard to predict; or when total cost of ownership dwarfs hardware costs (due to management overhead and software licensing prices). In those situations they justify their high price tags.

At Google, none of these requirements apply, because we partition index data and
computation to minimize communication and evenly balance the load across servers. We also produce all our software in-house, and minimize system management overhead through extensive automation and monitoring, which makes hardware costs a significant fraction of the total system operating expenses. Moreover, large-scale shared-memory machines still do not handle individual hardware component or software failures gracefully, with most fault types causing a full system crash. By deploying many small multiprocessors, we contain the effect of faults to smaller pieces of the system. Overall, a cluster solution fits the performance and availability
requirements of our service at significantly lower costs.

At first sight, it might appear that there are few applications that share Google’s characteristics, because there are few services that require many thousands of servers and petabytes of storage. However, many applications share the essential traits that allow for a PC-based cluster architecture. As long as an application orientation focuses on the price/performance and can run on servers that have no private state (so servers can be replicated), it might benefit from using a similar
architecture. Common examples include high-volume Web servers or application servers that are computationally intensive but essentially stateless. All of these applications have plenty of request-level parallelism, a characteristic exploitable by running individual requests on separate servers. In fact, larger Web sites already commonly use such architectures.

At Google’s scale, some limits of massive server parallelism do become apparent, such as the limited cooling capacity of commercial data centers and the less-than-optimal fit of current CPUs for throughput-oriented applications. Nevertheless, using inexpensive PCs to handle Google’s large-scale computations
has drastically increased the amount of computation we can afford to spend per query, thus helping to improve the Internet search experience of tens of millions of users.

Acknowledgments

Over the years, many others have made contributions to Google’s hardware architecture that are at least as significant as ours. In particular, we acknowledge the work of Gerald Aigner, Ross Biro, Bogdan Cocosel, and Larry Page.

2000, pp. 282-293.

6.
L.A. Barroso, K. Gharachorloo, and E. Bugnion, “Memory System Characterization
of Commercial Workloads,” Proc. 25th ACM Int’l Symp. Computer Architecture, ACM Press, 1998, pp. 3-14.

Luiz André Barroso is a member of the Systems Lab at Google, where he has focused on improving the efficiency of Google’s Web search and on Google’s hardware architecture. Barroso has a BS and an MS in electrical engineering from Pontifícia Universidade Católica, Brazil, and a PhD in computer engineering
from the University of Southern California. He is a member of the ACM.

Jeffrey Dean is a distinguished engineer in the Systems Lab at Google and has worked on the crawling, indexing, and query serving systems,

References

1.
S. Brin and L. Page, “The Anatomy of a
Large-Scale Hypertextual Web Search
Engine,” Proc. Seventh World Wide Web
Conf. (WWW7), International World Wide
Web Conference Committee (IW3C2), 1998,
pp. 107-117.
2.
“TPC Benchmark C Full Disclosure Report
for IBM eserver xSeries 440 using Microsoft
SQL Server 2000 Enterprise Edition and
Microsoft Windows .NET Datacenter Server
2003, TPC-C Version 5.0,” http://www.tpc.
org/results/FDR/TPCC/ibm.x4408way.c5.fdr.
02110801.pdf.
3.
D. Marr et al., “Hyper-Threading Technology
Architecture and Microarchitecture: A
Hypertext History,” Intel Technology J., vol.
6, issue 1, Feb. 2002.
4.
L. Hammond, B. Nayfeh, and K. Olukotun,
“A Single-Chip Multiprocessor,” Computer,
vol. 30, no. 9, Sept. 1997, pp. 79-85.
5.
L.A. Barroso et al., “Piranha: A Scalable
Architecture Based on Single-Chip
Multiprocessing,” Proc. 27th ACM Int’l
Symp. Computer Architecture, ACM Press,
with a focus on scalability and improving relevance. Dean has a BS in computer science
and economics from the University of Minnesota and a PhD in computer science from
the University of Washington. He is a member of the ACM.

Urs Hölzle is a Google Fellow and in his previous role as vice president of engineering was
responsible for managing the development
and operation of the Google search engine
during its first two years. Hölzle has a diploma from the Eidgenössische Technische
Hochschule Zürich and a PhD from Stanford
University, both in computer science. He is a
member of IEEE and the ACM.

Direct questions and comments about this
article to Urs Hölzle, 2400 Bayshore Parkway,
Mountain View, CA 94043; urs@google.com.

For further information on this or any other
computing topic, visit our Digital Library at
http://computer.org/publications/dlib.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

109楼 [报告]

发表于 2005-06-30 17:54 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

http://www-128.ibm.com/developerworks/opensource/library/os-openafs/

Next-generation NFS-like file system might be the answer to data headaches

Level: Introductory

Frank Pohlmann (frank@linuxuser.co.uk)
U.K. Technical Editor, Linuxuser and Developer
17 May 2005

Distributed file systems haven't had much press lately because it's mostly corporate and educational networks that use them, adding up to only thousands of users. Conceptually, it isn't always clear how such systems fit into the open source file system puzzle. The Open Andrew File System (OpenAFS) is a mature alternative to the Network File System (NFS), which scales only to large numbers of users and doesn't relieve management pain.

Users understand the concept of a file system in two ways. The first is a way to organize files, the directories that contain them, and the partitions holding a directory structure. And second, a file system is the way in which files are organized and mapped to the raw metal. Naturally, further layers exist in between, like the virtual file system (VFS) layer and the actual memory management routines, but regarding managing structured information accessible to users, it makes sense for power users to peer into file system internals and get just a sulfurous whiff of the kernel's infernal recesses.

The metal might consist of RAM or hard disks, but in either case, file system data structures organize the sectors and bytes formatted by the hardware manufacturer. Although rather crude, users can sustain this conceptual split fairly comfortably in their working lives. Tools are available that increase, for example, the speed with which users can access files greater than a certain size. Tools are also available to help reorganize directories and files, but these tools keep us safe from bits, bytes, and sectors.

File system metaconcepts
A classic case of this conceptual distinction is the way that FreeBSD -- harking back to the BSD UNIX® world -- uses UNIX File System V2 (UFS2) to organize data on the disk and the Flash File System (FFS) to organize files into directories and optimize directory access. Linux® systems work a bit differently because Linux permits much more than just one or two file systems natively. Thus, the VFS layer makes it possible for Linux users to add new file system support without worrying too much about the way in which Linux manages memory.

When I talk about further distinctions like static and journal file systems, I'm emphasizing the consistency and, to some extent, security of file system contents. Again, in terms that the BSD UNIX world used to view things, static and journal file systems relate to the way in which the UNIX File System (UFS) organizes and secures files. Although Linux file systems have encompassed journal file systems since the Journal File System (JFS), the next-generation file system (XFS), and the early ReiserFS were made available, another area in which neither technical journalism nor corporate publicity sheds much light is distributed file systems.

What we learned from NFS
This state of affairs is related to the fact that today, it would be judged imprudent to make networkwide file system layers available via TCP or User Datagram Protocol (UDP) to a large number of users. Horror scenarios surrounding pre-V3 NFS put off many administrators managing networks with less than a few dozen users. In addition, the appearance of multiple-processor architectures supported by extremely fast motherboard architectures seems to make distributed file system issues a lesser priority. Speed seems guaranteed by hardware, rather than by intelligently implemented distributed systems. Given that distributed file systems tend to rely on underlying file system implementations -- for example, the existing ext2, ext3, and ReiserFS file system drivers -- distributed file systems appear to be confined to the realms of large university networks and the occasional scientific or corporate network.

So, are distributed file systems a third layer on top of the two we have mentioned? One large issue in modern networking is getting heterogeneous networks to cooperate. (Samba is a prominent example.) But you need to understand that today, we have three major players in the file system puzzle: the group of Microsoft® Windows® file systems (FAT16, FAT32, and NTFS file system); Apple Mac OS X (HFS+); and native Linux journal file systems (mostly ReiserFS and ext3). Samba helps get Windows and Linux file systems to cooperate, but it is not meant to make access to files on all major file systems uniformly quick and easy to administer.

One could cite NFS V4 as an attempt to resolve this problem, but given that Request for Comments (RFC) 3530 dealing with NFS V4 is only two years old and NFS4 for kernel V2.6 is fairly new, I'd hesitate to recommend it for production servers. Fedora cores 2 and 3 provide NFS4 patches and NFS4 utilities that demonstrate the rather impressive progress developers have made since NFS forced suffering network administrators to open more ports and configure separate clients for each namespace exported to nervous users. RFC 3530 addresses most security concerns. Still, NFS directories have to be mounted individually. You can make things secure using unified sign-ons and Kerberos, but it all needs work.

OpenAFS rationale
OpenAFS tries to take the pain out of installing and administering software that makes differing file systems cooperate. OpenAFS also works to make differing file systems cooperate efficiently. Although the original metaphor for UNIX and its fascinating successor, Plan 9, was the file, commercial realities dictated that rather than rearchitect modern networked file systems completely, another distributed file system layer had to be added.

Carnegie Mellon University programmers developed AFS in 1983. Soon after, the university set up a company called Transarc to sell services based on AFS. IBM acquired Transarc in 1998 and made AFS available as an open source product under the name OpenAFS. The saga does not end there, however, because OpenAFS spawned other distributed file systems like Coda and Arla, which I cover later. Clients exist for all major operating systems, and documentation is plentiful, if somewhat dated. Gentoo.org made a special effort for OpenAFS to be accessible to Linux users, even though other organizations still seem to refer to NFS when they need distributed file systems.

OpenAFS architecture
OpenAFS is organized around a group of file servers, known as a cell. Each server's identity is usually hidden under the file system itself. Users logging in from an AFS client would not be able to tell which server they were working on because from the users' point of view, they would work on a single system with recognizable UNIX file system semantics. File system content is usually replicated across the cell so that failure of one hard disk would not impair working at the OpenAFS client. OpenAFS requires large client-caching facilities of up to 1 GB to enable accessing frequently used files. It also works as a fully secure Kerberos-based system that uses access control lists (ACLs) to make fine-grained access possible that is not based on the usual Linux and UNIX security models.

Except for the cache manager, which happens to be part of OpenAFS -- curiously only running with ext2 as an underlying file system -- the basic superficial structure of OpenAFS resembles modern NFS implementations. The basic architectures do not look alike at all, though, and you must view any parallels with a large dose of skepticism. For those of us who still prefer to use NFS, but would like to take advantage of OpenAFS facilities, it is possible to use a so-called NFS/AFS translator. As long as an OpenAFS client machine is configured as an NFS server machine, you should be able to enjoy the advantages of both file systems.

How OpenAFS manages its world
NFS is location-dependent, mapping local directories to remote file system locations. OpenAFS hides file locations from users. Because all source files are likely to be saved in read-write copies at various replicated file server locations, you must keep the replicated copies in sync. You do so through a technology known as Ubik, a play on the word ubiquitous and in Eastern European spelling. Ubik processes keep the files, directories, and volumes on the AFS file system in sync, but usually systems with more than three file server processes running benefit the most. A system administrator can group several AFS cells -- the old AFS abbreviation has been retained within OpenAFS file system semantics -- to an AFS site. The administrator would decide on the amount of AFS cells and the extent to which the cells can make storage and files available to other AFS cells within the site.

Partitions, volumes, and directories
AFS administrators divide cells into so-called volumes. Although volumes can be co-extensive with hard-disk partitions, most administrators would not fill a complete partition with a single volume. AFS volumes are actually managed by a separate UNIX-type process called the Volume Manager. You can mount a volume in a manner familiar from a UNIX file system directory. However, you can move an AFS volume from file server to file server -- again, a UNIX-type process -- but a UNIX directory cannot be physically moved from partition to partition. AFS automatically tracks the location of volumes and directories via the Volume Location Manager and keeps track of replicated volumes and files. Therefore, the user never needs to worry whenever a file server ceases operation unexpectedly because AFS would just switch the user to a replicated volume on a different file server machine without the user likely noticing.

Users never work on files located on AFS servers. They work on files that have been fetched from file servers by the client-side cache managers. The Cache Manager is a rather interesting beast that lives in the client's operating system kernel. In the case of Linux, a patch would be added to the kernel. (You can run the Cache Manager on any kernel from 2.4 onward.)

Cache Manager
The Cache Manager can respond to requests from a local application to fetch a file from across the AFS file system. Of course, if the file is a source file you change often, it might not be ideal that the file is likely to exist in several replicated versions. Because users are likely to change an often-requested source file frequently, you have two sets of problems: First, the file is likely to be kept in the client cache, as well as on several replicated volumes on several file server machines; and second, the Cache Manager has to update all volumes. The file server process sends the file to the client cache with a callback attached to it so that the system can deal with any changes happening somewhere else. If a user adds changes to a replicated file cached somewhere else, the original file server will activate the callback and remind the original cached version that it needs to be updated.

Distributed version control systems face this classic problem, but with an important difference: Distributed version control systems work perfectly well when disconnected, while AFS cannot have part of its file system cut off. The separated AFS section would not be able to reconnect with the original file system. File server processes that fail have to resynchronize with the still-running AFS file servers, but cannot add new changes that might have been preserved locally after it was cut off.

AFS descendants
AFS has provided an obvious point of departure for several attempts at new file systems. Two such systems incorporate lessons developers learned from the original distributed file system architecture: Coda and the Swedish open source volunteer effort, Arla.

The Coda file system was the first attempt at improving the original AFS. Starting in 1987 at Carnegie Mellon University, developers meant for Coda to be a conscious improvement on AFS, which had reached V2.0 by that time. In the late 1980s and early '90s, the Coda file system premiered a different cache manager: Venus. Although the basic feature set of Coda resembles that of AFS, Venus enables continued operation for the Coda-enabled client even if the client has been disconnected from the distributed file system. Venus has exactly the same function as the AFS Cache Manager, which takes its file system jobs from the VFS layer inside the kernel.

Connection breakdowns between Coda servers and the Venus cache manager are not always detrimental to network function: A laptop client must be able to work away from the central servers. Thus, Venus stores all updates in the client modification log. When the cache manager reconnects to the central servers, the system reintegrates the client modification log, making all file system updates available to the client.

Disconnected operation can create other problems, but the Venus cache manager illustrates that distributed file systems can be extended to encompass much more than complex networks that are always running in a connected fashion.

Programmers have been developing Arla, a Swedish project that provides a GPLed implementation of OpenAFS, since 1993, even though most of the development and ports have taken place since 1997. Arla imitates OpenAFS fairly well, except that the XFS file system must function on all operating systems that Arla runs on. Arla has reached V0.39 and, just like OpenAFS, runs on all BSD flavors, a good number of Linux kernels since kernel V2.0x, and Sun Solaris. Arla does partly implement a feature for AFS that was not originally in the AFS code: disconnected operation. Mileage may vary, however, and developers have not completed testing.

Other AFS-type file systems are available, like the GPLed InterMezzo, but they do not replicate AFS command-line semantics or its architecture. The world of open source distributed file systems is very much alive, and other distributed file systems have found applications in the mobile computing world.

Resources

* Check out OpenAFS for sources, binaries, and documentation.

* NFS has progressed, and you can find the RFC and other documentation on the NFS Version 4 Web site.

* Find information about the original Andrew File System, although many commands are identical to the OpenAFS version.

* Carnegie Mellon University still maintains the Coda file system.

* Find Coda file system documentation, even though this version is somewhat dated.

* Arla provides an entry point. Documentation tends to be between terse and nonexistent.

* A fairly popular attempt at writing a new distributed file system is the InterMezzo distributed file system.

* Gentoo offers downloads, documentation, and news about this compile-it-from-scratch version of Linux.

* Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.

* Innovate your next open source development project with IBM trial software, available for download or on DVD.

* Find hundreds of discounted books on open source topics in the Open source section of The Developer Bookstore, including many books about Linux.

* Get involved in the developerWorks community by participating in developerWorks blogs.

About the author
Author photoFrank Pohlmann dabbled in the history of Middle Eastern religions before various funding committees decided that research in the history of religious polemics was quite irrelevant to the modern world. He has focused on his hobby -- free software -- ever since. He admits to being the technical editor of the U.K.-based Linuxuser and Developer and has had an interest in scripts and character sets since the days when he was trying to learn Old and Middle Persian.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

110楼 [报告]

发表于 2005-07-02 22:21 |只看该作者

Unix下针对邮件,搜索,网络硬盘等海量存储的分布式文件系统项目

小弟正在根据goole的相关资料开发基于linux的GFS原型，用c++实现，有2万多行源代码，基本能跑起来了，但没找到根多相关的资料，也没有很详细的同类文件系统的资料，如果哪位有还望能赐教，有兴趣可以共同开发，非商业。

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

1 ... 7 8 9 101112 13 14 15 ... 25 / 25 页下一页

发新帖

Chinaunix › 论坛 › IT运维 › 虚拟化与云服务 › 浩存 - 面向数据库，虚拟机等海量数据可同时提供NFS/iSC ...

北京盛拓优讯信息技术有限公司. 版权所有京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号：11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员联系我们：huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP