论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2006-05-18 16:29 |只看该作者 |倒序浏览

原文链接：
http://kerneltrap.org/node/6505/print

KernelTrap logo
Linux: Explaining splice() and tee()
By Jeremy
Created 04/21/2006 - 07:14

An informative thread on the lkml [1] began with a request for a description of the recently added splice() and tee() system calls. Linux creator Linus Torvalds responded with a lengthy description, beginning with a simplified overview:

"The _really_ high-level concept is that there is now a notion of a 'random kernel buffer' that is exposed to user space.

"In other words, splice() and tee() work on a kernel buffer that the user has control over, where "splice()" moves data to/from the buffer from/to an arbitrary file descriptor, while "tee()" copes the data in one buffer to another.

"So in a very real (but abstract) sense, "splice()" is nothing but read()/write() to a kernel buffer, and "tee()" is a memcpy() from one kernel buffer to another."

From: Diego Calleja [email blocked]

To: Linus Torvalds [email blocked]

Subject: Re: Linux 2.6.17-rc2

Date: Wed, 19 Apr 2006 20:00:01 +0200

Could someone give a long high-level description of what splice() and tee()
are? I need a description for wiki.kernelnewbies.org/Linux_2_6_17 (while
we're it, it'd be nice if some people can review it in case it's missing
something ;) I've named it "generic zero-copy mechanism" but I bet
there's a better description, if it's so cool as people says it'd be nice
to do some "advertising" of it (notifying people of new features is not
something linux has done too well historically :)

What kind of apps available today could get performance benefits by using
this? Is there a new class of "processes" (or apps) that couldn't be done
and can be done now using splice, or are there some kind of apps that become
too complex internally today because they try to avoid extra copy of data
and they can get much simpler by using splice? Why people sees it as a
"radical" improvement in some cases over the typical way of doing I/O in
Unix. Is this similar or can be compared with ritchie's/SYSV STREAMS?

From: Linus Torvalds [email blocked]
Subject: Re: Linux 2.6.17-rc2
Date: Wed, 19 Apr 2006 11:44:25 -0700 (PDT)

On Wed, 19 Apr 2006, Diego Calleja wrote:
>
> Could someone give a long high-level description of what splice() and tee()
> are?

The _really_ high-level concept is that there is now a notion of a "random
kernel buffer" that is exposed to user space.

In other words, splice() and tee() work on a kernel buffer that the user
has control over, where "splice()" moves data to/from the buffer from/to
an arbitrary file descriptor, while "tee()" copes the data in one buffer
to another.

So in a very real (but abstract) sense, "splice()" is nothing but
read()/write() to a kernel buffer, and "tee()" is a memcpy() from one
kernel buffer to another.

Now, to get slightly less abstract, there's two important practical
details:

- the "buffer" implementation is nothing but a regular old-fashioned UNIX
pipe.

This actually makes sense on so many levels, but mostly simply because
that is _exactly_ what a UNIX pipe has always been: it's a buffer in
kernel space. That's what a pipe has always been. So the splice usage
isn't conceptually anything new for pipes - it's just exposing that
old buffer in a new way.

Using a pipe for the in-kernel buffer means that we already have all
the infrastructure in place to create these things (the "pipe()" system
call), and refer to them (user space uses a regular file descriptor as
a "pointer" to the kernel buffer).

It also means that we already know how to fill (or read) the kernel
buffer from user space: the bog-standard pre-existing "read()" and
"write()" system calls to the pipe work the obvious ways: they read the
data from the kernel buffer into user space, and write user space data
into the kernel buffer.

- the second part of the deal is that the buffer is actually implemented
as a set of reference-counted pointers, which means that you can copy
them around without actually physically copy memory. So while "tee()"
from a _conceptual_ standpoint is exactly the same as a "memcpy()" on
the kernel buffer, from an implementation standpoint it really just
copies the pointers and increments the refcounts.

There are some other buffer management system calls that I haven't done
yet (and when I say "I haven't done yet", I obviously mean "that I hope
some other sucker will do for me, since I'm lazy"), but that are obvious
future extensions:

- an ioctl/fcntl to set the maximum size of the buffer. Right now it's
hardcoded to 16 "buffer entries" (which in turn are normally limited to
one page each, although there's nothing that _requires_ that a buffer
entry always be a page).

- vmsplice() system call to basically do a "write to the buffer", but
using the reference counting and VM traversal to actually fill the
buffer. This means that the user needs to be careful not to re-use the
user-space buffer it spliced into the kernel-space one (contrast this
to "write()", which copies the actual data, and you can thus re-use the
buffer immediately after a successful write), but that is often easy to
do.

Anyway, when would you actually _use_ a kernel buffer? Normally you'd use
it it you want to copy things from one source into another, and you don't
actually want to see the data you are copying, so using a kernel buffer
allows you to possibly do it more efficiently, and you can avoid
allocating user VM space for it (with all the overhead that implies: not
just the memcpy() to/from user space, but also simply the book-keeping).

It should be noted that splice() is very much _not_ the same as
sendfile(). The buffer is really the big difference, both conceptually,
and in how you actually end up using it.

A "sendfile()" call (which a lot of other OS's also implement) doesn't
actually _need_ a buffer at all, because it uses the file cache directly
as the buffer it works on. So sendfile() is really easy to use, and really
efficient, but fundamentally limited in what it can do.

In contrast, the whole point of splice() very much is that buffer. It
means that in order to copy a file, you literally do it like you would
have done it traditionally in user space:

int ret;

for (;;) {
int ret = read(input, buffer, BUFSIZE);
char *p;
if (!ret)
break;
if (ret < 0) {
if (errno == EINTR)
continue;
.. exit with an inpot error ..
}

p = buffer;
do {
int written = write(output, p, ret);
if (!written)
.. exit with filesystem full ..
if (written < 0) {
if (errno == EINTR)
continue;
.. exit with an output error ..
}
p += written;
ret -= written;
} while (ret);
}

except you'd not have a buffer in user space, and the "read()" and
"write()" system calls would instead be "splice()" system calls to/from a
pipe you set up as your _kernel_ buffer. But the _construct_ would all be
indentical - the only thing that changes is really where that "buffer"
exists.

Now, the advantage of splice()/tee() is that you can do zero-copy movement
of data, and unlike sendfile() you can do it on _arbitrary_ data (and, as
shown by "tee()", it's more than just sending the data to somebody else:
you can duplicate the data and choose to forward it to two or more
different users - for things like logging etc).

So while sendfile() can send files (surprise surprise), splice() really is
a general "read/write in user space" and then some, so you can forward
data from one socket to another, without ever copying it into user space.

Or, rather than just a boring socket->socket forwarding, you could, for
example, forward data that comes from a MPEG-4 hardware encoder, and tee()
it to duplicate the stream, and write one of the streams to disk, and the
other one to a socket for a real-time broadcast. Again, all without
actually physically copying it around in memory.

So splice() is strictly more powerful than sendfile(), even if it's a bit
more complex to use (the explicit buffer management in the middle). That
said, I think we're actually going to _remove_ sendfile() from the kernel
entirely, and just leave a compatibility system call that uses splice()
internally to keep legacy users happy.

Splice really is that much more powerful a concept, that having sendfile()
just doesn't make any sense except as some legacy compatibility layer
around the more powerful splice().

Linus

From: Grzegorz Kulewski [email blocked]
Subject: Re: Linux 2.6.17-rc2
Date: Wed, 19 Apr 2006 21:20:56 +0200 (CEST)

On Wed, 19 Apr 2006, Linus Torvalds wrote:
> On Wed, 19 Apr 2006, Diego Calleja wrote:
>>
>> Could someone give a long high-level description of what splice() and tee()
>> are?
>
> The _really_ high-level concept is that there is now a notion of a "random
> kernel buffer" that is exposed to user space.

Suppose I am implementing hi performance HTTP (not caching) proxy that
reads (part of?) HTTP header from A, decides where to send request from
it, connects to the right host (B), sends (part of) HTTP header it already
received and then wants to:

- make all further bytes from A be copied to B without using user space
but no more than n bytes (n = request size it knows from header) or to the
end of data (disconnect or something like that),

- make all bytes from B copied to A without using user space but no more
than m bytes (m = response size from response header),

- stop both operations as soon as they copy enough data (assuming both
sides are still connected) and then use sockets normally - to implement
for example multiple requests per connection (keepalive).

Could it be done with splice() or tee() or some other kernel
"accelerator"? Or should it be done in userspace by plain read and write?

And what if n or m is not known in advance but for example end of request
is represented by <CR><LF><CR><LF> or something like that (common in some
older protocols)?

Thanks in advance,

Grzegorz Kulewski

From: Linus Torvalds [email blocked]
Subject: Re: Linux 2.6.17-rc2
Date: Wed, 19 Apr 2006 13:09:08 -0700 (PDT)

On Wed, 19 Apr 2006, Grzegorz Kulewski wrote:
>
> Suppose I am implementing hi performance HTTP (not caching) proxy that reads
> (part of?) HTTP header from A, decides where to send request from it, connects
> to the right host (B), sends (part of) HTTP header it already received and
> then wants to:
>
> - make all further bytes from A be copied to B without using user space but no
> more than n bytes (n = request size it knows from header) or to the end of
> data (disconnect or something like that),
>
> - make all bytes from B copied to A without using user space but no more than
> m bytes (m = response size from response header),
>
> - stop both operations as soon as they copy enough data (assuming both sides
> are still connected) and then use sockets normally - to implement for example
> multiple requests per connection (keepalive).
>
> Could it be done with splice() or tee() or some other kernel "accelerator"? Or
> should it be done in userspace by plain read and write?

You'd not use "tee()" here, because you never have any data that you want
to go to two different destinations, but yes, you could use very
well use splice() for this.

(well, technically you have the header part that you want to duplicate,
and you _could_ use "tee()" for that, but it would be stupid - since you
want to see the header in user space _anyway_ to see where to forward
things, you just want to start out with a MSG_PEEK on the incoming socket
to see the header, and then use splice, to splice it to the destination
socket).

> And what if n or m is not known in advance but for example end of request is
> represented by <CR><LF><CR><LF> or something like that (common in some older
> protocols)?

At that point, you need to actually watch the data in user space, and so
you need to do a real read() system call.

(Of course, the "kernel buffer" notion does allow for a notion of "kernel
filters" too, but then you get to shades of STREAMS, and that just scares
the crap out of me, so..)

Linus

From: Trond Myklebust [email blocked]
Subject: Re: Linux 2.6.17-rc2
Date: Wed, 19 Apr 2006 17:23:47 -0400

On Wed, 2006-04-19 at 11:44 -0700, Linus Torvalds wrote:
>
> On Wed, 19 Apr 2006, Diego Calleja wrote:
> >
> > Could someone give a long high-level description of what splice() and tee()
> > are?
>
> The _really_ high-level concept is that there is now a notion of a "random
> kernel buffer" that is exposed to user space.
>
> In other words, splice() and tee() work on a kernel buffer that the user
> has control over, where "splice()" moves data to/from the buffer from/to
> an arbitrary file descriptor, while "tee()" copes the data in one buffer
> to another.

Any chance this could be adapted to work with all those DMA (and RDMA)
engines that litter our motherboards? I'm thinking in particular of
stuff like the drm drivers, and userspace rdma.

Cheers,
Trond

From: Linus Torvalds [email blocked]
Subject: Re: Linux 2.6.17-rc2
Date: Wed, 19 Apr 2006 14:49:50 -0700 (PDT)

On Wed, 19 Apr 2006, Trond Myklebust wrote:
>
> Any chance this could be adapted to work with all those DMA (and RDMA)
> engines that litter our motherboards? I'm thinking in particular of
> stuff like the drm drivers, and userspace rdma.

Absolutely. Especially with "vmsplice()" (the not-yet-implemented "move
these user pages into a kernel buffer") it should be entirely possible to
set up an efficient zero-copy setup that does NOT have any of the problems
with aio and TLB shootdown etc.

Note that a driver would have to support the splice_in() and splice_out()
interfaces (which are basically just given the pipe buffers to do with as
they wish), and perhaps more importantly: note that you need specialized
apps that actually use splice() to do this.

That's the biggest downside by far, and is why I'm not 100% convinced
splice() usage will be all that wide-spread. If you look at sendfile(),
it's been available for a long time, and is actually even almost portable
across different OS's _and_ it is easy to use. But almost nobody actually
does. I suspect the only users are some apache mods, perhaps a ftp deamon
or two, and probably samba. And that's probably largely it.

There's a _huge_ downside to specialized interfaces. Admittedly, splice()
is a lot less specialized (ie it works in a much wider variety of loads),
but it's still very much a "corner-case" thing. You can always do the same
thing splice() does with a read/write pair instead, and be portable.

Also, the genericity of splice() does come at the cost of complexity. For
example, to do a zero-copy from a user space buffer to some RDMA network
interface, you'd have to basically keep track of _two_ buffers:

- keep track of how much of the user space buffer you have moved into
kernel space with "vmsplice()" (or, for that matter, with any other
source of data for the buffer - it might be a file, it might be another
socket, whatever. I say "vmsplice()", but that's just an example for
when you have the data in user space).

The kernel space buffer is - for obvious reasons - size limited in the
way a user-space buffer is not. People are used to doing megabytes of
buffers in user space. The splice buffer, in comparison, is maybe a few
hundred kB at most. For some apps, that's "inifinity". For others, it's
just a few tens of pages of data.

- keep track of how much of the kernel space buffer you have moved to the
RDMA network interface with "splice()".

The splice buffer _is_ another buffer, and you have to feed the data
from that buffer to the RDMA device manually.

In many usage schenarios, this means that you end up having the normal
kind of poll/select loop. Now, that's nothing new: people are used to
them, but people still hate them, and it just means that very few
environments are going to spend the effort on another buffering setup.

So the upside of splice() is that it really can do some things very
efficiently, by "copying" data with just a simple reference counted
pointer. But the downside is that it makes for another level of buffering,
and behind an interface that is in kernel space (for obvious reasons),
which means that it's somewhat harder to wrap your hands and head around
than just a regular user-space buffer.

So I'd expect this to be most useful for perhaps things like some HPC
apps, where you can have specialized libraries for data communication. And
servers, of course (but they might just continue to use the old
"sendfile()" interface, without even knowing that it's not sendfile() any
more, but just a wrapper around splice()).

Linus

From: "Hua Zhong" [email blocked]

Subject: RE: Linux 2.6.17-rc2

Date: Wed, 19 Apr 2006 11:04:47 -0700

http://lwn.net/Articles/178199/ [2]

From: [email blocked] (Jonathan Corbet)
Subject: Re: splice and tee [was Linux 2.6.17-rc2]
Date: Wed, 19 Apr 2006 13:40:45 -0600

> http://lwn.net/Articles/178199/ [3]

Additionally, the article on tee() can be found at:

http://lwn.net/SubscriberLink/179492/14a99324520b744f/ [4]

jon

Related Links:

* Archive of above thread [5]

Links
[1] http://www.tux.org/lkml/
[2] http://lwn.net/Articles/178199/
[3] http://lwn.net/Articles/178199/
[4] http://lwn.net/SubscriberLink/179492/14a99324520b744f/
[5] http://marc.theaimsgroup.com/?l= ... 47013019100&w=2
Source URL: http://kerneltrap.org/node/6505

返回列表

Chinaunix › 论坛 › 操作系统 › Linux论坛 › 内核/嵌入技术 › [FYI] Linux: Explaining (newly-added syscalls )splic ...

[FYI] Linux: Explaining (newly-added syscalls )splice() and tee() [复制链接]

浏览过的版块