Linux asynchronous file IO
The Linux kernel includes a facility for performing asynchronous file IO (Linux aio). At rustfest, during his keynote about Tokio, Alex Crichton said that he was not aware of any means to do asynchronous IO with files. I had a chat with him afterwards, and he was surprised that he had never heard about this, so I thought it might be a good idea to write down some notes about it. What’s more the interface is not very well documented, and there are several intricacies that I tend to forget every time I use it.
Asynchronous IO
Why one would want asynchronous IO? By default, IO operations (e.g., reading or
writing a file via read
/write
) are synchronous. That is, the caller is
suspended by the kernel until the operation is done. For example, calling read
on a file might put the executing thread to sleep until the data are fetched
from a storage device. Calling read
on a socket might block the thread until
data become available from the network.
For applications that want to handle many IO operations, performing them synchronously is problematic because no useful work is done during the time that the thread is blocked. One way to deal with this is to use multiple threads. At any point in time, some of them will be blocked, but others will continue to make progress. While this approach generally works well for a small number of threads, large numbers cause scalability issues that degrade performance. A frequently referenced case where this approach is problematic is building network servers where each connection is handled by a single thread. Needing 10K threads for handling 10K connections is generally considered inneficient (see: C10k problem). (A related issue is the long-standing threads-vs-events debate. For those interested, a great starting point is the five part blog series by Adrian Colyer in his excellent blog, the morning paper.)
When dealing with the network, the standard way to do asyncrhonous IO in Linux is via epoll, which, however, does not work for normal files.1 Hence a different mechanism is needed, and this is the what Linux aio is intended for.
Linux aio
Linux aio is implemented via the following system calls:
- io_setup and io_destroy, for creating and destroying an aio context
- io_submit for submitting IO requests
- io_getevents for retrieving the completions of the submitted requests
- io_cancel for cancelling IO requests
To use aio, an aio context is needed:
aio_context_t ioctx = 0;
unsigned maxevents = 128;
if (io_setup(maxevents, &ioctx) < 0) {
perror("io_setup");
exit(1);
}
(Yes, the ioctx needs to be zeroed before the call.)
If successful, the context can be used to submit io operations via:
int io_submit(aio_context_t ctx_id, long nr, struct iocb **iocbpp);
Each operation is represented by an IO control block (struct iocb
), whose
definition can be found
in
include/linux/aio_abi.h
The header defines a number of different operations, but in the latest
kernel
only four are supported: IOCB_CMD_PREAD
, IOCB_CMD_PWRITE
, IOCB_CMD_PREADV
,
and IOCB_CMD_PWRITEV
, which correspond to the
pread,
pwrite,
preadv, and
pwritev system calls.
If we want, for example, to issue two 512 read operations on a file descriptor
fd
, we can do the following (yes, zeroing the iocbs is necessary):
// first operation
char buff1[512];
struct iocb iocb1 = {0};
iocb1.aio_data = 0xbeef; // will be returned in completion
iocb1.aio_fildes = fd;
iocb1.aio_lio_opcode = IOCB_CMD_PREAD;
iocb1.aio_reqprio = 0;
iocb1.aio_buf = (uintptr_t)buff1;
iocb1.aio_nbytes = sizeof(buff1);
iocb1.aio_offset = 0; // read file at offset 0
// second operation
char buff2[512];
struct iocb iocb2 = {0};
iocb2.aio_data = 0xbaba; // will be returned in completion
iocb2.aio_fildes = fd;
iocb2.aio_lio_opcode = IOCB_CMD_PREAD;
iocb2.aio_reqprio = 0;
iocb2.aio_buf = (uintptr_t)buff2;
iocb2.aio_nbytes = sizeof(buff2);
iocb2.aio_offset = 4096; // read file at offset 4096 (bytes)
struct iocb *iocb_ptrs[2] = { &iocb1, &iocb2 };
// submit operations
int ret = io_submit(ioctx, 2, iocb_ptrs);
if (ret < 0) {
perror("io_submit");
exit(1);
} else if (ret != 2) {
perror("io_submit: unhandled partial success");
exit(1);
}
Eventually, we need to ask from the kernel for the completion of the requests that we submitted.
// wait for at least one event
size_t nevents = 2;
struct io_event events[nevents];
ret = io_getevents(ioctx, 1 /* min */, nevents, events, NULL);
if (ret < 0) {
perror("io_getevents");
exit(1);
}
for (size_t i=0; i<ret; i++) {
struct io_event *ev = &events[i];
assert(ev->data == 0xbeef || ev->data == 0xbaba);
printf("Event returned with res=%lld res2=%lld\n", ev->res, ev->res2);
nevents--;
}
res
stores the result of the operation. For IOCB_CMD_PREAD
for example it
seems to be either a negative error number or a positive value of how many bytes
were read. In our example, if everything went well we will see:
Event returned with res=512 res2=0
Event returned with res=512 res2=0
If, for example, the operations fail with EINVAL
(22), we will see:
Event returned with res=-22 res2=0
Event returned with res=-22 res2=0
I have no idea what res2
is, but my guess is that if it is not 0, an error
occurred.
The full version of the above example can be found in: [https://github.com/kkourt/aio-test/blob/master/aio-example.c] (https://github.com/kkourt/aio-test/blob/master/aio-example.c)
There is also a libaio, a thin library on top of the above system calls. Many undocumented features are illustrated there. For example, to create a vectored operation:
static inline void io_prep_preadv(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset) {
memset(iocb, 0, sizeof(*iocb));
iocb->aio_fildes = fd;
iocb->aio_lio_opcode = IO_CMD_PREADV;
iocb->aio_reqprio = 0;
iocb->buf = (void *)iov;
iocb->nbytes = iovcnt;
iocb->offset = offset;
}
Another hidden feature is the ability to be notified via an eventfd for completion events:
static inline void io_set_eventfd(struct iocb *iocb, int eventfd) {
iocb->flags |= IOCB_FLAG_RESFD;
iocb->resfd = eventfd;
}
Does it work as expected?
Well, kind of. For the whole think to make sense, io_submit
must issue the
operation but not complete it. In other words, if the operation is completed
within io_submit
and is immediately available the next time we call
io_getevents
, then we might as well use synchronous calls. In that sense,
async IO does not work with buffered IO, and
instead direct IO is needed (O_DIRECT
). Furthermore, there are cases
where the process is blocked when calling io_submit
, but
some
recent
changes
in the kernel try to address this.
If you want to experiment, I’ve written aio-test, a small utility for that purpose.
POSIX aio != Linux aio
It is worth noting that, Linux aio is sometimes confused with the glibc implementation of POSIX aio, which is something different. The glibc implementation uses thread pools that perform the IO operations synchronously. As mentioned in the corresponding manpage:
The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. Work has been in progress for some time on a kernel state-machine-based implementation of asynchronous I/O (see io_submit(2), io_setup(2), io_cancel(2), io_destroy(2), io_getevents(2)), but this implementation hasn’t yet matured to the point where the POSIX AIO implementation can be completely reimplemented using the kernel system calls.
Hence, while in principle the POSIX aio interface could be implemented using Linux aio, in practice this is not the case.
Other links
-
There are a number issues when trying to do this. Setting
O_NONBLOCK
to regular files does not result inread
andwrite
operations prematurely return in case they are blocked on doing IO, andepoll
returnsEPERM
when trying to add a descriptor from a regular file. ↩