The Linux kernel includes a facility for performing asynchronous file IO (Linux aio). At rustfest, during his keynote about Tokio, Alex Crichton said that he was not aware of any means to do asynchronous IO with files. I had a chat with him afterwards, and he was surprised that he had never heard about this, so I thought it might be a good idea to write down some notes about it. What’s more the interface is not very well documented, and there are several intricacies that I tend to forget every time I use it.

Asynchronous IO

Why one would want asynchronous IO? By default, IO operations (e.g., reading or writing a file via read/write) are synchronous. That is, the caller is suspended by the kernel until the operation is done. For example, calling read on a file might put the executing thread to sleep until the data are fetched from a storage device. Calling read on a socket might block the thread until data become available from the network.

For applications that want to handle many IO operations, performing them synchronously is problematic because no useful work is done during the time that the thread is blocked. One way to deal with this is to use multiple threads. At any point in time, some of them will be blocked, but others will continue to make progress. While this approach generally works well for a small number of threads, large numbers cause scalability issues that degrade performance. A frequently referenced case where this approach is problematic is building network servers where each connection is handled by a single thread. Needing 10K threads for handling 10K connections is generally considered inneficient (see: C10k problem). (A related issue is the long-standing threads-vs-events debate. For those interested, a great starting point is the five part blog series by Adrian Colyer in his excellent blog, the morning paper.)

When dealing with the network, the standard way to do asyncrhonous IO in Linux is via epoll, which, however, does not work for normal files.¹ Hence a different mechanism is needed, and this is the what Linux aio is intended for.

Linux aio

Linux aio is implemented via the following system calls:

io_setup and io_destroy, for creating and destroying an aio context
io_submit for submitting IO requests
io_getevents for retrieving the completions of the submitted requests
io_cancel for cancelling IO requests

To use aio, an aio context is needed:

aio_context_t ioctx = 0;
unsigned maxevents = 128;
if (io_setup(maxevents, &ioctx) < 0) {
    perror("io_setup");
    exit(1);
}

(Yes, the ioctx needs to be zeroed before the call.)

If successful, the context can be used to submit io operations via:

int io_submit(aio_context_t ctx_id, long nr, struct iocb **iocbpp);

Each operation is represented by an IO control block (struct iocb), whose definition can be found in include/linux/aio_abi.h

The header defines a number of different operations, but in the latest kernel only four are supported: IOCB_CMD_PREAD, IOCB_CMD_PWRITE, IOCB_CMD_PREADV, and IOCB_CMD_PWRITEV, which correspond to the pread, pwrite, preadv, and pwritev system calls.

If we want, for example, to issue two 512 read operations on a file descriptor fd, we can do the following (yes, zeroing the iocbs is necessary):

// first operation
char buff1[512];
struct iocb iocb1    = {0};
iocb1.aio_data       = 0xbeef; // will be returned in completion
iocb1.aio_fildes     = fd;
iocb1.aio_lio_opcode = IOCB_CMD_PREAD;
iocb1.aio_reqprio    = 0;
iocb1.aio_buf        = (uintptr_t)buff1;
iocb1.aio_nbytes     = sizeof(buff1);
iocb1.aio_offset     = 0;      // read file at offset 0

// second operation
char buff2[512];
struct iocb iocb2    = {0};
iocb2.aio_data       = 0xbaba; // will be returned in completion
iocb2.aio_fildes     = fd;
iocb2.aio_lio_opcode = IOCB_CMD_PREAD;
iocb2.aio_reqprio    = 0;
iocb2.aio_buf        = (uintptr_t)buff2;
iocb2.aio_nbytes     = sizeof(buff2);
iocb2.aio_offset     = 4096;   // read file at offset 4096 (bytes)

struct iocb *iocb_ptrs[2] = { &iocb1, &iocb2 };

// submit operations
int ret = io_submit(ioctx, 2, iocb_ptrs);
if (ret < 0) {
    perror("io_submit");
    exit(1);
} else if (ret != 2) {
    perror("io_submit: unhandled partial success");
    exit(1);
}

Eventually, we need to ask from the kernel for the completion of the requests that we submitted.

// wait for at least one event
size_t nevents = 2;
struct io_event events[nevents];
ret = io_getevents(ioctx, 1 /* min */, nevents, events, NULL);
if (ret < 0) {
    perror("io_getevents");
    exit(1);
}

for (size_t i=0; i<ret; i++) {
    struct io_event *ev = &events[i];
    assert(ev->data == 0xbeef || ev->data == 0xbaba);
    printf("Event returned with res=%lld res2=%lld\n", ev->res, ev->res2);
    nevents--;
}

res stores the result of the operation. For IOCB_CMD_PREAD for example it seems to be either a negative error number or a positive value of how many bytes were read. In our example, if everything went well we will see:

Event returned with res=512 res2=0
Event returned with res=512 res2=0

If, for example, the operations fail with EINVAL (22), we will see:

Event returned with res=-22 res2=0
Event returned with res=-22 res2=0

I have no idea what res2 is, but my guess is that if it is not 0, an error occurred.

The full version of the above example can be found in: [https://github.com/kkourt/aio-test/blob/master/aio-example.c] (https://github.com/kkourt/aio-test/blob/master/aio-example.c)

There is also a libaio, a thin library on top of the above system calls. Many undocumented features are illustrated there. For example, to create a vectored operation:

static inline void io_prep_preadv(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset) {
     memset(iocb, 0, sizeof(*iocb));
     iocb->aio_fildes = fd;
     iocb->aio_lio_opcode = IO_CMD_PREADV;
     iocb->aio_reqprio = 0;
     iocb->buf = (void *)iov;
     iocb->nbytes = iovcnt;
     iocb->offset = offset;
}

Another hidden feature is the ability to be notified via an eventfd for completion events:

static inline void io_set_eventfd(struct iocb *iocb, int eventfd) {
	  iocb->flags |= IOCB_FLAG_RESFD;
	  iocb->resfd = eventfd;
}

Does it work as expected?

Well, kind of. For the whole think to make sense, io_submit must issue the operation but not complete it. In other words, if the operation is completed within io_submit and is immediately available the next time we call io_getevents, then we might as well use synchronous calls. In that sense, async IO does not work with buffered IO, and instead direct IO is needed (O_DIRECT). Furthermore, there are cases where the process is blocked when calling io_submit, but some recent changes in the kernel try to address this.

If you want to experiment, I’ve written aio-test, a small utility for that purpose.

POSIX aio != Linux aio

It is worth noting that, Linux aio is sometimes confused with the glibc implementation of POSIX aio, which is something different. The glibc implementation uses thread pools that perform the IO operations synchronously. As mentioned in the corresponding manpage:

The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. Work has been in progress for some time on a kernel state-machine-based implementation of asynchronous I/O (see io_submit(2), io_setup(2), io_cancel(2), io_destroy(2), io_getevents(2)), but this implementation hasn’t yet matured to the point where the POSIX AIO implementation can be completely reimplemented using the kernel system calls.

Hence, while in principle the POSIX aio interface could be implemented using Linux aio, in practice this is not the case.

Linux asynchronous file IO

Asynchronous IO

Linux aio

Does it work as expected?

POSIX aio != Linux aio

Other links