For the past two years, I have occasionally been studying file IO and developing my own Java library for using io_uring, called jasyncfio. In this article, I want to share my experience and insights.

Contents:

About io_uring

io_uring — is a relatively new API in the Linux Kernel, introduced in version 5.1. io_uring was built with the idea of high-performance asynchronous input/output (IO) for both files and network sockets. In this article I will only cover the file API, and I might explore network IO at a later time. Before io_uring, Linux had only one asynchronous API for file IO — POSIX AIO, which has several limitations that make it unsuitable for widespread use.

io_uring API

io_uring is based on two queues that are shared between the kernel and user space memory, namely, Submission Queue (sq) and Completion Queue (cq). The user writes an operation request into the Submission Queue, and the kernel writes the result of the operation into the Completion Queue. The user then needs to read and process this result.

This approach allows the program to make requests for multiple IO operations to the kernel with a single system call, and correspondingly, the kernel can return the results of multiple IO operations through the appropriate queue.

The io_uring interface is located in the header file io_uring.h. In addition to constants, the file contains two main structures for working with queues - io_uring_sqe (Submission Queue Entry) and io_uring_cqe (Completion Queue Entry).

Three new system calls were also added to the Linux Kernel API:

  • io_uring_setup - allows to create a context to work with io_uring,
  • io_uring_enter - allows to notify the kernel and start operations,
  • io_uring_register - allows to regis files and buffers in the kernel for more efficient use withio_uring.

You can find more detailed information about io_uring in an excellent article by the API`s author, Jens Axboe - Efficient IO with io_uring.

Adding io_uring to Java, JNI and structures

To determine what we need to add to Java, first we need to familiarize ourselves with how interaction with io_uring occurs. As I mentioned earlier, the file io_uring.h defines two structures - io_uring_sqe and io_uring_cqe, and io_uring itself consists of two queues containing elements of these structures. To request the kernel to perform an operation, you need to populate the appropriate fields of the io_uring_sqe structure. To process the result of the operation, you need to read the fields of the io_uring_cqe structure.

io_uring configuration

Before we can start working with io_uring, it is necessary to properly configure the required context, which is a rather complex process.For this, we declare a variable of the io_uring_params structure type (defined in the io_uring.h header file), set the flags that we need, and make the io_uring_setup() system call, passing in the required queue size and the structure with parameters. If all goes well, you’ll get back an io_uring file descriptor.

struct io_uring_params p;
int ring_fd;

memset(&p, 0, sizeof(p));

p.sq_thread_idle = sq_thread_idle;
p.sq_thread_cpu = sq_thread_cpu;
p.cq_entries = cq_size;
// ... and so on

ring_fd = sys_io_uring_setup(entries, &p);

// error handling

Next comes the most challenging part of the configuration process - we need to allocate memory for the queues using mmap(), ensuring that we calculate the necessary sizes correctly for the number of elements for which the io_uring file descriptor was created. For the jasyncfio library, I took this code from the C library libuirng. After all these manipulations, we end up with an io_uring that is ready to use.

Example of a file read operation using io_uring

Now, let’s take a file read operation as an example to see what we need to implement in the Java library. To start, let’s retrieve the next available element from the sqes queue, so we can populate it with all necessary parameters for the IO operation we want to execute. This is done in the following way: during the configuration of io_uring, we calculate a special mask called ring_mask, which is used to obtain the next available io_uring_sqe element:

struct io_uring_sqe *sqe;
sqe = sqes[tail & ring_mask]; // retrieve the next available sqe
tail++; // don't forget to increment the tail counter

Here I deliberately omit error handling details, now we just need to get the main idea. When we have a pointer to the next io_uring_sqe element in the queue, we need to populate the fields that are necessary for our operation. In the case of a read operation, the C code would look like this:

sqe->opcode = IORING_OP_READ; // the operation we want to execute
sqe->flags = 0; // we don't need any flags
sqe->fd = fd; // file descriptor of the file we want to read from
sqe->off = 0 // the offset from which we want to start reading
// ... and so on

jasyncfio provides the ability to write data into a C structure from Java code. Then we need to call io_uring_enter() to notify the kernel that the submission queue is not empty. After the operation is completed, the kernel adds the io_uring_cqe element to the cqes queue, which operates in a manner similar to sqes:

struct io_uring_cqe *cqe;
cqe = cqes[head & ring_mask]; // retrieve the next CQE
head++;

To process the result, we need to read the fields of the io_uring_cqe structure.

unsigned long user_data = cqe->user_data;
int res = cqe->res;
unsigned int flags = cqe->flags;

It is important to note that, in addition to simple reading and writing from the sqes and cqes queues, it is essential to employ the correct memory synchronization primitives. Failure to do so can lead to unpredictable errors, such as the kernel only seeing a partially written io_uring_sqe structure! For Java code, jasyncfio does all the heavy lifting.

In the next chapter, we will take a closer look at the approaches that can be used to interact with io_uring from Java code.

System calls and constants

Unfortunately, there is no easy way of working with the C world from Java code yet. I use ‘yet’ because JDK 21 is expected to feature the release of Project Panama, which aims to simplify this task and make the interaction both safer and faster. But since there is no stable version of JDK with Panama yet, we will have to resort to the good old JNI and employ some tricks that I have picked up from other projects.

Let’s start with io_uring system calls. Here, everything is simple (except for the fact that the necessary system calls are not available in the C library, so we first need to write C code that will make these calls. However, this goes beyond the scope of this article; you can find the code in jasyncfio) - for each system call, we need to write a JNI wrapper. In my library I am using an unconventional way of working with JNI, which I found in the Netty project. I liked it because it allows one to get rid of pesky JNIEXPORT and JNICALL macros, as well as avoid using the special method naming convention, which is far from the typical C style.

Example:

static void java_io_uring_register(JNIEnv *env, jclass clazz, jint fd, jint opcode, jlong arg, jint nr_args) {
    int result;
    result = sys_io_uring_register(fd, opcode, (void *) arg, nr_args);
    if (result < 0) {
        throwRuntimeExceptionErrorNo(env, "Failed to call sys_io_uring_register: ", errno);
    }
}

static JNINativeMethod method_table[] = {
    {"ioUringRegister", "(IIJI)V", (void *) java_io_uring_register},
};

jint jni_iouring_on_load(JNIEnv *env) {
    jclass native_class = (*env)->FindClass(env, IOURING_NATIVE_CLASS_NAME);
    // check errors
    return (*env)->RegisterNatives(env, native_class, method_table, sizeof(method_table)/sizeof(method_table[0]));
}

// call by JVM
JNIEXPORT jint JNI_OnLoad(JavaVM *vm, void *reserved) {
    JNIEnv *env;
    (*vm)->GetEnv(vm, (void**) &env, JNI_VERSION_1_6);

    // register natives
    if (jni_iouring_on_load(env) == JNI_ERR) {
        return JNI_ERR;
    }

    return JNI_VERSION_1_6;
}

Once we have dealt with the system calls, we need to employ a similar approach for incorporating the io_uring constants. The supported operations are represented in the io_uring_op enum in the io_uring.h file, and the constants for io_uring_setup() are represented as numbers in the same file. We can use different approaches. The simplest approach is to manually transfer the numbers and constants from the enum, but in my opinion, the safest method involves writing C JNI wrappers and creating static variables in Java that call these wrappers.

Working with C structures from Java

That is the most tricky part. We need to understand how to work with C structures in Java code. The first and most obvious approach that comes to mind is to move all the work with io_uring_sqe and io_uring_cqe into JNI. When we want to populate the io_uring_sqe structure, we simply call the appropriate JNI method, which retrieves an element from the sqes queue and fills all the necessary fields. The same approach works for io_uring_cqe. After we find the required io_uring_cqe element, we need to copy its values into the Java code, and we mustn’t forget about memory synchronization!

But JNI has some overhead, which, ideally, we would like to avoid, since the primary objective of using the high-performance io_uring API is performance. How can we do this, when there is no standard way in Java to directly work with C structures?

To address this issue, let’s try to understand what working with structures looks like from the computer’s point of view. To do that let’s use Compiler Explorer and write the following code:

struct sample {
    int a;
    int b;
};
void write_struct() {
    struct sample s;
    s.a = 1;
    s.b = 2;    
}

Let’s take a look at the output produced by the GCC compiler:

write_struct:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-8], 1
        mov     DWORD PTR [rbp-4], 2
        nop
        pop     rbp
        ret

As we can see, there are no structures at the assembly level; the compiler simply generates memory writes at specific offsets. Let’s take advantage of this knowledge and write the same code in Java.

First, let’s get access to the pointers for the sqes and cqes queues in the Java code. We are going to represent the pointers as regular long variables. To do this, we’ll have to write a bit of JNI code. Let’s write a method that, after creating io_uring and allocating the queues, will copy all the necessary pointers into arrays. I also came across this approach in the Netty project:

jobjectArray setup_io_uring() {
    setup_iouring(...);

    jclass longArrayClass = (*env)->FindClass(env, "[J");
    jobjectArray array = (*env)->NewObjectArray(env, 2, longArrayClass, NULL); // create a two-dimensional array for cq and sq

    // initialize the first array

    jlong submissionArrayElements[] = {
        (jlong) ring.sq.khead,
        (jlong) ring.sq.ktail,
        (jlong) ring.sq.kring_mask,        
        // ...
    };
    // initialize the second array

    jlong completionArrayElements[] = {
        (jlong) ring.cq.khead,
        (jlong) ring.cq.ktail,
        (jlong) ring.cq.kring_mask,        
        // ...
    };

    // invoke the required JNI methods
    (*env)->SetLongArrayRegion(env, completionArray, 0, 10, completionArrayElements);
    (*env)->SetObjectArrayElement(env, array, 0, submissionArray);
    (*env)->SetObjectArrayElement(env, array, 1, completionArray);
    return array;
}

Now we have all necessary pointers for working with C structures from Java code!

Here is a code example from jasyncfio library, demonstrating how to work with io_uring_sqe:

long sqe = submissionArrayQueueAddress + (tail++ & ringMask) * SQE_SIZE; // equivalent to cqes[tail & ring_mask] in C
MemoryUtils.putByte(sqe + SQE_OP_CODE_FIELD, op); // act like a compiler
MemoryUtils.putByte(sqe + SQE_FLAGS_FIELD, (byte) flags);
MemoryUtils.putInt(sqe + SQE_FD_FIELD, fd);
MemoryUtils.putLong(sqe + SQE_OFFSET_FIELD, offset);
// ... and so forth

Where submissionArrayQueueAddress - is the address of the sqes queue, sqe is the base address of the io_uring_sqe structure element within the queue, and the constants are offsets used to access the corresponding elements of this structure. To calculate the correct offsets, you need to take into account the padding in structures. I manually wrote all the offsets, but generally speaking, it would be safer to write JNI wrappers that would return the offset of each element in the structure, depending on the OS and CPU architecture being used.

We can work with io_uring_cqe in a similar way:

long cqeAddress = completionArrayQueueAddress + (head++ & ringMask) * CQE_SIZE;
long userData = MemoryUtils.getLong(cqeAddress + CQE_USER_DATA_FIELD);
int res = MemoryUtils.getInt(cqeAddress + CQE_RES_FIELD);
int flags = MemoryUtils.getInt(cqeAddress + CQE_FLAGS_FIELD);

The MemoryUtils class is simply a wrapper around the Unsafe class. As an example, here is the implementation of the getLong method:

public static long getLong(long address) {
    return unsafe.getLong(address);
}

That wraps up the low-level part of the library! Now we have access to the Linux Kernel API io_uring from Java code, and all that’s left is to wrap it in an API that’s convenient for Java programmers. A native API for Kotlin will be implemented separately because that is potential to use Kotlin Multiplatform, and Kotlin Coroutines makes it easy to write asynchronous code.

Java API for working with io_uring

When I was developing the jasyncfio API, I aimed to make it feel natural for Java developers. io_uring is an asynchronous interface, so I use the CompletableFuture class to work with it, as there isn’t a more convenient API in the Java standard library. Inside, there is an implemented event loop for managing events. I won’t go into the details of the event loop implementation, but I do want to discuss the solution to a particularly interesting problem - parking and resuming the thread that processes the queues.

The event loop implementation assumes that the thread can be paused to reduce CPU load when there are no events to process, and then resumed when events arise from either the user or the kernel through io_uring. To solve this problem, I am using io_uring itself along with the event file descriptor. The event file descriptor is the Linux Kernel API that allows the creation of a special file descriptor capable of generating events, including for io_uring.

At the very beginning of the event loop, we register the event file descriptor in io_uring:

private void run() {
    addEventFdRead();
    resetSleepTimeout();
    // ...
}        

The implementation of the addEventFdRead() method simply adds aread operation to the Submission Queue:

executeCommand(Command.read(
    eventFd,
    0,
    8,
    eventFdBuffer,
    PollableStatus.NON_POLLABLE,
    this,
    eventFdReadResultProvider
));

When it’s time to stop the event loop thread, we call the submitTasksAndWait() method:

while (true) {
    try {
        state.set(WAIT);
        if (canSleep()) {
            if (sleepTimeout()) {
                submitTasksAndWait(); // here we stop the thread
                //...
            }
        }
    }
}

The submitTasksAndWait() method calls io_uring_enter() with the min_complete argument set to one.

When it’s time to launch the event loop thread, we simply write to the event file descriptor:

void unpark() {
    Native.eventFdWrite(eventFd, 1L);
}

In the end, an event will be generated that will cause the thread to wake up, return from the io_uring_enter() system call, and initiate the next iteration of the event loop. Writing to the event file descriptor is employed when there is a need to process a user request, but the event loop thread is stopped.

When io_uring finishes processing any previous request, it also causes the thread to return from the io_uring_enter() system call and kicks off the next iteration of the event loop.

It’s interesting to note that this approach comes with its ows set of pitfalls, specifically tied to io_uring. If we use the IORING_SETUP_IOPOLL flag while setting up an io_uring instance, this approach won’t work, because this flag changes the behaviour of the io_uring_enter() system call which, we rely on when we want to park the thread.

To address this issue jasyncfio offers two event loop modifications. One of them blocks the thread when needed and processes requests that are not supported by io_uring when IORING_SETUP_IOPOLL flag is used (like open), so io_uring is always created without the IORING_SETUP_IOPOLL flag. The second one handles requests that are supported by io_uring when the IORING_SETUP_IOPOLL flag is used (like read), but it’s only created if the user explicitly specifies in the configuration that they want to use this flag.

The event loop makes it possible to implement an API using CompletableFuture. I tried to make it as similar as possible to FileChannel from Java standard library, with the only difference being that FileChannel methods return the result of the operation immediately, while jasyncfio methods return a CompletableFuture with the result.

jasyncfio API overview

Let me give you a few examples of working with the library.

First and foremost, we need to initialize the EventExecutor, which encapsulates the event loop and everything related to io_uring:

EventExecutor eventExecutor = EventExecutor.initDefault();

The initDefault() method initializes EventExecutor with reasonable default values, and internally a single io_uring instance is created. For customization, you can use the builder() method:

EventExecutor eventExecutor = EventExecutor.builder()
                .entries(128)
                .ioRingSetupIoPoll() // this parameter creates two io_urings, which I wrote about earlier
                .ioRingSetupSqPoll(1000)
                .build();

After creating the EventExecutor, everything is set and ready for working with files:

CompletableFuture<AsyncFile> asyncFile = AsyncFile.open(filePath, eventExecutor, OpenOption.READ_WRITE);
AsyncFile file = asyncFile.get();
ByteBuffer buffer = ByteBuffer.allocateDirect(4096);
CompletableFuture<Integer> readCompletableFuture = file.read(buffer);

Please note that only DirectByteBuffer is allowed to be used. This is a deliberate decision, and that is where it differs from the standard Java library, which accepts any type of buffer. I did it because I don’t like how the standard library conceals the additional overhead. Inside a call to the standard Java library, if a DirectByteBuffer is not passed, a DirectByteBuffer will be allocated regardless. The reading will be performed into this buffer, and then the data will be copied to the user-provided buffer. This created hidden overhead costs, which are unacceptable for a high-performance IO library.

Benchmarks

I will use the results from the fio benchmark as reference values. In fio, there is a handy script called one-core-peak.sh for io_uring, which configures io_uring optimally for achieving the maximum number of IOPS in the current environment. I have a laptop with AMD Ryzen 7 4800H CPU, 32GB RAM and a Samsung SSD 970 EVO Plus 500GB. It is running Ubuntu 22.04 with Linux Kernel version 6.2.14-060214-generic.

I have the following results for my PC (with poll_queues enabled):

io_uring: Running taskset -c 0,1 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2  /dev/nvme0n1

IOPS=329.81K, BW=161MiB/s, IOS/call=31/31
IOPS=329.51K, BW=160MiB/s, IOS/call=31/32
IOPS=330.12K, BW=161MiB/s, IOS/call=31/31

Now let’s have a look at the results of jasyncfio benchmark. For my convenience and for library users, I’ve written a program in Java that, I believe, performs a benchmark similar to what fio does. Please note that the registration of file descriptors and the io_uring file descriptor are currently not supported.

Let’s run the jasyncfio benchmark with a configuration that is closest to fio:

java -jar benchmark/build/libs/benchmark-1.0-SNAPSHOT.jar -b512 -d128 -c32 -s32 -p=true -O=true -w2 /dev/nvme0n1

IOPS=288736, BW=140MiB/s, IOS/call=32/31
IOPS=286624, BW=139MiB/s, IOS/call=32/31
IOPS=285728, BW=139MiB/s, IOS/call=32/31

The difference is about -15%. The problem is that I have not yet found an optimal way to implement working with io_uring when using IORING_SETUP_IOPOLL flag. If we remove the poll (-p) flag from the benchmark configuration, then the results will be as follows:

java -jar benchmark/build/libs/benchmark-1.0-SNAPSHOT.jar -b512 -d128 -c32 -s32 -p=false -O=true -w2 /dev/nvme0n1

IOPS=320672, BW=156MiB/s, IOS/call=32/31
IOPS=316672, BW=154MiB/s, IOS/call=32/32
IOPS=315232, BW=153MiB/s, IOS/call=32/32

315 thousand IO operations with SSD per second!

Run the same configuration for fio:

taskset -c 0,1 t/io_uring -b512 -d128 -c32 -s32 -p0 -F0 -B0 -n2  /dev/nvme0n1

IOPS=328.00K, BW=160MiB/s, IOS/call=32/32
IOPS=328.80K, BW=160MiB/s, IOS/call=32/31
IOPS=329.47K, BW=160MiB/s, IOS/call=32/32

Now the difference between Java and C is less than 5%! io_uring is an amazing IO API that allows for extracting almost the maximum performance from the hardware’s IO subsystem with minimal CPU and memory usage.

Discuss on Hacker News


Epilogue

I hope you found it interesting to delve into the workings of this new, fully asynchronous, high-performance Linux Kernel API for input/output in Linux, and my Java library for it. I encourage you to try jasyncfio and share your results!

Author: Ilya Korennoy, editor: Artem Zinnatullin.