Adding io_uring to Java
For the past two years, I have occasionally been studying file IO and developing my own Java
library for using io_uring,
called jasyncfio. In this article, I want to share my experience and insights.
Contents:
- About io_uring
- Adding io_uring to
Java
, JNI and structures Java
API for working with io_uring- Epilogue
About io_uring
io_uring — is a relatively new API in the Linux Kernel, introduced in version 5.1. io_uring was built with the idea of high-performance asynchronous input/output (IO) for both files and network sockets. In this article I will only cover the file API, and I might explore network IO at a later time. Before io_uring, Linux had only one asynchronous API for file IO — POSIX AIO, which has several limitations that make it unsuitable for widespread use.
io_uring API
io_uring is based on two queues that are shared between the kernel and user space memory, namely, Submission Queue
(sq
) and Completion Queue (cq
). The user writes an operation request into the Submission Queue, and the kernel
writes the result of the operation into the Completion Queue. The user then needs to read and process this result.
This approach allows the program to make requests for multiple IO operations to the kernel with a single system call, and correspondingly, the kernel can return the results of multiple IO operations through the appropriate queue.
The io_uring interface is located in the header file io_uring.h.
In addition to constants, the file contains two main structures for working with queues - io_uring_sqe
(Submission Queue Entry) and io_uring_cqe
(Completion Queue Entry).
Three new system calls were also added to the Linux Kernel API:
- io_uring_setup - allows to create a context to work with io_uring,
- io_uring_enter - allows to notify the kernel and start operations,
- io_uring_register - allows to regis files and buffers in the kernel for more efficient use withio_uring.
You can find more detailed information about io_uring in an excellent article by the API`s author, Jens Axboe - Efficient IO with io_uring.
Adding io_uring to Java, JNI and structures
To determine what we need to add to Java
, first we need to familiarize ourselves with how interaction with io_uring
occurs. As I mentioned earlier, the file io_uring.h
defines two structures - io_uring_sqe
and io_uring_cqe
, and io_uring itself consists of two queues containing elements
of these structures. To request the kernel to perform an operation, you need to populate the appropriate fields of the
io_uring_sqe
structure. To process the result of the operation, you need to read the fields of the io_uring_cqe
structure.
io_uring configuration
Before we can start working with io_uring, it is necessary to properly configure the required context, which is a rather
complex process.For this, we declare a variable of the io_uring_params
structure type (defined in the io_uring.h
header file), set the flags that we need, and make the io_uring_setup()
system call, passing in the required queue
size and the structure with parameters. If all goes well, you’ll get back an io_uring file descriptor.
Next comes the most challenging part of the configuration process - we need to allocate memory for the queues using mmap()
,
ensuring that we calculate the necessary sizes correctly for the number of elements for which the io_uring file descriptor
was created. For the jasyncfio
library, I took this code from the C
library libuirng.
After all these manipulations, we end up with an io_uring that is ready to use.
Example of a file read operation using io_uring
Now, let’s take a file read operation as an example to see what we need to implement in the Java
library. To start, let’s
retrieve the next available element from the sqes
queue, so we can populate it with all necessary parameters for the IO
operation we want to execute. This is done in the following way: during the configuration of io_uring, we calculate
a special mask called ring_mask
, which is used to obtain the next available io_uring_sqe
element:
Here I deliberately omit error handling details, now we just need to get the main idea. When we have a pointer to the
next io_uring_sqe
element in the queue, we need to populate the fields that are necessary for our operation. In the case
of a read operation, the C
code would look like this:
jasyncfio
provides the ability to write data into a C
structure from Java
code. Then we need to call
io_uring_enter()
to notify the kernel that the submission queue is not empty. After the operation is completed, the
kernel adds the io_uring_cqe
element to the cqes
queue, which operates in a manner similar to sqes
:
To process the result, we need to read the fields of the io_uring_cqe
structure.
It is important to note that, in addition to simple reading and writing from the
sqes
andcqes
queues, it is essential to employ the correct memory synchronization primitives. Failure to do so can lead to unpredictable errors, such as the kernel only seeing a partially writtenio_uring_sqe
structure! ForJava
code,jasyncfio
does all the heavy lifting.
In the next chapter, we will take a closer look at the approaches that can be used to interact with io_uring from Java
code.
System calls and constants
Unfortunately, there is no easy way of working with the C
world from Java
code yet. I use ‘yet’ because JDK 21 is
expected to feature the release of Project Panama, which aims to simplify this task
and make the interaction both safer and faster. But since there is no stable version of JDK with Panama yet, we will have
to resort to the good old JNI and employ some tricks that I have picked up from other projects.
Let’s start with io_uring system calls. Here, everything is simple (except for the fact that the necessary system calls
are not available in the C
library, so we first need to write C
code that will make these calls. However, this goes
beyond the scope of this article; you can find the code in jasyncfio
) - for each system call, we need to write a JNI
wrapper. In my library I am using an unconventional way of working with JNI, which I found in the
Netty project. I liked it because it allows one to get rid of pesky JNIEXPORT
and
JNICALL
macros, as well as avoid using the special method naming convention, which is far from the typical C
style.
Example:
Once we have dealt with the system calls, we need to employ a similar approach for incorporating the io_uring constants.
The supported operations are represented in the io_uring_op
enum in the io_uring.h
file, and the constants for
io_uring_setup()
are represented as numbers in the same file. We can use different approaches. The simplest approach is
to manually transfer the numbers and constants from the enum
, but in my opinion, the safest method involves writing C
JNI wrappers and creating static variables in Java
that call these wrappers.
Working with C structures from Java
That is the most tricky part. We need to understand how to work with C
structures in Java
code. The first and most
obvious approach that comes to mind is to move all the work with io_uring_sqe
and io_uring_cqe
into JNI. When we
want to populate the io_uring_sqe
structure, we simply call the appropriate JNI method, which retrieves an element
from the sqes
queue and fills all the necessary fields. The same approach works for io_uring_cqe
. After we find the
required io_uring_cqe
element, we need to copy its values into the Java
code, and we mustn’t forget about memory
synchronization!
But JNI has some overhead, which, ideally, we would like to avoid, since the primary objective of using the
high-performance io_uring API is performance. How can we do this, when there is no standard way in Java
to directly
work with C
structures?
To address this issue, let’s try to understand what working with structures looks like from the computer’s point of view. To do that let’s use Compiler Explorer and write the following code:
Let’s take a look at the output produced by the GCC compiler:
As we can see, there are no structures at the assembly level; the compiler simply generates memory writes at specific
offsets. Let’s take advantage of this knowledge and write the same code in Java
.
First, let’s get access to the pointers for the sqes
and cqes
queues in the Java
code. We are going to represent the
pointers as regular long
variables. To do this, we’ll have to write a bit of JNI code. Let’s write a method that,
after creating io_uring and allocating the queues, will copy all the necessary pointers into arrays. I also came across
this approach in the Netty project:
Now we have all necessary pointers for working with C
structures from Java
code!
Here is a code example from jasyncfio
library, demonstrating how to work with io_uring_sqe
:
Where submissionArrayQueueAddress
- is the address of the sqes
queue, sqe
is the base address of the
io_uring_sqe
structure element within the queue, and the constants are offsets used to access the corresponding elements
of this structure. To calculate the correct offsets, you need to take into account the padding in structures. I manually
wrote all the offsets, but generally speaking, it would be safer to write JNI wrappers that would return the offset of
each element in the structure, depending on the OS and CPU architecture being used.
We can work with io_uring_cqe
in a similar way:
The MemoryUtils
class is simply a wrapper around the Unsafe
class. As an example, here is the implementation of the
getLong
method:
That wraps up the low-level part of the library! Now we have access to the Linux Kernel API io_uring from Java
code,
and all that’s left is to wrap it in an API that’s convenient for Java
programmers. A native API for Kotlin
will be
implemented separately because that is potential to use Kotlin Multiplatform
, and Kotlin Coroutines
makes it easy
to write asynchronous code.
Java API for working with io_uring
When I was developing the jasyncfio
API, I aimed to make it feel natural for Java
developers. io_uring is an
asynchronous interface, so I use the CompletableFuture
class to work with it, as there isn’t a more
convenient API in the Java
standard library. Inside, there is an implemented event loop for managing events. I won’t go
into the details of the event loop implementation, but I do want to discuss the solution to a particularly interesting
problem - parking and resuming the thread that processes the queues.
The event loop implementation assumes that the thread can be paused to reduce CPU load when there are no events to process, and then resumed when events arise from either the user or the kernel through io_uring. To solve this problem, I am using io_uring itself along with the event file descriptor. The event file descriptor is the Linux Kernel API that allows the creation of a special file descriptor capable of generating events, including for io_uring.
At the very beginning of the event loop, we register the event file descriptor in io_uring:
The implementation of the addEventFdRead()
method simply adds aread
operation to the Submission Queue:
When it’s time to stop the event loop thread, we call the submitTasksAndWait()
method:
The submitTasksAndWait()
method calls io_uring_enter()
with the min_complete
argument set to one.
When it’s time to launch the event loop thread, we simply write to the event file descriptor:
In the end, an event will be generated that will cause the thread to wake up, return from the io_uring_enter()
system
call, and initiate the next iteration of the event loop. Writing to the event file descriptor is employed when there
is a need to process a user request, but the event loop thread is stopped.
When io_uring finishes processing any previous request, it also causes the thread to return from the io_uring_enter()
system call and kicks off the next iteration of the event loop.
It’s interesting to note that this approach comes with its ows set of pitfalls, specifically tied to io_uring. If we
use the IORING_SETUP_IOPOLL
flag while setting up an io_uring instance, this approach won’t work, because this
flag changes the behaviour of the io_uring_enter()
system call which, we rely on when we want to park the thread.
To address this issue jasyncfio
offers two event loop modifications. One of them blocks the thread when needed and
processes requests that are not supported by io_uring when IORING_SETUP_IOPOLL
flag is used (like open
), so io_uring
is always created without the IORING_SETUP_IOPOLL
flag. The second one handles requests that are supported by io_uring
when the IORING_SETUP_IOPOLL
flag is used (like read
), but it’s only created if the user explicitly specifies in the
configuration that they want to use this flag.
The event loop makes it possible to implement an API using CompletableFuture
. I tried to make it as similar as
possible to FileChannel
from Java
standard library, with the only difference being that FileChannel
methods return
the result of the operation immediately, while jasyncfio
methods return a CompletableFuture
with the result.
jasyncfio API overview
Let me give you a few examples of working with the library.
First and foremost, we need to initialize the EventExecutor
, which encapsulates the event loop and everything related
to io_uring:
The initDefault()
method initializes EventExecutor
with reasonable default values, and internally a single
io_uring instance is created. For customization, you can use the builder()
method:
After creating the EventExecutor
, everything is set and ready for working with files:
Please note that only DirectByteBuffer
is allowed to be used. This is a deliberate decision, and that is where it
differs from the standard Java
library, which accepts any type of buffer. I did it because I don’t like how the
standard library conceals the additional overhead. Inside a call to the standard Java
library, if a DirectByteBuffer
is not passed, a DirectByteBuffer
will be allocated regardless. The reading will be performed into this buffer,
and then the data will be copied to the user-provided buffer. This created hidden overhead costs, which are unacceptable
for a high-performance IO library.
Benchmarks
I will use the results from the fio benchmark as reference values. In fio
, there is a
handy script called one-core-peak.sh
for io_uring, which configures io_uring optimally for achieving the maximum
number of IOPS in the current environment. I have a laptop with AMD Ryzen 7 4800H CPU, 32GB RAM and a Samsung SSD 970 EVO
Plus 500GB. It is running Ubuntu 22.04
with Linux Kernel version 6.2.14-060214-generic
.
I have the following results for my PC (with poll_queues enabled):
Now let’s have a look at the results of jasyncfio
benchmark. For my convenience and for library users, I’ve written
a program in Java
that, I believe, performs a benchmark similar to what fio
does. Please note that the registration
of file descriptors and the io_uring file descriptor are currently not supported.
Let’s run the jasyncfio
benchmark with a configuration that is closest to fio
:
The difference is about -15%
. The problem is that I have not yet found an optimal way to implement working with
io_uring when using IORING_SETUP_IOPOLL
flag. If we remove the poll (-p)
flag from the benchmark configuration,
then the results will be as follows:
315 thousand IO operations with SSD per second!
Run the same configuration for fio
:
Now the difference between
Java
andC
is less than 5%! io_uring is an amazing IO API that allows for extracting almost the maximum performance from the hardware’s IO subsystem with minimal CPU and memory usage.
Discuss on Hacker News
Epilogue
I hope you found it interesting to delve into the workings of this new, fully asynchronous, high-performance
Linux Kernel API for input/output in Linux, and my Java
library for it. I encourage you to try jasyncfio and share your results!
- Library source code on GitHub - jasyncfio.
- My contacts for questions:
- Email: [email protected]
- Github: ikorennoy
Author: Ilya Korennoy, editor: Artem Zinnatullin.