io_uring
io_uring
is a modern Linux I/O interface introduced in Linux 5.1 (2019) to improve
upon the limitations of traditional models like epoll
, which has been the standard
since Linux 2.6 (2003). Unlike epoll
, which requires syscalls for every read and
write, io_uring
uses memory-mapped (mmap
) submission and completion queues,
allowing user space and the kernel to communicate while minimizing syscalls and
avoiding unnecessary memory copies. This reduces context switching, lowers CPU
overhead, and enhances efficiency, making it ideal for high-throughput applications.
While epoll
follows a polling-based model, where the application must handle reads
and writes after getting notified, io_uring
enables true asynchronous execution,
letting the kernel perfom I/O in the background and notify the application only when
the I/O operations (read and write) are done.
epoll
is still heavily used in web servers like Nginx, proxies like Envoy, and
other network services that rely on efficient event-driven I/O. io_uring
is being
adopted by high-performance applications that demand lower latency and higher
throughput. epoll
is limited to socket I/O, while io_uring
supports both network
and file I/O. Though still evolving, io_uring
is gaining adoption in projects like
MariaDB, PostgreSQL, and NVMe storage stacks, making it a strong candidate for the
future of Linux I/O.
How it works
io_uring is built around a shared ring buffer architecture that allows user space and the kernel to communicate efficiently without frequent syscalls. It consists of two main components:
- Submission Queue (SQ): the application places I/O requests (e.g. reads and writes) into this queue, avoiding direct syscalls for each request.
- Completion Queue (CQ): once the kernel finishes processing a request, it writes the result here, allowing the application to retrieve it asynchronously.
-----------------------------------------------------------------
|user space |
| |
| |
| application produces application consumes |
| | ^ |
| v | |
| ------------------ ------------------ |
|------| submission queue |-----------| completion queue |------|
| ------------------ ------------------ |
| | ^ |
| v | |
| kenel consumes -> exec syscalls -> kernel produces |
| |
|kernel |
-----------------------------------------------------------------
mostly lock-free
io_uring's ring buffer is mostly lock-free due to its single-producer, single-consumer (SPSC) design for the Submission Queue (SQ) and Completion Queue (CQ). These queues are memory-mapped (mmap), allowing direct access without syscalls. Instead of locks, atomic operations update the queue's head and tail pointers.
However, applications with multi-threaded access to the SQ may still need synchronization mechanisms to manage concurrent submissions safely.
// simplied liburing APIs
// main structure representing an io_uring instance
struct io_uring {
int ring_fd; // file descriptor for the io_uring instance
struct io_uring_sq sq; // submission queue (SQ)
struct io_uring_cq cq; // completion queue (CQ)
// ...
};
// I/O submission data structure
// representing a Submission Queue Entry (SQE)
struct io_uring_sqe {
__u8 opcode; // operation type
__u8 flags; // submission flags
__u16 ioprio; // I/O priority
__s32 fd; // file descriptor for the I/O operation
// ...
};
// I/O completion data structure
// representing a Completion Queue Entry (CQE)
struct io_uring_cqe {
__u64 user_data; // user-defined identifier for tracking requests
__s32 res; // result code of the I/O operation
__u32 flags; // completion flags
// ...
};
// Submission Queue (SQ) metadata
struct io_uring_sq {
unsigned *khead; // kernel-managed head pointer
unsigned *ktail; // kernel-managed tail pointer
struct io_uring_sqe *sqes; // a list of SQEs
// ...
};
// Completion Queue (CQ) metadata
struct io_uring_cq {
unsigned *khead; // kernel-managed head pointer
unsigned *ktail; // kernel-managed tail pointer
struct io_uring_cqe *cqes; // a list of CQEs
// ...
};
// initializes an io_uring instance (allocates SQ and CQ)
// - entries: number of SQEs to allocate (power of 2 is recommended)
// - ring: pointer to the io_uring instance to initialize
// - flags: additional flags
// returns: 0 on success, negative error code on failure
int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags);
// cleans up and frees resources after using io_uring
// - ring: pointer to the io_uring instance to clean up
void io_uring_queue_exit(struct io_uring *ring);
// gets a free submission queue entry (sqe) to prepare an I/O request
// - ring: pointer to the io_uring instance
// returns: a pointer to an available sqe, or NULL if the queue is full
struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring);
// prepares a read operation (reads data from a fd into a buffer)
// - sqe: the submission queue entry to be configured
// - fd: file descriptor to read from (must be opened before calling this)
// - buf: pointer to the buffer where data will be stored
// - nbytes: number of bytes to read
// - offset: position in the file (0 for sequential reading)
void io_uring_prep_read(struct io_uring_sqe *seq, int fd, void *buf, unsigned nbytes, off_t offset);
// prepares a write operation (writes data from a buffer to a file descriptor)
// - sqe: the submission queue entry to be configured
// - fd: file descriptor to write to (must be opened before calling this)
// - buf: pointer to the buffer containing data to be written
// - nbytes: number of bytes to write
// - offset: position in the file (0 for sequential writing)
void io_uring_prep_write(struct io_uring_sqe *sqe, int fd, const void *buf, unsigned nbytes, off_t offset);
// submit all queued sqes to the kernel for execution
// - ring: pointer to the io_uring instance
// returns: number of submitted sqes or a negative error code
int io_uring_submit(struct io_uring *ring);
// waits for a cqe (blocks until at least one event completes)
// - ring: pointer to the io_uring instance
// - cqe: pointer to store the completed cqe
// returns: 0 on success, negative error on failure
int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe);
// non-blocking check for a completed I/O request
// - ring: pointer to the io_uring instance
// - cqe: pointer to store the completed cqe
// returns: 0 if a cqe is available, -EAGAIN if no cqe is available
int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe);
// marks a cqe as processed so the kernel can reuse it
// - ring: pointer to the io_uring instance
// - cqe: pointer to store the completed cqe
void io_uring_cqe_seen(struct io_uring *ring, struct io_uring_cqe **cqe);
// marks multiple cqes as processed
// - ring: pointer to the io_uring instance
// - nr: number of cqes to mark as processed
void io_uring_cq_advance(struct io_uring *ring, unsigned nr);
io_uring_cq_advance or io_uring_cqe_seen
- Use
io_uring_cqe_seen()
if you process one CQE at a time - Use
io_uring_cq_advance()
when processing multiple CQEs in a batch for efficiency
Demo
Setup
sudo apt update # update package list
sudo apt install pkg-config # pkg-config to manage library paths and dependencies
sudo apt install liburing-dev # io_uring: liburing.h headers, liburing.so shared lib
File I/O
// async file read with io_uring
#include <stdlib.h>
#include <fcntl.h>
#include <liburing.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#define FILE_PATH "test.txt"
#define BUFFER_SIZE 1024
int main() {
struct io_uring ring; // declare io_uring
struct io_uring_sqe *sqe; // declare pointer to SQE
struct io_uring_cqe *cqe; // declare pointer to CQE
char buffer[BUFFER_SIZE] = {0}; // store data read from file
int fd = open(FILE_PATH, O_RDONLY); // open the file for read
if (fd < 0) {
perror("failed to open file");
return EXIT_FAILURE; // 1 not -1, -1 becomes 255
}
// init io_uring instance: queue depth = 8
if (io_uring_queue_init(8, &ring, 0)) {
perror("io_uring_queue_init failed");
close(fd);
return EXIT_FAILURE;
}
// get a free submission queue entry (SQE) to submit a request
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "failed to get SQE\n"); // perror is only useful when a system call fails and sets errno
io_uring_queue_exit(&ring); // clean up io_uring instance
close(fd);
return EXIT_FAILURE;
}
// prepare the read operation
io_uring_prep_read(sqe, fd, buffer, BUFFER_SIZE, 0);
// submit the request to the kernel for execution
if (io_uring_submit(&ring) < 0) {
perror("io_uring_submit failed");
io_uring_queue_exit(&ring);
close(fd);
return EXIT_FAILURE;
}
// wait for completion of the request
// this blocks until an event (I/O operation) is completed and a CQE is available
if (io_uring_wait_cqe(&ring, &cqe) < 0) {
perror("io_uring_wait_cqe failed");
io_uring_queue_exit(&ring);
close(fd);
return EXIT_FAILURE;
}
// check for the result of the operation
if (cqe->res < 0) { // if result is negative, it indicates an error
fprintf(stderr, "read failed %s\n", strerror(-cqe->res));
} else {
// successfully read some data, print the number of bytes read and the content
printf("read %d bytes: %s\n", cqe->res, buffer);
}
// mark the completion queue event as processed
io_uring_cqe_seen(&ring, cqe);
// clean up resources
io_uring_queue_exit(&ring);
close(fd);
return EXIT_SUCCESS; // 0 (success)
}
cmake_minimum_required(VERSION 3.20)
project(io_uring_file_io)
set(CMAKE_C_STANDARD 11)
# find liburing
find_package(PkgConfig REQUIRED)
pkg_check_modules(LIBURING REQUIRED liburing)
add_executable(main main.c)
# link liburing
target_link_libraries(main PRIVATE ${LIBURING_LIBRARIES})
# include liburing headers
target_include_directories(main PRIVATE ${LIBURING_INCLUDE_DIRS})
tianb@tpc1:~/w/networks-d/io_uring_file$ cmake -Bbuild
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/tianb/w/networks-d/io_uring_file/build
tianb@tpc1:~/w/networks-d/io_uring_file$ cmake --build build
[ 50%] Building C object CMakeFiles/main.dir/main.c.o
[100%] Linking C executable main
[100%] Built target main
tianb@tpc1:~/w/networks-d/io_uring_file$ ./build/main
# output
read 12 bytes: hi io_uring
tianb@tpc1:~/w/networks-d/io_uring_file$
Network I/O
// a tcp echo server using io_uring
#include <liburing.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>
#define EVENT_ACCEPT 0 // accept new connections
#define EVENT_READ 1 // read data from a socket
#define EVENT_WRITE 2 // write data to a socket
#define ENTRIES_LENGTH 1024 // number of entries in io_uring
#define BUFFER_LENGTH 1024 // buffer size for reading/writing
// structure to hold connection details
struct conn_info {
int fd; // file descriptor of the connection
int event; // type of event (ACCEPT, READ, WRITE)
};
// init a tcp server socket, return socket fd or -1 on failure
int init_server(unsigned short port) {
// create a socket (IPv4, TCP)
int sockfd = socket(AF_INET, SOCK_STREAM, 0);
if (sockfd < 0) {
perror("socket failed");
return -1;
}
// set SO_REUSEADDR to reuse the port immediately after closing
int opt = 1;
setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
// define server address (bind to all interfaces, specified port)
struct sockaddr_in server_addr = {0};
server_addr.sin_family = AF_INET;
server_addr.sin_addr.s_addr = htonl(INADDR_ANY);
server_addr.sin_port = htons(port);
// bind the socket to the address
if (bind(sockfd, (struct sockaddr *)&server_addr, sizeof(server_addr)) < 0) {
perror("bind failed");
return -1;
}
// start listening for connections (queue up to 10 clients)
if (listen(sockfd, 10) < 0) {
perror("listen failed");
return -1;
}
return sockfd;
}
// submit an accept request to io_uring
void set_event_accept(struct io_uring *ring, int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
// get a submission queue entry, NULL check omitted for simplicity
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
// create connection info for ACCEPT event
struct conn_info accept_info = { .fd = sockfd, .event = EVENT_ACCEPT };
// prepare an accept operation
io_uring_prep_accept(sqe, sockfd, addr, addrlen, 0);
// attach user data (store event type in SQE)
memcpy(&sqe->user_data, &accept_info, sizeof(accept_info));
}
// submit a read request to io_uring
void set_event_recv(struct io_uring *ring, int sockfd, void *buf, size_t len) {
// get an SQE
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
// create connection info for READ event
struct conn_info recv_info = { .fd = sockfd, .event = EVENT_READ };
// prepare a read operation
io_uring_prep_recv(sqe, sockfd, buf, len, 0);
// attach user data
memcpy(&sqe->user_data, &recv_info, sizeof(recv_info));
}
// submit a write request to io_uring
void set_event_send(struct io_uring *ring, int sockfd, void *buf, size_t len) {
// get an SQE
struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
// create connection info for WRITE event
struct conn_info send_info = { .fd = sockfd, .event = EVENT_WRITE };
// prepare a send operation
io_uring_prep_send(sqe, sockfd, buf, len, 0);
// attach user data
memcpy(&sqe->user_data, &send_info, sizeof(send_info));
}
int main() {
unsigned short port = 9999;
int sockfd = init_server(port);
if (sockfd < 0) return EXIT_FAILURE;
// declare io_uring instance
struct io_uring ring;
// initialize io_uring with a submission queue of size ENTRIES_LENGTH
io_uring_queue_init(ENTRIES_LENGTH, &ring, 0);
// define client address storage
struct sockaddr_in client_addr;
socklen_t addrlen = sizeof(client_addr);
// submit an accept request for new connections
set_event_accept(&ring, sockfd, (struct sockaddr*)&client_addr, &addrlen);
// buffer for reading and writing data
char buffer[BUFFER_LENGTH] = {0};
while(1) {
// submit queued events to the kernel
io_uring_submit(&ring);
// wait for completion events
// blocks until at least one event is completed
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
// retrieve multiple completed events
struct io_uring_cqe *cqes[128];
int nready = io_uring_peek_batch_cqe(&ring, cqes, 128);
for (int i = 0; i < nready; i++) {
struct io_uring_cqe *entry = cqes[i];
// retrieve user data from CQE
struct conn_info result;
memcpy(&result, &entry->user_data, sizeof(result));
if (result.event == EVENT_ACCEPT) {
// accept the new connection
int connfd = entry->res;
if (connfd < 0) {
perror("accept failed");
continue;
}
printf("new connection accepted: %d\n", connfd);
// re-arm to accept new clients
set_event_accept(&ring, sockfd, (struct sockaddr*)&client_addr, &addrlen);
// start reading from the new connection
set_event_recv(&ring, connfd, buffer, BUFFER_LENGTH);
} else if (result.event == EVENT_READ) {
int ret = entry->res;
if (ret == 0) {
// client disconnected, close the socket
printf("client %d disconnected\n", result.fd);
close(result.fd);
} else if (ret > 0) {
// echo received data back to the client
printf("received %d bytes: %s\n", ret, buffer);
set_event_send(&ring, result.fd, buffer, ret);
}
} else if (result.event == EVENT_WRITE) {
// re-arm read event to receive more data
set_event_recv(&ring, result.fd, buffer, BUFFER_LENGTH);
}
}
// mark completed events as processed
io_uring_cq_advance(&ring, nready);
}
// clean up io_uring
io_uring_queue_exit(&ring);
return 0;
}
cmake_minimum_required(VERSION 3.20)
project(io_uring_tcp_server)
set(CMAKE_C_STANDARD 11)
# find liburing
find_package(PkgConfig REQUIRED)
pkg_check_modules(LIBURING REQUIRED liburing)
add_executable(server main.c)
# link liburing
target_link_libraries(server PRIVATE ${LIBURING_LIBRARIES})
# include liburing headers
target_include_directories(server PRIVATE ${LIBURING_INCLUDE_DIRS})
Reactor and Proactor
Reactor and Proactor are two common event-driven I/O desgin patterns used in high-performance applications like web servers, databases, and networking services.
Reactor Pattern
The Reactor pattern is synchronous, non-blocking. It waits for events (like socket reads or writes) and dispatches them to handlers, but the actual I/O operation (e.g. reading or writing data) is performed by the application. Libraries like epoll, kqueue, and select follow this model.
- Event Demultiplexer (e.g. epoll) waits for I/O events.
- When an event occurs, the reactor notifies the application.
- The application performs the actual read/write.
Proactor Pattern
The Proactor pattern is asynchronous, non-blocking. The system (kernel or an I/O service) handles I/O operations and notifies the application only when the operation is complete. This allows for better scalability in high-performance systems. io_uring and Windows IOCP (I/O Completion Ports) follow this model.
- The application submits an I/O request.
- The kernel completes the operation in the background.
- When done, the application gets notified with the result.
Reactor works well when you have many connections but lightweight I/O, while Proactor is better for high-throughput workloads where reducing syscalls and CPU overhead matters. io_uring enables a true Proactor model on Linux, making it the future of high-performance I/O.