Skip to content

io_uring

io_uring is a modern Linux I/O interface introduced in Linux 5.1 (2019) to improve upon the limitations of traditional models like epoll, which has been the standard since Linux 2.6 (2003). Unlike epoll, which requires syscalls for every read and write, io_uring uses memory-mapped (mmap) submission and completion queues, allowing user space and the kernel to communicate while minimizing syscalls and avoiding unnecessary memory copies. This reduces context switching, lowers CPU overhead, and enhances efficiency, making it ideal for high-throughput applications.

While epoll follows a polling-based model, where the application must handle reads and writes after getting notified, io_uring enables true asynchronous execution, letting the kernel perfom I/O in the background and notify the application only when the I/O operations (read and write) are done.

epoll is still heavily used in web servers like Nginx, proxies like Envoy, and other network services that rely on efficient event-driven I/O. io_uring is being adopted by high-performance applications that demand lower latency and higher throughput. epoll is limited to socket I/O, while io_uring supports both network and file I/O. Though still evolving, io_uring is gaining adoption in projects like MariaDB, PostgreSQL, and NVMe storage stacks, making it a strong candidate for the future of Linux I/O.

How it works

io_uring is built around a shared ring buffer architecture that allows user space and the kernel to communicate efficiently without frequent syscalls. It consists of two main components:

  1. Submission Queue (SQ): the application places I/O requests (e.g. reads and writes) into this queue, avoiding direct syscalls for each request.
  2. Completion Queue (CQ): once the kernel finishes processing a request, it writes the result here, allowing the application to retrieve it asynchronously.
-----------------------------------------------------------------
|user space                                                     |
|                                                               |
|                                                               |
|      application produces           application consumes      |
|              |                               ^                |
|              v                               |                |
|       ------------------             ------------------       |
|------| submission queue |-----------| completion queue |------|
|       ------------------             ------------------       |
|              |                               ^                |
|              v                               |                |
|       kenel consumes -> exec syscalls ->  kernel produces     |
|                                                               |
|kernel                                                         |
-----------------------------------------------------------------

mostly lock-free

io_uring's ring buffer is mostly lock-free due to its single-producer, single-consumer (SPSC) design for the Submission Queue (SQ) and Completion Queue (CQ). These queues are memory-mapped (mmap), allowing direct access without syscalls. Instead of locks, atomic operations update the queue's head and tail pointers.

However, applications with multi-threaded access to the SQ may still need synchronization mechanisms to manage concurrent submissions safely.

// simplied liburing APIs

// main structure representing an io_uring instance
struct io_uring {
    int ring_fd; // file descriptor for the io_uring instance
    struct io_uring_sq sq; // submission queue (SQ)
    struct io_uring_cq cq; // completion queue (CQ)
    // ...
};

// I/O submission data structure
// representing a Submission Queue Entry (SQE)
struct io_uring_sqe {
    __u8 opcode;  // operation type
    __u8 flags;   // submission flags 
    __u16 ioprio; // I/O priority
    __s32 fd;     // file descriptor for the I/O operation
    // ...
};

// I/O completion data structure
// representing a Completion Queue Entry (CQE)
struct io_uring_cqe {
    __u64 user_data; // user-defined identifier for tracking requests
    __s32 res;       // result code of the I/O operation
    __u32 flags;     // completion flags
    // ...
};

// Submission Queue (SQ) metadata
struct io_uring_sq {
    unsigned *khead; // kernel-managed head pointer
    unsigned *ktail; // kernel-managed tail pointer
    struct io_uring_sqe *sqes; // a list of SQEs
    // ...
};

// Completion Queue (CQ) metadata
struct io_uring_cq {
    unsigned *khead; // kernel-managed head pointer
    unsigned *ktail; // kernel-managed tail pointer
    struct io_uring_cqe *cqes; // a list of CQEs
    // ...
};

// initializes an io_uring instance (allocates SQ and CQ)
// - entries: number of SQEs to allocate (power of 2 is recommended)
// - ring: pointer to the io_uring instance to initialize
// - flags: additional flags
// returns: 0 on success, negative error code on failure
int io_uring_queue_init(unsigned entries, struct io_uring *ring, unsigned flags);

// cleans up and frees resources after using io_uring
// - ring: pointer to the io_uring instance to clean up
void io_uring_queue_exit(struct io_uring *ring);

// gets a free submission queue entry (sqe) to prepare an I/O request
// - ring: pointer to the io_uring instance
// returns: a pointer to an available sqe, or NULL if the queue is full
struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring);

// prepares a read operation (reads data from a fd into a buffer)
// - sqe: the submission queue entry to be configured
// - fd: file descriptor to read from (must be opened before calling this)
// - buf: pointer to the buffer where data will be stored
// - nbytes: number of bytes to read
// - offset: position in the file (0 for sequential reading)
void io_uring_prep_read(struct io_uring_sqe *seq, int fd, void *buf, unsigned nbytes, off_t offset);

// prepares a write operation (writes data from a buffer to a file descriptor)
//  - sqe: the submission queue entry to be configured
//  - fd: file descriptor to write to (must be opened before calling this)
//  - buf: pointer to the buffer containing data to be written
//  - nbytes: number of bytes to write
//  - offset: position in the file (0 for sequential writing)
void io_uring_prep_write(struct io_uring_sqe *sqe, int fd, const void *buf, unsigned nbytes, off_t offset);

// submit all queued sqes to the kernel for execution
// - ring: pointer to the io_uring instance
// returns: number of submitted sqes or a negative error code
int io_uring_submit(struct io_uring *ring);

// waits for a cqe (blocks until at least one event completes)
// - ring: pointer to the io_uring instance
// - cqe: pointer to store the completed cqe
// returns: 0 on success, negative error on failure
int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe);

// non-blocking check for a completed I/O request
// - ring: pointer to the io_uring instance
// - cqe: pointer to store the completed cqe
// returns: 0 if a cqe is available, -EAGAIN if no cqe is available
int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe);

// marks a cqe as processed so the kernel can reuse it
// - ring: pointer to the io_uring instance
// - cqe: pointer to store the completed cqe
void io_uring_cqe_seen(struct io_uring *ring, struct io_uring_cqe **cqe);

// marks multiple cqes as processed
// - ring: pointer to the io_uring instance
// - nr: number of cqes to mark as processed
void io_uring_cq_advance(struct io_uring *ring, unsigned nr);

io_uring_cq_advance or io_uring_cqe_seen

  • Use io_uring_cqe_seen() if you process one CQE at a time
  • Use io_uring_cq_advance() when processing multiple CQEs in a batch for efficiency

Demo

Setup

sudo apt update                 # update package list
sudo apt install pkg-config     # pkg-config to manage library paths and dependencies
sudo apt install liburing-dev   # io_uring: liburing.h headers, liburing.so shared lib

File I/O

// async file read with io_uring

#include <stdlib.h>
#include <fcntl.h>
#include <liburing.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define FILE_PATH "test.txt"
#define BUFFER_SIZE 1024

int main() {
  struct io_uring ring; // declare io_uring
  struct io_uring_sqe *sqe; // declare pointer to SQE
  struct io_uring_cqe *cqe; // declare pointer to CQE

  char buffer[BUFFER_SIZE] = {0}; // store data read from file

  int fd = open(FILE_PATH, O_RDONLY); // open the file for read
  if (fd < 0) {
    perror("failed to open file");
    return EXIT_FAILURE; // 1 not -1, -1 becomes 255
  }

  // init io_uring instance: queue depth = 8
  if (io_uring_queue_init(8, &ring, 0)) {
    perror("io_uring_queue_init failed");
    close(fd);
    return EXIT_FAILURE;
  }

  // get a free submission queue entry (SQE) to submit a request
  sqe = io_uring_get_sqe(&ring);
  if (!sqe) {
    fprintf(stderr, "failed to get SQE\n"); // perror is only useful when a system call fails and sets errno 
    io_uring_queue_exit(&ring); // clean up io_uring instance
    close(fd);
    return EXIT_FAILURE;
  }

  // prepare the read operation
  io_uring_prep_read(sqe, fd, buffer, BUFFER_SIZE, 0);

  // submit the request to the kernel for execution
  if (io_uring_submit(&ring) < 0) {
    perror("io_uring_submit failed");
    io_uring_queue_exit(&ring);
    close(fd);
    return EXIT_FAILURE;
  }

  // wait for completion of the request
  // this blocks until an event (I/O operation) is completed and a CQE is available
  if (io_uring_wait_cqe(&ring, &cqe) < 0) {
    perror("io_uring_wait_cqe failed");
    io_uring_queue_exit(&ring);
    close(fd);
    return EXIT_FAILURE;
  }

  // check for the result of the operation
  if (cqe->res < 0) { // if result is negative, it indicates an error
    fprintf(stderr, "read failed %s\n", strerror(-cqe->res));
  } else {
    // successfully read some data, print the number of bytes read and the content
    printf("read %d bytes: %s\n", cqe->res, buffer);
  }

  // mark the completion queue event as processed
  io_uring_cqe_seen(&ring, cqe);

  // clean up resources
  io_uring_queue_exit(&ring);
  close(fd);

  return EXIT_SUCCESS; // 0 (success)
}
cmake_minimum_required(VERSION 3.20)

project(io_uring_file_io)

set(CMAKE_C_STANDARD 11)

# find liburing
find_package(PkgConfig REQUIRED)
pkg_check_modules(LIBURING REQUIRED liburing)

add_executable(main main.c)

# link liburing
target_link_libraries(main PRIVATE ${LIBURING_LIBRARIES})

# include liburing headers
target_include_directories(main PRIVATE ${LIBURING_INCLUDE_DIRS})
hi io_uring
tianb@tpc1:~/w/networks-d/io_uring_file$ cmake -Bbuild
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/tianb/w/networks-d/io_uring_file/build
tianb@tpc1:~/w/networks-d/io_uring_file$ cmake --build build
[ 50%] Building C object CMakeFiles/main.dir/main.c.o
[100%] Linking C executable main
[100%] Built target main
tianb@tpc1:~/w/networks-d/io_uring_file$ ./build/main

# output
read 12 bytes: hi io_uring

tianb@tpc1:~/w/networks-d/io_uring_file$

Network I/O

// a tcp echo server using io_uring

#include <liburing.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>

#define EVENT_ACCEPT 0 // accept new connections
#define EVENT_READ   1 // read data from a socket
#define EVENT_WRITE  2 // write data to a socket

#define ENTRIES_LENGTH 1024 // number of entries in io_uring
#define BUFFER_LENGTH  1024 // buffer size for reading/writing

// structure to hold connection details
struct conn_info {
  int fd;    // file descriptor of the connection
  int event; // type of event (ACCEPT, READ, WRITE)
};

// init a tcp server socket, return socket fd or -1 on failure
int init_server(unsigned short port) {
  // create a socket (IPv4, TCP)
  int sockfd = socket(AF_INET, SOCK_STREAM, 0);
  if (sockfd < 0) {
    perror("socket failed");
    return -1;
  }

  // set SO_REUSEADDR to reuse the port immediately after closing
  int opt = 1;
  setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

  // define server address (bind to all interfaces, specified port)
  struct sockaddr_in server_addr = {0};
  server_addr.sin_family = AF_INET;
  server_addr.sin_addr.s_addr = htonl(INADDR_ANY);
  server_addr.sin_port = htons(port);

  // bind the socket to the address
  if (bind(sockfd, (struct sockaddr *)&server_addr, sizeof(server_addr)) < 0) {
    perror("bind failed");
    return -1;
  }

  // start listening for connections (queue up to 10 clients)
  if (listen(sockfd, 10) < 0) {
    perror("listen failed");
    return -1;
  }

  return sockfd;
}

// submit an accept request to io_uring
void set_event_accept(struct io_uring *ring, int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
  // get a submission queue entry, NULL check omitted for simplicity
  struct io_uring_sqe *sqe = io_uring_get_sqe(ring);

  // create connection info for ACCEPT event
  struct conn_info accept_info = { .fd = sockfd, .event = EVENT_ACCEPT };

  // prepare an accept operation
  io_uring_prep_accept(sqe, sockfd, addr, addrlen, 0);

  // attach user data (store event type in SQE)
  memcpy(&sqe->user_data, &accept_info, sizeof(accept_info));
}

// submit a read request to io_uring
void set_event_recv(struct io_uring *ring, int sockfd, void *buf, size_t len) {
  // get an SQE
  struct io_uring_sqe *sqe = io_uring_get_sqe(ring);

  // create connection info for READ event
  struct conn_info recv_info = { .fd = sockfd, .event = EVENT_READ };

  // prepare a read operation
  io_uring_prep_recv(sqe, sockfd, buf, len, 0);

  // attach user data
  memcpy(&sqe->user_data, &recv_info, sizeof(recv_info));
}

// submit a write request to io_uring
void set_event_send(struct io_uring *ring, int sockfd, void *buf, size_t len) {
  // get an SQE
  struct io_uring_sqe *sqe = io_uring_get_sqe(ring);

  // create connection info for WRITE event
  struct conn_info send_info = { .fd = sockfd, .event = EVENT_WRITE };

  // prepare a send operation
  io_uring_prep_send(sqe, sockfd, buf, len, 0);

  // attach user data
  memcpy(&sqe->user_data, &send_info, sizeof(send_info));
}

int main() {
  unsigned short port = 9999;

  int sockfd = init_server(port);

  if (sockfd < 0) return EXIT_FAILURE;

  // declare io_uring instance
  struct io_uring ring;

  // initialize io_uring with a submission queue of size ENTRIES_LENGTH
  io_uring_queue_init(ENTRIES_LENGTH, &ring, 0);

  // define client address storage
  struct sockaddr_in client_addr;
  socklen_t addrlen = sizeof(client_addr);

  // submit an accept request for new connections
  set_event_accept(&ring, sockfd, (struct sockaddr*)&client_addr, &addrlen);

  // buffer for reading and writing data
  char buffer[BUFFER_LENGTH] = {0};

  while(1) {
    // submit queued events to the kernel
    io_uring_submit(&ring);

    // wait for completion events
    // blocks until at least one event is completed
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);

    // retrieve multiple completed events
    struct io_uring_cqe *cqes[128];
    int nready = io_uring_peek_batch_cqe(&ring, cqes, 128);

    for (int i = 0; i < nready; i++) {
      struct io_uring_cqe *entry = cqes[i];
      // retrieve user data from CQE 
      struct conn_info result;
      memcpy(&result, &entry->user_data, sizeof(result));

      if (result.event == EVENT_ACCEPT) {
        // accept the new connection
        int connfd = entry->res;
        if (connfd < 0) {
          perror("accept failed");
          continue;
        }
        printf("new connection accepted: %d\n", connfd);

        // re-arm to accept new clients
        set_event_accept(&ring, sockfd, (struct sockaddr*)&client_addr, &addrlen);

        // start reading from the new connection
        set_event_recv(&ring, connfd, buffer, BUFFER_LENGTH);
      } else if (result.event == EVENT_READ) {
        int ret = entry->res;
        if (ret == 0) {
          // client disconnected, close the socket
          printf("client %d disconnected\n", result.fd);
          close(result.fd);
        } else if (ret > 0) {
          // echo received data back to the client
          printf("received %d bytes: %s\n", ret, buffer);
          set_event_send(&ring, result.fd, buffer, ret);
        }
      } else if (result.event == EVENT_WRITE) {
        // re-arm read event to receive more data
        set_event_recv(&ring, result.fd, buffer, BUFFER_LENGTH);
      }
    }

    // mark completed events as processed
    io_uring_cq_advance(&ring, nready);
  }

  // clean up io_uring
  io_uring_queue_exit(&ring);
  return 0;
}
cmake_minimum_required(VERSION 3.20)

project(io_uring_tcp_server)

set(CMAKE_C_STANDARD 11)

# find liburing
find_package(PkgConfig REQUIRED)
pkg_check_modules(LIBURING REQUIRED liburing)

add_executable(server main.c)

# link liburing
target_link_libraries(server PRIVATE ${LIBURING_LIBRARIES})

# include liburing headers
target_include_directories(server PRIVATE ${LIBURING_INCLUDE_DIRS})
# server
cmake -Bbuild
cmake --build build
./build/server

new connection accepted: 5        # server output
received 6 bytes: hello

client 5 disconnected

# client
echo "hello" | nc localhost 9999  # send tcp package and wait for response
hello                             # received server response

Reactor and Proactor

Reactor and Proactor are two common event-driven I/O desgin patterns used in high-performance applications like web servers, databases, and networking services.

Reactor Pattern

The Reactor pattern is synchronous, non-blocking. It waits for events (like socket reads or writes) and dispatches them to handlers, but the actual I/O operation (e.g. reading or writing data) is performed by the application. Libraries like epoll, kqueue, and select follow this model.

  1. Event Demultiplexer (e.g. epoll) waits for I/O events.
  2. When an event occurs, the reactor notifies the application.
  3. The application performs the actual read/write.

Proactor Pattern

The Proactor pattern is asynchronous, non-blocking. The system (kernel or an I/O service) handles I/O operations and notifies the application only when the operation is complete. This allows for better scalability in high-performance systems. io_uring and Windows IOCP (I/O Completion Ports) follow this model.

  1. The application submits an I/O request.
  2. The kernel completes the operation in the background.
  3. When done, the application gets notified with the result.

Reactor works well when you have many connections but lightweight I/O, while Proactor is better for high-throughput workloads where reducing syscalls and CPU overhead matters. io_uring enables a true Proactor model on Linux, making it the future of high-performance I/O.

🌕