Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add io_uring support to redis #9440

Open
wants to merge 8 commits into
base: unstable
Choose a base branch
from

Conversation

HanGuangyu
Copy link

@HanGuangyu HanGuangyu commented Sep 1, 2021

io_uring is a powerful new asynchronous I/O API for Linux by Jens Axboe from Facebook. And it has been added to Linux kernel since 5.1 version. For more details about io_uring, we can see Lord of the io_uring.

This work is a POC(Proof of Concept) cooperated by Alibaba team of Anolis community and Uniontech Software Technology Co., Ltd. We have experimentally added io_uring support to redis and gotten good performance improvement in testes. If this work can be approved by the community, we will proceed.

Test Environment

  • System: Anolis 8.4 [An OS benchmarked to Centos 8]
  • Kernel: Linux 4.19 + patch of io_uring
  • Kernel Configuration: use default setting of CPU mitigations, no disable mitigations. But in this CPU, some vulnerabilities is not affected. For more details:
# cat /sys/devices/system/cpu/vulnerabilities/meltdown 
Not affected
# cat /sys/devices/system/cpu/vulnerabilities/tsx_async_abort
Not affected
# cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
Mitigation: Full AMD retpoline, IBPB: conditional, STIBP: disabled, RSB filling
  • Server Model: Hygon C86 7280
  • CPU: Hygon C86 7285 32-core Processor * 2
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        8
Vendor ID:           HygonGenuine
CPU family:          24
Model:               1
Model name:          Hygon C86 7285 32-core Processor
Stepping:            1
CPU MHz:             2486.209
CPU max MHz:         2000.0000              
CPU min MHz:         1200.0000
BogoMIPS:            4000.20
Virtualization:      AMD-V
L1d cache:           32K          L1i cache:           64K
L2 cache:            512K         L3 cache:            8192K
NUMA node0 CPU(s):   0-7,64-71        NUMA node1 CPU(s):   8-15,72-79
NUMA node2 CPU(s):   16-23,80-87      NUMA node3 CPU(s):   24-31,88-95
NUMA node4 CPU(s):   32-39,96-103     NUMA node5 CPU(s):   40-47,104-111
NUMA node6 CPU(s):   48-55,112-119    NUMA node7 CPU(s):   56-63,120-127
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

Test Data

Please see the table below to get the data of performance improvement. And we also published it in Anolis Chinese Web.

RPS(request per second) PING_INLINE PING_BULK SET GET INCR LPUSH RPUSH LPOP RPOP SADD HSET SPOP LPUSH(needed to benchmark LRANGE) LRANGE_100 LRANGE_300 LRANGE_500 LRANGE_600 MSET (10 keys)
redis_origin 55493.90 55224.21 55540.44 55549.07 55652.95 56640.22 56493.03 56379.64 56280.96 55788.94 56218.62 55697.90 56774.31 33953.20 15290.75 12066.89 9730.44 57729.39
redis_add_uring + close_sqpoll 73336.37 72344.24 72844.88 73514.27 73542.93 72688.68 73133.63 73871.06 72998.56 72183.41 72502.65 73361.65 72535.25 43849.07 18304.73 13540.14 10733.82 61616.95
redis_add_uring + open_sqpoll 86625.09 84141.09 86079.27 84939.39 85744.91 84075.30 85814.08 85478.12 85657.51 85462.05 85095.52 84509.42 84403.15 51821.53 18953.36 14276.23 11673.47 83557.55
improvement of close_sqpoll 32.15% 31.00% 31.16% 32.34% 32.15% 28.33% 29.46% 31.02% 29.70% 29.39% 28.97% 31.71% 27.76% 29.15% 19.71% 12.21% 10.31% 6.73%
improvement of open_sqpoll 56.10% 52.36% 54.98% 52.91% 54.07% 48.44% 51.90% 51.61% 52.20% 53.19% 51.37% 51.73% 48.66% 52.63% 23.95% 18.31% 19.97% 44.74%

Test Command

Test Tool: redis-benchmark
We ran the benchmark and the redis-server on the same server. And both redis-server and redis-benchmark are bound fixed CPU cores.

## didn't open SQ polling
taskset -c 24-28 redis-server
taskset -c 30 redis-benchmark -c 500 -n 10000000 -q

## open SQ polling
taskset -c 24-28 redis-server --sqpoll
# query pid of sq, and bind it to cpu
ps -ef|grep io_uring-sq  
taskset -pc 29 {sq_pid}
taskset -c 30 redis-benchmark -c 500 -n 10000000 -q
Notes:
    The sq process and the redis-server process should be bound to the CPU 
cores of the same numa node if the compute uses numa(non-uniform memory access).

Looking forward to your comments, best wishes to you.
Related-Issue: #9441

Use io_uring to do poll_add/poll_remove. Make sure you have liburing
installed and kernel supported.

Signed-off-by: Betty <[email protected]>
Signed-off-by: Betty <[email protected]>
Signed-off-by: Jiufei Xue <[email protected]>
Signed-off-by: Betty <[email protected]>
Signed-off-by: Jiufei Xue <[email protected]>
Signed-off-by: Betty <[email protected]>
Signed-off-by: Jiufei Xue <[email protected]>
Signed-off-by: Betty <[email protected]>
Signed-off-by: Jiufei Xue <[email protected]>
io_uring_prep_poll_add() mistakenly always uses EPOLLIN.
Correct it by using poll_mask.

Signed-off-by: Hao Xu <[email protected]>
Register files when create clients in redis-benchmark.

Signed-off-by: Hao Xu <[email protected]>
Signed-off-by: Joseph Qi <[email protected]>
Add an argument to turn on/off sqpoll feature.

Signed-off-by: Hao Xu <[email protected]>
Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much CPU is used while this is ongoing? Someone mentioned to me that the submit query side of io_uring will use additional CPU in the background for sending the quests. It would be good to understand the performance gain of this compared to just enabling IO_threads for writes.

@HanGuangyu
Copy link
Author

HanGuangyu commented Sep 6, 2021

How much CPU is used while this is ongoing? Someone mentioned to me that the submit query side of io_uring will use additional CPU in the background for sending the quests. It would be good to understand the performance gain of this compared to just enabling IO_threads for writes.

Thank you for your comments, my friends. Yes, you are right. Five CPUs were distributed to redis-server for its five thread(we use "taskset -c 24-28 redis-server"). When we open sqpoll, it needs an additional CPU. But if we close sqpoll, it just needs the same num of CPU, no addtitional CPU. Even if we close sqpoll,it can still get improvement.

@madolson
Copy link
Contributor

madolson commented Sep 6, 2021

@HanGuangyu What is the difference between open and closed sq polling, I'm not familiar with that terminology? Also, to clarify your setup, you ran both the benchmark and the running server locally? I want to run this against some testing we did internally, where we didn't find the return very high returns to just using IO-threads, but we also have custom kernel tuning which might have reduced the benefits.

The code looks pretty straightforward all things considered, and it seems still inline with the longer term multi-threaded improvements (having multiple event loops serving a subset of clients). I'd be inclined to take a deeper look at the code to try to understand it better and see if we can figure out how to merge it.

@HanGuangyu
Copy link
Author

HanGuangyu commented Sep 7, 2021

@madolson Hi, I'm so happy for your quick responses and professional attitude. You made me feel the full vitality of the Redis community. I'm sorry that I didn't explain the information clearly in advance. I will offer whatever I can to help you understand it.

1. sqpoll(Submission Queue Polling)
An io_uring instance works through two ring buffers: a submission queue(SQ) and a completion queue(CQ), shared between kernel and user space. The queues are single producer, single consumer.

We need set two ring buffers up with io_uring_setup(). And we tell the kernel via io_uring_enter() system call that we have added an submission queue entry(SQE) to the submission queue ring buffer. We can add multiple SQEs before making the system call as well.

But io_uring also has a polling mode available, in which the kernel polls for new entries in the submission queue. This avoids the system call overhead of calling io_uring_enter() every time we submit entries for processing.

This is done via a special submission queue polling feature that io_uring supports. After our program sets up polling mode, io_uring start a special kernel thread that polls the shared submission queue for entries our program might add. This way, we just have to submit entries into the shared queue and the kernel thread should see it and pick up the submission queue entry without our program having to make the io_uring_enter() system call.

2. clarifying the setup
Yes, we run both the benchmark and the running server locally. We use the following commands:

## didn't open SQ polling
taskset -c 24-28 redis-server
taskset -c 30 redis-benchmark -c 500 -n 10000000 -q

## open SQ polling
taskset -c 24-28 redis-server --sqpoll
# query pid of sq, and bind it to cpu
ps -ef|grep io_uring-sq  
taskset -pc 29 {sq_pid}
taskset -c 30 redis-benchmark -c 500 -n 10000000 -q
Notes:
    The sq process and the redis-server process should be bound to the CPU 
cores of the same numa node if the compute uses numa(non-uniform memory access).

Best wishes for you. If you have any question, call me.

@HanGuangyu HanGuangyu requested a review from madolson September 14, 2021 01:45
@madolson
Copy link
Contributor

@HanGuangyu Sorry for being so slow to circle back, I'm going to review this tomorrow!

@HanGuangyu
Copy link
Author

HanGuangyu commented Sep 30, 2021

@HanGuangyu Sorry for being so slow to circle back, I'm going to review this tomorrow!

@madolson Don't feel bad, my friends. I can understand you, I know you'are very busy.
Wait you, come on. Looking forward to furthe work.

@HanGuangyu
Copy link
Author

HanGuangyu commented Oct 25, 2021

@madolson Could I get a response when you're not busy (ಥ﹏ಥ), my friend.
I always look forward to working with you and contributing to the Redis community. (๑・`◡´・๑)

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HanGuangyu Sorry for getting delayed, looked through everything but the io_uring details with some questions. Still trying to think through how to better encapsulate the async code.

Comment on lines +1162 to +1166
#ifdef HAVE_IO_URING
if (aeCreateFileEvent(el, fd, AE_READABLE, acceptTcpHandler, NULL) == AE_ERR) {
serverPanic("Unrecoverable error creating server.ipfd file event.");
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify my understanding, we are re-registering this since it will be de-registered each time it fires. Right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link

@NKU-zhubingpeng NKU-zhubingpeng Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ae_iouring.c, aeApiAddEvent()calls io_uring_prep_poll_add(), which is a single-shot poll request. use io_uring_prep_poll_multishot() can avoid re-registering event here.
Ref: https://man.archlinux.org/man/extra/liburing/io_uring_prep_poll_add.3.en

size_t objlen;
clientReplyBlock *o;

#if defined(HAVE_IO_URING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help readability if we used a config flag to control using IO_Uring as opposed to this being a regular Redis config instead of a macro.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm agree with you, a config flag is a better option. Use macro just to quickly implement POC.
And could you give me a suggestion about the implement form of config flag? Or is there a reference to the existing examples in redis

Comment on lines +1567 to +1568
c->wiov.iov_len = c->bufpos-c->sentlen;
c->wiov.iov_base = c->buf+c->sentlen;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the failure modes of submitting a request to the submission queue? It would be nice if we could encapsulate all of the submission into connWrite, it would make this code much easier to maintain, we just need an accounting mechanism.

While reading this code it seems like we should introduce a new "buffered IO" type so we can abstract the different requirements for submission queue. The type would have APIs for adding data to the buffer, pulling data out of the buffer, and acknowledging data from the buffer I think?

My other thought is that connections maybe should also have a callback function?

Copy link
Author

@HanGuangyu HanGuangyu Nov 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. failure modes of submitting a request to the submission queue

We use io_uring_queue_init(MAX_ENTRIES, state->ring, 0); to initialize a io_uring instance which have two queue. MAX_ENTRIES is a macro which was defined in the beginning of ae_iouring.c by #define MAX_ENTRIES 16384 . MAX_ENTRIES defines the length of io_uring queue which is the number of SQEs(Submission Queue Entry) in queue. This is a POC solution.
(In my opinion, it could be defined in redis.conf and will be modified by user. Could I get your opinion, my friends?)

Before we submit a request to the submission queue, we use io_uring_get_sqe to get the available sqe. Then, populate the data for this
SQE. And now, we return -1 if all sqe is used in this POC.

struct io_uring_sqe *sqe = io_uring_get_sqe(state->ring);
if (!sqe) return -1;

And we can try to make it stronger if you are agreed and interested.

Copy link
Author

@HanGuangyu HanGuangyu Nov 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. encapsulate all of the submission into connWrite
    So good, I will realize it.

  2. new "buffered IO" type
    I basically agree now, but sorry that I need some times to do a deeper understanding to realize it.

  3. a callback function in connection
    Could you give me more information about this in your opinion

}
#if defined(HAVE_IO_URING)
if (aeCreateFileEventWithBuf(server.el, c->conn->fd, AE_WRITABLE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general comment, we wanted to hide all of the ae code within the connection objects, so this would be breaking that particular barrier.

Copy link
Author

@HanGuangyu HanGuangyu Nov 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to fix it. And could I get more information about this rule?

@@ -3341,6 +3351,7 @@ void initServer(void) {

/* Register a readable event for the pipe used to awake the event loop
* when a blocked client in a module needs attention. */
aeRegisterFile(server.el, server.module_blocked_pipe[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Author

@HanGuangyu HanGuangyu Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry that I didn't explain it in advance. This is a advanced feature of io_uring——file registration.

Every time an operation is issued for a file descriptor, the kernel has to spend cycles mapping the file descriptor to its internal representation. For repeated operations over the same file, io_uring allows you to pre-register those files and save on the lookup.

The io_uring_register() system call registers user buffers or files for use in an io_uring(7) instance referenced by fd. Registering files or user buffers allows the kernel to take long term references to internal data structures or create long term mappings of application memory, greatly reducing per-I/O overhead.

For more details, we can see: https://unixism.net/loti/ref-iouring/io_uring_register.html. Or you can talk with me, if you have confuse about it.

@@ -1127,6 +1139,10 @@ static void acceptCommonHandler(connection *conn, int flags, char *ip) {
freeClient(connGetPrivateData(conn));
return;
}

#ifdef HAVE_IO_URING
readQueryFromClient(conn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

@@ -115,10 +115,19 @@ client *createClient(connection *conn) {
* in the context of a client. When commands are executed in other
* contexts (for instance a Lua script) we need a non connected client. */
if (conn) {
#if !defined(HAVE_IO_URING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not enable tcp no delay?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so sorry, this is a typo. I will fix it. Tcp no delay should be enable.

#if defined(HAVE_IO_URING)
/* Do not create poll event */
conn->read_handler = readDoneFromClient;
conn->write_handler = writeDoneToClient;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why install a write handler immediately?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question...

@ranshid
Copy link
Collaborator

ranshid commented Dec 1, 2021

Hello @HanGuangyu. I work with @madolson in AWS and we are interested in this great effort to improve the Redis performance!
without diving too deep in to the implementation I would like to ask some questions regarding your setup:

  1. I understand your results with using sqpolling but since that will we operated by a different kernel thread maybe we should be fair to place them against redis operating an extra io-thread?
  2. I am very interested in your kernel configuration. specifically regarding kernel mitigations which will cause extra overhead on systemcalls.
    I am primarily interested in:
  • if page table isolation is disabled ?
  • if MDS and TSX async aborts is disabled?
  • if Spectre v2 mitigations are disabled?

all of these might have good reason to be enabled in some systems, but I think we need a clear picture regarding the io-uring performance boost with and without the extra io-thread and kernel mitigations disablement.
Is it possible to get these results appended to the original results you placed?

Thank you, and nice to read your good work!

@HanGuangyu
Copy link
Author

HanGuangyu commented Dec 6, 2021

Hello @ranshid , nice to meet you.

  1. Yes, you are right. We will conduct a test to "place them against redis operating an extra io-thread".
  2. I have put the information of mitigations above the original results. I didn't disable them. But some mitigations is not affected in the cpu. Also put infomation here for your convenience.
    # cat /sys/devices/system/cpu/vulnerabilities/meltdown 
    Not affected
    # cat /sys/devices/system/cpu/vulnerabilities/tsx_async_abort
    Not affected
    # cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
    Mitigation: Full AMD retpoline, IBPB: conditional, STIBP: disabled, RSB filling

And we will also find a new environment to test io-uring performance with and without kernel mitigations disablement. But so sorry that maybe I need some time.

Thank you. If you want to learn about any infomation of this work, call me. Nice to have the work with you. It’s so good to participate in the work of redis community.

@madolson
Copy link
Contributor

madolson commented Jul 7, 2022

@HanGuangyu Hey, sorry for falling off for a bit. We ended up getting busy with 7.0 for a bit, but this is an important item that we want to improve in Redis, so wondering if you are still around and want to help merge this?

@oranagra
Copy link
Member

oranagra commented Jul 7, 2022

If you are, please look into the possibility of building this on top of #9320.
It will require extending the connection abstraction interfaces for Async IO, and somehow pull the the event loop (ae.c) into it too, which would be great since it'll make it more flexible for other interfaces / mediums.

@oranagra
Copy link
Member

@HanGuangyu now that #9320 is merged, it would be interesting to see if we can leverage that, and extend it so that it's flexible enough to allow io_uring support to be added in a similar way.
maybe you'd like to pick it up and come up with a design..

@HanGuangyu
Copy link
Author

Sorry for my late response.

A problem is we now lack a enough understanding of redis and #9320.
Perhaps another member of the team will take over the job to learn and continue do this, if the community is interested.

@HanGuangyu
Copy link
Author

@oranagra @madolson
hello, so sorry to bother and communicate below.
We have always feel that this PR is a meaningful and enjoyable work. But after searching and trying for a period of time, our team really does not have the right people and enough energy to do this work.
Not sure if anyone else in the community is interested in this work and would like to continue?If so, can other people continue, and we are also very willing to provide necessary support and assistance to new people.

@madolson
Copy link
Contributor

madolson commented Dec 7, 2022

@HanGuangyu No worries. There is definitely interest in the community, so I'm sure someone will pick it up eventually. Appreciate starting the conversation though!

@He-Jingkai
Copy link

He-Jingkai commented Nov 21, 2023

@HanGuangyu I am interested in this work and found some possible bugs.

Please see #9441 (comment) for detail.

@lipzhu
Copy link
Contributor

lipzhu commented Nov 23, 2023

Hi @madolson , @oranagra , @HanGuangyu
Is there any further plan to add io_uring for Redis?

@lipzhu
Copy link
Contributor

lipzhu commented Nov 23, 2023

There is definitely interest in the community, so I'm sure someone will pick it up eventually. Appreciate starting the conversation though!

I can pick it up to continue the work if community still interest.

@madolson
Copy link
Contributor

@lipzhu Hey, we are interested in this. Please make sure to post a summary of what you are planning to look into. Just want to make sure what you are doing is productive and likely to get pulled.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

if (c->flags & CLIENT_CLOSE_AFTER_REPLY) {
freeClientAsync(c);
}
if (c->flags & CLIENT_CLOSE_AFTER_REPLY) return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this logic, it seems same with 1746 line?

@congwang
Copy link

@HanGuangyu Any updates on this? The result looks very promising, we are interested in this as well.

@sundb
Copy link
Collaborator

sundb commented Oct 29, 2024

@congwang Welcome to continue the work if you are interested.

@congwang
Copy link

@sundb Thank you! I will take a look. From my quick glance, it looks like registered buffer is not yet utilized to eliminate the data copy between kernel-space and user-space. I think it would make the performance even better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.