Skip to content
This repository was archived by the owner on May 14, 2025. It is now read-only.

Conversation

@k0kubun
Copy link
Member

@k0kubun k0kubun commented Apr 11, 2025

This PR implements JIT-to-JIT calls using call/ret instructions.

Benchmark

def fib(n)
  if n < 2
    return n
  end
  return fib(n-1) + fib(n-2)
end

fib(32) # profile and compile
t = Time.new
fib(32)
puts "%.3fs" % (Time.new - t)

x86_64

# Interpreter (master)
$ ruby -v ~/tmp/fib.rb
ruby 3.5.0dev (2025-04-13T07:55:52Z send-iseq e84b495b38) +PRISM [x86_64-linux]
0.117s

# ZJIT (master)
$ ruby -v --zjit-call-threshold=34 --zjit-num-profiles=3 ~/tmp/fib.rb
ruby 3.5.0dev (2025-04-13T07:55:52Z send-iseq e84b495b38) +ZJIT +PRISM [x86_64-linux]
0.010s

# YJIT (3.4.2)
$ ruby -v --yjit-call-threshold=1 ~/tmp/fib.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
0.016s

arm64

# Interpreter (master)
$ ruby -v ~/tmp/fib.rb
ruby 3.5.0dev (2025-04-13T08:18:53Z send-iseq 18d8203af5) +PRISM [arm64-darwin24]
0.119s

# ZJIT (master)
$ ruby -v --zjit-call-threshold=34 --zjit-num-profiles=3 ~/tmp/fib.rb
ruby 3.5.0dev (2025-04-13T08:18:53Z send-iseq 18d8203af5) +ZJIT +PRISM [arm64-darwin24]
0.014s

# YJIT (3.4.2)
$ ruby -v --yjit-call-threshold=1 ~/tmp/fib.rb
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
0.018s

Generated code

Details
  # Block: bb0(v0)
  # Insn: v2 Const Value(2)
  # Insn: v4 PatchPoint BOPRedefined(INTEGER_REDEFINED_OP_FLAG, BOP_LT)
  # Insn: v5 GuardType v0, Fixnum
  0x560c4f9e5032: test sil, 1
  0x560c4f9e5036: je 0x560c4f9e5000
  # Insn: v7 FixnumLt v5, v2
  0x560c4f9e503c: cmp rsi, 5
  0x560c4f9e5040: mov edi, 0x14
  0x560c4f9e5045: mov edx, 0
  0x560c4f9e504a: cmovge rdi, rdx
  # Insn: v8 Test v7
  0x560c4f9e504e: test rdi, -5
  0x560c4f9e5055: movabs rdi, 0
  0x560c4f9e505f: mov edx, 1
  0x560c4f9e5064: cmovne rdi, rdx
  # Insn: v9 IfFalse v8, bb1(v0)
  0x560c4f9e5068: test rdi, rdi
  0x560c4f9e506b: jne 0x560c4f9e5076
  # set branch params: 1
  0x560c4f9e5071: jmp 0x560c4f9e5083
  # Insn: v10 Return v0
  # pop stack frame
  0x560c4f9e5076: add r13, 0x38
  0x560c4f9e507a: mov qword ptr [r12 + 0x10], r13
  0x560c4f9e507f: mov rax, rsi
  0x560c4f9e5082: ret
  # Block: bb1(v11)
  # Insn: v13 PutSelf
  # Insn: v14 Const Value(1)
  # Insn: v16 PatchPoint BOPRedefined(INTEGER_REDEFINED_OP_FLAG, BOP_MINUS)
  # Insn: v17 GuardType v11, Fixnum
  0x560c4f9e5083: test sil, 1
  0x560c4f9e5087: je 0x560c4f9e5005
  # Insn: v19 FixnumSub v17, v14
  0x560c4f9e508d: mov rdi, rsi
  0x560c4f9e5090: sub rdi, 3
  0x560c4f9e5094: jo 0x560c4f9e500a
  0x560c4f9e509a: add rdi, 1
  # Insn: v39 PatchPoint MethodRedefined(Object@0x79fdca45ec40, fib@0xb6f1)
  # Insn: v40 GuardBitEquals v13, VALUE(0x79fdca44c3a0)
  0x560c4f9e509e: movabs r11, 0x79fdca44c3a0
  0x560c4f9e50a8: cmp qword ptr [r13 + 0x18], r11
  0x560c4f9e50ac: jne 0x560c4f9e500f
  # Insn: v41 SendWithoutBlockDirect v40, :fib (0x7ffc1ea23068), v19
  # push callee control frame
  0x560c4f9e50b2: mov rdx, qword ptr [r13 + 0x18]
  0x560c4f9e50b6: mov qword ptr [r13 - 0x20], rdx
  # switch to new CFP
  0x560c4f9e50ba: sub r13, 0x38
  0x560c4f9e50be: mov qword ptr [r12 + 0x10], r13
  # set method params: 1
  0x560c4f9e50c3: push rsi
  0x560c4f9e50c4: push rsi
  0x560c4f9e50c5: mov rsi, rdi
  0x560c4f9e50c8: call 0x560c4f9e5032
  0x560c4f9e50cd: pop rsi
  0x560c4f9e50ce: pop rsi
  # Insn: v22 PatchPoint CalleeModifiedLocals(v41)
  # Insn: v23 PutSelf
  # Insn: v24 Const Value(2)
  # Insn: v26 PatchPoint BOPRedefined(INTEGER_REDEFINED_OP_FLAG, BOP_MINUS)
  # Insn: v27 GuardType v11, Fixnum
  0x560c4f9e50cf: test sil, 1
  0x560c4f9e50d3: je 0x560c4f9e5014
  # Insn: v29 FixnumSub v27, v24
  0x560c4f9e50d9: sub rsi, 5
  0x560c4f9e50dd: jo 0x560c4f9e5019
  0x560c4f9e50e3: add rsi, 1
  # Insn: v42 PatchPoint MethodRedefined(Object@0x79fdca45ec40, fib@0xb6f1)
  # Insn: v43 GuardBitEquals v23, VALUE(0x79fdca44c3a0)
  0x560c4f9e50e7: movabs r11, 0x79fdca44c3a0
  0x560c4f9e50f1: cmp qword ptr [r13 + 0x18], r11
  0x560c4f9e50f5: jne 0x560c4f9e501e
  # Insn: v44 SendWithoutBlockDirect v43, :fib (0x7ffc1ea23068), v29
  # push callee control frame
  0x560c4f9e50fb: mov rdi, qword ptr [r13 + 0x18]
  0x560c4f9e50ff: mov qword ptr [r13 - 0x20], rdi
  # switch to new CFP
  0x560c4f9e5103: sub r13, 0x38
  0x560c4f9e5107: mov qword ptr [r12 + 0x10], r13
  # set method params: 1
  0x560c4f9e510c: mov rdi, rax
  0x560c4f9e510f: push rdi
  0x560c4f9e5110: push rdi
  0x560c4f9e5111: call 0x560c4f9e5032
  0x560c4f9e5116: pop rdi
  0x560c4f9e5117: pop rdi
  # Insn: v32 PatchPoint CalleeModifiedLocals(v44)
  # Insn: v34 PatchPoint BOPRedefined(INTEGER_REDEFINED_OP_FLAG, BOP_PLUS)
  # Insn: v35 GuardType v41, Fixnum
  0x560c4f9e5118: test dil, 1
  0x560c4f9e511c: je 0x560c4f9e5023
  # Insn: v36 GuardType v44, Fixnum
  0x560c4f9e5122: test al, 1
  0x560c4f9e5125: je 0x560c4f9e5028
  # Insn: v37 FixnumAdd v35, v36
  0x560c4f9e512b: sub rdi, 1
  0x560c4f9e512f: add rdi, rax
  0x560c4f9e5132: jo 0x560c4f9e502d
  # Insn: v38 Return v37
  # pop stack frame
  0x560c4f9e5138: add r13, 0x38
  0x560c4f9e513c: mov qword ptr [r12 + 0x10], r13
  0x560c4f9e5141: mov rax, rdi
  0x560c4f9e5144: ret

  # ZJIT entry point: fib@/home/k0kubun/tmp/fib.rb:2
  0x560c4f9e5145: push r13
  0x560c4f9e5147: push r12
  0x560c4f9e5149: push rbx
  0x560c4f9e514a: push rbx
  0x560c4f9e514b: mov r12, rdi
  0x560c4f9e514e: mov r13, rsi
  0x560c4f9e5151: mov rbx, qword ptr [r13 + 8]
  # set method params: 1
  0x560c4f9e5155: mov rsi, qword ptr [rbx - 0x20]
  0x560c4f9e5159: call 0x560c4f9e5032
  # exit to the interpreter
  0x560c4f9e515e: pop rbx
  0x560c4f9e515f: pop rbx
  0x560c4f9e5160: pop r12
  0x560c4f9e5162: pop r13
  0x560c4f9e5164: ret

@maximecb
Copy link
Contributor

I'll get it done before RubyKaigi.

Love the ambition, excited to see how it performs!

@k0kubun k0kubun changed the title Implement JIT-to-JIT calls (WIP) Implement JIT-to-JIT calls Apr 13, 2025
@k0kubun k0kubun marked this pull request as ready for review April 13, 2025 08:21
@k0kubun k0kubun requested a review from a team as a code owner April 13, 2025 08:21
@k0kubun
Copy link
Member Author

k0kubun commented Apr 13, 2025

I finished up the implementation using call/ret instructions. I updated the PR description to include the benchmark results of fib on x86_64 and arm64.

@maximecb
Copy link
Contributor

Just arrived at the airport hotel in Narita. Getting settled and flying to Matsuyama tomorrow :)

I finished up the implementation using call/ret instructions. I updated the PR description to include the benchmark results of fib on x86_64 and arm64.

Wow, go Kokubun! Awesome that we can indeed beat YJIT on this microbenchmark with fast JIT-to-JIT calls!

For performance I think we should avoid manipulating the CFP in JIT-to-JIT calls, to be maximally efficient. We can recover callees and unwind the stack using C stack pointers instead. We can still merge this PR without that change though.

# push callee control frame
0x560c4f9e50fb: mov rdi, qword ptr [r13 + 0x18]
0x560c4f9e50ff: mov qword ptr [r13 - 0x20], rdi
# switch to new CFP
0x560c4f9e5103: sub r13, 0x38
0x560c4f9e5107: mov qword ptr [r12 + 0x10], r13

Comment on lines +116 to +129
// Recursively compile callee ISEQs
while let Some((branch, iseq)) = branch_iseqs.pop() {
// Disable profiling. This will be the last use of the profiling information for the ISEQ.
unsafe { rb_zjit_profile_disable(iseq); }

// Compile the ISEQ
if let Some((callee_ptr, callee_branch_iseqs)) = gen_iseq(cb, iseq) {
let callee_addr = callee_ptr.raw_ptr(cb);
branch.regenerate(cb, |asm| {
asm.ccall(callee_addr, vec![]);
});
branch_iseqs.extend(callee_branch_iseqs);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to avoid this long term. @tekknolagi had suggested using a trampoline to implement calls. The advantage is that this permits each ISEQ to only get compiled when its call threshold is hit.

Still open to merging this PR, but we should move towards using a trampoline for ISEQs that were not yet compiled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, we can leave a trampoline to lazily compile callee ISEQs.

I'd like to land the changes in this PR first and work on it in a separate PR though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can land the changes in this PR first 👍

I really appreciate the work that you put into this :)

Copy link
Contributor

@maximecb maximecb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to give other a chance to review the PR as well. Impressed you got it working this fast.

@k0kubun
Copy link
Member Author

k0kubun commented Apr 13, 2025

For performance I think we should avoid manipulating the CFP in JIT-to-JIT calls, to be maximally efficient. We can recover callees and unwind the stack using C stack pointers instead. We can still merge this PR without that change though.

Yeah I agree. We should probably make cfp->self part of the function arguments. The dynamic dispatch currently relies on the CFP register and ec->cfp, but we could do it lazily as needed. I want to scope them out in this PR though.

@maximecb
Copy link
Contributor

Yeah I agree. We should probably make cfp->self part of the function arguments. The dynamic dispatch currently relies on the CFP register and ec->cfp, but we could do it lazily as needed. I want to scope them out in this PR though.

I've been leaning towards that too. Logically, it seems like self should be a method argument.

We can also have logic (later on) to simply not pass function arguments that are not used by the callee when we know the identity of the callee (call direct), so arguments that are not used can be "free" in many cases.

@k0kubun k0kubun merged commit b78daed into master Apr 14, 2025
9 checks passed
@k0kubun k0kubun deleted the send-iseq branch April 14, 2025 07:08
k0kubun added a commit that referenced this pull request Apr 18, 2025
* Implement JIT-to-JIT calls

* Use a closer dummy address for Arm64

* Revert an obsoleted change

* Revert a few more obsoleted changes

* Fix outdated comments

* Explain PosMarkers for CCall

* s/JIT code/machine code/

* Get rid of ParallelMov
k0kubun added a commit to k0kubun/ruby that referenced this pull request Apr 18, 2025
* Implement JIT-to-JIT calls

* Use a closer dummy address for Arm64

* Revert an obsoleted change

* Revert a few more obsoleted changes

* Fix outdated comments

* Explain PosMarkers for CCall

* s/JIT code/machine code/

* Get rid of ParallelMov
k0kubun added a commit to k0kubun/ruby that referenced this pull request Apr 18, 2025
* Implement JIT-to-JIT calls

* Use a closer dummy address for Arm64

* Revert an obsoleted change

* Revert a few more obsoleted changes

* Fix outdated comments

* Explain PosMarkers for CCall

* s/JIT code/machine code/

* Get rid of ParallelMov
k0kubun added a commit to k0kubun/ruby that referenced this pull request Apr 18, 2025
* Implement JIT-to-JIT calls

* Use a closer dummy address for Arm64

* Revert an obsoleted change

* Revert a few more obsoleted changes

* Fix outdated comments

* Explain PosMarkers for CCall

* s/JIT code/machine code/

* Get rid of ParallelMov
k0kubun added a commit to k0kubun/ruby that referenced this pull request Apr 18, 2025
* Implement JIT-to-JIT calls

* Use a closer dummy address for Arm64

* Revert an obsoleted change

* Revert a few more obsoleted changes

* Fix outdated comments

* Explain PosMarkers for CCall

* s/JIT code/machine code/

* Get rid of ParallelMov
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants