SlideShare a Scribd company logo
1
ARM 64bit has come!
Tetsuyuki Kobayashi
2014.5.23 Japan Technical Jamboree
2014.5.25 Updated for カーネル /VM 探検隊
2
 The latest version of this slide will
be available from here
 http://www.slideshare.net/tetsu.koba/presentati
ons
3
Who am I?
 20+ years involved in embedded systems
 10 years in real time OS, such as iTRON
 10 years in embedded Java Virtual Machine
 Now GCC, Linux, QEMU, Android, …
 Blogs
 http://d.hatena.ne.jp/embedded/ (Personal)
 http://blog.kmckk.com/ (Corporate)
 http://kobablog.wordpress.com/(English)
 Twitter
 @tetsu_koba
Today's topics
 Introduction of ARM 64bit
 But does not cover all, only
something interesting for me :)
 Try aarch64 using QEMU
ARMv8 terminology
 AArch64: 64 bit mode
 1 instruction set: A64
 A64: 32bit fixed length instructions
 AArch32: 32 bit mode
 Upper compatible with ARMv7-A architecture
 2 instruction sets: A32, T32
 A32: ARM, 32bit fixed length instructions
 T32: Thumb2, 16bit/32bit instructions
6
ARM64 is not official name
 In the kernel source
 arch/arm64
Exception level
 4 levels
 Typical usage
 EL0: User application
 EL1: Kernel of OS
 EL2: Hypervisor
 EL3: Secure monitor
 Aarch64/aarch32 can change between
exception level
 CF. PL0-PL2 (Privilege level) at ARMv7
Aarch64 execution model
 R0 – R30: 64bit length general purpose
registers
 Wn: lower 32bit
 Xn: 64bit
 32th register means zero register(XZR, WZR) or SP
 SP: Stack Pointer
 Must be 16 byte aligned
 WSP for lower 32bit
 PC: Program Counter
 Can not use for calculate destination
Aarch64 execution model (cont.)
 V0 – V31: 128 bit length registers
 For floating point and SIMD
 Aarch64 must have FPU. No calling standard for
soft-float.
 Scalar
 Bn, Hn, Sn, Dn, Qn
 Vector
 Vn.8B, Vn.16B, Vn.4H, Vn.8H, Vn.2S, Vn.4S,
Vn.1D, Vn.2D
 FPCR: Floating Point Control Register
 FPSR: Floating Point Status Register
Aarch64 addressing model
 Without tag: 64bit virtual address
 With tag: 8bit tag + 56bit virtual address
 Tag is ignored when load/store/branch
 Good for implementing type-less languages
 Effective virtual address length is 48bit.
Calling standard (AAPCS64)
 R30 = LR (Link Register)
 R29 = FP (Frame Pointer)
 Parameter passing
 R0 – R7 for integer and pointer
 V0 – V7 for float
 Callee must preserve
 R19 – R29, SP
 V8 – V15
 No calling standard for soft-float
A64 instruction set
 Brand-new, clean design for 64bit architecture
 Not all, very small set of ”conditional data
processing” instructions
 No equivalent of Thumb2's IT instruction.
No multiple load/store
 No multiple load/store GP registers such
as LDM/STM, PUSH/POP
 Instead, there are 2 register load/store
such as LDP/STP
YIELD instruction
 NOP with hinting not important
 Use in spin-loop and trigger context
switching in SMT(Symmetric Multi-
Threading)
Sample #1 source
#include <stdio.h>
int main()
{
int i;
for (i = 5; i >=0; i--) {
printf("count down: %dn", i);
}
return 0;
}
Sample #1 Thumb2
000083f8 <main>:
83f8: b570 push {r4, r5, r6, lr}
83fa: 2405 movs r4, #5
83fc: f248 456c movw r5, #33900 ; 0x846c
8400: f2c0 0500 movt r5, #0
8404: 2601 movs r6, #1
8406: 4630 mov r0, r6
8408: 4629 mov r1, r5
840a: 4622 mov r2, r4
840c: f7ff ef7a blx 8304 <_init+0x38>
8410: 3c01 subs r4, #1
8412: f1b4 3fff cmp.w r4, #4294967295 ; 0xffffffff
8416: d1f6 bne.n 8406 <main+0xe>
8418: 2000 movs r0, #0
841a: bd70 pop {r4, r5, r6, pc}
Sample #1 A64
0000000000400440 <main>:
400440: a9be7bfd stp x29, x30, [sp,#-32]!
400444: 910003fd mov x29, sp
400448: a90153f3 stp x19, x20, [sp,#16]
40044c: 90000014 adrp x20, 400000 <_init-0x3c0>
400450: 528000b3 mov w19, #0x5 // #5
400454: 911a0294 add x20, x20, #0x680
400458: 2a1303e2 mov w2, w19
40045c: 52800020 mov w0, #0x1 // #1
400460: aa1403e1 mov x1, x20
400464: 97ffffeb bl 400410 <__printf_chk@plt>
400468: 51000673 sub w19, w19, #0x1
40046c: 3100067f cmn w19, #0x1
400470: 54ffff41 b.ne 400458 <main+0x18>
400474: 52800000 mov w0, #0x0 // #0
400478: a94153f3 ldp x19, x20, [sp,#16]
40047c: a8c27bfd ldp x29, x30, [sp],#32
400480: d65f03c0 ret
Sample #2 source
int iaload(int *base, int index)
{
return base[index];
}
long long laload(long long *base, int index)
{
return base[index];
}
char ibload(char *base, int index)
{
return base[index];
}
short isload(short *base, int index)
{
return base[index];
}
Sample #2 Thumb2
00000000 <iaload>:
0: f850 0021 ldr.w r0, [r0, r1, lsl #2]
4: 4770 bx lr
6: bf00 nop
00000008 <laload>:
8: eb00 01c1 add.w r1, r0, r1, lsl #3
c: e9d1 0100 ldrd r0, r1, [r1]
10: 4770 bx lr
12: bf00 nop
00000014 <ibload>:
14: 5c40 ldrb r0, [r0, r1]
16: 4770 bx lr
00000018 <isload>:
18: f930 0011 ldrsh.w r0, [r0, r1, lsl #1]
1c: 4770 bx lr
1e: bf00 nop
Sample #2 A64
0000000000000000 <iaload>:
0: b861d800 ldr w0, [x0,w1,sxtw #2]
4: d65f03c0 ret
0000000000000008 <laload>:
8: f861d800 ldr x0, [x0,w1,sxtw #3]
c: d65f03c0 ret
0000000000000010 <ibload>:
10: 3861c800 ldrb w0, [x0,w1,sxtw]
14: d65f03c0 ret
0000000000000018 <isload>:
18: 7861d800 ldrh w0, [x0,w1,sxtw #1]
1c: d65f03c0 ret
Sample #3 source
double range(double x, double min, double max)
{
if (x < min)
return min;
else if (x > max)
return max;
else
return x;
}
Sample #3 Thumb2
00000000 <range>:
0: eeb4 0bc1 vcmpe.f64 d0, d1
4: eef1 fa10 vmrs APSR_nzcv, fpscr
8: d407 bmi.n 1a <range+0x1a>
a: eeb4 0bc2 vcmpe.f64 d0, d2
e: eef1 fa10 vmrs APSR_nzcv, fpscr
12: bfc8 it gt
14: eeb0 0b42 vmovgt.f64 d0, d2
18: 4770 bx lr
1a: eeb0 0b41 vmov.f64d0, d1
1e: 4770 bx lr
Sample #3 A64
0000000000000000 <range>:
0: 1e612010 fcmpe d0, d1
4: 540000a4 b.mi 18 <range+0x18>
8: 1e622010 fcmpe d0, d2
c: 1e604041 fmov d1, d2
10: 5400004c b.gt 18 <range+0x18>
14: 1e604001 fmov d1, d0
18: 1e604020 fmov d0, d1
1c: d65f03c0 ret
Cache control
 Application level cache instructions
 Data cache
 DC VAU
 DC CVAC
 DC CIVAC
 Instruction cache
 IC IVAU
 No need to call kernel syscall
 JIT friendly
Preloading cache
 PRFM <prfop>, addr|label
 <prfop> ::= <type><target><policy>
 <type> ::= PLD | PST | PLI
 <target> ::= L1 | L2 | L3
 <policy> ::= KEEP | STRM
Non-temporal load/store
 LDNP/STNP
 Hinting unlikely to be accessed again
(like streaming)
Aarch32
 Upper compatible with ARMv7
 Added encrypt extension
 Added other some new instructions
aligned to aarch64
 Removed Jazelle, ThumbEE
Let's try Aarch64 using QEMU
 Qemu 2.0 supports aarch64 user mode
emulation
 Ubuntu 14.04 has qemu 2.0 and cross compiler
for aarch64
$ sudo apt-get install qemu-user-static
$ sudo apt-get install g++-aarch64-linux-gnu
Prepare gdb for aarch64
$ sudo apt-get build-dep gdb
$ wget http://ftp.gnu.org/gnu/gdb/gdb-7.7.1.tar.bz2
$ tar xf gdb-7.7.1.tar.bz2
$ mkdir obj
$ cd obj
$ ../gdb-7.7.1/configure --target=aarch64-linux-gnu
$ make
$ sudo make install
Execute by qemu and connect
gdb
$ aarch64-linux-gnu-gcc -g a.c
$ export QEMU_LD_PREFIX=/usr/aarch64-linux-gnu/
$ qemu-aarch64-static -g 1234 ./a.out
$ aarch64-linux-gnu-gdb ./a.out
  ...
(gdb) target remote :1234
(gdb) b main
(gdb) c
(gdb) x/i $pc
=> 0x4005a0 <main>: stp x29, x30, [sp,#-48]!
(gdb)
DEMO
32
References
 ARMv8Technology Preview
 ARMv8 Instruction Set Overview
 ARM®Architecture Reference Manual
 Procedure Call Standard for theARM 64-bitArch
itecture(AArch64)
 ARM 64bit ARMv8の アーキテクチャ の概要
 Ubuntu 14.04 arm 64bit(aarch6で
4)のコードをコンパイルして動かしてみる
33
Any comment?
@tetsu_koba
Thank you for listening!

More Related Content

ARM 64bit has come!

  • 1. 1 ARM 64bit has come! Tetsuyuki Kobayashi 2014.5.23 Japan Technical Jamboree 2014.5.25 Updated for カーネル /VM 探検隊
  • 2. 2  The latest version of this slide will be available from here  http://www.slideshare.net/tetsu.koba/presentati ons
  • 3. 3 Who am I?  20+ years involved in embedded systems  10 years in real time OS, such as iTRON  10 years in embedded Java Virtual Machine  Now GCC, Linux, QEMU, Android, …  Blogs  http://d.hatena.ne.jp/embedded/ (Personal)  http://blog.kmckk.com/ (Corporate)  http://kobablog.wordpress.com/(English)  Twitter  @tetsu_koba
  • 4. Today's topics  Introduction of ARM 64bit  But does not cover all, only something interesting for me :)  Try aarch64 using QEMU
  • 5. ARMv8 terminology  AArch64: 64 bit mode  1 instruction set: A64  A64: 32bit fixed length instructions  AArch32: 32 bit mode  Upper compatible with ARMv7-A architecture  2 instruction sets: A32, T32  A32: ARM, 32bit fixed length instructions  T32: Thumb2, 16bit/32bit instructions
  • 6. 6 ARM64 is not official name  In the kernel source  arch/arm64
  • 7. Exception level  4 levels  Typical usage  EL0: User application  EL1: Kernel of OS  EL2: Hypervisor  EL3: Secure monitor  Aarch64/aarch32 can change between exception level  CF. PL0-PL2 (Privilege level) at ARMv7
  • 8. Aarch64 execution model  R0 – R30: 64bit length general purpose registers  Wn: lower 32bit  Xn: 64bit  32th register means zero register(XZR, WZR) or SP  SP: Stack Pointer  Must be 16 byte aligned  WSP for lower 32bit  PC: Program Counter  Can not use for calculate destination
  • 9. Aarch64 execution model (cont.)  V0 – V31: 128 bit length registers  For floating point and SIMD  Aarch64 must have FPU. No calling standard for soft-float.  Scalar  Bn, Hn, Sn, Dn, Qn  Vector  Vn.8B, Vn.16B, Vn.4H, Vn.8H, Vn.2S, Vn.4S, Vn.1D, Vn.2D  FPCR: Floating Point Control Register  FPSR: Floating Point Status Register
  • 10. Aarch64 addressing model  Without tag: 64bit virtual address  With tag: 8bit tag + 56bit virtual address  Tag is ignored when load/store/branch  Good for implementing type-less languages  Effective virtual address length is 48bit.
  • 11. Calling standard (AAPCS64)  R30 = LR (Link Register)  R29 = FP (Frame Pointer)  Parameter passing  R0 – R7 for integer and pointer  V0 – V7 for float  Callee must preserve  R19 – R29, SP  V8 – V15  No calling standard for soft-float
  • 12. A64 instruction set  Brand-new, clean design for 64bit architecture  Not all, very small set of ”conditional data processing” instructions  No equivalent of Thumb2's IT instruction.
  • 13. No multiple load/store  No multiple load/store GP registers such as LDM/STM, PUSH/POP  Instead, there are 2 register load/store such as LDP/STP
  • 14. YIELD instruction  NOP with hinting not important  Use in spin-loop and trigger context switching in SMT(Symmetric Multi- Threading)
  • 15. Sample #1 source #include <stdio.h> int main() { int i; for (i = 5; i >=0; i--) { printf("count down: %dn", i); } return 0; }
  • 16. Sample #1 Thumb2 000083f8 <main>: 83f8: b570 push {r4, r5, r6, lr} 83fa: 2405 movs r4, #5 83fc: f248 456c movw r5, #33900 ; 0x846c 8400: f2c0 0500 movt r5, #0 8404: 2601 movs r6, #1 8406: 4630 mov r0, r6 8408: 4629 mov r1, r5 840a: 4622 mov r2, r4 840c: f7ff ef7a blx 8304 <_init+0x38> 8410: 3c01 subs r4, #1 8412: f1b4 3fff cmp.w r4, #4294967295 ; 0xffffffff 8416: d1f6 bne.n 8406 <main+0xe> 8418: 2000 movs r0, #0 841a: bd70 pop {r4, r5, r6, pc}
  • 17. Sample #1 A64 0000000000400440 <main>: 400440: a9be7bfd stp x29, x30, [sp,#-32]! 400444: 910003fd mov x29, sp 400448: a90153f3 stp x19, x20, [sp,#16] 40044c: 90000014 adrp x20, 400000 <_init-0x3c0> 400450: 528000b3 mov w19, #0x5 // #5 400454: 911a0294 add x20, x20, #0x680 400458: 2a1303e2 mov w2, w19 40045c: 52800020 mov w0, #0x1 // #1 400460: aa1403e1 mov x1, x20 400464: 97ffffeb bl 400410 <__printf_chk@plt> 400468: 51000673 sub w19, w19, #0x1 40046c: 3100067f cmn w19, #0x1 400470: 54ffff41 b.ne 400458 <main+0x18> 400474: 52800000 mov w0, #0x0 // #0 400478: a94153f3 ldp x19, x20, [sp,#16] 40047c: a8c27bfd ldp x29, x30, [sp],#32 400480: d65f03c0 ret
  • 18. Sample #2 source int iaload(int *base, int index) { return base[index]; } long long laload(long long *base, int index) { return base[index]; } char ibload(char *base, int index) { return base[index]; } short isload(short *base, int index) { return base[index]; }
  • 19. Sample #2 Thumb2 00000000 <iaload>: 0: f850 0021 ldr.w r0, [r0, r1, lsl #2] 4: 4770 bx lr 6: bf00 nop 00000008 <laload>: 8: eb00 01c1 add.w r1, r0, r1, lsl #3 c: e9d1 0100 ldrd r0, r1, [r1] 10: 4770 bx lr 12: bf00 nop 00000014 <ibload>: 14: 5c40 ldrb r0, [r0, r1] 16: 4770 bx lr 00000018 <isload>: 18: f930 0011 ldrsh.w r0, [r0, r1, lsl #1] 1c: 4770 bx lr 1e: bf00 nop
  • 20. Sample #2 A64 0000000000000000 <iaload>: 0: b861d800 ldr w0, [x0,w1,sxtw #2] 4: d65f03c0 ret 0000000000000008 <laload>: 8: f861d800 ldr x0, [x0,w1,sxtw #3] c: d65f03c0 ret 0000000000000010 <ibload>: 10: 3861c800 ldrb w0, [x0,w1,sxtw] 14: d65f03c0 ret 0000000000000018 <isload>: 18: 7861d800 ldrh w0, [x0,w1,sxtw #1] 1c: d65f03c0 ret
  • 21. Sample #3 source double range(double x, double min, double max) { if (x < min) return min; else if (x > max) return max; else return x; }
  • 22. Sample #3 Thumb2 00000000 <range>: 0: eeb4 0bc1 vcmpe.f64 d0, d1 4: eef1 fa10 vmrs APSR_nzcv, fpscr 8: d407 bmi.n 1a <range+0x1a> a: eeb4 0bc2 vcmpe.f64 d0, d2 e: eef1 fa10 vmrs APSR_nzcv, fpscr 12: bfc8 it gt 14: eeb0 0b42 vmovgt.f64 d0, d2 18: 4770 bx lr 1a: eeb0 0b41 vmov.f64d0, d1 1e: 4770 bx lr
  • 23. Sample #3 A64 0000000000000000 <range>: 0: 1e612010 fcmpe d0, d1 4: 540000a4 b.mi 18 <range+0x18> 8: 1e622010 fcmpe d0, d2 c: 1e604041 fmov d1, d2 10: 5400004c b.gt 18 <range+0x18> 14: 1e604001 fmov d1, d0 18: 1e604020 fmov d0, d1 1c: d65f03c0 ret
  • 24. Cache control  Application level cache instructions  Data cache  DC VAU  DC CVAC  DC CIVAC  Instruction cache  IC IVAU  No need to call kernel syscall  JIT friendly
  • 25. Preloading cache  PRFM <prfop>, addr|label  <prfop> ::= <type><target><policy>  <type> ::= PLD | PST | PLI  <target> ::= L1 | L2 | L3  <policy> ::= KEEP | STRM
  • 26. Non-temporal load/store  LDNP/STNP  Hinting unlikely to be accessed again (like streaming)
  • 27. Aarch32  Upper compatible with ARMv7  Added encrypt extension  Added other some new instructions aligned to aarch64  Removed Jazelle, ThumbEE
  • 28. Let's try Aarch64 using QEMU  Qemu 2.0 supports aarch64 user mode emulation  Ubuntu 14.04 has qemu 2.0 and cross compiler for aarch64 $ sudo apt-get install qemu-user-static $ sudo apt-get install g++-aarch64-linux-gnu
  • 29. Prepare gdb for aarch64 $ sudo apt-get build-dep gdb $ wget http://ftp.gnu.org/gnu/gdb/gdb-7.7.1.tar.bz2 $ tar xf gdb-7.7.1.tar.bz2 $ mkdir obj $ cd obj $ ../gdb-7.7.1/configure --target=aarch64-linux-gnu $ make $ sudo make install
  • 30. Execute by qemu and connect gdb $ aarch64-linux-gnu-gcc -g a.c $ export QEMU_LD_PREFIX=/usr/aarch64-linux-gnu/ $ qemu-aarch64-static -g 1234 ./a.out $ aarch64-linux-gnu-gdb ./a.out   ... (gdb) target remote :1234 (gdb) b main (gdb) c (gdb) x/i $pc => 0x4005a0 <main>: stp x29, x30, [sp,#-48]! (gdb)
  • 31. DEMO
  • 32. 32 References  ARMv8Technology Preview  ARMv8 Instruction Set Overview  ARM®Architecture Reference Manual  Procedure Call Standard for theARM 64-bitArch itecture(AArch64)  ARM 64bit ARMv8の アーキテクチャ の概要  Ubuntu 14.04 arm 64bit(aarch6で 4)のコードをコンパイルして動かしてみる