ARM 64bit has come!

1
ARM 64bit has come!
Tetsuyuki Kobayashi
2014.5.23 Japan Technical Jamboree
2014.5.25 Updated for カーネル /VM 探検隊

2
 The latest version of this slide will
be available from here
 http://www.slideshare.net/tetsu.koba/presentati
ons

3
Who am I?
 20+ years involved in embedded systems
 10 years in real time OS, such as iTRON
 10 years in embedded Java Virtual Machine
 Now GCC, Linux, QEMU, Android, …
 Blogs
 http://d.hatena.ne.jp/embedded/ (Personal)
 http://blog.kmckk.com/ (Corporate)
 http://kobablog.wordpress.com/(English)
 Twitter
 @tetsu_koba

Today's topics
 Introduction of ARM 64bit
 But does not cover all, only
something interesting for me :)
 Try aarch64 using QEMU

ARMv8 terminology
 AArch64: 64 bit mode
 1 instruction set: A64
 A64: 32bit fixed length instructions
 AArch32: 32 bit mode
 Upper compatible with ARMv7-A architecture
 2 instruction sets: A32, T32
 A32: ARM, 32bit fixed length instructions
 T32: Thumb2, 16bit/32bit instructions

6
ARM64 is not official name
 In the kernel source
 arch/arm64

Exception level
 4 levels
 Typical usage
 EL0: User application
 EL1: Kernel of OS
 EL2: Hypervisor
 EL3: Secure monitor
 Aarch64/aarch32 can change between
exception level
 CF. PL0-PL2 (Privilege level) at ARMv7

Aarch64 execution model
 R0 – R30: 64bit length general purpose
registers
 Wn: lower 32bit
 Xn: 64bit
 32th register means zero register(XZR, WZR) or SP
 SP: Stack Pointer
 Must be 16 byte aligned
 WSP for lower 32bit
 PC: Program Counter
 Can not use for calculate destination

Aarch64 execution model (cont.)
 V0 – V31: 128 bit length registers
 For floating point and SIMD
 Aarch64 must have FPU. No calling standard for
soft-float.
 Scalar
 Bn, Hn, Sn, Dn, Qn
 Vector
 Vn.8B, Vn.16B, Vn.4H, Vn.8H, Vn.2S, Vn.4S,
Vn.1D, Vn.2D
 FPCR: Floating Point Control Register
 FPSR: Floating Point Status Register

Aarch64 addressing model
 Without tag: 64bit virtual address
 With tag: 8bit tag + 56bit virtual address
 Tag is ignored when load/store/branch
 Good for implementing type-less languages
 Effective virtual address length is 48bit.

Calling standard (AAPCS64)
 R30 = LR (Link Register)
 R29 = FP (Frame Pointer)
 Parameter passing
 R0 – R7 for integer and pointer
 V0 – V7 for float
 Callee must preserve
 R19 – R29, SP
 V8 – V15
 No calling standard for soft-float

A64 instruction set
 Brand-new, clean design for 64bit architecture
 Not all, very small set of ”conditional data
processing” instructions
 No equivalent of Thumb2's IT instruction.

No multiple load/store
 No multiple load/store GP registers such
as LDM/STM, PUSH/POP
 Instead, there are 2 register load/store
such as LDP/STP

YIELD instruction
 NOP with hinting not important
 Use in spin-loop and trigger context
switching in SMT(Symmetric Multi-
Threading)

Sample #1 source
#include <stdio.h>
int main()
{
int i;
for (i = 5; i >=0; i--) {
printf("count down: %dn", i);
}
return 0;
}

Sample #1 Thumb2
000083f8 <main>:
83f8: b570 push {r4, r5, r6, lr}
83fa: 2405 movs r4, #5
83fc: f248 456c movw r5, #33900 ; 0x846c
8400: f2c0 0500 movt r5, #0
8404: 2601 movs r6, #1
8406: 4630 mov r0, r6
8408: 4629 mov r1, r5
840a: 4622 mov r2, r4
840c: f7ff ef7a blx 8304 <_init+0x38>
8410: 3c01 subs r4, #1
8412: f1b4 3fff cmp.w r4, #4294967295 ; 0xffffffff
8416: d1f6 bne.n 8406 <main+0xe>
8418: 2000 movs r0, #0
841a: bd70 pop {r4, r5, r6, pc}

Sample #1 A64
0000000000400440 <main>:
400440: a9be7bfd stp x29, x30, [sp,#-32]!
400444: 910003fd mov x29, sp
400448: a90153f3 stp x19, x20, [sp,#16]
40044c: 90000014 adrp x20, 400000 <_init-0x3c0>
400450: 528000b3 mov w19, #0x5 // #5
400454: 911a0294 add x20, x20, #0x680
400458: 2a1303e2 mov w2, w19
40045c: 52800020 mov w0, #0x1 // #1
400460: aa1403e1 mov x1, x20
400464: 97ffffeb bl 400410 <__printf_chk@plt>
400468: 51000673 sub w19, w19, #0x1
40046c: 3100067f cmn w19, #0x1
400470: 54ffff41 b.ne 400458 <main+0x18>
400474: 52800000 mov w0, #0x0 // #0
400478: a94153f3 ldp x19, x20, [sp,#16]
40047c: a8c27bfd ldp x29, x30, [sp],#32
400480: d65f03c0 ret

Sample #2 source
int iaload(int *base, int index)
{
return base[index];
}
long long laload(long long *base, int index)
{
return base[index];
}
char ibload(char *base, int index)
{
return base[index];
}
short isload(short *base, int index)
{
return base[index];
}

Sample #2 Thumb2
00000000 <iaload>:
0: f850 0021 ldr.w r0, [r0, r1, lsl #2]
4: 4770 bx lr
6: bf00 nop
00000008 <laload>:
8: eb00 01c1 add.w r1, r0, r1, lsl #3
c: e9d1 0100 ldrd r0, r1, [r1]
10: 4770 bx lr
12: bf00 nop
00000014 <ibload>:
14: 5c40 ldrb r0, [r0, r1]
16: 4770 bx lr
00000018 <isload>:
18: f930 0011 ldrsh.w r0, [r0, r1, lsl #1]
1c: 4770 bx lr
1e: bf00 nop

Sample #2 A64
0000000000000000 <iaload>:
0: b861d800 ldr w0, [x0,w1,sxtw #2]
4: d65f03c0 ret
0000000000000008 <laload>:
8: f861d800 ldr x0, [x0,w1,sxtw #3]
c: d65f03c0 ret
0000000000000010 <ibload>:
10: 3861c800 ldrb w0, [x0,w1,sxtw]
14: d65f03c0 ret
0000000000000018 <isload>:
18: 7861d800 ldrh w0, [x0,w1,sxtw #1]
1c: d65f03c0 ret

Sample #3 source
double range(double x, double min, double max)
{
if (x < min)
return min;
else if (x > max)
return max;
else
return x;
}

Sample #3 Thumb2
00000000 <range>:
0: eeb4 0bc1 vcmpe.f64 d0, d1
4: eef1 fa10 vmrs APSR_nzcv, fpscr
8: d407 bmi.n 1a <range+0x1a>
a: eeb4 0bc2 vcmpe.f64 d0, d2
e: eef1 fa10 vmrs APSR_nzcv, fpscr
12: bfc8 it gt
14: eeb0 0b42 vmovgt.f64 d0, d2
18: 4770 bx lr
1a: eeb0 0b41 vmov.f64d0, d1
1e: 4770 bx lr

Sample #3 A64
0000000000000000 <range>:
0: 1e612010 fcmpe d0, d1
4: 540000a4 b.mi 18 <range+0x18>
8: 1e622010 fcmpe d0, d2
c: 1e604041 fmov d1, d2
10: 5400004c b.gt 18 <range+0x18>
14: 1e604001 fmov d1, d0
18: 1e604020 fmov d0, d1
1c: d65f03c0 ret

Cache control
 Application level cache instructions
 Data cache
 DC VAU
 DC CVAC
 DC CIVAC
 Instruction cache
 IC IVAU
 No need to call kernel syscall
 JIT friendly

Non-temporal load/store
 LDNP/STNP
 Hinting unlikely to be accessed again
(like streaming)

Aarch32
 Upper compatible with ARMv7
 Added encrypt extension
 Added other some new instructions
aligned to aarch64
 Removed Jazelle, ThumbEE

Let's try Aarch64 using QEMU
 Qemu 2.0 supports aarch64 user mode
emulation
 Ubuntu 14.04 has qemu 2.0 and cross compiler
for aarch64
$ sudo apt-get install qemu-user-static
$ sudo apt-get install g++-aarch64-linux-gnu

Prepare gdb for aarch64
$ sudo apt-get build-dep gdb
$ wget http://ftp.gnu.org/gnu/gdb/gdb-7.7.1.tar.bz2
$ tar xf gdb-7.7.1.tar.bz2
$ mkdir obj
$ cd obj
$ ../gdb-7.7.1/configure --target=aarch64-linux-gnu
$ make
$ sudo make install

Execute by qemu and connect
gdb
$ aarch64-linux-gnu-gcc -g a.c
$ export QEMU_LD_PREFIX=/usr/aarch64-linux-gnu/
$ qemu-aarch64-static -g 1234 ./a.out
$ aarch64-linux-gnu-gdb ./a.out
　 ...
(gdb) target remote :1234
(gdb) b main
(gdb) c
(gdb) x/i $pc
=> 0x4005a0 <main>: stp x29, x30, [sp,#-48]!
(gdb)

32
References
 ARMv8Technology Preview
 ARMv8 Instruction Set Overview
 ARM®Architecture Reference Manual
 Procedure Call Standard for theARM 64-bitArch
itecture(AArch64)
 ARM 64bit ARMv8のアーキテクチャの概要
 Ubuntu 14.04 arm 64bit(aarch6で
4)のコードをコンパイルして動かしてみる

33
Any comment?
@tetsu_koba
Thank you for listening!

ARM 64bit has come!

More Related Content

ARM 64bit has come!