False Sharing in Java
I wanted to explain the problem of false sharing to colleagues with the following code:
Code. Without volatile(lock prefix x86) the
effect is not so visible. The cpu doesn’t need to drain the store buffer and can use store buffer forwarding
Windows 11 [Version 10.0.22621.3374]
Intel(R) Core(TM) i7-8850H CPU @ (2.60 - 4.30 GHz)
Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
Microcode signature: 000000F0
Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
Unified Cache 0, Level 2, 256 KB, Assoc 4, LineSize 64
Unified Cache 1, Level 3, 9 MB, Assoc 12, LineSize 64
# JMH version: 1.37
# VM version: JDK 17.0.8, OpenJDK 64-Bit Server VM, 17.0.8+7-LTS
Benchmark Mode Cnt Score Error Units
FalseSharingBenchmarks.countWith1Thread thrpt 10 1,419 ± 0,032 ops/s
FalseSharingBenchmarks.countWith1ThreadCounterNonVolatile thrpt 10 11,765 ± 0,228 ops/s
FalseSharingBenchmarks.countWith2Threads thrpt 10 0,468 ± 0,041 ops/s
FalseSharingBenchmarks.countWith2ThreadsDifferentCacheLine thrpt 10 2,698 ± 0,064 ops/s
FalseSharingBenchmarks.countWith4Threads thrpt 10 0,393 ± 0,009 ops/s
FalseSharingBenchmarks.countWith4ThreadsDifferentCacheLine thrpt 10 4,763 ± 0,100 ops/s
FalseSharingBenchmarks.countWith8Threads thrpt 10 0,470 ± 0,009 ops/s
FalseSharingBenchmarks.countWith8ThreadsDifferentCacheLine thrpt 10 7,895 ± 0,147 ops/s
Now lets take a look into the generated assembly code:
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=intel -jar target/jmh-benchmarks.jar
The windows calling convention(RCX, RDX, R8, R9) is used, because I did it on windows 11.
Code snipped:
int end = 1_000_000_000 / 8;
...
Thread t8 = new Thread(() -> {
for (int i = 0; i < end; i++) {
blackhole.consume(clng7++);
if ((i & 7) == 0) flushStoreBuffer8++; // flush/drain every 8th iteration
}
});
Compiled down to the following assembler code:
[Verified Entry Point]
# {method} {0x000001b12b480950} 'lambda$29' '(ILorg/openjdk/jmh/infra/Blackhole;)V' in 'codecoverage/de/FalseSharingBenchmarks'
# parm0: rdx = int
# parm1: r8:r8 = 'org/openjdk/jmh/infra/Blackhole'
# [sp+0x30] (sp of caller)
0x000001b0b8199180: mov DWORD PTR [rsp-0x7000],eax
0x000001b0b8199187: push rbp
0x000001b0b8199188: sub rsp,0x20
0x000001b0b819918c: mov r10d,edx ; r10=rdx=end (COUNT_TO(1_000_000_000) / 8 = 125.000.000)
0x000001b0b819918f: test edx,edx
0x000001b0b8199191: jle 0x000001b0b81992d1 ; RET if end <= 0; for loop
0x000001b0b8199197: test r8,r8 ; NPE test
0x000001b0b819919a: je 0x000001b0b81992fe
0x000001b0b81991a0: mov ebx,edx
0x000001b0b81991a2: dec ebx
0x000001b0b81991a4: mov r11d,0x80000000
0x000001b0b81991aa: cmp edx,ebx
0x000001b0b81991ac: cmovl ebx,r11d ; ebx = 124.999.999
0x000001b0b81991b0: movabs rsi,0x44b976180 ; rsi ptr to class 'codecoverage/de/FalseSharingBenchmarks'
0x000001b0b81991ba: mov r11,QWORD PTR [rsi+0x4b8] ; load clng7
0x000001b0b81991c1: mov r9,QWORD PTR [rsi+0x270] ; load flushStoreBuffer8
0x000001b0b81991c8: add r11,0x1
0x000001b0b81991cc: mov QWORD PTR [rsi+0x4b8],r11 ; store clng7+=1
0x000001b0b81991d3: add r9,0x1
0x000001b0b81991d7: mov QWORD PTR [rsi+0x270],r9 ; store flushStoreBuffer8+=1
0x000001b0b81991de: lock add DWORD PTR [rsp-0x40],0x0 ; StoreLoad barrier ...heavy one...flush
0x000001b0b81991e4: mov ecx,0x1
0x000001b0b81991e9: cmp ebx,0x1
0x000001b0b81991ec: jle 0x000001b0b81992a9
0x000001b0b81991f2: xor edx,edx ; edx = 0
0x000001b0b81991f4: mov edi,0x7d0 ; 2000
|---------0x000001b0b81991f9: jmp 0x000001b0b819928a
| 0x000001b0b81991fe: mov r9,QWORD PTR [rsi+0x270] ; load flushStoreBuffer8
| 0x000001b0b8199205: add r9,0x1
| 0x000001b0b8199209: mov QWORD PTR [rsi+0x270],r9
| 0x000001b0b8199210: lock add DWORD PTR [rsp-0x40],0x0 ; StoreLoad
| 0x000001b0b8199216: data16 nop WORD PTR [rax+rax*1+0x0] ; NOP
| 0x000001b0b8199220: add ecx,0x2 ; ecx = 3 <-------------------|
| 0x000001b0b8199223: cmp ecx,r11d ; |
| |------0x000001b0b8199226: jge 0x000001b0b819927c ; ~2000 pagefault if SafePoint |
| | |--->0x000001b0b8199228: mov r9,QWORD PTR [rsi+0x4b8] ; load clng7 |
| | | 0x000001b0b819922f: add r9,0x1 ; |
| | | 0x000001b0b8199233: mov QWORD PTR [rsi+0x4b8],r9 ; store clng7+=1 |
| | | 0x000001b0b819923a: test ecx,0x7 ; if ((i & 7) == 0) then |
| | ||---0x000001b0b8199240: je 0x000001b0b819925b ; add 1 flushStoreBuffer8 and flush |
| | |||->0x000001b0b8199242: mov ebp,ecx ; |
| | ||| 0x000001b0b8199244: inc ebp ; i++ |
| | ||| 0x000001b0b8199246: add r9,0x1 ; |
| | ||| 0x000001b0b819924a: mov QWORD PTR [rsi+0x4b8],r9 ; store clng7+=1 |
| | ||| 0x000001b0b8199251: test ebp,0x7 ; if ((i & 7) == 0) then |
| | ||| 0x000001b0b8199257: je 0x000001b0b81991fe ; flushStoreBuffer8++ |
| | ||| 0x000001b0b8199259: jmp 0x000001b0b8199220 ; if ((i & 7) != 0) --------------------|
| | ||+->0x000001b0b819925b: mov r9,QWORD PTR [rsi+0x270] ; load flushStoreBuffer8
| | | | 0x000001b0b8199262: add r9,0x1
| | | | 0x000001b0b8199266: mov QWORD PTR [rsi+0x270],r9 ; flushStoreBuffer8++
| | | | 0x000001b0b819926d: lock add DWORD PTR [rsp-0x40],0x0 ; StoreLoad
| | | | 0x000001b0b8199273: mov r9,QWORD PTR [rsi+0x4b8] ; load clng7
| | | |--0x000001b0b819927a: jmp 0x000001b0b8199242
| |-+--->0x000001b0b819927c: mov r11,QWORD PTR [r15+0x358] ;
| | 0x000001b0b8199283: test DWORD PTR [r11],eax ; SafePoint {poll}
| | 0x000001b0b8199286: cmp ecx,ebx
| | |--0x000001b0b8199288: jge 0x000001b0b81992a9
|----+-+->0x000001b0b819928a: mov r11d,ebx
| | 0x000001b0b819928d: sub r11d,ecx ; 124.999.998
| | 0x000001b0b8199290: cmp ebx,ecx
| | 0x000001b0b8199292: cmovl r11d,edx ; 124.999.998
| | 0x000001b0b8199296: cmp r11d,0x7d0
| | 0x000001b0b819929d: cmova r11d,edi ; 2000
| | 0x000001b0b81992a1: add r11d,ecx ; 2001
|-+--0x000001b0b81992a4: jmp 0x000001b0b8199228
|->0x000001b0b81992a9: cmp ecx,r10d
0x000001b0b81992ac: jge 0x000001b0b81992d1 ; fertig return
0x000001b0b81992ae: xchg ax,ax
0x000001b0b81992b0: mov r11,QWORD PTR [rsi+0x4b8]
0x000001b0b81992b7: add r11,0x1
0x000001b0b81992bb: mov QWORD PTR [rsi+0x4b8],r11
0x000001b0b81992c2: test ecx,0x7
0x000001b0b81992c8: je 0x000001b0b81992e4
0x000001b0b81992ca: inc ecx
0x000001b0b81992cc: cmp ecx,r10d
0x000001b0b81992cf: jl 0x000001b0b81992b0
0x000001b0b81992d1: add rsp,0x20
0x000001b0b81992d5: pop rbp
0x000001b0b81992d6: cmp rsp,QWORD PTR [r15+0x350]
0x000001b0b81992dd: ja 0x000001b0b8199314
0x000001b0b81992e3: ret
0x000001b0b81992e4: mov r11,QWORD PTR [rsi+0x270]
0x000001b0b81992eb: add r11,0x1
0x000001b0b81992ef: mov QWORD PTR [rsi+0x270],r11
0x000001b0b81992f6: lock add DWORD PTR [rsp-0x40],0x0 ; StoreLoad
0x000001b0b81992fc: jmp 0x000001b0b81992ca
0x000001b0b81992fe: mov edx,0xffffff76
0x000001b0b8199303: mov QWORD PTR [rsp],r8
0x000001b0b8199307: mov DWORD PTR [rsp+0x8],r10d
0x000001b0b819930c: data16 xchg ax,ax
0x000001b0b819930f: call 0x000001b0b7a76900 ; ImmutableOopMap {[0]=Oop }
.........
To get some deeper inside I changed to linux
Enable only the 4 threads benchmark and measure:
perf c2c record -g -- java -XX:+UnlockDiagnosticVMOptions -XX:+DumpPerfMapAtExit -XX:+PreserveFramePointer -jar target/jmh-benchmarks.jar
Hit in a modified cacheline(HITM) 99,5%
perf c2c report -g:
4Threads:
Shared Data Cache Line Table (2 entries, sorted on Total HITMs)
----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total --------- Stores -------- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- Load Dram ----
Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss N/A FB L1 L2 LclHit LclHitm RmtHit RmtHitm Lcl Rmt
- 0 0x716a688c0 0 343497 99,49% 6895 6895 0 367830 203445 164385 145056 19329 0 116151 75827 683 3882 6895 0 0 7 0
start_thread
thread_native_entry(Thread*)
Thread::call_run()
JavaThread::thread_main_inner() [clone .part.0]
thread_entry(JavaThread*, JavaThread*)
JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)
JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)
StubRoutines (initial stubs)
Interpreter
Interpreter
- Interpreter
13,93% void de.codecoverage.FalseSharingBenchmarks.lambda$12(int, org.openjdk.jmh.infra.Blackhole)
12,37% void de.codecoverage.FalseSharingBenchmarks.lambda$15(int, org.openjdk.jmh.infra.Blackhole)
11,74% void de.codecoverage.FalseSharingBenchmarks.lambda$14(int, org.openjdk.jmh.infra.Blackhole)
11,41% void de.codecoverage.FalseSharingBenchmarks.lambda$13(int, org.openjdk.jmh.infra.Blackhole)
...