I wanted to explain the problem of false sharing to colleagues with the following code:
Code. Without volatile(lock prefix x86) the effect is not so visible. The cpu doesn’t need to drain the store buffer and can use store buffer forwarding

Windows 11 [Version 10.0.22621.3374]
Intel(R) Core(TM) i7-8850H CPU @ (2.60 - 4.30 GHz)
Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
Microcode signature: 000000F0
Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
Unified Cache       0, Level 2,  256 KB, Assoc   4, LineSize  64
Unified Cache       1, Level 3,    9 MB, Assoc  12, LineSize  64
# JMH version: 1.37
# VM version: JDK 17.0.8, OpenJDK 64-Bit Server VM, 17.0.8+7-LTS
Benchmark                                                    Mode  Cnt   Score   Error  Units
FalseSharingBenchmarks.countWith1Thread                     thrpt   10   1,419 ± 0,032  ops/s
FalseSharingBenchmarks.countWith1ThreadCounterNonVolatile   thrpt   10  11,765 ± 0,228  ops/s
FalseSharingBenchmarks.countWith2Threads                    thrpt   10   0,468 ± 0,041  ops/s
FalseSharingBenchmarks.countWith2ThreadsDifferentCacheLine  thrpt   10   2,698 ± 0,064  ops/s
FalseSharingBenchmarks.countWith4Threads                    thrpt   10   0,393 ± 0,009  ops/s
FalseSharingBenchmarks.countWith4ThreadsDifferentCacheLine  thrpt   10   4,763 ± 0,100  ops/s
FalseSharingBenchmarks.countWith8Threads                    thrpt   10   0,470 ± 0,009  ops/s
FalseSharingBenchmarks.countWith8ThreadsDifferentCacheLine  thrpt   10   7,895 ± 0,147  ops/s

Now lets take a look into the generated assembly code:

java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:PrintAssemblyOptions=intel -jar target/jmh-benchmarks.jar

The windows calling convention(RCX, RDX, R8, R9) is used, because I did it on windows 11.

Code snipped:

int end = 1_000_000_000 / 8;
...
Thread t8 = new Thread(() -> {
	for (int i = 0; i < end; i++) {
		blackhole.consume(clng7++);
		if ((i & 7) == 0) flushStoreBuffer8++; // flush/drain every 8th iteration
	}
});

Compiled down to the following assembler code:

        [Verified Entry Point]
          # {method} {0x000001b12b480950} 'lambda$29' '(ILorg/openjdk/jmh/infra/Blackhole;)V' in 'codecoverage/de/FalseSharingBenchmarks'
          # parm0:    rdx       = int
          # parm1:    r8:r8     = 'org/openjdk/jmh/infra/Blackhole'
          #           [sp+0x30]  (sp of caller)
          0x000001b0b8199180:   mov    DWORD PTR [rsp-0x7000],eax
          0x000001b0b8199187:   push   rbp
          0x000001b0b8199188:   sub    rsp,0x20                     
          0x000001b0b819918c:   mov    r10d,edx                     ; r10=rdx=end (COUNT_TO(1_000_000_000) / 8 = 125.000.000)
          0x000001b0b819918f:   test   edx,edx                      
          0x000001b0b8199191:   jle    0x000001b0b81992d1           ; RET if end <= 0; for loop
          0x000001b0b8199197:   test   r8,r8                        ; NPE test
          0x000001b0b819919a:   je     0x000001b0b81992fe
          0x000001b0b81991a0:   mov    ebx,edx
          0x000001b0b81991a2:   dec    ebx
          0x000001b0b81991a4:   mov    r11d,0x80000000
          0x000001b0b81991aa:   cmp    edx,ebx
          0x000001b0b81991ac:   cmovl  ebx,r11d                     ; ebx = 124.999.999
          0x000001b0b81991b0:   movabs rsi,0x44b976180              ; rsi ptr to class 'codecoverage/de/FalseSharingBenchmarks'
          0x000001b0b81991ba:   mov    r11,QWORD PTR [rsi+0x4b8]    ; load clng7
          0x000001b0b81991c1:   mov    r9,QWORD PTR [rsi+0x270]     ; load flushStoreBuffer8
          0x000001b0b81991c8:   add    r11,0x1
          0x000001b0b81991cc:   mov    QWORD PTR [rsi+0x4b8],r11    ; store clng7+=1
          0x000001b0b81991d3:   add    r9,0x1
          0x000001b0b81991d7:   mov    QWORD PTR [rsi+0x270],r9     ; store flushStoreBuffer8+=1
          0x000001b0b81991de:   lock add DWORD PTR [rsp-0x40],0x0   ; StoreLoad barrier ...heavy one...flush
          0x000001b0b81991e4:   mov    ecx,0x1
          0x000001b0b81991e9:   cmp    ebx,0x1
          0x000001b0b81991ec:   jle    0x000001b0b81992a9
          0x000001b0b81991f2:   xor    edx,edx                      ; edx = 0
          0x000001b0b81991f4:   mov    edi,0x7d0                    ; 2000
|---------0x000001b0b81991f9:   jmp    0x000001b0b819928a           
|         0x000001b0b81991fe:   mov    r9,QWORD PTR [rsi+0x270]     ; load flushStoreBuffer8
|         0x000001b0b8199205:   add    r9,0x1
|         0x000001b0b8199209:   mov    QWORD PTR [rsi+0x270],r9
|         0x000001b0b8199210:   lock add DWORD PTR [rsp-0x40],0x0   ; StoreLoad
|         0x000001b0b8199216:   data16 nop WORD PTR [rax+rax*1+0x0] ; NOP
|         0x000001b0b8199220:   add    ecx,0x2                      ; ecx = 3               <-------------------|
|         0x000001b0b8199223:   cmp    ecx,r11d                     ;                                           |
|  |------0x000001b0b8199226:   jge    0x000001b0b819927c           ; ~2000 pagefault if SafePoint              |
|  | |--->0x000001b0b8199228:   mov    r9,QWORD PTR [rsi+0x4b8]     ; load clng7                                |
|  | |    0x000001b0b819922f:   add    r9,0x1                       ;                                           |
|  | |    0x000001b0b8199233:   mov    QWORD PTR [rsi+0x4b8],r9     ; store clng7+=1                            |
|  | |    0x000001b0b819923a:   test   ecx,0x7                      ; if ((i & 7) == 0) then                    |
|  | ||---0x000001b0b8199240:   je     0x000001b0b819925b           ; add 1 flushStoreBuffer8 and flush         |
|  | |||->0x000001b0b8199242:   mov    ebp,ecx                      ;                                           |
|  | |||  0x000001b0b8199244:   inc    ebp                          ; i++                                       |
|  | |||  0x000001b0b8199246:   add    r9,0x1                       ;                                           |
|  | |||  0x000001b0b819924a:   mov    QWORD PTR [rsi+0x4b8],r9     ; store clng7+=1                            |
|  | |||  0x000001b0b8199251:   test   ebp,0x7                      ; if ((i & 7) == 0) then                    |
|  | |||  0x000001b0b8199257:   je     0x000001b0b81991fe           ; flushStoreBuffer8++                       |
|  | |||  0x000001b0b8199259:   jmp    0x000001b0b8199220           ; if ((i & 7) != 0)     --------------------|
|  | ||+->0x000001b0b819925b:   mov    r9,QWORD PTR [rsi+0x270]     ; load flushStoreBuffer8
|  | | |  0x000001b0b8199262:   add    r9,0x1
|  | | |  0x000001b0b8199266:   mov    QWORD PTR [rsi+0x270],r9     ; flushStoreBuffer8++
|  | | |  0x000001b0b819926d:   lock add DWORD PTR [rsp-0x40],0x0   ; StoreLoad
|  | | |  0x000001b0b8199273:   mov    r9,QWORD PTR [rsi+0x4b8]     ; load clng7
|  | | |--0x000001b0b819927a:   jmp    0x000001b0b8199242           
|  |-+--->0x000001b0b819927c:   mov    r11,QWORD PTR [r15+0x358]    ;
|    |    0x000001b0b8199283:   test   DWORD PTR [r11],eax          ; SafePoint  {poll}
|    |    0x000001b0b8199286:   cmp    ecx,ebx
|    | |--0x000001b0b8199288:   jge    0x000001b0b81992a9
|----+-+->0x000001b0b819928a:   mov    r11d,ebx
     | |  0x000001b0b819928d:   sub    r11d,ecx                     ; 124.999.998
     | |  0x000001b0b8199290:   cmp    ebx,ecx
     | |  0x000001b0b8199292:   cmovl  r11d,edx                     ; 124.999.998
     | |  0x000001b0b8199296:   cmp    r11d,0x7d0
     | |  0x000001b0b819929d:   cmova  r11d,edi                     ; 2000
     | |  0x000001b0b81992a1:   add    r11d,ecx                     ; 2001
     |-+--0x000001b0b81992a4:   jmp    0x000001b0b8199228
       |->0x000001b0b81992a9:   cmp    ecx,r10d
          0x000001b0b81992ac:   jge    0x000001b0b81992d1           ; fertig return
          0x000001b0b81992ae:   xchg   ax,ax                        
          0x000001b0b81992b0:   mov    r11,QWORD PTR [rsi+0x4b8]    
          0x000001b0b81992b7:   add    r11,0x1
          0x000001b0b81992bb:   mov    QWORD PTR [rsi+0x4b8],r11    
          0x000001b0b81992c2:   test   ecx,0x7
          0x000001b0b81992c8:   je     0x000001b0b81992e4
          0x000001b0b81992ca:   inc    ecx                          
          0x000001b0b81992cc:   cmp    ecx,r10d
          0x000001b0b81992cf:   jl     0x000001b0b81992b0           
          0x000001b0b81992d1:   add    rsp,0x20
          0x000001b0b81992d5:   pop    rbp
          0x000001b0b81992d6:   cmp    rsp,QWORD PTR [r15+0x350]    
          0x000001b0b81992dd:   ja     0x000001b0b8199314
          0x000001b0b81992e3:   ret                                 
          0x000001b0b81992e4:   mov    r11,QWORD PTR [rsi+0x270]
          0x000001b0b81992eb:   add    r11,0x1
          0x000001b0b81992ef:   mov    QWORD PTR [rsi+0x270],r11
          0x000001b0b81992f6:   lock add DWORD PTR [rsp-0x40],0x0   ; StoreLoad
          0x000001b0b81992fc:   jmp    0x000001b0b81992ca
          0x000001b0b81992fe:   mov    edx,0xffffff76
          0x000001b0b8199303:   mov    QWORD PTR [rsp],r8
          0x000001b0b8199307:   mov    DWORD PTR [rsp+0x8],r10d
          0x000001b0b819930c:   data16 xchg ax,ax
          0x000001b0b819930f:   call   0x000001b0b7a76900           ; ImmutableOopMap {[0]=Oop }
 .........

To get some deeper inside I changed to linux

Enable only the 4 threads benchmark and measure:

perf c2c record -g -- java -XX:+UnlockDiagnosticVMOptions -XX:+DumpPerfMapAtExit -XX:+PreserveFramePointer -jar target/jmh-benchmarks.jar

Hit in a modified cacheline(HITM) 99,5%

perf c2c report -g:

4Threads:

Shared Data Cache Line Table     (2 entries, sorted on Total HITMs)
         ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
  Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
-     0         0x716a688c0     0  343497   99,49%     6895     6895        0   367830   203445   164385   145056    19329        0   116151    75827      683      3882     6895         0        0         7         0
     start_thread
     thread_native_entry(Thread*)
     Thread::call_run()
     JavaThread::thread_main_inner() [clone .part.0]
     thread_entry(JavaThread*, JavaThread*)
     JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)
     JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)
     StubRoutines (initial stubs)
     Interpreter
     Interpreter
   - Interpreter
        13,93% void de.codecoverage.FalseSharingBenchmarks.lambda$12(int, org.openjdk.jmh.infra.Blackhole)
        12,37% void de.codecoverage.FalseSharingBenchmarks.lambda$15(int, org.openjdk.jmh.infra.Blackhole)
        11,74% void de.codecoverage.FalseSharingBenchmarks.lambda$14(int, org.openjdk.jmh.infra.Blackhole)
        11,41% void de.codecoverage.FalseSharingBenchmarks.lambda$13(int, org.openjdk.jmh.infra.Blackhole)
...