Gregory Price wrote:
On Sun, Feb 05, 2023 at 05:02:29PM -0800, Dan Williams wrote:
Summary:
CXL RAM support allows for the dynamic provisioning of new CXL RAM regions, and more routinely, assembling a region from an existing configuration established by platform-firmware. The latter is motivated by CXL memory RAS (Reliability, Availability and Serviceability) support, that requires associating device events with System Physical Address ranges and vice versa.
Ok, I simplified down my tests and reverted a bunch of stuff, figured i should report this before I dive further in.
Earlier i was carrying the DOE patches and others, I've dropped most of that to make sure i could replicate on the base kernel and qemu images
QEMU branch: https://gitlab.com/jic23/qemu/-/tree/cxl-2023-01-26 this is a little out of date at this point i think? but it shouldn't matter, the results are the same regardless of what else i pull in.
Kernel branch: https://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=for-6.3/c...
Note that I acted on this feedback from Greg to break out a fix and merge it for v6.2-final
http://lore.kernel.org/r/Y+CSOeHVLKudN0A6@kroah.com
...i.e. you are missing at least the passthrough decoder fix, but that would show up as a region creation failure not a QEMU crash.
So I would move to testing cxl/next.
[..]
Lets attempt to use the memory [root@fedora ~]# numactl --membind=1 python KVM internal error. Suberror: 3 extra data[0]: 0x0000000080000b0e extra data[1]: 0x0000000000000031 extra data[2]: 0x0000000000000d81 extra data[3]: 0x0000000390074ac0 extra data[4]: 0x0000000000000010 RAX=0000000080000000 RBX=0000000000000000 RCX=0000000000000000 RDX=0000000000000001 RSI=0000000000000000 RDI=0000000390074000 RBP=ffffac1c4067bca0 RSP=ffffac1c4067bc88 R8 =0000000000000000 R9 =0000000000000001 R10=0000000000000000 R11=0000000000000000 R12=0000000000000000 R13=ffff99eed0074000 R14=0000000000000000 R15=0000000000000000 RIP=ffffffff812b3d62 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 0000000000000000 ffffffff 00c00000 CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0000 0000000000000000 ffffffff 00c00000 FS =0000 0000000000000000 ffffffff 00c00000 GS =0000 ffff99ec3bc00000 ffffffff 00c00000 LDT=0000 0000000000000000 ffffffff 00c00000 TR =0040 fffffe1d13135000 00004087 00008b00 DPL=0 TSS64-busy GDT= fffffe1d13133000 0000007f IDT= fffffe0000000000 00000fff CR0=80050033 CR2=ffffffff812b3d62 CR3=0000000390074000 CR4=000006f0 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000fffe0ff0 DR7=0000000000000400 EFER=0000000000000d01 Code=5d 9c 01 0f b7 db 48 09 df 48 0f ba ef 3f 0f 22 df 0f 1f 00 <5b> 41 5c 41 5d 5d c3 cc cc cc cc 48 c7 c0 00 00 00 80 48 2b 05 cd 0d 76 01
At first glance that looks like a QEMU issue, but I would capture a cxl list -vvv before attempting to use the memory just to verify the decoder setup looks sane.
I also tested lowering the ram sizes (2GB ram, 1GB "CXL") to see if there's something going on with the PCI hole or something, but no, same results.
Double checked if there was an issue using a single root port so i registered a second one - same results.
In prior tests i accessed the memory directly via devmem2
This still works when mapping the memory manually
[root@fedora map] ./map_memory.sh echo ram > /sys/bus/cxl/devices/decoder2.0/mode echo 0x40000000 > /sys/bus/cxl/devices/decoder2.0/dpa_size echo region0 > /sys/bus/cxl/devices/decoder0.0/create_ram_region echo 4096 > /sys/bus/cxl/devices/region0/interleave_granularity echo 1 > /sys/bus/cxl/devices/region0/interleave_ways echo 0x40000000 > /sys/bus/cxl/devices/region0/size echo decoder2.0 > /sys/bus/cxl/devices/region0/target0 echo 1 > /sys/bus/cxl/devices/region0/commit
[root@fedora devmem]# ./devmem2 0x290000000 w 0x12345678 /dev/mem opened. Memory mapped at address 0x7fb4d4ed3000. Value at address 0x290000000 (0x7fb4d4ed3000): 0x0 Written 0x12345678; readback 0x12345678
Likely it is sensitive to crossing an interleave threshold.
This kind of implies there's a disagreement about the state of memory between linux and qemu.
but even just onlining a region produces memory usage:
[root@fedora ~]# cat /sys/bus/node/devices/node1/meminfo Node 1 MemTotal: 1048576 kB Node 1 MemFree: 1048112 kB Node 1 MemUsed: 464 kB
Which I would expect to set off some fireworks.
Maybe an issue at the NUMA level? I just... i have no idea.
I will need to dig through the email chains to figure out what others have been doing that i'm missing. Everything *looks* nominal, but the reactors are exploding so... ¯_(ツ)_/¯
I'm not sure where to start here, but i'll bash my face on the keyboard for a bit until i have some ideas.
Not ruling out the driver yet, but Fan's tests with hardware has me leaning more towards QEMU.