Hi,
Food for thought for today's sync up. I've been writting QEMU plugins to exercise the plugin system and see what sort of useful information you can extract when you can control the instruction stream.
For example I now have a plugin that can break down instruction counts for any given run, for example a kernel boot:
Instruction Classes: Class: UDEF not counted Class: SVE (68 hits) Class: Reserved (0 hits) Class: PCrel addr (4589078 hits) Class: Add/Sub (imm,tags) (0 hits) Class: Add/Sub (imm) (26832113 hits) Class: Logical (imm) (74304974 hits) Class: Move Wide (imm) (10933759 hits) Class: Bitfield (71470957 hits) Class: Extract (85655 hits) Class: Data Proc Imm (0 hits) Class: Cond Branch (imm) (37227632 hits) Class: Exception Gen (6 hits) Class: NOP not counted Class: Hints (244825554 hits) Class: Barriers (1668558 hits) Class: PSTATE (202144 hits) Class: System Insn (7132992 hits) Class: System Reg (2268308 hits) Class: Branch (reg) (6280976 hits) Class: Branch (imm) (18347905 hits) Class: Cmp & Branch (180167025 hits) Class: Tst & Branch (4092972 hits) Class: Branches (0 hits) Class: AdvSimd ldstmult (0 hits) Class: AdvSimd ldstmult++ (0 hits) Class: AdvSimd ldst (0 hits) Class: AdvSimd ldst++ (0 hits) Class: ldst excl (160861365 hits) Class: Prefetch (0 hits) Class: Load Reg (lit) (12828544 hits) Class: ldst noalloc pair (0 hits) Class: ldst pair (60381349 hits) Class: ldst reg (0 hits) Class: Atomic ldst (0 hits) Class: ldst reg (reg off) (0 hits) Class: ldst reg (pac) (0 hits) Class: ldst reg (imm) (119597941 hits) Class: Loads & Stores (0 hits) Class: Data Proc Reg (113586343 hits) Class: Scalar FP (0 hits) Class: Unclassified (0 hits)
You can break down each class to individual instructions. For example the Hints are mostly:
Individual Instructions: Instr: wfe (132400072 hits) (op=0xd503205f/ Hints) Instr: sevl (66433640 hits) (op=0xd50320bf/ Hints) Instr: yield (29619246 hits) (op=0xd503203f/ Hints) Instr: wfi (2865 hits) (op=0xd503207f/ Hints)
So I'm looking for a similar experiment that would be useful for the memory sub-system. When I chatted to Maxim we thought maybe a simplified cache line simulator might be useful. The aim wouldn't be to simulate what a real cache might do but to be useful say for identifying regions of code which might be susceptible to cache line bouncing. So as compiler writers what sort of run time memory behaviour would you like to track? What sort of information would be useful to extract with such a tool?
I'm open to ideas ;-)
-- Alex Bennée
On 30/05/2019 07:27, Alex Bennée wrote:
Hi,
Food for thought for today's sync up. I've been writting QEMU plugins to exercise the plugin system and see what sort of useful information you can extract when you can control the instruction stream.
For example I now have a plugin that can break down instruction counts for any given run, for example a kernel boot:
Instruction Classes: Class: UDEF not counted Class: SVE (68 hits) Class: Reserved (0 hits) Class: PCrel addr (4589078 hits) Class: Add/Sub (imm,tags) (0 hits) Class: Add/Sub (imm) (26832113 hits) Class: Logical (imm) (74304974 hits) Class: Move Wide (imm) (10933759 hits) Class: Bitfield (71470957 hits) Class: Extract (85655 hits) Class: Data Proc Imm (0 hits) Class: Cond Branch (imm) (37227632 hits) Class: Exception Gen (6 hits) Class: NOP not counted Class: Hints (244825554 hits) Class: Barriers (1668558 hits) Class: PSTATE (202144 hits) Class: System Insn (7132992 hits) Class: System Reg (2268308 hits) Class: Branch (reg) (6280976 hits) Class: Branch (imm) (18347905 hits) Class: Cmp & Branch (180167025 hits) Class: Tst & Branch (4092972 hits) Class: Branches (0 hits) Class: AdvSimd ldstmult (0 hits) Class: AdvSimd ldstmult++ (0 hits) Class: AdvSimd ldst (0 hits) Class: AdvSimd ldst++ (0 hits) Class: ldst excl (160861365 hits) Class: Prefetch (0 hits) Class: Load Reg (lit) (12828544 hits) Class: ldst noalloc pair (0 hits) Class: ldst pair (60381349 hits) Class: ldst reg (0 hits) Class: Atomic ldst (0 hits) Class: ldst reg (reg off) (0 hits) Class: ldst reg (pac) (0 hits) Class: ldst reg (imm) (119597941 hits) Class: Loads & Stores (0 hits) Class: Data Proc Reg (113586343 hits) Class: Scalar FP (0 hits) Class: Unclassified (0 hits)
You can break down each class to individual instructions. For example the Hints are mostly:
Individual Instructions: Instr: wfe (132400072 hits) (op=0xd503205f/ Hints) Instr: sevl (66433640 hits) (op=0xd50320bf/ Hints) Instr: yield (29619246 hits) (op=0xd503203f/ Hints) Instr: wfi (2865 hits) (op=0xd503207f/ Hints)
So I'm looking for a similar experiment that would be useful for the memory sub-system. When I chatted to Maxim we thought maybe a simplified cache line simulator might be useful. The aim wouldn't be to simulate what a real cache might do but to be useful say for identifying regions of code which might be susceptible to cache line bouncing. So as compiler writers what sort of run time memory behaviour would you like to track? What sort of information would be useful to extract with such a tool?
I'm open to ideas ;-)
Back at IBM one internal project we usually regularly was an instruction tracer based on a out-of-tree patch to valgrind. The idea was to get precise instruction sequence for a specific text segment boundary so we could it loaded it later on a powerpc simulator to post-analyse the code behaviour regarding instruction latency, op-ports utilization, cpu stalls etc.
Not sure if would be that useful without a post-analysis tool, but I think it might be useful to some arch-specific optimization. What do you think?
On Thu, 30 May 2019 at 11:28, Alex Bennée alex.bennee@linaro.org wrote:
Hi,
Food for thought for today's sync up. I've been writting QEMU plugins to exercise the plugin system and see what sort of useful information you can extract when you can control the instruction stream.
For example I now have a plugin that can break down instruction counts for any given run, for example a kernel boot:
Instruction Classes: Class: UDEF not counted Class: SVE (68 hits) Class: Reserved (0 hits) Class: PCrel addr (4589078 hits) Class: Add/Sub (imm,tags) (0 hits) Class: Add/Sub (imm) (26832113 hits) Class: Logical (imm) (74304974 hits) Class: Move Wide (imm) (10933759 hits) Class: Bitfield (71470957 hits) Class: Extract (85655 hits) Class: Data Proc Imm (0 hits) Class: Cond Branch (imm) (37227632 hits) Class: Exception Gen (6 hits) Class: NOP not counted Class: Hints (244825554 hits) Class: Barriers (1668558 hits) Class: PSTATE (202144 hits) Class: System Insn (7132992 hits) Class: System Reg (2268308 hits) Class: Branch (reg) (6280976 hits) Class: Branch (imm) (18347905 hits) Class: Cmp & Branch (180167025 hits) Class: Tst & Branch (4092972 hits) Class: Branches (0 hits) Class: AdvSimd ldstmult (0 hits) Class: AdvSimd ldstmult++ (0 hits) Class: AdvSimd ldst (0 hits) Class: AdvSimd ldst++ (0 hits) Class: ldst excl (160861365 hits) Class: Prefetch (0 hits) Class: Load Reg (lit) (12828544 hits) Class: ldst noalloc pair (0 hits) Class: ldst pair (60381349 hits) Class: ldst reg (0 hits) Class: Atomic ldst (0 hits) Class: ldst reg (reg off) (0 hits) Class: ldst reg (pac) (0 hits) Class: ldst reg (imm) (119597941 hits) Class: Loads & Stores (0 hits) Class: Data Proc Reg (113586343 hits) Class: Scalar FP (0 hits) Class: Unclassified (0 hits)
You can break down each class to individual instructions. For example the Hints are mostly:
Individual Instructions: Instr: wfe (132400072 hits) (op=0xd503205f/ Hints) Instr: sevl (66433640 hits) (op=0xd50320bf/ Hints) Instr: yield (29619246 hits) (op=0xd503203f/ Hints) Instr: wfi (2865 hits) (op=0xd503207f/ Hints)
So I'm looking for a similar experiment that would be useful for the memory sub-system. When I chatted to Maxim we thought maybe a simplified cache line simulator might be useful. The aim wouldn't be to simulate what a real cache might do but to be useful say for identifying regions of code which might be susceptible to cache line bouncing. So as compiler writers what sort of run time memory behaviour would you like to track? What sort of information would be useful to extract with such a tool?
I'm open to ideas ;-)
In our embedded compiler team we used a fast model plugin to check that our cortex-m3 execute-only code did indeed not read the executable instructions (no literal pools etc). You may have this emulated already though. Another demo I saw was a cache visualisation plugin that gave a graphical display of the cache as the program was running. Pretty, but sure did slow the model down.
Peter
-- Alex Bennée _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
On Thu, 30 May 2019 at 12:28, Alex Bennée alex.bennee@linaro.org wrote:
Hi,
Food for thought for today's sync up. I've been writting QEMU plugins to exercise the plugin system and see what sort of useful information you can extract when you can control the instruction stream.
For example I now have a plugin that can break down instruction counts for any given run, for example a kernel boot:
Instruction Classes: Class: UDEF not counted Class: SVE (68 hits) Class: Reserved (0 hits) Class: PCrel addr (4589078 hits) Class: Add/Sub (imm,tags) (0 hits) Class: Add/Sub (imm) (26832113 hits) Class: Logical (imm) (74304974 hits) Class: Move Wide (imm) (10933759 hits) Class: Bitfield (71470957 hits) Class: Extract (85655 hits) Class: Data Proc Imm (0 hits) Class: Cond Branch (imm) (37227632 hits) Class: Exception Gen (6 hits) Class: NOP not counted Class: Hints (244825554 hits) Class: Barriers (1668558 hits) Class: PSTATE (202144 hits) Class: System Insn (7132992 hits) Class: System Reg (2268308 hits) Class: Branch (reg) (6280976 hits) Class: Branch (imm) (18347905 hits) Class: Cmp & Branch (180167025 hits) Class: Tst & Branch (4092972 hits) Class: Branches (0 hits) Class: AdvSimd ldstmult (0 hits) Class: AdvSimd ldstmult++ (0 hits) Class: AdvSimd ldst (0 hits) Class: AdvSimd ldst++ (0 hits) Class: ldst excl (160861365 hits) Class: Prefetch (0 hits) Class: Load Reg (lit) (12828544 hits) Class: ldst noalloc pair (0 hits) Class: ldst pair (60381349 hits) Class: ldst reg (0 hits) Class: Atomic ldst (0 hits) Class: ldst reg (reg off) (0 hits) Class: ldst reg (pac) (0 hits) Class: ldst reg (imm) (119597941 hits) Class: Loads & Stores (0 hits) Class: Data Proc Reg (113586343 hits) Class: Scalar FP (0 hits) Class: Unclassified (0 hits)
You can break down each class to individual instructions. For example the Hints are mostly:
Individual Instructions: Instr: wfe (132400072 hits) (op=0xd503205f/ Hints) Instr: sevl (66433640 hits) (op=0xd50320bf/ Hints) Instr: yield (29619246 hits) (op=0xd503203f/ Hints) Instr: wfi (2865 hits) (op=0xd503207f/ Hints)
So I'm looking for a similar experiment that would be useful for the memory sub-system. When I chatted to Maxim we thought maybe a simplified cache line simulator might be useful. The aim wouldn't be to simulate what a real cache might do but to be useful say for identifying regions of code which might be susceptible to cache line bouncing. So as compiler writers what sort of run time memory behaviour would you like to track? What sort of information would be useful to extract with such a tool?
I'm open to ideas ;-)
On our side (ST), we use qemu plugins for various things: - code coverage - code profiling - loop analysis (more compiler developer oriented than the previous ones)
Christophe
-- Alex Bennée _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
On Mon, 3 Jun 2019 at 21:36, Christophe Lyon christophe.lyon@linaro.org wrote:
On Thu, 30 May 2019 at 12:28, Alex Bennée alex.bennee@linaro.org wrote:
Hi,
Food for thought for today's sync up. I've been writting QEMU plugins to exercise the plugin system and see what sort of useful information you can extract when you can control the instruction stream.
For example I now have a plugin that can break down instruction counts for any given run, for example a kernel boot:
Instruction Classes: Class: UDEF not counted Class: SVE (68 hits) Class: Reserved (0 hits) Class: PCrel addr (4589078 hits) Class: Add/Sub (imm,tags) (0 hits) Class: Add/Sub (imm) (26832113 hits) Class: Logical (imm) (74304974 hits) Class: Move Wide (imm) (10933759 hits) Class: Bitfield (71470957 hits) Class: Extract (85655 hits) Class: Data Proc Imm (0 hits) Class: Cond Branch (imm) (37227632 hits) Class: Exception Gen (6 hits) Class: NOP not counted Class: Hints (244825554 hits) Class: Barriers (1668558 hits) Class: PSTATE (202144 hits) Class: System Insn (7132992 hits) Class: System Reg (2268308 hits) Class: Branch (reg) (6280976 hits) Class: Branch (imm) (18347905 hits) Class: Cmp & Branch (180167025 hits) Class: Tst & Branch (4092972 hits) Class: Branches (0 hits) Class: AdvSimd ldstmult (0 hits) Class: AdvSimd ldstmult++ (0 hits) Class: AdvSimd ldst (0 hits) Class: AdvSimd ldst++ (0 hits) Class: ldst excl (160861365 hits) Class: Prefetch (0 hits) Class: Load Reg (lit) (12828544 hits) Class: ldst noalloc pair (0 hits) Class: ldst pair (60381349 hits) Class: ldst reg (0 hits) Class: Atomic ldst (0 hits) Class: ldst reg (reg off) (0 hits) Class: ldst reg (pac) (0 hits) Class: ldst reg (imm) (119597941 hits) Class: Loads & Stores (0 hits) Class: Data Proc Reg (113586343 hits) Class: Scalar FP (0 hits) Class: Unclassified (0 hits)
You can break down each class to individual instructions. For example the Hints are mostly:
Individual Instructions: Instr: wfe (132400072 hits) (op=0xd503205f/ Hints) Instr: sevl (66433640 hits) (op=0xd50320bf/ Hints) Instr: yield (29619246 hits) (op=0xd503203f/ Hints) Instr: wfi (2865 hits) (op=0xd503207f/ Hints)
So I'm looking for a similar experiment that would be useful for the memory sub-system. When I chatted to Maxim we thought maybe a simplified cache line simulator might be useful. The aim wouldn't be to simulate what a real cache might do but to be useful say for identifying regions of code which might be susceptible to cache line bouncing. So as compiler writers what sort of run time memory behaviour would you like to track? What sort of information would be useful to extract with such a tool?
I'm open to ideas ;-)
On our side (ST), we use qemu plugins for various things:
- code coverage
- code profiling
- loop analysis (more compiler developer oriented than the previous ones)
Actually for more detailed info you can have a look at: Example plugins reference: https://github.com/atos-tools/qemu/tree/stable-3.1.plugins/tcg/plugins
I.e. example plugins list: - Full instruction trace - DineroIV cache simulator - Instruction group/mnemonic dynamic count - coverage - per-function profile - oprofile - function call stack - global instruction count - I/O memory mapped simulation - block trace
Other more advanced plugins (ref for instance paper: https://ppopp19.sigplan.org/details/PPoPP-2019-papers/9/Data-Flow-Dependence...): - Function call graph, CFGs and function call-stack sampling (flamegraph) - Dynamic Dependence Graph
Christophe
-- Alex Bennée _______________________________________________ linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
linaro-toolchain@lists.linaro.org