From: Jason Gunthorpe jgg@nvidia.com Sent: Thursday, September 4, 2025 1:46 AM
[...]
This is enough to implement the 8 initial format variations with all of their features:
- Entries comprised of contiguous blocks of IO PTEs for larger page sizes (AMDv1, ARMv8)
- Multi-level tables, up to 6 levels. Runtime selected top level
- Runtime variable table level size (ARM's concatenated tables)
- Expandable top level (AMDv1)
any more context about this one? how is it different from the earlier "runtime selected top level"?
--- /dev/null +++ b/drivers/iommu/generic_pt/pt_common.h @@ -0,0 +1,355 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/*
- Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES
- This header is included after the format. It contains definitions
- that build on the format definitions to create the basic format API.
- The format API is listed here, with kdocs, in alphabetical order. The
Is alphabetical order important here? It's not strictly followed, e.g.:
pt_entry_make_write_dirty() pt_dirty_supported() pt_entry_num_contig_lg2()
and several other violations at a glance. IMHO grouping related functions together is more meaningful (e.g. dirty related) and less burden than following the alphabetical order.
- functions without bodies are implemented in the format using the
pattern:
static inline FMTpt_XXX(..) {..}
#define pt_XXX FMTpt_XXX
or provided by pt_fmt_defaults.h
- The routines marked "@pts: Entry to query" operate on the entire
contiguous
- entry and can be called with a pts->index pointing to any sub item that
makes
- up that entry.
- The header order is:
- pt_defs.h
- fmt_XX.h
s/fmt_XX.h/FMT.h/
or rename amdv1.h etc. to fmt_amdv1.h etc. to be consistent
+/**
- pt_entry_make_write_dirty() - Make an entry dirty
- @pts: Table index to change
it's about the entire entry instead of a specific index? if yes then "entry to change" makes more sense.
+/**
- pt_entry_oa_full() - Return the full OA for an entry
- @pts: Entry to query
s/full/exact/?
+/**
- pt_entry_set_write_clean() - Make the entry write clean
- @pts: Table index to change
ditto "entry to change"
+/**
- pt_has_system_page() - True if level 0 can install a PAGE_SHIFT entry
- @common: Page table to query
pt_has_system_page_size()
+/**
- pt_install_leaf_entry() - Write a leaf entry to the table
- @pts: Table index to change
- @oa: Output Address for this leaf
- @oasz_lg2: Size in VA for this leaf
- @attrs: Attributes to modify the entry
- A leaf OA entry will return PT_ENTRY_OA from pt_load_entry(). It
translates
- the VA indicated by pts to the given OA.
- For a single item non-contiguous entry oasz_lg2 is pt_table_item_lg2sz().
- For contiguous it is pt_table_item_lg2sz() + num_contig_lg2.
this sounds a fixed thing then could be judged within the function instead of letting the caller to set?
+/**
- pt_max_output_address_lg2() - Return the maximum OA the table
format can hold
- @common: Page table to query
pt_max_oa_lg2()
+/**
- DOC: Generic Page Table Language
- Language used in Generic Page Table
- VA
The input address to the page table, often the virtual address.
- OA
The output address from the page table, often the physical address.
- leaf
An entry that results in an output address. I.e. a physical memory addr
"I.e. a physical ..." is redundant to what OA already explains
- start/end
An half-open range, e.g. [0,0) refers to no VA.
- start/last
An inclusive closed range, e.g. [0,0] refers to the VA 0
- common
The generic page table container struct pt_common
- level
The number of table hops from the lowest leaf. Level 0
is always a table of only leaves of the least significant VA bits. The
labels used by HW descriptions are never used.
- top_level
The inclusive highest level of the table. A two-level table
has a top level of 1.
- table
A linear array of entries representing the translation items for that
level.
to not mix 'entry' and 'item' in one description:
"A linear array of translation items for that level"
- index
The position in a table of an element: item = table[index]
- item
A single position in a table
'position' is called 'index'
- entry
A single logical element in a table. If contiguous pages are not
supported then item and entry are the same thing, otherwise entry
refers
to the all the items that comprise a single contiguous translation.
'refers to all the items"
- item/entry_size
The number of bytes of VA the table translates for.
If the item is a table entry then the next table covers
this size. If the entry is an output address then the
s/is/translates/
full OA is: OA | (VA % entry_size)
- contig_count
The number of consecutive items fused into a single entry.
item_size * contig_count is the size of that entry's translation.
- lg2
Indicates the value is encoded as log2, i.e. 1<<x is the actual value.
Normally the compiler is fine to optimize divide and mod with log2
values
automatically when inlining, however if the values are not constant
expressions it can't. So we do it by hand; we want to avoid 64-bit
divmod.
- */
+/* Returned by pt_load_entry() and for_each_pt_level_entry() */ +enum pt_entry_type {
- PT_ENTRY_EMPTY,
- PT_ENTRY_TABLE,
add a comment to be consistent with following line
- /* Entry is valid and returns an output address */
- PT_ENTRY_OA,
+};
+struct pt_range {
- struct pt_common *common;
- struct pt_table_p *top_table;
- pt_vaddr_t va;
- pt_vaddr_t last_va;
- u8 top_level;
- u8 max_vasz_lg2;
+};
+/*
- Similar to xa_state, this records information about an in-progress parse
at a
- single level.
- */
+struct pt_state {
- struct pt_range *range;
- struct pt_table_p *table;
- struct pt_table_p *table_lower;
- u64 entry;
- enum pt_entry_type type;
- unsigned short index;
- unsigned short end_index;
- u8 level;
+};
+#define pt_cur_table(pts, type) ((type *)((pts)->table))
+/*
- Try to install a new table pointer. The locking methodology requires this
to
- be atomic (multiple threads can race to install a pointer) the losing
threads
"... install a pointer). The losing threads..."
+static inline bool pt_feature(const struct pt_common *common,
unsigned int feature_nr)
+{
- if (PT_FORCE_ENABLED_FEATURES & BIT(feature_nr))
return true;
- if (!PT_SUPPORTED_FEATURE(feature_nr))
return false;
- return common->features & BIT(feature_nr);
+}
common->features is already verified in pt_init_common(). So above is kind of an optimization using compiler to filter out static checks in fast path?
+/*
- PT_WARN_ON is used for invariants that the kunit should be checking
can't
- happen.
- */
+#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT) +#define PT_WARN_ON WARN_ON +#else +static inline bool PT_WARN_ON(bool condition) +{
- return false;
+} +#endif
Then call it PT_DBG_WARN_ON() to be more explicit?
btw looks there is no plain WARN_ON() used in generic-pt. Just be curious about the rationale behind. Is it a new trend to contain all warnings under a debug option?
+/* These all work on the VA type */ +#define log2_to_int(a_lg2) log2_to_int_t(pt_vaddr_t, a_lg2) +#define log2_to_max_int(a_lg2) log2_to_max_int_t(pt_vaddr_t, a_lg2) +#define log2_div(a, b_lg2) log2_div_t(pt_vaddr_t, a, b_lg2) +#define log2_div_eq(a, b, c_lg2) log2_div_eq_t(pt_vaddr_t, a, b, c_lg2) +#define log2_mod(a, b_lg2) log2_mod_t(pt_vaddr_t, a, b_lg2) +#define log2_mod_eq_max(a, b_lg2) log2_mod_eq_max_t(pt_vaddr_t, a, b_lg2) +#define log2_set_mod(a, val, b_lg2) log2_set_mod_t(pt_vaddr_t, a, val, b_lg2) +#define log2_set_mod_max(a, b_lg2) log2_set_mod_max_t(pt_vaddr_t, a, b_lg2) +#define log2_mul(a, b_lg2) log2_mul_t(pt_vaddr_t, a, b_lg2) +#define log2_ffs(a) log2_ffs_t(pt_vaddr_t, a) +#define log2_fls(a) log2_fls_t(pt_vaddr_t, a) +#define log2_ffz(a) log2_ffz_t(pt_vaddr_t, a)
above three are not related to log2
+/* If not supplied by the format then contiguous pages are not supported */ +#ifndef pt_entry_num_contig_lg2 +static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state *pts) +{
- return ilog2(1);
+}
+static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts) +{
- return ilog2(1);
+}
what is the difference between above two helpers?
It's currently not implemented by any driver so will have the default version returning 0. and it is only used by default pt_possible_sizes(), which then returns only one page size accordingly.
I kind of think it's useless and we could simply move pt_possible_sizes() here and simplify it to return only one size explicitly, assuming a format should implement both pt_entry_num_contig_lg2() and pt_possible_sizes().
+#ifndef pt_pgsz_lg2_to_level +static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,
unsigned int pgsize_lg2)
+{
- return (pgsize_lg2 - PT_GRANULE_LG2SZ) /
(PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE));
- return 0;
+} +#endif
remove the 2nd 'return'
+/* If not supplied by the format then dirty tracking is not supported */ +#ifndef pt_entry_write_is_dirty +static inline bool pt_entry_write_is_dirty(const struct pt_state *pts) +{
- return false;
+}
+static inline void pt_entry_set_write_clean(struct pt_state *pts) +{ +}
+static inline bool pt_dirty_supported(struct pt_common *common) +{
- return true;
should return false here.
+/*
- Format supplies either:
- pt_entry_oa - OA is at the start of a contiguous entry
- or
- pt_item_oa - OA is correct for every item in a contiguous entry
what is the meaning of 'correct'?
- Build the missing one
- */
+#ifdef pt_entry_oa +static inline pt_oaddr_t pt_item_oa(const struct pt_state *pts) +{
- return pt_entry_oa(pts) |
log2_mul(pts->index, pt_table_item_lg2sz(pts));
+} +#define _pt_entry_oa_fast pt_entry_oa +#endif
+#ifdef pt_item_oa +static inline pt_oaddr_t pt_entry_oa(const struct pt_state *pts) +{
- return log2_set_mod(pt_item_oa(pts), 0,
pt_entry_num_contig_lg2(pts) +
pt_table_item_lg2sz(pts));
+} +#define _pt_entry_oa_fast pt_item_oa +#endif
I have a problem understanding _pt_entry_oa_fast() here.
Obviously pt_entry_oa/pt_item_oa generates different oa for a given pts, based on the aligned size. why is it ok to alias a common macro to either of them? looks the assumption is that the caller doesn't care about the offset within the entry range e.g. will do its own masking. Probably some comment is welcomed to clarify it.
+/*
- If not supplied by the format then use the constant
- PT_MAX_OUTPUT_ADDRESS_LG2.
- */
+#ifndef pt_max_output_address_lg2 +static inline unsigned int +pt_max_output_address_lg2(const struct pt_common *common) +{
- return PT_MAX_OUTPUT_ADDRESS_LG2;
+} +#endif
+#ifndef pt_has_system_page +static inline bool pt_has_system_page(const struct pt_common *common) +{
- return PT_GRANULE_LG2SZ == PAGE_SHIFT;
+} +#endif
will there be a implementation supporting system page size while breaking above check? if not it could be moved to pt_common.h
+/**
- pt_item_fully_covered() - Check if the item or entry is entirely contained
within pts->range
when using pts it's more accurate to call it pt_entry_fully_covered()
- The system is divided into three logical levels:
- The page table format and its manipulation functions
- Generic helpers to give a consistent API regardless of underlying format
- An algorithm implementation (e.g. IOMMU/DRM/KVM/MM)
- Multiple implementations are supported. The intention is to have the
generic
- format code be re-usable for whatever specalized implementation is
s/ specalized/specialized
required.
- The generic code is solely about the format of the radix tree; it does not
- include memory allocation or higher level decisions that are left for the
- implementation.
- The generic framework supports a superset of functions across many HW
- implementations:
- Entries comprised of contiguous blocks of IO PTEs for larger page sizes
- Multi-level tables, up to 6 levels. Runtime selected top level
- Runtime variable table level size (ARM's concatenated tables)
- Expandable top level allowing dynamic sizing of table levels
- Optional leaf entries at any level
- 32-bit/64-bit virtual and output addresses, using every address bit
- Dirty tracking Sign extended addressing
and "Sign extended addressing "
- /**
* @PT_FEAT_FLUSH_RANGE: IOTLB maintenance is done by flushing
IOVA
* ranges which will clean out any walk cache or any IOPTE fully
* contained by the range. The optimization objective is to minimize
the
* number of flushes even if ranges include IOVA gaps that do not
need
* to be flushed.
*/
- PT_FEAT_FLUSH_RANGE,
- /**
* @PT_FEAT_FLUSH_RANGE_NO_GAPS: Like PT_FEAT_FLUSH_RANGE
except that
* the optimization objective is to only flush IOVA that has been
* changed. This mode is suitable for cases like hypervisor shadowing
* where flushing unchanged ranges may cause the hypervisor to
reparse
* significant amount of page table.
*/
- PT_FEAT_FLUSH_RANGE_NO_GAPS,
FLUSH_RANGE and FLUSH_RANGE_NO_GAPS are mutually exclusive but one format must select one? then we could just keep one flag (NO_GAP) then feature off means FLUSH_RANGE.