On Thu, Sep 18, 2025 at 06:49:08AM +0000, Tian, Kevin wrote:
This is enough to implement the 8 initial format variations with all of their features:
- Entries comprised of contiguous blocks of IO PTEs for larger page sizes (AMDv1, ARMv8)
- Multi-level tables, up to 6 levels. Runtime selected top level
- Runtime variable table level size (ARM's concatenated tables)
- Expandable top level (AMDv1)
any more context about this one? how is it different from the earlier "runtime selected top level"?
How about:
* The size of the top table level can be selected at runtime (ARM's concatenated tables) * The number of levels in the table can optionally increase dynamically during map (AMDv1)
--- /dev/null +++ b/drivers/iommu/generic_pt/pt_common.h @@ -0,0 +1,355 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/*
- Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES
- This header is included after the format. It contains definitions
- that build on the format definitions to create the basic format API.
- The format API is listed here, with kdocs, in alphabetical order. The
Is alphabetical order important here? It's not strictly followed, e.g.:
Let's drop it, I think you are right the grouping is better for the generated kdoc anyhow.
- functions without bodies are implemented in the format using the
pattern:
static inline FMTpt_XXX(..) {..}
#define pt_XXX FMTpt_XXX
or provided by pt_fmt_defaults.h
* The format API is listed here, with kdocs. The functions without bodies are * implemented in the format using the pattern: * static inline FMTpt_XXX(..) {..} * #define pt_XXX FMTpt_XXX * * If the format doesn't implement a function then pt_fmt_defaults.h can provide * a generic version.
- The routines marked "@pts: Entry to query" operate on the entire
contiguous
- entry and can be called with a pts->index pointing to any sub item that
makes
- up that entry.
- The header order is:
- pt_defs.h
- fmt_XX.h
s/fmt_XX.h/FMT.h/
Oops, yep
+/**
- pt_entry_make_write_dirty() - Make an entry dirty
- @pts: Table index to change
it's about the entire entry instead of a specific index? if yes then "entry to change" makes more sense.
Yes, cut and pasto. All the "_entry_" function should internally replicate to all the contiguous items.
+/**
- pt_entry_oa_full() - Return the full OA for an entry
- @pts: Entry to query
s/full/exact/?
OK
+/**
- pt_has_system_page() - True if level 0 can install a PAGE_SHIFT entry
- @common: Page table to query
pt_has_system_page_size()
Ok
+/**
- pt_install_leaf_entry() - Write a leaf entry to the table
- @pts: Table index to change
- @oa: Output Address for this leaf
- @oasz_lg2: Size in VA for this leaf
- @attrs: Attributes to modify the entry
- A leaf OA entry will return PT_ENTRY_OA from pt_load_entry(). It
translates
- the VA indicated by pts to the given OA.
- For a single item non-contiguous entry oasz_lg2 is pt_table_item_lg2sz().
- For contiguous it is pt_table_item_lg2sz() + num_contig_lg2.
this sounds a fixed thing then could be judged within the function instead of letting the caller to set?
The point is the caller specifies the size of the entry using oasz_lg2, the documentation above is explaining how that math works.
+/**
- pt_max_output_address_lg2() - Return the maximum OA the table
format can hold
- @common: Page table to query
pt_max_oa_lg2()
Ok
- leaf
An entry that results in an output address. I.e. a physical memory addr
"I.e. a physical ..." is redundant to what OA already explains
Yep
- table
A linear array of entries representing the translation items for that
level.
to not mix 'entry' and 'item' in one description:
"A linear array of translation items for that level"
Ok
- item
A single position in a table
'position' is called 'index'
Yep
- entry
A single logical element in a table. If contiguous pages are not
supported then item and entry are the same thing, otherwise entry
refers
to the all the items that comprise a single contiguous translation.
'refers to all the items"
Yep
- item/entry_size
The number of bytes of VA the table translates for.
If the item is a table entry then the next table covers
this size. If the entry is an output address then the
s/is/translates/
Don't follow?
+enum pt_entry_type {
- PT_ENTRY_EMPTY,
- PT_ENTRY_TABLE,
add a comment to be consistent with following line
/* Item points to a lower table level */
+/*
- Try to install a new table pointer. The locking methodology requires this
to
- be atomic (multiple threads can race to install a pointer) the losing
threads
"... install a pointer). The losing threads..."
Yep
+static inline bool pt_feature(const struct pt_common *common,
unsigned int feature_nr)
+{
- if (PT_FORCE_ENABLED_FEATURES & BIT(feature_nr))
return true;
- if (!PT_SUPPORTED_FEATURE(feature_nr))
return false;
- return common->features & BIT(feature_nr);
+}
common->features is already verified in pt_init_common(). So above is kind of an optimization using compiler to filter out static checks in fast path?
Yes.
+/*
- PT_WARN_ON is used for invariants that the kunit should be checking
can't
- happen.
- */
+#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT) +#define PT_WARN_ON WARN_ON +#else +static inline bool PT_WARN_ON(bool condition) +{
- return false;
+} +#endif
Then call it PT_DBG_WARN_ON() to be more explicit?
Ah, I'd rather leave it..
btw looks there is no plain WARN_ON() used in generic-pt. Just be curious about the rationale behind. Is it a new trend to contain all warnings under a debug option?
It is sort of like VM_WARN_ON(), the places that got put are largely performance path so you don't want it enabled unless doing debugging.
Generally the idea is to use PT_WARN_ON() on performance path cases only, and leave normal stuff to normal WARN_ON.
+/* These all work on the VA type */ +#define log2_to_int(a_lg2) log2_to_int_t(pt_vaddr_t, a_lg2) +#define log2_to_max_int(a_lg2) log2_to_max_int_t(pt_vaddr_t, a_lg2) +#define log2_div(a, b_lg2) log2_div_t(pt_vaddr_t, a, b_lg2) +#define log2_div_eq(a, b, c_lg2) log2_div_eq_t(pt_vaddr_t, a, b, c_lg2) +#define log2_mod(a, b_lg2) log2_mod_t(pt_vaddr_t, a, b_lg2) +#define log2_mod_eq_max(a, b_lg2) log2_mod_eq_max_t(pt_vaddr_t, a, b_lg2) +#define log2_set_mod(a, val, b_lg2) log2_set_mod_t(pt_vaddr_t, a, val, b_lg2) +#define log2_set_mod_max(a, b_lg2) log2_set_mod_max_t(pt_vaddr_t, a, b_lg2) +#define log2_mul(a, b_lg2) log2_mul_t(pt_vaddr_t, a, b_lg2) +#define log2_ffs(a) log2_ffs_t(pt_vaddr_t, a) +#define log2_fls(a) log2_fls_t(pt_vaddr_t, a) +#define log2_ffz(a) log2_ffz_t(pt_vaddr_t, a)
above three are not related to log2
Hrm, maybe not, but I also don't have a better name.. It needs the type multiplexer. Let me think on it.
+/* If not supplied by the format then contiguous pages are not supported */ +#ifndef pt_entry_num_contig_lg2 +static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state *pts) +{
- return ilog2(1);
+}
+static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts) +{
- return ilog2(1);
+}
what is the difference between above two helpers?
Oh, pt_contig_count_lg2 didn't get kdocs because they are internal helpers to build other functions..
Like this:
/* * If not supplied by the format then contiguous pages are not supported. * * If contiguous pages are supported then the format must also provide * pt_contig_count_lg2() if it supports a single contiguous size per level, * or pt_possible_sizes() if it supports multiple sizes per level. */ #ifndef pt_entry_num_contig_lg2 static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state *pts) { return ilog2(1); }
/* * Return the number of contiguous OA items forming an entry at this table level */ static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts) { return ilog2(1); } #endif
It's currently not implemented by any driver so will have the default version returning 0. and it is only used by default pt_possible_sizes(), which then returns only one page size accordingly.
ARM and RISCV use it. AMD is the only format that support more than one size of contiguous per level.
+#ifndef pt_pgsz_lg2_to_level +static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,
unsigned int pgsize_lg2)
+{
- return (pgsize_lg2 - PT_GRANULE_LG2SZ) /
(PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE));
- return 0;
+} +#endif
remove the 2nd 'return'
Yep
+/* If not supplied by the format then dirty tracking is not supported */ +#ifndef pt_entry_write_is_dirty +static inline bool pt_entry_write_is_dirty(const struct pt_state *pts) +{
- return false;
+}
+static inline void pt_entry_set_write_clean(struct pt_state *pts) +{ +}
+static inline bool pt_dirty_supported(struct pt_common *common) +{
- return true;
should return false here.
Huh.. Seems like yes. This is only used by the kunit, but I have to figure out why the kunit doesn't fail..
- Format supplies either:
- pt_entry_oa - OA is at the start of a contiguous entry
- or
- pt_item_oa - OA is correct for every item in a contiguous entry
what is the meaning of 'correct'?
I will use 'adjusted' instead.
I have a problem understanding _pt_entry_oa_fast() here.
Obviously pt_entry_oa/pt_item_oa generates different oa for a given pts, based on the aligned size. why is it ok to alias a common macro to either of them? looks the assumption is that the caller doesn't care about the offset within the entry range e.g. will do its own masking.
Yes, this is correct.
Probably some comment is welcomed to clarify it.
* The internal helper _pt_entry_oa_fast() allows generating * an efficient pt_entry_oa_exact(), it doesn't care which * option is selected.
+#ifndef pt_has_system_page +static inline bool pt_has_system_page(const struct pt_common *common) +{
- return PT_GRANULE_LG2SZ == PAGE_SHIFT;
+} +#endif
will there be a implementation supporting system page size while breaking above check? if not it could be moved to pt_common.h
Yes, DART does in my draft and I expect to rework ARMv8 to have variable page sizes.
+/**
- pt_item_fully_covered() - Check if the item or entry is entirely contained
within pts->range
when using pts it's more accurate to call it pt_entry_fully_covered()
Not so much related to PTS, as pts could be either, but it makes more sense to refer to table poitner as an entry than does it to refer to a contiguous entry an item.
- PT_FEAT_FLUSH_RANGE,
- /**
* @PT_FEAT_FLUSH_RANGE_NO_GAPS: Like PT_FEAT_FLUSH_RANGE
except that
* the optimization objective is to only flush IOVA that has been
* changed. This mode is suitable for cases like hypervisor shadowing
* where flushing unchanged ranges may cause the hypervisor to
reparse
* significant amount of page table.
*/
- PT_FEAT_FLUSH_RANGE_NO_GAPS,
FLUSH_RANGE and FLUSH_RANGE_NO_GAPS are mutually exclusive but one format must select one? then we could just keep one flag (NO_GAP) then feature off means FLUSH_RANGE.
I think at least one or two more flushing modes will be needed, and they would be mutually exclusive. It is setup with this one hot encoding because of how the pt_feature function compile optimization works.
Thanks, Jason