Re: [PATCH v5 01/15] genpt: Generic Page Table base API

18 Sep 2025

On Thu, Sep 18, 2025 at 06:49:08AM +0000, Tian, Kevin wrote:
...
...
This is enough to implement the 8 initial format variations with all of
their features:

Entries comprised of contiguous blocks of IO PTEs for larger page
sizes (AMDv1, ARMv8)
Multi-level tables, up to 6 levels. Runtime selected top level
Runtime variable table level size (ARM's concatenated tables)
Expandable top level (AMDv1)

any more context about this one? how is it different from the earlier
"runtime selected top level"?
How about:
* The size of the top table level can be selected at runtime (ARM's
   concatenated tables)
 * The number of levels in the table can optionally increase dynamically
   during map (AMDv1)
...
...

--- /dev/null
+++ b/drivers/iommu/generic_pt/pt_common.h
@@ -0,0 +1,355 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*


Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES







This header is included after the format. It contains definitions



that build on the format definitions to create the basic format API.







The format API is listed here, with kdocs, in alphabetical order. The



Is alphabetical order important here? It's not strictly followed, e.g.:
Let's drop it, I think you are right the grouping is better for the
generated kdoc anyhow.
...
...


functions without bodies are implemented in the format using the



pattern:


static inline FMTpt_XXX(..) {..}





#define pt_XXX FMTpt_XXX





or provided by pt_fmt_defaults.h
* The format API is listed here, with kdocs. The functions without bodies are
 * implemented in the format using the pattern:
 *     static inline FMTpt_XXX(..) {..}
 *     #define pt_XXX FMTpt_XXX
 *
 * If the format doesn't implement a function then pt_fmt_defaults.h can provide
 * a generic version.
...
...


The routines marked "@pts: Entry to query" operate on the entire



contiguous


entry and can be called with a pts->index pointing to any sub item that



makes


up that entry.







The header order is:



pt_defs.h



fmt_XX.h



s/fmt_XX.h/FMT.h/
Oops, yep
...
...
+/**


pt_entry_make_write_dirty() - Make an entry dirty



@pts: Table index to change



it's about the entire entry instead of a specific index? if yes then
"entry to change" makes more sense.
Yes, cut and pasto. All the "_entry_" function should internally
replicate to all the contiguous items.
...
...
+/**


pt_entry_oa_full() - Return the full OA for an entry



@pts: Entry to query



s/full/exact/?
OK
...
...
+/**


pt_has_system_page() - True if level 0 can install a PAGE_SHIFT entry



@common: Page table to query



pt_has_system_page_size()
Ok
...
...
+/**


pt_install_leaf_entry() - Write a leaf entry to the table



@pts: Table index to change



@oa: Output Address for this leaf



@oasz_lg2: Size in VA for this leaf



@attrs: Attributes to modify the entry







A leaf OA entry will return PT_ENTRY_OA from pt_load_entry(). It



translates


the VA indicated by pts to the given OA.







For a single item non-contiguous entry oasz_lg2 is pt_table_item_lg2sz().



For contiguous it is pt_table_item_lg2sz() + num_contig_lg2.



this sounds a fixed thing then could be judged within the function instead of
letting the caller to set?
The point is the caller specifies the size of the entry using
oasz_lg2, the documentation above is explaining how that math works.
...
...
+/**


pt_max_output_address_lg2() - Return the maximum OA the table



format can hold


@common: Page table to query



pt_max_oa_lg2()
Ok
...
...


leaf



An entry that results in an output address. I.e. a physical memory addr





"I.e. a physical ..." is redundant to what OA already explains
Yep
...
...


table



A linear array of entries representing the translation items for that





level.





to not mix 'entry' and 'item' in one description:
"A linear array of translation items for that level"
Ok
...
...


item



A single position in a table





'position' is called 'index'
Yep
...
...


entry



A single logical element in a table. If contiguous pages are not





supported then item and entry are the same thing, otherwise entry





refers


to the all the items that comprise a single contiguous translation.





'refers to all the items"
Yep
...
...


item/entry_size



The number of bytes of VA the table translates for.





If the item is a table entry then the next table covers





this size. If the entry is an output address then the





s/is/translates/
Don't follow?
...
...
+enum pt_entry_type {

PT_ENTRY_EMPTY,
PT_ENTRY_TABLE,

add a comment to be consistent with following line
/* Item points to a lower table level */
...
...
+/*


Try to install a new table pointer. The locking methodology requires this



to


be atomic (multiple threads can race to install a pointer) the losing



threads
"... install a pointer). The losing threads..."
Yep
...
...
+static inline bool pt_feature(const struct pt_common *common,

	      unsigned int feature_nr)



+{

if (PT_FORCE_ENABLED_FEATURES & BIT(feature_nr))
return true;


if (!PT_SUPPORTED_FEATURE(feature_nr))
return false;


return common->features & BIT(feature_nr);

+}
common->features is already verified in pt_init_common(). So above is
kind of an optimization using compiler to filter out static checks in fast
path?
Yes.
...
...
+/*


PT_WARN_ON is used for invariants that the kunit should be checking



can't


happen.


*/

+#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)
+#define PT_WARN_ON WARN_ON
+#else
+static inline bool PT_WARN_ON(bool condition)
+{

return false;

+}
+#endif
Then call it PT_DBG_WARN_ON() to be more explicit?
Ah, I'd rather leave it..
...
btw looks there is no plain WARN_ON() used in generic-pt. Just be curious
about the rationale behind. Is it a new trend to contain all warnings under
a debug option?
It is sort of like VM_WARN_ON(), the places that got put are largely
performance path so you don't want it enabled unless doing debugging.
Generally the idea is to use PT_WARN_ON() on performance path cases
only, and leave normal stuff to normal WARN_ON.
...
...
+/* These all work on the VA type */
+#define log2_to_int(a_lg2) log2_to_int_t(pt_vaddr_t, a_lg2)
+#define log2_to_max_int(a_lg2) log2_to_max_int_t(pt_vaddr_t, a_lg2)
+#define log2_div(a, b_lg2) log2_div_t(pt_vaddr_t, a, b_lg2)
+#define log2_div_eq(a, b, c_lg2) log2_div_eq_t(pt_vaddr_t, a, b, c_lg2)
+#define log2_mod(a, b_lg2) log2_mod_t(pt_vaddr_t, a, b_lg2)
+#define log2_mod_eq_max(a, b_lg2) log2_mod_eq_max_t(pt_vaddr_t, a,
b_lg2)
+#define log2_set_mod(a, val, b_lg2) log2_set_mod_t(pt_vaddr_t, a, val,
b_lg2)
+#define log2_set_mod_max(a, b_lg2) log2_set_mod_max_t(pt_vaddr_t, a,
b_lg2)
+#define log2_mul(a, b_lg2) log2_mul_t(pt_vaddr_t, a, b_lg2)
+#define log2_ffs(a) log2_ffs_t(pt_vaddr_t, a)
+#define log2_fls(a) log2_fls_t(pt_vaddr_t, a)
+#define log2_ffz(a) log2_ffz_t(pt_vaddr_t, a)
above three are not related to log2
Hrm, maybe not, but I also don't have a better name.. It needs the
type multiplexer. Let me think on it.
...
...



+/* If not supplied by the format then contiguous pages are not supported */
+#ifndef pt_entry_num_contig_lg2
+static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state
*pts)
+{

return ilog2(1);

+}



+static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts)
+{

return ilog2(1);

+}
what is the difference between above two helpers?
Oh, pt_contig_count_lg2 didn't get kdocs because they are internal
helpers to build other functions..
Like this:
/*
 * If not supplied by the format then contiguous pages are not supported.
 *
 * If contiguous pages are supported then the format must also provide
 * pt_contig_count_lg2() if it supports a single contiguous size per level,
 * or pt_possible_sizes() if it supports multiple sizes per level.
 */
#ifndef pt_entry_num_contig_lg2
static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state *pts)
{
    return ilog2(1);
}
/*
 * Return the number of contiguous OA items forming an entry at this table level
 */
static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts)
{
    return ilog2(1);
}
#endif
...
It's currently not implemented by any driver so will have the default version
returning 0. and it is only used by default pt_possible_sizes(), which then
returns only one page size accordingly.
ARM and RISCV use it. AMD is the only format that support more than
one size of contiguous per level.
...
...
+#ifndef pt_pgsz_lg2_to_level
+static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,

				unsigned int pgsize_lg2)



+{

return (pgsize_lg2 - PT_GRANULE_LG2SZ) /
      (PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE));


return 0;

+}
+#endif
remove the 2nd 'return'
Yep
...
...
+/* If not supplied by the format then dirty tracking is not supported */
+#ifndef pt_entry_write_is_dirty
+static inline bool pt_entry_write_is_dirty(const struct pt_state *pts)
+{

return false;

+}



+static inline void pt_entry_set_write_clean(struct pt_state *pts)
+{
+}



+static inline bool pt_dirty_supported(struct pt_common *common)
+{

return true;

should return false here.
Huh.. Seems like yes. This is only used by the kunit, but I have to
figure out why the kunit doesn't fail..
...
...


Format supplies either:



pt_entry_oa - OA is at the start of a contiguous entry



or



pt_item_oa  - OA is correct for every item in a contiguous entry



what is the meaning of 'correct'?
I will use 'adjusted' instead.
...
I have a problem understanding _pt_entry_oa_fast() here.
Obviously pt_entry_oa/pt_item_oa generates different oa for
a given pts, based on the aligned size. why is it ok to alias
a common macro to either of them? looks the assumption
is that the caller doesn't care about the offset within the
entry range e.g. will do its own masking.
Yes, this is correct.
...
Probably some comment is welcomed to clarify it.
* The internal helper _pt_entry_oa_fast() allows generating
 * an efficient pt_entry_oa_exact(), it doesn't care which
 * option is selected.
...
...
+#ifndef pt_has_system_page
+static inline bool pt_has_system_page(const struct pt_common *common)
+{

return PT_GRANULE_LG2SZ == PAGE_SHIFT;

+}
+#endif
will there be a implementation supporting system page size while breaking
above check? if not it could be moved to pt_common.h
Yes, DART does in my draft and I expect to rework ARMv8 to have
variable page sizes.
...
...



+/**


pt_item_fully_covered() - Check if the item or entry is entirely contained



                      within pts->range





when using pts it's more accurate to call it pt_entry_fully_covered()
Not so much related to PTS, as pts could be either, but it makes more
sense to refer to table poitner as an entry than does it to refer to a
contiguous entry an item.
...
...

PT_FEAT_FLUSH_RANGE,
/**
* @PT_FEAT_FLUSH_RANGE_NO_GAPS: Like PT_FEAT_FLUSH_RANGE



except that

* the optimization objective is to only flush IOVA that has been


* changed. This mode is suitable for cases like hypervisor shadowing


* where flushing unchanged ranges may cause the hypervisor to



reparse

* significant amount of page table.


*/


PT_FEAT_FLUSH_RANGE_NO_GAPS,

FLUSH_RANGE and FLUSH_RANGE_NO_GAPS are mutually exclusive but
one format must select one? then we could just keep one flag (NO_GAP)
then feature off means FLUSH_RANGE.
I think at least one or two more flushing modes will be needed, and
they would be mutually exclusive. It is setup with this one hot
encoding because of how the pt_feature function compile optimization
works.
Thanks,
Jason

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v5 01/15] genpt: Generic Page Table base API