RE: [PATCH v5 01/15] genpt: Generic Page Table base API

18 Sep 2025

...
From: Jason Gunthorpe jgg@nvidia.com
Sent: Thursday, September 4, 2025 1:46 AM
[...]
...
This is enough to implement the 8 initial format variations with all of
their features:

Entries comprised of contiguous blocks of IO PTEs for larger page
sizes (AMDv1, ARMv8)
Multi-level tables, up to 6 levels. Runtime selected top level
Runtime variable table level size (ARM's concatenated tables)
Expandable top level (AMDv1)

any more context about this one? how is it different from the earlier
"runtime selected top level"?
...

--- /dev/null
+++ b/drivers/iommu/generic_pt/pt_common.h
@@ -0,0 +1,355 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*


Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES







This header is included after the format. It contains definitions



that build on the format definitions to create the basic format API.







The format API is listed here, with kdocs, in alphabetical order. The



Is alphabetical order important here? It's not strictly followed, e.g.:
pt_entry_make_write_dirty()
    pt_dirty_supported()
    pt_entry_num_contig_lg2()
and several other violations at a glance. IMHO grouping related functions
together is more meaningful (e.g. dirty related) and less burden than
following the alphabetical order.
...


functions without bodies are implemented in the format using the



pattern:


static inline FMTpt_XXX(..) {..}





#define pt_XXX FMTpt_XXX





or provided by pt_fmt_defaults.h
...






The routines marked "@pts: Entry to query" operate on the entire



contiguous


entry and can be called with a pts->index pointing to any sub item that



makes


up that entry.







The header order is:



pt_defs.h



fmt_XX.h



s/fmt_XX.h/FMT.h/
or rename amdv1.h etc. to fmt_amdv1.h etc. to be consistent
...
+/**


pt_entry_make_write_dirty() - Make an entry dirty



@pts: Table index to change



it's about the entire entry instead of a specific index? if yes then
"entry to change" makes more sense.
...



+/**


pt_entry_oa_full() - Return the full OA for an entry



@pts: Entry to query



s/full/exact/?
...



+/**


pt_entry_set_write_clean() - Make the entry write clean



@pts: Table index to change



ditto "entry to change"
...
+/**


pt_has_system_page() - True if level 0 can install a PAGE_SHIFT entry



@common: Page table to query



pt_has_system_page_size()
...
+/**


pt_install_leaf_entry() - Write a leaf entry to the table



@pts: Table index to change



@oa: Output Address for this leaf



@oasz_lg2: Size in VA for this leaf



@attrs: Attributes to modify the entry







A leaf OA entry will return PT_ENTRY_OA from pt_load_entry(). It



translates


the VA indicated by pts to the given OA.







For a single item non-contiguous entry oasz_lg2 is pt_table_item_lg2sz().



For contiguous it is pt_table_item_lg2sz() + num_contig_lg2.



this sounds a fixed thing then could be judged within the function instead of
letting the caller to set?
...



+/**


pt_max_output_address_lg2() - Return the maximum OA the table



format can hold


@common: Page table to query



pt_max_oa_lg2()
...
+/**


DOC: Generic Page Table Language







Language used in Generic Page Table



VA



The input address to the page table, often the virtual address.





OA



The output address from the page table, often the physical address.





leaf



An entry that results in an output address. I.e. a physical memory addr





"I.e. a physical ..." is redundant to what OA already explains
...


start/end



An half-open range, e.g. [0,0) refers to no VA.





start/last



An inclusive closed range, e.g. [0,0] refers to the VA 0





common



The generic page table container struct pt_common





level



The number of table hops from the lowest leaf. Level 0





is always a table of only leaves of the least significant VA bits. The





labels used by HW descriptions are never used.





top_level



The inclusive highest level of the table. A two-level table





has a top level of 1.





table



A linear array of entries representing the translation items for that





level.





to not mix 'entry' and 'item' in one description:
"A linear array of translation items for that level"
...


index



The position in a table of an element: item = table[index]





item



A single position in a table





'position' is called 'index'
...


entry



A single logical element in a table. If contiguous pages are not





supported then item and entry are the same thing, otherwise entry





refers


to the all the items that comprise a single contiguous translation.





'refers to all the items"
...


item/entry_size



The number of bytes of VA the table translates for.





If the item is a table entry then the next table covers





this size. If the entry is an output address then the





s/is/translates/
...


full OA is: OA | (VA % entry_size)





contig_count



The number of consecutive items fused into a single entry.





item_size * contig_count is the size of that entry's translation.





lg2



Indicates the value is encoded as log2, i.e. 1<<x is the actual value.





Normally the compiler is fine to optimize divide and mod with log2





values


automatically when inlining, however if the values are not constant





expressions it can't. So we do it by hand; we want to avoid 64-bit





divmod.




*/


+/* Returned by pt_load_entry() and for_each_pt_level_entry() */
+enum pt_entry_type {

PT_ENTRY_EMPTY,
PT_ENTRY_TABLE,

add a comment to be consistent with following line
...

/* Entry is valid and returns an output address */
PT_ENTRY_OA,

+};



+struct pt_range {

struct pt_common *common;
struct pt_table_p *top_table;
pt_vaddr_t va;
pt_vaddr_t last_va;
u8 top_level;
u8 max_vasz_lg2;

+};



+/*


Similar to xa_state, this records information about an in-progress parse



at a


single level.


*/

+struct pt_state {

struct pt_range *range;
struct pt_table_p *table;
struct pt_table_p *table_lower;
u64 entry;
enum pt_entry_type type;
unsigned short index;
unsigned short end_index;
u8 level;

+};



+#define pt_cur_table(pts, type) ((type *)((pts)->table))



+/*


Try to install a new table pointer. The locking methodology requires this



to


be atomic (multiple threads can race to install a pointer) the losing



threads
"... install a pointer). The losing threads..."
...
+static inline bool pt_feature(const struct pt_common *common,

	      unsigned int feature_nr)



+{

if (PT_FORCE_ENABLED_FEATURES & BIT(feature_nr))
return true;


if (!PT_SUPPORTED_FEATURE(feature_nr))
return false;


return common->features & BIT(feature_nr);

+}
common->features is already verified in pt_init_common(). So above is
kind of an optimization using compiler to filter out static checks in fast
path?
...
+/*


PT_WARN_ON is used for invariants that the kunit should be checking



can't


happen.


*/

+#if IS_ENABLED(CONFIG_DEBUG_GENERIC_PT)
+#define PT_WARN_ON WARN_ON
+#else
+static inline bool PT_WARN_ON(bool condition)
+{

return false;

+}
+#endif
Then call it PT_DBG_WARN_ON() to be more explicit?
btw looks there is no plain WARN_ON() used in generic-pt. Just be curious
about the rationale behind. Is it a new trend to contain all warnings under
a debug option?
...



+/* These all work on the VA type */
+#define log2_to_int(a_lg2) log2_to_int_t(pt_vaddr_t, a_lg2)
+#define log2_to_max_int(a_lg2) log2_to_max_int_t(pt_vaddr_t, a_lg2)
+#define log2_div(a, b_lg2) log2_div_t(pt_vaddr_t, a, b_lg2)
+#define log2_div_eq(a, b, c_lg2) log2_div_eq_t(pt_vaddr_t, a, b, c_lg2)
+#define log2_mod(a, b_lg2) log2_mod_t(pt_vaddr_t, a, b_lg2)
+#define log2_mod_eq_max(a, b_lg2) log2_mod_eq_max_t(pt_vaddr_t, a,
b_lg2)
+#define log2_set_mod(a, val, b_lg2) log2_set_mod_t(pt_vaddr_t, a, val,
b_lg2)
+#define log2_set_mod_max(a, b_lg2) log2_set_mod_max_t(pt_vaddr_t, a,
b_lg2)
+#define log2_mul(a, b_lg2) log2_mul_t(pt_vaddr_t, a, b_lg2)
+#define log2_ffs(a) log2_ffs_t(pt_vaddr_t, a)
+#define log2_fls(a) log2_fls_t(pt_vaddr_t, a)
+#define log2_ffz(a) log2_ffz_t(pt_vaddr_t, a)
above three are not related to log2
...



+/* If not supplied by the format then contiguous pages are not supported */
+#ifndef pt_entry_num_contig_lg2
+static inline unsigned int pt_entry_num_contig_lg2(const struct pt_state
*pts)
+{

return ilog2(1);

+}



+static inline unsigned short pt_contig_count_lg2(const struct pt_state *pts)
+{

return ilog2(1);

+}
what is the difference between above two helpers?
It's currently not implemented by any driver so will have the default version
returning 0. and it is only used by default pt_possible_sizes(), which then
returns only one page size accordingly.
I kind of think it's useless and we could simply move pt_possible_sizes()
here and simplify it to return only one size explicitly, assuming a format
should implement both pt_entry_num_contig_lg2() and pt_possible_sizes().
...
+#ifndef pt_pgsz_lg2_to_level
+static inline unsigned int pt_pgsz_lg2_to_level(struct pt_common *common,

				unsigned int pgsize_lg2)



+{

return (pgsize_lg2 - PT_GRANULE_LG2SZ) /
      (PT_TABLEMEM_LG2SZ - ilog2(PT_ITEM_WORD_SIZE));


return 0;

+}
+#endif
remove the 2nd 'return'
...



+/* If not supplied by the format then dirty tracking is not supported */
+#ifndef pt_entry_write_is_dirty
+static inline bool pt_entry_write_is_dirty(const struct pt_state *pts)
+{

return false;

+}



+static inline void pt_entry_set_write_clean(struct pt_state *pts)
+{
+}



+static inline bool pt_dirty_supported(struct pt_common *common)
+{

return true;

should return false here.
...



+/*


Format supplies either:



pt_entry_oa - OA is at the start of a contiguous entry



or



pt_item_oa  - OA is correct for every item in a contiguous entry



what is the meaning of 'correct'?
...






Build the missing one


*/

+#ifdef pt_entry_oa
+static inline pt_oaddr_t pt_item_oa(const struct pt_state *pts)
+{

return pt_entry_oa(pts) |
      log2_mul(pts->index, pt_table_item_lg2sz(pts));



+}
+#define _pt_entry_oa_fast pt_entry_oa
+#endif



+#ifdef pt_item_oa
+static inline pt_oaddr_t pt_entry_oa(const struct pt_state *pts)
+{

return log2_set_mod(pt_item_oa(pts), 0,
	    pt_entry_num_contig_lg2(pts) +


		    pt_table_item_lg2sz(pts));



+}
+#define _pt_entry_oa_fast pt_item_oa
+#endif
I have a problem understanding _pt_entry_oa_fast() here.
Obviously pt_entry_oa/pt_item_oa generates different oa for
a given pts, based on the aligned size. why is it ok to alias
a common macro to either of them? looks the assumption
is that the caller doesn't care about the offset within the
entry range e.g. will do its own masking. Probably some comment
is welcomed to clarify it.
...



+/*


If not supplied by the format then use the constant



PT_MAX_OUTPUT_ADDRESS_LG2.


*/

+#ifndef pt_max_output_address_lg2
+static inline unsigned int
+pt_max_output_address_lg2(const struct pt_common *common)
+{

return PT_MAX_OUTPUT_ADDRESS_LG2;

+}
+#endif



+#ifndef pt_has_system_page
+static inline bool pt_has_system_page(const struct pt_common *common)
+{

return PT_GRANULE_LG2SZ == PAGE_SHIFT;

+}
+#endif
will there be a implementation supporting system page size while breaking
above check? if not it could be moved to pt_common.h
...



+/**


pt_item_fully_covered() - Check if the item or entry is entirely contained



                      within pts->range





when using pts it's more accurate to call it pt_entry_fully_covered()
...






The system is divided into three logical levels:








The page table format and its manipulation functions






Generic helpers to give a consistent API regardless of underlying format






An algorithm implementation (e.g. IOMMU/DRM/KVM/MM)









Multiple implementations are supported. The intention is to have the



generic


format code be re-usable for whatever specalized implementation is



s/ specalized/specialized
...
required.


The generic code is solely about the format of the radix tree; it does not



include memory allocation or higher level decisions that are left for the



implementation.







The generic framework supports a superset of functions across many HW



implementations:








Entries comprised of contiguous blocks of IO PTEs for larger page sizes






Multi-level tables, up to 6 levels. Runtime selected top level






Runtime variable table level size (ARM's concatenated tables)






Expandable top level allowing dynamic sizing of table levels






Optional leaf entries at any level






32-bit/64-bit virtual and output addresses, using every address bit






Dirty tracking Sign extended addressing





and "Sign extended addressing "
...

/**
* @PT_FEAT_FLUSH_RANGE: IOTLB maintenance is done by flushing



IOVA

* ranges which will clean out any walk cache or any IOPTE fully


* contained by the range. The optimization objective is to minimize



the

* number of flushes even if ranges include IOVA gaps that do not



need

* to be flushed.


*/


PT_FEAT_FLUSH_RANGE,
/**
* @PT_FEAT_FLUSH_RANGE_NO_GAPS: Like PT_FEAT_FLUSH_RANGE



except that

* the optimization objective is to only flush IOVA that has been


* changed. This mode is suitable for cases like hypervisor shadowing


* where flushing unchanged ranges may cause the hypervisor to



reparse

* significant amount of page table.


*/


PT_FEAT_FLUSH_RANGE_NO_GAPS,

FLUSH_RANGE and FLUSH_RANGE_NO_GAPS are mutually exclusive but
one format must select one? then we could just keep one flag (NO_GAP)
then feature off means FLUSH_RANGE.

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

RE: [PATCH v5 01/15] genpt: Generic Page Table base API