July 2023 - Linux-kselftest-mirror

[PATCH v24 0/5] Implement IOCTL to get and optionally clear info about PTEs

by Muhammad Usama Anjum

*Changes in v24*: - Rebase on top of next-20230710 - Place WP markers in case of hole as well *Changes in v23*: - Set vec_buf_index in loop only when vec_buf_index is set - Return -EFAULT instead of -EINVAL if vec is NULL - Correctly return the walk ending address to the page granularity *Changes in v22*: - Interface change: - Replace [start start + len) with [start, end) - Return the ending address of the address walk in start *Changes in v21*: - Abort walk instead of returning error if WP is to be performed on partial hugetlb *Changes in v20* - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO *Changes in v19* - Minor changes and interface updates *Changes in v18* - Rebase on top of next-20230613 - Minor updates *Changes in v17* - Rebase on top of next-20230606 - Minor improvements in PAGEMAP_SCAN IOCTL patch *Changes in v16* - Fix a corner case - Add exclusive PM_SCAN_OP_WP back *Changes in v15* - Build fix (Add missed build fix in RESEND) *Changes in v14* - Fix build error caused by #ifdef added at last minute in some configs *Changes in v13* - Rebase on top of next-20230414 - Give-up on using uffd_wp_range() and write new helpers, flush tlb only once *Changes in v12* - Update and other memory types to UFFD_FEATURE_WP_ASYNC - Rebaase on top of next-20230406 - Review updates *Changes in v11* - Rebase on top of next-20230307 - Base patches on UFFD_FEATURE_WP_UNPOPULATED - Do a lot of cosmetic changes and review updates - Remove ENGAGE_WP + !GET operation as it can be performed with UFFDIO_WRITEPROTECT *Changes in v10* - Add specific condition to return error if hugetlb is used with wp async - Move changes in tools/include/uapi/linux/fs.h to separate patch - Add documentation *Changes in v9:* - Correct fault resolution for userfaultfd wp async - Fix build warnings and errors which were happening on some configs - Simplify pagemap ioctl's code *Changes in v8:* - Update uffd async wp implementation - Improve PAGEMAP_IOCTL implementation *Changes in v7:* - Add uffd wp async - Update the IOCTL to use uffd under the hood instead of soft-dirty flags *Motivation* The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of the pages that are written to in a region of virtual memory. This syscall is used in Windows applications and games etc. This syscall is being emulated in pretty slow manner in userspace. Our purpose is to enhance the kernel such that we translate it efficiently in a better way. Currently some out of tree hack patches are being used to efficiently emulate it in some kernels. We intend to replace those with these patches. So the whole gaming on Linux can effectively get benefit from this. It means there would be tons of users of this code. CRIU use case [2] was mentioned by Andrei and Danylo: > Use cases for migrating sparse VMAs are binaries sanitized with ASAN, > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of > shadow memory [4]. Being able to migrate such binaries allows to highly > reduce the amount of work needed to identify and fix post-migration > crashes, which happen constantly. Andrei's defines the following uses of this code: * it is more granular and allows us to track changed pages more effectively. The current interface can clear dirty bits for the entire process only. In addition, reading info about pages is a separate operation. It means we must freeze the process to read information about all its pages, reset dirty bits, only then we can start dumping pages. The information about pages becomes more and more outdated, while we are processing pages. The new interface solves both these downsides. First, it allows us to read pte bits and clear the soft-dirty bit atomically. It means that CRIU will not need to freeze processes to pre-dump their memory. Second, it clears soft-dirty bits for a specified region of memory. It means CRIU will have actual info about pages to the moment of dumping them. * The new interface has to be much faster because basic page filtering is happening in the kernel. With the old interface, we have to read pagemap for each page. *Implementation Evolution (Short Summary)* From the definition of GetWriteWatch(), we feel like kernel's soft-dirty feature can be used under the hood with some additions like: * reset soft-dirty flag for only a specific region of memory instead of clearing the flag for the entire process * get and clear soft-dirty flag for a specific region atomically So we decided to use ioctl on pagemap file to read or/and reset soft-dirty flag. But using soft-dirty flag, sometimes we get extra pages which weren't even written. They had become soft-dirty because of VMA merging and VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were able to by-pass this short coming by ignoring VM_SOFTDIRTY until David reported that mprotect etc messes up the soft-dirty flag while ignoring VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We discussed if we can revert these patches. But we could not reach to any conclusion. So at this point, I made couple of tries to solve this whole VM_SOFTDIRTY issue by correcting the soft-dirty implementation: * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause regression. We left it behind. * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I got the reply don't increase the size of the VMA by 8 bytes. At this point, we left soft-dirty considering it is too much delicate and userfaultfd [9] seemed like the only way forward. From there onward, we have been basing soft-dirty emulation on userfaultfd wp feature where kernel resolves the faults itself when WP_ASYNC feature is used. It was straight forward to add WP_ASYNC feature in userfautlfd. Now we get only those pages dirty or written-to which are really written in reality. (PS There is another WP_UNPOPULATED userfautfd feature is required which is needed to avoid pre-faulting memory before write-protecting [9].) All the different masks were added on the request of CRIU devs to create interface more generic and better. [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-… [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com [3] https://github.com/google/sanitizers [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/ [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.… [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.… [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com * Original Cover letter from v8* Hello, Note: Soft-dirty pages and pages which have been written-to are synonyms. As kernel already has soft-dirty feature inside which we have given up to use, we are using written-to terminology while using UFFD async WP under the hood. This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear the info about page table entries. The following operations are supported in this ioctl: - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped (PAGE_IS_SWAPPED). - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which pages have been written-to. - Find pages which have been written-to and write protect the pages (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) It is possible to find and clear soft-dirty pages entirely in userspace. But it isn't efficient: - The mprotect and SIGSEGV handler for bookkeeping - The userfaultfd wp (synchronous) with the handler for bookkeeping Some benchmarks can be seen here[1]. This series adds features that weren't present earlier: - There is no atomic get soft-dirty/Written-to status and clear present in the kernel. - The pages which have been written-to can not be found in accurate way. (Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty pages than there actually are.) Historically, soft-dirty PTE bit tracking has been used in the CRIU project. The procfs interface is enough for finding the soft-dirty bit status and clearing the soft-dirty bit of all the pages of a process. We have the use case where we need to track the soft-dirty PTE bit for only specific pages on-demand. We need this tracking and clear mechanism of a region of memory while the process is running to emulate the getWriteWatch() syscall of Windows. *(Moved to using UFFD instead of soft-dirtyi feature to find pages which have been written-to from v7 patch series)*: Stop using the soft-dirty flags for finding which pages have been written to. It is too delicate and wrong as it shows more soft-dirty pages than the actual soft-dirty pages. There is no interest in correcting it [2][3] as this is how the feature was written years ago. It shouldn't be updated to changed behaviour. Peter Xu has suggested using the async version of the UFFD WP [4] as it is based inherently on the PTEs. So in this patch series, I've added a new mode to the UFFD which is asynchronous version of the write protect. When this variant of the UFFD WP is used, the page faults are resolved automatically by the kernel. The pages which have been written-to can be found by reading pagemap file (!PM_UFFD_WP). This feature can be used successfully to find which pages have been written to from the time the pages were write protected. This works just like the soft-dirty flag without showing any extra pages which aren't soft-dirty in reality. The information related to pages if the page is file mapped, present and swapped is required for the CRIU project [5][6]. The addition of the required mask, any mask, excluded mask and return masks are also required for the CRIU project [5]. The IOCTL returns the addresses of the pages which match the specific masks. The page addresses are returned in struct page_region in a compact form. The max_pages is needed to support a use case where user only wants to get a specific number of pages. So there is no need to find all the pages of interest in the range when max_pages is specified. The IOCTL returns when the maximum number of the pages are found. The max_pages is optional. If max_pages is specified, it must be equal or greater than the vec_size. This restriction is needed to handle worse case when one page_region only contains info of one page and it cannot be compacted. This is needed to emulate the Windows getWriteWatch() syscall. The patch series include the detailed selftest which can be used as an example for the uffd async wp test and PAGEMAP_IOCTL. It shows the interface usages as well. [1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora… [2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.… [3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.… [4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n [5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/ [6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/ Regards, Muhammad Usama Anjum Muhammad Usama Anjum (4): fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs tools headers UAPI: Update linux/fs.h with the kernel sources mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL selftests: mm: add pagemap ioctl tests Peter Xu (1): userfaultfd: UFFD_FEATURE_WP_ASYNC Documentation/admin-guide/mm/pagemap.rst | 58 + Documentation/admin-guide/mm/userfaultfd.rst | 35 + fs/proc/task_mmu.c | 583 +++++++ fs/userfaultfd.c | 26 +- include/linux/hugetlb.h | 1 + include/linux/userfaultfd_k.h | 21 +- include/uapi/linux/fs.h | 55 + include/uapi/linux/userfaultfd.h | 9 +- mm/hugetlb.c | 34 +- mm/memory.c | 27 +- tools/include/uapi/linux/fs.h | 55 + tools/testing/selftests/mm/.gitignore | 2 + tools/testing/selftests/mm/Makefile | 3 +- tools/testing/selftests/mm/config | 1 + tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 4 + 16 files changed, 2354 insertions(+), 24 deletions(-) create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh -- 2.39.2

2 years, 4 months

2
7
0 0

[PATCH v2 00/12] tools/nolibc: shrink arch support

by Zhangjin Wu

Hi, Willy This is v2 of the "tools/nolibc: shrink arch support" [1]. This v2 has no core code logic change, but applies some suggestions from Willy and Thomas, one is using post-whitespaces instead of post-tab, another is restructuring the arch support directory and files [2]. Like musl, this v2 creates <ARCH> directory for every arch and splits the old arch-<ARCH>.h to <ARCH>/{crt.h, sys.h} and at the same time, splits the old arch.h to crt_arch.h and sys_arch.h. at last, only need to include crt_arch.h in crt.h and sys_arch.h in sys.h respectively, and no longer need to include arch.h in the other common headers: crt.h <-- crt_arch.h <-- <ARCH>/crt.h sys.h <-- sys_arch.h <-- <ARCH>/sys.h It is based on the 20230705-nolibc-series2 branch of nolibc repo [3]. It should be applied after the v6 __sysret helper series [4] and the v4 min config support series [5]. Here is the test report for all of the supported architectures: arch/board | result ------------|------------ arm/vexpress-a9 | 142 test(s) passed, 1 skipped, 0 failed. arm/virt | 142 test(s) passed, 1 skipped, 0 failed. aarch64/virt | 142 test(s) passed, 1 skipped, 0 failed. ppc/g3beige | not supported ppc/ppce500 | not supported i386/pc | 142 test(s) passed, 1 skipped, 0 failed. x86_64/pc | 142 test(s) passed, 1 skipped, 0 failed. mipsel/malta | 142 test(s) passed, 1 skipped, 0 failed. loongarch64/virt | 142 test(s) passed, 1 skipped, 0 failed. riscv64/virt | 142 test(s) passed, 1 skipped, 0 failed. riscv32/virt | 0 test(s) passed, 0 skipped, 0 failed. s390x/s390-ccw-virtio | 142 test(s) passed, 1 skipped, 0 failed. Changes from v1 --> v2: * tools/nolibc: rename arch-<ARCH>.h to <ARCH>/arch.h tools/nolibc: split arch.h to crt.h and sys.h Restruct the arch support directory and files. Fix up the errors reported by scripts/checkpatch.pl. * tools/nolibc: sys.h: remove the old sys_stat support Rebase on the new arch support directory and files. * tools/nolibc: crt.h: add _start_c Move #include "compiler.h" in the common crt.h too. * tools/nolibc: arm/crt.h: shrink _start with _start_c tools/nolibc: aarch64/crt.h: shrink _start with _start_c tools/nolibc: i386/crt.h: shrink _start with _start_c tools/nolibc: x86_64/crt.h: shrink _start with _start_c tools/nolibc: mips/crt.h: shrink _start with _start_c tools/nolibc: loongarch/crt.h: shrink _start with _start_c tools/nolibc: riscv/crt.h: shrink _start with _start_c tools/nolibc: s390/crt.h: shrink _start with _start_c Rebase on the new arch support directory and files. Use post-whitespaces instead of post-tab. Best regards, Zhangjin --- [1]: https://lore.kernel.org/lkml/cover.1687976753.git.falcon@tinylab.org/ [2]: https://lore.kernel.org/lkml/20230703145500.500460-1-falcon@tinylab.org/ [3]: https://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/nolibc.git [4]: https://lore.kernel.org/lkml/cover.1688739492.git.falcon@tinylab.org/ [5]: https://lore.kernel.org/lkml/cover.1688750763.git.falcon@tinylab.org/ Zhangjin Wu (12): tools/nolibc: rename arch-<ARCH>.h to <ARCH>/arch.h tools/nolibc: split arch.h to crt.h and sys.h tools/nolibc: sys.h: remove the old sys_stat support tools/nolibc: crt.h: add _start_c tools/nolibc: arm/crt.h: shrink _start with _start_c tools/nolibc: aarch64/crt.h: shrink _start with _start_c tools/nolibc: i386/crt.h: shrink _start with _start_c tools/nolibc: x86_64/crt.h: shrink _start with _start_c tools/nolibc: mips/crt.h: shrink _start with _start_c tools/nolibc: loongarch/crt.h: shrink _start with _start_c tools/nolibc: riscv/crt.h: shrink _start with _start_c tools/nolibc: s390/crt.h: shrink _start with _start_c tools/include/nolibc/Makefile | 36 ++++--- tools/include/nolibc/aarch64/crt.h | 24 +++++ .../nolibc/{arch-aarch64.h => aarch64/sys.h} | 68 +------------ tools/include/nolibc/arch.h | 36 ------- tools/include/nolibc/arm/crt.h | 25 +++++ .../include/nolibc/{arch-arm.h => arm/sys.h} | 96 +------------------ tools/include/nolibc/crt.h | 60 ++++++++++++ tools/include/nolibc/crt_arch.h | 32 +++++++ tools/include/nolibc/i386/crt.h | 33 +++++++ .../nolibc/{arch-i386.h => i386/sys.h} | 77 +-------------- tools/include/nolibc/loongarch/crt.h | 30 ++++++ .../{arch-loongarch.h => loongarch/sys.h} | 64 +------------ tools/include/nolibc/mips/crt.h | 32 +++++++ .../nolibc/{arch-mips.h => mips/sys.h} | 87 +---------------- tools/include/nolibc/nolibc.h | 2 +- tools/include/nolibc/riscv/crt.h | 28 ++++++ .../nolibc/{arch-riscv.h => riscv/sys.h} | 83 +--------------- tools/include/nolibc/s390/crt.h | 21 ++++ .../nolibc/{arch-s390.h => s390/sys.h} | 74 +------------- tools/include/nolibc/signal.h | 1 - tools/include/nolibc/stdio.h | 1 - tools/include/nolibc/stdlib.h | 2 +- tools/include/nolibc/sys.h | 65 +++---------- tools/include/nolibc/sys_arch.h | 32 +++++++ tools/include/nolibc/time.h | 1 - tools/include/nolibc/types.h | 4 +- tools/include/nolibc/unistd.h | 1 - tools/include/nolibc/x86_64/crt.h | 33 +++++++ .../nolibc/{arch-x86_64.h => x86_64/sys.h} | 74 +------------- 29 files changed, 421 insertions(+), 701 deletions(-) create mode 100644 tools/include/nolibc/aarch64/crt.h rename tools/include/nolibc/{arch-aarch64.h => aarch64/sys.h} (76%) delete mode 100644 tools/include/nolibc/arch.h create mode 100644 tools/include/nolibc/arm/crt.h rename tools/include/nolibc/{arch-arm.h => arm/sys.h} (74%) create mode 100644 tools/include/nolibc/crt.h create mode 100644 tools/include/nolibc/crt_arch.h create mode 100644 tools/include/nolibc/i386/crt.h rename tools/include/nolibc/{arch-i386.h => i386/sys.h} (73%) create mode 100644 tools/include/nolibc/loongarch/crt.h rename tools/include/nolibc/{arch-loongarch.h => loongarch/sys.h} (73%) create mode 100644 tools/include/nolibc/mips/crt.h rename tools/include/nolibc/{arch-mips.h => mips/sys.h} (74%) create mode 100644 tools/include/nolibc/riscv/crt.h rename tools/include/nolibc/{arch-riscv.h => riscv/sys.h} (70%) create mode 100644 tools/include/nolibc/s390/crt.h rename tools/include/nolibc/{arch-s390.h => s390/sys.h} (68%) create mode 100644 tools/include/nolibc/sys_arch.h create mode 100644 tools/include/nolibc/x86_64/crt.h rename tools/include/nolibc/{arch-x86_64.h => x86_64/sys.h} (76%) -- 2.25.1

2 years, 4 months

3
23
0 0

[PATCH 0/4] selftests/nolibc: simplify conditions and testcases

by Thomas Weißschuh

A few cleanups to the existing test logic. Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net> --- Thomas Weißschuh (4): selftests/nolibc: make evaluation of test conditions selftests/nolibc: simplify status printing selftests/nolibc: simplify status argument selftests/nolibc: avoid gaps in test numbers tools/testing/selftests/nolibc/nolibc-test.c | 201 +++++++++++---------------- 1 file changed, 85 insertions(+), 116 deletions(-) --- base-commit: 078cda365b3f47f61047a08230925a1478e9a1c8 change-id: 20230711-nolibc-sizeof-long-gaps-0f28cba7ee4d Best regards, -- Thomas Weißschuh <linux(a)weissschuh.net>

2 years, 4 months

2
5
0 0

[PATCH bpf-next v5 0/7] Add SO_REUSEPORT support for TC bpf_sk_assign

by Lorenz Bauer

We want to replace iptables TPROXY with a BPF program at TC ingress. To make this work in all cases we need to assign a SO_REUSEPORT socket to an skb, which is currently prohibited. This series adds support for such sockets to bpf_sk_assing. I did some refactoring to cut down on the amount of duplicate code. The key to this is to use INDIRECT_CALL in the reuseport helpers. To show that this approach is not just beneficial to TC sk_assign I removed duplicate code for bpf_sk_lookup as well. Joint work with Daniel Borkmann. Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com> --- Changes in v5: - Drop reuse_sk == sk check in inet[6]_steal_stock (Kuniyuki) - Link to v4: https://lore.kernel.org/r/20230613-so-reuseport-v4-0-4ece76708bba@isovalent… Changes in v4: - WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki) - Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki) - Shuffle documentation patch around (Kuniyuki) - Update commit message to explain why IPv6 needs EXPORT_SYMBOL - Link to v3: https://lore.kernel.org/r/20230613-so-reuseport-v3-0-907b4cbb7b99@isovalent… Changes in v3: - Fix warning re udp_ehashfn and udp6_ehashfn (Simon) - Return higher scoring connected UDP reuseport sockets (Kuniyuki) - Fix ipv6 module builds - Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent… Changes in v2: - Correct commit abbrev length (Kuniyuki) - Reduce duplication (Kuniyuki) - Add checks on sk_state (Martin) - Split exporting inet[6]_lookup_reuseport into separate patch (Eric) --- Daniel Borkmann (1): selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper Lorenz Bauer (6): udp: re-score reuseport groups when connected sockets are present net: export inet_lookup_reuseport and inet6_lookup_reuseport net: remove duplicate reuseport_lookup functions net: document inet[6]_lookup_reuseport sk_state requirements net: remove duplicate sk_lookup helpers bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign include/net/inet6_hashtables.h | 81 ++++++++- include/net/inet_hashtables.h | 74 +++++++- include/net/sock.h | 7 +- include/uapi/linux/bpf.h | 3 - net/core/filter.c | 2 - net/ipv4/inet_hashtables.c | 68 ++++--- net/ipv4/udp.c | 88 ++++----- net/ipv6/inet6_hashtables.c | 71 +++++--- net/ipv6/udp.c | 98 ++++------ tools/include/uapi/linux/bpf.h | 3 - tools/testing/selftests/bpf/network_helpers.c | 3 + .../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++ .../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++ 13 files changed, 658 insertions(+), 179 deletions(-) --- base-commit: c20f9cef725bc6b19efe372696e8000fb5af0d46 change-id: 20230613-so-reuseport-e92c526173ee Best regards, -- Lorenz Bauer <lmb(a)isovalent.com>

2 years, 4 months

2
9
0 0

[PATCH] selftests/arm64: fix build failure during the "emit_tests" step

by John Hubbard

The build failure reported in [1] occurred because commit 9fc96c7c19df ("selftests: error out if kernel header files are not yet built") added a new "kernel_header_files" dependency to "all", and that triggered another, pre-existing problem. Specifically, the arm64 selftests override the emit_tests target, and that override improperly declares itself to depend upon the "all" target. This is a problem because the "emit_tests" target in lib.mk was not intended to be overridden. emit_tests is a very simple, sequential build target that was originally invoked from the "install" target, which in turn, depends upon "all". That approach worked for years. But with 9fc96c7c19df in place, emit_tests failed, because it does not set up all of the elaborate things that "install" does. And that caused the new "kernel_header_files" target (which depends upon $(KBUILD_OUTPUT) being correct) to fail. Some detail: The "all" target is .PHONY. Therefore, each target that depends on "all" will cause it to be invoked again, and because dependencies are managed quite loosely in the selftests Makefiles, many things will run, even "all" is invoked several times in immediate succession. So this is not a "real" failure, as far as build steps go: everything gets built, but "all" reports a problem when invoked a second time from a bad environment. To fix this, simply remove the unnecessary "all" dependency from the overridden emit_tests target. The dependency is still effectively honored, because again, invocation is via "install", which also depends upon "all". An alternative approach would be to harden the emit_tests target so that it can depend upon "all", but that's a lot more complicated and hard to get right, and doesn't seem worth it, especially given that emit_tests should probably not be overridden at all. [1] https://lore.kernel.org/20230710-kselftest-fix-arm64-v1-1-48e872844f25@kern… Fixes: 9fc96c7c19df ("selftests: error out if kernel header files are not yet built") Reported-by: Mark Brown <broonie(a)kernel.org> Signed-off-by: John Hubbard <jhubbard(a)nvidia.com> --- tools/testing/selftests/arm64/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile index 9460cbe81bcc..ace8b67fb22d 100644 --- a/tools/testing/selftests/arm64/Makefile +++ b/tools/testing/selftests/arm64/Makefile @@ -42,7 +42,7 @@ run_tests: all done # Avoid any output on non arm64 on emit_tests -emit_tests: all +emit_tests: @for DIR in $(ARM64_SUBTARGETS); do \ BUILD_TARGET=$(OUTPUT)/$$DIR; \ make OUTPUT=$$BUILD_TARGET -C $$DIR $@; \ base-commit: d5fe758c21f4770763ae4c05580be239be18947d -- 2.41.0

2 years, 4 months

2
2
0 0

[linux-next:master] BUILD REGRESSION 8e4b7f2f3d6071665b1dfd70786229c8a5d6c256

by kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master branch HEAD: 8e4b7f2f3d6071665b1dfd70786229c8a5d6c256 Add linux-next specific files for 20230711 Error/Warning reports: https://lore.kernel.org/oe-kbuild-all/202306122223.HHER4zOo-lkp@intel.com https://lore.kernel.org/oe-kbuild-all/202306260401.qZlYQpV2-lkp@intel.com https://lore.kernel.org/oe-kbuild-all/202307111309.401QvMTN-lkp@intel.com Error/Warning: (recently discovered and may have been fixed) arch/parisc/kernel/pdt.c:67:6: warning: no previous prototype for 'arch_report_meminfo' [-Wmissing-prototypes] arch/s390/include/asm/io.h:29:17: error: implicit declaration of function 'iounmap'; did you mean 'vunmap'? [-Werror=implicit-function-declaration] drivers/mfd/max77541.c:176:18: warning: cast to smaller integer type 'enum max7754x_ids' from 'const void *' [-Wvoid-pointer-to-enum-cast] drivers/net/arcnet/arc-rimi.c:107:13: error: implicit declaration of function 'ioremap'; did you mean 'ifr_map'? [-Werror=implicit-function-declaration] drivers/net/arcnet/com90xx.c:225:24: error: implicit declaration of function 'ioremap'; did you mean 'ifr_map'? [-Werror=implicit-function-declaration] drivers/net/ethernet/8390/pcnet_cs.c:290:12: error: implicit declaration of function 'ioremap'; did you mean 'ifr_map'? [-Werror=implicit-function-declaration] drivers/net/ethernet/fujitsu/fmvj18x_cs.c:549:12: error: implicit declaration of function 'ioremap'; did you mean 'iounmap'? [-Werror=implicit-function-declaration] drivers/net/ethernet/smsc/smc91c92_cs.c:447:17: error: implicit declaration of function 'ioremap'; did you mean 'ifr_map'? [-Werror=implicit-function-declaration] drivers/net/ethernet/xircom/xirc2ps_cs.c:843:28: error: implicit declaration of function 'ioremap'; did you mean 'iounmap'? [-Werror=implicit-function-declaration] drivers/pcmcia/cistpl.c:103:31: error: implicit declaration of function 'ioremap'; did you mean 'iounmap'? [-Werror=implicit-function-declaration] drivers/tty/ipwireless/main.c:115:30: error: implicit declaration of function 'ioremap'; did you mean 'iounmap'? [-Werror=implicit-function-declaration] lib/kunit/executor_test.c:138:4: warning: cast from 'void (*)(const void *)' to 'kunit_action_t *' (aka 'void (*)(void *)') converts to incompatible function type [-Wcast-function-type-strict] lib/kunit/test.c:775:38: warning: cast from 'void (*)(const void *)' to 'kunit_action_t *' (aka 'void (*)(void *)') converts to incompatible function type [-Wcast-function-type-strict] Unverified Error/Warning (likely false positive, please contact us if interested): drivers/clk/imx/clk-imx93.c:294 imx93_clocks_probe() error: uninitialized symbol 'base'. drivers/net/ethernet/mellanox/mlx5/core/lib/devcom.c:98 mlx5_devcom_register_device() error: uninitialized symbol 'tmp_dev'. net/wireless/scan.c:373 cfg80211_gen_new_ie() warn: potential spectre issue 'sub->data' [r] net/wireless/scan.c:397 cfg80211_gen_new_ie() warn: possible spectre second half. 'ext_id' {standard input}: Error: local label `"2" (instance number 9 of a fb label)' is not defined Error/Warning ids grouped by kconfigs: gcc_recent_errors |-- arm64-randconfig-m041-20230710 | `-- drivers-clk-imx-clk-imx93.c-imx93_clocks_probe()-error:uninitialized-symbol-base-. |-- parisc-randconfig-r083-20230710 | `-- arch-parisc-kernel-pdt.c:warning:no-previous-prototype-for-arch_report_meminfo |-- s390-allmodconfig | |-- arch-s390-include-asm-io.h:error:implicit-declaration-of-function-iounmap | |-- drivers-net-arcnet-arc-rimi.c:error:implicit-declaration-of-function-ioremap | |-- drivers-net-arcnet-com9x.c:error:implicit-declaration-of-function-ioremap | |-- drivers-net-ethernet-fujitsu-fmvj18x_cs.c:error:implicit-declaration-of-function-ioremap | |-- drivers-net-ethernet-pcnet_cs.c:error:implicit-declaration-of-function-ioremap | |-- drivers-net-ethernet-smsc-smc91c92_cs.c:error:implicit-declaration-of-function-ioremap | |-- drivers-net-ethernet-xircom-xirc2ps_cs.c:error:implicit-declaration-of-function-ioremap | |-- drivers-pcmcia-cistpl.c:error:implicit-declaration-of-function-ioremap | `-- drivers-tty-ipwireless-main.c:error:implicit-declaration-of-function-ioremap |-- sh-allmodconfig | `-- standard-input:Error:local-label-(instance-number-of-a-fb-label)-is-not-defined `-- x86_64-randconfig-m001-20230710 |-- drivers-net-ethernet-mellanox-mlx5-core-lib-devcom.c-mlx5_devcom_register_device()-error:uninitialized-symbol-tmp_dev-. |-- net-wireless-scan.c-cfg80211_gen_new_ie()-warn:possible-spectre-second-half.-ext_id `-- net-wireless-scan.c-cfg80211_gen_new_ie()-warn:potential-spectre-issue-sub-data-r clang_recent_errors |-- arm-randconfig-r001-20230710 | |-- lib-kunit-executor_test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- arm64-randconfig-r013-20230710 | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- arm64-randconfig-r024-20230710 | |-- drivers-mfd-max77541.c:warning:cast-to-smaller-integer-type-enum-max7754x_ids-from-const-void | |-- lib-kunit-executor_test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- hexagon-randconfig-r041-20230710 | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- hexagon-randconfig-r045-20230710 | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type |-- riscv-randconfig-r042-20230710 | |-- lib-kunit-executor_test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type | `-- lib-kunit-test.c:warning:cast-from-void-(-)(const-void-)-to-kunit_action_t-(aka-void-(-)(void-)-)-converts-to-incompatible-function-type `-- x86_64-buildonly-randconfig-r002-20230711 `-- drivers-mfd-max77541.c:warning:cast-to-smaller-integer-type-enum-max7754x_ids-from-const-void elapsed time: 720m configs tested: 140 configs skipped: 4 tested configs: alpha allyesconfig gcc alpha defconfig gcc alpha randconfig-r004-20230710 gcc alpha randconfig-r005-20230710 gcc alpha randconfig-r034-20230710 gcc arc alldefconfig gcc arc allyesconfig gcc arc axs103_defconfig gcc arc defconfig gcc arc randconfig-r043-20230710 gcc arm allmodconfig gcc arm allyesconfig gcc arm aspeed_g4_defconfig clang arm defconfig gcc arm dove_defconfig clang arm lpc18xx_defconfig gcc arm mvebu_v7_defconfig gcc arm netwinder_defconfig clang arm omap2plus_defconfig gcc arm randconfig-r001-20230710 clang arm randconfig-r026-20230710 gcc arm randconfig-r046-20230710 gcc arm sama5_defconfig gcc arm spear3xx_defconfig clang arm stm32_defconfig gcc arm versatile_defconfig clang arm64 allyesconfig gcc arm64 defconfig gcc arm64 randconfig-r013-20230710 clang arm64 randconfig-r024-20230710 clang csky defconfig gcc csky randconfig-r006-20230710 gcc csky randconfig-r016-20230710 gcc csky randconfig-r036-20230710 gcc hexagon defconfig clang hexagon randconfig-r041-20230710 clang hexagon randconfig-r045-20230710 clang i386 allyesconfig gcc i386 buildonly-randconfig-r004-20230711 clang i386 buildonly-randconfig-r005-20230711 clang i386 buildonly-randconfig-r006-20230711 clang i386 debian-10.3 gcc i386 defconfig gcc i386 randconfig-i001-20230710 gcc i386 randconfig-i002-20230710 gcc i386 randconfig-i003-20230710 gcc i386 randconfig-i004-20230710 gcc i386 randconfig-i005-20230710 gcc i386 randconfig-i006-20230710 gcc i386 randconfig-i011-20230710 clang i386 randconfig-i012-20230710 clang i386 randconfig-i013-20230710 clang i386 randconfig-i014-20230710 clang i386 randconfig-i015-20230710 clang i386 randconfig-i016-20230710 clang i386 randconfig-r011-20230710 clang loongarch allmodconfig gcc loongarch allnoconfig gcc loongarch defconfig gcc m68k allmodconfig gcc m68k allyesconfig gcc m68k amcore_defconfig gcc m68k defconfig gcc m68k m5307c3_defconfig gcc m68k randconfig-r021-20230710 gcc m68k stmark2_defconfig gcc m68k virt_defconfig gcc microblaze mmu_defconfig gcc mips allmodconfig gcc mips allyesconfig gcc mips cobalt_defconfig gcc mips maltaup_defconfig clang nios2 defconfig gcc parisc allyesconfig gcc parisc defconfig gcc parisc randconfig-r015-20230710 gcc parisc64 defconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc powerpc asp8347_defconfig gcc powerpc linkstation_defconfig gcc powerpc mvme5100_defconfig clang powerpc ppc64_defconfig gcc powerpc randconfig-r035-20230710 gcc powerpc tqm8560_defconfig clang powerpc walnut_defconfig clang riscv allmodconfig gcc riscv allnoconfig gcc riscv allyesconfig gcc riscv defconfig gcc riscv randconfig-r002-20230710 gcc riscv randconfig-r031-20230710 gcc riscv randconfig-r032-20230710 gcc riscv randconfig-r042-20230710 clang riscv rv32_defconfig gcc s390 alldefconfig clang s390 allmodconfig gcc s390 allyesconfig gcc s390 defconfig gcc s390 randconfig-r044-20230710 clang sh allmodconfig gcc sh j2_defconfig gcc sh migor_defconfig gcc sh rts7751r2dplus_defconfig gcc sh se7750_defconfig gcc sparc allyesconfig gcc sparc defconfig gcc sparc64 randconfig-r022-20230710 gcc um allmodconfig clang um allnoconfig clang um allyesconfig clang um defconfig gcc um i386_defconfig gcc um randconfig-r023-20230710 gcc um x86_64_defconfig gcc x86_64 allyesconfig gcc x86_64 buildonly-randconfig-r001-20230711 clang x86_64 buildonly-randconfig-r002-20230711 clang x86_64 buildonly-randconfig-r003-20230711 clang x86_64 defconfig gcc x86_64 kexec gcc x86_64 randconfig-r025-20230710 clang x86_64 randconfig-r033-20230710 gcc x86_64 randconfig-x001-20230710 clang x86_64 randconfig-x002-20230710 clang x86_64 randconfig-x003-20230710 clang x86_64 randconfig-x004-20230710 clang x86_64 randconfig-x005-20230710 clang x86_64 randconfig-x006-20230710 clang x86_64 randconfig-x011-20230710 gcc x86_64 randconfig-x012-20230710 gcc x86_64 randconfig-x013-20230710 gcc x86_64 randconfig-x014-20230710 gcc x86_64 randconfig-x015-20230710 gcc x86_64 randconfig-x016-20230710 gcc x86_64 rhel-8.3-rust clang x86_64 rhel-8.3 gcc xtensa alldefconfig gcc xtensa generic_kc705_defconfig gcc xtensa randconfig-r012-20230710 gcc -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki

2 years, 4 months

1
0
0 0

Re: [PATCH v3 03/10] eventfs: adding eventfs dir add functions

by Ajay Kaher

> On 10-Jul-2023, at 7:24 AM, Steven Rostedt <rostedt(a)goodmis.org> wrote: > > !! External Email > > On Mon, 3 Jul 2023 15:52:26 -0400 > Steven Rostedt <rostedt(a)goodmis.org> wrote: > >> On Mon, 3 Jul 2023 18:51:22 +0000 >> Ajay Kaher <akaher(a)vmware.com> wrote: >> >>>> >>>> We can also look to see if we can implement this with RCU. What exactly >>>> is this rwsem protecting? >>>> >>> >>> - struct eventfs_file holds the meta-data for file or dir. >>> https://github.com/intel-lab-lkp/linux/blob/dfe0dc15a73261ed83cdc728e43f4b3… >>> - eventfs_rwsem is supposed to protect the 'link-list which is made of struct eventfs_file >>> ' and elements of struct eventfs_file. >> >> RCU is usually the perfect solution for protecting link lists though. I'll >> take a look at this when I get back to work. >> > > So I did the below patch on top of this series. If you could fold this > into the appropriate patches, it should get us closer to an acceptable > solution. > > What I did was: > > 1. Moved the struct eventfs_file and eventfs_inode into event_inode.c as it > really should not be exposed to all users. > > 2. Added a recursion check to eventfs_remove_rec() as it is really > dangerous to have unchecked recursion in the kernel (we do have a fixed > size stack). > > 3. Removed all the eventfs_rwsem code and replaced it with an srcu lock for > the readers, and a mutex to synchronize the writers of the list. > > 4. Added a eventfs_mutex that is used for the modifications of the > dentry itself (as well as modifying the list from 3 above). > > 5. Have the free use srcu callbacks. After the srcu grace periods are done, > it adds the eventfs_file onto a llist (lockless link list) and wakes up a > work queue. Then the work queue does the freeing (this needs to be done in > task/workqueue context, as srcu callbacks are done in softirq context). > > This appears to pass through some of my instance stress tests as well as > the in tree ftrace selftests. > Awesome :) I have manually applied the patches and ftracetest results are same as v3. No more complains from lockdep. I will merge this into appropriate patches of v3 and soon send v4. You have renamed eventfs_create_dir() to create_dir(), and kept eventfs_create_dir() just a wrapper with lock, same for eventfs_create_file(). However these wrapper no where used, I will drop these wrappers. I was trying to have independent lock for each instance of events. As common lock for every instance of events is not must. Something was broken in your mail (I guess cc list) and couldn’t reach to lkml or ignored by lkml. I just wanted to track the auto test results from linux-kselftest. -Ajay > > --- > fs/tracefs/event_inode.c | 333 ++++++++++++++++++++++---------------------- > include/linux/tracefs.h | 26 --- > kernel/trace/trace.h | 1 > kernel/trace/trace_events.c | 6 > 4 files changed, 179 insertions(+), 187 deletions(-) > > Index: linux-trace.git/fs/tracefs/event_inode.c > =================================================================== > --- linux-trace.git.orig/fs/tracefs/event_inode.c 2023-07-07 22:04:44.490812310 -0400 > +++ linux-trace.git/fs/tracefs/event_inode.c 2023-07-09 21:48:28.162874719 -0400 > @@ -16,71 +16,69 @@ > #include <linux/fsnotify.h> > #include <linux/fs.h> > #include <linux/namei.h> > +#include <linux/workqueue.h> > #include <linux/security.h> > #include <linux/tracefs.h> > #include <linux/kref.h> > #include <linux/delay.h> > #include "internal.h" > > -/** > - * eventfs_dentry_to_rwsem - Return corresponding eventfs_rwsem > - * @dentry: a pointer to dentry > - * > - * helper function to return crossponding eventfs_rwsem for given dentry > - */ > -static struct rw_semaphore *eventfs_dentry_to_rwsem(struct dentry *dentry) > -{ > - if (S_ISDIR(dentry->d_inode->i_mode)) > - return (struct rw_semaphore *)dentry->d_inode->i_private; > - else > - return (struct rw_semaphore *)dentry->d_parent->d_inode->i_private; > -} > +struct eventfs_inode { > + struct list_head e_top_files; > +}; > > -/** > - * eventfs_down_read - acquire read lock function > - * @eventfs_rwsem: a pointer to rw_semaphore > - * > - * helper function to perform read lock. Nested locking requires because > - * lookup(), release() requires read lock, these could be called directly > - * or from open(), remove() which already hold the read/write lock. > - */ > -static void eventfs_down_read(struct rw_semaphore *eventfs_rwsem) > -{ > - down_read_nested(eventfs_rwsem, SINGLE_DEPTH_NESTING); > -} > +struct eventfs_file { > + const char *name; > + struct dentry *d_parent; > + struct dentry *dentry; > + struct list_head list; > + struct eventfs_inode *ei; > + const struct file_operations *fop; > + const struct inode_operations *iop; > + union { > + struct rcu_head rcu; > + struct llist_node llist; /* For freeing after RCU */ > + }; > + void *data; > + umode_t mode; > + bool created; > +}; > > -/** > - * eventfs_up_read - release read lock function > - * @eventfs_rwsem: a pointer to rw_semaphore > - * > - * helper function to release eventfs_rwsem lock if locked > - */ > -static void eventfs_up_read(struct rw_semaphore *eventfs_rwsem) > -{ > - up_read(eventfs_rwsem); > -} > +static DEFINE_MUTEX(eventfs_mutex); > +DEFINE_STATIC_SRCU(eventfs_srcu); > > -/** > - * eventfs_down_write - acquire write lock function > - * @eventfs_rwsem: a pointer to rw_semaphore > - * > - * helper function to perform write lock on eventfs_rwsem > - */ > -static void eventfs_down_write(struct rw_semaphore *eventfs_rwsem) > +static struct dentry *create_file(const char *name, umode_t mode, > + struct dentry *parent, void *data, > + const struct file_operations *fop) > { > - while (!down_write_trylock(eventfs_rwsem)) > - msleep(10); > -} > + struct tracefs_inode *ti; > + struct dentry *dentry; > + struct inode *inode; > > -/** > - * eventfs_up_write - release write lock function > - * @eventfs_rwsem: a pointer to rw_semaphore > - * > - * helper function to perform write lock on eventfs_rwsem > - */ > -static void eventfs_up_write(struct rw_semaphore *eventfs_rwsem) > -{ > - up_write(eventfs_rwsem); > + if (!(mode & S_IFMT)) > + mode |= S_IFREG; > + > + if (WARN_ON_ONCE(!S_ISREG(mode))) > + return NULL; > + > + dentry = eventfs_start_creating(name, parent); > + > + if (IS_ERR(dentry)) > + return dentry; > + > + inode = tracefs_get_inode(dentry->d_sb); > + if (unlikely(!inode)) > + return eventfs_failed_creating(dentry); > + > + inode->i_mode = mode; > + inode->i_fop = fop; > + inode->i_private = data; > + > + ti = get_tracefs(inode); > + ti->flags |= TRACEFS_EVENT_INODE; > + d_instantiate(dentry, inode); > + fsnotify_create(dentry->d_parent->d_inode, dentry); > + return eventfs_end_creating(dentry); > } > > /** > @@ -111,21 +109,30 @@ static struct dentry *eventfs_create_fil > struct dentry *parent, void *data, > const struct file_operations *fop) > { > - struct tracefs_inode *ti; > struct dentry *dentry; > - struct inode *inode; > > if (security_locked_down(LOCKDOWN_TRACEFS)) > return NULL; > > - if (!(mode & S_IFMT)) > - mode |= S_IFREG; > + mutex_lock(&eventfs_mutex); > + dentry = create_file(name, mode, parent, data, fop); > + mutex_unlock(&eventfs_mutex); > > - if (WARN_ON_ONCE(!S_ISREG(mode))) > - return NULL; > + return dentry; > +} > > - dentry = eventfs_start_creating(name, parent); > +static struct dentry *create_dir(const char *name, umode_t mode, > + struct dentry *parent, void *data, > + const struct file_operations *fop, > + const struct inode_operations *iop) > +{ > + struct tracefs_inode *ti; > + struct dentry *dentry; > + struct inode *inode; > > + WARN_ON(!S_ISDIR(mode)); > + > + dentry = eventfs_start_creating(name, parent); > if (IS_ERR(dentry)) > return dentry; > > @@ -134,13 +141,17 @@ static struct dentry *eventfs_create_fil > return eventfs_failed_creating(dentry); > > inode->i_mode = mode; > + inode->i_op = iop; > inode->i_fop = fop; > inode->i_private = data; > > ti = get_tracefs(inode); > ti->flags |= TRACEFS_EVENT_INODE; > + > + inc_nlink(inode); > d_instantiate(dentry, inode); > - fsnotify_create(dentry->d_parent->d_inode, dentry); > + inc_nlink(dentry->d_parent->d_inode); > + fsnotify_mkdir(dentry->d_parent->d_inode, dentry); > return eventfs_end_creating(dentry); > } > > @@ -175,37 +186,18 @@ static struct dentry *eventfs_create_dir > const struct file_operations *fop, > const struct inode_operations *iop) > { > - struct tracefs_inode *ti; > struct dentry *dentry; > - struct inode *inode; > > if (security_locked_down(LOCKDOWN_TRACEFS)) > return NULL; > > WARN_ON(!S_ISDIR(mode)); > > - dentry = eventfs_start_creating(name, parent); > - > - if (IS_ERR(dentry)) > - return dentry; > - > - inode = tracefs_get_inode(dentry->d_sb); > - if (unlikely(!inode)) > - return eventfs_failed_creating(dentry); > + mutex_lock(&eventfs_mutex); > + dentry = create_dir(name, mode, parent, data, fop, iop); > + mutex_unlock(&eventfs_mutex); > > - inode->i_mode = mode; > - inode->i_op = iop; > - inode->i_fop = fop; > - inode->i_private = data; > - > - ti = get_tracefs(inode); > - ti->flags |= TRACEFS_EVENT_INODE; > - > - inc_nlink(inode); > - d_instantiate(dentry, inode); > - inc_nlink(dentry->d_parent->d_inode); > - fsnotify_mkdir(dentry->d_parent->d_inode, dentry); > - return eventfs_end_creating(dentry); > + return dentry; > } > > /** > @@ -241,13 +233,14 @@ static void eventfs_post_create_dir(stru > { > struct eventfs_file *ef_child; > struct tracefs_inode *ti; > + int idx; > > - eventfs_down_read((struct rw_semaphore *) ef->data); > + /* srcu lock already held */ > /* fill parent-child relation */ > - list_for_each_entry(ef_child, &ef->ei->e_top_files, list) { > + list_for_each_entry_srcu(ef_child, &ef->ei->e_top_files, list, > + srcu_read_lock_held(&eventfs_srcu)) { > ef_child->d_parent = ef->dentry; > } > - eventfs_up_read((struct rw_semaphore *) ef->data); > > ti = get_tracefs(ef->dentry->d_inode); > ti->private = ef->ei; > @@ -271,40 +264,43 @@ static struct dentry *eventfs_root_looku > struct eventfs_inode *ei; > struct eventfs_file *ef; > struct dentry *ret = NULL; > - struct rw_semaphore *eventfs_rwsem; > + int idx; > > ti = get_tracefs(dir); > if (!(ti->flags & TRACEFS_EVENT_INODE)) > return NULL; > > ei = ti->private; > - eventfs_rwsem = (struct rw_semaphore *) dir->i_private; > - eventfs_down_read(eventfs_rwsem); > - list_for_each_entry(ef, &ei->e_top_files, list) { > + idx = srcu_read_lock(&eventfs_srcu); > + list_for_each_entry_srcu(ef, &ei->e_top_files, list, > + srcu_read_lock_held(&eventfs_srcu)) { > if (strcmp(ef->name, dentry->d_name.name)) > continue; > ret = simple_lookup(dir, dentry, flags); > if (ef->created) > continue; > + mutex_lock(&eventfs_mutex); > ef->created = true; > if (ef->ei) > - ef->dentry = eventfs_create_dir(ef->name, ef->mode, ef->d_parent, > - ef->data, ef->fop, ef->iop); > + ef->dentry = create_dir(ef->name, ef->mode, ef->d_parent, > + ef->data, ef->fop, ef->iop); > else > - ef->dentry = eventfs_create_file(ef->name, ef->mode, ef->d_parent, > - ef->data, ef->fop); > + ef->dentry = create_file(ef->name, ef->mode, ef->d_parent, > + ef->data, ef->fop); > > if (IS_ERR_OR_NULL(ef->dentry)) { > ef->created = false; > + mutex_unlock(&eventfs_mutex); > } else { > if (ef->ei) > eventfs_post_create_dir(ef); > ef->dentry->d_fsdata = ef; > + mutex_unlock(&eventfs_mutex); > dput(ef->dentry); > } > break; > } > - eventfs_up_read(eventfs_rwsem); > + srcu_read_unlock(&eventfs_srcu, idx); > return ret; > } > > @@ -318,21 +314,20 @@ static int eventfs_release(struct inode > struct tracefs_inode *ti; > struct eventfs_inode *ei; > struct eventfs_file *ef; > - struct dentry *dentry = file_dentry(file); > - struct rw_semaphore *eventfs_rwsem; > + int idx; > > ti = get_tracefs(inode); > if (!(ti->flags & TRACEFS_EVENT_INODE)) > return -EINVAL; > > ei = ti->private; > - eventfs_rwsem = eventfs_dentry_to_rwsem(dentry); > - eventfs_down_read(eventfs_rwsem); > - list_for_each_entry(ef, &ei->e_top_files, list) { > + idx = srcu_read_lock(&eventfs_srcu); > + list_for_each_entry_srcu(ef, &ei->e_top_files, list, > + srcu_read_lock_held(&eventfs_srcu)) { > if (ef->created) > dput(ef->dentry); > } > - eventfs_up_read(eventfs_rwsem); > + srcu_read_unlock(&eventfs_srcu, idx); > return dcache_dir_close(inode, file); > } > > @@ -352,30 +347,30 @@ static int dcache_dir_open_wrapper(struc > struct eventfs_file *ef; > struct inode *f_inode = file_inode(file); > struct dentry *dentry = file_dentry(file); > - struct rw_semaphore *eventfs_rwsem; > + int idx; > > ti = get_tracefs(f_inode); > if (!(ti->flags & TRACEFS_EVENT_INODE)) > return -EINVAL; > > ei = ti->private; > - eventfs_rwsem = eventfs_dentry_to_rwsem(dentry); > - eventfs_down_read(eventfs_rwsem); > - list_for_each_entry(ef, &ei->e_top_files, list) { > + idx = srcu_read_lock(&eventfs_srcu); > + list_for_each_entry_rcu(ef, &ei->e_top_files, list) { > if (ef->created) { > dget(ef->dentry); > continue; > } > > + mutex_lock(&eventfs_mutex); > ef->created = true; > > inode_lock(dentry->d_inode); > if (ef->ei) > - ef->dentry = eventfs_create_dir(ef->name, ef->mode, dentry, > - ef->data, ef->fop, ef->iop); > + ef->dentry = create_dir(ef->name, ef->mode, dentry, > + ef->data, ef->fop, ef->iop); > else > - ef->dentry = eventfs_create_file(ef->name, ef->mode, dentry, > - ef->data, ef->fop); > + ef->dentry = create_file(ef->name, ef->mode, dentry, > + ef->data, ef->fop); > inode_unlock(dentry->d_inode); > > if (IS_ERR_OR_NULL(ef->dentry)) { > @@ -385,8 +380,9 @@ static int dcache_dir_open_wrapper(struc > eventfs_post_create_dir(ef); > ef->dentry->d_fsdata = ef; > } > + mutex_unlock(&eventfs_mutex); > } > - eventfs_up_read(eventfs_rwsem); > + srcu_read_unlock(&eventfs_srcu, idx); > return dcache_dir_open(inode, file); > } > > @@ -463,13 +459,11 @@ static struct eventfs_file *eventfs_prep > * @parent: a pointer to the parent dentry for this file. This should be a > * directory dentry if set. If this parameter is NULL, then the > * directory will be created in the root of the tracefs filesystem. > - * @eventfs_rwsem: a pointer to rw_semaphore > * > * This function creates the top of the trace event directory. > */ > struct dentry *eventfs_create_events_dir(const char *name, > - struct dentry *parent, > - struct rw_semaphore *eventfs_rwsem) > + struct dentry *parent) > { > struct dentry *dentry = tracefs_start_creating(name, parent); > struct eventfs_inode *ei; > @@ -489,7 +483,6 @@ struct dentry *eventfs_create_events_dir > return ERR_PTR(-ENOMEM); > } > > - init_rwsem(eventfs_rwsem); > INIT_LIST_HEAD(&ei->e_top_files); > > ti = get_tracefs(inode); > @@ -499,7 +492,6 @@ struct dentry *eventfs_create_events_dir > inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO; > inode->i_op = &eventfs_root_dir_inode_operations; > inode->i_fop = &eventfs_file_operations; > - inode->i_private = eventfs_rwsem; > > /* directory inodes start off with i_nlink == 2 (for "." entry) */ > inc_nlink(inode); > @@ -513,15 +505,13 @@ struct dentry *eventfs_create_events_dir > * eventfs_add_subsystem_dir - add eventfs subsystem_dir to list to create later > * @name: a pointer to a string containing the name of the file to create. > * @parent: a pointer to the parent dentry for this dir. > - * @eventfs_rwsem: a pointer to rw_semaphore > * > * This function adds eventfs subsystem dir to list. > * And all these dirs are created on the fly when they are looked up, > * and the dentry and inodes will be removed when they are done. > */ > struct eventfs_file *eventfs_add_subsystem_dir(const char *name, > - struct dentry *parent, > - struct rw_semaphore *eventfs_rwsem) > + struct dentry *parent) > { > struct tracefs_inode *ti_parent; > struct eventfs_inode *ei_parent; > @@ -536,16 +526,15 @@ struct eventfs_file *eventfs_add_subsyst > ef = eventfs_prepare_ef(name, > S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO, > &eventfs_file_operations, > - &eventfs_root_dir_inode_operations, > - (void *) eventfs_rwsem); > + &eventfs_root_dir_inode_operations, NULL); > > if (IS_ERR(ef)) > return ef; > > - eventfs_down_write(eventfs_rwsem); > + mutex_lock(&eventfs_mutex); > list_add_tail(&ef->list, &ei_parent->e_top_files); > ef->d_parent = parent; > - eventfs_up_write(eventfs_rwsem); > + mutex_unlock(&eventfs_mutex); > return ef; > } > > @@ -553,15 +542,13 @@ struct eventfs_file *eventfs_add_subsyst > * eventfs_add_dir - add eventfs dir to list to create later > * @name: a pointer to a string containing the name of the file to create. > * @ef_parent: a pointer to the parent eventfs_file for this dir. > - * @eventfs_rwsem: a pointer to rw_semaphore > * > * This function adds eventfs dir to list. > * And all these dirs are created on the fly when they are looked up, > * and the dentry and inodes will be removed when they are done. > */ > struct eventfs_file *eventfs_add_dir(const char *name, > - struct eventfs_file *ef_parent, > - struct rw_semaphore *eventfs_rwsem) > + struct eventfs_file *ef_parent) > { > struct eventfs_file *ef; > > @@ -571,16 +558,15 @@ struct eventfs_file *eventfs_add_dir(con > ef = eventfs_prepare_ef(name, > S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO, > &eventfs_file_operations, > - &eventfs_root_dir_inode_operations, > - (void *) eventfs_rwsem); > + &eventfs_root_dir_inode_operations, NULL); > > if (IS_ERR(ef)) > return ef; > > - eventfs_down_write(eventfs_rwsem); > + mutex_lock(&eventfs_mutex); > list_add_tail(&ef->list, &ef_parent->ei->e_top_files); > ef->d_parent = ef_parent->dentry; > - eventfs_up_write(eventfs_rwsem); > + mutex_unlock(&eventfs_mutex); > return ef; > } > > @@ -608,7 +594,6 @@ int eventfs_add_top_file(const char *nam > struct tracefs_inode *ti; > struct eventfs_inode *ei; > struct eventfs_file *ef; > - struct rw_semaphore *eventfs_rwsem; > > if (!parent) > return -EINVAL; > @@ -629,11 +614,10 @@ int eventfs_add_top_file(const char *nam > if (IS_ERR(ef)) > return -ENOMEM; > > - eventfs_rwsem = (struct rw_semaphore *) parent->d_inode->i_private; > - eventfs_down_write(eventfs_rwsem); > + mutex_lock(&eventfs_mutex); > list_add_tail(&ef->list, &ei->e_top_files); > ef->d_parent = parent; > - eventfs_up_write(eventfs_rwsem); > + mutex_unlock(&eventfs_mutex); > return 0; > } > > @@ -658,7 +642,6 @@ int eventfs_add_file(const char *name, u > const struct file_operations *fop) > { > struct eventfs_file *ef; > - struct rw_semaphore *eventfs_rwsem; > > if (!ef_parent) > return -EINVAL; > @@ -670,14 +653,42 @@ int eventfs_add_file(const char *name, u > if (IS_ERR(ef)) > return -ENOMEM; > > - eventfs_rwsem = (struct rw_semaphore *) ef_parent->data; > - eventfs_down_write(eventfs_rwsem); > + mutex_lock(&eventfs_mutex); > list_add_tail(&ef->list, &ef_parent->ei->e_top_files); > ef->d_parent = ef_parent->dentry; > - eventfs_up_write(eventfs_rwsem); > + mutex_unlock(&eventfs_mutex); > return 0; > } > > +static LLIST_HEAD(free_list); > + > +static void eventfs_workfn(struct work_struct *work) > +{ > + struct eventfs_file *ef, *tmp; > + struct llist_node *llnode; > + > + llnode = llist_del_all(&free_list); > + llist_for_each_entry_safe(ef, tmp, llnode, llist) { > + if (ef->created && ef->dentry) > + dput(ef->dentry); > + kfree(ef->name); > + kfree(ef->ei); > + kfree(ef); > + } > +} > + > +DECLARE_WORK(eventfs_work, eventfs_workfn); > + > +static void free_ef(struct rcu_head *head) > +{ > + struct eventfs_file *ef = container_of(head, struct eventfs_file, rcu); > + > + if (!llist_add(&ef->llist, &free_list)) > + return; > + > + queue_work(system_unbound_wq, &eventfs_work); > +} > + > /** > * eventfs_remove_rec - remove eventfs dir or file from list > * @ef: a pointer to eventfs_file to be removed. > @@ -685,51 +696,51 @@ int eventfs_add_file(const char *name, u > * This function recursively remove eventfs_file which > * contains info of file or dir. > */ > -static void eventfs_remove_rec(struct eventfs_file *ef) > +static void eventfs_remove_rec(struct eventfs_file *ef, int level) > { > - struct eventfs_file *ef_child, *n; > + struct eventfs_file *ef_child; > > if (!ef) > return; > + /* > + * Check recursion depth. It should never be greater than 3: > + * 0 - events/ > + * 1 - events/group/ > + * 2 - events/group/event/ > + * 3 - events/group/event/file > + */ > + if (WARN_ON_ONCE(level > 3)) > + return; > > if (ef->ei) { > /* search for nested folders or files */ > - list_for_each_entry_safe(ef_child, n, &ef->ei->e_top_files, list) { > - eventfs_remove_rec(ef_child); > + list_for_each_entry_srcu(ef_child, &ef->ei->e_top_files, list, > + lockdep_is_held(&eventfs_mutex)) { > + eventfs_remove_rec(ef_child, level + 1); > } > - kfree(ef->ei); > } > > - if (ef->created && ef->dentry) { > + if (ef->created && ef->dentry) > d_invalidate(ef->dentry); > - dput(ef->dentry); > - } > - list_del(&ef->list); > - kfree(ef->name); > - kfree(ef); > + > + list_del_rcu(&ef->list); > + call_srcu(&eventfs_srcu, &ef->rcu, free_ef); > } > > /** > * eventfs_remove - remove eventfs dir or file from list > * @ef: a pointer to eventfs_file to be removed. > * > - * This function acquire the eventfs_rwsem lock and call eventfs_remove_rec() > + * This function acquire the eventfs_mutex lock and calls eventfs_remove_rec() > */ > void eventfs_remove(struct eventfs_file *ef) > { > - struct rw_semaphore *eventfs_rwsem; > - > if (!ef) > return; > > - if (ef->ei) > - eventfs_rwsem = (struct rw_semaphore *) ef->data; > - else > - eventfs_rwsem = (struct rw_semaphore *) ef->d_parent->d_inode->i_private; > - > - eventfs_down_write(eventfs_rwsem); > - eventfs_remove_rec(ef); > - eventfs_up_write(eventfs_rwsem); > + mutex_lock(&eventfs_mutex); > + eventfs_remove_rec(ef, 0); > + mutex_unlock(&eventfs_mutex); > } > > /** > Index: linux-trace.git/include/linux/tracefs.h > =================================================================== > --- linux-trace.git.orig/include/linux/tracefs.h 2023-07-07 22:04:44.490812310 -0400 > +++ linux-trace.git/include/linux/tracefs.h 2023-07-07 22:04:44.486812271 -0400 > @@ -21,22 +21,7 @@ struct file_operations; > > #ifdef CONFIG_TRACING > > -struct eventfs_inode { > - struct list_head e_top_files; > -}; > - > -struct eventfs_file { > - const char *name; > - struct dentry *d_parent; > - struct dentry *dentry; > - struct list_head list; > - struct eventfs_inode *ei; > - const struct file_operations *fop; > - const struct inode_operations *iop; > - void *data; > - umode_t mode; > - bool created; > -}; > +struct eventfs_file; > > struct dentry *eventfs_start_creating(const char *name, struct dentry *parent); > > @@ -45,16 +30,13 @@ struct dentry *eventfs_failed_creating(s > struct dentry *eventfs_end_creating(struct dentry *dentry); > > struct dentry *eventfs_create_events_dir(const char *name, > - struct dentry *parent, > - struct rw_semaphore *eventfs_rwsem); > + struct dentry *parent); > > struct eventfs_file *eventfs_add_subsystem_dir(const char *name, > - struct dentry *parent, > - struct rw_semaphore *eventfs_rwsem); > + struct dentry *parent); > > struct eventfs_file *eventfs_add_dir(const char *name, > - struct eventfs_file *ef_parent, > - struct rw_semaphore *eventfs_rwsem); > + struct eventfs_file *ef_parent); > > int eventfs_add_file(const char *name, umode_t mode, > struct eventfs_file *ef_parent, void *data, > Index: linux-trace.git/kernel/trace/trace.h > =================================================================== > --- linux-trace.git.orig/kernel/trace/trace.h 2023-07-07 22:04:44.490812310 -0400 > +++ linux-trace.git/kernel/trace/trace.h 2023-07-07 22:04:44.486812271 -0400 > @@ -359,7 +359,6 @@ struct trace_array { > struct dentry *options; > struct dentry *percpu_dir; > struct dentry *event_dir; > - struct rw_semaphore eventfs_rwsem; > struct trace_options *topts; > struct list_head systems; > struct list_head events; > Index: linux-trace.git/kernel/trace/trace_events.c > =================================================================== > --- linux-trace.git.orig/kernel/trace/trace_events.c 2023-07-07 22:04:44.490812310 -0400 > +++ linux-trace.git/kernel/trace/trace_events.c 2023-07-07 22:04:44.486812271 -0400 > @@ -2337,7 +2337,7 @@ event_subsystem_dir(struct trace_array * > } else > __get_system(system); > > - dir->ef = eventfs_add_subsystem_dir(name, parent, &tr->eventfs_rwsem); > + dir->ef = eventfs_add_subsystem_dir(name, parent); > if (IS_ERR(dir->ef)) { > pr_warn("Failed to create system directory %s\n", name); > __put_system(system); > @@ -2439,7 +2439,7 @@ event_create_dir(struct dentry *parent, > return -ENOMEM; > > name = trace_event_name(call); > - file->ef = eventfs_add_dir(name, ef_subsystem, &tr->eventfs_rwsem); > + file->ef = eventfs_add_dir(name, ef_subsystem); > if (IS_ERR(file->ef)) { > pr_warn("Could not create tracefs '%s' directory\n", name); > return -1; > @@ -3647,7 +3647,7 @@ create_event_toplevel_files(struct dentr > if (!entry) > return -ENOMEM; > > - d_events = eventfs_create_events_dir("events", parent, &tr->eventfs_rwsem); > + d_events = eventfs_create_events_dir("events", parent); > if (IS_ERR(d_events)) { > pr_warn("Could not create tracefs 'events' directory\n"); > return -ENOMEM; > > !! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

2 years, 4 months

2
4
0 0

[PATCH v4 0/9] cgroup/cpuset: Support remote partitions

by Waiman Long

v4: - [v3] https://lore.kernel.org/lkml/20230627005529.1564984-1-longman@redhat.com/ - Fix compilation problem reported by kernel test robot. v3: - [v2] https://lore.kernel.org/lkml/20230531163405.2200292-1-longman@redhat.com/ - Change the new control file from root-only "cpuset.cpus.reserve" to non-root "cpuset.cpus.exclusive" which lists the set of exclusive CPUs distributed down the hierarchy. - Add a patch to restrict boot-time isolated CPUs to isolated partitions only. - Update the test_cpuset_prs.sh test script and documentation accordingly. This patch series introduces a new cpuset control file "cpuset.cpus.exclusive" which must be a subset of "cpuset.cpus" and the parent's "cpuset.cpus.exclusive". This control file lists the exclusive CPUs to be distributed down the hierarchy. Any one of the exclusive CPUs can only be distributed to at most one child cpuset. Unlike "cpuset.cpus", invalid input to "cpuset.cpus.exclusive" will be rejected with an error. This new control file has no effect on the behavior of the cpuset until it turns into a partition root. At that point, its effective CPUs will be set to its exclusive CPUs unless some of them are offline. This patch series also introduces a new category of cpuset partition called remote partitions. The existing partition category where the partition roots have to be clustered around the root cgroup in a hierarchical way is now referred to as local partitions. A remote partition can be formed far from the root cgroup with no partition root parent. While local partitions can be created without touching "cpuset.cpus.exclusive" as it can be set automatically if a cpuset becomes a local partition root. Properly set "cpuset.cpus.exclusive" values down the hierarchy are required to create a remote partition. Both scheduling and isolated partitions can be formed in a remote partition. A local partition can be created under a remote partition. A remote partition, however, cannot be formed under a local partition for now. Modern container orchestration tools like Kubernetes use the cgroup hierarchy to manage different containers. And it is relying on other middleware like systemd to help managing it. If a container needs to use isolated CPUs, it is hard to get those with the local partitions as it will require the administrative parent cgroup to be a partition root too which tool like systemd may not be ready to manage. With this patch series, we allow the creation of remote partition far from the root. The container management tool can manage the "cpuset.cpus.exclusive" file without impacting the other cpuset files that are managed by other middlewares. Of course, invalid "cpuset.cpus.exclusive" values will be rejected and changes to "cpuset.cpus" can affect the value of "cpuset.cpus.exclusive" due to the requirement that it has to be a subset of the former control file. Waiman Long (9): cgroup/cpuset: Inherit parent's load balance state in v2 cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling cgroup/cpuset: Improve temporary cpumasks handling cgroup/cpuset: Allow suppression of sched domain rebuild in update_cpumasks_hier() cgroup/cpuset: Add cpuset.cpus.exclusive for v2 cgroup/cpuset: Introduce remote partition cgroup/cpuset: Check partition conflict with housekeeping setup cgroup/cpuset: Documentation update for partition cgroup/cpuset: Extend test_cpuset_prs.sh to test remote partition Documentation/admin-guide/cgroup-v2.rst | 100 +- kernel/cgroup/cpuset.c | 1347 ++++++++++++----- .../selftests/cgroup/test_cpuset_prs.sh | 398 +++-- 3 files changed, 1291 insertions(+), 554 deletions(-) -- 2.31.1

2 years, 4 months

2
21
0 0

[PATCH bpf-next v4 0/7] Add SO_REUSEPORT support for TC bpf_sk_assign

by Lorenz Bauer

We want to replace iptables TPROXY with a BPF program at TC ingress. To make this work in all cases we need to assign a SO_REUSEPORT socket to an skb, which is currently prohibited. This series adds support for such sockets to bpf_sk_assing. I did some refactoring to cut down on the amount of duplicate code. The key to this is to use INDIRECT_CALL in the reuseport helpers. To show that this approach is not just beneficial to TC sk_assign I removed duplicate code for bpf_sk_lookup as well. Joint work with Daniel Borkmann. Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com> --- Changes in v4: - WARN_ON_ONCE if reuseport socket is refcounted (Kuniyuki) - Use inet[6]_ehashfn_t to shorten function declarations (Kuniyuki) - Shuffle documentation patch around (Kuniyuki) - Update commit message to explain why IPv6 needs EXPORT_SYMBOL - Link to v3: https://lore.kernel.org/r/20230613-so-reuseport-v3-0-907b4cbb7b99@isovalent… Changes in v3: - Fix warning re udp_ehashfn and udp6_ehashfn (Simon) - Return higher scoring connected UDP reuseport sockets (Kuniyuki) - Fix ipv6 module builds - Link to v2: https://lore.kernel.org/r/20230613-so-reuseport-v2-0-b7c69a342613@isovalent… Changes in v2: - Correct commit abbrev length (Kuniyuki) - Reduce duplication (Kuniyuki) - Add checks on sk_state (Martin) - Split exporting inet[6]_lookup_reuseport into separate patch (Eric) --- Daniel Borkmann (1): selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper Lorenz Bauer (6): udp: re-score reuseport groups when connected sockets are present net: export inet_lookup_reuseport and inet6_lookup_reuseport net: remove duplicate reuseport_lookup functions net: document inet[6]_lookup_reuseport sk_state requirements net: remove duplicate sk_lookup helpers bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign include/net/inet6_hashtables.h | 81 ++++++++- include/net/inet_hashtables.h | 74 +++++++- include/net/sock.h | 7 +- include/uapi/linux/bpf.h | 3 - net/core/filter.c | 2 - net/ipv4/inet_hashtables.c | 67 ++++--- net/ipv4/udp.c | 88 ++++----- net/ipv6/inet6_hashtables.c | 70 +++++--- net/ipv6/udp.c | 98 ++++------ tools/include/uapi/linux/bpf.h | 3 - tools/testing/selftests/bpf/network_helpers.c | 3 + .../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++ .../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++ 13 files changed, 656 insertions(+), 179 deletions(-) --- base-commit: 970308a7b544fa1c7ee98a2721faba3765be8dd8 change-id: 20230613-so-reuseport-e92c526173ee Best regards, -- Lorenz Bauer <lmb(a)isovalent.com>

2 years, 4 months

3
19
0 0

[PATCH bpf-next v3 0/6] Support defragmenting IPv(4|6) packets in BPF

by Daniel Xu

=== Context === In the context of a middlebox, fragmented packets are tricky to handle. The full 5-tuple of a packet is often only available in the first fragment which makes enforcing consistent policy difficult. There are really only two stateless options, neither of which are very nice: 1. Enforce policy on first fragment and accept all subsequent fragments. This works but may let in certain attacks or allow data exfiltration. 2. Enforce policy on first fragment and drop all subsequent fragments. This does not really work b/c some protocols may rely on fragmentation. For example, DNS may rely on oversized UDP packets for large responses. So stateful tracking is the only sane option. RFC 8900 [0] calls this out as well in section 6.3: Middleboxes [...] should process IP fragments in a manner that is consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes must maintain state in order to achieve this goal. === BPF related bits === Policy has traditionally been enforced from XDP/TC hooks. Both hooks run before kernel reassembly facilities. However, with the new BPF_PROG_TYPE_NETFILTER, we can rather easily hook into existing netfilter reassembly infra. The basic idea is we bump a refcnt on the netfilter defrag module and then run the bpf prog after the defrag module runs. This allows bpf progs to transparently see full, reassembled packets. The nice thing about this is that progs don't have to carry around logic to detect fragments. === Changelog === Changes from v2: * module_put() if ->enable() fails * Fix CI build errors Changes from v1: * Drop bpf_program__attach_netfilter() patches * static -> static const where appropriate * Fix callback assignment order during registration * Only request_module() if callbacks are missing * Fix retval when modprobe fails in userspace * Fix v6 defrag module name (nf_defrag_ipv6_hooks -> nf_defrag_ipv6) * Simplify priority checking code * Add warning if module doesn't assign callbacks in the future * Take refcnt on module while defrag link is active [0]: https://datatracker.ietf.org/doc/html/rfc8900 Daniel Xu (6): netfilter: defrag: Add glue hooks for enabling/disabling defrag netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link netfilter: bpf: Prevent defrag module unload while link active bpf: selftests: Support not connecting client socket bpf: selftests: Support custom type and proto for client sockets bpf: selftests: Add defrag selftests include/linux/netfilter.h | 15 + include/uapi/linux/bpf.h | 5 + net/ipv4/netfilter/nf_defrag_ipv4.c | 17 +- net/ipv6/netfilter/nf_defrag_ipv6_hooks.c | 11 + net/netfilter/core.c | 6 + net/netfilter/nf_bpf_link.c | 150 +++++++++- tools/include/uapi/linux/bpf.h | 5 + tools/testing/selftests/bpf/Makefile | 4 +- .../selftests/bpf/generate_udp_fragments.py | 90 ++++++ .../selftests/bpf/ip_check_defrag_frags.h | 57 ++++ tools/testing/selftests/bpf/network_helpers.c | 26 +- tools/testing/selftests/bpf/network_helpers.h | 3 + .../bpf/prog_tests/ip_check_defrag.c | 282 ++++++++++++++++++ .../selftests/bpf/progs/ip_check_defrag.c | 104 +++++++ 14 files changed, 753 insertions(+), 22 deletions(-) create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c -- 2.41.0

2 years, 4 months

2
4
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror July 2023