Remove some of the outmoded "Why KUnit" rationale -- which focuses
significantly on UML -- and update the Getting Started guide to mention
running tests without the kunit_tool wrapper.
Signed-off-by: David Gow <davidgow(a)google.com>
---
This is an attempt at resolving some of the issues with the KUnit
documentation pointed out here:
https://lore.kernel.org/linux-kselftest/CABVgOSkiLi0UNijH1xTSvmsJEE5+ocCZ7n…
There's definitely room for further work on the KUnit documentation
(e.g., adding more information around the environment tests run in), but
this hopefully is better than nothing as a starting point.
Documentation/dev-tools/kunit/index.rst | 60 ++++---------------
Documentation/dev-tools/kunit/start.rst | 80 +++++++++++++++++++++----
2 files changed, 78 insertions(+), 62 deletions(-)
diff --git a/Documentation/dev-tools/kunit/index.rst b/Documentation/dev-tools/kunit/index.rst
index d16a4d2c3a41..6064cd14dfad 100644
--- a/Documentation/dev-tools/kunit/index.rst
+++ b/Documentation/dev-tools/kunit/index.rst
@@ -17,63 +17,23 @@ What is KUnit?
==============
KUnit is a lightweight unit testing and mocking framework for the Linux kernel.
-These tests are able to be run locally on a developer's workstation without a VM
-or special hardware.
KUnit is heavily inspired by JUnit, Python's unittest.mock, and
Googletest/Googlemock for C++. KUnit provides facilities for defining unit test
cases, grouping related test cases into test suites, providing common
infrastructure for running tests, and much more.
-Get started now: :doc:`start`
-
-Why KUnit?
-==========
-
-A unit test is supposed to test a single unit of code in isolation, hence the
-name. A unit test should be the finest granularity of testing and as such should
-allow all possible code paths to be tested in the code under test; this is only
-possible if the code under test is very small and does not have any external
-dependencies outside of the test's control like hardware.
-
-Outside of KUnit, there are no testing frameworks currently
-available for the kernel that do not require installing the kernel on a test
-machine or in a VM and all require tests to be written in userspace running on
-the kernel; this is true for Autotest, and kselftest, disqualifying
-any of them from being considered unit testing frameworks.
+KUnit consists of a kernel component, which provides a set of macros for easily
+writing unit tests. Tests written against KUnit will run on kernel boot if
+built-in, or when loaded if built as a module. These tests write out results to
+the kernel log in `TAP <https://testanything.org/>`_ format.
-KUnit addresses the problem of being able to run tests without needing a virtual
-machine or actual hardware with User Mode Linux. User Mode Linux is a Linux
-architecture, like ARM or x86; however, unlike other architectures it compiles
-to a standalone program that can be run like any other program directly inside
-of a host operating system; to be clear, it does not require any virtualization
-support; it is just a regular program.
+To make running these tests (and reading the results) easier, KUnit offsers
+:doc:`kunit_tool <kunit-tool>`, which builds a `User Mode Linux
+<http://user-mode-linux.sourceforge.net>`_ kernel, runs it, and parses the test
+results. This provides a quick way of running KUnit tests during development.
-Alternatively, kunit and kunit tests can be built as modules and tests will
-run when the test module is loaded.
-
-KUnit is fast. Excluding build time, from invocation to completion KUnit can run
-several dozen tests in only 10 to 20 seconds; this might not sound like a big
-deal to some people, but having such fast and easy to run tests fundamentally
-changes the way you go about testing and even writing code in the first place.
-Linus himself said in his `git talk at Google
-<https://gist.github.com/lorn/1272686/revisions#diff-53c65572127855f1b003db4…>`_:
-
- "... a lot of people seem to think that performance is about doing the
- same thing, just doing it faster, and that is not true. That is not what
- performance is all about. If you can do something really fast, really
- well, people will start using it differently."
-
-In this context Linus was talking about branching and merging,
-but this point also applies to testing. If your tests are slow, unreliable, are
-difficult to write, and require a special setup or special hardware to run,
-then you wait a lot longer to write tests, and you wait a lot longer to run
-tests; this means that tests are likely to break, unlikely to test a lot of
-things, and are unlikely to be rerun once they pass. If your tests are really
-fast, you run them all the time, every time you make a change, and every time
-someone sends you some code. Why trust that someone ran all their tests
-correctly on every change when you can just run them yourself in less time than
-it takes to read their test log?
+Get started now: :doc:`start`
How do I use it?
================
@@ -81,3 +41,5 @@ How do I use it?
* :doc:`start` - for new users of KUnit
* :doc:`usage` - for a more detailed explanation of KUnit features
* :doc:`api/index` - for the list of KUnit APIs used for testing
+* :doc:`kunit-tool` - for more information on the kunit_tool helper script
+* :doc:`faq` - for answers to some common questions about KUnit
diff --git a/Documentation/dev-tools/kunit/start.rst b/Documentation/dev-tools/kunit/start.rst
index 4e1d24db6b13..e1c5ce80ce12 100644
--- a/Documentation/dev-tools/kunit/start.rst
+++ b/Documentation/dev-tools/kunit/start.rst
@@ -9,11 +9,10 @@ Installing dependencies
KUnit has the same dependencies as the Linux kernel. As long as you can build
the kernel, you can run KUnit.
-KUnit Wrapper
-=============
-Included with KUnit is a simple Python wrapper that helps format the output to
-easily use and read KUnit output. It handles building and running the kernel, as
-well as formatting the output.
+Running tests with the KUnit Wrapper
+====================================
+Included with KUnit is a simple Python wrapper which runs tests under User Mode
+Linux, and formats the test results.
The wrapper can be run with:
@@ -21,22 +20,42 @@ The wrapper can be run with:
./tools/testing/kunit/kunit.py run --defconfig
-For more information on this wrapper (also called kunit_tool) checkout the
+For more information on this wrapper (also called kunit_tool) check out the
:doc:`kunit-tool` page.
Creating a .kunitconfig
-=======================
-The Python script is a thin wrapper around Kbuild. As such, it needs to be
-configured with a ``.kunitconfig`` file. This file essentially contains the
-regular Kernel config, with the specific test targets as well.
-
+-----------------------
+If you want to run a specific set of tests (rather than those listed in the
+KUnit defconfig), you can provide Kconfig options in the ``.kunitconfig`` file.
+This file essentially contains the regular Kernel config, with the specific
+test targets as well. The ``.kunitconfig`` should also contain any other config
+options required by the tests.
+
+A good starting point for a ``.kunitconfig`` is the KUnit defconfig:
.. code-block:: bash
cd $PATH_TO_LINUX_REPO
cp arch/um/configs/kunit_defconfig .kunitconfig
-Verifying KUnit Works
----------------------
+You can then add any other Kconfig options you wish, e.g.:
+.. code-block:: none
+
+ CONFIG_LIST_KUNIT_TEST=y
+
+:doc:`kunit_tool <kunit-tool>` will ensure that all config options set in
+``.kunitconfig`` are set in the kernel ``.config`` before running the tests.
+It'll warn you if you haven't included the dependencies of the options you're
+using.
+
+.. note::
+ Note that removing something from the ``.kunitconfig`` will not trigger a
+ rebuild of the ``.config`` file: the configuration is only updated if the
+ ``.kunitconfig`` is not a subset of ``.config``. This means that you can use
+ other tools (such as make menuconfig) to adjust other config options.
+
+
+Running the tests
+-----------------
To make sure that everything is set up correctly, simply invoke the Python
wrapper from your kernel repo:
@@ -62,6 +81,41 @@ followed by a list of tests that are run. All of them should be passing.
Because it is building a lot of sources for the first time, the
``Building KUnit kernel`` step may take a while.
+Running tests without the KUnit Wrapper
+=======================================
+
+If you'd rather not use the KUnit Wrapper (if, for example, you need to
+integrate with other systems, or use an architecture other than UML), KUnit can
+be included in any kernel, and the results read out and parsed manually.
+
+.. note::
+ KUnit is not designed for use in a production system, and it's possible that
+ tests may reduce the stability or security of the system.
+
+
+
+Configuring the kernel
+----------------------
+
+In order to enable KUnit itself, you simply need to enable the ``CONFIG_KUNIT``
+Kconfig option (it's under Kernel Hacking/Kernel Testing and Coverage in
+menuconfig). From there, you can enable any KUnit tests you want: they usually
+have config options ending in ``_KUNIT_TEST``.
+
+KUnit and KUnit tests can be compiled as modules: in this case the tests in a
+module will be run when the module is loaded.
+
+Running the tests
+-----------------
+
+Build and run your kernel as usual. Test output will be written to the kernel
+log in `TAP <https://testanything.org/>`_ format.
+
+.. note::
+ It's possible that there will be other lines and/or data interspersed in the
+ TAP output.
+
+
Writing your first test
=======================
--
2.25.0.341.g760bfbb309-goog
From: Randy Dunlap <rdunlap(a)infradead.org>
Fix Documentation warning due to missing a blank line after a directive:
linux/Documentation/dev-tools/kunit/usage.rst:553: WARNING: Error in "code-block" directive:
maximum 1 argument(s) allowed, 3 supplied.
.. code-block:: bash
modprobe example-test
Fixes: 6ae2bfd3df06 ("kunit: update documentation to describe module-based build")
Signed-off-by: Randy Dunlap <rdunlap(a)infradead.org>
Cc: Alan Maguire <alan.maguire(a)oracle.com>
Cc: Knut Omang <knut.omang(a)oracle.com>
Cc: Brendan Higgins <brendanhiggins(a)google.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kunit-dev(a)googlegroups.com
Cc: Shuah Khan <skhan(a)linuxfoundation.org>
---
Documentation/dev-tools/kunit/usage.rst | 1 +
1 file changed, 1 insertion(+)
--- lnx-56-rc1.orig/Documentation/dev-tools/kunit/usage.rst
+++ lnx-56-rc1/Documentation/dev-tools/kunit/usage.rst
@@ -551,6 +551,7 @@ options to your ``.config``:
Once the kernel is built and installed, a simple
.. code-block:: bash
+
modprobe example-test
...will run the tests.
[Resend the v9 patch set to Shuah Khan and linux-kselftest mailing list.
No code and commit message change.]
With more and more resctrl features are being added by Intel, AMD
and ARM, a test tool is becoming more and more useful to validate
that both hardware and software functionalities work as expected.
We introduce resctrl selftest to cover resctrl features on X86, AMD
and ARM architectures. It first implements MBM (Memory Bandwidth
Monitoring), MBA (Memory Bandwidth Allocation), L3 CAT (Cache Allocation
Technology), and CQM (Cache QoS Monitoring) tests. We will enhance
the selftest tool to include more functionality tests in the future.
The tool has been tested on Intel RDT, AMD QoS and ARM MPAM and is
in tools/testing/selftests/resctrl in order to have generic test code
base for all architectures.
The selftest tool we are introducing here provides a convenient
tool which does automatic resctrl testing, is easily available in kernel
tree, and covers Intel RDT, AMD QoS and ARM MPAM.
There is an existing resctrl test suite 'intel_cmt_cat'. But its major
purpose is to test Intel RDT hardware via writing and reading MSR
registers. It does access resctrl file system; but the functionalities
are very limited. And it doesn't support automatic test and a lot of
manual verifications are involved.
Changelog:
v9:
- Per Boris suggestion, add Co-developed-by in each patch to make it
clear who contributed to the patch set.
v8:
Update code per comments from Andre Przywara from ARM:
- Change Makefile and remove inline assembly code to build and test the
tool on ARM
- Change the output to TAP format because the format is both readable by
human and other test tools.
- Detect resctrl feature from /proc/cpuinfo instead of dmesg to support
generic detection on all architectures.
- Fix a few coding issues.
v7:
- Fix a few warnings when compiling patches separately, pointed by Babu
v6:
- Fix a benchmark reading optimized out issue in newer GCC.
- Fix a few coding style issues.
- Re-arrange code among patches to make cleaner code. No change in patches
structure.
v5:
- Based the v4 patches submitted by Fenghua Yu and added changes to support
AMD.
- Changed the function name get_sock_num to get_resource_id. Intel uses
socket number for schemata and AMD uses l3 index id. To generalize,
changed the function name to get_resource_id.
- Added the code to detect vendor.
- Disabled the few tests for AMD where the test results are not clear.
Also AMD does not have IMC.
- Fixed few compile issues.
- Some cleanup to make each patch independent.
- Tested the patches on AMD system. Fenghua, Need your help to test on
Intel box. Please feel free to change and resubmit if something
broken.
v4:
- address comments from Balu and Randy
- Add CAT and CQM tests
v3:
- Change code based on comments from Babu Moger
- Remove some unnessary code and use pipe to communicate b/w processes
v2:
- Change code based on comments from Babu Moger
- Clean up other places.
Babu Moger (3):
selftests/resctrl: Add vendor detection mechanism
selftests/resctrl: Use cache index3 id for AMD schemata masks
selftests/resctrl: Disable MBA and MBM tests for AMD
Fenghua Yu (6):
selftests/resctrl: Add README for resctrl tests
selftests/resctrl: Add MBM test
selftests/resctrl: Add MBA test
selftests/resctrl: Add Cache QoS Monitoring (CQM) selftest
selftests/resctrl: Add Cache Allocation Technology (CAT) selftest
selftests/resctrl: Add the test in MAINTAINERS
Sai Praneeth Prakhya (4):
selftests/resctrl: Add basic resctrl file system operations and data
selftests/resctrl: Read memory bandwidth from perf IMC counter and
from resctrl file system
selftests/resctrl: Add callback to start a benchmark
selftests/resctrl: Add built in benchmark
MAINTAINERS | 1 +
tools/testing/selftests/resctrl/Makefile | 17 +
tools/testing/selftests/resctrl/README | 53 ++
tools/testing/selftests/resctrl/cache.c | 272 +++++++
tools/testing/selftests/resctrl/cat_test.c | 250 ++++++
tools/testing/selftests/resctrl/cqm_test.c | 176 +++++
tools/testing/selftests/resctrl/fill_buf.c | 213 +++++
tools/testing/selftests/resctrl/mba_test.c | 171 ++++
tools/testing/selftests/resctrl/mbm_test.c | 145 ++++
tools/testing/selftests/resctrl/resctrl.h | 107 +++
.../testing/selftests/resctrl/resctrl_tests.c | 202 +++++
tools/testing/selftests/resctrl/resctrl_val.c | 744 ++++++++++++++++++
tools/testing/selftests/resctrl/resctrlfs.c | 722 +++++++++++++++++
13 files changed, 3073 insertions(+)
create mode 100644 tools/testing/selftests/resctrl/Makefile
create mode 100644 tools/testing/selftests/resctrl/README
create mode 100644 tools/testing/selftests/resctrl/cache.c
create mode 100644 tools/testing/selftests/resctrl/cat_test.c
create mode 100644 tools/testing/selftests/resctrl/cqm_test.c
create mode 100644 tools/testing/selftests/resctrl/fill_buf.c
create mode 100644 tools/testing/selftests/resctrl/mba_test.c
create mode 100644 tools/testing/selftests/resctrl/mbm_test.c
create mode 100644 tools/testing/selftests/resctrl/resctrl.h
create mode 100644 tools/testing/selftests/resctrl/resctrl_tests.c
create mode 100644 tools/testing/selftests/resctrl/resctrl_val.c
create mode 100644 tools/testing/selftests/resctrl/resctrlfs.c
--
2.19.1
Hi,
Here's another update after another round of reviews from Kirill and Jan.
There is a git repo and branch, for convenience in reviewing:
git@github.com:johnhubbard/linux.git track_user_pages_v5
============================================================
Changes since v4:
* Added documentation about the huge page behavior of the new
/proc/vmstat items.
* Added a missing mode_node_page_state() call to put_compound_head().
* Fixed a tracepoint call in page_ref_sub_return().
* Added a trailing underscore to a URL in pin_user_pages.rst, to fix
a broken generated link.
* Added ACKs and reviewed-by's from Jan Kara and Kirill Shutemov.
* Rebased onto today's linux.git, and
* I am experimenting here with "git format-patch --base=<commit>".
This generated the "base-commit:" tag you'll see at the end of this
cover letter. I was inspired to do so after trying out a new
get-lore-mbox.py tool (it's very nice), mentioned in a recent LWN
article (https://lwn.net/Articles/811528/ ). That tool relies on the
base-commit tag for some things.
============================================================
Changes since v3:
* Rebased onto latest linux.git
* Added ACKs and reviewed-by's from Kirill Shutemov and Jan Kara.
* /proc/vmstat:
* Renamed items, after realizing that I hate the previous names:
nr_foll_pin_requested --> nr_foll_pin_acquired
nr_foll_pin_returned --> nr_foll_pin_released
* Removed the CONFIG_DEBUG_VM guard, and collapsed away a wrapper
routine: now just calls mod_node_page_state() directly.
* Tweaked the WARN_ON_ONCE() statements in mm/hugetlb.c to be more
informative, and added comments above them as well.
* Fixed gup_benchmark: signed int --> unsigned long.
* One or two minor formatting changes.
============================================================
Changes since v2:
* Rebased onto linux.git, because the akpm tree for 5.6 has been merged.
* Split the tracking patch into even more patches, as requested.
* Merged Matthew Wilcox's dump_page() changes into mine, as part of the
first patch.
* Renamed: page_dma_pinned() --> page_maybe_dma_pinned(), in response to
Kirill Shutemov's review.
* Moved a WARN to the top of a routine, and fixed a typo in the commit
description of patch #7, also as suggested by Kirill.
============================================================
Changes since v1:
* Split the tracking patch into 6 smaller patches
* Rebased onto today's linux-next/akpm (there weren't any conflicts).
* Fixed an "unsigned int" vs. "int" problem in gup_benchmark, reported
by Nathan Chancellor. (I don't see it in my local builds, probably
because they use gcc, but an LLVM test found the mismatch.)
* Fixed a huge page pincount problem (add/subtract vs.
increment/decrement), spotted by Jan Kara.
============================================================
There is a reasonable case to be made for merging two of the patches
(patches 7 and 8), given that patch 7 provides tracking that has upper
limits on the number of pins that can be done with huge pages. Let me
know if anyone wants those merged, but unless there is some weird chance
of someone grabbing patch 7 and not patch 8, I don't really see the
need. Meanwhile, it's easier to review in this form.
Also, patch 3 has been revived. Earlier reviewers asked for it to be
merged into the tracking patch (one cannot please everyone, heh), but
now it's back out on it's own.
This activates tracking of FOLL_PIN pages. This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].
FOLL_PIN support is now in the main linux tree. However, the
patch to use FOLL_PIN to track pages was *not* submitted, because Leon
saw an RDMA test suite failure that involved (I think) page refcount
overflows when huge pages were used.
This patch definitively solves that kind of overflow problem, by adding
an exact pincount, for compound pages (of order > 1), in the 3rd struct
page of a compound page. If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara
for that idea.
Other interesting changes:
* dump_page(): added one, or two new things to report for compound
pages: head refcount (for all compound pages), and map_pincount (for
compound pages of order > 1).
* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
huge page refcount upper limit problems, and added notes about how it
works now. Also added a note about the dump_page() enhancements.
* Added some comments in gup.c and mm.h, to explain that there are two
ways to count pinned pages: exact (for compound pages of order > 1)
and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).
============================================================
General notes about the tracking patch:
This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017. :)
In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:
get_user_pages() (sets FOLL_GET)
put_page()
to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()
============================================================
Next steps:
* Convert more subsystems from get_user_pages() to pin_user_pages().
* Work with Ira and others to connect this all up with file system
leases.
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages()
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
John Hubbard (12):
mm: dump_page(): better diagnostics for compound pages
mm/gup: split get_user_pages_remote() into two routines
mm/gup: pass a flags arg to __gup_device_* functions
mm: introduce page_ref_sub_return()
mm/gup: pass gup flags to two more routines
mm/gup: require FOLL_GET for get_user_pages_fast()
mm/gup: track FOLL_PIN pages
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
mm: dump_page(): better diagnostics for huge pinned pages
mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
mm/gup_benchmark: support pin_user_pages() and related calls
selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage
Documentation/core-api/pin_user_pages.rst | 86 ++--
include/linux/mm.h | 108 ++++-
include/linux/mm_types.h | 7 +-
include/linux/mmzone.h | 2 +
include/linux/page_ref.h | 9 +
mm/debug.c | 61 ++-
mm/gup.c | 452 ++++++++++++++++-----
mm/gup_benchmark.c | 71 +++-
mm/huge_memory.c | 29 +-
mm/hugetlb.c | 60 ++-
mm/page_alloc.c | 2 +
mm/rmap.c | 6 +
mm/vmstat.c | 2 +
tools/testing/selftests/vm/gup_benchmark.c | 15 +-
tools/testing/selftests/vm/run_vmtests | 22 +
15 files changed, 752 insertions(+), 180 deletions(-)
base-commit: 90568ecf561540fa330511e21fcd823b0c3829c6
--
2.25.0
Hello,
This patchset implements a new futex operation, called FUTEX_WAIT_MULTIPLE,
which allows a thread to wait on several futexes at the same time, and be
awoken by any of them.
The use case lies in the Wine implementation of the Windows NT interface
WaitMultipleObjects. This Windows API function allows a thread to sleep
waiting on the first of a set of event sources (mutexes, timers, signal,
console input, etc) to signal. Considering this is a primitive
synchronization operation for Windows applications, being able to quickly
signal events on the producer side, and quickly go to sleep on the
consumer side is essential for good performance of those running over Wine.
Since this API exposes a mechanism to wait on multiple objects, and
we might have multiple waiters for each of these events, a M->N
relationship, the current Linux interfaces fell short on performance
evaluation of large M,N scenarios. We experimented, for instance, with
eventfd, which has performance problems discussed below, but we also
experimented with userspace solutions, like making each consumer wait on
a condition variable guarding the entire list of objects, and then
waking up multiple variables on the producer side, but this is
prohibitively expensive since we either need to signal many condition
variables or share that condition variable among multiple waiters, and
then verify for the event being signaled in userspace, which means
dealing with often false positive wakes ups.
The natural interface to implement the behavior we want, also
considering that one of the waitable objects is a mutex itself, would be
the futex interface. Therefore, this patchset proposes a mechanism for
a thread to wait on multiple futexes at once, and wake up on the first
futex that was awaken.
In particular, using futexes in our Wine use case reduced the CPU
utilization by 4% for the game Beat Saber and by 1.5% for the game
Shadow of Tomb Raider, both running over Proton (a Wine based solution
for Windows emulation), when compared to the eventfd interface. This
implementation also doesn't rely of file descriptors, so it doesn't risk
overflowing the resource.
In time, we are also proposing modifications to glibc and libpthread to
make this feature available for Linux native multithreaded applications
using libpthread, which can benefit from the behavior of waiting on any
of a group of futexes.
Technically, the existing FUTEX_WAIT implementation can be easily
reworked by using futex_wait_multiple() with a count of one, and I
have a patch showing how it works. I'm not proposing it, since
futex is such a tricky code, that I'd be more comfortable to have
FUTEX_WAIT_MULTIPLE running upstream for a couple development cycles,
before considering modifying FUTEX_WAIT.
The patch series includes an extensive set of kselftests validating
the behavior of the interface. We also implemented support[1] on
Syzkaller and survived the fuzzy testing.
Finally, if you'd rather pull directly a branch with this set you can
find it here:
https://gitlab.collabora.com/tonyk/linux/commits/futex-dev
=== Performance of eventfd ===
Polling on several eventfd contexts with semaphore semantics would
provide us with the exact semantics we are looking for. However, as
shown below, in a scenario with sufficient producers and consumers, the
eventfd interface itself becomes a bottleneck, in particular because
each thread will compete to acquire a sequence of waitqueue locks for
each eventfd context in the poll list. In addition, in the uncontended
case, where the producer is ready for consumption, eventfd still
requires going into the kernel on the consumer side.
When a write or a read operation in an eventfd file succeeds, it will try
to wake up all threads that are waiting to perform some operation to
the file. The lock (ctx->wqh.lock) that hold the access to the file value
(ctx->count) is the same lock used to control access the waitqueue. When
all those those thread woke, they will compete to get this lock. Along
with that, the poll() also manipulates the waitqueue and need to hold
this same lock. This lock is specially hard to acquire when poll() calls
poll_freewait(), where it tries to free all waitqueues associated with
this poll. While doing that, it will compete with a lot of read and
write operations that have been waken.
In our use case, with a huge number of parallel reads, writes and polls,
this lock is a bottleneck and hurts the performance of applications. Our
implementation of futex, however, decrease the calls of spin lock by more
than 80% in some user applications.
Finally, eventfd operates on file descriptors, which is a limited
resource that has shown its limitation in our use cases. Despite the
Windows interface not waiting on more than 64 objects at once, we still
have multiple waiters at the same time, and we were easily able to
exhaust the FD limits on applications like games.
The RFC for this patch can be found here:
https://lkml.org/lkml/2019/7/30/1399
Thanks,
André
[1] https://github.com/andrealmeid/syzkaller/tree/futex-wait-multiple
Gabriel Krisman Bertazi (4):
futex: Implement mechanism to wait on any of several futexes
selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test
selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test
selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test
include/uapi/linux/futex.h | 20 +
kernel/futex.c | 356 +++++++++++++++++-
.../selftests/futex/functional/.gitignore | 1 +
.../selftests/futex/functional/Makefile | 3 +-
.../futex/functional/futex_wait_multiple.c | 173 +++++++++
.../futex/functional/futex_wait_timeout.c | 38 +-
.../futex/functional/futex_wait_wouldblock.c | 28 +-
.../testing/selftests/futex/functional/run.sh | 3 +
.../selftests/futex/include/futextest.h | 22 +
9 files changed, 635 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/futex/functional/futex_wait_multiple.c
--
2.25.0