When sysctl_nr_open is set to a very high value (for example, 1073741816 as set by systemd), processes attempting to use file descriptors near the limit can trigger massive memory allocation attempts that exceed INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
This happens because kvmalloc_array() and kvmalloc() check if the requested size exceeds INT_MAX and emit a warning when the allocation is not flagged with __GFP_NOWARN.
Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a process calls dup2(oldfd, 1073741880), the kernel attempts to allocate: - File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes - Multiple bitmaps: ~400MB - Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
Reproducer: 1. Set /proc/sys/fs/nr_open to 1073741816: # echo 1073741816 > /proc/sys/fs/nr_open
2. Run a program that uses a high file descriptor: #include <unistd.h> #include <sys/resource.h>
int main() { struct rlimit rlim = {1073741824, 1073741824}; setrlimit(RLIMIT_NOFILE, &rlim); dup2(2, 1073741880); // Triggers the warning return 0; }
3. Observe WARNING in dmesg at mm/slub.c:5027
systemd commit a8b627a introduced automatic bumping of fs.nr_open to the maximum possible value. The rationale was that systems with memory control groups (memcg) no longer need separate file descriptor limits since memory is properly accounted. However, this change overlooked that:
1. The kernel's allocation functions still enforce INT_MAX as a maximum size regardless of memcg accounting 2. Programs and tests that legitimately test file descriptor limits can inadvertently trigger massive allocations 3. The resulting allocations (>8GB) are impractical and will always fail
systemd's algorithm starts with INT_MAX and keeps halving the value until the kernel accepts it. On most systems, this results in nr_open being set to 1073741816 (0x3ffffff8), which is just under 1GB of file descriptors.
While processes rarely use file descriptors near this limit in normal operation, certain selftests (like tools/testing/selftests/core/unshare_test.c) and programs that test file descriptor limits can trigger this issue.
Fix this by adding a check in alloc_fdtable() to ensure the requested allocation size does not exceed INT_MAX. This causes the operation to fail with -EMFILE instead of triggering a kernel warning and avoids the impractical >8GB memory allocation request.
Fixes: 9cfe015aa424 ("get rid of NR_OPEN and introduce a sysctl_nr_open") Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/file.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/fs/file.c b/fs/file.c index b6db031545e65..6d2275c3be9c6 100644 --- a/fs/file.c +++ b/fs/file.c @@ -197,6 +197,21 @@ static struct fdtable *alloc_fdtable(unsigned int slots_wanted) return ERR_PTR(-EMFILE); }
+ /* + * Check if the allocation size would exceed INT_MAX. kvmalloc_array() + * and kvmalloc() will warn if the allocation size is greater than + * INT_MAX, as filp_cachep objects are not __GFP_NOWARN. + * + * This can happen when sysctl_nr_open is set to a very high value and + * a process tries to use a file descriptor near that limit. For example, + * if sysctl_nr_open is set to 1073741816 (0x3ffffff8) - which is what + * systemd typically sets it to - then trying to use a file descriptor + * close to that value will require allocating a file descriptor table + * that exceeds 8GB in size. + */ + if (unlikely(nr > INT_MAX / sizeof(struct file *))) + return ERR_PTR(-EMFILE); + fdt = kmalloc(sizeof(struct fdtable), GFP_KERNEL_ACCOUNT); if (!fdt) goto out;
On Sun, Jun 29, 2025 at 03:40:21AM -0400, Sasha Levin wrote:
When sysctl_nr_open is set to a very high value (for example, 1073741816 as set by systemd), processes attempting to use file descriptors near the limit can trigger massive memory allocation attempts that exceed INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
This happens because kvmalloc_array() and kvmalloc() check if the requested size exceeds INT_MAX and emit a warning when the allocation is not flagged with __GFP_NOWARN.
Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
Reproducer:
Set /proc/sys/fs/nr_open to 1073741816: # echo 1073741816 > /proc/sys/fs/nr_open
Run a program that uses a high file descriptor: #include <unistd.h> #include <sys/resource.h>
int main() { struct rlimit rlim = {1073741824, 1073741824}; setrlimit(RLIMIT_NOFILE, &rlim); dup2(2, 1073741880); // Triggers the warning return 0; }
Observe WARNING in dmesg at mm/slub.c:5027
systemd commit a8b627a introduced automatic bumping of fs.nr_open to the maximum possible value. The rationale was that systems with memory control groups (memcg) no longer need separate file descriptor limits since memory is properly accounted. However, this change overlooked that:
- The kernel's allocation functions still enforce INT_MAX as a maximum size regardless of memcg accounting
- Programs and tests that legitimately test file descriptor limits can inadvertently trigger massive allocations
- The resulting allocations (>8GB) are impractical and will always fail
alloc_fdtable() seems like the wrong place to do it.
If there is an explicit de facto limit, the machinery which alters fs.nr_open should validate against it.
I understand this might result in systemd setting a new value which significantly lower than what it uses now which technically is a change in behavior, but I don't think it's a big deal.
I'm assuming the kernel can't just set the value to something very high by default.
But in that case perhaps it could expose the max settable value? Then systemd would not have to guess.
On Sun, Jun 29, 2025 at 09:58:12PM +0200, Mateusz Guzik wrote:
On Sun, Jun 29, 2025 at 03:40:21AM -0400, Sasha Levin wrote:
When sysctl_nr_open is set to a very high value (for example, 1073741816 as set by systemd), processes attempting to use file descriptors near the limit can trigger massive memory allocation attempts that exceed INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
This happens because kvmalloc_array() and kvmalloc() check if the requested size exceeds INT_MAX and emit a warning when the allocation is not flagged with __GFP_NOWARN.
Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
Reproducer:
Set /proc/sys/fs/nr_open to 1073741816: # echo 1073741816 > /proc/sys/fs/nr_open
Run a program that uses a high file descriptor: #include <unistd.h> #include <sys/resource.h>
int main() { struct rlimit rlim = {1073741824, 1073741824}; setrlimit(RLIMIT_NOFILE, &rlim); dup2(2, 1073741880); // Triggers the warning return 0; }
Observe WARNING in dmesg at mm/slub.c:5027
systemd commit a8b627a introduced automatic bumping of fs.nr_open to the maximum possible value. The rationale was that systems with memory control groups (memcg) no longer need separate file descriptor limits since memory is properly accounted. However, this change overlooked that:
- The kernel's allocation functions still enforce INT_MAX as a maximum size regardless of memcg accounting
- Programs and tests that legitimately test file descriptor limits can inadvertently trigger massive allocations
- The resulting allocations (>8GB) are impractical and will always fail
alloc_fdtable() seems like the wrong place to do it.
If there is an explicit de facto limit, the machinery which alters fs.nr_open should validate against it.
I understand this might result in systemd setting a new value which significantly lower than what it uses now which technically is a change in behavior, but I don't think it's a big deal.
I'm assuming the kernel can't just set the value to something very high by default.
But in that case perhaps it could expose the max settable value? Then systemd would not have to guess.
The patch is in alloc_fdtable() because it's addressing a memory allocator limitation, not a fundamental file descriptor limitation.
The INT_MAX restriction comes from kvmalloc(), not from any inherent constraint on how many FDs a process can have. If we implemented sparse FD tables or if kvmalloc() later supports larger allocations, the same nr_open value could become usable without any changes to FD handling code.
Putting the check at the sysctl layer would codify a temporary implementation detail of the memory allocator as if it were a fundamental FD limit. By keeping it at the allocation point, the check reflects what it actually is - a current limitation of how large a contiguous allocation we can make.
This placement also means the limit naturally adjusts if the underlying implementation changes, rather than requiring coordinated updates between the sysctl validation and the allocator capabilities.
I don't have a strong opinion either way...
On Mon, Jun 30, 2025 at 5:13 AM Sasha Levin sashal@kernel.org wrote:
On Sun, Jun 29, 2025 at 09:58:12PM +0200, Mateusz Guzik wrote:
On Sun, Jun 29, 2025 at 03:40:21AM -0400, Sasha Levin wrote:
When sysctl_nr_open is set to a very high value (for example, 1073741816 as set by systemd), processes attempting to use file descriptors near the limit can trigger massive memory allocation attempts that exceed INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
This happens because kvmalloc_array() and kvmalloc() check if the requested size exceeds INT_MAX and emit a warning when the allocation is not flagged with __GFP_NOWARN.
Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
Reproducer:
Set /proc/sys/fs/nr_open to 1073741816: # echo 1073741816 > /proc/sys/fs/nr_open
Run a program that uses a high file descriptor: #include <unistd.h> #include <sys/resource.h>
int main() { struct rlimit rlim = {1073741824, 1073741824}; setrlimit(RLIMIT_NOFILE, &rlim); dup2(2, 1073741880); // Triggers the warning return 0; }
Observe WARNING in dmesg at mm/slub.c:5027
systemd commit a8b627a introduced automatic bumping of fs.nr_open to the maximum possible value. The rationale was that systems with memory control groups (memcg) no longer need separate file descriptor limits since memory is properly accounted. However, this change overlooked that:
- The kernel's allocation functions still enforce INT_MAX as a maximum size regardless of memcg accounting
- Programs and tests that legitimately test file descriptor limits can inadvertently trigger massive allocations
- The resulting allocations (>8GB) are impractical and will always fail
alloc_fdtable() seems like the wrong place to do it.
If there is an explicit de facto limit, the machinery which alters fs.nr_open should validate against it.
I understand this might result in systemd setting a new value which significantly lower than what it uses now which technically is a change in behavior, but I don't think it's a big deal.
I'm assuming the kernel can't just set the value to something very high by default.
But in that case perhaps it could expose the max settable value? Then systemd would not have to guess.
The patch is in alloc_fdtable() because it's addressing a memory allocator limitation, not a fundamental file descriptor limitation.
The INT_MAX restriction comes from kvmalloc(), not from any inherent constraint on how many FDs a process can have. If we implemented sparse FD tables or if kvmalloc() later supports larger allocations, the same nr_open value could become usable without any changes to FD handling code.
Putting the check at the sysctl layer would codify a temporary implementation detail of the memory allocator as if it were a fundamental FD limit. By keeping it at the allocation point, the check reflects what it actually is - a current limitation of how large a contiguous allocation we can make.
This placement also means the limit naturally adjusts if the underlying implementation changes, rather than requiring coordinated updates between the sysctl validation and the allocator capabilities.
I don't have a strong opinion either way...
Allowing privileged userspace to set a limit which the kernel knows it cannot reach sounds like a bug to me.
Indeed the limitation is an artifact of the current implementation, I don't understand the logic behind pretending it's not there.
Regardless, not my call :)
On Mon, Jun 30, 2025 at 01:35:08PM +0200, Mateusz Guzik wrote:
On Mon, Jun 30, 2025 at 5:13 AM Sasha Levin sashal@kernel.org wrote:
On Sun, Jun 29, 2025 at 09:58:12PM +0200, Mateusz Guzik wrote:
On Sun, Jun 29, 2025 at 03:40:21AM -0400, Sasha Levin wrote:
When sysctl_nr_open is set to a very high value (for example, 1073741816 as set by systemd), processes attempting to use file descriptors near
Note that systemd caps all services/processes it starts to 500k fds by default. So someone would have to hand-massage the per process limit like in your example.
And fwiw, allocating file descriptors above INT_MAX is inherently unsafe because we have stuff like:
#define AT_FDWCD -100
If we allow file descriptor allocation above INT_MAX it's easy to allocate a file descriptor at 4294967196 which is AT_FDCWD. If you pass that to fchmodat() or something similar you have a problem because instead of changing whatever the file descriptor points to you're changing your current working directory.
Since we have a bunch of system calls that return file descriptors such as pidfd_open() returning above INT_MAX would mean we'd return errnos as valid fds, e.g., ENETDOWN for an AT_FDCWD range allocation.
But what's annoying is that we are communicating very confusing things to userspace by being inconsistent in our system call interface.
We have system calls that accept int as the file descriptor type (fallocate() faccessat() fchmodat() etc) and then we have system calls that accept unsigned int as the file descriptor type (close() ftruncate() fchdir() fchmod() etc).
What makes it all worse is that glibc enforces that all fd-based system calls take int as an argument:
close(2) System Calls Manual close(2)
NAME close - close a file descriptor
LIBRARY Standard C library (libc, -lc)
SYNOPSIS #include <unistd.h>
int close(int fd);
So we now also have a userspace-kernel disconnect.
the limit can trigger massive memory allocation attempts that exceed INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
This happens because kvmalloc_array() and kvmalloc() check if the requested size exceeds INT_MAX and emit a warning when the allocation is not flagged with __GFP_NOWARN.
Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
Reproducer:
Set /proc/sys/fs/nr_open to 1073741816: # echo 1073741816 > /proc/sys/fs/nr_open
Run a program that uses a high file descriptor: #include <unistd.h> #include <sys/resource.h>
int main() { struct rlimit rlim = {1073741824, 1073741824}; setrlimit(RLIMIT_NOFILE, &rlim); dup2(2, 1073741880); // Triggers the warning return 0; }
Observe WARNING in dmesg at mm/slub.c:5027
systemd commit a8b627a introduced automatic bumping of fs.nr_open to the maximum possible value. The rationale was that systems with memory control groups (memcg) no longer need separate file descriptor limits since memory is properly accounted. However, this change overlooked that:
- The kernel's allocation functions still enforce INT_MAX as a maximum size regardless of memcg accounting
- Programs and tests that legitimately test file descriptor limits can inadvertently trigger massive allocations
- The resulting allocations (>8GB) are impractical and will always fail
alloc_fdtable() seems like the wrong place to do it.
If there is an explicit de facto limit, the machinery which alters fs.nr_open should validate against it.
I understand this might result in systemd setting a new value which significantly lower than what it uses now which technically is a change in behavior, but I don't think it's a big deal.
I'm assuming the kernel can't just set the value to something very high by default.
But in that case perhaps it could expose the max settable value? Then systemd would not have to guess.
The patch is in alloc_fdtable() because it's addressing a memory allocator limitation, not a fundamental file descriptor limitation.
The INT_MAX restriction comes from kvmalloc(), not from any inherent constraint on how many FDs a process can have. If we implemented sparse FD tables or if kvmalloc() later supports larger allocations, the same nr_open value could become usable without any changes to FD handling code.
Putting the check at the sysctl layer would codify a temporary implementation detail of the memory allocator as if it were a fundamental FD limit. By keeping it at the allocation point, the check
Yeah, I tend to agree.
reflects what it actually is - a current limitation of how large a contiguous allocation we can make.
This placement also means the limit naturally adjusts if the underlying implementation changes, rather than requiring coordinated updates between the sysctl validation and the allocator capabilities.
I don't have a strong opinion either way...
I think Mateusz' idea of exposing the maximum supported value in procfs as a read-only file is probably pretty sensible. Userspace like systemd has to do stuff like if you want to allow large number of fds by default:
#if BUMP_PROC_SYS_FS_NR_OPEN int v = INT_MAX;
/* Argh! The kernel enforces maximum and minimum values on the fs.nr_open, but we don't really know * what they are. The expression by which the maximum is determined is dependent on the architecture, * and is something we don't really want to copy to userspace, as it is dependent on implementation * details of the kernel. Since the kernel doesn't expose the maximum value to us, we can only try * and hope. Hence, let's start with INT_MAX, and then keep halving the value until we find one that * works. Ugly? Yes, absolutely, but kernel APIs are kernel APIs, so what do can we do... 🤯 */
for (;;) { int k;
v &= ~(__SIZEOF_POINTER__ - 1); /* Round down to next multiple of the pointer size */ if (v < 1024) { log_warning("Can't bump fs.nr_open, value too small."); break; }
k = read_nr_open(); if (k < 0) { log_error_errno(k, "Failed to read fs.nr_open: %m"); break; } if (k >= v) { /* Already larger */ log_debug("Skipping bump, value is already larger."); break; }
r = sysctl_writef("fs/nr_open", "%i", v); if (r == -EINVAL) { log_debug("Couldn't write fs.nr_open as %i, halving it.", v); v /= 2; continue; } if (r < 0) { log_full_errno(IN_SET(r, -EROFS, -EPERM, -EACCES) ? LOG_DEBUG : LOG_WARNING, r, "Failed to bump fs.nr_open, ignoring: %m"); break; }
log_debug("Successfully bumped fs.nr_open to %i", v); break; } #endif
On Sun, 29 Jun 2025 03:40:21 -0400, Sasha Levin wrote:
When sysctl_nr_open is set to a very high value (for example, 1073741816 as set by systemd), processes attempting to use file descriptors near the limit can trigger massive memory allocation attempts that exceed INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
[...]
Applied to the vfs-6.17.misc branch of the vfs/vfs.git tree. Patches in the vfs-6.17.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs-6.17.misc
[1/1] fs: Prevent file descriptor table allocations exceeding INT_MAX https://git.kernel.org/vfs/vfs/c/c608a019c82f
linux-stable-mirror@lists.linaro.org