The series is aimed at adding support to maintain individual timestamp ranges for filesystems. This helps futimens, utimensat and utimes syscalls to conform to POSIX defined behavior when the time being set is outside of the corresponding filesystem's supported limits.
The series was developed with discussions and guidance from Arnd Bergmann.
The original thread is at https://lkml.org/lkml/2016/11/2/294
I will be submitting follow up kernel patches to update all filesystems. Currently ext4 is the only filesystem that reflects correct limits.
The branch is available at https://github.com/deepa-hub/vfs.git refs/heads/range
Changes since v5: * Dropped y2038-specific changes Changes since v4: * Added documentation for boot param Changes since v3: * Remove redundant initializations in libfs.c * Change early_param to __setup similar to other root mount options. * Fix documentation warning Changes since v2: * Introduce early boot param override for checks. * Drop afs patch for timestamp limits. Changes since v1: * return EROFS on mount errors * fix mtime copy/paste error in utimes
Deepa Dinamani (4): vfs: Add file timestamp range support ext4: Initialize timestamps limits vfs: Add timestamp_truncate() api utimes: Clamp the timestamps before update
fs/ext4/ext4.h | 4 ++++ fs/ext4/super.c | 7 ++++++- fs/inode.c | 32 +++++++++++++++++++++++++++++++- fs/super.c | 2 ++ fs/utimes.c | 17 +++++++++++++---- include/linux/fs.h | 3 +++ include/linux/time64.h | 2 ++ 7 files changed, 61 insertions(+), 6 deletions(-)
Add fields to the superblock to track the min and max timestamps supported by filesystems.
Initially, when a superblock is allocated, initialize it to the max and min values the fields can hold. Individual filesystems override these to match their actual limits.
Pseudo filesystems are assumed to always support the min and max allowable values for the fields.
Note that the time ranges are saved in type time64_t rather than time_t. This is required because if we save ranges in time_t then we would not be able to save timestamp ranges for files that support timestamps beyond y2038.
Signed-off-by: Deepa Dinamani deepa.kernel@gmail.com --- fs/super.c | 2 ++ include/linux/fs.h | 3 +++ include/linux/time64.h | 2 ++ 3 files changed, 7 insertions(+)
diff --git a/fs/super.c b/fs/super.c index 672538ca9831..9e0c97e54e46 100644 --- a/fs/super.c +++ b/fs/super.c @@ -244,6 +244,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, s->s_maxbytes = MAX_NON_LFS; s->s_op = &default_op; s->s_time_gran = 1000000000; + s->s_time_min = TIME64_MIN; + s->s_time_max = TIME64_MAX; s->cleancache_poolid = CLEANCACHE_NO_POOL;
s->s_shrink.seeks = DEFAULT_SEEKS; diff --git a/include/linux/fs.h b/include/linux/fs.h index f9d01c0951a8..406f3de71c22 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1379,6 +1379,9 @@ struct super_block { /* Granularity of c/m/atime in ns. Cannot be worse than a second */ u32 s_time_gran; + /* Time limits for c/m/atime in seconds. */ + time64_t s_time_min; + time64_t s_time_max;
/* * The next field is for VFS *only*. No filesystems have any business diff --git a/include/linux/time64.h b/include/linux/time64.h index 93d39499838e..76ed46db7a7f 100644 --- a/include/linux/time64.h +++ b/include/linux/time64.h @@ -36,6 +36,8 @@ struct itimerspec64 {
/* Located here for timespec[64]_valid_strict */ #define TIME64_MAX ((s64)~((u64)1 << 63)) +#define TIME64_MIN (-TIME64_MAX - 1) + #define KTIME_MAX ((s64)~((u64)1 << 63)) #define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC)
ext4 has different overflow limits for max filesystem timestamps based on the extra bytes available.
Signed-off-by: Deepa Dinamani deepa.kernel@gmail.com Cc: "Theodore Ts'o" tytso@mit.edu Cc: Andreas Dilger adilger.kernel@dilger.ca Cc: linux-ext4@vger.kernel.org --- fs/ext4/ext4.h | 4 ++++ fs/ext4/super.c | 7 ++++++- 2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 3241475a1733..fe4d7a168664 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1602,6 +1602,10 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_GOOD_OLD_INODE_SIZE 128
+#define EXT4_EXTRA_TIMESTAMP_MAX (((s64)1 << 34) - 1 + S32_MIN) +#define EXT4_NON_EXTRA_TIMESTAMP_MAX S32_MAX +#define EXT4_TIMESTAMP_MIN S32_MIN + /* * Feature set definitions */ diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 0843ebfeace1..7c2b227aa319 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3768,8 +3768,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) sbi->s_inode_size); goto failed_mount; } - if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE) + if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE) { sb->s_time_gran = 1 << (EXT4_EPOCH_BITS - 2); + sb->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX; + } else + sb->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX; + + sb->s_time_min = EXT4_TIMESTAMP_MIN; }
sbi->s_desc_size = le16_to_cpu(es->s_desc_size);
timespec_trunc() function is used to truncate a filesystem timestamp to the right granularity. But, the function does not clamp tv_sec part of the timestamps according to the filesystem timestamp limits.
Also, timespec_trunc() is exclusively used for filesystem timestamps. Move the api to be part of vfs.
The replacement api: timestamp_truncate() also alters the signature of the function to accommodate filesystem timestamp clamping according to flesystem limits.
Note that the clamp_t macro is used for clamping here as vfs is not yet using struct timespec64 internally. This is required for compilation purposes. Also note that clamp won't do the right thing for timestamps beyond 2038 on 32-bit machines until the vfs uses timespec64. After the vfs is transitioned to use timespec64 for timestamps, clamp_t() can be replaced by clamp().
Signed-off-by: Deepa Dinamani deepa.kernel@gmail.com --- fs/inode.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-)
diff --git a/fs/inode.c b/fs/inode.c index ef362364d396..0e2820ade554 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2107,6 +2107,36 @@ void inode_nohighmem(struct inode *inode) } EXPORT_SYMBOL(inode_nohighmem);
+/** + * timestamp_truncate - Truncate timespec to a granularity + * @t: Timespec + * @inode: inode being updated + * + * Truncate a timespec to the granularity supported by the fs + * containing the inode. Always rounds down. gran must + * not be 0 nor greater than a second (NSEC_PER_SEC, or 10^9 ns). + */ +struct timespec timestamp_truncate(struct timespec t, struct inode *inode) +{ + struct super_block *sb = inode->i_sb; + unsigned int gran = sb->s_time_gran; + + t.tv_sec = clamp_t(time64_t, t.tv_sec, sb->s_time_min, sb->s_time_max); + + /* Avoid division in the common cases 1 ns and 1 s. */ + if (gran == 1) { + /* nothing */ + } else if (gran == NSEC_PER_SEC) { + t.tv_nsec = 0; + } else if (gran > 1 && gran < NSEC_PER_SEC) { + t.tv_nsec -= t.tv_nsec % gran; + } else { + WARN(1, "illegal file time granularity: %u", gran); + } + return t; +} +EXPORT_SYMBOL(timestamp_truncate); + /** * current_time - Return FS time * @inode: inode. @@ -2126,6 +2156,6 @@ struct timespec current_time(struct inode *inode) return now; }
- return timespec_trunc(now, inode->i_sb->s_time_gran); + return timestamp_truncate(now, inode); } EXPORT_SYMBOL(current_time);
On Sun, Jan 21, 2018 at 6:04 PM, Deepa Dinamani deepa.kernel@gmail.com wrote:
t.tv_nsec -= t.tv_nsec % gran;
This doesn't actuall ywork if tv_nsec is negative.
Which may not be an issue in most cases, but did somebody check utimensat() or whatever?
WARN(1, "illegal file time granularity: %u", gran);
.. small nit: we generally should use 'invalid' rather than 'illegal'.
No cops will hunt you down for things like this.
Linus
On Mon, Jan 22, 2018 at 12:25 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Jan 21, 2018 at 6:04 PM, Deepa Dinamani deepa.kernel@gmail.com wrote:
t.tv_nsec -= t.tv_nsec % gran;
This doesn't actuall ywork if tv_nsec is negative.
Right.
Which may not be an issue in most cases, but did somebody check utimensat() or whatever?
I checked POSIX again. There is no mention of tv_nsec being positive always for utimes. And, the long term plan is to replace all the callers of timespec_trunc() to use this new api instead for filesystems. So this will need to be fixed. I will fix this and post an update.
WARN(1, "illegal file time granularity: %u", gran);
.. small nit: we generally should use 'invalid' rather than 'illegal'.
Will update this as well.
Thanks, Deepa
On Tue, Jan 23, 2018 at 5:25 PM, Deepa Dinamani deepa.kernel@gmail.com wrote:
On Mon, Jan 22, 2018 at 12:25 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Jan 21, 2018 at 6:04 PM, Deepa Dinamani deepa.kernel@gmail.com wrote:
t.tv_nsec -= t.tv_nsec % gran;
This doesn't actuall ywork if tv_nsec is negative.
Right.
Which may not be an issue in most cases, but did somebody check utimensat() or whatever?
I checked POSIX again. There is no mention of tv_nsec being positive always for utimes. And, the long term plan is to replace all the callers of timespec_trunc() to use this new api instead for filesystems. So this will need to be fixed. I will fix this and post an update.
I found this on http://pubs.opengroup.org/onlinepubs/9699919799/functions/utimes.html
ERRORS These functions shall fail if: ... [EINVAL] Either of the times argument structures specified a tv_nsec value that was neither UTIME_NOW nor UTIME_OMIT, and was a value less than zero or greater than or equal to 1000 million.
which is the same as the Linux man page and what the kernel actually does for all the syscalls. The POSIX description seems a bit ambiguous to whether it also expects or allows EINVAL for utimes() with a tv_usec over 1000000 microseconds, or if it just applies to the utimensat and futimens(). Older descriptions that only explain utimes() don't mention the range check on tv_usec either.
Arnd
On Wed, Jan 24, 2018 at 3:56 AM, Arnd Bergmann arnd@arndb.de wrote:
On Tue, Jan 23, 2018 at 5:25 PM, Deepa Dinamani deepa.kernel@gmail.com wrote:
On Mon, Jan 22, 2018 at 12:25 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Sun, Jan 21, 2018 at 6:04 PM, Deepa Dinamani deepa.kernel@gmail.com wrote:
t.tv_nsec -= t.tv_nsec % gran;
This doesn't actuall ywork if tv_nsec is negative.
Right.
Which may not be an issue in most cases, but did somebody check utimensat() or whatever?
I checked POSIX again. There is no mention of tv_nsec being positive always for utimes. And, the long term plan is to replace all the callers of timespec_trunc() to use this new api instead for filesystems. So this will need to be fixed. I will fix this and post an update.
I found this on http://pubs.opengroup.org/onlinepubs/9699919799/functions/utimes.html
ERRORS These functions shall fail if: ... [EINVAL] Either of the times argument structures specified a tv_nsec value that was neither UTIME_NOW nor UTIME_OMIT, and was a value less than zero or greater than or equal to 1000 million.
which is the same as the Linux man page and what the kernel actually does for all the syscalls. The POSIX description seems a bit ambiguous to whether it also expects or allows EINVAL for utimes() with a tv_usec over 1000000 microseconds, or if it just applies to the utimensat and futimens(). Older descriptions that only explain utimes() don't mention the range check on tv_usec either.
Right. This is in keeping with the kernel implementation of the corresponding syscalls.
But, this timespec_truncate() is being called from current_time() and will be extended to other calls.
C99 says "When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded (often called "truncation toward zero"). If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a."
Also, we are already checking for gran being non-zero and in the nanoseconds range.
So I think the right answer here is to add a comment that the function expects timespec to be normalized, and the functions calling it can take care of validation.
-Deepa
On Wed, Jan 24, 2018 at 3:56 AM, Arnd Bergmann arnd@arndb.de wrote:
On Tue, Jan 23, 2018 at 5:25 PM, Deepa Dinamani deepa.kernel@gmail.com wrote:
I checked POSIX again. There is no mention of tv_nsec being positive always for utimes. And, the long term plan is to replace all the callers of timespec_trunc() to use this new api instead for filesystems. So this will need to be fixed. I will fix this and post an update.
I found this on http://pubs.opengroup.org/onlinepubs/9699919799/functions/utimes.html
ERRORS These functions shall fail if: ... [EINVAL] Either of the times argument structures specified a tv_nsec value that was neither UTIME_NOW nor UTIME_OMIT, and was a value less than zero or greater than or equal to 1000 million.
The thing is, we shouldn't check the standards, we should check what we actually _do_.
And what we actually _do_ is to have a tv_nsec that is of type "long", and while we do have a
timespec64_valid(const struct timespec64 *ts)
and fs/utimes.c has a 'nsec_valid()' that apparently the utimes* family of system calls all do end up using, I'm more worried about something where we don't.
Because I'm almost certain that we've had cases where we just normalize things after-the-fact.
This all likely isn't a big deal, but it does end up worrying me if we then _depend_ on that granularity thing actually giving the proper granularity even for odd inputs. If they can happen.
Maybe we don't care?
Linus
On Wed, Jan 24, 2018 at 7:00 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, Jan 24, 2018 at 3:56 AM, Arnd Bergmann arnd@arndb.de wrote:
On Tue, Jan 23, 2018 at 5:25 PM, Deepa Dinamani deepa.kernel@gmail.com wrote:
I checked POSIX again. There is no mention of tv_nsec being positive always for utimes. And, the long term plan is to replace all the callers of timespec_trunc() to use this new api instead for filesystems. So this will need to be fixed. I will fix this and post an update.
I found this on http://pubs.opengroup.org/onlinepubs/9699919799/functions/utimes.html
ERRORS These functions shall fail if: ... [EINVAL] Either of the times argument structures specified a tv_nsec value that was neither UTIME_NOW nor UTIME_OMIT, and was a value less than zero or greater than or equal to 1000 million.
The thing is, we shouldn't check the standards, we should check what we actually _do_.
The issue for tv_sec is that we don't do anything interesting at the moment, and that's bad.
- The Linux man page says we return -EINVAL for an out-of-range tv_sec, but the VFS code at the moment ends up running into an integer overflow, resulting in an arbitrary (i.e. file system specific) tv_sec when we read back a number that was set out of range.
- POSIX (IEEE Std 1003.1-2008) appears to say we should cap the tv_sec to the maximum supported value to update the inode /and/ return -EINVAL, which seems better than what we do today, but would be surprising behavior, as -EINVAL usually indicates that we did not do anything. My best guess is that this is not intentional in the spec and should not be done like that.
- Deepa's patch implements a third option, which is to cap to the maximum (or minimum for times before the fs specific epoch) and keep always returning success for utimensat() etc. This seems the most reasonable behavior.
Between these three, we really need to make a decision.
And what we actually _do_ is to have a tv_nsec that is of type "long", and while we do have a
timespec64_valid(const struct timespec64 *ts)
and fs/utimes.c has a 'nsec_valid()' that apparently the utimes* family of system calls all do end up using, I'm more worried about something where we don't.
Because I'm almost certain that we've had cases where we just normalize things after-the-fact.
This all likely isn't a big deal, but it does end up worrying me if we then _depend_ on that granularity thing actually giving the proper granularity even for odd inputs. If they can happen.
Maybe we don't care?
This part seems easy, while there are two aspects here, I think they each have just one answer:
- truncating the nanoseconds in the in-memory inode to the resolution supported by the file system is currently done by Linux (since 2.6.10). This behavior matches the Linux and POSIX documentation and is sensible, so there is no reason to change it. Before 2.6.10, we could end up with a timestamp moving backwards when an inode got evicted from the icache and loaded back from disk.
- the range nsec validation is currently correct, I double-checked the relevant corner cases. We have to be careful when we introduce the 64-bit time_t based system call to make sure we can deal with glibc using 32-bit padding in the upper bits. For 64-bit user space, we must keep returning -EINVAL when those bits are nonzero, but for 32-bit tasks (compat or native), the current plan is to ignore the padding and instead take only the lower 32-bit before performing the range check. Deepa has posted patches for this in the past, but that's a differnent series.
Arnd
POSIX.1 section for futimens, utimensat and utimes says: The file's relevant timestamp shall be set to the greatest value supported by the file system that is not greater than the specified time.
Clamp the timestamps accordingly before assignment.
Note that the clamp_t macro is used for clamping here as vfs is not yet using struct timespec64 internally. This is required for compilation purposes. Also note that clamp won't do the right thing for timestamps beyond 2038 on 32-bit machines until the vfs uses timespec64. After the vfs is transitioned to use timespec64 for timestamps, clamp_t() can be replaced by clamp().
Signed-off-by: Deepa Dinamani deepa.kernel@gmail.com --- fs/utimes.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/fs/utimes.c b/fs/utimes.c index e4b3d7c2c9f5..82fdcc3284b9 100644 --- a/fs/utimes.c +++ b/fs/utimes.c @@ -50,6 +50,7 @@ static int utimes_common(const struct path *path, struct timespec64 *times) int error; struct iattr newattrs; struct inode *inode = path->dentry->d_inode; + struct super_block *sb = inode->i_sb; struct inode *delegated_inode = NULL;
error = mnt_want_write(path->mnt); @@ -65,16 +66,24 @@ static int utimes_common(const struct path *path, struct timespec64 *times) if (times[0].tv_nsec == UTIME_OMIT) newattrs.ia_valid &= ~ATTR_ATIME; else if (times[0].tv_nsec != UTIME_NOW) { - newattrs.ia_atime.tv_sec = times[0].tv_sec; - newattrs.ia_atime.tv_nsec = times[0].tv_nsec; + newattrs.ia_atime.tv_sec = + clamp_t(time64_t, times[0].tv_sec, sb->s_time_min, sb->s_time_max); + if (times[0].tv_sec >= sb->s_time_max) + newattrs.ia_atime.tv_nsec = 0; + else + newattrs.ia_atime.tv_nsec = times[0].tv_nsec; newattrs.ia_valid |= ATTR_ATIME_SET; }
if (times[1].tv_nsec == UTIME_OMIT) newattrs.ia_valid &= ~ATTR_MTIME; else if (times[1].tv_nsec != UTIME_NOW) { - newattrs.ia_mtime.tv_sec = times[1].tv_sec; - newattrs.ia_mtime.tv_nsec = times[1].tv_nsec; + newattrs.ia_mtime.tv_sec = + clamp_t(time64_t, times[1].tv_sec, sb->s_time_min, sb->s_time_max); + if (times[1].tv_sec >= sb->s_time_max) + newattrs.ia_mtime.tv_nsec = 0; + else + newattrs.ia_mtime.tv_nsec = times[1].tv_nsec; newattrs.ia_valid |= ATTR_MTIME_SET; } /*