btrfs-progs: offline filesystem resize feature #1007

loemraw · 2025-07-29T16:07:05Z

This patch introduces the ability to resize a btrfs filesystem while it is not mounted via a new --offline flag. Currently only increasing the size of the filesystem is supported, though I believe it would be possible to implement shrinking the filesystem to the end of the last device extent.

This is a more general, and hopefully more useful, solution to the problem I was trying to solve with the
("btrfs-progs: add slack space for mkfs --shrink") patch. This patch should enable users to resize a filesystem without the higher capabilities needed for mounting a filesystem.

cmds/filesystem.c

Forza-tng · 2025-07-29T16:43:17Z

Will this allow for resizing a fs that would normally go ro due to enospc during mount?

loemraw · 2025-07-29T20:00:29Z

Will this allow for resizing a fs that would normally go ro due to enospc during mount?

Yeah, I don't see why not!

adam900710 · 2025-07-29T22:12:50Z

Will this allow for resizing a fs that would normally go ro due to enospc during mount?
Yeah, I don't see why not!

Unfortunately not that simple. If the fs has metadata full, it may still fail.
As this is still using regular transaction protection, metadata still need space for COW.

Thankfully this one is only mostly modifying chunk tree, which is not that common to exhaust system chunks.
But your btrfs_start_transaction() is using fs root, which is the wrong target, and can lead to unexpected ENOSPC error.

Please use chunk root instead since you're only modifying chunk tree.

Furthermore there are a lot of code style problems, like unexpected new lines splitting variable definitions (e.g. between new_size and old_total inside offline_resize()).
IIRC you can use checkpatch.pl from the kernel, as we mostly follow the same code style here.

And the parameter list for that function is also pretty bad. I'd say normally we put more important parameter (like path, which specifies a whole fs) first, and more specific parameters like new size after that.
Furthermore, the amount parameter as char doesn't make much sense, why not just parse the value and directly pass a s64 instead?

adam900710 · 2025-07-29T22:31:32Z

And a lot of error handling is missing.

E.g. if btrfs_update_device() failed, you should abort the current transaction other than continue to commit the transaction.
This also gets rid of the unnecessary ret2 variable.

loemraw · 2025-07-30T00:17:08Z

Will this allow for resizing a fs that would normally go ro due to enospc during mount?
Yeah, I don't see why not!

Unfortunately not that simple. If the fs has metadata full, it may still fail. As this is still using regular transaction protection, metadata still need space for COW.

Ahh yeah, that makes sense.

Thankfully this one is only mostly modifying chunk tree, which is not that common to exhaust system chunks. But your btrfs_start_transaction() is using fs root, which is the wrong target, and can lead to unexpected ENOSPC error.

Good catch...

I don't have the best understanding of btrfs transactions. When I was testing this patch it appeared like the filesystem was being resized correctly, why would this be the case if I'm passing the wrong target? Was it making changes to the wrong tree? Was I just getting lucky?

Please use chunk root instead since you're only modifying chunk tree.

Will do.

Furthermore there are a lot of code style problems, like unexpected new lines splitting variable definitions (e.g. between new_size and old_total inside offline_resize()). IIRC you can use checkpatch.pl from the kernel, as we mostly follow the same code style here.

Will check this in v2.

And the parameter list for that function is also pretty bad. I'd say normally we put more important parameter (like path, which specifies a whole fs) first, and more specific parameters like new size after that. Furthermore, the amount parameter as char doesn't make much sense, why not just parse the value and directly pass a s64 instead?

I'll rethink the parameter ordering.

I pass the amount parameter as a char because I call check_offline_resize_args inside offline_resize and I wanted to keep all of the argument logic together.

loemraw · 2025-07-30T00:18:03Z

And a lot of error handling is missing.

E.g. if btrfs_update_device() failed, you should abort the current transaction other than continue to commit the transaction. This also gets rid of the unnecessary ret2 variable.

Ok, makes sense will fix in v2.

adam900710 · 2025-07-30T00:56:09Z

I don't have the best understanding of btrfs transactions. When I was testing this patch it appeared like the filesystem was being resized correctly, why would this be the case if I'm passing the wrong target? Was it making changes to the wrong tree? Was I just getting lucky?

The target root for btrfs_start_transaction() is only for space reservation purpose. For btrfs-progs, it's only checking if we have enough space. So a wrong root here won't cause a problem, until the metadata space is almost exhausted.
In that case, btrfs_start_transaction() will return ENOSPC.

I pass the amount parameter as a char because I call check_offline_resize_args inside offline_resize and I wanted to keep all of the argument logic together.

You can check how other codes in btrfs-progs is doing.
For most cases, if we want to pass a size (or a diff of size), we would pass u64 or s64 instead, and do the parsing before calling the final function.

There are exceptions like the existing check_resize_args() where amount can be the string cancel.
But for most cases, the string parsing is done as soon as possible.
E.g. the size variable inside cmd_filesystem_mkswapfile().

In your particular case, we won't support anything other than a number, thus parsing it early will be a more common solution.

BTW, for your update, you can just force push the same branch. No need to create a new PR.

loemraw · 2025-07-31T16:49:51Z

I pass the amount parameter as a char because I call check_offline_resize_args inside offline_resize and I wanted to keep all of the argument logic together.

You can check how other codes in btrfs-progs is doing. For most cases, if we want to pass a size (or a diff of size), we would pass u64 or s64 instead, and do the parsing before calling the final function.

There are exceptions like the existing check_resize_args() where amount can be the string cancel. But for most cases, the string parsing is done as soon as possible. E.g. the size variable inside cmd_filesystem_mkswapfile().

In your particular case, we won't support anything other than a number, thus parsing it early will be a more common solution.

In this case there is a difference between passing "+1G" and "1G" that would be lost if I just passed a u64. "+1G" increases the filesystem size by 1G and "1G" sets the filesystem size to 1G. I could pass a u64 along with a boolean to indicate whether the size is relative to the existing filesystem size, but it feels like this logic would be better encapsulated inside check_offline_resize_args.

Zygo · 2025-07-31T19:03:35Z

Will this allow for resizing a fs that would normally go ro due to enospc during mount?

An offline tool that increases a device's size--and does nothing else--is much more likely to succeed in that situation than mount + fi-resize, but it can still fail in one really specific case.

mount does a number of things that can result in enospc failure. Off the top of my head: orphan inode reclaim, the snapshot cleaner thread, tearing down an incomplete reloc tree, and resuming a balance (this is the only one that can be turned off by a mount option). Avoiding these other tasks makes success far more likely.

The chunk tree is stored in the system chunk, which is a dedicated contiguous storage area that is usually already large enough for two copies of the chunk tree. To run out of space there, you'd need hundreds of thousands of chunks, like a single profile 100 TiB filesystem that has no unallocated space at the moment that the chunk tree needs to be enlarged, with a system chunk that still has the original 32MB size from mkfs. On the other hand, while that's a rare case, it would be an especially painful one that you can't fix with this method.

Users can run into this case fairly often when they have filesystems that are close to this threshold size (100 TiB or multiples thereof). A smaller filesystem doesn't allocate enough chunks to require a new system chunk, while a bigger filesystem would have unallocated space when it needs a new system chunk. It also comes up if you're running any striped profile (raid0, raid10, raid5, or raid6) and you don't do something to reduce dev_extent fragmentation--I hit this with a 26T filesystem once, that had 250k chunks after several drive upgrades and naive balances.

If you're willing to accept overwriting a metadata page in place, you can resize a device by finding the page in the chunk tree where the device item is, changing the size field in the dev item, and updating the csum (in addition to the superblock changes, which are already overwrite-in-place, and repeated across all mirrors). Increasing or decreasing a device size (without relocating any chunks) doesn't require any new allocations.

adam900710 · 2025-07-31T22:33:14Z

In this case there is a difference between passing "+1G" and "1G" that would be lost if I just passed a u64. "+1G" increases the filesystem size by 1G and "1G" sets the filesystem size to 1G.

Just use s64

loemraw · 2025-08-06T21:31:19Z

Force pushed with a v2

use chunk root instead of fs root
fix offline resize error handling
fix variable declarations to not have newlines
fix parameter list to have more important arguments first

In this case there is a difference between passing "+1G" and "1G" that would be lost if I just passed a u64. "+1G" increases the filesystem size by 1G and "1G" sets the filesystem size to 1G.

Just use s64

I'm still passing the amount as a cstring because a s64 doesn't convey the difference between "+1G" and "1G".

maharmstone · 2025-08-22T09:59:26Z

cmds/filesystem.c

+		/* For target sizes without +/- sign prefix (e.g. 1:150g) */
+		if (mod == 0) {
+			new_size = diff;
+		} else if (mod > 0) {


The bot is right here. mod should be a bool, it's only ever 0 or 1

loemraw · 2025-08-13T17:45:26Z

@adam900710 ping for v2 review

Use SUBVOL_SYNC_WAIT ioctl for 'btrfs subvolume sync' command before checking periodically and add an option to not use sync wait ioctl call and force to check periodically. This patch calls a new function wait_for_subvolume_sync() that calls BTRFS_IOC_SUBVOL_SYNC_WAIT for each subvol. Issue: kdave#953 Pull-request: kdave#989 Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> Signed-off-by: David Sterba <dsterba@suse.com>

In the upstream equivalent of btrfs_remove_block_group(), the function remove_block_group_free_space() is called to delete free spaces associated with the block group being freed. However, this function is defined in btrfs-progs but not called anywhere. To address this issue, I added a call to remove_block_group_free_space() in btrfs_remove_block_group(). This ensures that the free spaces are properly deleted when a block group is removed. I also added a check to remove_block_group_free_space to make sure that free-space-tree is enabled. Signed-off-by: Leo Martins <loemra.dev@gmail.com>

check_free_space_tree() is intended to check that the free-space tree doesn't contain any entries which don't correspond to a block group. Unfortunately it calls btrfs_lookup_first_block_group(), which only returns NULL if the search offset is larger than the start of any block group. Fix check_free_space_tree() to use btrfs_lookup_block_group() as was obviously intended, and fix fsck-tests/054-orphan-directory to get rid of its spurious free-space entries that this now diagnoses. Signed-off-by: Mark Harmstone <mark@harmstone.com>

There is a note about a bug that mount a fs RO first, then remount it RW, will make btrfs to skip the orphan roots cleanup. However it's no longer the case after kernel commit 44c0ca211a4d ("btrfs: lift read-write mount setup from mount and remount"), as that commit unify the pre-RW mount checks, and will always do the orphan roots cleanup. Just update the note so that it won't cause any confusion. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com>

[BEHAVIOR CHANGE] In the incoming v6.17 kernel release, due to the changes in commit 40426dd147ff ("btrfs: use the super_block as holder when mounting file systems"), we can no longer mount a seed device through both sprouted fs and the seed device. E.g. # mkfs.btrfs -f /dev/test/scratch1 # mount /dev/test/scratch1 /mnt/btrfs # xfs_io -f -c "pwrite 0 16m" /mnt/btrfs/foobar # umount /mnt/btrfs # btrfstune -S1 /dev/test/scratch1 # mount /dev/test/scratch1 /mnt/btrfs # btrfs device add /dev/test/scratch2 /mnt/btrfs Now the sprouted fs is mount, but if one wants to mount the seed device, it will fail: # mount /dev/test/scratch1 /mnt/btrfs/ mount: /mnt/btrfs: /dev/mapper/test-scratch1 already mounted or mount point busy. dmesg(1) may have more information after failed mount system call. The only new dmesg is: BTRFS error: failed to open device for path /dev/mapper/test-scratch1 with flags 0x23: -16 [CAUSE] After that kernel commit, each block device will have its own unique holder (super block). This super block block device holder is critical to pass device events like freeze/thaw/missing to each filesystem. And after the seed device is sprouted, the holder for the seed device is the super block of the sprouted fs. But if some one else tries to mount the seed device again, it will be a new super block as the holder passed into bdev_file_open_by_path(). Since the seed device already has a different holder, this new bdev_file_open_by_path() will fail with -EBUSY. [ENHANCEMENT] This is a kernel behavior change, but considering the benefit (proper bdev events passing into the fs) I'd say the old behavior is more like a hack or a coincidence, other than a properly designed behavior. So update the documentation to make it more explicit. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com>

…erent filesystems Since kernel commit 40426dd147ff ("btrfs: use the super_block as holder when mounting file systems"), the kernel will not allow a block device belonging to two different filesystems. This means a seed device can only be mounted through either the sprouted fs, or the seed device, not both at the same time. Although a seed device can still be shared between different sprouted fs, only one of those fs can be mounted. Considering the extra benefit (extra protection, better device events handling), it's worthy to do the kernel behavior change. Update the test case to follow the new limits. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com>

Currently we do the validation of subvolume parameters inside mkfs/main.c, but considering all things like structure rootdir_subvol are all inside rootdir.[ch], it's better to move the validation part into rootdir.[ch]. Extract the validation part into a helper, btrfs_mkfs_validate_subvols(), into rootdir.[ch]. Furthermore since we're here, also slightly enhance the duplicated subvolume check, so that the runtime is halved. Signed-off-by: Qu Wenruo <wqu@suse.com>

helper Currently we do the validation of inode flag parameters inside mkfs/main.c, but considering all things like structure rootdir_inode_flags_entry are all inside rootdir.[ch], it's better to move the validation part into rootdir.[ch]. Extract the validation part into a helper, btrfs_mkfs_validate_inode_flags(), into rootdir.[ch]. Signed-off-by: Qu Wenruo <wqu@suse.com>

Currently for --subvol parameter, we save the full path of the host fs into rootdir_subvol sturcture, then compare each inode we hit with that saved full path. This string comparison can be time consuming (up to PATH_MAX characters), and we're doing it for every directory inode. On the other hand, nftw() also provides a stat structure of the current inode, stat::st_dev and stat::st_ino pair can always uniquely locate an inode in the host filesystem. With that said, we can just save the st_dev/st_ino pair when validating the subvol parameters, and use st_dev/st_ino to check if we hit the target subvolume. This has two benefits: - Reduce the memory usage of each rootdir_subvol Now we need no full_path member, replacing it with two u64. This saves (4K - 16) bytes per rootdir_subvol. - Reduce the runtime of direct inode comparison Instead of doing strcmp() for up to 4K bytes, it's just two u64 comparison. Signed-off-by: Qu Wenruo <wqu@suse.com>

Currently for --inode-flags parameter, we save the full path of the host fs into rootdir_inode_flags_entry sturcture, then compare each inode we hit with that saved full path. This string comparison can be time consuming (up to PATH_MAX characters), and we're doing it for every inode. On the other hand, nftw() also provides a stat structure of the current inode, stat::st_dev and stat::st_ino pair can always uniquely locate an inode in the host filesystem. With that said, we can just save the st_dev/st_ino pair when validating the inode flag parameters, and use st_dev/st_ino to check if we hit the target inode. This has two benefits: - Reduce the memory usage of each rootdir_inode_flags_entry Now we need no full_path member, replacing it with two u64. This saves (4K - 16) bytes per rootdir_inode_flags_entry. - Reduce the runtime of inode comparison Instead of doing strcmp() for up to 4K bytes, it's just two u64 comparison. Signed-off-by: Qu Wenruo <wqu@suse.com>

With the recent enhancements to --subvol and --inode-flags options, the memory usage for them are just 4K+, and even for the older 8K+ memory usage, it's really hard to consider them as heavy memory usage. E.g. even if the end user specified over 1000 entries of --subvol/--inode-flags, it only takes a little over 8MiB for the older code or over 4MiB for the newer code. Since we're in the user space and the year is 2025 not 1995, such memory usage is far from heavy. Just remove the paranoid note from the man page. Signed-off-by: Qu Wenruo <wqu@suse.com>

This patch introduces the ability to resize a btrfs filesystem while it is not mounted via a new `--offline` flag. Currently only increasing the size of the filesystem is supported, though I believe it would be possible to implement shrinking the filesystem to the end of the last device extent. This is a more general, and hopefully more useful, solution to the problem I was trying to solve with the ("btrfs-progs: add slack space for mkfs --shrink") patch. This patch should enable users to resize a filesystem without the higher capabilities needed for mounting a filesystem. Signed-off-by: Leo Martins <loemra.dev@gmail.com> --- Changelog: v1->v2: - use chunk root instead of fs root - fix offline resize error handling - fix variable declarations to not have newlines - fix parameter list to have more important arguments first

maharmstone · 2025-08-22T09:41:32Z

cmds/filesystem.c

+	return 0;
+}
+
+static int offline_resize(const char *path, const char *amount)


Maybe just return bool, if you don't particularly care about the actual return value? It'd simplify some of the code.
(bool is a relatively recent addition to C, so there's still a historic tendency to use int for the same thing in some places)

maharmstone · 2025-08-22T09:44:41Z

cmds/filesystem.c

+
+	root = open_ctree(path, 0, OPEN_CTREE_WRITES | OPEN_CTREE_CHUNK_ROOT_ONLY);
+	if (!root) {
+		error("could not open file at %s\n"


You might not even need to print an error here, it looks like open_ctree is already quite chatty. But if you did keep it I think you need to make it clear it works with block devices as well as files

maharmstone · 2025-08-22T09:48:17Z

cmds/filesystem.c

+		goto close;
+	}
+
+	ret = snprintf(dev_name, BTRFS_VOL_NAME_MAX, "%s", device->name);


sprintf with %s is a code smell - that's just a more complicated strcpy. strncpy is the particular function you're looking for here.
But I'd just move it off the stack and use strdup, that way you don't have to worry about it being too long. Don't forget to free it later.

maharmstone · 2025-08-22T09:48:49Z

cmds/filesystem.c

+		btrfs_abort_transaction(trans, ret);
+		goto close;
+	}
+	ret |= btrfs_commit_transaction(trans, root);


No need for |=, ret is guaranteed to be 0 here

maharmstone · 2025-08-22T09:49:17Z

cmds/filesystem.c

+
+	if (path_is_reg_file(dev_name))
+		ret = truncate(dev_name, new_size);
+	if (ret)


I'd move this check into the scope of the if on line 1520

maharmstone · 2025-08-22T09:56:19Z

cmds/filesystem.c

+
+	if (devstr && !dev_found) {
+		/* Devid specified but not found. */
+		error("cannot find devid: %lld", devid);


devid is unsigned, this should be %llu. The -Wformat-signedness warning will diagnose this

maharmstone · 2025-08-22T09:57:45Z

cmds/filesystem.c

+				new_size = 0;
+			} else {
+				new_size = stat_buf.st_size;
+			}


Nit-picking, but the usual kernel style is that you wouldn't need curly brackets here

maharmstone · 2025-08-22T09:59:26Z

cmds/filesystem.c

+		/* For target sizes without +/- sign prefix (e.g. 1:150g) */
+		if (mod == 0) {
+			new_size = diff;
+		} else if (mod > 0) {


The bot is right here. mod should be a bool, it's only ever 0 or 1

maharmstone · 2025-08-22T10:00:13Z

cmds/filesystem.c

+
+	if (new_size < 256 * SZ_1M)
+		warning("the new size %lld (%s) is < 256MiB, this may be rejected by kernel",
+			new_size, pretty_size_mode(new_size, UNITS_DEFAULT));


maharmstone · 2025-08-22T10:00:25Z

cmds/filesystem.c

+		warning("the new size %lld (%s) is < 256MiB, this may be rejected by kernel",
+			new_size, pretty_size_mode(new_size, UNITS_DEFAULT));
+
+	pr_verbose(LOG_DEFAULT, "Resize device id %lld from %s to %s\n", devid,


github-advanced-security bot found potential problems Jul 29, 2025

View reviewed changes

cmds/filesystem.c Fixed Show fixed Hide fixed

adam900710 force-pushed the devel branch from 2eaf551 to a92826e Compare August 2, 2025 04:22

loemraw force-pushed the feature-offline-resize branch from d7e49b7 to 44e653a Compare August 6, 2025 21:27

github-advanced-security bot found potential problems Aug 6, 2025

View reviewed changes

kdave force-pushed the devel branch 2 times, most recently from f660864 to a452b1e Compare August 8, 2025 17:39

kdave force-pushed the devel branch from 0ad9c81 to 9c1d53d Compare August 13, 2025 18:05

loemraw and others added 10 commits August 15, 2025 07:36

loemraw force-pushed the feature-offline-resize branch 2 times, most recently from b62465f to f4799f1 Compare August 21, 2025 23:00

loemraw force-pushed the feature-offline-resize branch from f4799f1 to 7a8c4e5 Compare August 21, 2025 23:15

maharmstone reviewed Aug 22, 2025

View reviewed changes

kdave force-pushed the devel branch 2 times, most recently from 05e416f to 539b0cb Compare August 27, 2025 02:30

btrfs-progs: offline filesystem resize feature #1007

Are you sure you want to change the base?

btrfs-progs: offline filesystem resize feature #1007

Uh oh!

Conversation

loemraw commented Jul 29, 2025

Uh oh!

Uh oh!

Forza-tng commented Jul 29, 2025

Uh oh!

loemraw commented Jul 29, 2025

Uh oh!

adam900710 commented Jul 29, 2025

Uh oh!

adam900710 commented Jul 29, 2025

Uh oh!

loemraw commented Jul 30, 2025

Uh oh!

loemraw commented Jul 30, 2025

Uh oh!

adam900710 commented Jul 30, 2025

Uh oh!

loemraw commented Jul 31, 2025

Uh oh!

Zygo commented Jul 31, 2025

Uh oh!

adam900710 commented Jul 31, 2025

Uh oh!

loemraw commented Aug 6, 2025

Uh oh!

Check warning

Choose a reason for hiding this comment

Uh oh!

loemraw commented Aug 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!