|
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9 |
|
| #
e2f48c48 |
| 07-May-2024 |
Petr Vorel <[email protected]> |
bcachefs: Move BCACHEFS_STATFS_MAGIC value to UAPI magic.h
Move BCACHEFS_STATFS_MAGIC value to UAPI <linux/magic.h> under BCACHEFS_SUPER_MAGIC definition (use common approach for name) and reuse the
bcachefs: Move BCACHEFS_STATFS_MAGIC value to UAPI magic.h
Move BCACHEFS_STATFS_MAGIC value to UAPI <linux/magic.h> under BCACHEFS_SUPER_MAGIC definition (use common approach for name) and reuse the definition in bcachefs_format.h BCACHEFS_STATFS_MAGIC.
There are other bcachefs magic definitions: BCACHE_MAGIC, BCHFS_MAGIC, which use UUID_INIT() and are used only in libbcachefs. Therefore move only BCACHEFS_STATFS_MAGIC value, which can be used outside of libbcachefs for f_type field in struct statfs in statfs() or fstatfs().
Suggested-by: Su Yue <[email protected]> Signed-off-by: Petr Vorel <[email protected]> Acked-by: Brian Foster <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
show more ...
|
|
Revision tags: v6.9-rc7, v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8, v6.8-rc7, v6.8-rc6, v6.8-rc5 |
|
| #
cb12fd8e |
| 12-Feb-2024 |
Christian Brauner <[email protected]> |
pidfd: add pidfs
This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't abl
pidfd: add pidfs
This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows:
* statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() and then comparing inode numbers. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes.
The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed.
We allocate a new inode for each struct pid and we reuse that inode for all pidfds. We use iget_locked() to find that inode again based on the inode number which isn't recycled. We allocate a new dentry for each pidfd that uses the same inode. That is similar to anonymous inodes which reuse the same inode for thousands of dentries. For pidfds we're talking way less than that. There usually won't be a lot of concurrent openers of the same struct pid. They can probably often be counted on two hands. I know that systemd does use separate pidfd for the same struct pid for various complex process tracking issues. So I think with that things actually become way simpler. Especially because we don't have to care about lookup. Dentries and inodes continue to be always deleted.
The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs which uses a similar stashing mechanism just for namespaces.
Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
show more ...
|
|
Revision tags: v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5, v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6, v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2 |
|
| #
68f2736a |
| 07-Jun-2022 |
Matthew Wilcox (Oracle) <[email protected]> |
mm: Convert all PageMovable users to movable_operations
These drivers are rather uncomfortably hammered into the address_space_operations hole. They aren't filesystems and don't behave like filesys
mm: Convert all PageMovable users to movable_operations
These drivers are rather uncomfortably hammered into the address_space_operations hole. They aren't filesystems and don't behave like filesystems. They just need their own movable_operations structure, which we can point to directly from page->mapping.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
show more ...
|
|
Revision tags: v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1 |
|
| #
c086df49 |
| 10-Jan-2022 |
Jeff Layton <[email protected]> |
fuse: move FUSE_SUPER_MAGIC definition to magic.h
...to help userland apps that need to identify FUSE mounts.
Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Miklos Szeredi <mszeredi
fuse: move FUSE_SUPER_MAGIC definition to magic.h
...to help userland apps that need to identify FUSE mounts.
Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
show more ...
|
| #
dea29037 |
| 11-Jan-2022 |
Jeff Layton <[email protected]> |
cifs: move superblock magic defitions to magic.h
Help userland apps to identify cifs and smb2 mounts.
Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Steve French <stfrench@microsoft
cifs: move superblock magic defitions to magic.h
Help userland apps to identify cifs and smb2 mounts.
Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Steve French <[email protected]>
show more ...
|
| #
a0b3a15e |
| 10-Jan-2022 |
Jeff Layton <[email protected]> |
ceph: move CEPH_SUPER_MAGIC definition to magic.h
The uapi headers are missing the ceph definition. Move it there so userland apps can ID cephfs.
Signed-off-by: Jeff Layton <[email protected]> Rev
ceph: move CEPH_SUPER_MAGIC definition to magic.h
The uapi headers are missing the ceph definition. Move it there so userland apps can ID cephfs.
Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
show more ...
|
|
Revision tags: v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3 |
|
| #
1ed147e2 |
| 25-Nov-2021 |
Namjae Jeon <[email protected]> |
exfat: move super block magic number to magic.h
Move exfat superblock magic number from local definition to magic.h. It is also needed by userspace programs that call fstatfs().
Acked-by: Christian
exfat: move super block magic number to magic.h
Move exfat superblock magic number from local definition to magic.h. It is also needed by userspace programs that call fstatfs().
Acked-by: Christian Brauner <[email protected]> Signed-off-by: Namjae Jeon <[email protected]>
show more ...
|
|
Revision tags: v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1 |
|
| #
1507f512 |
| 08-Jul-2021 |
Mike Rapoport <[email protected]> |
mm: introduce memfd_secret system call to create "secret" memory areas
Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process
mm: introduce memfd_secret system call to create "secret" memory areas
Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped not only to other processes but in the kernel page tables as well.
The secretmem feature is off by default and the user must explicitly enable it at the boot time.
Once secretmem is enabled, the user will be able to create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped in the page table of the processes that have access to the file descriptor.
Secretmem is designed to provide the following protections:
* Enhanced protection (in conjunction with all the other in-kernel attack prevention systems) against ROP attacks. Seceretmem makes "simple" ROP insufficient to perform exfiltration, which increases the required complexity of the attack. Along with other protections like the kernel stack size limit and address space layout randomization which make finding gadgets is really hard, absence of any in-kernel primitive for accessing secret memory means the one gadget ROP attack can't work. Since the only way to access secret memory is to reconstruct the missing mapping entry, the attacker has to recover the physical page and insert a PTE pointing to it in the kernel and then retrieve the contents. That takes at least three gadgets which is a level of difficulty beyond most standard attacks.
* Prevent cross-process secret userspace memory exposures. Once the secret memory is allocated, the user can't accidentally pass it into the kernel to be transmitted somewhere. The secreremem pages cannot be accessed via the direct map and they are disallowed in GUP.
* Harden against exploited kernel flaws. In order to access secretmem, a kernel-side attack would need to either walk the page tables and create new ones, or spawn a new privileged uiserspace process to perform secrets exfiltration using ptrace.
The file descriptor based memory has several advantages over the "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File descriptor approach allows explicit and controlled sharing of the memory areas, it allows to seal the operations. Besides, file descriptor based memory paves the way for VMMs to remove the secret memory range from the userspace hipervisor process, for instance QEMU. Andy Lutomirski says:
"Getting fd-backed memory into a guest will take some possibly major work in the kernel, but getting vma-backed memory into a guest without mapping it in the host user address space seems much, much worse."
memfd_secret() is made a dedicated system call rather than an extension to memfd_create() because it's purpose is to allow the user to create more secure memory mappings rather than to simply allow file based access to the memory. Nowadays a new system call cost is negligible while it is way simpler for userspace to deal with a clear-cut system calls than with a multiplexer or an overloaded syscall. Moreover, the initial implementation of memfd_secret() is completely distinct from memfd_create() so there is no much sense in overloading memfd_create() to begin with. If there will be a need for code sharing between these implementation it can be easily achieved without a need to adjust user visible APIs.
The secret memory remains accessible in the process context using uaccess primitives, but it is not exposed to the kernel otherwise; secret memory areas are removed from the direct map and functions in the follow_page()/get_user_page() family will refuse to return a page that belongs to the secret memory area.
Once there will be a use case that will require exposing secretmem to the kernel it will be an opt-in request in the system call flags so that user would have to decide what data can be exposed to the kernel.
Removing of the pages from the direct map may cause its fragmentation on architectures that use large pages to map the physical memory which affects the system performance. However, the original Kconfig text for CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736 ("x86: add gbpages switches")) and the recent report [1] showed that "... although 1G mappings are a good default choice, there is no compelling evidence that it must be the only choice". Hence, it is sufficient to have secretmem disabled by default with the ability of a system administrator to enable it at boot time.
Pages in the secretmem regions are unevictable and unmovable to avoid accidental exposure of the sensitive data via swap or during page migration.
Since the secretmem mappings are locked in memory they cannot exceed RLIMIT_MEMLOCK. Since these mappings are already locked independently from mlock(), an attempt to mlock()/munlock() secretmem range would fail and mlockall()/munlockall() will ignore secretmem mappings.
However, unlike mlock()ed memory, secretmem currently behaves more like long-term GUP: secretmem mappings are unmovable mappings directly consumed by user space. With default limits, there is no excessive use of secretmem and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in the future this should be addressed to allow balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.
A page that was a part of the secret memory area is cleared when it is freed to ensure the data is not exposed to the next user of that page.
The following example demonstrates creation of a secret mapping (error handling is omitted):
fd = memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
[1] https://lore.kernel.org/linux-mm/[email protected]/
[[email protected]: suppress Kconfig whine]
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Hagen Paul Pfeifer <[email protected]> Acked-by: James Bottomley <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Elena Reshetova <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Bottomley <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rick Edgecombe <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tycho Andersen <[email protected]> Cc: Will Deacon <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: kernel test robot <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12, v5.12-rc8, v5.12-rc7, v5.12-rc6, v5.12-rc5, v5.12-rc4, v5.12-rc3, v5.12-rc2, v5.12-rc1, v5.12-rc1-dontuse, v5.11, v5.11-rc7, v5.11-rc6, v5.11-rc5, v5.11-rc4, v5.11-rc3, v5.11-rc2, v5.11-rc1, v5.10, v5.10-rc7, v5.10-rc6, v5.10-rc5, v5.10-rc4, v5.10-rc3, v5.10-rc2, v5.10-rc1, v5.9, v5.9-rc8, v5.9-rc7, v5.9-rc6, v5.9-rc5, v5.9-rc4, v5.9-rc3, v5.9-rc2, v5.9-rc1, v5.8, v5.8-rc7, v5.8-rc6, v5.8-rc5, v5.8-rc4, v5.8-rc3, v5.8-rc2, v5.8-rc1, v5.7, v5.7-rc7 |
|
| #
3234ac66 |
| 21-May-2020 |
Dan Williams <[email protected]> |
/dev/mem: Revoke mappings when a driver claims the region
Close the hole of holding a mapping over kernel driver takeover event of a given address range.
Commit 90a545e98126 ("restrict /dev/mem to
/dev/mem: Revoke mappings when a driver claims the region
Close the hole of holding a mapping over kernel driver takeover event of a given address range.
Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges") introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the kernel against scenarios where a /dev/mem user tramples memory that a kernel driver owns. However, this protection only prevents *new* read(), write() and mmap() requests. Established mappings prior to the driver calling request_mem_region() are left alone.
Especially with persistent memory, and the core kernel metadata that is stored there, there are plentiful scenarios for a /dev/mem user to violate the expectations of the driver and cause amplified damage.
Teach request_mem_region() to find and shoot down active /dev/mem mappings that it believes it has successfully claimed for the exclusive use of the driver. Effectively a driver call to request_mem_region() becomes a hole-punch on the /dev/mem device.
The typical usage of unmap_mapping_range() is part of truncate_pagecache() to punch a hole in a file, but in this case the implementation is only doing the "first half" of a hole punch. Namely it is just evacuating current established mappings of the "hole", and it relies on the fact that /dev/mem establishes mappings in terms of absolute physical address offsets. Once existing mmap users are invalidated they can attempt to re-establish the mapping, or attempt to continue issuing read(2) / write(2) to the invalidated extent, but they will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can block those subsequent accesses.
Cc: Arnd Bergmann <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Russell King <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Fixes: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v5.7-rc6, v5.7-rc5, v5.7-rc4, v5.7-rc3, v5.7-rc2, v5.7-rc1, v5.6, v5.6-rc7, v5.6-rc6, v5.6-rc5, v5.6-rc4, v5.6-rc3, v5.6-rc2, v5.6-rc1, v5.5, v5.5-rc7, v5.5-rc6, v5.5-rc5, v5.5-rc4 |
|
| #
8dcc1a9d |
| 25-Dec-2019 |
Damien Le Moal <[email protected]> |
fs: New zonefs file system
zonefs is a very simple file system exposing each zone of a zoned block device as a file. Unlike a regular file system with zoned block device support (e.g. f2fs), zonefs
fs: New zonefs file system
zonefs is a very simple file system exposing each zone of a zoned block device as a file. Unlike a regular file system with zoned block device support (e.g. f2fs), zonefs does not hide the sequential write constraint of zoned block devices to the user. Files representing sequential write zones of the device must be written sequentially starting from the end of the file (append only writes).
As such, zonefs is in essence closer to a raw block device access interface than to a full featured POSIX file system. The goal of zonefs is to simplify the implementation of zoned block device support in applications by replacing raw block device file accesses with a richer file API, avoiding relying on direct block device file ioctls which may be more obscure to developers. One example of this approach is the implementation of LSM (log-structured merge) tree structures (such as used in RocksDB and LevelDB) on zoned block devices by allowing SSTables to be stored in a zone file similarly to a regular file system rather than as a range of sectors of a zoned device. The introduction of the higher level construct "one file is one zone" can help reducing the amount of changes needed in the application as well as introducing support for different application programming languages.
Zonefs on-disk metadata is reduced to an immutable super block to persistently store a magic number and optional feature flags and values. On mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration and populates the mount point with a static file tree solely based on this information. E.g. file sizes come from the device zone type and write pointer offset managed by the device itself.
The zone files created on mount have the following characteristics. 1) Files representing zones of the same type are grouped together under a common sub-directory: * For conventional zones, the sub-directory "cnv" is used. * For sequential write zones, the sub-directory "seq" is used. These two directories are the only directories that exist in zonefs. Users cannot create other directories and cannot rename nor delete the "cnv" and "seq" sub-directories. 2) The name of zone files is the number of the file within the zone type sub-directory, in order of increasing zone start sector. 3) The size of conventional zone files is fixed to the device zone size. Conventional zone files cannot be truncated. 4) The size of sequential zone files represent the file's zone write pointer position relative to the zone start sector. Truncating these files is allowed only down to 0, in which case, the zone is reset to rewind the zone write pointer position to the start of the zone, or up to the zone size, in which case the file's zone is transitioned to the FULL state (finish zone operation). 5) All read and write operations to files are not allowed beyond the file zone size. Any access exceeding the zone size is failed with the -EFBIG error. 6) Creating, deleting, renaming or modifying any attribute of files and sub-directories is not allowed. 7) There are no restrictions on the type of read and write operations that can be issued to conventional zone files. Buffered, direct and mmap read & write operations are accepted. For sequential zone files, there are no restrictions on read operations, but all write operations must be direct IO append writes. mmap write of sequential files is not allowed.
Several optional features of zonefs can be enabled at format time. * Conventional zone aggregation: ranges of contiguous conventional zones can be aggregated into a single larger file instead of the default one file per zone. * File ownership: The owner UID and GID of zone files is by default 0 (root) but can be changed to any valid UID/GID. * File access permissions: the default 640 access permissions can be changed.
The mkzonefs tool is used to format zoned block devices for use with zonefs. This tool is available on Github at:
[email protected]:damien-lemoal/zonefs-tools.git.
zonefs-tools also includes a test suite which can be run against any zoned block device, including null_blk block device created with zoned mode.
Example: the following formats a 15TB host-managed SMR HDD with 256 MB zones with the conventional zones aggregation feature enabled.
$ sudo mkzonefs -o aggr_cnv /dev/sdX $ sudo mount -t zonefs /dev/sdX /mnt $ ls -l /mnt/ total 0 dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
The size of the zone files sub-directories indicate the number of files existing for each type of zones. In this example, there is only one conventional zone file (all conventional zones are aggregated under a single file).
$ ls -l /mnt/cnv total 137101312 -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
This aggregated conventional zone file can be used as a regular file.
$ sudo mkfs.ext4 /mnt/cnv/0 $ sudo mount -o loop /mnt/cnv/0 /data
The "seq" sub-directory grouping files for sequential write zones has in this example 55356 zones.
$ ls -lv /mnt/seq total 14511243264 -rw-r----- 1 root root 0 Nov 25 13:23 0 -rw-r----- 1 root root 0 Nov 25 13:23 1 -rw-r----- 1 root root 0 Nov 25 13:23 2 ... -rw-r----- 1 root root 0 Nov 25 13:23 55354 -rw-r----- 1 root root 0 Nov 25 13:23 55355
For sequential write zone files, the file size changes as data is appended at the end of the file, similarly to any regular file system.
$ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s
$ ls -l /mnt/seq/0 -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
The written file can be truncated to the zone size, preventing any further write operation.
$ truncate -s 268435456 /mnt/seq/0 $ ls -l /mnt/seq/0 -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
Truncation to 0 size allows freeing the file zone storage space and restart append-writes to the file.
$ truncate -s 0 /mnt/seq/0 $ ls -l /mnt/seq/0 -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
Since files are statically mapped to zones on the disk, the number of blocks of a file as reported by stat() and fstat() indicates the size of the file zone.
$ stat /mnt/seq/0 File: /mnt/seq/0 Size: 0 Blocks: 524288 IO Block: 4096 regular empty file Device: 870h/2160d Inode: 50431 Links: 1 Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-11-25 13:23:57.048971997 +0900 Modify: 2019-11-25 13:52:25.553805765 +0900 Change: 2019-11-25 13:52:25.553805765 +0900 Birth: -
The number of blocks of the file ("Blocks") in units of 512B blocks gives the maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone size in this example. Of note is that the "IO block" field always indicates the minimum IO size for writes and corresponds to the device physical sector size.
This code contains contributions from: * Johannes Thumshirn <[email protected]>, * Darrick J. Wong <[email protected]>, * Christoph Hellwig <[email protected]>, * Chaitanya Kulkarni <[email protected]> and * Ting Yao <[email protected]>.
Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
show more ...
|
|
Revision tags: v5.5-rc3, v5.5-rc2, v5.5-rc1, v5.4, v5.4-rc8, v5.4-rc7, v5.4-rc6 |
|
| #
fe030c9b |
| 31-Oct-2019 |
David Hildenbrand <[email protected]> |
powerpc/pseries/cmm: Implement balloon compaction
We can now get rid of the cmm_lock and completely rely on the balloon compaction internals, which now also manage the page list and the lock.
Infla
powerpc/pseries/cmm: Implement balloon compaction
We can now get rid of the cmm_lock and completely rely on the balloon compaction internals, which now also manage the page list and the lock.
Inflated/"loaned" pages are now movable. Memory blocks that contain such pages can get offlined. Also, all such pages will be marked PageOffline() and can therefore be excluded in memory dumps using recent versions of makedumpfile.
Don't switch to balloon_page_alloc() yet (due to the GFP_NOIO). Will do that separately to discuss this change in detail.
Signed-off-by: David Hildenbrand <[email protected]> [mpe: Add isolated_pages-- in cmm_migratepage() as suggested by David] Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
show more ...
|
|
Revision tags: v5.4-rc5, v5.4-rc4, v5.4-rc3, v5.4-rc2, v5.4-rc1, v5.3, v5.3-rc8, v5.3-rc7, v5.3-rc6 |
|
| #
47e4937a |
| 22-Aug-2019 |
Gao Xiang <[email protected]> |
erofs: move erofs out of staging
EROFS filesystem has been merged into linux-staging for a year.
EROFS is designed to be a better solution of saving extra storage space with guaranteed end-to-end p
erofs: move erofs out of staging
EROFS filesystem has been merged into linux-staging for a year.
EROFS is designed to be a better solution of saving extra storage space with guaranteed end-to-end performance for read-only files with the help of reduced metadata, fixed-sized output compression and decompression inplace technologies.
In the past year, EROFS was greatly improved by many people as a staging driver, self-tested, betaed by a large number of our internal users, successfully applied to almost all in-service HUAWEI smartphones as the part of EMUI 9.1 and proven to be stable enough to be moved out of staging.
EROFS is a self-contained filesystem driver. Although there are still some TODOs to be more generic, we have a dedicated team actively keeping on working on EROFS in order to make it better with the evolution of Linux kernel as the other in-kernel filesystems.
As Pavel suggested, it's better to do as one commit since git can do moves and all histories will be saved in this way.
Let's promote it from staging and enhance it more actively as a "real" part of kernel for more wider scenarios!
Cc: Greg Kroah-Hartman <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Pavel Machek <[email protected]> Cc: David Sterba <[email protected]> Cc: Amir Goldstein <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Darrick J . Wong <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jan Kara <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Chao Yu <[email protected]> Cc: Miao Xie <[email protected]> Cc: Li Guifu <[email protected]> Cc: Fang Wei <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v5.3-rc5, v5.3-rc4, v5.3-rc3, v5.3-rc2, v5.3-rc1, v5.2, v5.2-rc7, v5.2-rc6, v5.2-rc5 |
|
| #
ed63bb1d |
| 13-Jun-2019 |
Greg Hackmann <[email protected]> |
dma-buf: give each buffer a full-fledged inode
By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN can get a lot of fine-grained data about how shmem buffers are shared among pr
dma-buf: give each buffer a full-fledged inode
By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN can get a lot of fine-grained data about how shmem buffers are shared among processes. stat(2) on each entry gives the caller a unique ID (st_ino), the buffer's size (st_size), and even the number of pages currently charged to the buffer (st_blocks / 512).
In contrast, all dma-bufs share the same anonymous inode. So while we can count how many dma-buf fds or mappings a process has, we can't get the size of the backing buffers or tell if two entries point to the same dma-buf. On systems with debugfs, we can get a per-buffer breakdown of size and reference count, but can't tell which processes are actually holding the references to each buffer.
Replace the singleton inode with full-fledged inodes allocated by alloc_anon_inode(). This involves creating and mounting a mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c.
Signed-off-by: Greg Hackmann <[email protected]> Signed-off-by: Chenbo Feng <[email protected]> Signed-off-by: Sumit Semwal <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
show more ...
|
|
Revision tags: v5.2-rc4, v5.2-rc3, v5.2-rc2 |
|
| #
ea8157ab |
| 21-May-2019 |
David Howells <[email protected]> |
zsfold: Convert zsfold to use the new mount API
Convert the zsfold filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communic
zsfold: Convert zsfold to use the new mount API
Convert the zsfold filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem.
See Documentation/filesystems/mount_api.txt for more information.
Signed-off-by: David Howells <[email protected]>
show more ...
|
|
Revision tags: v5.2-rc1, v5.1, v5.1-rc7, v5.1-rc6, v5.1-rc5, v5.1-rc4, v5.1-rc3, v5.1-rc2, v5.1-rc1, v5.0, v5.0-rc8, v5.0-rc7, v5.0-rc6, v5.0-rc5, v5.0-rc4, v5.0-rc3, v5.0-rc2, v5.0-rc1, v4.20, v4.20-rc7 |
|
| #
3ad20fe3 |
| 14-Dec-2018 |
Christian Brauner <[email protected]> |
binder: implement binderfs
As discussed at Linux Plumbers Conference 2018 in Vancouver [1] this is the implementation of binderfs.
/* Abstract */ binderfs is a backwards-compatible filesystem for A
binder: implement binderfs
As discussed at Linux Plumbers Conference 2018 in Vancouver [1] this is the implementation of binderfs.
/* Abstract */ binderfs is a backwards-compatible filesystem for Android's binder ipc mechanism. Each ipc namespace will mount a new binderfs instance. Mounting binderfs multiple times at different locations in the same ipc namespace will not cause a new super block to be allocated and hence it will be the same filesystem instance. Each new binderfs mount will have its own set of binder devices only visible in the ipc namespace it has been mounted in. All devices in a new binderfs mount will follow the scheme binder%d and numbering will always start at 0.
/* Backwards compatibility */ Devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES for the initial ipc namespace will work as before. They will be registered via misc_register() and appear in the devtmpfs mount. Specifically, the standard devices binder, hwbinder, and vndbinder will all appear in their standard locations in /dev. Mounting or unmounting the binderfs mount in the initial ipc namespace will have no effect on these devices, i.e. they will neither show up in the binderfs mount nor will they disappear when the binderfs mount is gone.
/* binder-control */ Each new binderfs instance comes with a binder-control device. No other devices will be present at first. The binder-control device can be used to dynamically allocate binder devices. All requests operate on the binderfs mount the binder-control device resides in. Assuming a new instance of binderfs has been mounted at /dev/binderfs via mount -t binderfs binderfs /dev/binderfs. Then a request to create a new binder device can be made as illustrated in [2]. Binderfs devices can simply be removed via unlink().
/* Implementation details */ - dynamic major number allocation: When binderfs is registered as a new filesystem it will dynamically allocate a new major number. The allocated major number will be returned in struct binderfs_device when a new binder device is allocated. - global minor number tracking: Minor are tracked in a global idr struct that is capped at BINDERFS_MAX_MINOR. The minor number tracker is protected by a global mutex. This is the only point of contention between binderfs mounts. - struct binderfs_info: Each binderfs super block has its own struct binderfs_info that tracks specific details about a binderfs instance: - ipc namespace - dentry of the binder-control device - root uid and root gid of the user namespace the binderfs instance was mounted in - mountable by user namespace root: binderfs can be mounted by user namespace root in a non-initial user namespace. The devices will be owned by user namespace root. - binderfs binder devices without misc infrastructure: New binder devices associated with a binderfs mount do not use the full misc_register() infrastructure. The misc_register() infrastructure can only create new devices in the host's devtmpfs mount. binderfs does however only make devices appear under its own mountpoint and thus allocates new character device nodes from the inode of the root dentry of the super block. This will have the side-effect that binderfs specific device nodes do not appear in sysfs. This behavior is similar to devpts allocated pts devices and has no effect on the functionality of the ipc mechanism itself.
[1]: https://goo.gl/JL2tfX [2]: program to allocate a new binderfs binder device:
#define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <linux/android/binder_ctl.h>
int main(int argc, char *argv[]) { int fd, ret, saved_errno; size_t len; struct binderfs_device device = { 0 };
if (argc < 2) exit(EXIT_FAILURE);
len = strlen(argv[1]); if (len > BINDERFS_MAX_NAME) exit(EXIT_FAILURE);
memcpy(device.name, argv[1], len);
fd = open("/dev/binderfs/binder-control", O_RDONLY | O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open binder-control device\n", strerror(errno)); exit(EXIT_FAILURE); }
ret = ioctl(fd, BINDER_CTL_ADD, &device); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to allocate new binder device\n", strerror(errno)); exit(EXIT_FAILURE); }
printf("Allocated new binder device with major %d, minor %d, and " "name %s\n", device.major, device.minor, device.name);
exit(EXIT_SUCCESS); }
Cc: Martijn Coenen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Acked-by: Todd Kjos <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v4.20-rc6, v4.20-rc5, v4.20-rc4, v4.20-rc3, v4.20-rc2, v4.20-rc1, v4.19 |
|
| #
dddde68b |
| 18-Oct-2018 |
Adam Borowski <[email protected]> |
xfs: add a define for statfs magic to uapi
Needed by userspace programs that call fstatfs().
It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two have identical values, they have di
xfs: add a define for statfs magic to uapi
Needed by userspace programs that call fstatfs().
It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two have identical values, they have different semantic meaning: one is an enum cookie meant for statfs, the other a signature of the on-disk format.
Signed-off-by: Adam Borowski <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
show more ...
|
|
Revision tags: v4.19-rc8, v4.19-rc7, v4.19-rc6, v4.19-rc5, v4.19-rc4, v4.19-rc3, v4.19-rc2, v4.19-rc1, v4.18, v4.18-rc8, v4.18-rc7, v4.18-rc6, v4.18-rc5, v4.18-rc4, v4.18-rc3, v4.18-rc2, v4.18-rc1, v4.17, v4.17-rc7, v4.17-rc6, v4.17-rc5, v4.17-rc4, v4.17-rc3, v4.17-rc2, v4.17-rc1, v4.16, v4.16-rc7, v4.16-rc6, v4.16-rc5, v4.16-rc4, v4.16-rc3, v4.16-rc2, v4.16-rc1, v4.15, v4.15-rc9, v4.15-rc8, v4.15-rc7, v4.15-rc6, v4.15-rc5, v4.15-rc4, v4.15-rc3, v4.15-rc2, v4.15-rc1, v4.14, v4.14-rc8 |
|
| #
f044c884 |
| 02-Nov-2017 |
David Howells <[email protected]> |
afs: Lay the groundwork for supporting network namespaces
Lay the groundwork for supporting network namespaces (netns) to the AFS filesystem by moving various global features to a network-namespace
afs: Lay the groundwork for supporting network namespaces
Lay the groundwork for supporting network namespaces (netns) to the AFS filesystem by moving various global features to a network-namespace struct (afs_net) and providing an instance of this as a temporary global variable that everything uses via accessor functions for the moment.
The following changes have been made:
(1) Store the netns in the superblock info. This will be obtained from the mounter's nsproxy on a manual mount and inherited from the parent superblock on an automount.
(2) The cell list is made per-netns. It can be viewed through /proc/net/afs/cells and also be modified by writing commands to that file.
(3) The local workstation cell is set per-ns in /proc/net/afs/rootcell. This is unset by default.
(4) The 'rootcell' module parameter, which sets a cell and VL server list modifies the init net namespace, thereby allowing an AFS root fs to be theoretically used.
(5) The volume location lists and the file lock manager are made per-netns.
(6) The AF_RXRPC socket and associated I/O bits are made per-ns.
The various workqueues remain global for the moment.
Changes still to be made:
(1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced from the old name.
(2) A per-netns subsys needs to be registered for AFS into which it can store its per-netns data.
(3) Rather than the AF_RXRPC socket being opened on module init, it needs to be opened on the creation of a superblock in that netns.
(4) The socket needs to be closed when the last superblock using it is destroyed and all outstanding client calls on it have been completed. This prevents a reference loop on the namespace.
(5) It is possible that several namespaces will want to use AFS, in which case each one will need its own UDP port. These can either be set through /proc/net/afs/cm_port or the kernel can pick one at random. The init_ns gets 7001 by default.
Other issues that need resolving:
(1) The DNS keyring needs net-namespacing.
(2) Where do upcalls go (eg. DNS request-key upcall)?
(3) Need something like open_socket_in_file_ns() syscall so that AFS command line tools attempting to operate on an AFS file/volume have their RPC calls go to the right place.
Signed-off-by: David Howells <[email protected]>
show more ...
|
| #
6f52b16c |
| 01-Nov-2017 |
Greg Kroah-Hartman <[email protected]> |
License cleanup: add SPDX license identifier to uapi header files with no license
Many user space API headers are missing licensing information, which makes it hard for compliance tools to determine
License cleanup: add SPDX license identifier to uapi header files with no license
Many user space API headers are missing licensing information, which makes it hard for compliance tools to determine the correct license.
By default are files without license information under the default license of the kernel, which is GPLV2. Marking them GPLV2 would exclude them from being included in non GPLV2 code, which is obviously not intended. The user space API headers fall under the syscall exception which is in the kernels COPYING file:
NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work".
otherwise syscall usage would not be possible.
Update the files which contain no license information with an SPDX license identifier. The chosen identifier is 'GPL-2.0 WITH Linux-syscall-note' which is the officially assigned identifier for the Linux syscall exception. SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched.
Reviewed-by: Kate Stewart <[email protected]> Reviewed-by: Philippe Ombredanne <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
show more ...
|
|
Revision tags: v4.14-rc7, v4.14-rc6, v4.14-rc5, v4.14-rc4, v4.14-rc3, v4.14-rc2, v4.14-rc1, v4.13, v4.13-rc7, v4.13-rc6, v4.13-rc5, v4.13-rc4, v4.13-rc3, v4.13-rc2, v4.13-rc1 |
|
| #
62aa81d7 |
| 06-Jul-2017 |
Fabian Frederick <[email protected]> |
ocfs2: use magic.h
Filesystems generally use SUPER_MAGIC values from magic.h instead of a local definition.
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Fabia
ocfs2: use magic.h
Filesystems generally use SUPER_MAGIC values from magic.h instead of a local definition.
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Fabian Frederick <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Joseph Qi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v4.12, v4.12-rc7, v4.12-rc6, v4.12-rc5, v4.12-rc4, v4.12-rc3 |
|
| #
a481f4d9 |
| 25-May-2017 |
John Johansen <[email protected]> |
apparmor: add custom apparmorfs that will be used by policy namespace files
AppArmor policy needs to be able to be resolved based on the policy namespace a task is confined by. Add a base apparmorfs
apparmor: add custom apparmorfs that will be used by policy namespace files
AppArmor policy needs to be able to be resolved based on the policy namespace a task is confined by. Add a base apparmorfs filesystem that (like nsfs) will exist as a kern mount and be accessed via jump_link through a securityfs file.
Setup the base apparmorfs fns and data, but don't use it yet.
Signed-off-by: John Johansen <[email protected]> Reviewed-by: Seth Arnold <[email protected]> Reviewed-by: Kees Cook <[email protected]>
show more ...
|
|
Revision tags: v4.12-rc2, v4.12-rc1, v4.11, v4.11-rc8, v4.11-rc7, v4.11-rc6, v4.11-rc5, v4.11-rc4, v4.11-rc3, v4.11-rc2, v4.11-rc1, v4.10, v4.10-rc8, v4.10-rc7, v4.10-rc6, v4.10-rc5, v4.10-rc4, v4.10-rc3, v4.10-rc2, v4.10-rc1, v4.9, v4.9-rc8, v4.9-rc7, v4.9-rc6, v4.9-rc5, v4.9-rc4, v4.9-rc3 |
|
| #
5ff193fb |
| 28-Oct-2016 |
Fenghua Yu <[email protected]> |
x86/intel_rdt: Add basic resctrl filesystem support
Use kernfs as basis for our user interface filesystem. This patch supports mount/umount, and one mount parameter "cdp" to enable code/data priorit
x86/intel_rdt: Add basic resctrl filesystem support
Use kernfs as basis for our user interface filesystem. This patch supports mount/umount, and one mount parameter "cdp" to enable code/data prioritization (though all we do at this point is ensure that the system can support CDP). The file system is not populated yet in this patch.
[ tglx: Fixed up a few nits and added cdp handling in case of error ]
Signed-off-by: Fenghua Yu <[email protected]> Cc: "Ravi V Shankar" <[email protected]> Cc: "Tony Luck" <[email protected]> Cc: "Shaohua Li" <[email protected]> Cc: "Sai Prakhya" <[email protected]> Cc: "Peter Zijlstra" <[email protected]> Cc: "Stephane Eranian" <[email protected]> Cc: "Dave Hansen" <[email protected]> Cc: "David Carrillo-Cisneros" <[email protected]> Cc: "Nilay Vaish" <[email protected]> Cc: "Vikas Shivappa" <[email protected]> Cc: "Ingo Molnar" <[email protected]> Cc: "Borislav Petkov" <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
show more ...
|
|
Revision tags: v4.9-rc2, v4.9-rc1, v4.8, v4.8-rc8, v4.8-rc7, v4.8-rc6, v4.8-rc5, v4.8-rc4, v4.8-rc3, v4.8-rc2, v4.8-rc1 |
|
| #
3bc52c45 |
| 25-Jul-2016 |
Dan Williams <[email protected]> |
dax: define a unified inode/address_space for device-dax mappings
In support of enabling resize / truncate of device-dax instances, define a pseudo-fs to provide a unified inode/address space for vm
dax: define a unified inode/address_space for device-dax mappings
In support of enabling resize / truncate of device-dax instances, define a pseudo-fs to provide a unified inode/address space for vm operations.
Cc: Al Viro <[email protected]> Signed-off-by: Dan Williams <[email protected]>
show more ...
|
| #
48b4800a |
| 26-Jul-2016 |
Minchan Kim <[email protected]> |
zsmalloc: page migration support
This patch introduces run-time migration feature for zspage.
For migration, VM uses page.lru field so it would be better to not use page.next field which is unified
zsmalloc: page migration support
This patch introduces run-time migration feature for zspage.
For migration, VM uses page.lru field so it would be better to not use page.next field which is unified with page.lru for own purpose. For that, firstly, we can get first object offset of the page via runtime calculation instead of using page.index so we can use page.index as link for page chaining instead of page.next.
In case of huge object, it stores handle to page.index instead of next link of page chaining because huge object doesn't need to next link for page chaining. So get_next_page need to identify huge object to return NULL. For it, this patch uses PG_owner_priv_1 flag of the page flag.
For migration, it supports three functions
* zs_page_isolate
It isolates a zspage which includes a subpage VM want to migrate from class so anyone cannot allocate new object from the zspage.
We could try to isolate a zspage by the number of subpage so subsequent isolation trial of other subpage of the zpsage shouldn't fail. For that, we introduce zspage.isolated count. With that, zs_page_isolate can know whether zspage is already isolated or not for migration so if it is isolated for migration, subsequent isolation trial can be successful without trying further isolation.
* zs_page_migrate
First of all, it holds write-side zspage->lock to prevent migrate other subpage in zspage. Then, lock all objects in the page VM want to migrate. The reason we should lock all objects in the page is due to race between zs_map_object and zs_page_migrate.
zs_map_object zs_page_migrate
pin_tag(handle) obj = handle_to_obj(handle) obj_to_location(obj, &page, &obj_idx);
write_lock(&zspage->lock) if (!trypin_tag(handle)) goto unpin_object
zspage = get_zspage(page); read_lock(&zspage->lock);
If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can be stale by migration so it goes crash.
If it locks all of objects successfully, it copies content from old page to new one, finally, create new zspage chain with new page. And if it's last isolated subpage in the zspage, put the zspage back to class.
* zs_page_putback
It returns isolated zspage to right fullness_group list if it fails to migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage freeing to workqueue. See below about async zspage freeing.
This patch introduces asynchronous zspage free. The reason to need it is we need page_lock to clear PG_movable but unfortunately, zs_free path should be atomic so the apporach is try to grab page_lock. If it got page_lock of all of pages successfully, it can free zspage immediately. Otherwise, it queues free request and free zspage via workqueue in process context.
If zs_free finds the zspage is isolated when it try to free zspage, it delays the freeing until zs_page_putback finds it so it will free free the zspage finally.
In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL. First of all, it will use ZS_EMPTY list for delay freeing. And with adding ZS_FULL list, it makes to identify whether zspage is isolated or not via list_empty(&zspage->list) test.
[[email protected]: zsmalloc: keep first object offset in struct page] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: zsmalloc: zspage sanity check] Link: http://lkml.kernel.org/r/20160603010129.GC3304@bbox Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
| #
b1123ea6 |
| 26-Jul-2016 |
Minchan Kim <[email protected]> |
mm: balloon: use general non-lru movable page feature
Now, VM has a feature to migrate non-lru movable pages so balloon doesn't need custom migration hooks in migrate.c and compaction.c.
Instead, t
mm: balloon: use general non-lru movable page feature
Now, VM has a feature to migrate non-lru movable pages so balloon doesn't need custom migration hooks in migrate.c and compaction.c.
Instead, this patch implements the page->mapping->a_ops-> {isolate|migrate|putback} functions.
With that, we could remove hooks for ballooning in general migration functions and make balloon compaction simple.
[[email protected]: compaction.h requires that the includer first include node.h] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gioh Kim <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Rafael Aquini <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
show more ...
|
|
Revision tags: v4.7, v4.7-rc7, v4.7-rc6, v4.7-rc5, v4.7-rc4, v4.7-rc3, v4.7-rc2, v4.7-rc1, v4.6, v4.6-rc7, v4.6-rc6 |
|
| #
2a28900b |
| 28-Apr-2016 |
Jan Kara <[email protected]> |
udf: Export superblock magic to userspace
Currently UDF superblock magic doesn't appear in any userspace header files and thus userspace apps have hard time checking for this fs. Let's export the ma
udf: Export superblock magic to userspace
Currently UDF superblock magic doesn't appear in any userspace header files and thus userspace apps have hard time checking for this fs. Let's export the magic to userspace as with any other filesystem.
Signed-off-by: Jan Kara <[email protected]>
show more ...
|