The Linux Kernel/Storage

Storage functionality provides access to various storage devices via files and directories of files. Most of the storage is persistent as flash memory, SSD and legacy hard disks. Another kind of storage is temporary. The file system provides an abstraction to organize the information into separate pieces of data (called files) identified by a unique name. Each file system type defines their own structures and logic rules used to manage these groups of information and their names. Linux supports a plethora or different file system types, local and remote, native and from other operating systems. To accommodate such disparity the kernel defines a common top layer, the virtual file system (VFS) layer.



Files and directories
Four basic files access system calls:
 * ↪ - opens a file by name and returns a  ( fd ). Below functions operates on a fd.

File in Linux and UNIX is not only physical file on persistent storage. File interface is used to access pipes, sockets and other pseudo-files.

🔧 TODO
 * – manipulate file descriptor
 * – manipulate file descriptor
 * – manipulate file descriptor
 * – manipulate file descriptor

⚙️ Files and directories internals

📚 Files and directories references
 * Input/Output, The GNU C Library
 * VFS in Linux Kernel 2.4 Internals

File locks
File locks are mechanisms that allow processes to coordinate access to shared files. These locks help prevent conflicts when multiple processes or threads attempt to access the same file simultaneously.

💾 Historical: Mandatory locking feature is no longer supported at all in Linux 5.15 and above because the implementation is unreliable.

⚲ API
 * – list local system locks
 * – apply, test or remove a POSIX lock on an open file
 * – apply or remove an advisory BSD lock on an open file
 * – manipulate file descriptor
 * – advisory record lock
 * – Open File Description Lock
 * – lock parameters

⚙️ Internals

Asynchronous I/O
🚀 advanced features

AIO


 * https://lwn.net/Kernel/Index/#Asynchronous_IO



🌱 New since release 5.1 in May 2019


 * https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
 * https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/
 * https://lwn.net/Kernel/Index/#io_uring
 * io_uring, SCM_RIGHTS, and reference-count cycles
 * The rapid growth of io_uring
 * Automatic buffer selection for io_uring
 * Operations restrictions for io_uring
 * io_uring, SCM_RIGHTS, and reference-count cycles
 * Redesigned workqueues for io_uring
 * Automatic buffer selection for io_uring
 * Operations restrictions for io_uring
 * io_uring, SCM_RIGHTS, and reference-count cycles
 * Redesigned workqueues for io_uring

Allow non-blocking access to multiple file descriptors.

Efficient event polling 

⚲ API:

⚙️ Internals:

 and 

💾 Historical: Select and poll system calls are derived from UNIX

⚲ API:

⚙️ Internals:

Vectored I/O
🚀 advanced feature

, also known as scatter/gather I/O, is a method of input and output by which a single procedure call sequentially reads data from multiple buffers and writes it to a single data stream, or reads data from a data stream and writes it to multiple buffers, as defined in a vector of buffers. Scatter/gather refers to the process of gathering data from, or scattering data into, the given set of buffers. Vectored I/O can operate synchronously or asynchronously. The main reasons for using vectored I/O are efficiency and convenience.

⚲ API:

⚙️ Internals:
 * ↯ call hierarchy:
 * ↯ call hierarchy:

📚 References
 * Fast Scatter-Gather I/O, The GNU C Library
 * https://lwn.net/Kernel/Index/#Vectored_IO
 * https://lwn.net/Kernel/Index/#Scattergather_chaining

Virtual File System
The (VFS) is an abstract layer on top of a concrete logical file system. The purpose of a VFS is to allow client applications to access different types of logical file systems in a uniform way. A VFS can, for example, be used to access local and network storage devices transparently without the client application noticing the difference. It can be used to bridge the differences in Windows, classic Mac OS/macOS and Unix filesystems, so that applications can access files on local file systems of those types without having to know what type of file system they are accessing. A VFS specifies an interface (or a "contract") between the kernel and a logical file system. Therefore, it is easy to add support for new file system types to the kernel simply by fulfilling the contract.

🔧 TODO:, , ,

📚 VFS References
 * VFS in Linux Kernel 2.4 Internals
 * VFS in Linux Kernel 2.4 Internals

Logical file systems
A (or filesystem) is used to control how data is stored and retrieved. Without a file system, information placed in a storage area would be one large body of data with no way to tell where one piece of information stops and the next begins. By separating the data into individual pieces, and giving each piece a name, the information is easily separated and identified. Each group of data is called a "file". The structure and logic rules used to manage the groups of information and their names is called a "file system".

There are many different kinds of file systems. Each one has different structure and logic, properties of speed, flexibility, security, size and more. Some file systems have been designed to be used for specific applications. For example, the ISO 9660 file system is designed specifically for optical discs.

File systems can be used on many different kinds of storage devices. Each storage device uses a different kind of media. The most common storage device in use today is a. Other media that was used are hard disk, magnetic tape, optical disc, and. In some cases, the computer's main memory (RAM) is used to create a temporary file system for short-term use. Raw storage is called a block device.

Linux supports many different file systems, but common choices for the system disk on a block device include the ext* family (such as, and ), ,  and. For raw Flash without a (FTL) or  (MTD), there is, , and , among others. is a common compressed read-only file system. NFS and another network FS are described further in paragraph Network storage.

⚲ Shell interfaces:
 * cat /proc/filesystems
 * ls /sys/fs/

Infrastructure ⚲ API function registers structs  and stores them in linked list ⚙️. Function registers. Operation of file system opening is called mounting:

⚙️ Internals:

📚 References:
 * Kernel wikis: EXT4, btrfs, Reiser4, RAID, XFS
 * Kernel wikis: EXT4, btrfs, Reiser4, RAID, XFS

Page cache
A page cache or disk cache is a transparent cache for the memory pages originating from a secondary storage device such as a hard disk drive. The operating system keeps a page cache in otherwise unused portions of the main memory, resulting in quicker access to the contents of cached pages and overall performance improvements. The page cache is implemented by the kernel, and is mostly transparent to applications.

Usually, all physical memory not directly allocated to applications is used by the operating system for the page cache. Since the memory would otherwise be idle and is easily reclaimed when applications request it, there is generally no associated performance penalty and the operating system might even report such memory as "free" or "available". The page cache also aids in writing to a disk. Pages in the main memory that have been modified during writing data to disk are marked as "dirty" and have to be flushed to disk before they can be freed. When a file write occurs, the page backing the particular block is looked up. If it is already found in the page cache, the write is done to that page in the main memory. Otherwise, when the write perfectly falls on page size boundaries, the page is not even read from disk, but allocated and immediately marked dirty. Otherwise, the page(s) are fetched from disk and requested modifications are done.

Not all cached pages can be written to as program code is often mapped as read-only or copy-on-write; in the latter case, modifications to code will only be visible to the process itself and will not be written to disk.

⚲ API:

📚 References

More
 * The future of DAX - direct access bypassing the cache
 * Linux Page Cache in Linux Kernel 2.4 Internals

Zero-copy
🚀 advanced features

Writing data to storage and reading are very resource consuming operations. Copying memory is time and CPU consuming operation too. Set of methods to avoid copying operations is called. The goal of zero-copy methods is a fast and efficient data transfer within the system.

The first and simplest method is, invoked by operator "|" in shells. Instead of writing data into temporary file and reading, the data is passed efficiently via a pipe bypassing a storage. The second method is.

⚲ Syscalls:

⚲ API and ⚙️ Internals:


 *  ↪ - creates pipe
 * uses ,
 *  ↪ - duplicates pipe content
 * calls
 *  ↪ - transfers data between file descriptors, the output can be a socket. Used in network storage and servers.
 * Calls: ,


 *  ↪ - transfers data between files
 * calls custom like
 * or custom like
 * or


 *  ↪ - splices data to/from a pipe.
 * There are three cases regarding which end being a pipe:
 * - only input is a pipe
 * Calls or custom
 * or :, ,
 * - only output is a pipe.
 * Calls or custom
 * or :
 * - both are pipes


 * – splices user pages to a pipe
 * – splices a pipe to user pages
 * – splices a pipe to user pages

⚲ API

⚙️ Internals:

🔧 TODO: builds a zerocopy skb datagram from an iov_iter. Used in and.

📚 References
 * LTP:, , , , , ,
 * LTP:, , , , , ,
 * LTP:, , , , , ,
 * LTP:, , , , , ,
 * LTP:, , , , , ,
 * LTP:, , , , , ,

Block device layer
Linux storage is based on block devices.

Block devices provide buffered access to the hardware, always allowing to read or write any sized block (including single characters/bytes) and are not subject to alignment restrictions. They are commonly used to represent hardware like hard disks.

⚲ Interfaces:
 * allocates
 * - main unit of I/O for the block layer and lower layers (ie drivers and stacking drivers)
 * allocates
 * - main unit of I/O for the block layer and lower layers (ie drivers and stacking drivers)
 * - main unit of I/O for the block layer and lower layers (ie drivers and stacking drivers)
 * - main unit of I/O for the block layer and lower layers (ie drivers and stacking drivers)
 * - main unit of I/O for the block layer and lower layers (ie drivers and stacking drivers)
 * - main unit of I/O for the block layer and lower layers (ie drivers and stacking drivers)
 * - main unit of I/O for the block layer and lower layers (ie drivers and stacking drivers)

⚙️ Internals.

👁 Examples:
 * - small RAM backed block device driver

📚 References
 * https://lwn.net/Kernel/Index/#Block_layer
 * LDD3:Block Drivers
 * LDD1:Loading Block Drivers
 * ULK3 Chapter 14. Block Device Drivers
 * https://lwn.net/Kernel/Index/#Block_layer
 * LDD3:Block Drivers
 * LDD1:Loading Block Drivers
 * ULK3 Chapter 14. Block Device Drivers
 * LDD1:Loading Block Drivers
 * ULK3 Chapter 14. Block Device Drivers

Device mapper
The device mapper is a framework provided by the kernel for mapping physical block devices onto higher-level "virtual block devices". It forms the foundation of LVM2, software RAIDs and dm-crypt disk encryption, and offers additional features such as file system snapshots.

Device mapper works by passing data from a virtual block device, which is provided by the device mapper itself, to another block device. Data can be also modified in transition, which is performed, for example, in the case of device mapper providing disk encryption.

User space applications that need to create new mapped devices talk to the device mapper via the  shared library, which in turn issues ioctls to the   device node.

Functions provided by the device mapper include linear, striped and error mappings, as well as crypt and multipath targets. For example, two disks may be concatenated into one logical volume with a pair of linear mappings, one for each disk. As another example, crypt target encrypts the data passing through the specified device, by using the Linux kernel's Crypto API.

The following mapping targets are available:
 * cache - allows the creation of hybrid volumes, by using solid-state drives (SSDs) as caches for hard disk drives (HDDs)
 * crypt - provides data encryption, by using the Linux kernel's Crypto API
 * delay - delays reads and/or writes to different devices (used for testing)
 * era - behaves in a way similar to the linear target, while it keeps track of blocks that were written to within a user-defined period of time
 * error - simulates I/O errors for all mapped blocks (used for testing)
 * flakey - simulates periodic unreliable behaviour (used for testing)
 * linear - maps a continuous range of blocks onto another block device
 * mirror - maps a mirrored logical device, while providing data redundancy
 * multipath - supports the mapping of multipathed devices, through usage of their path groups
 * raid - offers an interface to the Linux kernel's software RAID driver (md)
 * snapshot and snapshot-origin - used for creation of LVM snapshots, as part of the underlining copy-on-write scheme
 * striped - strips the data across physical devices, with the number of stripes and the striping chunk size as parameters
 * zero - an equivalent of, all reads return blocks of zeros, and writes are discarded

📚 References
 * https://lwn.net/Kernel/Index/#Device_mapper
 * https://lwn.net/Kernel/Index/#Device_mapper
 * https://lwn.net/Kernel/Index/#Device_mapper
 * https://lwn.net/Kernel/Index/#Device_mapper
 * https://lwn.net/Kernel/Index/#Device_mapper

I/O scheduler
I/O scheduling (or disk scheduling) is the method chosen by the kernel to decide in which order the block I/O operations will be submitted to the storage volumes. I/O scheduling usually has to work with hard disk drives that have long access times for requests placed far away from the current position of the disk head (this operation is called a seek). To minimize the effect this has on system performance, most I/O schedulers implement a variant of the elevator algorithm that reorders the incoming randomly ordered requests so the associated data would be accessed with minimal arm/head movement.

The particular I/O scheduler used with certain block device can be switched at run time by modifying the corresponding  file in the sysfs filesystem. Some I/O schedulers also have tunable parameters that can be set through files in.

⚲ Interfaces:
 * Function registers struct.
 * Function registers struct.

⚙️ Internals:

📚 References:
 * https://www.cloudbees.com/blog/linux-io-scheduler-tuning/
 * https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers
 * https://www.cloudbees.com/blog/linux-io-scheduler-tuning/
 * https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers
 * https://www.cloudbees.com/blog/linux-io-scheduler-tuning/
 * https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers
 * https://wiki.ubuntu.com/Kernel/Reference/IOSchedulers

drivers
🔧 TODO

⚙️ Internals
 * - Non Volatile Memory devices like ,
 * - for 🤖 embedded devices
 * - Non Volatile Memory devices like ,
 * - for 🤖 embedded devices
 * - for 🤖 embedded devices

NVMe
drivers provide accesses a computer's. Local storage is attached via bus. PCI NVMe device driver entry point is. Remote storage driver is called target and local driver is called host. connect remote targets with local host. A fabric can be based on, or  protocols.

⚲ API:
 * nvme-cli

⚙️ Internals:

Host :

⚲ Interfaces:
 * initializes a NVMe controller structures with operations
 * a subroutine of adds a new disk with
 * a subroutine of adds a new disk with


 * - local PCI nvme module init
 * - module init
 * - module init
 * - module init
 * - module init

Fabrics

⚲ interfaces:
 * resisters
 * - fabrics module init
 * - fabrics module init

⚙️ internals:
 * - fabrics module init
 * binds
 * binds
 * binds
 * binds

Target :

⚲ Interfaces:
 * registers


 * - module init
 * - loopback test module init which can be useful to test NVMe-FC transport interfaces.

👁 Example: nvme loopback
 * - fabrics operations
 * - target operation
 * - target operation
 * - target operation

Appendices
🚀 Advanced
 * – reports task statistics
 * /proc/self/io – I/O statistics for the process (see )

💾 Historical storage drivers

📖 Further reading about storage
 * bcc/ebpf storage and filesystems tools