Linux 3.8 was released on Mon, 18 Feb 2013.
This Linux release includes support in Ext4 for embedding very small files in the inode, which greatly improves the performance for these files and saves some disk space. There is also a new Btrfs feature that allows to replace quickly a disk, a new filesystem F2FS optimized for SSDs, support of filesystem mounts, UTS, IPC, PIDs, and network stack namespaces for unprivileged users, accounting of kernel memory in the memory resource controller, journal checksums in XFS, an improved NUMA policy redesign and, of course, the removal of support for 386 processors. Many small features and new drivers and fixes are also available.
Contents
-
Prominent features in Linux 3.8
- Ext4 embeds very small files in the inode
- Btrfs fast device replacement
- F2FS, a SSD friendly file system
- User namespace support completed
- XFS log checksums
- Huge Pages support a zero page
- The memory resource controller supports accounting of kernel memory
- Automatic NUMA balancing
- Removal of support for 386 processors
- Driver and architecture-specific changes
- Various core changes
- Filesystems
- Block
- Crypto/keyring
- Security
- Perf
- Virtualization
- Networking
- Other news sites that track the changes of this release
1. Prominent features in Linux 3.8
1.1. Ext4 embeds very small files in the inode
Every file in Ext4 has a corresponding inode which stores various information -size, date creation, owner, etc- about the file (users can see that information with the stat(1) command). But the inode doesn't store the actual data, it just holds information about where the data it is placed.
The size used by each inode is predetermined at mkfs.ext4(8) time, and defaults to 256 bytes. But the space isn't always used entirely (despite small extended attributes making use of it), and there millions of inodes in a typical file system, so some space is wasted. At the same time, at least one data block is always allocated for file data (typically, 4KB), even if the file only uses a few bytes. And there is a extra seek involved for reading these few bytes, because the data blocks aren't allocated contiguously to the inodes.
Ext4 has added support for storing very small files in the unused inode space. With this feature the unused inode space gets some use, a data block isn't allocated for the file, and reading these small files is faster, because once the inode has been read, the data is already available without extra disk seeks. Some simple tests shows that with a linux-3.0 vanilla source, the new system can save more than 1% disk space. For a sample /usr directory, it saved more than 3% of space. Performance for small files is also improved. The files that can be inlined can be tweaked indirectly by increasing the inode size (-I mkfs.ext4(8) option) - the bigger the inode, the bigger the files that can be inlined (but if the workload doesn't make extensive use of small files, the space will be wasted).
Recommended LWN article: Improving ext4: bigalloc, inline data, and metadata checksums
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8)
1.2. Btrfs fast device replacement
As a filesystem that expands to multiple devices, Btrfs can remove a disk easily, just in case you want to shrink your storage pool, or just because the device is failing and you want to replace it:
- btrfs device add new_disk btrfs device delete old_disk
But the process is not as fast as it could be. Btrfs has added a explicit device replacement operation which is much faster:
- btrfs replace mountpoint old_disk new_disk
The copy usually takes place at 90% of the available platter speed if no additional disk I/O is ongoing during the copy operation. The operation takes place at runtime on a live filesystem, it does not require to unmount it or stop active tasks, and it is safe to crash or lose power during the operation, the process will resume with the next mount. It's also possible to use the command "btrfs replace status" to check the status of the operation, or "btrfs replace cancel" to cancel it. The userspace patches for the btrfs program can be found [git://btrfs.giantdisaster.de/git/btrfs-progs here].
Code: (commit 1, 2, 3, 4, 5, 6)
1.3. F2FS, a SSD friendly file system
F2FS is a new experimental file system, contributed by Samsung, optimized for flash memory storage devices. Linux has several file systems targeted for flash devices -logfs, jffs2, ubifs-, but they are designed for "native" flash devices that expose the flash storage device directly to the computer. Many of the flash storage devices commonly used (SSD disks) aren't "native" flash devices. Instead, they have a FTL ("flash translation layer") that emulates a block based device and hides the true nature of flash memory devices. This makes possible to use the existing block storage stacks and file systems in those devices. These file systems have made some optimizations to work better with SSDs (like trimming). But the filesystem formats don't make changes to optimize for them.
F2FS is a filesystem for SSDs that tries to keep in mind the existence of the Flash Translation Layer, and tries to make good use of it. For more details about the design choices made by F2FS, reading the following LWN article is recommended:
Recommended LWN article: An f2fs teardown
Code: fs/f2fs
1.4. User namespace support completed
Per-process namespaces allow to have different namespaces for several resources. For example, a process might see a set mountpoints, PID numbers, and network stack state, and a process in other namespace might see others. The per-process namespace support has been developed for many years: The command unshare(1), available in modern linux distros, allows to start a process with the mount, UTS, IPC or network namespaces "unshared" from its parent; and systemd uses mount namespaces for the ReadWriteDirectories, ReadOnlyDirectories or InaccessibleDirectories unit configuration options, and for systemd-nspawn. But the use of namespaces was limited only to root.
This release adds is the ability for unprivileged users to use per-process namespaces safely. The resources with namespace support available are filesystem mount points, UTS, IPC, PIDs, and network stack.
For more details about the Linux namespace support, what they are, how they work, details about the API and some example programs, you should read the article series from LWN
Namespaces in operation, part 1: namespaces overview
Namespaces in operation, part 2: the namespaces API
Namespaces in operation, part 3: PID namespaces
Namespaces in operation, part 4: more on PID namespaces
(The remaining namespaces will be covered in future LWN articles)
1.5. XFS log checksums
XFS is planning to add full metadata checksumming in the future. As part of that effort, this release adds support for checksums in the journal.
1.6. Huge Pages support a zero page
Huge pages are a type of memory pages provided by the CPU memory management unit, which are much bigger than usual. They are typically used by big databases and applications which maker use of large portions of memory. In the other hand, a "zero page" is a memory page full of zeros. This page is used by the kernel to save memory: some applications allocate large portions of memory full of zeros but they don't write to all parts of it, so instead of allocating that zeroed memory, the kernel just makes all the memory point to the zero page. The zero page was only available for normal sized pages (4KB in x86), this release adds a zero huge page for applications that use huge pages.
Recommended LWN article: Adding a huge zero page
Code: (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
1.7. The memory resource controller supports accounting of kernel memory
The Linux memory controller is a control group that can limit, account and isolate memory usage to arbitrary groups of processes. In this release, the memory controller has got support for accounting two types uses of kernel memory usage: stack and slab usage. These limits can be useful for things like stopping fork bombs.
The files created in the control group are:
- memory.kmem.limit_in_bytes: set/show hard limit for kernel memory memory.kmem.usage_in_bytes: show current kernel memory allocation memory.kmem.failcnt: show the number of kernel memory usage hits limits memory.kmem.max_usage_in_bytes: show max kernel memory usage recorded
Recommended LWN article: KS2012: memcg/mm: Improving kernel-memory accounting for memory cgroups
1.8. Automatic NUMA balancing
A lot of modern machines are "non uniform memory access" (NUMA) architectures: they have per-processor memory controllers, and accessing the memory in the local processor is faster than accessing the memory of other processors, so the placement of memory in the same node where processes will reference it is critical for performance. This is specially true in huge boxes with docens or hundreds of processors.
The Linux NUMA implementation had some deficiencies. This release includes a new NUMA foundation which will allow to build smarter NUMA policies in the next releases. For more details, see the LWN article:
Recommended LWN article: NUMA in a hurry
Some parts of the code: (commit 1), 2, 3, 4, 5, 6, 7)
1.9. Removal of support for 386 processors
As it has been widely reported, this release no longer supports the Intel 386 processor (486 is still supported, though)
Code: (commit)
2. Driver and architecture-specific changes
All the driver and architecture-specific changes can be found in the Linux_3.8_DriverArch page
3. Various core changes
modules: add syscall to load module from file descriptor. Contributed by Chrome OS, who wants to be able to enforce that kernel modules are being loaded only from their read-only crypto-hash verified (dm_verity) root filesystem. Related LWN article: Loading modules from file descriptors,(commit), (commit)
Support more page sizes for MAP_HUGETLB/SHM_HUGETLB, as some large applications want to use 1GB huge pages on some mappings (commit)
Enable to assign a node which has only movable memory, so that the whole memory of the node can be hotplugged (commit), (commit)
- SYSV IPC
Add 3 new variables and sysctls to tune them. This variable can be used to set desired ID for next allocated IPC object. Used by checkpoint/restart (commit)
Introduce message queue copy feature, needed by checkpoint/restart, as it requires some way to get all pending IPC messages without deleting them from the queue (commit)
freezer cgroup: implement proper hierarchy support (commit)
ptrace: introduce PTRACE_O_EXITKILL. If the tracer exits it sends SIGKILL to every tracee which has this bit set (commit)
procfs: add VmFlags field in smaps output, as checkpoint/restart needs to get these VMA associated flags (commit)
tmpfs: support SEEK_DATA and SEEK_HOLE lseek() flags (commit)
tty: Add new ioctl flags for TTY flags fetching. TIOCGPKT, TIOCGPTLCK, TIOCGEXCL for fetching PTY's packet mode and locking state, and exclusive mode of TTY (commit)
RCU locking: Add a module parameter to force use of expedited RCU primitives (commit), add callback-free CPUs, as RCU callback execution can add significant OS jitter and also can degrade both scheduling latency (commit)
PCI: Single Root I/O Virtualization control and status via sysfs (commit)
devfreq: Add sysfs node to expose available frequencies (commit)
- Power Management
4. Filesystems
- ext4
Disable the ability to disable extended attributes (commit)
Introduce lseek() SEEK_DATA/SEEK_HOLE support (commit 1), 2, 3, 4, 5)
- XFS
- GFS2
SMB: Add SMB2.02 dialect support by specifying vers=2.0 on mount (commit)
5. Block
6. Crypto/keyring
camellia: add AES-NI/AVX/x86_64 assembler implementation of Camellia cipher (commit)
crc32c: Optimize CRC-32C calculation with PCLMULQDQ instruction (commit)
keyring: Make the session and process keyrings per-thread rather than per-process, but still inherited from the parent thread to solve a problem with PAM and GDM (commit)
keyring: Reduce initial permissions on keys, this gives the creator a chance to adjust the permissions mask before other processes can access the new key or create a link to it (commit)
7. Security
Smack: create a sysfs mount point for Smackfs at /sys/fs/smackfs (commit)
- Add "Seccomp" field at /proc/pid/status, necessary to examine the state of seccomp for a given
process (commit)
8. Perf
Integrate script browser into main browser. Users can press function key 'r' to list all perf scripts and select one of them to run that script, the output will be shown in a separate browser (commit 1, 2, 3)
diff: Add -b option for perf diff to display paired entries only (commit), add -p option to display period values for hist entries (commit), add option to sort entries based on diff computation (commit), add -F option to display formula for computation (commit), add ratio computation way to compare hist entries (commit)
inject: "perf inject" can only handle data from pipe. Now it works with files too (commit)
stat: Add --pre and --post command (commit)
Add gtk.<command> config option for launching GTK browser (commit)
trace: Add an event duration column (commit), add duration filter (commit), support interrupted syscalls (commit), use sched:sched_stat_runtime to provide a thread summary (commit)
x86: Make hardware event translations available in sysfs /sys/devices/cpu/events/ (commit)
tracing: Add trace_options kernel command line parameter (commit)
9. Virtualization
KVM: paravirtual clock vsyscall support. It reduces clock_gettime from 500 cycles to 200 cycles in a testbox https://git.kernel.org/linus/3dc4f7cfb7441e5e0fed3a02fc81cdaabd28300a (commit)], (commit)
virtio-net: multiqueue support (commit), support changing the number of queue pairs through ethtool (commit)
virtio-console: Add support for remoteproc serial (commit)
vhost-net: enable zerocopy tx by default (commit)
Add Microsoft Hyper-V balloon driver (commit)
Xen: ACPI PAD driver (commit)
10. Networking
- Wireless
Allow to abort low priority scan requests (commit)
Allow to flush old scan results (commit)
Allow drivers to support P2P GO powersave configuration (commit)
Provide partial VHT radiotap information (commit)
Support VHT association (commit)
Allow per interface TX power setting, instead of per device (commit)
Add the NL80211_CMD_SET_MCAST_RATE command, which enables the user to change the rate used to send multicast frames for vif configured as IBSS or MESH_POINT (commit)
Support P2P GO powersave configuration (commit)
RFC 5961 5.2 TCP blind data injection attack mitigation (commit)
Change default TCP hash size to be more in line with current day realities. The existing heuristics were chosen a decade ago (commit)
Support for checksum offload of encapsulated packets (basically, tunneled traffic can still be checksummed by HW) (commit)
B.A.T.M.A.N. mesh: Add Distributed ARP Table, a DHT-based mechanism that increases ARP reliability on sparse wireless mesh networks (commit 1, 2, 3, 4, 5, 6, 7)
Allow BPF filter access to VLAN tags (commit)
Support Distributed Overlay Virtual Ethernet (DOVE) extensions for VXLAN (commit)
Openvswitch: add IPv6 'set' action (commit)
IPv6: add support of equal-cost multi-path (ECMP) routing (commit)
IPIP tunnel: add GSO support (commit)
IPvs: Complete IPv6 fragment handling for IPVS (commit)
- Netfilter
pkt_sched: turn QFQ into QFQ+, a variant of QFQ that provides some benefits (commit)
tuntap: multiqueue support (commit)
SCTP: support per-association statistics (commit)
SCTP: Make HMAC algorithm selection for cookie generation dynamic (commit)
- Add support of link creation via rtnl 'ip link .. type
IPv6 tunnel: add support of link creation via rtnl 'ip link .. type ip6tnl' (commit)
, add support of link creation via rtnl 'ip link .. type ipip' (commit)
sit: add support of link creation via rtnl 'ip link .. type sit' (commit)
sk-filter: Add ability to get socket filter program (v2) (commit)
11. Other news sites that track the changes of this release
H-Online Kernel Log - Coming in 3.8 Part 1: Filesystems and storage, Part 2: Infraestructure, Part 3: Drivers