Cgroups v2: Comprehensive Guide
Control Groups (cgroups) v2 is the modern Linux kernel feature for organizing processes hierarchically and distributing system resources along the hierarchy in a controlled and configurable manner. This guide provides a comprehensive overview of cgroups v2, its architecture, controllers, and practical usage.
Overview and Evolution
What are Cgroups?
Cgroups (Control Groups) provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behavior. They enable:
- Resource Limitation: Limit resource usage (CPU, memory, I/O, network)
- Prioritization: Set relative priorities for resource allocation
- Accounting: Monitor resource usage
- Control: Freeze, checkpoint, and restart groups of processes
Cgroups v1 vs v2 Evolution
Cgroups v1 Limitations
- Multiple Hierarchies: Each controller had its own hierarchy
- Complex Management: Difficult to coordinate between controllers
- Inconsistent Interfaces: Different controllers had different APIs
- Race Conditions: Process migration between hierarchies was problematic
Cgroups v2 Improvements
- Unified Hierarchy: Single tree structure for all controllers
- Consistent Interface: Standardized API across all controllers
- Better Process Management: Simplified process migration
- Improved Performance: More efficient implementation
- Thread Safety: Better handling of multi-threaded applications
Architecture and Design
Unified Hierarchy
Cgroups v2 uses a single unified hierarchy where all controllers are mounted together:
/sys/fs/cgroup/
├── cgroup.controllers # Available controllers
├── cgroup.procs # Processes in root cgroup
├── cgroup.subtree_control # Enabled controllers for children
├── memory.current # Current memory usage
├── cpu.stat # CPU statistics
└── user.slice/ # Systemd user slice
├── cgroup.controllers
├── cgroup.procs
├── user-1000.slice/
│ └── session-1.scope/
└── user-1001.slice/
Key Concepts
1. Cgroup Tree Structure
Root Cgroup (/)
├── system.slice/ # System services
│ ├── sshd.service
│ ├── nginx.service
│ └── docker.service
├── user.slice/ # User sessions
│ ├── user-1000.slice/
│ └── user-1001.slice/
└── machine.slice/ # Virtual machines/containers
├── docker-container1.scope
└── libvirt-vm1.scope
2. Controllers
Controllers implement resource management policies:
- cpu: CPU time distribution
- memory: Memory usage limits and accounting
- io: Block I/O bandwidth control
- pids: Process number limits
- cpuset: CPU and memory node assignment
- hugetlb: HugeTLB usage limits
- perf_event: Performance monitoring
- rdma: RDMA/IB resource limits
3. Control Files
Each cgroup directory contains control files:
cgroup.controllers: Available controllerscgroup.subtree_control: Controllers enabled for childrencgroup.procs: Process IDs in this cgroupcgroup.threads: Thread IDs in this cgroup (threaded mode)cgroup.events: Event notifications
Core Interface Files
Essential Control Files
cgroup.controllers
Lists controllers available in the current cgroup:
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
cgroup.subtree_control
Controls which controllers are enabled for child cgroups:
# Enable cpu and memory controllers for children
echo "+cpu +memory" > /sys/fs/cgroup/myapp/cgroup.subtree_control
# Disable io controller
echo "-io" > /sys/fs/cgroup/myapp/cgroup.subtree_control
# Check current settings
cat /sys/fs/cgroup/myapp/cgroup.subtree_control
cgroup.procs
Contains PIDs of processes in the cgroup:
# Add process to cgroup
echo $PID > /sys/fs/cgroup/myapp/cgroup.procs
# List all processes in cgroup
cat /sys/fs/cgroup/myapp/cgroup.procs
cgroup.events
Provides event notifications:
$ cat /sys/fs/cgroup/myapp/cgroup.events
populated 1
frozen 0
Controllers Deep Dive
1. CPU Controller
Configuration Files
cpu.weight: Relative weight (1-10000, default 100)cpu.weight.nice: Nice value equivalent (-20 to 19)cpu.max: CPU bandwidth limitcpu.stat: CPU usage statistics
Examples
Setting CPU Weight:
# Create cgroup
mkdir /sys/fs/cgroup/high_priority_app
echo "+cpu" > /sys/fs/cgroup/cgroup.subtree_control
# Set high priority (weight 200, double the default)
echo 200 > /sys/fs/cgroup/high_priority_app/cpu.weight
# Alternative: use nice value
echo -5 > /sys/fs/cgroup/high_priority_app/cpu.weight.nice
CPU Bandwidth Limiting:
# Limit to 50% of one CPU core
echo "50000 100000" > /sys/fs/cgroup/myapp/cpu.max
# Format: quota period (both in microseconds)
# 50000/100000 = 50% of CPU time
# Limit to 1.5 CPU cores
echo "150000 100000" > /sys/fs/cgroup/myapp/cpu.max
# Remove limit
echo "max" > /sys/fs/cgroup/myapp/cpu.max
Monitoring CPU Usage:
$ cat /sys/fs/cgroup/myapp/cpu.stat
usage_usec 12345678
user_usec 8765432
system_usec 3580246
nr_periods 1234
nr_throttled 56
throttled_usec 789012
2. Memory Controller
Configuration Files
memory.current: Current memory usagememory.max: Memory usage limitmemory.min: Memory protection (guaranteed minimum)memory.low: Best-effort memory protectionmemory.high: Memory throttling thresholdmemory.swap.current: Current swap usagememory.swap.max: Swap usage limit
Memory Hierarchy
memory.max (hard limit)
↑
memory.high (throttling threshold)
↑
memory.low (best-effort protection)
↑
memory.min (guaranteed minimum)
Examples
Basic Memory Limiting:
# Set hard memory limit to 1GB
echo 1G > /sys/fs/cgroup/myapp/memory.max
# Set soft limit (throttling starts here)
echo 800M > /sys/fs/cgroup/myapp/memory.high
# Set memory protection (try to keep at least this much)
echo 200M > /sys/fs/cgroup/myapp/memory.low
# Disable swap for this cgroup
echo 0 > /sys/fs/cgroup/myapp/memory.swap.max
Memory Monitoring:
# Current memory usage
$ cat /sys/fs/cgroup/myapp/memory.current
524288000
# Detailed memory statistics
$ cat /sys/fs/cgroup/myapp/memory.stat
anon 134217728
file 67108864
kernel_stack 32768
pagetables 2097152
percpu 8192
sock 0
shmem 0
file_mapped 33554432
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 67108864
active_anon 67108864
inactive_file 33554432
active_file 33554432
unevictable 0
slab_reclaimable 4194304
slab_unreclaimable 2097152
Memory Events:
$ cat /sys/fs/cgroup/myapp/memory.events
low 12
high 34
max 5
oom 0
oom_kill 0
oom_group_kill 0
3. I/O Controller
Configuration Files
io.max: Bandwidth and IOPS limitsio.weight: Relative I/O weightio.stat: I/O statisticsio.pressure: PSI (Pressure Stall Information) for I/O
Examples
I/O Bandwidth Limiting:
# Limit read bandwidth to 100MB/s on device 8:0 (sda)
echo "8:0 rbps=104857600" > /sys/fs/cgroup/myapp/io.max
# Limit write IOPS to 1000 on device 8:0
echo "8:0 wiops=1000" > /sys/fs/cgroup/myapp/io.max
# Combined limits
echo "8:0 rbps=104857600 wbps=52428800 riops=2000 wiops=1000" > /sys/fs/cgroup/myapp/io.max
I/O Weight (Proportional Control):
# Set I/O weight (1-10000, default 100)
echo "8:0 200" > /sys/fs/cgroup/myapp/io.weight
# This cgroup gets 2x more I/O bandwidth than default
I/O Monitoring:
$ cat /sys/fs/cgroup/myapp/io.stat
8:0 rbytes=1048576000 wbytes=524288000 rios=25600 wios=12800 dbytes=0 dios=0
8:16 rbytes=0 wbytes=0 rios=0 wios=0 dbytes=0 dios=0
4. PID Controller
Configuration Files
pids.current: Current number of processes/threadspids.max: Maximum number of processes/threadspids.events: PID-related events
Examples
Process Limiting:
# Limit to maximum 100 processes
echo 100 > /sys/fs/cgroup/myapp/pids.max
# Check current process count
cat /sys/fs/cgroup/myapp/pids.current
# Monitor PID events
cat /sys/fs/cgroup/myapp/pids.events
5. CPUSet Controller
Configuration Files
cpuset.cpus: Allowed CPU corescpuset.mems: Allowed memory nodes (NUMA)cpuset.cpus.effective: Actually available CPUscpuset.mems.effective: Actually available memory nodes
Examples
CPU Affinity:
# Allow only CPUs 0, 1, and 4-7
echo "0-1,4-7" > /sys/fs/cgroup/myapp/cpuset.cpus
# Allow only NUMA node 0
echo "0" > /sys/fs/cgroup/myapp/cpuset.mems
# Check effective settings
cat /sys/fs/cgroup/myapp/cpuset.cpus.effective
cat /sys/fs/cgroup/myapp/cpuset.mems.effective
Practical Usage Examples
1. Creating and Managing Cgroups
Manual Cgroup Creation
#!/bin/bash
# Create a new cgroup for a web application
CGROUP_PATH="/sys/fs/cgroup/webapp"
mkdir -p $CGROUP_PATH
# Enable controllers for this cgroup
echo "+cpu +memory +io +pids" > /sys/fs/cgroup/cgroup.subtree_control
# Configure resource limits
echo "2G" > $CGROUP_PATH/memory.max # 2GB memory limit
echo "1500M" > $CGROUP_PATH/memory.high # Throttle at 1.5GB
echo "150000 100000" > $CGROUP_PATH/cpu.max # 1.5 CPU cores
echo "200" > $CGROUP_PATH/pids.max # Max 200 processes
# Set I/O limits (assuming /dev/sda is 8:0)
echo "8:0 rbps=209715200 wbps=104857600" > $CGROUP_PATH/io.max # 200MB/s read, 100MB/s write
# Start application and add to cgroup
./start_webapp.sh &
APP_PID=$!
echo $APP_PID > $CGROUP_PATH/cgroup.procs
echo "Web application started with PID $APP_PID in cgroup $CGROUP_PATH"
Monitoring Script
#!/bin/bash
CGROUP_PATH="/sys/fs/cgroup/webapp"
while true; do
echo "=== $(date) ==="
echo "Memory Usage: $(cat $CGROUP_PATH/memory.current | numfmt --to=iec)"
echo "Memory Limit: $(cat $CGROUP_PATH/memory.max)"
echo "CPU Usage: $(grep usage_usec $CGROUP_PATH/cpu.stat)"
echo "Process Count: $(cat $CGROUP_PATH/pids.current)"
echo "I/O Stats:"
cat $CGROUP_PATH/io.stat
echo
sleep 5
done
2. Systemd Integration
Systemd is the primary interface for cgroups v2 on modern Linux systems.
Service Configuration
# /etc/systemd/system/myapp.service
[Unit]
Description=My Application
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/myapp
User=myapp
Group=myapp
# Cgroups v2 resource limits
MemoryMax=1G
MemoryHigh=800M
CPUWeight=200
CPUQuota=150%
TasksMax=100
IOWeight=200
# Block device specific I/O limits
IOReadBandwidthMax=/dev/sda 100M
IOWriteBandwidthMax=/dev/sda 50M
[Install]
WantedBy=multi-user.target
Systemd Commands
# Start service with resource limits
systemctl start myapp.service
# Check cgroup path
systemctl show myapp.service -p ControlGroup
# Monitor resource usage
systemd-cgtop
# Show detailed cgroup info
systemctl status myapp.service
# Change limits at runtime
systemctl set-property myapp.service MemoryMax=2G
systemctl set-property myapp.service CPUQuota=200%
3. Container Integration
Docker with Cgroups v2
# Run container with resource limits
docker run -d \
--name myapp \
--memory=1g \
--memory-reservation=800m \
--cpus=1.5 \
--pids-limit=100 \
--device-read-bps=/dev/sda:100mb \
--device-write-bps=/dev/sda:50mb \
myapp:latest
# Check container's cgroup
docker inspect myapp | grep -i cgroup
# Monitor container resources
docker stats myapp
Kubernetes Pod Resource Limits
apiVersion: v1
kind: Pod
metadata:
name: resource-limited-pod
spec:
containers:
- name: app
image: myapp:latest
resources:
limits:
memory: "1Gi"
cpu: "1500m"
ephemeral-storage: "2Gi"
requests:
memory: "800Mi"
cpu: "500m"
ephemeral-storage: "1Gi"
4. Advanced Scenarios
Multi-Tier Application Setup
#!/bin/bash
# Create hierarchical cgroup structure for multi-tier app
BASE_PATH="/sys/fs/cgroup/myapp"
mkdir -p $BASE_PATH/{frontend,backend,database}
# Enable controllers
echo "+cpu +memory +io +pids" > /sys/fs/cgroup/cgroup.subtree_control
echo "+cpu +memory +io +pids" > $BASE_PATH/cgroup.subtree_control
# Frontend tier (web servers)
echo "1G" > $BASE_PATH/frontend/memory.max
echo "100000 100000" > $BASE_PATH/frontend/cpu.max # 1 CPU core
echo "50" > $BASE_PATH/frontend/pids.max
# Backend tier (application servers)
echo "2G" > $BASE_PATH/backend/memory.max
echo "200000 100000" > $BASE_PATH/backend/cpu.max # 2 CPU cores
echo "100" > $BASE_PATH/backend/pids.max
# Database tier (highest priority)
echo "4G" > $BASE_PATH/database/memory.max
echo "400000 100000" > $BASE_PATH/database/cpu.max # 4 CPU cores
echo "500" > $BASE_PATH/database/cpu.weight # Higher priority
echo "200" > $BASE_PATH/database/pids.max
# Set I/O priorities
echo "8:0 100" > $BASE_PATH/frontend/io.weight # Lower I/O priority
echo "8:0 200" > $BASE_PATH/backend/io.weight # Normal I/O priority
echo "8:0 500" > $BASE_PATH/database/io.weight # Higher I/O priority
Dynamic Resource Adjustment
#!/bin/bash
# Dynamic resource adjustment based on load
CGROUP_PATH="/sys/fs/cgroup/myapp"
monitor_and_adjust() {
while true; do
# Get current memory usage percentage
current=$(cat $CGROUP_PATH/memory.current)
max=$(cat $CGROUP_PATH/memory.max)
usage_percent=$((current * 100 / max))
# Get CPU pressure
cpu_pressure=$(awk '/some/ {print $2}' $CGROUP_PATH/cpu.pressure | cut -d= -f2)
echo "Memory usage: ${usage_percent}%, CPU pressure: ${cpu_pressure}"
# Adjust resources based on usage
if [ $usage_percent -gt 80 ]; then
# Increase memory limit by 500MB
new_limit=$((max + 524288000))
echo $new_limit > $CGROUP_PATH/memory.max
echo "Increased memory limit to $(numfmt --to=iec $new_limit)"
fi
if [ $(echo "$cpu_pressure > 50" | bc -l) -eq 1 ]; then
# Increase CPU quota
current_quota=$(awk '{print $1}' $CGROUP_PATH/cpu.max)
if [ "$current_quota" != "max" ]; then
new_quota=$((current_quota + 50000))
echo "$new_quota 100000" > $CGROUP_PATH/cpu.max
echo "Increased CPU quota to $new_quota"
fi
fi
sleep 30
done
}
monitor_and_adjust &
Pressure Stall Information (PSI)
Cgroups v2 provides PSI metrics to understand resource pressure:
CPU Pressure
$ cat /sys/fs/cgroup/myapp/cpu.pressure
some avg10=2.50 avg60=1.20 avg300=0.80 total=12345678
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
Memory Pressure
$ cat /sys/fs/cgroup/myapp/memory.pressure
some avg10=15.50 avg60=12.30 avg300=8.90 total=87654321
full avg10=2.10 avg60=1.80 avg300=1.20 total=9876543
I/O Pressure
$ cat /sys/fs/cgroup/myapp/io.pressure
some avg10=5.20 avg60=3.40 avg300=2.10 total=45678901
full avg10=1.80 avg60=1.20 avg300=0.90 total=23456789
PSI Monitoring Script
#!/bin/bash
CGROUP_PATH="/sys/fs/cgroup/myapp"
monitor_pressure() {
echo "Monitoring PSI for $CGROUP_PATH"
while true; do
echo "=== $(date) ==="
# CPU Pressure
if [ -f "$CGROUP_PATH/cpu.pressure" ]; then
echo "CPU Pressure:"
cat $CGROUP_PATH/cpu.pressure
fi
# Memory Pressure
if [ -f "$CGROUP_PATH/memory.pressure" ]; then
echo "Memory Pressure:"
cat $CGROUP_PATH/memory.pressure
fi
# I/O Pressure
if [ -f "$CGROUP_PATH/io.pressure" ]; then
echo "I/O Pressure:"
cat $CGROUP_PATH/io.pressure
fi
echo
sleep 10
done
}
monitor_pressure
Migration from Cgroups v1
Key Differences
| Aspect | Cgroups v1 | Cgroups v2 |
|---|---|---|
| Hierarchy | Multiple hierarchies per controller | Single unified hierarchy |
| Mount Point | /sys/fs/cgroup/<controller> | /sys/fs/cgroup |
| Process Assignment | Can be in different hierarchies | Must be in same hierarchy |
| Thread Control | Limited thread support | Better thread management |
| Interface | Controller-specific | Standardized interface |
Migration Steps
1. Check Current System
# Check if cgroups v2 is available
ls /sys/fs/cgroup/cgroup.controllers
# Check current cgroups version
mount | grep cgroup
# Check systemd support
systemctl --version
2. Enable Cgroups v2
# Add to kernel command line (GRUB)
# /etc/default/grub
GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"
# Update GRUB and reboot
sudo update-grub
sudo reboot
3. Update Scripts and Configurations
# Old cgroups v1 path
OLD_PATH="/sys/fs/cgroup/memory/myapp"
# New cgroups v2 path
NEW_PATH="/sys/fs/cgroup/myapp"
# Old memory limit setting
echo 1073741824 > $OLD_PATH/memory.limit_in_bytes
# New memory limit setting
echo 1G > $NEW_PATH/memory.max
Troubleshooting
Common Issues
1. Permission Denied
# Check ownership and permissions
ls -la /sys/fs/cgroup/myapp/
# Fix permissions (if needed)
sudo chown root:root /sys/fs/cgroup/myapp/
sudo chmod 755 /sys/fs/cgroup/myapp/
2. Controller Not Available
# Check available controllers
cat /sys/fs/cgroup/cgroup.controllers
# Enable controller in parent
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
3. Process Migration Fails
# Check if process exists
ps -p $PID
# Check current cgroup
cat /proc/$PID/cgroup
# Ensure target cgroup exists
ls -d /sys/fs/cgroup/myapp/
Debugging Tools
1. Systemd Analysis
# Show cgroup tree
systemd-cgls
# Show resource usage
systemd-cgtop
# Analyze service
systemd-analyze critical-chain myapp.service
2. Process Cgroup Information
# Show process cgroup membership
cat /proc/self/cgroup
# Show all processes in cgroup
systemd-cgls /sys/fs/cgroup/myapp
3. Resource Usage Analysis
#!/bin/bash
# Comprehensive resource analysis
CGROUP_PATH="/sys/fs/cgroup/myapp"
echo "=== Cgroup Resource Analysis ==="
echo "Path: $CGROUP_PATH"
echo
# Controllers
echo "Available Controllers:"
cat $CGROUP_PATH/cgroup.controllers
echo
echo "Enabled Controllers:"
cat $CGROUP_PATH/cgroup.subtree_control
echo
# Memory
if [ -f "$CGROUP_PATH/memory.current" ]; then
echo "Memory Usage:"
echo " Current: $(numfmt --to=iec $(cat $CGROUP_PATH/memory.current))"
echo " Max: $(cat $CGROUP_PATH/memory.max)"
echo " High: $(cat $CGROUP_PATH/memory.high 2>/dev/null || echo 'not set')"
echo
fi
# CPU
if [ -f "$CGROUP_PATH/cpu.stat" ]; then
echo "CPU Usage:"
cat $CGROUP_PATH/cpu.stat
echo
fi
# Processes
if [ -f "$CGROUP_PATH/pids.current" ]; then
echo "Process Count: $(cat $CGROUP_PATH/pids.current)"
echo "Process Limit: $(cat $CGROUP_PATH/pids.max 2>/dev/null || echo 'not set')"
echo
fi
Best Practices
1. Hierarchy Design
- Keep hierarchy shallow (2-3 levels max)
- Group related processes together
- Use meaningful names for cgroups
- Follow systemd slice conventions when possible
2. Resource Limits
- Set both soft (memory.high) and hard (memory.max) limits
- Use memory.low for important services
- Set reasonable CPU weights rather than hard limits when possible
- Monitor pressure metrics to tune limits
3. Monitoring and Alerting
# Example monitoring script for production
#!/bin/bash
CGROUP_PATH="/sys/fs/cgroup/production-app"
ALERT_THRESHOLD=80
check_memory_usage() {
current=$(cat $CGROUP_PATH/memory.current)
max=$(cat $CGROUP_PATH/memory.max)
usage_percent=$((current * 100 / max))
if [ $usage_percent -gt $ALERT_THRESHOLD ]; then
echo "ALERT: Memory usage ${usage_percent}% exceeds threshold ${ALERT_THRESHOLD}%"
# Send alert (email, Slack, etc.)
fi
}
check_oom_events() {
oom_count=$(grep oom /sys/fs/cgroup/production-app/memory.events | awk '{print $2}')
if [ $oom_count -gt 0 ]; then
echo "ALERT: OOM events detected: $oom_count"
fi
}
check_memory_usage
check_oom_events