In the Linux kernel, the following vulnerability has been resolved:
net: skbuff: preserve shared-frag marker during coalescing
skb_try_coalesce() can attach paged frags from @from to @to. If @from
has SKBFL_SHARED_FRAG set, the resulting @to skb can contain the same
externally-owned or page-cache-backed frags, but the shared-frag marker
is currently lost.
That breaks the invariant relied on by later in-place writers. In
particular, ESP input checks skb_has_shared_frag() before deciding
whether an uncloned nonlinear skb can skip skb_cow_data(). If TCP
receive coalescing has moved shared frags into an unmarked skb, ESP can
see skb_has_shared_frag() as false and decrypt in place over page-cache
backed frags.
Propagate SKBFL_SHARED_FRAG when skb_try_coalesce() transfers paged
frags. The tailroom copy path does not need the marker because it copies
bytes into @to's linear data rather than transferring frag descriptors.
In the Linux kernel, the following vulnerability has been resolved:
unshare: fix unshare_fs() handling
There's an unpleasant corner case in unshare(2), when we have a
CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that
case copy_mnt_ns() gets passed current->fs instead of a private copy,
which causes interesting warts in proof of correctness]
> I guess if private means fs->users == 1, the condition could still be true.
Unfortunately, it's worse than just a convoluted proof of correctness.
Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS
(and current->fs->users == 1).
We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and
flips current->fs->{pwd,root} to corresponding locations in the new namespace.
Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM).
We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's
destroyed and its mount tree is dissolved, but... current->fs->root and
current->fs->pwd are both left pointing to now detached mounts.
They are pinning those, so it's not a UAF, but it leaves the calling
process with unshare(2) failing with -ENOMEM _and_ leaving it with
pwd and root on detached isolated mounts. The last part is clearly a bug.
There is other fun related to that mess (races with pivot_root(), including
the one between pivot_root() and fork(), of all things), but this one
is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new
fs_struct even if it hadn't been shared in the first place". Sure, we could
go for something like "if both CLONE_NEWNS *and* one of the things that might
end up failing after copy_mnt_ns() call in create_new_namespaces() are set,
force allocation of new fs_struct", but let's keep it simple - the cost
of copy_fs_struct() is trivial.
Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets
a freshly allocated fs_struct, yet to be attached to anything. That
seriously simplifies the analysis...
FWIW, that bug had been there since the introduction of unshare(2) ;-/
In the Linux kernel, the following vulnerability has been resolved:
bonding: fix type confusion in bond_setup_by_slave()
kernel BUG at net/core/skbuff.c:2306!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:pskb_expand_head+0xa08/0xfe0 net/core/skbuff.c:2306
RSP: 0018:ffffc90004aff760 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88807e3c8780 RCX: ffffffff89593e0e
RDX: ffff88807b7c4900 RSI: ffffffff89594747 RDI: ffff88807b7c4900
RBP: 0000000000000820 R08: 0000000000000005 R09: 0000000000000000
R10: 00000000961a63e0 R11: 0000000000000000 R12: ffff88807e3c8780
R13: 00000000961a6560 R14: dffffc0000000000 R15: 00000000961a63e0
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe1a0ed8df0 CR3: 000000002d816000 CR4: 00000000003526f0
Call Trace:
<TASK>
ipgre_header+0xdd/0x540 net/ipv4/ip_gre.c:900
dev_hard_header include/linux/netdevice.h:3439 [inline]
packet_snd net/packet/af_packet.c:3028 [inline]
packet_sendmsg+0x3ae5/0x53c0 net/packet/af_packet.c:3108
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0xa54/0xc30 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe1a0e6c1a9
When a non-Ethernet device (e.g. GRE tunnel) is enslaved to a bond,
bond_setup_by_slave() directly copies the slave's header_ops to the
bond device:
bond_dev->header_ops = slave_dev->header_ops;
This causes a type confusion when dev_hard_header() is later called
on the bond device. Functions like ipgre_header(), ip6gre_header(),all use
netdev_priv(dev) to access their device-specific private data. When
called with the bond device, netdev_priv() returns the bond's private
data (struct bonding) instead of the expected type (e.g. struct
ip_tunnel), leading to garbage values being read and kernel crashes.
Fix this by introducing bond_header_ops with wrapper functions that
delegate to the active slave's header_ops using the slave's own
device. This ensures netdev_priv() in the slave's header functions
always receives the correct device.
The fix is placed in the bonding driver rather than individual device
drivers, as the root cause is bond blindly inheriting header_ops from
the slave without considering that these callbacks expect a specific
netdev_priv() layout.
The type confusion can be observed by adding a printk in
ipgre_header() and running the following commands:
ip link add dummy0 type dummy
ip addr add 10.0.0.1/24 dev dummy0
ip link set dummy0 up
ip link add gre1 type gre local 10.0.0.1
ip link add bond1 type bond mode active-backup
ip link set gre1 master bond1
ip link set gre1 up
ip link set bond1 up
ip addr add fe80::1/64 dev bond1
In the Linux kernel, the following vulnerability has been resolved:
serial: caif: hold tty->link reference in ldisc_open and ser_release
A reproducer triggers a KASAN slab-use-after-free in pty_write_room()
when caif_serial's TX path calls tty_write_room(). The faulting access
is on tty->link->port.
Hold an extra kref on tty->link for the lifetime of the caif_serial line
discipline: get it in ldisc_open() and drop it in ser_release(), and
also drop it on the ldisc_open() error path.
With this change applied, the reproducer no longer triggers the UAF in
my testing.
In the Linux kernel, the following vulnerability has been resolved:
netfilter: nfnetlink_cthelper: fix OOB read in nfnl_cthelper_dump_table()
nfnl_cthelper_dump_table() has a 'goto restart' that jumps to a label
inside the for loop body. When the "last" helper saved in cb->args[1]
is deleted between dump rounds, every entry fails the (cur != last)
check, so cb->args[1] is never cleared. The for loop finishes with
cb->args[0] == nf_ct_helper_hsize, and the 'goto restart' jumps back
into the loop body bypassing the bounds check, causing an 8-byte
out-of-bounds read on nf_ct_helper_hash[nf_ct_helper_hsize].
The 'goto restart' block was meant to re-traverse the current bucket
when "last" is no longer found, but it was placed after the for loop
instead of inside it. Move the block into the for loop body so that
the restart only occurs while cb->args[0] is still within bounds.
BUG: KASAN: slab-out-of-bounds in nfnl_cthelper_dump_table+0x9f/0x1b0
Read of size 8 at addr ffff888104ca3000 by task poc_cthelper/131
Call Trace:
nfnl_cthelper_dump_table+0x9f/0x1b0
netlink_dump+0x333/0x880
netlink_recvmsg+0x3e2/0x4b0
sock_recvmsg+0xde/0xf0
__sys_recvfrom+0x150/0x200
__x64_sys_recvfrom+0x76/0x90
do_syscall_64+0xc3/0x6e0
Allocated by task 1:
__kvmalloc_node_noprof+0x21b/0x700
nf_ct_alloc_hashtable+0x65/0xd0
nf_conntrack_helper_init+0x21/0x60
nf_conntrack_init_start+0x18d/0x300
nf_conntrack_standalone_init+0x12/0xc0
In the Linux kernel, the following vulnerability has been resolved:
netfilter: nfnetlink_queue: fix entry leak in bridge verdict error path
nfqnl_recv_verdict() calls find_dequeue_entry() to remove the queue
entry from the queue data structures, taking ownership of the entry.
For PF_BRIDGE packets, it then calls nfqa_parse_bridge() to parse VLAN
attributes. If nfqa_parse_bridge() returns an error (e.g. NFQA_VLAN
present but NFQA_VLAN_TCI missing), the function returns immediately
without freeing the dequeued entry or its sk_buff.
This leaks the nf_queue_entry, its associated sk_buff, and all held
references (net_device refcounts, struct net refcount). Repeated
triggering exhausts kernel memory.
Fix this by dropping the entry via nfqnl_reinject() with NF_DROP verdict
on the error path, consistent with other error handling in this file.
In the Linux kernel, the following vulnerability has been resolved:
netfilter: x_tables: guard option walkers against 1-byte tail reads
When the last byte of options is a non-single-byte option kind, walkers
that advance with i += op[i + 1] ? : 1 can read op[i + 1] past the end
of the option area.
Add an explicit i == optlen - 1 check before dereferencing op[i + 1]
in xt_tcpudp and xt_dccp option walkers.
In the Linux kernel, the following vulnerability has been resolved:
ALSA: pcm: fix use-after-free on linked stream runtime in snd_pcm_drain()
In the drain loop, the local variable 'runtime' is reassigned to a
linked stream's runtime (runtime = s->runtime at line 2157). After
releasing the stream lock at line 2169, the code accesses
runtime->no_period_wakeup, runtime->rate, and runtime->buffer_size
(lines 2170-2178) — all referencing the linked stream's runtime without
any lock or refcount protecting its lifetime.
A concurrent close() on the linked stream's fd triggers
snd_pcm_release_substream() → snd_pcm_drop() → pcm_release_private()
→ snd_pcm_unlink() → snd_pcm_detach_substream() → kfree(runtime).
No synchronization prevents kfree(runtime) from completing while the
drain path dereferences the stale pointer.
Fix by caching the needed runtime fields (no_period_wakeup, rate,
buffer_size) into local variables while still holding the stream lock,
and using the cached values after the lock is released.
In the Linux kernel, the following vulnerability has been resolved:
cgroup: fix race between task migration and iteration
When a task is migrated out of a css_set, cgroup_migrate_add_task()
first moves it from cset->tasks to cset->mg_tasks via:
list_move_tail(&task->cg_list, &cset->mg_tasks);
If a css_task_iter currently has it->task_pos pointing to this task,
css_set_move_task() calls css_task_iter_skip() to keep the iterator
valid. However, since the task has already been moved to ->mg_tasks,
the iterator is advanced relative to the mg_tasks list instead of the
original tasks list. As a result, remaining tasks on cset->tasks, as
well as tasks queued on cset->mg_tasks, can be skipped by iteration.
Fix this by calling css_set_skip_task_iters() before unlinking
task->cg_list from cset->tasks. This advances all active iterators to
the next task on cset->tasks, so iteration continues correctly even
when a task is concurrently being migrated.
This race is hard to hit in practice without instrumentation, but it
can be reproduced by artificially slowing down cgroup_procs_show().
For example, on an Android device a temporary
/sys/kernel/cgroup/cgroup_test knob can be added to inject a delay
into cgroup_procs_show(), and then:
1) Spawn three long-running tasks (PIDs 101, 102, 103).
2) Create a test cgroup and move the tasks into it.
3) Enable a large delay via /sys/kernel/cgroup/cgroup_test.
4) In one shell, read cgroup.procs from the test cgroup.
5) Within the delay window, in another shell migrate PID 102 by
writing it to a different cgroup.procs file.
Under this setup, cgroup.procs can intermittently show only PID 101
while skipping PID 103. Once the migration completes, reading the
file again shows all tasks as expected.
Note that this change does not allow removing the existing
css_set_skip_task_iters() call in css_set_move_task(). The new call
in cgroup_migrate_add_task() only handles iterators that are racing
with migration while the task is still on cset->tasks. Iterators may
also start after the task has been moved to cset->mg_tasks. If we
dropped css_set_skip_task_iters() from css_set_move_task(), such
iterators could keep task_pos pointing to a migrating task, causing
css_task_iter_advance() to malfunction on the destination css_set,
up to and including crashes or infinite loops.
The race window between migration and iteration is very small, and
css_task_iter is not on a hot path. In the worst case, when an
iterator is positioned on the first thread of the migrating process,
cgroup_migrate_add_task() may have to skip multiple tasks via
css_set_skip_task_iters(). However, this only happens when migration
and iteration actually race, so the performance impact is negligible
compared to the correctness fix provided here.