Security Vulnerabilities
- CVEs Published In May 2025
In the Linux kernel, the following vulnerability has been resolved:
cxl/region: Fix decoder allocation crash
When an intermediate port's decoders have been exhausted by existing
regions, and creating a new region with the port in question in it's
hierarchical path is attempted, cxl_port_attach_region() fails to find a
port decoder (as would be expected), and drops into the failure / cleanup
path.
However, during cleanup of the region reference, a sanity check attempts
to dereference the decoder, which in the above case didn't exist. This
causes a NULL pointer dereference BUG.
To fix this, refactor the decoder allocation and de-allocation into
helper routines, and in this 'free' routine, check that the decoder,
@cxld, is valid before attempting any operations on it.
In the Linux kernel, the following vulnerability has been resolved:
cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
When a cxl_nvdimm object goes through a ->remove() event (device
physically removed, nvdimm-bridge disabled, or nvdimm device disabled),
then any associated regions must also be disabled. As highlighted by the
cxl-create-region.sh test [1], a single device may host multiple
regions, but the driver was only tracking one region at a time. This
leads to a situation where only the last enabled region per nvdimm
device is cleaned up properly. Other regions are leaked, and this also
causes cxl_memdev reference leaks.
Fix the tracking by allowing cxl_nvdimm objects to track multiple region
associations.
In the Linux kernel, the following vulnerability has been resolved:
btrfs: fix tree mod log mishandling of reallocated nodes
We have been seeing the following panic in production
kernel BUG at fs/btrfs/tree-mod-log.c:677!
invalid opcode: 0000 [#1] SMP
RIP: 0010:tree_mod_log_rewind+0x1b4/0x200
RSP: 0000:ffffc9002c02f890 EFLAGS: 00010293
RAX: 0000000000000003 RBX: ffff8882b448c700 RCX: 0000000000000000
RDX: 0000000000008000 RSI: 00000000000000a7 RDI: ffff88877d831c00
RBP: 0000000000000002 R08: 000000000000009f R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000100c40 R12: 0000000000000001
R13: ffff8886c26d6a00 R14: ffff88829f5424f8 R15: ffff88877d831a00
FS: 00007fee1d80c780(0000) GS:ffff8890400c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee1963a020 CR3: 0000000434f33002 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
btrfs_get_old_root+0x12b/0x420
btrfs_search_old_slot+0x64/0x2f0
? tree_mod_log_oldest_root+0x3d/0xf0
resolve_indirect_ref+0xfd/0x660
? ulist_alloc+0x31/0x60
? kmem_cache_alloc_trace+0x114/0x2c0
find_parent_nodes+0x97a/0x17e0
? ulist_alloc+0x30/0x60
btrfs_find_all_roots_safe+0x97/0x150
iterate_extent_inodes+0x154/0x370
? btrfs_search_path_in_tree+0x240/0x240
iterate_inodes_from_logical+0x98/0xd0
? btrfs_search_path_in_tree+0x240/0x240
btrfs_ioctl_logical_to_ino+0xd9/0x180
btrfs_ioctl+0xe2/0x2ec0
? __mod_memcg_lruvec_state+0x3d/0x280
? do_sys_openat2+0x6d/0x140
? kretprobe_dispatcher+0x47/0x70
? kretprobe_rethook_handler+0x38/0x50
? rethook_trampoline_handler+0x82/0x140
? arch_rethook_trampoline_callback+0x3b/0x50
? kmem_cache_free+0xfb/0x270
? do_sys_openat2+0xd5/0x140
__x64_sys_ioctl+0x71/0xb0
do_syscall_64+0x2d/0x40
Which is this code in tree_mod_log_rewind()
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
This occurs because we replay the nodes in order that they happened, and
when we do a REPLACE we will log a REMOVE_WHILE_FREEING for every slot,
starting at 0. 'n' here is the number of items in this block, which in
this case was 1, but we had 2 REMOVE_WHILE_FREEING operations.
The actual root cause of this was that we were replaying operations for
a block that shouldn't have been replayed. Consider the following
sequence of events
1. We have an already modified root, and we do a btrfs_get_tree_mod_seq().
2. We begin removing items from this root, triggering KEY_REPLACE for
it's child slots.
3. We remove one of the 2 children this root node points to, thus triggering
the root node promotion of the remaining child, and freeing this node.
4. We modify a new root, and re-allocate the above node to the root node of
this other root.
The tree mod log looks something like this
logical 0 op KEY_REPLACE (slot 1) seq 2
logical 0 op KEY_REMOVE (slot 1) seq 3
logical 0 op KEY_REMOVE_WHILE_FREEING (slot 0) seq 4
logical 4096 op LOG_ROOT_REPLACE (old logical 0) seq 5
logical 8192 op KEY_REMOVE_WHILE_FREEING (slot 1) seq 6
logical 8192 op KEY_REMOVE_WHILE_FREEING (slot 0) seq 7
logical 0 op LOG_ROOT_REPLACE (old logical 8192) seq 8
>From here the bug is triggered by the following steps
1. Call btrfs_get_old_root() on the new_root.
2. We call tree_mod_log_oldest_root(btrfs_root_node(new_root)), which is
currently logical 0.
3. tree_mod_log_oldest_root() calls tree_mod_log_search_oldest(), which
gives us the KEY_REPLACE seq 2, and since that's not a
LOG_ROOT_REPLACE we incorrectly believe that we don't have an old
root, because we expect that the most recent change should be a
LOG_ROOT_REPLACE.
4. Back in tree_mod_log_oldest_root() we don't have a LOG_ROOT_REPLACE,
so we don't set old_root, we simply use our e
---truncated---
In the Linux kernel, the following vulnerability has been resolved:
fscrypt: stop using keyrings subsystem for fscrypt_master_key
The approach of fs/crypto/ internally managing the fscrypt_master_key
structs as the payloads of "struct key" objects contained in a
"struct key" keyring has outlived its usefulness. The original idea was
to simplify the code by reusing code from the keyrings subsystem.
However, several issues have arisen that can't easily be resolved:
- When a master key struct is destroyed, blk_crypto_evict_key() must be
called on any per-mode keys embedded in it. (This started being the
case when inline encryption support was added.) Yet, the keyrings
subsystem can arbitrarily delay the destruction of keys, even past the
time the filesystem was unmounted. Therefore, currently there is no
easy way to call blk_crypto_evict_key() when a master key is
destroyed. Currently, this is worked around by holding an extra
reference to the filesystem's request_queue(s). But it was overlooked
that the request_queue reference is *not* guaranteed to pin the
corresponding blk_crypto_profile too; for device-mapper devices that
support inline crypto, it doesn't. This can cause a use-after-free.
- When the last inode that was using an incompletely-removed master key
is evicted, the master key removal is completed by removing the key
struct from the keyring. Currently this is done via key_invalidate().
Yet, key_invalidate() takes the key semaphore. This can deadlock when
called from the shrinker, since in fscrypt_ioctl_add_key(), memory is
allocated with GFP_KERNEL under the same semaphore.
- More generally, the fact that the keyrings subsystem can arbitrarily
delay the destruction of keys (via garbage collection delay, or via
random processes getting temporary key references) is undesirable, as
it means we can't strictly guarantee that all secrets are ever wiped.
- Doing the master key lookups via the keyrings subsystem results in the
key_permission LSM hook being called. fscrypt doesn't want this, as
all access control for encrypted files is designed to happen via the
files themselves, like any other files. The workaround which SELinux
users are using is to change their SELinux policy to grant key search
access to all domains. This works, but it is an odd extra step that
shouldn't really have to be done.
The fix for all these issues is to change the implementation to what I
should have done originally: don't use the keyrings subsystem to keep
track of the filesystem's fscrypt_master_key structs. Instead, just
store them in a regular kernel data structure, and rework the reference
counting, locking, and lifetime accordingly. Retain support for
RCU-mode key lookups by using a hash table. Replace fscrypt_sb_free()
with fscrypt_sb_delete(), which releases the keys synchronously and runs
a bit earlier during unmount, so that block devices are still available.
A side effect of this patch is that neither the master keys themselves
nor the filesystem keyrings will be listed in /proc/keys anymore.
("Master key users" and the master key users keyrings will still be
listed.) However, this was mostly an implementation detail, and it was
intended just for debugging purposes. I don't know of anyone using it.
This patch does *not* change how "master key users" (->mk_users) works;
that still uses the keyrings subsystem. That is still needed for key
quotas, and changing that isn't necessary to solve the issues listed
above. If we decide to change that too, it would be a separate patch.
I've marked this as fixing the original commit that added the fscrypt
keyring, but as noted above the most important issue that this patch
fixes wasn't introduced until the addition of inline encryption support.
In the Linux kernel, the following vulnerability has been resolved:
wifi: cfg80211: fix memory leak in query_regdb_file()
In the function query_regdb_file() the alpha2 parameter is duplicated
using kmemdup() and subsequently freed in regdb_fw_cb(). However,
request_firmware_nowait() can fail without calling regdb_fw_cb() and
thus leak memory.
In the Linux kernel, the following vulnerability has been resolved:
KVM: Reject attempts to consume or refresh inactive gfn_to_pfn_cache
Reject kvm_gpc_check() and kvm_gpc_refresh() if the cache is inactive.
Not checking the active flag during refresh is particularly egregious, as
KVM can end up with a valid, inactive cache, which can lead to a variety
of use-after-free bugs, e.g. consuming a NULL kernel pointer or missing
an mmu_notifier invalidation due to the cache not being on the list of
gfns to invalidate.
Note, "active" needs to be set if and only if the cache is on the list
of caches, i.e. is reachable via mmu_notifier events. If a relevant
mmu_notifier event occurs while the cache is "active" but not on the
list, KVM will not acquire the cache's lock and so will not serailize
the mmu_notifier event with active users and/or kvm_gpc_refresh().
A race between KVM_XEN_ATTR_TYPE_SHARED_INFO and KVM_XEN_HVM_EVTCHN_SEND
can be exploited to trigger the bug.
1. Deactivate shinfo cache:
kvm_xen_hvm_set_attr
case KVM_XEN_ATTR_TYPE_SHARED_INFO
kvm_gpc_deactivate
kvm_gpc_unmap
gpc->valid = false
gpc->khva = NULL
gpc->active = false
Result: active = false, valid = false
2. Cause cache refresh:
kvm_arch_vm_ioctl
case KVM_XEN_HVM_EVTCHN_SEND
kvm_xen_hvm_evtchn_send
kvm_xen_set_evtchn
kvm_xen_set_evtchn_fast
kvm_gpc_check
return -EWOULDBLOCK because !gpc->valid
kvm_xen_set_evtchn_fast
return -EWOULDBLOCK
kvm_gpc_refresh
hva_to_pfn_retry
gpc->valid = true
gpc->khva = not NULL
Result: active = false, valid = true
3. Race ioctl KVM_XEN_HVM_EVTCHN_SEND against ioctl
KVM_XEN_ATTR_TYPE_SHARED_INFO:
kvm_arch_vm_ioctl
case KVM_XEN_HVM_EVTCHN_SEND
kvm_xen_hvm_evtchn_send
kvm_xen_set_evtchn
kvm_xen_set_evtchn_fast
read_lock gpc->lock
kvm_xen_hvm_set_attr case
KVM_XEN_ATTR_TYPE_SHARED_INFO
mutex_lock kvm->lock
kvm_xen_shared_info_init
kvm_gpc_activate
gpc->khva = NULL
kvm_gpc_check
[ Check passes because gpc->valid is
still true, even though gpc->khva
is already NULL. ]
shinfo = gpc->khva
pending_bits = shinfo->evtchn_pending
CRASH: test_and_set_bit(..., pending_bits)
In the Linux kernel, the following vulnerability has been resolved:
KVM: x86: smm: number of GPRs in the SMRAM image depends on the image format
On 64 bit host, if the guest doesn't have X86_FEATURE_LM, KVM will
access 16 gprs to 32-bit smram image, causing out-ouf-bound ram
access.
On 32 bit host, the rsm_load_state_64/enter_smm_save_state_64
is compiled out, thus access overflow can't happen.
In the Linux kernel, the following vulnerability has been resolved:
KVM: Initialize gfn_to_pfn_cache locks in dedicated helper
Move the gfn_to_pfn_cache lock initialization to another helper and
call the new helper during VM/vCPU creation. There are race
conditions possible due to kvm_gfn_to_pfn_cache_init()'s
ability to re-initialize the cache's locks.
For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and
kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock.
(thread 1) | (thread 2)
|
kvm_xen_set_evtchn_fast |
read_lock_irqsave(&gpc->lock, ...) |
| kvm_gfn_to_pfn_cache_init
| rwlock_init(&gpc->lock)
read_unlock_irqrestore(&gpc->lock, ...) |
Rename "cache_init" and "cache_destroy" to activate+deactivate to
avoid implying that the cache really is destroyed/freed.
Note, there more races in the newly named kvm_gpc_activate() that will
be addressed separately.
[sean: call out that this is a bug fix]
In the Linux kernel, the following vulnerability has been resolved:
ACPI: APEI: Fix integer overflow in ghes_estatus_pool_init()
Change num_ghes from int to unsigned int, preventing an overflow
and causing subsequent vmalloc() to fail.
The overflow happens in ghes_estatus_pool_init() when calculating
len during execution of the statement below as both multiplication
operands here are signed int:
len += (num_ghes * GHES_ESOURCE_PREALLOC_MAX_SIZE);
The following call trace is observed because of this bug:
[ 9.317108] swapper/0: vmalloc error: size 18446744071562596352, exceeds total pages, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0-1
[ 9.317131] Call Trace:
[ 9.317134] <TASK>
[ 9.317137] dump_stack_lvl+0x49/0x5f
[ 9.317145] dump_stack+0x10/0x12
[ 9.317146] warn_alloc.cold+0x7b/0xdf
[ 9.317150] ? __device_attach+0x16a/0x1b0
[ 9.317155] __vmalloc_node_range+0x702/0x740
[ 9.317160] ? device_add+0x17f/0x920
[ 9.317164] ? dev_set_name+0x53/0x70
[ 9.317166] ? platform_device_add+0xf9/0x240
[ 9.317168] __vmalloc_node+0x49/0x50
[ 9.317170] ? ghes_estatus_pool_init+0x43/0xa0
[ 9.317176] vmalloc+0x21/0x30
[ 9.317177] ghes_estatus_pool_init+0x43/0xa0
[ 9.317179] acpi_hest_init+0x129/0x19c
[ 9.317185] acpi_init+0x434/0x4a4
[ 9.317188] ? acpi_sleep_proc_init+0x2a/0x2a
[ 9.317190] do_one_initcall+0x48/0x200
[ 9.317195] kernel_init_freeable+0x221/0x284
[ 9.317200] ? rest_init+0xe0/0xe0
[ 9.317204] kernel_init+0x1a/0x130
[ 9.317205] ret_from_fork+0x22/0x30
[ 9.317208] </TASK>
[ rjw: Subject and changelog edits ]
In the Linux kernel, the following vulnerability has been resolved:
x86/tdx: Panic on bad configs that #VE on "private" memory access
All normal kernel memory is "TDX private memory". This includes
everything from kernel stacks to kernel text. Handling
exceptions on arbitrary accesses to kernel memory is essentially
impossible because they can happen in horribly nasty places like
kernel entry/exit. But, TDX hardware can theoretically _deliver_
a virtualization exception (#VE) on any access to private memory.
But, it's not as bad as it sounds. TDX can be configured to never
deliver these exceptions on private memory with a "TD attribute"
called ATTR_SEPT_VE_DISABLE. The guest has no way to *set* this
attribute, but it can check it.
Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it
is unset. There is no sane way for Linux to run with this
attribute clear so a panic() is appropriate.
There's small window during boot before the check where kernel
has an early #VE handler. But the handler is only for port I/O
and will also panic() as soon as it sees any other #VE, such as
a one generated by a private memory access.
[ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo().
Add Kirill's tested-by because I made changes since
he wrote this. ]