CVE-2021-1048: Android kernel refcount increment on mid-destruction file

Jann Horn

The Basics

NOTE: The original vulnerability was in the Linux kernel, but in-the-wild exploitation was only seen on Android-based devices, which run Android-specific kernel forks

Disclosure or Patch Date: it's complicated (but the Android bulletin is from 6 November 2021)

Product: Android / Linux kernel

Advisory: ASB 2021-11

Affected Versions (upstream Linux):

5.9-rc2 - 5.9-rc3 (mainline: only release candidates affected)
5.8.4 - 5.8.7 (short-lived stable branch)
- date range: 2020-08-26 - 2020-09-09
5.7.18 and higher (short-lived stable branch, EOL before fix)
- date range: 2020-08-26 - EOL
5.4.61 - 5.4.63 (LTS stable branch)
- date range: 2020-08-26 - 2020-09-09
4.19.142 - 4.19.143 (LTS stable branch)
- date range: 2020-08-26 - 2020-09-09
4.14.195 - 4.14.196
- date range: 2020-08-26 - 2020-09-09
4.9.234 - 4.9.235
- date range: 2020-08-26 - 2020-09-12
4.4.234 - 4.4.235
- date range: 2020-08-26 - 2020-09-12

Affected Versions (Android devices): possibly some Android devices before SPL 2021-11-06, depending on LTS syncs

First Patched Version:

upstream: 5.9-rc4, 5.8.8, 5.4.64, 4.19.144, 4.14.197, 4.9.236, 4.4.236
Android devices: SPL 2021-11-06 or lower (see "context of bug" section for explanation)

Issue/Bug Report (upstream Linux): https://lore.kernel.org/linux-fsdevel/000000000000dc862405ae31ae9b@google.com/T/#u

Issue/Bug Report (Android devices): unknown

Patch CL: https://git.kernel.org/linus/77f4689de17c

Bug-Introducing CL: https://git.kernel.org/linus/a9ed4a6560b8 (bugfix for another memory corruption)

Reporter(s) (upstream Linux): syzbot/syzkaller

Reporter(s) (Android devices): unknown

The Code

Proof-of-concept: N/A

Exploit sample: N/A

Did you have access to the exploit sample when doing the analysis? no

The Vulnerability

Bug class: object state confusion leading to use-after-free

Vulnerability details:

ep_loop_check_proc() is trying to increment the refcount of a file with get_file(). However, get_file() is only allowed when a refcounted reference is already held to the file; and ep_loop_check_proc() instead relies on locking ep->mtx to protect the weak reference to the file from concurrent removal by eventpoll_release(), which doesn't prevent encountering a file with refcount zero.

Here is a diagram of the relevant lifetime states of struct file:

Essentially, get_file() is called on an object that may be in a state in which get_file() is not permitted.

Patch analysis:

get_file() is replaced with get_file_rcu(), which is valid for (a superset of) all possible states of the file.

Thoughts on how this vuln might have been found (fuzzing, code auditing, variant analysis, etc.): Since the bug was quickly fixed in upstream Linux, but not in all Android devices, there's a good chance that the attackers specifically searched for memory corruption fixes that are present upstream but not in Android devices.

This reminds me of https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html , another case where a bug was fixed upstream but not in all Android kernels.

(Historical/present/future) context of bug:

The commit that introduced the bug (and fixed another one) was included in the Android Security Bulletin for December 2020, forcing all Android vendors to include that commit. However, the fix for this bug, despite quickly landing in upstream stable kernels (see "Affected Versions" above), was only included in an Android Security Bulletin in November 2021.

This means that devices by Android vendors who only cherrypick bugfixes referenced in Android Security Bulletins, rather than pulling the complete Android common kernel tree, will have been vulnerable for almost a year, even though upstream stable releases (and Android common kernels) were only affected for ~2-3 weeks.

That doesn't necessarily mean that all Android devices were affected that long though; for example, Pixel 4 XL devices seem to have been patched in their March 2021 security update through the periodic LTS update from 4.14.191 to 4.14.199. The kernel versions that were shipped to Pixel 4 XL devices are (from running strings on boot.img in the firmware images):

in the December 2020 update: 4.14.191-gf6c9439f069c-ab6924784 (still vulnerable?)
in the January 2021 update: 4.14.191-gd36f32db91a3-ab6960308 (still vulnerable?)
in the February 2021 update: 4.14.191-gd36f32db91a3-ab7006457 (still vulnerable?)
in the March 2021 update: 4.14.199-g815ef3fd6754-ab7079165 (fixed)
in the April 2021 update: 4.14.199-gb0863551cb91-ab7132611 (fixed)

The Exploit

(The terms exploit primitive, exploit strategy, exploit technique, and exploit flow are defined here.)

Exploit strategy (or strategies): N/A - no exploit sample to analyze

Exploit flow:

Known cases of the same exploit flow:

Part of an exploit chain?

The Next Steps

Variant analysis

Areas/approach for variant analysis (and why):

I think there are two approaches for variant analysis here:

Check whether any Linux kernel patches listed in Android Security Bulletins are referenced by other commits in the Fixes: tag, and verify for any hits that they either aren't security-relevant or have also been included in an ASB.
Look whether there are any other codepaths that extract a file from an epoll item and assume that its refcount is non-zero.

Found variants:

I found no variants with clear security implications.

Re #1, the following upstream Linux commits referenced in bulletins from 2020 and 2021 are referenced by followup fix commits:

d0cb50185ae9 (do_last(): fetch directory ->i_mode and ->i_uid before it's too late)
- followup: 6404674acd59 (vfs: fix do_last() regression)
  - reported by syzkaller: https://syzkaller.appspot.com/bug?extid=190005201ced78a74ad6
  - looks like just a NULL deref when racing?
07e6124a1a46 (vt: selection, close sel_buffer race)
- followup: e8c75a30a23c (vt: selection, push sel_lock up)
  - deadlock fix
- followup: 4b70dd57a15d (vt: selection, push console lock down)
  - deadlock fix
594cc251fdd0 (make 'user_access_begin()' do 'access_ok()')
- followup: ab10ae1c3bef (lib: Reduce user_access_begin() boundaries in strncpy_from_user() and strnlen_user())
  - looks like a powerpc-specific performance regression fix?
6d390e4b5d48 (locks: fix a potential use-after-free problem when wakeup a waiter)
- followup: dcf23ac3e846 (locks: reinstate locks_delete_block optimization)
  - performance regression fix
a9ed4a6560b8 (epoll: Keep a reference on files added to the check list)
- followup: 77f4689de17c (fix regression in "epoll: Keep a reference on files added to the check list")
  - original case
21998a351512 (x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.)
- followup: 33fc379df76b (x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb)
  - fixes incorrect reporting of speculation mitigation status on X86
- followup: 1978b3a53a74 (x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP)
  - fixes not being able to turn on IBPB on X86
8019ad13ef7f (futex: Fix inode life-time issue)
- followup: 8d67743653dc (futex: Unbreak futex hashing)
  - performance regression fix, theoretically also correctness fix

Re #2: The only place that looks vaguely interesting in that regard is ep_item_poll(): From what I can tell, it can invoke vfs_poll() on a file whose refcount is already zero, but only before the file's ->release() handler is called. But I think that's fine.

Structural improvements

What are structural improvements such as ways to kill the bug class, prevent the introduction of this vulnerability, mitigate the exploit flow, make this type of vulnerability harder to exploit, etc.?

Ideas to kill the bug class: In my opinion, the bug class here is "object state confusion", and killing the bug class would have to involve using static analysis and annotations to sanity-check whether object states match the requirements.

Ideas to mitigate the exploit flow: N/A

Other potential improvements: When cherrypicking specific security fixes, it would probably be a good idea to at least monitor the upstream repository for commits that refer to the cherrypicked patch with Fixes:.

0-day detection methods

What are potential detection methods for similar 0-days? Meaning are there any ideas of how this exploit or similar exploits could be detected as a 0-day?