Ben Hawkes, Project Zero

The Basics

Disclosure or Patch Date: 1 May 2021

Product: Qualcomm Adreno GPU

Advisory: https://www.qualcomm.com/company/product-security/bulletins/may-2021-bulletin

Affected Versions: Prior to Android 2021-05-01 security patch level

Note: the Qualcomm Adreno GPU kernel driver may be used in other platforms aside from Android, but the following analysis was performed with Android in mind, since Android is a high priority area of interest for Project Zero.

First Patched Version: Android 2021-05-01 security patch level

Issue/Bug Report: N/A

Patch CL:
https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?id=d236d315145f8250523ce9e14897d62e5d6639fc
https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?id=ec3c8cf016991818ca286c4fd92255393c211405

Bug-Introducing CL: N/A

Reporter(s): N/A

The Code

Proof-of-concept: N/A

Exploit sample: N/A

Did you have access to the exploit sample when doing the analysis? No

The Vulnerability

Bug class: use-after-free (UaF)

Vulnerability details:

There are two conditions required to trigger this vulnerability.

The first condition is to trigger a state error in a core GPU structure used to track GPU mappings. A GPU shared mapping with multiple VMAs (Linux kernel virtual memory areas) is created (e.g. by splitting a larger mapping). One of the mappings is closed, which results in the kgsl_gpumem_vm_close function being called via the registered struct vm_operations_struct. The kgsl_gpumem_vm_close then clears the entry->memdesc.useraddr field of the GPU shared mapping's struct kgsl_mem_entry. Unfortunately this has an unintended logical effect for the remaining VMA, since the entry structure is shared, and this field is used to check whether the entry is already mapped.

Specifically this means that get_mmap_entry will successfully return this entry when the GPU mapping is mapped for a second time. This occurs in both kgsl_mmap and kgsl_get_unmapped_area, but the latter looks most interesting for this attack.

The kgsl_get_unmapped_area function is called by the Linux kernel's mmap implementation. A semaphore (mmap_sem) is held which prevents multiple threads in the same process from calling this function concurrently. In the Qualcomm GPU design, multiple processes can share the same GPU address space (such as a child process that is forked after the KGSL file descriptor is opened), and so multiple VMAs can share the same underlying struct kgsl_mem_entry.

The second condition is to trigger a race condition in kgsl_get_unmapped_area between two processes trying to map the same GPU mapping at the same time. Since this occurs after the first condition has been triggered, which can result in the same struct kgsl_mem_entry is being used at the same time in each process. Since there are no locks held on this structure, this can lead to unexpected behavior.

There are a number of paths that could be explored to exploit this issue, such as using an error path to call kgsl_iommu_put_gpuaddr on a successfully allocated mapping.

Patch analysis:

Although only one patch is listed in the Qualcomm advisory, we believe both patches listed above are relevant to this issue. The first patch changes the way kgsl_gpumem_vm_close accounts for the fact that multiple VMAs may point to the same GPU shared mapping. The second patch adds locking to the memdesc field of the struct kgsl_mem_entry, which aims to prevent similar race conditions in memory management routines.

Thoughts on how this vuln might have been found (fuzzing, code auditing, variant analysis, etc.):

Given the complex interplay of discretely triggering one condition followed by winning a race condition, this issue would be challenging to fuzz, but it might be possible with a well-crafted fuzzer designed specifically for the Qualcomm GPU driver (e.g. by biasing system calls toward relevant process management, memory management and well-formed KGSL ioctl system calls).

It is possible that this issue was found manually, either by observing the lack of locking on the shared struct kgsl_mem_entry and working backward to establish a path to triggering this, or by observing the suspicious state management in kgsl_gpumem_vm_close and building the attack up from there.

(Historical/present/future) context of bug:

A different use-after-free (UaF) vulnerability was discovered and fixed by Man Yue Mo from the GitHub Security Lab. This vulnerability was in a different part of GPU memory management code, and was not known to be exploited in-the-wild. His write-up of this attack can be found here.

Another issue, CVE-2021-1906, was fixed by Qualcomm at the same time and reported as in-the-wild. This change is believed to be related to CVE-2020-11261 (also marked as exploited in-the-wild), and is not directly useful by itself.

The Exploit

(The terms exploit primitive, exploit strategy, exploit technique, and exploit flow are defined here.)

Exploit strategy (or strategies): N/A

Exploit flow: N/A

Known cases of the same exploit flow: N/A

Part of an exploit chain? N/A

The Next Steps

Variant analysis

Areas/approach for variant analysis (and why):

Generally all of the structures that can be shared between multiple processes (such as struct kgsl_process_private) should be carefully investigated for state assumptions, reference counting issues, and race conditions.

Found variants:

A cursory review of relevant structure members and memory management related ioctls and callbacks didn't surface any variants of this issue.

Structural improvements

What are structural improvements such as ways to kill the bug class, prevent the introduction of this vulnerability, mitigate the exploit flow, make this type of vulnerability harder to exploit, etc.?

Ideas to kill the bug class:

In this case it's hard to say if the attack would have proceeded with the classical memory corruption route (e.g. using the freed object to achieve arbitrary R/W), or with a GPU specific approach (such as granting arbitrary physical memory R/W to an attacker controlled GPU context). If the former approach, then upcoming memory tagging designs would likely help. The latter approach would require further study.

Ideas to mitigate the exploit flow: N/A

Other potential improvements: N/A

0-day detection methods

What are potential detection methods for similar 0-days? Meaning are there any ideas of how this exploit or similar exploits could be detected as a 0-day?

Kernel crash log analysis might be one approach, but establishing the root-cause of an issue like this using only crash output would be challenging. Runtime anomaly detection might be another option, but would require specialist tooling.

Other References