ActiveBreach

Function Peekaboo: Crafting self masking functions using LLVM

Introduction

LLVM compiler infrastructure is powerful because of its modular design, flexibility, and rich intermediate representation (IR) that enables deep analysis and transformation of code. Unlike traditional compilers, LLVM separates the front end (language parsing) from the back end (code generation), allowing developers to support multiple languages and targets with minimal duplication. Its IR is language-agnostic and designed for optimization, making it ideal for building advanced tooling such as static analyzers, custom code generators, obfuscators, and JIT compilers. Additionally, LLVM’s extensive ecosystem—including Clang, LLD, and MLIR (Multi-Level Intermediate Representation)—makes it a robust foundation for research, experimentation, and production-grade compiler development.

Clang/Clang++ is a frontend built on top of LLVM that parses and compiles source code written in C, C++, Objective-C, and Objective-C++. It translates this code into LLVM IR, which is then processed by LLVM’s backend. Together, Clang and LLVM form a flexible and extensible toolchain capable of supporting a wide range of languages, platforms, and custom compiler features.

In this post, we will customize the LLVM compiler infrastructure to build a solution that enables self-masking capabilities for ordinary user-defined functions in a C++ source file. Self-masking means that a function remains in a masked (obfuscated or encrypted) state until it is invoked. Once execution enters the function, it is temporarily unmasked, and upon returning, it reverts back to its masked state.

The Clang compiler, built on top of this customized LLVM, consistently generates binaries in which all functions are equipped with this masking behavior. Although the functions are not masked at compile time, the runtime design ensures that they are expected to be in a masked state during execution. This approach enhances binary protection and obfuscation without altering the original control flow or affecting the handling of arguments and return values.

The Game Plan

To impart self-masking capabilities to ordinary functions, we must manipulate the program’s control flow by injecting custom code during compilation.
It is crucial that these modifications do not alter the original control flow or interfere with the handling of function arguments and return values.

In this section, we will explore how to implement custom prologue and epilogue stubs for each function within a compilation unit. These stubs will manage execution flow and invoke a dedicated masking handler responsible for performing masking and unmasking operations.

The custom Clang compiler we develop will integrate these transformations directly into the compiled binary. However, functions within the binary will remain unmasked at compile time. Instead, the design assumes that all functions are in a masked state at runtime.

To enforce this, we introduce an additional component that handles entry point redirection and performs pre-CRT execution logic. This ensures that the original function bodies and their epilogues remain masked throughout the binary’s lifetime in memory.

PE Layout

The PE file generated by our custom Clang++ compiler includes two custom sections, as illustrated in the image below:

  • The .funcmeta section stores the XOR key along with metadata required to identify the start and length of each function. We will explore the structure and purpose of this section in detail in the following sections. A high-level data layout is illustrated in the image below:
  • The .stub section contains shellcode that serves as the entry point for the PE file and is executed immediately upon program launch.

Function Registration

Our custom LLVM implementation does not indiscriminately mask every function within a compilation unit; instead, it selectively applies modifications only to those functions explicitly chosen by the user. We need to provide users with an intuitive and efficient mechanism to register functions that require masking.

Function Meta Data

A custom section named .funcmeta is created to hold XOR key and function meta data.

__attribute__((section(".funcmeta")))
uint32_t myfuncsec_key = 0x12345678;

The function metadata consists of the function’s start address and the length of its body. We encapsulate this information in a structure named FunctionMetaData, as shown below. For each registered function, an instance of FunctionMetaData is placed in the .funcmeta section.

struct FunctionMetaData 
{
    void* func;
    uint32_t len;
};

Registering Functions

Let’s define a specialized macro function named REGISTER_FUNCTION(), as shown below, which inserts function metadata into the .funcmeta section. Since the length of the function body cannot be determined at compile time, we use a placeholder value of 0xDEADBEEF. The macro accepts the name of the function to be masked, and its pointer is recorded in the custom section. To distinguish registered functions from regular ones, we follow a naming convention where each registered function begins with the prefix REG_, allowing LLVM to identify them during compilation.

// Macro to register a function
#define REGISTER_FUNCTION(fn) \
    __attribute__((section(".funcmeta"))) \
    struct FunctionMetaData fn##_entry = { (void*)fn, 0xDEADBEEF }

void REG_foo()
{
    ...
}

REGISTER_FUNCTION(REG_foo)

Custom Prologue and Epilogue

Before we dive into details, we need to discuss about initialization phase. As mentioned earlier, each function body in the compiled binary will have a custom prologue and epilogue code attached to it, that will redirect the flow to handler code. The whole setup is designed in such a way that the prologue code expects the original function body along with the epilogue code appended to it are already in masked state so that it can instruct the handler to unmask it and execute the function body. It is the duty of the epilogue stub to mask the code after the execution of the function body. This demands the function body to be in masked state when the control reaches the prologue code to execute the function body . We need to introduce an “initialization phase” before program resumes normal execution by invoking CRT . Both prologue and epilogue code should posses ability to distinguish between normal execution flow and initialization phase. This will be discussed in the following section.

Prologue

The working of prologue code is as follows

  • Retrieve the function start address and pass it to handler. This is achieved by employing call/pop assembly trick and RIP relative assembly code. The computed start address is stored in UserReserved[1] member of TEB for later use.
  • Next step is to check if we are in initialization phase by checking the value stored in UserReserved[0] member of TEB. If the value is 0x80000001 then we are in initialization phase, here we will simply jump to the epilogue stub appended to the function body. If we are not in initialization phase then we will directly invoke handler from prologue to unmask the function body and resume normal execution.
  • We must not modify RCX/RDX/R8/R9, non-volatile registers and stack in the prologue code.
;
;DO NOT modify RCX/RDX/R8/R9 as this will corrupt function params and preserve
;non volatile registers - RBX/RBP/RDI/RSI/RSP/R12/R13/R14/R15
;

;
; gs:[0xE8] => TEB -> UserReserver[0] => init_phase status value : 0x80000001
; gs:[0xF0] => TEB -> UserReserver[1] => Target function start address
; gs:[0xF8] => TEB -> UserReserver[2] => VirtualProtect address
;


BITS 64

prologue:
    call get_rip                    ; Calculate start address
get_rip:
    pop rax
    push rcx                        ; Save rcx
    push rdx                        ; Save rdx
    lea rcx, [rel get_rip]
    lea rdx, [rel prologue]
    sub rcx, rdx
    sub rax, rcx                    ; Function start address

    pop rdx                         ; Restore rdx
    pop rcx                         ; Restore rcx


    ;
    ;Read TEB -> UserReserved[0] to check if we are in initializtion phase 
    ;Value 0x80000001 indicates initialization phase
    ;

    mov gs:[0xF0], rax              ; Store the function start address in   TEB -> UserReserved[1] 
    mov r10, gs:[0xE8]              ; TEB -> UserReserved[0]
    xor rax, rax
    mov rax, 0x80000001
    cmp r10, rax
    ;
    ; Initialization phase,
    ;
    je epilogue               

    ;
    ; Normal execution flow
    ; Handler is invoked to unmask the code
    ;

    call handler             

Epilogue

The working of epilogue code is as follows:

  • If we are in initialization phase, we will call handler directly as this will push the function boundary address to stack as the return address. This way handler can easily compute the function size. We will discuss this in detail later in this post.
  • If we are not in initialization phase then we will simply jump into handler code, as this will preserve the return address placed on stack and handler can return to the original caller of the masked function.
  • We must not modify RAX and non-volatile registers here.
;
; DO NOT modify rax here
;


;
; gs:[0xE8] => TEB -> UserReserver[0] => init_phase status value : 0x80000001
; gs:[0xF0] => TEB -> UserReserver[1] => Target function start address
; gs:[0xF8] => TEB -> UserReserver[2] => VirtualProtect address
;

BITS 64
epilog:

    mov rdx, gs:[0xE8]          ;init_phase check, TEB -> UserReserved[0]
    xor rcx, rcx
    mov rcx, 0x80000001
    cmp rdx, rcx                                 
    je init_stage              ;initialization phase


    jmp handler               ;use jmp here so that we can return to original caller of this function from the handler

init_stage:                    ;Initialization phase
    call handler               ;The ret addr pushed will be used to compute func size
    ret                        ;function boundary

Masking Handler

The working of handler code is as follows:

  • It preserves the register values before execution of handler logic.
  • Parses the function meta data present in special .funcmeta section. If we are in initialization phase, indicated by init_phase label, then handler will update the function body size for each registered function address in .funcmeta section. This will help the handler to fetch function size for a specific function start address by to simply consulting .funcmeta section and perform masking/unmasking of the code.
  • Keep in mind that by the time control reaches the handler code, the function’s prologue has already placed the function’s start address in TEB.UserReserved[1], and the initialization code has placed the address of the VirtualProtect API in TEB.UserReserved[2].
  • In this POC, we are using simple XOR encoding to mask the function body and epilogue stub. Handler will call VirtualProtect before and after the masking to change memory protection.
  • Following XOR encoding, we restore the registers and simply return to the caller. If we are in the initialization phase then this will take us back to initialization code.
BITS 64

struc FunctionMetaData          
    .FunctionStartAddress  resq 1          
    .FunctionLength        resd 1                
endstruc

;
; gs:[0xE8] => TEB -> UserReserver[0] => init_phase status value : 0x80000001
; gs:[0xF0] => TEB -> UserReserver[1] => Target function start address
; gs:[0xF8] => TEB -> UserReserver[2] => VirtualProtect address
;

;
;Save caller state
;

push rcx
push rdx
push r8
push r9
push rax
push rbx
push rsi
push rdi
push rbp
push r12 
push r13
push r14
push r15

;
;Fetch runtime ImageBase from PEB
;

mov rax, gs:[0x60]         ; Get PEB base
mov rsi, [rax + 0x10]      ; ImageBaseAddress

;
;Fetch function start address from TEB -> UserReserved[1] 
;

mov rax, gs:[0xF0]

find_pe_header:
    ;
    ; PE file validation, optional
    ;
    cmp word [rsi], 0x5A4D      ; 'MZ'
    jne done
    mov edi, [rsi + 0x3C]       ; e_lfanew
    add rdi, rsi
    cmp dword [rdi], 0x00004550 ;'PE\0\0'
    jne done

    ;
    ;Locate section table
    ;

    mov ecx, [rdi + 0x6]        ; Number of sections
    xor rbx,rbx
    mov bx, [rdi + 0x14]        ; Size of optional header
    add rbx, rdi                ; Make sure it doesnt alter CF state
    add rbx, 0x18               ; Section table starts here


section_lookup:
    cmp dword [rbx], 0x6E75662E         ; ".fun"
    jne next_section

    cmp dword [rbx + 4], 0x74656D63     ; "cmet"
    jne next_section

    mov edx, dword [rbx + 0x0C]         ; RVA
    add rdx, rsi                        ; Convert to VA
    jmp fetch_metadata

next_section:
    add rbx, 0x28                       ; IMAGE_SECTION_HEADER size
    loop section_lookup

init_phase:
    ;
    ;RAX/R8 contains func start address
    ;RSP contains epilog ret address 
    ;RBX contains ptr to FunctionMetaData[i]
    ;RSI peb::ImageBase
    ;xor encoder expects func length in rdx
    ;

    xor r9, r9
    mov rcx, rsp
    add rcx, 0x68
    mov r9, [rcx]                                       ;Fetch epi ret address, function boundary
    sub r9, rax                                         ;Function size
    mov [rbx + FunctionMetaData.FunctionLength], r9
    mov rdx, r9
    jmp xor_encoder

fetch_metadata:
    xor r9, r9
    xor r10,r10
    mov r9d, dword [rdx]            ; r9 = xor key
    add rdx, 0x8      
    mov rbx, rdx                    ; RBX = points to metadata struct in .funcmeta

process_metadata_entries:
    mov r8, [rbx + FunctionMetaData.FunctionStartAddress]
    cmp r8, 0
    je done
    mov rdx, [rbx + FunctionMetaData.FunctionLength]  
    xor rsi,rsi                                         ;Clear data
    mov rsi, r9                                         ;Place xor key in rsi
    cmp rax, r8                                         ;Check if caller's start address is same as metadata entry    
    je function_found
    add rbx, 0x10
    jmp process_metadata_entries


function_found:
    ;
    ;Read TEB -> UserReserved[0] to check if we are in initializtion phase 
    ;Value 0x80000001 indicates initialization phase
    ;

    mov r10, gs:[0xE8] 
    xor rcx, rcx
    mov rcx, 0x80000001
    cmp r10, rcx            ;init phase check
    je init_phase

    ;   
    ;RSI -->key
    ;R8  -->Target Memory
    ;RDX -->Length
    ;

xor_encoder:

    add r8, 0x46                ; Prologue length 
                                ; Delta between prologue stub and LLVM emitted code - 0xB.
                                ; Total length = prologue stub length + delta

    sub rdx, 0x46               ; Update length (Function length - prologue length)

    test    rdx, rdx            ; check if length is zero
    jz      done                ; if zero, exit

    xor rbx,rbx       
    mov r11, gs:[0xF8]          ;TEB -> UserReserved[2]
    push rdx
    push r8

    mov rcx, r8
    mov r8, 0x40

    sub rsp, 8                  ; Reserve 8 bytes
    mov qword [rsp], 0          ; Optional: zero it
    mov r9, rsp                 ; R9 = pointer to old protection


    sub rsp, 0x20               ; Shadow space

    call r11                    ; VirtualProtect()

    add rsp, 0x28               ; Stack clean-up    

    ;
    ; Restore data
    ;     

    pop r8
    pop rdx

    ;
    ;Save data for future VirtualProtect call
    ;

    push r8
    push rdx

encode_loop:
    mov     bl, [r8]         ; Load target instruction
    xor     bl, sil          ; perform xor on instruction
    mov     [r8], bl         ; store encoded byte back
    inc     r8               ; move to next byte
    dec     rdx              ; decrement function length
    jnz     encode_loop     

    pop rdx             ; dwSize
    pop rcx             ; lpAddress
    mov r8, 0x20        ; flNewProtect

    sub rsp, 8                 
    mov qword [rsp], 0        
    mov r9, rsp         ; lpflOldProtect

    sub rsp, 0x20       ; Shadow space allocation

    mov r11, gs:[0xF8]  ; Fetch VirtualProtect address from TEB -> UserReserved[2]

    call r11            ; VirtualProtect()

    add rsp, 0x28       ; Stack clean-up

done:

    ;
    ;Restore caller state
    ;

    pop r15
    pop r14
    pop r13
    pop r12
    pop rbp
    pop rdi
    pop rsi
    pop rbx
    pop rax
    pop r9
    pop r8
    pop rdx
    pop rcx

    ret

Redirecting Entrypoint and Initialization Phase

Following the creation of binary compiled using our custom LLVM implementation, which will contain handler code embedded in .text section along with prologue/epilogue code attached to all the functions, we will have to perform additional step of embedding an entrypoint stub into the binary. To direct the control to our custom stub, we will patch AddressOfEntryPoint member of PE OptionalHeader. The entry point stub will be responsible for executing initialization phase and it serves two primary purposes: first, to compute the total size of all registered functions; and second, to mask each of them before normal execution resumes.

Below code can be summarized into following points:

  • An optional IAT parsing is included in this code to fetch address of VirtualProtect api. Its optional here because this logic can be differed to handler stub which is a more suitable place to do it as it provides a very transparent way to obtain api addresses if IAT is already hooked by custom opsec code that implements techniques like stack spoofing. I will leave that as an excercise to the user.
  • After the IAT parsing, we store 0x80000001 in TEB.UserReserved[0] to let prologue, epilogue and handler know about intialization phase.
  • In the code a placeholder 0x12345678 is used to store original entrypoint. This will be done externally using a python script. We will discuss this in details later.
  • We will go ahead and fetch the data embedded in .funcmeta section. Each entry in this section is represented by the structure below. At the compilation time, the address of a registered function will be stored in the first member func, since this is being saved as a pointer, when the loader loads the program, the rebased address will be available here. We are not going to store the size of the function at the compilation time. So our entrypoint logic will call the handler through registered function and handler will place the dynamically computed size in the len field.
  struct FunctionMetaData 
  {
      void* func;
      uint32_t len;
  };
  • The initialize_loop label will take care of the initialization phase by calling into each registered function and place the size of the each function in the .funcmeta section as discussed above. Before we call into each function we will
  • Many junk instructions have been added towards the end of the code to break a generic Microsoft Defender signature.
  • Finally we exit initialization loop and place 0x00000000 in TEB.UserReserved[0] to indicate that initialization is done and we simply jump to original entrypoint address (CRT).
BITS 64



struc FunctionMetaData          
    .FunctionStartAddress  resq 1          
    .FunctionLength        resd 1                
endstruc

    ;
    ;Fetch runtime ImageBase from PEB
    ;
    mov rax, gs:[0x60]         ; Get PEB base
    mov rsi, [rax + 0x10]      ; ImageBaseAddress


    ;
    ;
    ;IAT Parser begin
    ;
    ;

    mov rbx, rsi
    mov eax, dword [rbx + 0x3C]        ; e_lfanew
    add rbx, rax                       ; NT Headers

    ; Get Optional Header
    add rbx, 0x18                      ; skip Signature + FileHeader
    mov rdx, rbx                       ; Optional Header

    ; Get RVA of Import Directory
    mov eax, dword [rdx + 0x78]        ; DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT].VirtualAddress
    test eax, eax
    jz not_found
    add rax, rsi                       ; Import Directory VA
    mov rdi, rax                       ; IMAGE_IMPORT_DESCRIPTOR

find_kernel32:
    mov eax, dword [rdi]               ; Check if descriptor is null
    test eax, eax
    jz not_found

    ; Get DLL name
    mov eax, dword [rdi + 0x0C]        ; Name RVA
    add rax, rsi
    mov r8, rax                        ; DLL name


    mov rcx, 0x32334C454E52454B        ; "KERNEL32"           
    cmp qword [r8], rcx
    jne next_descriptor
    mov ecx, dword [r8 + 8]
    cmp ecx, 0x6C6C642E                ; ".DLL"
    jne next_descriptor

    ; Found kernel32.dll
    xor rbx, rbx
    mov ebx, dword [rdi + 0x10]             ; FirstThunk RVA
    add rbx, rsi                            ; rbx = IAT
    mov rax, [rdi + 0x00]                   ; OriginalFirstThunk RVA
    add rax, rsi                            ; rax = INT
    mov rcx, rbx                            ; rcx = IAT
    mov rdx, rax                            ; rdx = INT

loop_thunks:
    mov r8, [rdx]
    test r8, r8
    jz not_found

    ; Check if import by ordinal
    mov rax, 0x8000000000000000
    test r8, rax
    jnz next_thunk

    ; Get function name
    add r8, rsi
    add r8, 2                               
    mov r9, r8                              ;r9 = function name

    ;
    ; Compare with "VirtualProtect"
    ;

    mov rax, 0x506c617574726956            
    cmp qword [r9], rax
    jne next_thunk
    mov eax, dword [r9 + 8]
    cmp eax, 0x65746f72                    
    jne next_thunk

    ;
    ; Found thunk for VirtualProtect
    ;

    mov rax, [rcx]                          ; Get resolved address of VirtualProtect from IAT
    mov gs:[0xF8], rax
    jmp iat_parsing_success

next_thunk:
    add rcx, 8                         ; Next IAT entry
    add rdx, 8                         ; Next INT entry
    jmp loop_thunks

next_descriptor:
    add rdi, 0x14                      ; Next IMAGE_IMPORT_DESCRIPTOR
    jmp find_kernel32

not_found:
    xor rax, rax
    ret                                ; Return to NTDLL


    ;
    ;
    ;IAT Parser end
    ;
    ;

iat_parsing_success:

    ;
    ;Store the value 0x80000001 in TEB -> UserReserved[0] to indicate initialization phase.
    ;

    xor rbx, rbx
    mov rbx, 0x80000001
    mov gs:[0xE8], rbx

    ;
    ;Fetch AddressOfEntryPoint - DWORD offset
    ;


    xor r14, r14
    mov r14, 0x12345678        ; Placeholder 0x12345678

find_pe_header:
    cmp word [rsi], 0x5A4D      ; 'MZ'
    jne done
    mov edi, [rsi + 0x3C]       ; e_lfanew
    add rdi, rsi 
    cmp dword [rdi], 0x00004550 ;'PE\0\0'
    jne done
    ;
    ; Junk instructions 
    ;
    nop
    xor eax, eax
    inc eax
    dec eax

    ;
    ; Locate section table
    ;

    mov ecx, [rdi + 0x6]        ; Number of sections
    xor rbx, rbx
    mov bx, [rdi + 0x14]        ; Size of optional header
    add rbx, rdi
    add rbx, 0x18               ; Section table starts here

    ;
    ; Junk instructions start
    ;

    push rax
    pop rax
    mov rax, rax
    nop

    ;
    ; Junk instructions end
    ;

section_lookup:
    cmp dword [rbx], 0x6E75662E     ; ".fun"
    jne next_section

    ;
    ; Junk instructions start
    ;
    xor r8, r8
    test r8, r8
    jz .skip1
    .skip1:
    ;
    ; Junk instructions end
    ;

    cmp dword [rbx + 4], 0x74656D63 ; "cmet"
    jne next_section

    ;
    ; Custom section .funcmet found
    ;

    mov edx, dword [rbx + 0x0C]          ; RVA
    add rdx, rsi                         ; Convert to VA
    jmp fetch_metadata

next_section:
    add rbx, 0x28                        ; IMAGE_SECTION_HEADER size
    ;
    ; Junk instructions start
    ;

    xor r9, r9
    mov r9, r9
    loop section_lookup

    ;
    ; Junk instructions end
    ;

fetch_metadata:
    add rdx, 0x8                         ; Skip xor-key
    mov rbx, rdx

    ;
    ; Junk instructions start
    ;
    nop
    pushfq
    popfq

    ;
    ; Junk instructions end
    ;

    ;
    ; Perform initialization - Mask registered functions before execution of Main()
    ;

initialize_loop:
    mov r13, [rbx + FunctionMetaData.FunctionStartAddress] 
    cmp r13, 0
    je done

    ;
    ; Junk instructions start
    ;

    xor r10, r10
    test r10, r10
    jz .skip2
    .skip2:

    ;
    ; Junk instructions end
    ;

    call r13                             ; Call registered function 
    add rbx, 16                          ; Move to next FunctionMetaData entry
    loop initialize_loop

done:
    ;
    ; Initialization finished 
    ; Make sure we change the value 0x80000001 in TEB->UserReserved[0] to 0
    ;

    xor eax, eax
    mov gs:[0xE8], eax

    ;
    ; Junk instructions 
    ;
    nop
    mov rcx, rcx
    push rdx
    pop rdx

    ;
    ; Junk instructions end
    ;

    ;
    ; Execute original entry point (CRT)
    ;

    add r14, rsi        ; AddressOfEntryPoint DWORD offset + ImageBaseAddress
    jmp r14

LLVM Compiler Infrastructure

LLVM compilation is organized into several phases, each responsible for transforming source code into optimized machine code. Here’s a breakdown of the key phases:

  1. Frontend Phase
  • Purpose: Parses source code and generates LLVM IR.
  • Tools: Clang (for C/C++), Flang (Fortran), etc.
  • Steps:
  • Lexical and syntax analysis
  • Semantic analysis
  • AST (Abstract Syntax Tree) generation
  • IR generation (.ll or .bc)
  1. Middle-End (Optimizer) Phase
  • Purpose: Performs target-independent optimizations on LLVM IR.
  • Key Components:
  • Pass Manager: Runs a sequence of optimization passes.
  • Optimization Passes:
    • Constant folding
    • Dead code elimination
    • Loop unrolling
    • Inlining
    • Scalar replacement
    • Global value numbering
  1. Backend Phase
  • Purpose: Converts optimized IR to target-specific machine code.
  • Steps:
  • Instruction selection
  • Register allocation
  • Instruction scheduling
  • Emission of assembly or object code
  1. Code Generation Phase
  • Purpose: Emits final machine code or object files.
  • Tools: LLVM CodeGen, MC layer
  • Output: .o, .obj, .exe, .dll, etc.
  1. Link-Time Optimization (LTO) [Optional]
  • Purpose: Performs whole-program optimization across modules.
  • Types:
  • Full LTO: Uses LLVM IR across modules.
  • Thin LTO: More scalable, uses summaries for cross-module optimization.

Picking The Right Phase

For this project, we do not interact with LLVM’s IR-level code, meaning no modifications are required at that stage. Instead, our focus is on attaching a custom prologue and epilogue to the beginning and end of each registered function, respectively, and embedding a handler stub within the .text section. These transformations must occur during the backend phase. To emit the stub code correctly, we will need to modify specific components of LLVM’s code generation infrastructure.

Before diving into the backend modifications, we must first address a critical issue—patching return instructions. Each registered function typically ends with a return instruction, which interferes with our plan to append a custom epilogue stub. To resolve this, we need to remove all return instructions prior to inserting the epilogue during the backend phase. This requires implementing a custom backend pass that scans the body of each registered function, identifies all return instructions, and safely erases them.

To create a custom machine function pass, lets declare a subclass X86RetModPass that inherits properties and methods from MachineFunctionPass class. We need to override a special LLVM routine runOnMachineFunction in the the superclass.

#ifndef LLVM_LIB_TARGET_X86_X86RETMODPASS_H
#define LLVM_LIB_TARGET_X86_X86RETMODPASS_H

#include "llvm/CodeGen/MachineFunctionPass.h"

namespace llvm {

class X86RetModPass : public MachineFunctionPass {
public:
  static char ID;
  X86RetModPass();

  bool runOnMachineFunction(MachineFunction &MF) override;
  StringRef getPassName() const override;
};

} // end namespace llvm

#endif // LLVM_LIB_TARGET_X86_X86RETMODPASS_H

Lets implement runOnMachineFunction to write a function machine pass that will perform following tasks:

  • Get the name of the current function and demangle it.
  • Check whether the function name begins with the prefix REG_; if it does, apply the pass. Otherwise, skip to the next function.
  • The logic of the pass is straightforward: if a return instruction is located at the end of a function block, we simply erase it. However, if a return is found elsewhere, we replace it with a jump instruction that redirects control to our handler stub.
bool X86RetModPass::runOnMachineFunction(MachineFunction &MF) {
    const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
    MCContext &Ctx = MF.getContext();

    // Demangle function name
    std::string MangledName = MF.getName().str();
    std::string FuncName = llvm::demangle(MangledName);

    // Skip transformation if function name doesnt contain "REG_"
    if (FuncName.find("REG_") == std::string::npos) {
      return true;
    }

    // Find the last RET instruction in the function
    MachineInstr *LastRetInstr = nullptr;
    for (auto &MBB : llvm::reverse(MF)) {
      for (auto &MI : llvm::reverse(MBB)) {
        if (MI.isReturn()) {
          LastRetInstr = &MI;
          break;
        }
      }
      if (LastRetInstr) break;
    }

    for (auto &MBB : MF) {
      for (auto MI = MBB.begin(); MI != MBB.end(); ) {
        if (MI->isReturn()) {
          DebugLoc DL = MI->getDebugLoc();

          if (&*MI == LastRetInstr) {
            MI = MBB.erase(MI); // Erase last RET
          } else {
            MCSymbol *Sym = Ctx.getOrCreateSymbol("handler");
            //const MCExpr *Expr = MCSymbolRefExpr::create(Sym, Ctx);

            BuildMI(MBB, MI, DL, TII->get(X86::JMP_1)).addSym(Sym);
            MI = MBB.erase(MI);
          }
        } else {
          ++MI;
        }
      }
    }

    return true;
}

Registering The Pass

Registering a pass is the process of informing LLVM about our custom transformation so it can be integrated into the compilation pipeline. As discussed earlier, it’s crucial to choose the appropriate phase for registration. Since our work does not involve IR-level transformations, we want LLVM to execute our pass during the Pre-Emit phase. This phase is ideal for performing low-level code transformations just before the machine instructions are emitted for the target architecture.

The LLVM X86 target provides several hook points for injecting custom passes, as outlined below. For our use case, we utilize the addPreEmitPass() hook to register our X86RetModPass, ensuring it runs just before machine code emission.

//source--> https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86TargetMachine.cpp
void addIRPasses() override;
bool addInstSelector() override;
bool addIRTranslator() override;
bool addLegalizeMachineIR() override;
bool addRegBankSelect() override;
bool addGlobalInstructionSelect() override;
bool addILPOpts() override;
bool addPreISel() override;
void addMachineSSAOptimization() override;
void addPreRegAlloc() override;
bool addPostFastRegAllocRewrite() override;
void addPostRegAlloc() override;
void addPreEmitPass() override;
void addPreEmitPass2() override;
void addPreSched2() override;
bool addRegAssignAndRewriteOptimized() override;

To register our pass, simply instantiate the class inside LLVM function addPass() as outlined below.

void X86PassConfig::addPreEmitPass() 
{
  /*
         DO NOT Modify existing code here in this function, only append your code!

  */

  addPass(new X86RetModPass());
}

Modifying Backend Components

The MC Layer (Machine Code Layer) in LLVM is a critical part of the backend responsible for emitting machine code, assembly, and object files. It acts as the final stage in the compilation pipeline, translating MachineInstr representations into actual binary or textual output.

Key Responsibilities of the MC Layer:

  1. Instruction Encoding
  • Converts MachineInstr into binary opcodes.
  • Uses MCCodeEmitter to encode instructions for the target architecture.
  1. Assembly Emission
  • Emits human-readable assembly via AsmPrinter.
  • Handles formatting, symbol resolution, and directives.
  1. Object File Generation
  • Uses MCObjectStreamer to write .o or .obj files.
  • Handles sections, relocations, symbol tables, and alignment.
  1. Section and Symbol Management
  • Manages .text, .data, .bss, and custom sections.
  • Uses MCSection, MCSymbol, and MCContext.
  1. Debug and Metadata Emission
  • Emits DWARF debug info, line tables, and other metadata.
  • Supports .debug_* sections and symbol annotations.

Core Components in the MC Layer

A solid understanding of the various components within LLVM’s MC Layer is essential if you intend to manipulate the code generation process effectively. These components form the backbone of instruction encoding, section management, and final output emission, making them critical for any low-level backend customization.

ComponentRole
MCStreamerAbstract interface for emitting code (assembly or object).
MCObjectStreamerEmits object files using target-specific formats (ELF, COFF, Mach-O).
MCAsmStreamerEmits textual assembly output.
MCCodeEmitterEncodes instructions into binary form.
MCInstTarget-independent representation of a machine instruction.
MCContextManages symbols, sections, and other state.
MCSectionRepresents a section in the output file.
MCSymbolRepresents labels and symbols in code.
AsmPrinterBridges MachineInstr and MCInst, emits assembly or object code.

To accomplish our goal, we will modify the X86AsmPrinter which is a sub class of AsmPrinter component so that each registered function receives a custom prologue and epilogue. Additionally, we will inject a handler stub into the .text section. These modifications leverage the X86AsmPrinter‘s role as the bridge between MachineInstr and the MC Layer, allowing us to control how instructions and auxiliary code are emitted during the final stages of code generation.

Before proceeding, it’s important to clearly delineate the responsibilities between our custom prologue/epilogue code and LLVM’s built-in infrastructure. Specifically, we need to decide which parts of the code generation process will be handled by our implementation, and which aspects will rely on support from LLVM’s AsmPrinter. This separation ensures that our custom logic—such as injecting prologues, epilogues, and handler stubs—is integrated seamlessly with LLVM’s existing emission pipeline.

The commented-out instructions will be dynamically generated by the X86AsmPrinter. The remaining code, however, must be explicitly provided by us and passed to X86AsmPrinter for emission.

Modified Prologue

; .\nasm.exe -f bin -o .\out.bin prologue.asm  (Windows)

BITS 64

prologue:
    call get_rip                  
get_rip:
    pop rax
    push rcx                       
    push rdx                      
    lea rcx, [rel get_rip]
    lea rdx, [rel prologue]
    sub rcx, rdx
    sub rax, rcx                   

    pop rdx                        
    pop rcx                      

    mov gs:[0xF0], rax           
    mov r10, gs:[0xE8]             
    xor rax, rax
    mov rax, 0x80000001
    cmp r10, rax

    ;
    ; LLVM will emit below instruction
    ;
    ;je epilogue               

    ;call handler            

Modified Epilogue

; .\nasm.exe -f bin -o .\out.bin epilogue.asm  (Windows)
BITS 64
epilog:

    mov rdx, gs:[0xE8]          
    xor rcx, rcx
    mov rcx, 0x80000001
    cmp rdx, rcx

    ;
    ;  LLVM will emit below instructions
    ;

    ;je init_stage              


    ;jmp handler             

;init_stage:                    
    ;call handler               
    ;ret                        

The X86AsmPrinter class is a subclass of LLVM’s AsmPrinter, and it exposes several key functions that can be customized. In our case, we will modify three of these functions to enable X86AsmPrinter to emit a custom prologue and epilogue for each registered function, as well as inject our handler stub into the .text section during code generation.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86AsmPrinter.cpp

void emitFunctionBodyStart() override;
void emitFunctionBodyEnd() override;
void emitEndOfAsmFile(Module &M) override;

Modifying emitFunctionBodyStart and emitFunctionBodyEnd

During the code generation process, LLVM invokes emitFunctionBodyStart for each function in the compilation unit to emit the beginning of its machine-level representation. This makes it an ideal insertion point for our custom prologue code, allowing us to seamlessly integrate additional logic at the start of registered functions.

void X86AsmPrinter::emitFunctionBodyStart() {
  //Our custom code starts here..
  llvm::StringRef Mangled = CurrentFnSym->getName();
  std::string Demangled = llvm::demangle(Mangled.str());
  llvm::errs() << Demangled << "\n";
  if(Demangled.find("REG_") != std::string::npos )
  {

    /*                  Prologue stub

          0:  e8 00 00 00 00          call   0x5
          5:  58                      pop    rax
          6:  51                      push   rcx
          7:  52                      push   rdx
          8:  48 8d 0d f6 ff ff ff    lea    rcx,[rip+0xfffffffffffffff6]        # 0x5
          f:  48 8d 15 ea ff ff ff    lea    rdx,[rip+0xffffffffffffffea]        # 0x0
          16: 48 29 d1                sub    rcx,rdx
          19: 48 29 c8                sub    rax,rcx
          1c: 5a                      pop    rdx
          1d: 59                      pop    rcx
          1e: 65 48 89 04 25 f0 00    mov    QWORD PTR gs:0xf0,rax
          25: 00 00
          27: 65 4c 8b 14 25 e8 00    mov    r10,QWORD PTR gs:0xe8
          2e: 00 00
          30: 48 31 c0                xor    rax,rax
          33: b8 01 00 00 80          mov    eax,0x80000001
          38: 49 39 c2                cmp    r10,rax

                    <rest emitted by LLVM>

    */
    static const uint8_t PrologueStub[] = {
    0xE8, 0x00, 0x00, 0x00, 0x00, 0x58, 0x51, 0x52, 0x48, 0x8D, 0x0D, 0xF6, 0xFF, 0xFF, 0xFF,
    0x48, 0x8D, 0x15, 0xEA, 0xFF, 0xFF, 0xFF, 0x48, 0x29, 0xD1, 0x48, 0x29, 0xC8, 0x5A, 0x59,
    0x65, 0x48, 0x89, 0x04, 0x25, 0xF0, 0x00, 0x00, 0x00, 0x65, 0x4C, 0x8B, 0x14, 0x25, 0xE8,
    0x00, 0x00, 0x00, 0x48, 0x31, 0xC0, 0xB8, 0x01, 0x00, 0x00, 0x80, 0x49, 0x39, 0xC2
    };
    for (uint8_t Byte : PrologueStub) {
      OutStreamer->emitIntValue(Byte, 1);
    }


      /*

        Rest of the prologue code is emitted here by LLVM

        je epilogue
        call handler

    */

      EpilogueStubSymbol = OutContext.getOrCreateSymbol(Twine(CurrentFnSym->getName()) +"_epilogue_stub");
      MCSymbol *HandlerSym = OutContext.getOrCreateSymbol("handler");
      MCSymbol *AfterJE = OutContext.createTempSymbol();

      MCInst CallInst;
      CallInst.setOpcode(X86::CALL64pcrel32);
        CallInst.addOperand(MCOperand::createExpr(MCSymbolRefExpr::create(HandlerSym, OutContext)));


     //  je epilogue
       OutStreamer->emitBytes("\x0F\x84");

      // Emit 4-byte displacement placeholder for epilogue_stub start address

      const MCExpr *RelExpr = MCBinaryExpr::createSub(
      MCSymbolRefExpr::create(EpilogueStubSymbol, OutContext),
      MCSymbolRefExpr::create(AfterJE, OutContext),
          OutContext
        );
      OutStreamer->emitValue(RelExpr, 4);
      OutStreamer->emitLabel(AfterJE);


    //call handler
      OutStreamer->emitInstruction(CallInst, *TM.getMCSubtargetInfo());

  }
   //Our custom code ends here..

  if (EmitFPOData) {
    auto *XTS =
        static_cast<X86TargetStreamer *>(OutStreamer->getTargetStreamer());
    XTS->emitFPOProc(
        CurrentFnSym,
        MF->getInfo<X86MachineFunctionInfo>()->getArgumentStackSize());
  }


}

Emitting je epilogue and call handler instructions

  • PrologueStub[] contains the assembled position-independent code (PIC) stub generated from the modified prologue code discussed earlier.
  • During code generation, the prologue is designed to jump to the epilogue during the initialization phase. However, this introduces a significant challenge: the prologue has no direct means of transferring control to the epilogue, which is appended at the end of the function only when LLVM invokes X86AsmPrinter::emitFunctionBodyEnd()—a separate event in the compilation pipeline. To resolve this, we can define a shared symbol, EpilogueStubSymbol, which can be referenced by both emitFunctionBodyStart() and emitFunctionBodyEnd() during function emission. This approach requires updating the X86AsmPrinter class definition to ensure the symbol is properly declared and accessible across both emission stages.
  private:
      MCSymbol *EpilogueStubSymbol = nullptr;
  • The next challenge is emitting a je (jump if equal) instruction that correctly targets the epilogue stub. Since je uses a 32-bit RIP-relative offset, we need to compute the offset to EpilogueStubSymbol. Our approach involves emitting the raw je opcode (\x0F\x84) using the emitBytes() method. After that, we calculate the delta between emitLabel(AfterJE) immediately after the je (referred to as AfterJE) and the address of EpilogueStubSymbol, storing this offset in RelExpr. Finally, we emit the RelExpr value, which resolves to the correct RIP-relative offset, ensuring the je instruction correctly jumps to the epilogue stub.
  MCSymbol *AfterJE = OutContext.createTempSymbol();
  //je <ip relative offset to epilogue stub>
  OutStreamer->emitBytes("\x0F\x84");

       const MCExpr *RelExpr = MCBinaryExpr::createSub(
        MCSymbolRefExpr::create(EpilogueStubSymbol, OutContext),
        MCSymbolRefExpr::create(AfterJE, OutContext),
            OutContext
          );
        OutStreamer->emitValue(RelExpr, 4);
        OutStreamer->emitLabel(AfterJE);
  • Emitting the call handler is fairly straightforward. We begin by creating a symbol reference named HandlerSym, which will later be emitted as a label in the emitEndOfAsmFile() method and associated with the handler stub. To represent the call instruction, we instantiate an MCInst object named CallInst. We then configure it by invoking setOpcode() to specify the call operation, followed by addOperand() to add the target operand—our handler symbol. This effectively constructs a call to the handler.
  MCSymbol *HandlerSym = OutContext.getOrCreateSymbol("handler");
  MCInst CallInst;
  CallInst.setOpcode(X86::CALL64pcrel32);
  CallInst.addOperand(MCOperand::createExpr(MCSymbolRefExpr::create(HandlerSym, OutContext)));
  OutStreamer->emitInstruction(CallInst, *TM.getMCSubtargetInfo());

We will apply the same strategy described above to emit the epilogue code, as demonstrated below. The EpilogueStub[] contains the assembled position-independent code (PIC) stub generated from the modified epilogue code discussed earlier.

void X86AsmPrinter::emitFunctionBodyEnd() {


  llvm::StringRef Mangled = CurrentFnSym->getName();
  std::string Demangled = llvm::demangle(Mangled.str());
  //llvm::errs() << Demangled << "\n";
  if(Demangled.find("REG_") != std::string::npos )
  {
      MCSymbol *InitStageSym = OutContext.getOrCreateSymbol(Twine(CurrentFnSym->getName()) + "init_stage");
      MCSymbol *AfterJE = OutContext.createTempSymbol(); // Marks address after full JE instruction
      MCSymbol *HandlerSym = OutContext.getOrCreateSymbol("handler");

      MCInst JumpInst;
      JumpInst.setOpcode(X86::JMP_1); 
      const MCExpr *TargetExpr = MCSymbolRefExpr::create(HandlerSym, OutContext);
      JumpInst.addOperand(MCOperand::createExpr(TargetExpr));


      MCInst CallInst;
      CallInst.setOpcode(X86::CALL64pcrel32);
      CallInst.addOperand(MCOperand::createExpr(MCSymbolRefExpr::create(HandlerSym, OutContext)));

    /*
                          Epilogue stub

          0:  65 48 8b 14 25 e8 00    mov    rdx,QWORD PTR gs:0xe8
          7:  00 00
          9:  48 31 c9                xor    rcx,rcx
          c:  b9 01 00 00 80          mov    ecx,0x80000001
          11: 48 39 ca                cmp    rdx,rcx

                  <rest emitted by llvm>
    */

      OutStreamer->emitLabel(EpilogueStubSymbol);
    static const uint8_t EpilogueStub[] = {
      0x65, 0x48, 0x8B, 0x14, 0x25, 0xE8, 0x00, 0x00, 0x00,
        0x48, 0x31, 0xC9, 0xB9, 0x01, 0x00, 0x00, 0x80,
        0x48, 0x39, 0xCA
    };

    for (uint8_t Byte : EpilogueStub) {
      OutStreamer->emitIntValue(Byte, 1);
    }


      /*

        Rest of the epilogue code is emitted here by LLVM

        je init_stage              
        jmp handler            
        init_stage:                     
          call handler               
          ret          

    */

      // Emit JE init_stage

      OutStreamer->emitBytes("\x0F\x84");

      // Emit 4-byte displacement placeholder for epilogue_stub start address

      const MCExpr *RelExpr = MCBinaryExpr::createSub(
          MCSymbolRefExpr::create(InitStageSym, OutContext),
          MCSymbolRefExpr::create(AfterJE, OutContext),
          OutContext
      );
      OutStreamer->emitValue(RelExpr, 4); 
      OutStreamer->emitLabel(AfterJE);

    //jump handler 
      OutStreamer->emitInstruction(JumpInst, *TM.getMCSubtargetInfo());

      // init_stage:
      OutStreamer->emitLabel(InitStageSym);

    //call handler 
      OutStreamer->emitInstruction(CallInst, *TM.getMCSubtargetInfo());
    //ret
      OutStreamer->emitBytes("\xC3");

  }

  if (EmitFPOData) {
    auto *XTS =
        static_cast<X86TargetStreamer *>(OutStreamer->getTargetStreamer());
    XTS->emitFPOEndProc();
  }

}

Modifying emitEndOfAsmFile

Finally, we need to emit the assembly stub for the handler logic, as previously discussed. This includes emitting the handler label to ensure that all call and jump instructions correctly transfer control to the handler stub. The symbol is emitted using emitLabel(StubSym), and the handler’s position-independent code stub is emitted byte-by-byte using emitIntValue(Byte, 1).

void X86AsmPrinter::emitEndOfAsmFile(Module &M) 
{

  OutStreamer->switchSection(getObjFileLowering().getTextSection());

  MCSymbol *StubSym = OutContext.getOrCreateSymbol("handler");
  OutStreamer->emitLabel(StubSym);

  /*
    0:  51                      push   rcx
    1:  52                      push   rdx
    2:  41 50                   push   r8
    4:  41 51                   push   r9
    6:  50                      push   rax
    7:  53                      push   rbx
    8:  56                      push   rsi
    9:  57                      push   rdi
    a:  55                      push   rbp
    b:  41 54                   push   r12
    d:  41 55                   push   r13
    f:  41 56                   push   r14
    11: 41 57                   push   r15
    13: 65 48 8b 04 25 60 00    mov    rax,QWORD PTR gs:0x60
    1a: 00 00
    1c: 48 8b 70 10             mov    rsi,QWORD PTR [rax+0x10]
    20: 65 48 8b 04 25 f0 00    mov    rax,QWORD PTR gs:0xf0
    27: 00 00
    29: 66 81 3e 4d 5a          cmp    WORD PTR [rsi],0x5a4d
    2e: 0f 85 24 01 00 00       jne    0x158
    34: 8b 7e 3c                mov    edi,DWORD PTR [rsi+0x3c]
    37: 48 01 f7                add    rdi,rsi
    3a: 81 3f 50 45 00 00       cmp    DWORD PTR [rdi],0x4550
    40: 0f 85 12 01 00 00       jne    0x158
    46: 8b 4f 06                mov    ecx,DWORD PTR [rdi+0x6]
    49: 48 31 db                xor    rbx,rbx
    4c: 66 8b 5f 14             mov    bx,WORD PTR [rdi+0x14]
    50: 48 01 fb                add    rbx,rdi
    53: 48 83 c3 18             add    rbx,0x18
    57: 81 3b 2e 66 75 6e       cmp    DWORD PTR [rbx],0x6e75662e
    5d: 75 11                   jne    0x70
    5f: 81 7b 04 63 6d 65 74    cmp    DWORD PTR [rbx+0x4],0x74656d63
    66: 75 08                   jne    0x70
    68: 8b 53 0c                mov    edx,DWORD PTR [rbx+0xc]
    6b: 48 01 f2                add    rdx,rsi
    6e: eb 1f                   jmp    0x8f
    70: 48 83 c3 28             add    rbx,0x28
    74: e2 e1                   loop   0x57
    76: 4d 31 c9                xor    r9,r9
    79: 48 89 e1                mov    rcx,rsp
    7c: 48 83 c1 68             add    rcx,0x68
    80: 4c 8b 09                mov    r9,QWORD PTR [rcx]
    83: 49 29 c1                sub    r9,rax
    86: 4c 89 4b 08             mov    QWORD PTR [rbx+0x8],r9
    8a: 4c 89 ca                mov    rdx,r9
    8d: eb 48                   jmp    0xd7
    8f: 4d 31 c9                xor    r9,r9
    92: 4d 31 d2                xor    r10,r10
    95: 44 8b 0a                mov    r9d,DWORD PTR [rdx]
    98: 48 83 c2 08             add    rdx,0x8
    9c: 48 89 d3                mov    rbx,rdx
    9f: 4c 8b 03                mov    r8,QWORD PTR [rbx]
    a2: 49 83 f8 00             cmp    r8,0x0
    a6: 0f 84 ac 00 00 00       je     0x158
    ac: 48 8b 53 08             mov    rdx,QWORD PTR [rbx+0x8]
    b0: 48 31 f6                xor    rsi,rsi
    b3: 4c 89 ce                mov    rsi,r9
    b6: 4c 39 c0                cmp    rax,r8
    b9: 74 06                   je     0xc1
    bb: 48 83 c3 10             add    rbx,0x10
    bf: eb de                   jmp    0x9f
    c1: 65 4c 8b 14 25 e8 00    mov    r10,QWORD PTR gs:0xe8
    c8: 00 00
    ca: 48 31 c9                xor    rcx,rcx
    cd: b9 01 00 00 80          mov    ecx,0x80000001
    d2: 49 39 ca                cmp    r10,rcx
    d5: 74 9f                   je     0x76
    d7: 49 83 c0 46             add    r8,0x46
    db: 48 83 ea 46             sub    rdx,0x46
    df: 48 85 d2                test   rdx,rdx
    e2: 74 74                   je     0x158
    e4: 48 31 db                xor    rbx,rbx
    e7: 65 4c 8b 1c 25 f8 00    mov    r11,QWORD PTR gs:0xf8
    ee: 00 00
    f0: 52                      push   rdx
    f1: 41 50                   push   r8
    f3: 4c 89 c1                mov    rcx,r8
    f6: 41 b8 40 00 00 00       mov    r8d,0x40
    fc: 48 83 ec 08             sub    rsp,0x8
    100:    48 c7 04 24 00 00 00    mov    QWORD PTR [rsp],0x0
    107:    00
    108:    49 89 e1                mov    r9,rsp
    10b:    48 83 ec 20             sub    rsp,0x20
    10f:    41 ff d3                call   r11
    112:    48 83 c4 28             add    rsp,0x28
    116:    41 58                   pop    r8
    118:    5a                      pop    rdx
    119:    41 50                   push   r8
    11b:    52                      push   rdx
    11c:    41 8a 18                mov    bl,BYTE PTR [r8]
    11f:    40 30 f3                xor    bl,sil
    122:    41 88 18                mov    BYTE PTR [r8],bl
    125:    49 ff c0                inc    r8
    128:    48 ff ca                dec    rdx
    12b:    75 ef                   jne    0x11c
    12d:    5a                      pop    rdx
    12e:    59                      pop    rcx
    12f:    41 b8 20 00 00 00       mov    r8d,0x20
    135:    48 83 ec 08             sub    rsp,0x8
    139:    48 c7 04 24 00 00 00    mov    QWORD PTR [rsp],0x0
    140:    00
    141:    49 89 e1                mov    r9,rsp
    144:    48 83 ec 20             sub    rsp,0x20
    148:    65 4c 8b 1c 25 f8 00    mov    r11,QWORD PTR gs:0xf8
    14f:    00 00
    151:    41 ff d3                call   r11
    154:    48 83 c4 28             add    rsp,0x28
    158:    41 5f                   pop    r15
    15a:    41 5e                   pop    r14
    15c:    41 5d                   pop    r13
    15e:    41 5c                   pop    r12
    160:    5d                      pop    rbp
    161:    5f                      pop    rdi
    162:    5e                      pop    rsi
    163:    5b                      pop    rbx
    164:    58                      pop    rax
    165:    41 59                   pop    r9
    167:    41 58                   pop    r8
    169:    5a                      pop    rdx
    16a:    59                      pop    rcx
    16b:    c3                      ret

  */

  uint8_t HandlerStub[] = {
    0x51, 0x52, 0x41, 0x50, 0x41, 0x51, 0x50, 0x53, 0x56, 0x57, 0x55, 0x41, 0x54, 0x41, 0x55, 0x41, 0x56, 0x41, 0x57,
    0x65, 0x48, 0x8B, 0x04, 0x25, 0x60, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x70, 0x10, 0x65, 0x48, 0x8B, 0x04, 0x25, 0xF0,
    0x00, 0x00, 0x00, 0x66, 0x81, 0x3E, 0x4D, 0x5A, 0x0F, 0x85, 0x24, 0x01, 0x00, 0x00, 0x8B, 0x7E, 0x3C, 0x48, 0x01,
    0xF7, 0x81, 0x3F, 0x50, 0x45, 0x00, 0x00, 0x0F, 0x85, 0x12, 0x01, 0x00, 0x00, 0x8B, 0x4F, 0x06, 0x48, 0x31, 0xDB,
    0x66, 0x8B, 0x5F, 0x14, 0x48, 0x01, 0xFB, 0x48, 0x83, 0xC3, 0x18, 0x81, 0x3B, 0x2E, 0x66, 0x75, 0x6E, 0x75, 0x11,
    0x81, 0x7B, 0x04, 0x63, 0x6D, 0x65, 0x74, 0x75, 0x08, 0x8B, 0x53, 0x0C, 0x48, 0x01, 0xF2, 0xEB, 0x1F, 0x48, 0x83,
    0xC3, 0x28, 0xE2, 0xE1, 0x4D, 0x31, 0xC9, 0x48, 0x89, 0xE1, 0x48, 0x83, 0xC1, 0x68, 0x4C, 0x8B, 0x09, 0x49, 0x29,
    0xC1, 0x4C, 0x89, 0x4B, 0x08, 0x4C, 0x89, 0xCA, 0xEB, 0x48, 0x4D, 0x31, 0xC9, 0x4D, 0x31, 0xD2, 0x44, 0x8B, 0x0A,
    0x48, 0x83, 0xC2, 0x08, 0x48, 0x89, 0xD3, 0x4C, 0x8B, 0x03, 0x49, 0x83, 0xF8, 0x00, 0x0F, 0x84, 0xAC, 0x00, 0x00,
    0x00, 0x48, 0x8B, 0x53, 0x08, 0x48, 0x31, 0xF6, 0x4C, 0x89, 0xCE, 0x4C, 0x39, 0xC0, 0x74, 0x06, 0x48, 0x83, 0xC3,
    0x10, 0xEB, 0xDE, 0x65, 0x4C, 0x8B, 0x14, 0x25, 0xE8, 0x00, 0x00, 0x00, 0x48, 0x31, 0xC9, 0xB9, 0x01, 0x00, 0x00,
    0x80, 0x49, 0x39, 0xCA, 0x74, 0x9F, 0x49, 0x83, 0xC0, 0x46, 0x48, 0x83, 0xEA, 0x46, 0x48, 0x85, 0xD2, 0x74, 0x74,
    0x48, 0x31, 0xDB, 0x65, 0x4C, 0x8B, 0x1C, 0x25, 0xF8, 0x00, 0x00, 0x00, 0x52, 0x41, 0x50, 0x4C, 0x89, 0xC1, 0x41,
    0xB8, 0x40, 0x00, 0x00, 0x00, 0x48, 0x83, 0xEC, 0x08, 0x48, 0xC7, 0x04, 0x24, 0x00, 0x00, 0x00, 0x00, 0x49, 0x89,
    0xE1, 0x48, 0x83, 0xEC, 0x20, 0x41, 0xFF, 0xD3, 0x48, 0x83, 0xC4, 0x28, 0x41, 0x58, 0x5A, 0x41, 0x50, 0x52, 0x41,
    0x8A, 0x18, 0x40, 0x30, 0xF3, 0x41, 0x88, 0x18, 0x49, 0xFF, 0xC0, 0x48, 0xFF, 0xCA, 0x75, 0xEF, 0x5A, 0x59, 0x41,
    0xB8, 0x20, 0x00, 0x00, 0x00, 0x48, 0x83, 0xEC, 0x08, 0x48, 0xC7, 0x04, 0x24, 0x00, 0x00, 0x00, 0x00, 0x49, 0x89,
    0xE1, 0x48, 0x83, 0xEC, 0x20, 0x65, 0x4C, 0x8B, 0x1C, 0x25, 0xF8, 0x00, 0x00, 0x00, 0x41, 0xFF, 0xD3, 0x48, 0x83,
    0xC4, 0x28, 0x41, 0x5F, 0x41, 0x5E, 0x41, 0x5D, 0x41, 0x5C, 0x5D, 0x5F, 0x5E, 0x5B, 0x58, 0x41, 0x59, 0x41, 0x58,
    0x5A, 0x59, 0xC3
  };

  for (uint8_t Byte : HandlerStub) {
    OutStreamer->emitIntValue(Byte, 1);
  }

}

Injecting Entrypoint Stub

We inject a new section named .stub into the final PE file generated by our custom LLVM-based Clang++ compiler.
For convenience, this is done externally using a Python script. As discussed in the Redirecting Entrypoint and Initialization section of this post, the .stub section embeds the assembly code responsible for handling entry point redirection and pre-CRT execution logic.

The Python script shown below demonstrates how to create a new section and embed position-independent shellcode into it. The final binary generated by the script has all components seamlessly integrated and is fully prepared for execution.

import pefile
import struct
import mmap
import argparse

def add_section_and_modify_entry(pe_path, shellcode, output_path):
    pe = pefile.PE(pe_path)

    # Patch shellcode with original entry point
    ep = pe.OPTIONAL_HEADER.AddressOfEntryPoint
    ep_little_endian = struct.pack("<I", ep)
    print(ep_little_endian.hex())
    placeholder = b"\x78\x56\x34\x12"
    modified_stub = shellcode.replace(placeholder, ep_little_endian)
    escaped = ''.join(f'\\x{b:02x}' for b in modified_stub)
    print(escaped)
    # Section setup
    new_section_name = b'.stub\x00\x00\x00'
    new_section_size = len(modified_stub)

    file_alignment = pe.OPTIONAL_HEADER.FileAlignment
    section_alignment = pe.OPTIONAL_HEADER.SectionAlignment

    aligned_raw_size = (new_section_size + file_alignment - 1) & ~(file_alignment - 1)
    aligned_virtual_size = (new_section_size + section_alignment - 1) & ~(section_alignment - 1)

    # Calculate safe placement
    last_raw_end = max(s.PointerToRawData + s.SizeOfRawData for s in pe.sections)
    last_virtual_end = max(s.VirtualAddress + s.Misc_VirtualSize for s in pe.sections)

    new_section_raw_address = (last_raw_end + file_alignment - 1) & ~(file_alignment - 1)
    new_section_virtual_address = (last_virtual_end + section_alignment - 1) & ~(section_alignment - 1)

    # Ensure raw data doesn't overwrite headers
    if new_section_raw_address < pe.OPTIONAL_HEADER.SizeOfHeaders:
        raise RuntimeError("New section raw data would overwrite PE headers.")

    # Ensure there's space for another section header
    max_section_headers = (pe.OPTIONAL_HEADER.SizeOfHeaders - pe.DOS_HEADER.e_lfanew - 248) // 40
    if pe.FILE_HEADER.NumberOfSections >= max_section_headers:
        raise RuntimeError("Not enough space in PE header for new section header.")

    # Create new section header and set its file offset
    new_section = pefile.SectionStructure(pe.__IMAGE_SECTION_HEADER_format__)
    last_section_header_offset = pe.sections[-1].get_file_offset()
    new_section.set_file_offset(last_section_header_offset + 40)

    new_section.Name = new_section_name
    new_section.Misc = new_section.Misc_VirtualSize = aligned_virtual_size
    new_section.VirtualAddress = new_section_virtual_address
    new_section.SizeOfRawData = aligned_raw_size
    new_section.PointerToRawData = new_section_raw_address
    new_section.PointerToRelocations = 0
    new_section.PointerToLinenumbers = 0
    new_section.NumberOfRelocations = 0
    new_section.NumberOfLinenumbers = 0
    new_section.Characteristics = 0x60000020  # Read + Execute + Code

    # Inject section
    pe.__structures__.append(new_section)
    pe.sections.append(new_section)

    # Update headers
    pe.FILE_HEADER.NumberOfSections += 1
    pe.OPTIONAL_HEADER.SizeOfImage = new_section.VirtualAddress + aligned_virtual_size
    pe.OPTIONAL_HEADER.AddressOfEntryPoint = new_section.VirtualAddress


    required_size = new_section_raw_address + aligned_raw_size
    if isinstance(pe.__data__, mmap.mmap):
        pe.__data__ = bytearray(pe.__data__)
    if len(pe.__data__) < required_size:
        pe.__data__.extend(b'\x00' * (required_size - len(pe.__data__)))

    # Write shellcode
    pe.set_bytes_at_offset(new_section_raw_address, modified_stub.ljust(aligned_raw_size, b'\x00'))

    # Save modified PE
    pe.write(output_path)
    print(f" Modified PE saved to {output_path}")


parser = argparse.ArgumentParser()
parser.add_argument("input", help="Path to the input file")
parser.add_argument("output", help="Path to the output file")
args = parser.parse_args()

'''

                                    shellcode_stub


            0:  65 48 8b 04 25 60 00    mov    rax,QWORD PTR gs:0x60
            7:  00 00
            9:  48 8b 70 10             mov    rsi,QWORD PTR [rax+0x10]
            d:  48 89 f3                mov    rbx,rsi
            10: 8b 43 3c                mov    eax,DWORD PTR [rbx+0x3c]
            13: 48 01 c3                add    rbx,rax
            16: 48 83 c3 18             add    rbx,0x18
            1a: 48 89 da                mov    rdx,rbx
            1d: 8b 42 78                mov    eax,DWORD PTR [rdx+0x78]
            20: 85 c0                   test   eax,eax
            22: 0f 84 a5 00 00 00       je     0xcd
            28: 48 01 f0                add    rax,rsi
            2b: 48 89 c7                mov    rdi,rax
            2e: 8b 07                   mov    eax,DWORD PTR [rdi]
            30: 85 c0                   test   eax,eax
            32: 0f 84 95 00 00 00       je     0xcd
            38: 8b 47 0c                mov    eax,DWORD PTR [rdi+0xc]
            3b: 48 01 f0                add    rax,rsi
            3e: 49 89 c0                mov    r8,rax
            41: 48 b9 4b 45 52 4e 45    movabs rcx,0x32334c454e52454b
            48: 4c 33 32
            4b: 49 39 08                cmp    QWORD PTR [r8],rcx
            4e: 75 74                   jne    0xc4
            50: 41 8b 48 08             mov    ecx,DWORD PTR [r8+0x8]
            54: 81 f9 2e 64 6c 6c       cmp    ecx,0x6c6c642e
            5a: 75 68                   jne    0xc4
            5c: 48 31 db                xor    rbx,rbx
            5f: 8b 5f 10                mov    ebx,DWORD PTR [rdi+0x10]
            62: 48 01 f3                add    rbx,rsi
            65: 48 8b 07                mov    rax,QWORD PTR [rdi]
            68: 48 01 f0                add    rax,rsi
            6b: 48 89 d9                mov    rcx,rbx
            6e: 48 89 c2                mov    rdx,rax
            71: 4c 8b 02                mov    r8,QWORD PTR [rdx]
            74: 4d 85 c0                test   r8,r8
            77: 74 54                   je     0xcd
            79: 48 b8 00 00 00 00 00    movabs rax,0x8000000000000000
            80: 00 00 80
            83: 49 85 c0                test   r8,rax
            86: 75 32                   jne    0xba
            88: 49 01 f0                add    r8,rsi
            8b: 49 83 c0 02             add    r8,0x2
            8f: 4d 89 c1                mov    r9,r8
            92: 48 b8 56 69 72 74 75    movabs rax,0x506c617574726956
            99: 61 6c 50
            9c: 49 39 01                cmp    QWORD PTR [r9],rax
            9f: 75 19                   jne    0xba
            a1: 41 8b 41 08             mov    eax,DWORD PTR [r9+0x8]
            a5: 3d 72 6f 74 65          cmp    eax,0x65746f72
            aa: 75 0e                   jne    0xba
            ac: 48 8b 01                mov    rax,QWORD PTR [rcx]
            af: 65 48 89 04 25 f8 00    mov    QWORD PTR gs:0xf8,rax
            b6: 00 00
            b8: eb 17                   jmp    0xd1
            ba: 48 83 c1 08             add    rcx,0x8
            be: 48 83 c2 08             add    rdx,0x8
            c2: eb ad                   jmp    0x71
            c4: 48 83 c7 14             add    rdi,0x14
            c8: e9 61 ff ff ff          jmp    0x2e
            cd: 48 31 c0                xor    rax,rax
            d0: c3                      ret
            d1: 48 31 db                xor    rbx,rbx
            d4: bb 01 00 00 80          mov    ebx,0x80000001
            d9: 65 48 89 1c 25 e8 00    mov    QWORD PTR gs:0xe8,rbx
            e0: 00 00
            e2: 4d 31 f6                xor    r14,r14
            e5: 41 be 78 56 34 12       mov    r14d,0x12345678
            eb: 66 81 3e 4d 5a          cmp    WORD PTR [rsi],0x5a4d
            f0: 75 7d                   jne    0x16f
            f2: 8b 7e 3c                mov    edi,DWORD PTR [rsi+0x3c]
            f5: 48 01 f7                add    rdi,rsi
            f8: 81 3f 50 45 00 00       cmp    DWORD PTR [rdi],0x4550
            fe: 75 6f                   jne    0x16f
            100:    90                      nop
            101:    31 c0                   xor    eax,eax
            103:    ff c0                   inc    eax
            105:    ff c8                   dec    eax
            107:    8b 4f 06                mov    ecx,DWORD PTR [rdi+0x6]
            10a:    48 31 db                xor    rbx,rbx
            10d:    66 8b 5f 14             mov    bx,WORD PTR [rdi+0x14]
            111:    48 01 fb                add    rbx,rdi
            114:    48 83 c3 18             add    rbx,0x18
            118:    50                      push   rax
            119:    58                      pop    rax
            11a:    48 89 c0                mov    rax,rax
            11d:    90                      nop
            11e:    81 3b 2e 66 75 6e       cmp    DWORD PTR [rbx],0x6e75662e
            124:    75 19                   jne    0x13f
            126:    4d 31 c0                xor    r8,r8
            129:    4d 85 c0                test   r8,r8
            12c:    74 00                   je     0x12e
            12e:    81 7b 04 63 6d 65 74    cmp    DWORD PTR [rbx+0x4],0x74656d63
            135:    75 08                   jne    0x13f
            137:    8b 53 0c                mov    edx,DWORD PTR [rbx+0xc]
            13a:    48 01 f2                add    rdx,rsi
            13d:    eb 0c                   jmp    0x14b
            13f:    48 83 c3 28             add    rbx,0x28
            143:    4d 31 c9                xor    r9,r9
            146:    4d 89 c9                mov    r9,r9
            149:    e2 d3                   loop   0x11e
            14b:    48 83 c2 08             add    rdx,0x8
            14f:    48 89 d3                mov    rbx,rdx
            152:    90                      nop
            153:    9c                      pushf
            154:    9d                      popf
            155:    4c 8b 2b                mov    r13,QWORD PTR [rbx]
            158:    49 83 fd 00             cmp    r13,0x0
            15c:    74 11                   je     0x16f
            15e:    4d 31 d2                xor    r10,r10
            161:    4d 85 d2                test   r10,r10
            164:    74 00                   je     0x166
            166:    41 ff d5                call   r13
            169:    48 83 c3 10             add    rbx,0x10
            16d:    e2 e6                   loop   0x155
            16f:    31 c0                   xor    eax,eax
            171:    65 89 04 25 e8 00 00    mov    DWORD PTR gs:0xe8,eax
            178:    00
            179:    90                      nop
            17a:    48 89 c9                mov    rcx,rcx
            17d:    52                      push   rdx
            17e:    5a                      pop    rdx
            17f:    49 01 f6                add    r14,rsi
            182:    41 ff e6                jmp    r14


'''

shellcode_stub =b"\x65\x48\x8B\x04\x25\x60\x00\x00\x00\x48\x8B\x70\x10\x48\x89\xF3\x8B\x43\x3C\x48\x01\xC3\x48\x83\xC3\x18\x48\x89\xDA\x8B\x42\x78\x85\xC0\x0F\x84\xA5\x00\x00\x00\x48\x01\xF0\x48\x89\xC7\x8B\x07\x85\xC0\x0F\x84\x95\x00\x00\x00\x8B\x47\x0C\x48\x01\xF0\x49\x89\xC0\x48\xB9\x4B\x45\x52\x4E\x45\x4C\x33\x32\x49\x39\x08\x75\x74\x41\x8B\x48\x08\x81\xF9\x2E\x64\x6C\x6C\x75\x68\x48\x31\xDB\x8B\x5F\x10\x48\x01\xF3\x48\x8B\x07\x48\x01\xF0\x48\x89\xD9\x48\x89\xC2\x4C\x8B\x02\x4D\x85\xC0\x74\x54\x48\xB8\x00\x00\x00\x00\x00\x00\x00\x80\x49\x85\xC0\x75\x32\x49\x01\xF0\x49\x83\xC0\x02\x4D\x89\xC1\x48\xB8\x56\x69\x72\x74\x75\x61\x6C\x50\x49\x39\x01\x75\x19\x41\x8B\x41\x08\x3D\x72\x6F\x74\x65\x75\x0E\x48\x8B\x01\x65\x48\x89\x04\x25\xF8\x00\x00\x00\xEB\x17\x48\x83\xC1\x08\x48\x83\xC2\x08\xEB\xAD\x48\x83\xC7\x14\xE9\x61\xFF\xFF\xFF\x48\x31\xC0\xC3\x48\x31\xDB\xBB\x01\x00\x00\x80\x65\x48\x89\x1C\x25\xE8\x00\x00\x00\x4D\x31\xF6\x41\xBE\x78\x56\x34\x12\x66\x81\x3E\x4D\x5A\x75\x7D\x8B\x7E\x3C\x48\x01\xF7\x81\x3F\x50\x45\x00\x00\x75\x6F\x90\x31\xC0\xFF\xC0\xFF\xC8\x8B\x4F\x06\x48\x31\xDB\x66\x8B\x5F\x14\x48\x01\xFB\x48\x83\xC3\x18\x50\x58\x48\x89\xC0\x90\x81\x3B\x2E\x66\x75\x6E\x75\x19\x4D\x31\xC0\x4D\x85\xC0\x74\x00\x81\x7B\x04\x63\x6D\x65\x74\x75\x08\x8B\x53\x0C\x48\x01\xF2\xEB\x0C\x48\x83\xC3\x28\x4D\x31\xC9\x4D\x89\xC9\xE2\xD3\x48\x83\xC2\x08\x48\x89\xD3\x90\x9C\x9D\x4C\x8B\x2B\x49\x83\xFD\x00\x74\x11\x4D\x31\xD2\x4D\x85\xD2\x74\x00\x41\xFF\xD5\x48\x83\xC3\x10\xE2\xE6\x31\xC0\x65\x89\x04\x25\xE8\x00\x00\x00\x90\x48\x89\xC9\x52\x5A\x49\x01\xF6\x41\xFF\xE6"

add_section_and_modify_entry(args.input, shellcode_stub, args.output)

Test Program

// test.cpp
#include <windows.h>
#include <iostream>
#include <stdint.h>

struct MyStruct 
{
    void* func;
    uint32_t len;
};

// Key at the start of the section
__attribute__((section(".funcmeta")))
uint32_t myfuncsec_key = 0x12345678;

// Macro to register a function
#define REGISTER_FUNCTION(fn) \
    __attribute__((section(".funcmeta"))) \
    struct MyStruct fn##_entry = { (void*)fn, 0xDEADBEEF };

void REG_foo2()
{
    std::cout << "\nhello from foo2";
}
int REG_foo(int a, int b, int c, int d, int e) 
{
    int i = 0;
    int x = a + b + c + d + e;
    std::cout << "\n" << x;
    MessageBoxA(NULL, "Hello from foo", "Test", MB_OK);
    if (i)
    {
        return 0;
    }
    else{
        i++;
    }
    return x;
}

REGISTER_FUNCTION(REG_foo)
REGISTER_FUNCTION(REG_foo2)

int main()
{
    bool p = VirtualProtect(0,0,0,0);
    std::cout << "MAIN here";
    int a = REG_foo(1,2,3,4,5);
    std::cout << "\n Ret val :" << a;
    REG_foo2();
    return 0;
}

The image below illustrates the state of the REG_foo function prior to the initialization phase, where the function body remains unmasked. You can clearly observe the custom prologue and epilogue code stubs attached to the function body, which are responsible for managing execution flow and preparing for masking operations.

After the initialization phase, the REG_foo function becomes masked along with its epilogue code—only the prologue remains visible, as shown in the image below. This reflects the intended runtime state where the function body and epilogue are protected, ensuring that masking is active throughout the binary’s execution.

This post was written by saab_sec.

You can find the companion code to this release on the MDSec github.

written by

MDSec Research

Ready to engage
with MDSec?

Copyright 2025 MDSec