Binary Analysis - The Detective's Field Guide

Binary analysis is the process of examining and understanding compiled software (binary files) without access to the original source code. It's a critical discipline in cybersecurity, reverse engineering, and digital forensics, acting much like a detective investigating a crime scene.

Binary Files

A binary file is a sequence of bytes interpreted by a computer as something other than plain text. In binary analysis, we primarily deal with executable files (binaries) containing machine code—instructions directly executable by the CPU.

  • File Formats: Executable files adhere to specific formats that organize code, data, and metadata.
    • PE (Portable Executable): Used on Windows systems (.exe, .dll).
    • ELF (Executable and Linkable Format): Used on Linux and other Unix-like systems.

Reverse Engineering

Binary analysis is often synonymous with reverse engineering—the process of deconstructing a man-made object to determine how it was designed or how it operates. For software, this involves translating machine code back into a human-readable form.

  • Disassembly: Translating machine code into Assembly Language, a low-level, human-readable representation of CPU instructions.
  • Decompilation: Attempting to translate assembly code back into a high-level language (like C or C++), though this is often imprecise.

Today, we're not just going to tell you what to do - we're going to tell you what to look for and why it matters. Every step in binary analysis has a purpose. Let's create a field guide for what to look for at each step and why it's critical for security analysis.

The Binary to Analyze: "mystery_program"

// mystery_program.c

#include <stdio.h>

#include <string.h>


// Global variables in different sections

int global_initialized = 42;               // Will be in .data

int global_uninitialized;                   // Will be in .bss

const char secret[] = "S3cr3tP@ssw0rd!";   // Will be in .rodata


// A simple function

void print_message() {

    printf("Welcome to the mystery program!\n");

}


// A function with a buffer

void process_input() {

    char buffer[16];  // Small buffer

    printf("Enter your name: ");

    gets(buffer);     // UNSAFE - but good for analysis!

    printf("Hello, %s!\n", buffer);

}


// Main function

int main() {

    printf("Program starting...\n");

    print_message();

    

    printf("Global initialized: %d\n", global_initialized);

    printf("Global uninitialized: %d\n", global_uninitialized);

    

    process_input();

   

    printf("Program ending...\n");

    return 0;

}

First, compile properly with correct flags

Now, let me compile it in two versions:

# Version 1: With debug symbols (easier to analyze)

gcc -std=c99 -fno-stack-protector -z execstack -no-pie -g -o my_mystryapp_debug mymystryapp.c

# Version 2: Stripped (harder, like real-world binaries)

gcc -std=c99 -fno-stack-protector -z execstack -no-pie -o my_mystryapp_stripped mymystryapp.c

strip my_mystryapp_stripped

# Version 3: Vulnerable version (for exploit practice)

gcc -fno-stack-protector -z execstack -no-pie -o mystery_vuln mystery_program.c

We are going to analyze above files:

mystery_debug     - Binary with debug symbols

mystery_stripped  - Stripped binary (no symbols)

mystery_vuln      - Vulnerable binary for exploit dev

The Analysis Process: What to Look For and Why

Phase 1: Initial Reconnaissance - The First Clues

Why we do this: To understand what we're dealing with before diving deep. Like checking the cover of a book before reading it.

Step 1: file command - Identifying the Target


file mystery_program

What to look for:

  • "ELF 64-bit" vs "ELF 32-bit" → Determines architecture and register sizes (64-bit uses rax, rbx, etc.; 32-bit uses eax, ebx, etc.)
  • "LSB" (Little Endian) → Confirms byte order (critical for interpreting memory correctly)
  • "stripped" vs "not stripped" → Tells us if we have function names or not (stripped = harder)
  • "dynamically linked" → Means it uses shared libraries (easier to trace)
  • "statically linked" → Contains all libraries inside (larger file, harder to analyze)
  • "executable" → Can be run (vs "shared object" which is a library)

Why it matters for security:

  • 32-bit vs 64-bit affects exploit development (different stack layouts, register names)
  • Stripped binaries are common in malware and commercial software
  • Dynamically linked binaries reveal what libraries are used (attack surface)

Step 2: checksec or manual protection checking


readelf -a mystery_program | grep -E "(GNU_STACK|GNU_RELRO)"

What to look for:

  • NX (Non-eXecutable stack)GNU_STACK with RWE flags (R=Read, W=Write, E=Execute)
    • RW = NX enabled (stack cannot execute code)
    • RWE = NX disabled (stack CAN execute code - easier to exploit!)
  • PIE (Position Independent Executable) → Check if entry point address starts with 0x4...
    • Fixed address (0x400000) = No PIE
    • Random-looking address = PIE enabled (harder to exploit)
  • Stack Canary → Look for __stack_chk_fail in symbols (present = canary enabled)
  • RELROGNU_RELRO section present
    • Partial RELRO = GOT (Global Offset Table) can be overwritten
    • Full RELRO = GOT is read-only (prevents GOT overwrite attacks)

Why it matters for security:

  • These are the defenses we need to bypass in exploit development
  • NX disabled = We can execute shellcode on the stack (classic buffer overflow)
  • No PIE = Addresses are predictable (easier to target specific functions)
  • No stack canary = Buffer overflows won't be detected
  • Partial RELRO = We can overwrite function pointers in GOT

Step 3: strings - The Human Readable Clues

strings mystery_program | grep -i "pass\|secret\|key\|admin\|root\|error\|fail"

What to look for:

  • Hardcoded credentials → Passwords, API keys, tokens
  • Error messages → Reveal function names and logic flow
  • URLs and paths → Network connections, file operations
  • Format strings%s, %d, %x can indicate format string vulnerabilities
  • Command stringssystem(), exec(), popen() calls

Why it matters for security:

  • Hardcoded credentials = Instant compromise
  • Error messages help understand program flow for reverse engineering
  • URLs might indicate C2 (Command & Control) servers in malware
  • Format strings in printf() without proper arguments = format string vulnerability
  • Command strings might indicate possible command injection points

Phase 2: Static Analysis - Reading the Blueprint

Step 1: nm - The Symbol Table (if not stripped)

nm mystery_program | grep -E "(T|t|U) .*"

What to look for:

  • "T" or "t" → Defined functions (T = global, t = local)
    • Look for: main, vuln, login, auth, encrypt, decrypt
  • "U" → Undefined functions (imported from libraries)
    • Look for dangerous functions: gets, strcpy, sprintf, system, exec
  • "D" or "d" → Global variables (D = initialized, d = uninitialized)
    • Look for: flags, configuration variables, encryption keys

Why it matters for security:

  • Dangerous imported functions (gets, strcpy) = potential buffer overflows
  • system() or exec() calls = potential command injection
  • Custom encryption functions = might have weak implementation
  • Authentication functions = target for bypass attacks

Step 2: objdump -d - Reading the Assembly


objdump -d mystery_program | grep -B5 -A5 "call.*gets\|call.*strcpy"


What to look for in disassembly:

Function Prologues (Start of functions):

assembly

push   rbp            ; Save base pointer

mov    rbp, rsp       ; Set up new stack frame

sub    rsp, 0x20      ; Allocate 32 bytes on stack ← BUFFER SIZE!

  • sub rsp, 0xXX → Tells us stack buffer sizes
  • Small buffers (0x10, 0x20) =更容易溢出

Dangerous Function Calls:

assembly

call   0x400500 <gets@plt>    ; NO bounds checking!

call   0x400510 <strcpy@plt>  ; NO bounds checking!

call   0x400520 <strcat@plt>  ; NO bounds checking!

call   0x400530 <sprintf@plt> ; Format string vulnerability possible

Buffer Allocation Patterns:

assembly

lea    rax, [rbp-0x10]  ; Buffer starts at rbp-0x10 (16 bytes)

lea    rdx, [rbp-0x20]  ; Another buffer at rbp-0x20 (32 bytes)

Return Instruction Patterns:

assembly

leave                   ; Clean up stack frame

ret                     ; Return to caller ← WHERE WE HIJACK CONTROL!

Why it matters for security:

  • Buffer sizes tell us how much data needed to overflow
  • Dangerous functions are vulnerability indicators
  • Return instructions are where we hijack control flow
  • Leave/ret sequences are where we insert our exploit

Step 3: readelf -S - Memory Layout


readelf -S mystery_program | grep -E "(text|data|rodata|bss|plt|got)"

What to look for:

  • .text → Executable code section (where shellcode might go if executable)
  • .data → Writable data (where we might write exploit data)
  • .rodata → Read-only data (hardcoded strings, might contain secrets)
  • .plt → Procedure Linkage Table (function stubs for dynamic linking)
  • .got → Global Offset Table (actual addresses of imported functions)
  • Section permissions: AX (Execute), W (Write), R (Read)

Why it matters for security:

  • .got is writable with Partial RELRO → We can overwrite function addresses!
  • .text executable → We can place shellcode there if we can write to it
  • .data writable → Good place to store exploit strings
  • Knowing memory layout helps in ROP (Return-Oriented Programming) chain building


Phase 3: Dynamic Analysis - Watching It Run

Step 1: strace - System Call Tracing

bash

strace -e trace=file,network ./mystery_program

What to look for:

  • File operations: open, read, write, close
    • Look for: config files, password files, log files
  • Network operations: socket, connect, send, recv
    • Look for: IP addresses, ports (malware C2)
  • Process operations: fork, execve, system
    • Look for: command execution (potential injection)
  • Memory operations: mprotect, mmap
    • Look for: memory protection changes

Why it matters for security:

  • File operations reveal sensitive data access
  • Network operations show communication patterns (data exfiltration?)
  • execve with user input = possible command injection
  • mprotect changing permissions = anti-debugging or self-modifying code

Step 2: ltrace - Library Call Tracing

bash

ltrace ./mystery_program 2>&1 | grep -E "(gets|strcpy|printf|system)"

What to look for:

  • Input functions: gets, fgets, scanf, read
  • String functions: strcpy, strcat, sprintf
  • Memory functions: malloc, free, memcpy
  • Format functions: printf, fprintf, snprintf

Why it matters for security:

  • See what data flows into dangerous functions
  • Track user input through the program
  • Identify format string vulnerabilities (printf with user-controlled format)
  • Spot heap operations (potential heap overflows)

Step 3: gdb - Interactive Debugging

3.1: Initial Setup

GNU gdb (Debian 13.1-3) 13.1

...

(gdb) help

List of classes of commands:

...

What's happening: GDB started successfully. You asked for help which shows all command categories. This is good!

What to do next: We need to load a binary to analyze. Or

3.2: Loading the Debug Binary

gdb -q my_mystryapp_debug 

Reading symbols from my_mystryapp_debug...

Success!

  • "Reading symbols" means this binary has debug information (compiled with -g)
  • We can use function names like main, process_input, etc.
  • This will make our analysis easier

3.3: Setting Breakpoints - The Critical Issue

(gdb) break process_input

Breakpoint 1 at 0x401164

(gdb) break *strcpy

No symbol "strcpy" in current context.

What's happening here:

  1. break *gets worked → Found at address 0x1050
  2. break *strcpy failed → "No symbol in current context"

Why this matters:

  • Our program doesn't use strcpy()! We only use gets()
  • Check our source code:

gets(buffer);     // We have this

// No strcpy() anywhere!

  • The compiler warning was suggesting fgets() instead of gets(), not mentioning strcpy

What to do:

  • Only set breakpoints for functions that actually exist
  • Let's check what functions we have:

gdb

(gdb) info functions

3.4: Switching to Stripped Binary - The Confusion

(gdb) file my_mystryapp_stripped 

Load new symbol table from "my_mystryapp_stripped"? (y or n) y

Reading symbols from my_mystryapp_stripped...

(No debugging symbols found in my_mystryapp_stripped)

What's happening:

  1. GDB is asking: "You loaded symbols for debug, now switching to stripped. Reload symbols?"
  2. We say y (yes)
  3. "No debugging symbols found" → This is expected! The binary is stripped

The critical mistake happening here:

gdb

(gdb) break *strcpy

No symbol table is loaded.  Use the "file" command.

Why this error?

  • When you have no symbols (stripped binary), you cannot use function names
  • You must use addresses instead
  • Example: break *0x401050 (use the address we found earlier)

3.5: The Confusion Loop

(gdb) file my_mystryapp_stripped 

Reading symbols from my_mystryapp_stripped...

(No debugging symbols found in my_mystryapp_stripped)

(gdb) break *gets

Note: breakpoint 1 also set at pc 0x1050.

Breakpoint 2 at 0x1050

Interesting finding:

  • Even though it's stripped, break *gets works!
  • But wait, look at the address: 0x1050 (same as debug version)
  • And then later: Breakpoint 3 at 0x401050

Why two different addresses?

  • 0x1050 = Relative address (offset)
  • 0x401050 = Absolute address (actual memory location)
  • This is due to PIE (Position Independent Executable) being disabled with -no-pie

3.6: The Final Confusion

(gdb) info frame

No stack.

Why "No stack"?

  • The program isn't running yet!
  • info frame shows the current stack frame... but we haven't started execution
  • We need to run the program first

(gdb) break *strcpy

No symbol table is loaded.  Use the "file" command.


Why this keeps happening:

  • You're trying to use function names on a stripped binary
  • Stripped = No function names available
  • You must use addresses

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The Detective's Checklist: What to Document

For Every Binary Analysis, Document:

  • Basic Information
    • Architecture (32/64-bit)
    • Endianness
    • Stripped or not
    • Linked statically or dynamically
  • Protections (Defenses)
    • NX (Stack executable?): [ ] Yes [ ] No
    • ASLR/PIE: [ ] Enabled [ ] Disabled
    • Stack Canary: [ ] Present [ ] Absent
    • RELRO: [ ] Full [ ] Partial [ ] None
  • Vulnerability Indicators
    • Dangerous functions used: gets, strcpy, sprintf, system
    • Buffer sizes found: [ ] Small (<32) [ ] Medium (32-128) [ ] Large (>128)
    • Format strings with user input: [ ] Yes [ ] No
    • Command execution with user input: [ ] Yes [ ] No
  • Attack Surface
    • User input points identified: [ ] Network [ ] File [ ] Command line [ ] Environment
    • Authentication functions found: __________
    • Encryption functions found: __________
  • Exploit Development Notes
    • Crash confirmed at: ____ bytes
    • Return address offset: ____ bytes from buffer start
    • Available gadgets: [ ] pop rdi; ret [ ] pop rsi; ret [ ] execve available
    • Writable memory regions: __________
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Why This Systematic Approach Matters

In the real world:

  • Real Malware Analysis: Malware is always stripped - you must use addresses
  • Exploit Development: You need exact addresses for payloads
  • Reverse Engineering: You gradually rebuild symbols as you analyze
  • CTF Challenges: Often give stripped binaries to make it harder

Remember: Every piece of information you gather tells a story:

  • Small buffer + gets() = Classic buffer overflow
  • system() with user input = Command injection
  • Partial RELRO + GOT entry = GOT overwrite attack possible
  • No PIE + known function address = Return-to-libc attack