Shiven's blog

Anatomy of a function call

Fri, 19 Jul 2024 00:00:00 +0000

PS: to keep the main content concise, the technical details relating to some concepts are mentioned as Footnotes.

The Program Counter (`PC`)

A Program Counter or instruction pointer in context of x86 systems is a register that holds the address of an instruction to be executed. In a simple environment, the instructions (which are stored in RAM) are “fetched” sequentially. Control Transfer instructions however, can change this sequence by placing a new value in the PC. Even the return value^[ref] from a function is a control transfer instruction. Other examples include the jump instruction. goto in C behaves analogous to jump, and the compiler might generate jump calls in the assembler as well. These control transfer instructions can branch off execution. In multi-core and multi-threader processors, each core or thread has its own PC. This means that each core/thread can execute a different instruction sequence, independently of others. In SIMD applications, each parallel computation unit performs the exact same instruction, just with different data.

Synchronization

The program counter is very important for synchronization. For instance, locks and mutexes prevent concurrent accesses to shared resources, and thus the program counter for parellel units must be coordinated carefully across the different threads to ensure mutual exclusion. Say, the PC gets stuck in a waiting state, unable to progress, because each thread is waiting for a resource held by another thread, causing a circular wait. This is a deadlock. A livelock is caused when the PC’s of threads involved are continuously changing state without making any progress. It is a risk with algorithms that recover from a deadlock. When changing states, if more than one process takes action, a deadlock detection algorithm gets triggered repeatedly. Read more about synchronization problems from this article by Jerry Breecher.

Instruction cycle

It is a fundamental sequence of steps that a CPU performs to execute a single operation. Simply put, it is a fetch $\rightarrow$ decode $\rightarrow$ execute cycle, and it looks something like this:

It starts with a fetch instruction, which retrieves the next instruction to be executed from the Program Counter^[1].
The Control Unit interprets the instructions via a decoder.
The ALU executes the arithmetic and logic operations, by reading the operands from registers or from memory.
Some intermediate read/write phases might also be involved.

Function call cycle

When a function is called, a stack frame is created, which includes the return address, function arguments and the variables local to the function. The return address specifies where control should return after the function call completes.
A simplified breakdown of a function call looks something like this:

Arguments of the function are placed on the stack.
Some memory is allocated for the return value from the function.
A platform specific call instruction is executed. This places the return address of the function is placed onto the stack. The program counter jumps to this location - because of which the control is transferred to this function.
The function reads the arguments from the stack and the code in the function body is run.
The ret instruction is used to grab the return value(s) from the function. This instruction pops the return address from the stack and thus control returns back to the caller.

Disassembling

Consider a simple C program:

int add(int a, int b) { 
    return a + b; 
}

int main() {
    int result = add(123, 456);
    return 0;
}

To get the assembler output that the GCC compiler generates:

$ gcc -S -fverbose-asm -o asm.s add.c

Let us examine the assembler output:

    movl    $456, %esi,
    movl    $123, %edi,
    call    add

First, 456 is moved into esi, which stores the first integer argument, and edi stores the second. This is the first couple of steps where the function arguments are placed on the stack. As I explained earlier, call add will now pass control to the function body, which looks like this:

add:
.LFB0:
    .cfi_startproc
    pushq   %rbp    #
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp  #,
    .cfi_def_cfa_register 6
    movl    %edi, -4(%rbp)  # a, a
    movl    %esi, -8(%rbp)  # b, b
    movl    -4(%rbp), %edx  # a, tmp100
    movl    -8(%rbp), %eax  # b, tmp101
    addl    %edx, %eax  # tmp100, _3
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   add, .-add
    .globl  main
    .type   main, @function

Which looks very complicated, albeit performing a simple addition. I’m going to focus on the important bits of this piece of assembler.

.cfi_startproc and .cfi_endproc are directives used for debugging and exception handling. They mark the start and end of the stack frame. The base pointer rbp is pushed onto the stack, and the stack pointer rsp is moved into rbp; this makes it easier to access function parameters and local variables.
The local variables are stored at -4(%rbp) and -8(%rbp). Notice the difference 4, which is the size of the int on my machine. This nomenclature roughly translates to 4 bytes before the address in rbp. Here, a and b will be moved into edx and eax respectively, and their addition is computed using the addl opcode.
Subsequently, the stack frame is cleaned up by popping rbp from the stack, and the value is returned via ret

C/C++ specifics

From the GNU GCC manual:
You can actually grab the return or frame address of the function. The prototype of this utility looks like this:

void* __builtin_return_address(unsigned int level)

This function returns the return address of the function, or one of its callers. level is the number of frames to scan up the call stack.

level = 0 yields the return address of the current function. level = 1 yields the address of the caller.

and so on.

On some platforms, additional post-processing is needed to get the actual address from __builtin-return_adresss. This is done as follows:

void* addr = __builtin_extract_return_addr( __builtin_return_address (...) );

To give an example, x86/x86-64 systems, stack protection is used, with something called a Stack Canary. It is some random value that is inserted typically between local variables and the return address. Before the function returns, this Canary value is checked. If that value is altered, the program is usually terminated to prevent malicious code from running. The GCC compiler is awesome, so it provides some stack smashing protections^[2] as well.

Footnotes

^[1] The Address Bus is used to carry the instruction from the PC to memory, and the fetched instruction is carried from memory to the Instruction Register by the Data Bus. The instruction is only temporarily placed in the IR. Modern memory buses connect directly to the DRAM chips, and have very low latency.

^[2a] You can pass the -fstack-protector flag to detect buffer overflows on the stack. This is done by placing the Canary next to critical stack data, as explained earlier. However, this works only for functions that the compiler thinks is vulnerable. You can override this behaviour with the flag -fstack-protector-all for all functions.

tree TARGET_STACK_PROTECT_GUARD (void)

Via the TARGET_STACK_PROTECT_GUARD hook in GCC, the compiler can obtain the value of the Canary. I think this value is set during compilation. A “guard” variable is created to detect stack smashing attacks, and it exists in the form of a global variable __stack_chk_guard.

This hook allows each architecture to provide its own implementation. I’m not familiar with GCC internals though, you can read more here.

^[2b] Position Independent Executables (PIEs) allow the program code to be loaded at different addresses, making it more difficult for an attacker to predict the locations of specific code segments. You can enable PIE during compilation with the fPIE flag.

PIE exists to support Address Space Layout Randomization (ASLR) in executables. You can explore this in action by compiling a simple C program with te -g flag, and set br/eakpoints at different positions in the source file, and you’ll observe the randomization of addresses at the br/eakpoints.

I think you’ll need to call set disable-randomization off in GDB to turn on ASLR for the process. Otherwise, GDB will default to give fixed addresses.

On Linux, you can check if ASLR is on by running:

$ sudo cat /proc/sys/kernel/randomize_via_space

Usually with modern Linux kernels, it is enabled by default and set to a specific value of 2. You can read in-depth about PIE by taking a look at OpenBSD’s PIE implementation.

Why is Numpy so fast?

Tue, 25 Jun 2024 00:00:00 +0000

Homogeneity

Numpy arrays have elements with homogeneous types, whilst native Python lists are just containers holding pointers to objects - even when they are of the same type.

The Principle of Locality is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. Thus, since Numpy arrays are homogeneous, these elements can be cached, and future accesses to these will be relatively faster. This subdivision is called Spatial Locality.

This figure explains Spacial Locality. You can see how during instruction fetches, $n$ loop iterations access same memory locations many times. Numpy arrays are contiguous. This means that the processor just loads the entire block into cache. So, accesses to all elements within the block are faster.

Due to this homogeneity, a lot of latency is saved on pointer indirection and per-element type checking. In Python lists, it won’t even matter if the list has the same type of elements. This is because it treats even primitive objects (like integers) as objects. When you add a variable, say $x = 20$ in a list, the reference to $20$ gets appended. Now, the list and $x$ both hold a reference to $20$. When you reassign any of them, the reference changes, meaning $x$ or the list will now hold a reference to the new object.

Vectorized operations

Arithmetic operations are applied to the entire array at once, instead of having to explicitly loop over the latter just to access an element, which introduces a complexity of $O(N)$ at worst. Numpy just offloads array processing to C, so array operations like iterations should always be done in such vectorized operations.

Array broadcasting is done when arrays are of different sizes, so it becomes possible to perform arithmetic operations between arrays are scalars. The scalar value is essentially expanded - so the number 2 will be treated as an array filled with twos. This is done so that looping occurs in C instead of Python. Needless copies of data are not created, so efficiency is almost always a by-product. Broadcasting (like many other vectorized operations), under the hood is almost always done in the form of “Universal functions” (ufunc for short). These are functions that operate on ndarrays in an element by element fashion. Most of the built-in functions, like addition are carried out this way.

Efficient internal organization

Numpy arrays aren’t actually arrays. They’re a data structure that consists of a contiguous data buffer and some metadata.

Internal data buffer

It is essentially a C-style array - a contiguous, fixed block of memory, containing similar elements. This is the portion that actually holds the array elements.

This is the fundamental aspect of an ndarray - It is essentially a chunk of memory starting at some location.

Buffer Metadata

It contains the following (quoted directly from the Numpy documentation):

Size of the basic data element (ex.: ints, which are 4 bytes, irrespective of 32-bit or 64-bit systems)
an offset relative to the start of data buffer
the number of dimensions and the size of each dimension
the stride
byte order
read flag (whether the buffer is read-only, or not)
dtype: interpretation of the basic data element (yes, the users can create arbitrarily complex, composite data types as the basic array element!)
Memory ordering: C is Row-Major, meaning that elements of a row are stored adjacent to each other, and Fortran is Column-Major, so elements of a column are stored adjacent to each other.

If you want to read more about how Numpy works under the hood, check out Guide to Numpy by Travis Oliphant, who’s credited with the proliferation of Numpy, Scipy, and even Python itself.

Some Benchmarking

Let’s see how fast Numpy is. I’ll use the dot product of two $N\times N$ matrices as reference. The elements of both the matrices will be randomly generated using numpy.random.rand(). Intuitively, the random element generation was excluded from the time checks.

The benchmarks were done on an IPython Kernel, with an Intel Xeon(R). The timeit console utility was used to time function calls. The Python documentation for timeit states that there is a certain base overhead for executing a pass statement. You can check that overhead for your machine by invoking timeit without any arguments.

a, b = numpy.random.rand(n, n), numpy.random.rand(n, n)
%timeit numpy.dot(a, b)

What next?

It is important to note that Numpy is not always fast. I’ll talk more about Numpy specific use cases, and where it fails against Vanilla Python. I also want to compare Numpy to BLAS and LAPACK, as Numpy does rely on both of them for some operations if they are installed. Benchmarking is a tough subject though, so the next article in this series will be solely focused on it. There, I’ll also compare it to other Linear Algebra APIs, maybe in C/C++ (something like Eigen). Till then, read more about BLAS and LAPACK with Numpy here.