<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Shiven&apos;s blog</title>
    <description>Musings of a programmer.</description>
    <link>https://sh7ven.github.io/</link>
    <atom:link href="https://sh7ven.github.io/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Fri, 10 Jan 2025 08:23:56 +0000</pubDate>
    <lastBuildDate>Fri, 10 Jan 2025 08:23:56 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Anatomy of a function call</title>
        <description>&lt;!-- more --&gt;

&lt;p&gt;&lt;strong&gt;PS:&lt;/strong&gt; to keep the main content concise, the technical details relating to some concepts are mentioned as &lt;a href=&quot;#footnotes&quot;&gt;Footnotes&lt;/a&gt;.&lt;/p&gt;
&lt;hr /&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h1 id=&quot;the-program-counter-pc&quot;&gt;The Program Counter (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PC&lt;/code&gt;)&lt;/h1&gt;
&lt;p&gt;A Program Counter or instruction pointer in context of &lt;a href=&quot;https://en.wikipedia.org/wiki/X86&quot;&gt;x86 systems&lt;/a&gt; is a register that holds the address of an instruction to be executed.
In a simple environment, the instructions (which are stored in RAM) are “fetched” sequentially.
&lt;a href=&quot;https://docs.oracle.com/cd/E19120-01/open.solaris/817-5477/eoizl/index.html&quot;&gt;Control Transfer&lt;/a&gt; instructions however, can change this sequence by placing a new value in the PC.
Even the return value&lt;a href=&quot;#function-call-cycle&quot;&gt;&lt;sup&gt;[ref]&lt;/sup&gt;&lt;/a&gt; from a function is a &lt;em&gt;control transfer&lt;/em&gt; instruction. Other examples include the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jump&lt;/code&gt; instruction.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;goto&lt;/code&gt; in C behaves analogous to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jump&lt;/code&gt;, and the compiler might generate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jump&lt;/code&gt; calls in the assembler as well. These control transfer instructions can branch off execution. In multi-core and multi-threader processors, each core or thread has its own PC. This means that each core/thread can execute a different instruction sequence, independently of others. In SIMD applications, each parallel computation unit performs the exact same instruction, just with different data.
&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 id=&quot;synchronization&quot;&gt;Synchronization&lt;/h2&gt;
&lt;p&gt;The program counter is very important for synchronization. For instance, locks and mutexes prevent concurrent accesses to shared resources, and thus the program counter for parellel units 
must be coordinated carefully across the different threads to ensure mutual exclusion. Say, the PC gets stuck in a waiting state, unable to progress, because each thread is waiting for
a resource held by another thread, causing a circular wait. This is a deadlock. A livelock is caused when the PC’s of threads involved are continuously changing state without making any progress. It is a risk with algorithms that recover from a deadlock. When changing states, if more than one process takes action, a deadlock detection algorithm gets triggered repeatedly. Read more about synchronization problems from &lt;a href=&quot;https://web.cs.wpi.edu/~cs3013/c07/lectures/Section06-Sync.pdf&quot;&gt;this article&lt;/a&gt; by Jerry Breecher.
&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h1 id=&quot;instruction-cycle&quot;&gt;Instruction cycle&lt;/h1&gt;
&lt;p&gt;It is a fundamental sequence of steps that a CPU performs to execute a single operation. Simply put, it is a fetch $\rightarrow$ decode $\rightarrow$ execute cycle, 
and it looks something like this:
&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;It starts with a fetch instruction, which retrieves the next instruction to be executed from the Program Counter&lt;sup&gt;[1]&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Control_unit&quot;&gt;Control Unit&lt;/a&gt; interprets the &lt;a href=&quot;https://en.wikipedia.org/wiki/Opcode&quot;&gt;instructions&lt;/a&gt; via a &lt;a href=&quot;https://www.sciencedirect.com/topics/engineering/instruction-decoder&quot;&gt;decoder&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;The ALU executes the arithmetic and logic operations, by reading the operands from registers or from memory.
&lt;br /&gt;
Some intermediate read/write phases might also be involved. 
&lt;br /&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;function-call-cycle&quot;&gt;Function call cycle&lt;/h1&gt;
&lt;p&gt;When a function is called, a stack frame is created, which includes the return address, function arguments and the variables local to the function.
The return address specifies where control should return after the function call completes.
&lt;br /&gt;
A simplified breakdown of a function call looks something like this:
&lt;br /&gt;&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Arguments of the function are placed on the stack.&lt;/li&gt;
  &lt;li&gt;Some memory is allocated for the return value from the function.&lt;/li&gt;
  &lt;li&gt;A platform specific &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;call&lt;/code&gt; instruction is executed. This places the &lt;strong&gt;return address&lt;/strong&gt; of the function is placed onto the stack.
The &lt;em&gt;program counter&lt;/em&gt; jumps to this location - because of which the control is transferred to this function.&lt;/li&gt;
  &lt;li&gt;The function reads the arguments from the stack and the code in the function body is run.&lt;/li&gt;
  &lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ret&lt;/code&gt; instruction is used to grab the return value(s) from the function. This instruction pops the return address from the stack and thus control returns back to the caller. 
&lt;br /&gt;
&lt;br /&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id=&quot;disassembling&quot;&gt;Disassembling&lt;/h1&gt;
&lt;p&gt;Consider a simple C program:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; 
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; 
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;123&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;456&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;To get the assembler output that the GCC compiler generates:&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcc &lt;span class=&quot;nt&quot;&gt;-S&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-fverbose-asm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; asm.s add.c
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Let us examine the assembler output:&lt;/p&gt;
&lt;div class=&quot;language-nasm highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;nf&quot;&gt;movl&lt;/span&gt;    &lt;span class=&quot;kc&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;456&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;esi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;movl&lt;/span&gt;    &lt;span class=&quot;kc&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;123&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;edi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;call&lt;/span&gt;    &lt;span class=&quot;nv&quot;&gt;add&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;First, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;456&lt;/code&gt; is moved into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;esi&lt;/code&gt;, which stores the first integer argument, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;edi&lt;/code&gt; stores the second. This is the first couple of steps where the function arguments are placed on the stack.
As I explained earlier, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;call add&lt;/code&gt; will now pass control to the function body, which looks like this:
&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;div class=&quot;language-nasm highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nl&quot;&gt;add:&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;.LFB0:&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.cfi_startproc&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;pushq&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rbp&lt;/span&gt;    &lt;span class=&quot;err&quot;&gt;#&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.cfi_def_cfa_offset&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.cfi_offset&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;movq&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rsp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rbp&lt;/span&gt;  &lt;span class=&quot;err&quot;&gt;#&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.cfi_def_cfa_register&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;movl&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;edi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rbp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;err&quot;&gt;#&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;a&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;movl&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;esi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rbp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;err&quot;&gt;#&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;b&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;movl&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rbp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;edx&lt;/span&gt;  &lt;span class=&quot;err&quot;&gt;#&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;tmp100&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;movl&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rbp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;eax&lt;/span&gt;  &lt;span class=&quot;err&quot;&gt;#&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;tmp101&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;addl&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;edx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;eax&lt;/span&gt;  &lt;span class=&quot;err&quot;&gt;#&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;tmp100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;_3&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;popq&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rbp&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.cfi_def_cfa&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;ret&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.cfi_endproc&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;.LFE0:&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.size&lt;/span&gt;   &lt;span class=&quot;nv&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;add&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.globl&lt;/span&gt;  &lt;span class=&quot;nv&quot;&gt;main&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;.type&lt;/span&gt;   &lt;span class=&quot;nv&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;function&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;br /&gt;
Which looks very complicated, albeit performing a simple addition. I’m going to focus on the important bits of this piece of assembler. 
&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.cfi_startproc&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.cfi_endproc&lt;/code&gt; are directives used for debugging and exception handling. They mark the start and end of the stack frame.
The base pointer &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rbp&lt;/code&gt; is pushed onto the stack, and the stack pointer &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsp&lt;/code&gt; is moved into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rbp&lt;/code&gt;; this makes it easier to access function parameters and local variables.
&lt;br /&gt;
The local variables are stored at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-4(%rbp)&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-8(%rbp)&lt;/code&gt;. Notice the difference &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;4&lt;/code&gt;, which is the size of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int&lt;/code&gt; on my machine.
This nomenclature roughly translates to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;4&lt;/code&gt; bytes before the address in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rbp&lt;/code&gt;. Here, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;a&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;b&lt;/code&gt; will be moved into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;edx&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;eax&lt;/code&gt; respectively, and their addition is computed using 
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;addl&lt;/code&gt; opcode. 
&lt;br /&gt;
Subsequently, the stack frame is cleaned up by popping &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rbp&lt;/code&gt; from the stack, and the value is returned via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ret&lt;/code&gt;
&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h1 id=&quot;cc-specifics&quot;&gt;C/C++ specifics&lt;/h1&gt;
&lt;p&gt;From the GNU GCC manual:
&lt;br /&gt;
You can actually grab the return or frame address of the function. The prototype of this utility looks like this:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__builtin_return_address&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;level&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;
This function returns the return address of the function, or one of its callers. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;level&lt;/code&gt; is the number of frames to scan up the call stack.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;level = 0&lt;/code&gt; yields the return address of the current function.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;level = 1&lt;/code&gt; yields the address of the caller.&lt;/p&gt;

&lt;p&gt;and so on.
&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;On some platforms, additional post-processing is needed to get the &lt;em&gt;actual&lt;/em&gt; address from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;__builtin-return_adresss&lt;/code&gt;. This is done as follows:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;addr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__builtin_extract_return_addr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__builtin_return_address&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(...)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;To give an example, x86/x86-64 systems, stack protection is used, with something called a Stack Canary.
It is some random value that is inserted typically between local variables and the return address. Before the function returns, this Canary value is checked.
If that value is altered, the program is usually terminated to prevent malicious code from running. The GCC compiler is awesome, so it provides some stack smashing protections&lt;sup&gt;[2]&lt;/sup&gt; as well.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h1&gt;
&lt;p&gt;&lt;sup&gt;[1]&lt;/sup&gt; The &lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_(computing)#Address_bus&quot;&gt;Address Bus&lt;/a&gt; is used to carry the instruction from the PC to memory, and the fetched instruction 
is carried from memory to the &lt;em&gt;Instruction Register&lt;/em&gt; by the &lt;em&gt;Data Bus&lt;/em&gt;. The instruction is only temporarily placed in the IR.
Modern &lt;a href=&quot;https://en.wikipedia.org/wiki/Bus_(computing)&quot;&gt;memory buses&lt;/a&gt; connect directly to the DRAM chips, and have very low latency.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;sup&gt;[2a]&lt;/sup&gt; You can pass the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-fstack-protector&lt;/code&gt; flag to detect buffer overflows on the stack. This is done by placing the Canary next to critical stack data, as explained earlier. However, this works only for functions that the compiler &lt;em&gt;thinks&lt;/em&gt; is vulnerable. You can override this behaviour with the flag &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-fstack-protector-all&lt;/code&gt; for all functions.
&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;tree&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TARGET_STACK_PROTECT_GUARD&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Via the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TARGET_STACK_PROTECT_GUARD&lt;/code&gt; hook in GCC, the compiler can obtain the value of the Canary. I think this value is set during compilation. 
A “guard” variable is created to detect stack smashing attacks, and it exists in the form of a global variable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;__stack_chk_guard&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This hook allows each architecture to provide its own implementation. I’m not familiar with GCC internals though, you can read more &lt;a href=&quot;https://gcc.gnu.org/onlinedocs/gccint/target-macros/stack-layout-and-calling-conventions/stack-smashing-protection.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;&lt;sup&gt;[2b]&lt;/sup&gt;  Position Independent Executables (PIEs) allow the program code to be loaded at different addresses, making it more difficult for an attacker to predict the locations of specific code segments. You can enable PIE during compilation with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fPIE&lt;/code&gt; flag.&lt;/p&gt;

&lt;p&gt;PIE exists to support Address Space Layout Randomization (ASLR) in executables. You can explore this in action by compiling a simple C program with te &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-g&lt;/code&gt; flag, and set br/eakpoints at different positions in the source file, and you’ll observe the randomization of addresses at the br/eakpoints.&lt;/p&gt;

&lt;p&gt;I think you’ll need to call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set disable-randomization off&lt;/code&gt; in GDB to turn on ASLR for the process. Otherwise, GDB will default to give fixed addresses.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;
On Linux, you can check if ASLR is on by running:&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo cat&lt;/span&gt; /proc/sys/kernel/randomize_via_space
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;p&gt;Usually with modern Linux kernels, it is enabled by default and set to a specific value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2&lt;/code&gt;. You can read in-depth about PIE by taking a look at &lt;a href=&quot;http://www.openbsd.org/papers/nycbsdcon08-pie/&quot;&gt;OpenBSD’s PIE implementation&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Fri, 19 Jul 2024 00:00:00 +0000</pubDate>
        <link>https://sh7ven.github.io/2024/07/19/Anatomy-of-a-function-call.html</link>
        <guid isPermaLink="true">https://sh7ven.github.io/2024/07/19/Anatomy-of-a-function-call.html</guid>
        
        <category>C</category>
        
        <category>Assembly</category>
        
        <category>Low-level</category>
        
        
      </item>
    
      <item>
        <title>Why is Numpy so fast?</title>
        <description>&lt;!-- more --&gt;

&lt;h2 id=&quot;homogeneity&quot;&gt;Homogeneity&lt;/h2&gt;
&lt;p&gt;Numpy arrays have elements with homogeneous types, whilst native Python lists are just containers holding pointers to objects - even when they are of the same type.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Locality_of_reference&quot;&gt;Principle of Locality&lt;/a&gt; is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. Thus, since Numpy arrays are homogeneous, these elements can be cached, and future accesses to these will be relatively faster. This subdivision is called &lt;strong&gt;Spatial Locality&lt;/strong&gt;.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/Principle-of-Locality.png&quot; alt=&quot;Principle of Locality diagram&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This figure explains Spacial Locality. You can see how during instruction fetches, $n$ loop iterations access same memory locations many times. 
Numpy arrays are contiguous. This means that the processor just loads the entire block into cache. So, accesses to all elements within the block are faster.&lt;/p&gt;

&lt;p&gt;Due to this homogeneity, a lot of latency is saved on &lt;a href=&quot;https://en.wikipedia.org/wiki/Indirection&quot;&gt;pointer indirection&lt;/a&gt; and per-element type checking. In Python lists, it won’t even matter if the list has the same type of elements. This is because it treats even primitive objects (like integers) as objects. 
When you add a variable, say $x = 20$ in a list, the reference to $20$ gets appended. Now, the list and $x$ both hold a reference to $20$. When you reassign any of them, the reference changes, meaning $x$ or the list will now hold a reference to the new object.&lt;/p&gt;

&lt;h2 id=&quot;vectorized-operations&quot;&gt;Vectorized operations&lt;/h2&gt;
&lt;p&gt;Arithmetic operations are applied to the entire array at once, instead of having to explicitly loop over the latter just to access an element, which introduces a complexity of $O(N)$ at worst. Numpy just offloads array processing to C, so array operations like iterations should always be done in such vectorized operations.&lt;/p&gt;

&lt;p&gt;Array broadcasting is done when arrays are of different sizes, so it becomes possible to perform arithmetic operations between arrays are scalars. The scalar value is essentially expanded - so the number 2 will be treated as an array filled with twos. This is done so that looping occurs in C instead of Python. Needless copies of data are not created, so efficiency is almost always a by-product. Broadcasting (like many other vectorized operations), under the hood is almost always done in the form of &lt;a href=&quot;https://numpy.org/doc/stable/reference/ufuncs.html&quot;&gt;“Universal functions”&lt;/a&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ufunc&lt;/code&gt; for short). These are functions that operate on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndarrays&lt;/code&gt; in an element by element fashion. Most of the built-in functions, like addition are carried out this way.&lt;/p&gt;

&lt;h2 id=&quot;efficient-internal-organization&quot;&gt;Efficient internal organization&lt;/h2&gt;
&lt;p&gt;Numpy arrays aren’t &lt;em&gt;actually&lt;/em&gt; arrays. They’re a data structure that consists of a contiguous data buffer and some metadata.&lt;/p&gt;

&lt;h4 id=&quot;internal-data-buffer&quot;&gt;Internal data buffer&lt;/h4&gt;
&lt;p&gt;It is essentially a C-style array - a contiguous, fixed block of memory, containing similar elements. This is the portion that actually holds the array elements.&lt;/p&gt;

&lt;p&gt;This is the fundamental aspect of an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ndarray&lt;/code&gt; - It is essentially a chunk of memory starting at some location.&lt;/p&gt;

&lt;h4 id=&quot;buffer-metadata&quot;&gt;Buffer Metadata&lt;/h4&gt;
&lt;p&gt;It contains the following (quoted directly from the Numpy documentation):&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Size of the basic data element (ex.: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int&lt;/code&gt;s, which are 4 bytes, irrespective of 32-bit or 64-bit systems)&lt;/li&gt;
  &lt;li&gt;an offset relative to the start of data buffer&lt;/li&gt;
  &lt;li&gt;the number of dimensions and the size of each dimension&lt;/li&gt;
  &lt;li&gt;the &lt;a href=&quot;https://en.wikipedia.org/wiki/Stride_of_an_array&quot;&gt;stride&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Endianness&quot;&gt;byte order&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;read flag (whether the buffer is read-only, or not)&lt;/li&gt;
  &lt;li&gt;dtype: interpretation of the basic data element (yes, the users can create &lt;a href=&quot;https://numpy.org/doc/stable/glossary.html#term-structured-data-type&quot;&gt;arbitrarily complex, composite data types&lt;/a&gt; as the basic array element!)&lt;/li&gt;
  &lt;li&gt;Memory ordering: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt; is Row-Major, meaning that elements of a row are stored adjacent to each other, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Fortran&lt;/code&gt; is Column-Major, so elements of a column are stored adjacent to each other.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to read more about how Numpy works under the hood, check out &lt;a href=&quot;https://www.amazon.com/Guide-NumPy-Travis-Oliphant-PhD/dp/151730007X&quot;&gt;Guide to Numpy&lt;/a&gt; 
by Travis Oliphant, who’s credited with the proliferation of Numpy, Scipy, and even Python itself.&lt;/p&gt;

&lt;h1 id=&quot;some-benchmarking&quot;&gt;Some Benchmarking&lt;/h1&gt;
&lt;p&gt;Let’s see how fast Numpy is. I’ll use the dot product of two $N\times N$ matrices as reference.
The elements of both the matrices will be randomly generated using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;numpy.random.rand()&lt;/code&gt;. Intuitively,
the random element generation was excluded from the time checks.&lt;/p&gt;

&lt;p&gt;The benchmarks were done on an IPython Kernel, with an Intel Xeon(R). The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timeit&lt;/code&gt; console utility
was used to time function calls. The Python documentation for &lt;a href=&quot;https://docs.python.org/3/library/timeit.html&quot;&gt;timeit&lt;/a&gt; 
states that there is a certain base overhead for executing a pass statement. 
You can check that overhead for your machine by invoking &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timeit&lt;/code&gt; without any arguments.&lt;/p&gt;

&lt;!-- 
&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt; &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;--&lt;/span&gt; &lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;
 --&gt;

&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;timeit&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/numpy_plot.png&quot; alt=&quot;Numpy benchmark plot&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;what-next&quot;&gt;What next?&lt;/h1&gt;
&lt;p&gt;It is important to note that Numpy is not &lt;em&gt;always&lt;/em&gt; fast. I’ll talk more about Numpy specific use cases, and where it fails against Vanilla Python. I also want to compare Numpy to BLAS and LAPACK, as Numpy does rely on both of them for some operations if they are installed. Benchmarking is a tough subject though, so the next article in this series will be solely focused on it. There, I’ll also compare it to other Linear Algebra APIs, maybe in C/C++ (something like &lt;a href=&quot;https://eigen.tuxfamily.org/&quot;&gt;Eigen&lt;/a&gt;). Till then, read more about BLAS and LAPACK with Numpy &lt;a href=&quot;https://superfastpython.com/what-is-blas-and-lapack-in-numpy/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Tue, 25 Jun 2024 00:00:00 +0000</pubDate>
        <link>https://sh7ven.github.io/2024/06/25/why-is-numpy-so-fast.html</link>
        <guid isPermaLink="true">https://sh7ven.github.io/2024/06/25/why-is-numpy-so-fast.html</guid>
        
        <category>Numpy</category>
        
        <category>Python</category>
        
        <category>Optimization</category>
        
        <category>Benchmarking</category>
        
        
      </item>
    
  </channel>
</rss>
