Compiling

Computers think in binary. Yes or no, true or false, on or off, one or zero.

Each digit is one bit. A byte is eight bits. Hexidecimal is 16 bits. 32 bit processors were the norm up to a few years ago. 64 bit processors are the standard today. So, the largest word a modern personal computer can comprehend is 64 bits long.

It is very difficult for us humans to think using binary mathematics, so, we develop higher level languages that we can understand and that the computer can translate into the binary that it uses to process information.

  • Preprocessing
  • Compiling
  • Assembling
  • Linking

GCC, which stands for GNU Compiler Collection, is a suite of compilers and related tools developed by the GNU Project for compiling programs written in various programming languages into machine code. Initially created for the C language, GCC has evolved to support numerous languages including C++, Objective-C, Fortran, Ada, and Go, among others. Here’s an in-depth look at GCC and how it functions:

GCC compiles source code into assembly or directly into binary machine code that can be executed by a computer. It’s known for its portability across different hardware architectures and operating systems, making it a cornerstone of open-source software development. It’s released under the GNU General Public License (GPL), ensuring that GCC remains free software.

Languages supported by GCC include C and C++. GCC supports languages through frontends (parsers) like:

  • C++ (g++): The C++ compiler.
  • Fortran (gfortran): For Fortran code.
  • Ada (GNAT): Ada compiler.
  • Go (gccgo): For compiling Go programs.

The gcc command itself acts as a driver, orchestrating the compilation process by calling appropriate tools in sequence. Frontends include language-specific parsers that turn source code into an intermediate representation (IR), known as GIMPLE in GCC’s terminology.

A middle end optimizes the code in the GIMPLE form. It’s language-independent and focuses on transformations like loop unrolling, dead code elimination and other optimizations.

The backend converts the optimized IR into assembly or directly into machine code for the target architecture. It includes code generation that translates IR to assembly or machine code. Instruction selection that chooses the best machine instructions for the given operations. Register allocation manages how variables are allocated to hardware registers. The assembler (as) converts assembly code to object code (machine code in binary form), which is included in the GCC suite. A linker (ld), part of the GNU Binutils, is used by GCC to link object files into an executable.

Users run GCC via the gcc command. For example, gcc -o output input.c would compile input.c into an executable named output. First, GCC invokes the preprocessor (cpp) to handle directives like #include and #define. This step results in an expanded source file with all macros replaced and included files merged in.

During parsing, the frontend reads the source code, checks syntax and transforms it into GIMPLE. Semantic analysis checks for semantic correctness, like type checking.

The middle end optimizes the code, which might involve tree optimizations such as on the GIMPLE representation. RTL (Register Transfer Language) optimizations are a lower-level representation before machine code generation.

The backend generates assembly code or directly machine code for the target architecture. This step includes instruction scheduling, optimizing the order of instructions for better performance. Register allocation maps variables to CPU registers for efficiency.

The GNU Assembler (as) transforms assembly code into object code. GCC calls the linker (ld) to combine object files, resolve symbols and produce an executable or library. This includes static vs. dynamic linking. Depending on flags, GCC might statically link libraries or prepare for dynamic linking. Finally, GCC outputs either an executable file, a shared library or object code, depending on the compilation options.

GCC can compile code for an architecture different from the one it’s running on, which is crucial for embedded systems development. It offers various levels of optimization (-O0, -O1, -O2, -O3, -Os) that can significantly improve performance or reduce code size. GCC provides extensive error and warning messages to help developers debug their code. It supports plugins for extending its functionality, like additional optimizations or analysis tools. Profile-Guided Optimization (PGO) uses runtime information to guide optimizations.

GCC has numerous options for controlling the compilation process, affecting everything from optimization to output format. Environment Variables can influence GCC’s behavior, like CFLAGS for C compilation flags.

In summary, GCC is a versatile and powerful compiler suite that plays a fundamental role in software development across various platforms and programming languages. Its modular design allows for extensive customization and optimization, making it suitable for both simple scripts and complex, performance-critical applications.

Preprocesser

The preprocessor processes the header files before processing the main function. That way, all the functions called for in the header files will be available for the main function.

A preprocessor is a program or module that processes source code before it undergoes compilation or interpretation. It’s an intermediate step in the compilation process, primarily used in languages like C, C++ or Rust but is also found in other programming environments. Here’s an in-depth look at what a preprocessor does:

The preprocessor does not understand the programming language’s syntax or semantics in depth; it operates on the source code as plain text, performing macro expansions and file inclusions. By processing directives (often starting with # in C/C++), it prepares the source for the actual compiler by transforming the code according to these directives.

Macros are essentially shorthand notations for larger pieces of code or constants. Conditional compilation involves directives like #ifdef, #ifndef, #if, #else, #elif and #endif control which parts of the code are compiled based on certain conditions. This is useful for debugging, including or excluding debug code based on whether DEBUG is defined. Platform-specific code involves using #ifdef for code segments that are OS-specific.

Line control can be used to alter the compiler’s perception of line numbers and file names, which is useful for generated code or when dealing with macro expansions for better error reporting. #pragma offers compiler-specific features or instructions, which can control compiler behavior like optimization settings or warning suppression. Error generation: #error can be used to halt compilation and display an error message if certain conditions aren’t met, helping to enforce build rules or configurations.

The preprocessor reads source files with embedded preprocessor directives. It interprets these directives, performing text substitutions, including files, or conditionally compiling sections of code. The result is an expanded version of the source code, which is then fed into the compiler.

Macros and includes promote code reuse without redundancy. Conditional compilation allows for writing code that can be compiled for different platforms or configurations from a single source. Preprocessors make it easier to toggle features or debug-specific code sections.

Excessive use of macros can lead to code bloat and might obscure the program’s logic, making maintenance harder. While beneficial, preprocessor directives can sometimes reduce portability if not used carefully, especially with platform-specific #ifdef blocks.

Preprocessors provide powerful, albeit simple, text manipulation tools that significantly enhance the flexibility and maintainability of programming in languages like C and C++. They allow developers to manage complexity, adapt code for different environments, and streamline the development process.

Compiling

The compiler takes the C programming code you write and converts it into Assembly code. Compiling is the process of converting source code written in a high-level programming language (like C, C++ or Rust) into a lower-level language, typically machine code, that a computer’s processor can execute. It transforms human-readable code into efficient, executable instructions that a computer can run directly. This process is performed by a specialized software program called a compiler.

A byte is eight bits. That’s why there are 16 bit, 32 bit and 64 bit words in computer architecture. That’s why computers use hexidecimal instead of decimal numbers. 64 bit words are sometimes two 32 bit words connected together. You know, the registers are arranged into 32 bit words in the CPU, and the Assembly code adds two 32 bit long words into one 64 bit long long word.

ASCII is a system that assigns a number for all the letters and symbols we can use in a computer. The computer converts the numbers, letters and symbols that we enter into the computer into ASCII numbers, then, it converts ASCII into assembly and then binary numbers that the CPU can process.

It uses 0x as a prefix to tell the CPU that the current number is actually a hexidecimal number, not an ASCII character. There are other prefixes for other number systems. Hexidecimal is the most important because that’s what computers use to calculate.

The reason we use hexidecimal instead of decimal is that, we can arrange binary into four bit words and 16 is a multiple of four. That is not possible with 10.

So, the compiler compiles C programming language into Assembly code. Assembly is a very simple and precise code, telling the CPU exactly what to do. Such as, move a number from a register in the ALU, the Arithmetic Logic Unit, into a register in cache, or add two registers together, etc.

Registers are circuits in the CPU, where data is stored temporarily while the computer processes it. The computer is processing data in the billions of operations every second. A laptop with eight processors can process a lot of information fast.

The CPU is reading the code as ASCII in binary numbers. So, the binary number for 64 is 01000000 which the computer recognizes as the @ symbol. 65, or 01000001, equals the ASCII symbol for A. The hexidecimal numbers for those two examples are 40 and 41,

Assembling

When discussing the compilation of computer programs, “assembling” refers to a specific phase in the transformation of source code into executable machine code. This process is particularly relevant when dealing with low-level programming languages like assembly language.

Compilation is the process of translating human-readable source code into machine-executable instructions. The journey from source code to executable involves several steps:

  • Lexical Analysis (Tokenization)
  • Syntax Analysis (Parsing)
  • Semantic Analysis
  • Intermediate Code Generation
  • Optimization
  • Code Generation
  • Assembling (for languages compiled to assembly)
  • Linking

Assembling specifically comes into play when the source code is written in or transformed into assembly language. Assembly language is a low-level programming language where each statement corresponds to a single machine instruction. It uses mnemonics (human-readable names) for operations, making it more readable than raw machine code. The assembler converts these mnemonic instructions into their binary or hexadecimal equivalents, which the computer’s hardware can execute directly.

The assembler reads the assembly source file, which consists of mnemonics, labels, directives and data definitions. It checks for syntax errors, ensuring each statement adheres to the assembly language rules of the target architecture.

The assembler creates a symbol table to keep track of where each label points in memory. Labels are names for memory locations defined by the assembler. As the assembler goes through the instruction set, it resolves addresses for all labels. If a label is used before it’s defined, this pass helps to assign correct memory addresses.

Each mnemonic instruction is translated into its opcode (operation code) and operands (data or memory locations). This step converts human-readable instructions like MOV AX, BX into a binary sequence like 10001000 11010001. The assembler combines all translated opcodes and operands into a sequence of bytes that form the machine code. This includes:

  • Instruction bytes: The actual instructions to be executed.
  • Data: Any constants or initialized data areas.
  • Directives: Handling special assembly directives like ORG for setting the program counter, or EQU for defining constants.

Throughout the process, the assembler checks for errors like undefined symbols, incorrect operand sizes or improper use of instructions. It reports these to the programmer to be corrected. The final product of assembling is an object file or an executable file, depending on whether further steps like linking are needed. An object file contains machine code, symbol information and relocation information (for dynamic linking).

Instruction set

If the assembly program uses external libraries or multiple source files, a linker combines these into a single executable, resolving external references. The executable file is loaded into memory by the operating system whenever the program is run.

Assembly allows for fine-tuned control over hardware, which can be crucial for performance-critical sections of code. Writing or understanding assembly can be essential for legacy systems or for developing drivers or firmware where direct hardware manipulation is necessary. Sometimes, debugging at the assembly level provides insights not visible in higher-level languages.

Assembling is a pivotal step in the compilation process for programs written or transitioned to assembly language. It bridges the gap between human-readable code and machine-executable instructions, ensuring that software can run on the intended hardware architecture with precision and efficiency.

The Instruction Set is the vocabulary of the Assembler. x86/x64 computers use a complex instruction set computer (CISC) architecture. Here are a few examples of general purpose registers:

  • Register – Name – Function
  • EAX – Accumulator – Arithmetic operations
  • ECX – Counter – Loop counter and shift/rotate counter
  • EDX – Data – Arithmetic and I/O operations
  • EBX – Base – Pointer to data
  • ESP – Stack pointer – Pointer to the top of the stack
  • EBP – Base pointer – Pointer to the base of the stack within a function
  • ESI – Source index – Pointer to the source location within array operations
  • EDI – Destination index – Pointer to the destination location in array operations

Linking

Linking is a critical phase in the compilation process of computer programs, bridging the gap between compiled object modules and creating a final executable or library. This phase is responsible for resolving external symbol references, combining separate parts of a program and ensuring that all components can work together seamlessly.

Before focusing on linking, it’s essential to understand where it fits in the overall compilation pipeline:

  • Lexical Analysis (Tokenization)
  • Syntax Analysis (Parsing)
  • Semantic Analysis
  • Intermediate Code Generation
  • Optimization
  • Code Generation or Assembling (producing object files)
  • Linking
  • Loading (at runtime)

Linking is the process of taking the object files generated by the compiler or assembler and combining them into a single executable file or library. During this process, the linker resolves references to code or data defined in other modules or libraries. These are the outputs from the compiler or assembler, containing machine code, data, and symbol tables. They are not yet executable because they might reference external symbols (functions or variables defined elsewhere).

The linker begins by collecting all the object files that constitute the program. This might also include additional libraries or modules. Each object file contains a symbol table listing symbols (functions, variables) defined within it and symbols it references but does not define. The linker uses this information to understand what needs to be resolved.

For internal symbols, a symbol is defined and used within the same module, no linking action is required. For external symbols, referenced but not defined in the current module, the linker searches for their definitions in other object files or libraries: The linker will only include the parts of the library that are actually used. Dynamic libraries prepare for these to be linked at runtime.

The linker assigns final memory addresses to all symbols. In the object files, these addresses might have been placeholders or relative to some base address. It adjusts any references in the code to point to these new, correct addresses. This ensures that when the program loads into memory, all code and data are correctly located.

Different sections of code and data (.text for code, .data for initialized data, .bss for uninitialized data, etc.) from various object files are merged into one coherent layout.

The linker constructs the executable file, which includes:

  • Headers: Metadata about the file format, entry point, segments, etc.
  • Code and Data: The actual machine instructions and data from all modules.
  • Symbol and Relocation Information: For debugging or dynamic linking purposes.
  • Error Handling:

Any unresolved symbols at this stage lead to linker errors. Common issues include missing or mismatched library versions, incorrect symbol definitions, or circular dependencies.

In static linking, all necessary library code is included in the final executable, making it self-contained but potentially larger. Dynamic linking only refers to or stubs to libraries are placed in the executable. The actual library code is loaded at runtime, allowing for smaller executables and easier updates to libraries without recompiling the program.

Modularity allows developers to work on different parts of a program independently, then combine them later. Dynamic linking can save memory and allow for more flexible updates to shared components. Linking makes it easier to patch or update individual libraries without affecting the entire application.

Dependency Management helps ensure all required libraries or modules are present and compatible. Symbol conflicts involve two modules defining the same symbol with different meanings. Decisions between static and dynamic linking can affect load times, memory usage, and runtime performance.

The computer would read 01000101 01000001 01011000 as EAX. The numbers flow through the CPU billions of bits per second.

Computer manufacturers have their own instruction set that is unique for each kind of machine. The assembly language for an HP laptop may be different than the assembly language for a Dell. The compiler translates the C Programming language into the correct Assembly Language for the machine it is running on.

The Assembler then sends the program to the Linker, which links all the pieces of the program together into one executable file. For example, the header files are often large files with functions the current program needs to run. This way you don’t have to keep reinventing the wheel every time you need a function. The Linker links all the pieces of the program together.

In summary, linking is the phase where the promise of modularity in software design is fulfilled, ensuring that various parts of a program can interact as intended. It’s a sophisticated process that requires careful management to produce an executable that will run correctly on the target system.

Loading

Loading, in the context of compiling, refers to the process where the executable code, which has been prepared through compilation, assembly, and linking, is actually moved from storage (like a hard drive or SSD) into the computer’s main memory (RAM) for execution.

Source code is transformed into object code or assembly by a compiler or assembler. The linker combines object files and libraries into a single executable file, resolving external references and setting up the memory layout. The final step before execution is loading, where the executable is read into memory.

The executable is stored in a specific binary format (like ELF for Unix/Linux systems, PE for Windows, or Mach-O for macOS). These formats define how data should be loaded into memory. The OS or a loader program reads the executable’s headers to understand the memory requirements. Different segments of the program (code, data, stack, etc.) are mapped into memory. For instance:

  • Text Segment: Contains the actual machine instructions (code) and is often marked as read-only to prevent accidental overwrites.
  • Data Segment: Holds initialized data.
  • BSS Segment: For uninitialized data, which is zeroed out by the loader.
  • Heap and Stack: Dynamic memory and function call stack, respectively, are set up.
  • Dynamic Linking: If the executable uses dynamic libraries:
  • Library Loading: The loader might need to locate and load these shared libraries into memory, adjusting pointers in the executable to reference these libraries’ locations.
  • Relocation: Adjusts addresses in the program to reflect where in memory everything is actually placed, especially important for position-independent code.
  • Execution Start: After loading:
  • Program Counter: Set to the entry point of the program as specified in the executable’s headers.
  • Initialization: Some initialization might be required like setting up the environment, signal handlers, or calling constructors for global objects in C++.

Source: Modern Computer Architecture and Organization, Jim Ledin, p 252, 2020

During static loading the entire program is loaded into memory at once before execution. This was more common in earlier systems but is less flexible in terms of memory usage. There are several kinds of dynamic loading. Modern systems often implement:

  • Lazy Loading: Only parts of the program are loaded as they’re needed (e.g., when a function is first called).
  • Demand Paging: The OS loads pages of memory only when they’re accessed, which helps manage memory more efficiently.
  • Shared Libraries/Shared Objects: Loading shared libraries at runtime allows multiple programs to share the same memory segment for common code, reducing memory usage.

Efficient loading requires careful management of memory to ensure programs do not exceed available RAM, potentially leading to swapping or thrashing. Loading must be secure to prevent unauthorized code execution or tampering with the program’s memory space.

Loading can affect startup time, so techniques like preloading or optimizing the loading sequence are important for applications where quick start-up is critical. With dynamic linking, ensuring that all required libraries are present and compatible with the executable version can be challenging. Address Space Layout Randomization (ASLR) is a security feature where the base address of an executable is randomized each time it’s loaded to make attacks like buffer overflows more difficult.

Loading is the bridge between the compilation process and actual program execution. It involves not only moving code and data into memory but also setting up the environment for the program to run correctly, efficiently, and securely. The specifics of this process can vary significantly based on the operating system, hardware architecture, and the executable’s format.

Interpreting Code

Interpreting a computer program involves executing source code directly without first compiling it into machine code. Instead of translating the entire program into a form that can be run by the hardware all at once, an interpreter reads, translates, and executes the code line by line or statement by statement.

An interpreter is a program that directly executes instructions written in a programming or scripting language without requiring them to be compiled into machine code beforehand.

Compilation converts a whole program into machine code before execution. Interpretation translates and executes the code at runtime, one piece at a time. Languages like Python, JavaScript, Ruby and PHP are commonly interpreted, although some have compilers or just-in-time (JIT) compilers for performance optimization.

Steps in the interpretation process include source code reading. The interpreter reads the source code, which could be from a file or input directly in an interactive environment.

During lexical Analysis (Tokenization) source code is broken down into tokens. This involves recognizing keywords, operators, identifiers, literals, etc., much like the first stage of compilation. During syntax analysis (Parsing), tokens are analyzed to ensure they conform to the language’s grammatical rules. This step constructs an Abstract Syntax Tree (AST) or similar structure representing the program’s structure.

Checks for semantic correctness, like type checking, ensuring variables are used correctly and functions are called with the right number and types of arguments. Execution:

  • Direct Execution: The interpreter walks through the AST (or equivalent) and executes each node:
  • Expressions: Evaluates immediate values or performs operations.
  • Statements: Executes control structures (if, while, for), function calls, assignments, etc.
  • Environment: Maintains a state or environment where variables, functions, and their scopes are stored.

Unlike compiled programs where errors are often detected at compile-time, interpreters can catch and report errors at runtime, offering more immediate feedback. As each line or block of code is processed, any outputs, including side effects like printing to the console or modifying external resources, occur.

During interpretation it is easier to modify and test code since changes take effect immediately without needing to recompile. If the interpreter is available on a platform, the same code can run there without recompilation. Environments like Python’s REPL allow for immediate execution of code snippets. Runtime errors can be identified and fixed more dynamically.

Interpreted programs are generally slower than compiled code because of the overhead of interpreting each command at runtime. Keeping the source code and interpretation machinery in memory can consume more resources. There is less opportunity for comprehensive optimizations across the whole program, although some interpreters implement JIT compilation to mitigate this.

Pure Interpreters execute each line of code directly, like early versions of BASIC or shell scripts. Bytecode Interpreters are an intermediate version that’s easier to interpret than source code but not as low-level as machine code. Python compiles to bytecode (.pyc files) which is then interpreted. This adds a layer of abstraction but can improve performance.

Just-In-Time (JIT) Compilers combine interpretation with on-the-fly compilation. They interpret code but can compile frequently executed parts into machine code for speed, e.g., V8 engine for JavaScript.

Interpreters often provide or interact with a runtime environment, offering built-in functions, managing memory, handling exceptions, etc. Many interpreted languages come with powerful debugging and profiling tools since the source code is available at runtime. Since interpreters execute code dynamically, there are considerations for ensuring safety, particularly with user-generated content or scripts.

Interpreting a computer program provides a direct, interactive way of executing code, which is particularly beneficial for development, learning and environments where code needs to adapt quickly to changes. While it might not match the performance of compiled code in all scenarios, advancements like JIT compilation help bridge this gap, making interpreted languages viable for a wide range of applications from web development to scripting system tasks.

Processes

In the realm of computer science, managing processes and threads is fundamental to achieving efficiency, responsiveness, and maximizing hardware utilization. Here’s a comprehensive look at what processes and threads are, how they function, and why they are essential in modern computing.

A process is an instance of a program being executed. When you run an application, your operating system creates a new process for that application. At the most basic level a process includes program code, the actual instructions that the CPU will execute; data, including variables, constants and any dynamic data the program is working with. It includes a process state, encompassing the current activity of the process (e.g., running, waiting, stopped); a private virtual address space in memory, where the process stores its data; and information the operating system needs to manage the process, like a program counter, stack pointer and various registers.

Each process runs in its own memory space, which means processes are generally isolated from one another, enhancing security and stability. However, this isolation also means that communication between processes is more complex, often requiring mechanisms like inter-process communication (IPC) methods such as pipes, sockets or shared memory.

Threads

Threads are the smallest units of processing that can be scheduled by an operating system. They exist within processes and share the process’s resources, including memory and file handles. A thread state is similar to a process state but at a thread level, indicating whether the thread is currently running, ready to run or blocked. The execution context includes the program counter, stack and local variables for the thread. Shared resources are memory space shared by multiple threads within the same process, which allows for easy data sharing but also introduces the need for synchronization to avoid race conditions. A race condition is more than one thread trying to access the same memory and data at the same time.

Threads are crucial because they allow for concurrency within a single process. This means that multiple tasks can be executed seemingly at the same time on modern multi-core processors, improving performance, especially for applications with multiple independent units of work.

Each process has its own memory space, whereas threads share the memory space of their parent process. Creating a new process is heavier compared to spawning a new thread because of the memory and resource allocation involved. Inter-thread communication is generally simpler than inter-process communication since threads share memory. Processes offer more protection against errors; if one process crashes, it doesn’t necessarily affect others. With threads, a bad thread can potentially corrupt shared data or crash the entire process.

Processes enable true multitasking across different applications or services on a computer. Concurrency is the ability of threads within an application handling different aspects of the program simultaneously, like UI updates and background computations. Threads can share data more easily, reducing the need for complex communication mechanisms. On systems with multiple CPUs or cores, multi-threading can improve performance by leveraging parallel processing.

Synchronization ensures that threads do not interfere with each other when accessing shared data, often managed by mutexes, semaphores or other synchronization primitives. Deadlocks are situations where two or more threads are waiting indefinitely for each other to release resources, leading to a program freeze. Managing multiple threads or processes can significantly increase the complexity of software, particularly in terms of debugging and maintenance.

Understanding processes and threads is crucial for anyone involved in software development or system administration. They are the backbone of how modern operating systems manage and execute software, making computers more efficient and responsive. By leveraging processes for isolation and threads for concurrency, developers can craft applications that are both powerful and resource-efficient, tailored to the complexities of today’s computing environments.

x86/x64
CISC/RISC

32 bit ARM
64 bit ARM

RISC V

Instruction Set Architecture (ISA)

#include <stdio>

int main()
{
printf( “Hello, world!\n” );
return 0;
}

include is a header file that calls for the stdio (standard input output) library to be called into the program.

You write your program in C. The compiler transforms the program from C into LLVM, and then from LLVM into Assembly

Desktop
Mobile
Network