Machine Language

Let’s start this story at the bare metal stage of development. As usual my stories are relatively superficial, probably imperfect and often artificial intelligence assisted stories that I’m writing in order to learn, as much as, I am also teaching anyone who is interested in the things I’m interested in.

Computer Architecture

Digital logic is made of electrical circuits made of transisters, capacitors and resisters. A voltage of less than .1 volts may be zero or off, a voltage of more than .5 volts could be a one or on. Transistors enable you to use a small voltage to turn a larger voltage on or off.

A DRAM is a Dynamic Random-Access Memory integrated circuit made of MOSFETs (Metal-Oxide-Semiconductor Field-Effect Transisters) and capacitors. They can be arranged into a variety of applications, including DRAM memory banks, which is an array of MOSFETs.

A DRAM bit cell circuit is a MOSFET and a capacitor. Each MOSFET is one bit. The control unit sends a signal to the appropriate wordline of the 64 bit array. The signal turns the MOSFET on, allowing the capacitor to discharge its voltage, which is either high or low, depending on whether there is a 1 or 0 stored there, onto the bitline.

A bit line is a conductor or a wire that connects multiple memory cells in a column within a memory array. Each memory cell in a memory chip is typically connected to a unique row (via a word line) and column (via a bit line) for addressing. The bit line carries electrical signals that represent binary data (0s and 1s). It is paired with a complementary bit line in many memory designs (e.g., in DRAM and SRAM) to improve signal integrity and noise immunity.

Integrated Circuits

There is a set of instructions that controls the flow of information, kind of like syntax and grammar controls the meaning of sounds in human language. The instruction set includes addressing modes, instruction catagories, interrupt processing and input/output operations.

In modern computers, the electrical circuits, the DRAM bit cells, are arranged in arrays of 64 bit words. The computer uses the instruction set to read and write to these electrical circuits billions of times every second.

A typical integrated circuit has billions of these tiny circuits in them. There are several different kinds of circuits that accomplish different tasks, but integrated circuits are made out of billions of copies of six or seven different electrical circuits. There are logic gates, latches, flip flops, registers, adders, clocking circuits and sequential logic.

The arrays are arranged into control units, arithmetic logic units (ALU) and registers. There is usually more than one core on each integrated circuit chip. They work so fast that each one can work on more than one task at a time, so the computer counts each physical core as two cores. So, a chip with four cores will say it has eight cores and use the instruction set to cause them to work accordingly.

Each core will have many registers arranged into the control unit and the ALU and several layers of cache memory. The L1 layer of cache memory is as close to the ALU as possible. L2 is slightly farther away. The control unit controls the flow of information around to registors, which are arrays of DRAM circuits, made out of MOSFETs, with word lines and bit lines transmitting information in and out of the array.

The Random Access Memory (RAM) is a card that has several chips full of registers that the CPU can use as memory in its calculations. It is plugged into the mother board as close to the CPU as possible.

The integrated circuits are all plugged into a mother board. There is usually some kind of heat sink that draws heat away from the central processor. All that processing generates a lot of heat. They also usually have a fan blowing over the heat sink.

Modern computers usually have a graphics processing unit in addition to the central processing unit. Graphic processing units are designed to work a lot faster than central processing units.

The next level up from binary is assembly language. It is simple instructions that move data around and process it; add, subtract, multiply, divide, etc.. The computer translates both the commands and the data into binary. The Assembly programming language is usually specific to a particular machine architecture. Each company and brand of computers has its own assembly language.

There are two major forms of assembly language, the Intel and the AT&T versions. Most of the rest of the Assembly languages are variations of these two themes. The compiler translates C into LLVM IR, and then it translates LLVM IR into Assembly Code, and then it translates the Assembly code into the binary code the computer works processes.

Some higher level languages, like C and C++ are compiled, some others, like python, are interpreted. Compiled languages are compiled all at once. Interpreted languages are interpreted one line at a time.

Binary

Binary is a base-2 numeral system, which means it uses two digits (0 and 1) to represent numbers. This contrasts with the decimal (base-10) system we commonly use, which has ten digits from 0 to 9. Each digit in a binary number is called a “bit” (binary digit). A bit can be either 0 or 1. For example, the number 10 in binary is represented as “1010”.

Like in decimal, each position in a binary number represents a power of the base (2 in this case). From right to left, the positions are 20, 21, 22, 23 and so on. For instance:
1010 in binary translates to 1*23 + 0*22 + 1*21 + 0*20 = 8 + 0 + 2 + 0 = 10 in decimal.

0 in binary is simply 0.
1 in binary is 1.
2 is 10 (1*21 + 0*20).
3 is 11 (1*21 + 1*20).

Every piece of data in a computer is stored in binary form. Text, images, videos – all are converted into sequences of bits. Memory (RAM, hard drives) stores data as bits and processors execute instructions by interpreting binary code. Programs are written in high-level languages but are compiled or interpreted into machine code, which is binary.

Similar to decimal addition but with carry-overs when the sum exceeds 1. For example:

1011
+0101
10000

Subtraction, multiplication and division operations also have binary algorithms, often involving bitwise operations like AND, OR, XOR and NOT.

Digital circuits use logic gates which operate on binary inputs to produce binary outputs. Examples include AND, OR and NOT gates. Data transmission protocols in networks often encode information in binary form for transmission over various media.

IPv4 addresses are typically expressed in dotted-decimal notation but are binary at their core (e.g., 192.168.1.1 is 11000000.10101000.00000001.00000001 in binary). Data sizes like bytes, kilobytes, etc., are multiples of bits (8 bits = 1 byte).

Binary is not just a numbering system; it’s the backbone of digital information processing, storage, and communication. Understanding binary helps in grasping how computers manage, interpret, and store data at its most fundamental level.

Each character in binary is called a bit. Bits are stored in registers during processing. Registers are arranged into four bit sections. A byte is typically 8 bits. These 8 bit bytes are used through the computer in both physical hardware and mathematical logic .

The Central Processing Unit and other circuitry is arranged into four, eight, sixteen, 32 and now 64 bit words. A conductor is laid out in a microscopic line along a row of registers. The registers are arranged into rows of four registers, with one end of the row next to the microscopic line of conductor. The registers are also microscopic. They can be turned on or off to register a 1 or 0. Multiply that by billions of registers and billions of operations per second.

Bits and Bytes

A bit is the smallest unit of digital information. It can have only one of two values, typically represented as 0 or 1. Bits are used to encode data at the most fundamental level in digital systems. For instance, a single bit could represent “on” or “off,” “true” or “false,” or any binary choice.

A byte is a group of 8 bits. It’s the basic unit of storage in most computer architectures. A byte can represent 28 = 256 different values, from 0 to 255 in decimal or 00 to FF in hexadecimal. Bytes are used for encoding characters (like in ASCII), storing small integers, or as a fundamental unit of memory allocation.

The terms “bit” and “byte” are often confused or misused, especially in casual conversation. “Bit” is singular for binary digit, and “byte” refers to a collection of bits (usually 8). “Bits per second” (bps) measures data transfer rates, while “bytes” are used when referring to file sizes or memory capacity (e.g., kilobytes, megabytes).

A 32-bit Processor can process data in chunks of 32 bits at once. This includes the size of registers, memory address space and data bus width. 64-bit processors can handle 64 bits of data at once, offering a larger address space and potentially better performance for certain operations.

32-bit processors can address up to 232 bytes of memory, which is 4 GB. This limit can be a constraint in modern computing contexts. 64-bit processors can theoretically address up to 264 bytes, which is 16 exabytes, providing virtually unlimited memory for practical purposes.

64-bit processors generally offer better performance for certain operations due to larger registers, which can do more with each clock cycle. However, for many applications, the difference might not be noticeable unless dealing with large datasets or complex calculations. 64-bit processors can run 32-bit software through emulation or compatibility modes, but 32-bit processors cannot run 64-bit software natively.

32-bit operating systems can only use up to 4 GB of memory and run 32-bit applications. 64-bit operating systems can leverage the full capacity of 64-bit hardware, run both 32-bit and 64-bit applications (with appropriate support), and they can handle more RAM.

32-bit integer operations are limited to 32 bits, which might require multiple steps for larger numbers. 64-bit processors can handle larger integer sizes natively, improving efficiency in number crunching tasks.

64-bit processors are often better for multimedia applications due to the ability to handle larger chunks of data at once, which can improve video editing, gaming and 3D rendering performance. While 64-bit processors are designed to be backward compatible with 32-bit code, there might be performance penalties or specific compatibility issues in some scenarios.

Understanding bits, bytes and the capabilities of 32-bit versus 64-bit processors is fundamental for appreciating how computers handle data and perform operations. The transition from 32-bit to 64-bit computing has significantly expanded computational capabilities, particularly in terms of memory management and performance for complex applications. However, for many everyday tasks, the difference might not be immediately apparent unless one is pushing the boundaries of current technology or dealing with very large datasets.

ASCII

The American Standard Code for Information Interchange (ASCII) is a character encoding standard used in computers and other devices that handle text. Originally developed in the early 1960s, ASCII was designed to standardize the communication of text between various types of data processing equipment.

ASCII was first published as ASA X3.4-1963 by the American Standards Association (now ANSI). It was later revised in 1967 and 1986. ASCII was originally created to ensure compatibility between different types of data processing systems, especially for teletype machines and early computers. ASCII uses 7 bits for each character, allowing for 128 possible characters, 0 to 127 in decimal.

The first 32 characters, 0-31, are non-printable control characters used for communication control, like line feed (LF), carriage return (CR) and tab (HT). The next 95 characters, 32-126, are printable, including:

  • Space and punctuation (32-47, 58-64, 91-96, 123-126)
  • Numerals (48-57)
  • Uppercase letters (65-90)
  • Lowercase letters (97-122)

Character 127 is the delete character, originally used to punch all holes in paper tape to erase any data.

ASCII’s simplicity made it widely adopted for basic text representation in computers. One weakness is, it only supports English characters, which led to the creation of extended ASCII and later, Unicode, for more comprehensive language support.

With the advent of 8-bit systems, the extended ASCII character set (characters 128-255) was introduced, although these characters are not standardized and can vary by implementation (e.g., different for Windows-1252 vs. ISO-8859-1).

ASCII is still used in many contexts for its simplicity and because it is a subset of Unicode (UTF-8), ensuring backwards compatibility. ASCII characters are crucial in programming, especially for string manipulation, file encoding and network protocols.

ASCII Chart

Below is a simplified list showing all 128 ASCII characters by their decimal value.

Decimal Char Name/Description

  • 0 : NUL : Null
  • 1 : SOH : Start Heading
  • 2 : STX : Start Text
  • 3 : ETX : End Text
  • 4 : EOT : End Transmission
  • 5 : ENQ : Enquiry
  • 6 : ACK : Acknowledge
  • 7 : BEL : Bell
  • 8 : BS : Backspace
  • 9 : HT : Horizontal Tab
  • 10 : LF : Line Feed
  • 11 : VT : Vertical Tab
  • 12 : FF : Form Feed
  • 13 : CR : Carriage Return
  • 14 : SO : Shift Out
  • 15 : SI : Shift In
  • 16 : DLE : Data Link Escape
  • 17 : DC1 : Device Control 1
  • 18 : DC2 : Device Control 2
  • 19 : DC3 : Device Control 3
  • 20 : DC4 : Device Control 4
  • 21 : NAK : Negative Acknowledge
  • 22 : SYN : Synchronous Idle
  • 23 : ETB : End of Transmission Block
  • 24 : CAN : Cancel
  • 25 : EM : End of Medium
  • 26 : SUB : Substitute
  • 27 : ESC : Escape
  • 28 : FS : File Separator
  • 29 : GS : Group Separator
  • 30 : RS : Record Separator
  • 31 : US : Unit Separator
  • 32 : (space) : Space
  • 33-47 : !”#$%&'()*+,-./ : Punctuation
  • 48-57 : 0-9 : Numerals
  • 58-64 : :;<=>?@ : More Punctuation
  • 65-90 : A-Z : Uppercase Letters
  • 91-96 : []^_` : More Punctuation
  • 97-122 : a-z : Lowercase Letters
  • 123-126 : {}~ : Punctuation
  • 127 : DEL : Delete

Source: Grok

This list provides a basic overview. For detailed descriptions of each character, including hexadecimal and octal representations, comprehensive ASCII tables are easy to find online.

Hexidecimal

The hexadecimal number system, commonly known as “hex,” is a base-16 number system, which means it uses sixteen distinct symbols for representing numbers.

Hexadecimal uses sixteen digits, from 0 to 9 and A to F, where A=10, B=11, C=12, D=13, E=14, F=15. This system is more compact than binary (base-2) but still closely related to it. The sixteen hex digits are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F.

Similar to other numeral systems, each position in a hexadecimal number represents a power of 16 from right to left. For example:

1A3 in hexadecimal translates to:
1*162 + 10*161 + 3*160 = 256 + 160 + 3 = 419 in decimal.

0x1F in hex equals 1*161 + 15*160 = 16 + 15 = 31 in decimal. 0x signifies that the number is a hexadecimal number.

Hex is particularly useful because each hex digit can represent exactly four binary digits (bits). This makes conversions between binary and hexadecimal straightforward: 1101 1010 in binary easily translates into DA in hex.

Due to this relationship between mathematical logic and the physical structure of computer hardware and bits and bytes, hex is often used in computing to represent binary data in a more human-readable format. For example, a byte (8 bits) can be represented by two hex digits.

In computer architecture, memory addresses are often displayed in hex because it’s more compact than binary and easier to read than decimal. Hexadecimal is used in web design for specifying colors in RGB format, like #FF0000 for red. Many file formats use hex signatures or markers, like the start of a JPEG file with FF D8 FF.

Hex is used in debugging for representing binary data, machine code and memory dumps. Programmers often convert hex to binary or decimal to understand what’s stored or happening in memory.

Hexadecimal addition is similar to decimal, but with carry-overs when the sum exceeds F (15 in decimal):
Example:

1A3E
+0B5D
259B

Subtraction, multiplication and division follow similar rules to other number systems but with base-16 arithmetic.

Hexadecimal is used for representing cryptographic keys and hashes because it’s compact and less error-prone when manually entering or reading long sequences of numbers. Hex is used in checksums for data integrity checks, like in network protocols.

Hexadecimal serves as an intermediary between human readability and machine-level binary data. Its application spans from low-level programming and hardware interaction to high-level applications like color coding in web design. Understanding hexadecimal is crucial for anyone delving into the internals of computing, software development or digital system design.

Assembly

Assembly language is a low-level programming language that provides a symbolic representation of a computer’s machine code. It acts as an intermediary between high-level programming languages (e.g., C, C++ and Rust) and the raw binary instructions executed by a computer’s processor (CPU). Unlike high-level languages, assembly is hardware-specific, meaning the instructions are tailored to a particular processor’s architecture, such as x86, ARM or RISC-V.

Assembly language uses mnemonic codes (human-readable commands) to represent the binary instructions (opcodes) and operands used in machine language. These mnemonics make it easier for humans to write and understand the instructions, compared to directly using binary or hexadecimal.

The Assembly Language provides direct control over hardware, including registers, memory and I/O. Each processor architecture has its own assembly language syntax and instruction set.

Programs written in assembly can be highly optimized for performance or specific hardware constraints. While not as user-friendly as high-level languages, assembly is easier to read and write than raw machine code.

An assembly program typically consists of instructions that represent operations to be executed by the CPU (e.g., MOV, ADD). Operands that specify the data or memory locations on which instructions operate. Labels that define points in the program for jumps or loops. And directives that provide instructions to the assembler for program organization (e.g., declaring data).

Example of x86 Assembly:

section .data
    msg db 'Hello, World!', 0

section .text
    global _start

_start:
    mov rax, 1        ; System call for write
    mov rdi, 1        ; File descriptor (stdout)
    mov rsi, msg      ; Address of the message
    mov rdx, 13       ; Length of the message
    syscall           ; Call the kernel

    mov rax, 60       ; System call for exit
    xor rdi, rdi      ; Return 0
    syscall           ; Call the kernel

An assembler is a software tool that converts assembly code into machine code. The output is an executable binary file that can be directly run by the processor. The programmer (often a computer) writes assembly instructions. The Assembler converts mnemonics (e.g., MOV) into binary opcodes and maps symbolic labels to memory addresses. The assembler outputs the binary file containing the processor’s native instructions.

Assembly Language Concepts

  • MOV: Move data between registers or memory.
  • ADD: Add two values.
  • SUB: Subtract one value from another.
  • Assembly instructions operate on registers, small storage locations within the CPU.
    • General-purpose registers: RAX, RBX, RCX (x86-64).
    • Special-purpose registers: Program counter (PC), stack pointer (SP).
  • Assembly allows direct manipulation of memory via addresses:
    • Immediate values: Direct data (MOV RAX, 5).
    • Direct addressing: Access a memory location (MOV RAX, [0x4000]).
    • Indirect addressinassembly: Use register contents as addresses (MOV RAX, [RBX]).
  • Control flow is managed using jumps and branches:
    • JMP: Unconditional jump.
    • JE, JNE: Conditional jumps (e.g., jump if equal or not equal).

Advantages of assembly programs include, interacting with the operating system using interrupts or system calls for tasks like I/O or process termination. Programs can be fine-tuned for maximum speed and minimal resource usage. Assembly enables manipulation of CPU registers, memory and peripherals. It results in smaller binaries compared to higher-level languages.

Writing and debugging assembly is challenging compared to high-level languages. Assembly code is tied to a specific processor architecture, making it non-portable. It requires a detailed understanding of the underlying hardware.

Assembly Language is used in embedded systems for programming micro-controllers or devices with limited resources. In performance critical applications like games, drivers or real-time systems. In operating systems development for writing kernel-level code. And in reverse engineering for analyzing or modifying compiled binaries.

Assembly language provides the closest interaction with hardware while still being somewhat human-readable. Although higher-level languages dominate most modern programming, assembly remains essential in areas where low-level control or extreme optimization is required.